Saturday, November 22, 2008

Use GREP or ACK to Search for Chinese Characters

I have been developing a solution for a Chinese customer. As a result, I need to search for Chinese characters in text files. I use GREP and ACK. There is some subtlety when searching for Chinese characters.

In order to input Chinese characters, the windows command line's code page needs to be set to 936. Issuing chcp 936 can do this. 936 is for GBK encoding.

There are two files a.txt and b.txt in d:/text. Both of these 2 files contain the following text:

中国
中国abc

The encoding for a.txt is GBK. The encoding for b.txt is UTF-8. The grep 中国 d:/test/*.txt only finds 中国 in a.txt.

The conclusion is that GREP can only find Double Byte characters such as Chinese only if the command line console and the file to be searched has the same encoding. This conclusion also applies to ACK.

No comments: