Saturday, November 22, 2008

Use GREP or ACK to Search for Chinese Characters

I have been developing a solution for a Chinese customer. As a result, I need to search for Chinese characters in text files. I use GREP and ACK. There is some subtlety when searching for Chinese characters.

In order to input Chinese characters, the windows command line's code page needs to be set to 936. Issuing chcp 936 can do this. 936 is for GBK encoding.

There are two files a.txt and b.txt in d:/text. Both of these 2 files contain the following text:

中国
中国abc

The encoding for a.txt is GBK. The encoding for b.txt is UTF-8. The grep 中国 d:/test/*.txt only finds 中国 in a.txt.

The conclusion is that GREP can only find Double Byte characters such as Chinese only if the command line console and the file to be searched has the same encoding. This conclusion also applies to ACK.

Use PAR to format XML comment

PAR is fantastic for formatting text files. It can be used to give XML comment a pretty layout. For the following XML comment:

<!-- You can recognize truth by its beauty and -->
<!-- simplicity. When you get it right, it is obvious that it is right. -->

par 50 produces

<!-- You can recognize truth by its beauty -->
<!-- and simplicity. When you get it right, -->
<!-- it is obvious that it is right. -->

But for the following text

<!-- You can recognize truth by its beauty and simplicity. When you get it right, it is obvious that it is right. -->

par 50 produces

<!-- You can recognize truth by its beauty and
simplicity. When you get it right, it is obvious
that it is right. -->

It is not what we want. Instead, par 50 -p5 -s5 can be used to produce

<!-- You can recognize truth by its beauty -->
<!-- and simplicity. When you get it right, -->
<!-- it is obvious that it is right. -->

For the details of using PAR, you can refer to Par.

Monday, November 17, 2008

Paste Multiple Lines of Code into Clisp Console

There is a little problem when pasting multiple lines of code into clisp console. If TAB is used for code indentation, there will be the following error when pasting the code into CLISP console.

You are in the top-level Read-Eval-Print loop.

Using spaces for code indentation will fix this problem.