Wednesday, March 19, 2008

Unicode & Java

Endianess


In computing, endianness is the byte (and sometimes bit) ordering used to represent
some kind of data.
Most modern computer processors agree on bit ordering "inside" individual bytes (this was not always the case). This means that any single-byte value will be read the same on almost any computer one may send it to.
Integers are usually stored as sequences of bytes, so that the encoded value can be obtained by simple concatenation. The two most common of them are:

  1. increasing numeric significance with increasing memory addresses, known as little-endian

  2. its opposite, most-significant byte first, called big-endian.


Inter x86 use little-endian. JVM use big-endian.(The above content is from Wikipedia Endianess Entry)

Unicode


Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16.
Mapping of Unicode character planes is a good explanation of Unicode planes and code points.

UTF-16


To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range, (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

Java


In the Java SE API documentation, Unicode code point is used for character values in the range between U+0000 and U+10FFFF, and Unicode code unit is used for 16-bit char values that are code units of the UTF-16 encoding.

No comments: