Posted by Christopher Diggins, 19 October 2011 10:42 am
You must understand Unicode if you want to make your Windows application ready for the international market. Unfortunately if you jump around the internet and MSDN like usual, you have a good chance of becoming incredibly confused. This article is intended to help sort out the key issues. One of my colleagues called this article a "crash course" in language encoding schemes available in Windows.
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems (http://en.wikipedia.org/wiki/Unicode). Usually when talking about Unicode we are also referring to the Universal Character Set (http://en.wikipedia.org/wiki/Universal_Character_Set). The universal character set is a one-to-one mapping of thousands of abstract characters (also called glyph) to individual integer representations called “code points”.
Most programmers know that there are a number of different possible encodings for Unicode (e.g. Utf-8, Utf-16, Utf-32 and so on). So why does Microsoft documentation say in several place that “Unicode is a 16-bit encoding”? For example http://www.microsoft.com/typography/unicode/cscp.htm and http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx.
There are various reasons for this inaccuracy, but I suspect the primary reason is historical. From Wikipedia (http://en.wikipedia.org/wiki/Variable-width_encoding) “Originally, both Unicode and ISO 10646 standards were meant to be fixed-width, with Unicode being 16 bit”.
The Windows API uses wide character (
wchar_t) strings to pass Unicode text data encoded in Utf-16 (http://en.wikipedia.org/wiki/UTF-16) with the bytes in little-endian format. Utf-16 is only one of many different possible encoding of Unicode. Utf-16 is a variable length encoding system, where Unicode code points (characters) are mapped to either one or two 16-bit code units (i.e. two or four bytes of data). Little endian format means that the least significant byte is stored first.
This chart should help clarify the various terms used so far:
|Code Point||Glyph||Character||Utf-16 Code Unit||Utf-16 LE bytes|
|U+007A||z||latin small letter z||0x007A||0x7A 0x00|
In the Windows API there are usually at least two extra versions of every function which accepts text parameters. One version specifically use arrays of
char for string data and the other will use an array of
wchar_t for string data. The single byte (
char) version usually ends with the letter “A” (for example “
MessageBoxA()” ) and the two-byte (
wchar_t version accepts name ends with the letter “W” (for example “
Some people mistakenly assumed that "A" means ASCII or ANSI. This is not true. The "A" version of an API just identifies the fact that string data is is encoded using a code page. As you will see later a code page can be fixed-width single byte data, or variable length byte data in the case of DBCS code pages.
Normally when writing Windows application you don’t use functions with a name ending with “A” or “W”, instead you use a non-suffixed prefix name (for example
MessageBox()). The Windows API will choose one of the two functions depending on whether the preprocessor symbol
UNICODE is defined or not. (http://msdn.microsoft.com/en-us/library/xxh1wfhz.aspx).
A codepage is a list of character codes in a certain order. Wikipedia defines it as a synonym for character encoding (http://en.wikipedia.org/wiki/Code_page) but for Windows programming there is a more specific meaning (http://en.wikipedia.org/wiki/Windows_code_page). Different code pages will map characters from different languages to the same range of numerical values.
An example of a code page is Windows-1252 (see http://msdn.microsoft.com/en-us/goglobal/cc305145 and http://en.wikipedia.org/wiki/Windows-1252) which is a character encoding of the Latin Alphabet. This code page is often confused with the ISO-8859-1 character set and as such is referred to as an “ANSI” code page. The term “ANSI” here is a common misnomer, no code page is in fact an ANSI standard. The term ANSI continues to be used for historical reasons.
There is also no ASCII Windows code page, but most Windows code pages are extensions of the ASCII character set. This means that the values assigned to ASCII characters in different code pages (e.g. for an English code page or a French code page) usually uses the same numerical representation (code unit). Thus ASCII data just works.
An unfortunate consequence of this during development is that you may think your text handling routines are properly localized when they are not!
More information on code pages can be found at http://msdn.microsoft.com/en-us/library/dd317752.aspx and http://msdn.microsoft.com/en-US/library/8w60z792.aspx.
Prior to adopting Unicode Microsoft used a system called MBCS (multi-byte character set) for encoding large character sets, such as those required by Chinese and Japanese. The most commonly used MBCS in Windows in DBCS (double-byte character set).In much of the MSDN documentation MBCS is used to describe all non-Unicode support for multibyte characters. In Visual C++, MBCS always means DBCS (http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx).
Despite the name DBCS is an example of a variable length encoding. With DBCS, characters can be 1 or 2 bytes in size and their interpretation depends on which code page is in use. In a modern Windows application you should try to use Unicode wherever possible, but DBCS may be necessary for compatibility with various DLLs or older code.
If you are serious about learn how to write Windows software for international markets I strongly suggest reading the following articles. Reading them in order should help minimize your confusion:
Please only report comments that are spam or abusive.