• 1/3

Writing Unicode Enabled Windows Applications

Posted by Christopher Diggins, 19 October 2011 10:42 am

You must understand Unicode if you want to make your Windows application ready for the international market. Unfortunately if you jump around the internet and MSDN like usual, you have a good chance of becoming incredibly confused. This article is intended to help sort out the key issues. One of my colleagues called this article a "crash course" in language encoding schemes available in Windows.

What is Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems (http://en.wikipedia.org/wiki/Unicode). Usually when talking about Unicode we are also referring to the Universal Character Set (http://en.wikipedia.org/wiki/Universal_Character_Set). The universal character set is a one-to-one mapping of thousands of abstract characters (also called glyph) to individual integer representations called “code points”.

Encodings

Most programmers know that there are a number of different possible encodings for Unicode (e.g. Utf-8, Utf-16, Utf-32 and so on). So why does Microsoft documentation say in several place that “Unicode is a 16-bit encoding”? For example http://www.microsoft.com/typography/unicode/cscp.htm and http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx.
There are various reasons for this inaccuracy, but I suspect the primary reason is historical. From Wikipedia (http://en.wikipedia.org/wiki/Variable-width_encoding) “Originally, both Unicode and ISO 10646 standards were meant to be fixed-width, with Unicode being 16 bit”.

The Windows API and Unicode Data

The Windows API uses wide character (wchar_t) strings to pass Unicode text data encoded in Utf-16 (http://en.wikipedia.org/wiki/UTF-16) with the bytes in little-endian format. Utf-16 is only one of many different possible encoding of Unicode. Utf-16 is a variable length encoding system, where Unicode code points (characters) are mapped to either one or two 16-bit code units (i.e. two or four bytes of data). Little endian format means that the least significant byte is stored first.

This chart should help clarify the various terms used so far:  

Code Point Glyph Character Utf-16 Code Unit Utf-16 LE bytes
U+007A z latin small letter z 0x007A 0x7A 0x00

In the Windows API there are usually at least two extra versions of every function which accepts text parameters. One version specifically use arrays of char for string data and the other will use an array of wchar_t for string data. The single byte (char) version usually ends with the letter “A” (for example “MessageBoxA()” ) and the two-byte (wchar_t version accepts name ends with the letter “W” (for example “MessageBoxW()”).

Some people mistakenly assumed that "A" means ASCII or ANSI. This is not true. The "A" version of an API just identifies the fact that string data is is encoded using a code page. As you will see later a code page can be fixed-width single byte data, or variable length byte data in the case of DBCS code pages. 

Normally when writing Windows application you don’t use functions with a name ending with “A” or “W”, instead you use a non-suffixed prefix name (for example MessageBox()). The Windows API will choose one of the two functions depending on whether the preprocessor symbol UNICODE is defined or not. (http://msdn.microsoft.com/en-us/library/xxh1wfhz.aspx).

Codepages

A codepage is a list of character codes in a certain order. Wikipedia defines it as a synonym for character encoding (http://en.wikipedia.org/wiki/Code_page) but for Windows programming there is a more specific meaning (http://en.wikipedia.org/wiki/Windows_code_page). Different code pages will map characters from different languages to the same range of numerical values.

An example of a code page is Windows-1252 (see http://msdn.microsoft.com/en-us/goglobal/cc305145 and http://en.wikipedia.org/wiki/Windows-1252) which is a character encoding of the Latin Alphabet. This code page is often confused with the ISO-8859-1 character set and as such is referred to as an “ANSI” code page. The term “ANSI” here is a common misnomer, no code page is in fact an ANSI standard. The term ANSI continues to be used for historical reasons.

There is also no ASCII Windows code page, but most Windows code pages are extensions of the ASCII character set. This means that the values assigned to ASCII characters in different code pages (e.g. for an English code page or a French code page) usually uses the same numerical representation (code unit). Thus ASCII data just works.

An unfortunate consequence of this during development is that you may think your text handling routines are properly localized when they are not!  

More information on code pages can be found at http://msdn.microsoft.com/en-us/library/dd317752.aspx and http://msdn.microsoft.com/en-US/library/8w60z792.aspx.

DBCS is not a Two-Byte Encoding

Prior to adopting Unicode Microsoft used a system called MBCS (multi-byte character set) for encoding large character sets, such as those required by Chinese and Japanese. The most commonly used MBCS in Windows in DBCS (double-byte character set).In much of the MSDN documentation MBCS is used to describe all non-Unicode support for multibyte characters. In Visual C++, MBCS always means DBCS (http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx).

Despite the name DBCS is an example of a variable length encoding. With DBCS, characters can be 1 or 2 bytes in size and their interpretation depends on which code page is in use. In a modern Windows application you should try to use Unicode wherever possible, but DBCS may be necessary for compatibility with various DLLs or older code.

Final Words

If you are serious about learn how to write Windows software for international markets I strongly suggest reading the following articles. Reading them in order should help minimize your confusion:

  1. Unicode on Wikipedia - http://en.wikipedia.org/wiki/Unicode.
  2. Universal Character Set on Wikipedia - http://en.wikipedia.org/wiki/Universal_Character_Set.
  3. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) from Joel on Software http://www.joelonsoftware.com/articles/Unicode.html.
  4. Globalization Step-by-Step on MSDN http://msdn.microsoft.com/en-us/goglobal/bb688113.
  5. Internationalization for Windows Applications on MSDN - http://msdn.microsoft.com/en-us/library/dd318661.aspx.
  6. Unicode and MBCS on MSDN - http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx

 

Comments

There are currently no comments for this post. Be the first to comment!

Add Your Comment

You must be logged in to post a comment.

Please only report comments that are spam or abusive.