Using Unicode in C/C++
[ Path: > Evan Jones' Scratch Pad | Written by Evan Jones ]
[ 2006-April-15 14:11 ]
I have written before about How to use Unicode with Python, but I've never figured out how to use Unicode in Standard C before. I managed to find an extremely helpful UTF-8 and Unicode FAQ which answers most of the questions, particularly the section beginning with C Support for Unicode and UTF-8. Additionally, Tim Bray's Characters vs. Bytes provides a very readable overview of Unicode encodings.
The good news is that if you use wchar_t* strings and the family of functions related to them such as wprintf, wcslen, and wcslcat, you are dealing with Unicode values. In the C++ world, you can use std::wstring to provide a friendly interface. My only complaint is that these are 32-bit (4 byte) characters, so they are memory hogs for all languages. The reason for this choice is that it guarantees each possible character can be represented by one value. Contrary to popular belief, it is possible for a Unicode character to require multiple 16-bit values. While these characters are rare, it is dangerous to hard code that belief into your software, particularly as computers spread through more of the world, and as people create new characters.
The bad news is that converting to and from these wchar_t values involves C's complex locale library. I understand that this locale stuff is complicated because internationalization is hard, but I have never figured this API out. On my system, calling setlocale(LC_CTYPE, "en_ca.UTF-8") enabled UTF-8 output, although there probably is a better way to do it. If you need to do conversions to and from specific encodings, I recommend using iconv instead. If your system doesn't supply an implementation or if it doesn't support Unicode, use libiconv, which is released under the LGPL licence, allowing it to be used with commercial as well as open source code. If you want a nice wrapper around libiconv, you can use GLib, which includes Unicode string routines, among other things.
There are two questions when writing software that must deal with Unicode: What format do you use for data that goes in and out of your software, and what format do you use internally? For the internal format, as Tim Bray explains: pick UTF-8 or UTF-16 and stick with it. It hardly matters which one, as long as you are consistent. I would add a small note: Use UTF-32 If it makes your life easier to assume each character is a single value, or if you don't care about memory. It is better supported by the standard C and C++ libraries. For external data, things get much more difficult. Tim Bray recommends using XML, but I would only do that if the application already has a dependency on an XML parser. If you don't, I prefer UTF-8, mostly because some of the time it will work as expected with older software.
Other Resources
- How to Use UTF-8 with Python by Evan Jones - A quick and dirty guide to using Unicode in Python.
- On the Goodness of Unicode by Tim Bray - An essay about why you should support Unicode.
- The [...] Minimum Every Software Developer [...] Must Know About Unicode [...] by Joel Spolsky - Another essay about why Unicode is good, and an introduction to how it works.
- Characters vs. Bytes by Tim Bray - An introduction to the details of Unicode encoding.
- Unicode in Python by Thijs van der Vossen - Another quick and dirty introduction to Python's Unicode support.
- Python Unicode Objects by Fredrik Lundh - A collection of tips about Python's Unicode support, like using it in regular expressions.
- Unicode for Programmers by Jason Orendorff - A detailed guide to Unicode, geared towards Python, Java, and Windows programmers.
- Unicode Home Page - The official web site for the Unicode specifications.