Using Unicode in C/C++ (evanjones.ca)

[ 2006-April-15 14:11 ]

I have written before about How to use Unicode with Python, but I've never figured out how to use Unicode in Standard C before. I managed to find an extremely helpful UTF-8 and Unicode FAQ which answers most of the questions, particularly the section beginning with C Support for Unicode and UTF-8. Additionally, Tim Bray's Characters vs. Bytes provides a very readable overview of Unicode encodings.

The good news is that if you use wchar_t* strings and the family of functions related to them such as wprintf, wcslen, and wcslcat, you are dealing with Unicode values. In the C++ world, you can use std::wstring to provide a friendly interface. My only complaint is that these are 32-bit (4 byte) characters, so they are memory hogs for all languages. The reason for this choice is that it guarantees each possible character can be represented by one value. Contrary to popular belief, it is possible for a Unicode character to require multiple 16-bit values. While these characters are rare, it is dangerous to hard code that belief into your software, particularly as computers spread through more of the world, and as people create new characters.

The bad news is that converting to and from these wchar_t values involves C's complex locale library. I understand that this locale stuff is complicated because internationalization is hard, but I have never figured this API out. On my system, calling setlocale(LC_CTYPE, "en_ca.UTF-8") enabled UTF-8 output, although there probably is a better way to do it. If you need to do conversions to and from specific encodings, the minimal approach is to use iconv(), which is part of the C library. If your system doesn't supply an implementation, libiconv is released under the LGPL licence, allowing it to be used with commercial as well as open source code. If you need to manipulate UTF-8 strings, you may want to consider GLib, which is under the LGPL licence and includes many helpfulUTF-8 string routines. Finally, if you need to do any complicated text manipulation, ICU (International Components for Unicode) is the way to go. It can do location and language specific tasks such as formatting dates and times, uppercasing and lowercasing strings, and alphabetically sorting text.

There are two questions when writing software that must deal with Unicode: What format do you use for data that goes in and out of your software, and what format do you use internally? For the internal format, as Tim Bray explains: pick UTF-8 or UTF-16 and stick with it. It hardly matters which one, as long as you are consistent. However, it seems to me that UTF-8 is the standard choice of most software I encounter these days, so that is what I personally recommend. For external data, things get much more difficult. Tim Bray recommends using XML, but I would only do that if the application already has a dependency on an XML parser. If you don't, I prefer UTF-8, mostly because some of the time it will work as expected with older software.

Other Resources

How to Use UTF-8 with Python by Evan Jones - A quick and dirty guide to using Unicode in Python.
On the Goodness of Unicode by Tim Bray - An essay about why you should support Unicode.
The [...] Minimum Every Software Developer [...] Must Know About Unicode [...] by Joel Spolsky - Another essay about why Unicode is good, and an introduction to how it works.
Characters vs. Bytes by Tim Bray - An introduction to the details of Unicode encoding.
Unicode in Python by Thijs van der Vossen - Another quick and dirty introduction to Python's Unicode support.
Python Unicode Objects by Fredrik Lundh - A collection of tips about Python's Unicode support, like using it in regular expressions.
Unicode for Programmers by Jason Orendorff - A detailed guide to Unicode, geared towards Python, Java, and Windows programmers.
Unicode Home Page - The official web site for the Unicode specifications.