Tugger the SLUGger!SLUG Mailing List Archives

[coders] Converting a UTF-8 string to a wchar_t (in C)

Herro all,

I have a C string (char*) that's encoded in UTF-8. I'd like to convert this to a wide string (wchar_t*). I've done plenty of reading about mbstowcs(3), iconv(3) and friends, and from what I understand, I have two options:

1. First, setlocale() to some bogus UTF-8 locale (such as "en_US.UTF-8", and then use mbstowcs() to perform the conversion.

2. Use the stupendously painful iconv() interface with a iconv_t from "UTF-8" to "WCHAR_T".

So far, I've tried (2) -- the iconv() method -- and it doesn't work for me. It seems to work fine if the characters are ASCII, but the moment it actually hits any non-ASCII characters, iconv() throws a return code of -1 and errno's set to EILSEQ. I'm assuming there are some bugs in my code, which is no surprise considering how annoying iconv() is to use.

So instead of actually trying to fix the bugs, I figure that using mbstowcs() is probably easier than trying to work around iconv()'s brain damage. The thing is, surely there _must_ be some way to tell mbstowcs() that the source string to convert is in UTF-8, besides using setlocale() with a dummy UTF-8 locale. I'm only concerned about the encoding type after all, not what language it's in, and I feel quite yucky doing something like setlocale(LC_CTYPE, "en_US.UTF-8"), because I'm not in the USA.

Is there something I'm missing, or is the way that everybody really does it? I'm thinking that converting between UTF-8 and wchar_t must be somewhat common these days, but Googling for "convert utf-8 to wchar_t" really isn't being all that helpful.

(I'm also quite happy to use C++'s locale/facet/codecvt stuff, but the documentation I've found about that so far appears to be equally terse.)


% Andre Pang : trust.in.love.to.save  <http://www.algorithm.com.au/>