SLUG Mailing List Archives
[coders] Converting a UTF-8 string to a wchar_t (in C)
- To: SLUG Coders <coders@xxxxxxxxxxx>
- Subject: [coders] Converting a UTF-8 string to a wchar_t (in C)
- From: Andre Pang <ozone@xxxxxxxxxxxxxxxx>
- Date: Thu, 14 Dec 2006 01:15:12 +1100
I have a C string (char*) that's encoded in UTF-8. I'd like to
convert this to a wide string (wchar_t*). I've done plenty of
reading about mbstowcs(3), iconv(3) and friends, and from what I
understand, I have two options:
1. First, setlocale() to some bogus UTF-8 locale (such as
"en_US.UTF-8", and then use mbstowcs() to perform the conversion.
2. Use the stupendously painful iconv() interface with a iconv_t
from "UTF-8" to "WCHAR_T".
So far, I've tried (2) -- the iconv() method -- and it doesn't work
for me. It seems to work fine if the characters are ASCII, but the
moment it actually hits any non-ASCII characters, iconv() throws a
return code of -1 and errno's set to EILSEQ. I'm assuming there are
some bugs in my code, which is no surprise considering how annoying
iconv() is to use.
So instead of actually trying to fix the bugs, I figure that using
mbstowcs() is probably easier than trying to work around iconv()'s
brain damage. The thing is, surely there _must_ be some way to tell
mbstowcs() that the source string to convert is in UTF-8, besides
using setlocale() with a dummy UTF-8 locale. I'm only concerned
about the encoding type after all, not what language it's in, and I
feel quite yucky doing something like setlocale(LC_CTYPE,
"en_US.UTF-8"), because I'm not in the USA.
Is there something I'm missing, or is the way that everybody really
does it? I'm thinking that converting between UTF-8 and wchar_t must
be somewhat common these days, but Googling for "convert utf-8 to
wchar_t" really isn't being all that helpful.
(I'm also quite happy to use C++'s locale/facet/codecvt stuff, but
the documentation I've found about that so far appears to be equally
% Andre Pang : trust.in.love.to.save <http://www.algorithm.com.au/>