popen and fread/fgets and UNICODE

Ulrich Eckhardt

2008-07-19 07:10:46 UTC

Post by n***@rediffmail.com
I am new to Linux and trying to use popen. The code I am writing
eventually will _have_ to be internationalized. Being from windows, I
am trying to understand how UNICODE support should be added. On
windows I can use #ifdef and use _popen or _wpopen but on Linux that
doesn't seem to be the case.

Just a few things up front:
- Unicode is a standard.
- UNICODE/_UNICODE are macros that control the meaning of TCHAR in the win32
API.
- With _UNICODE defined, TCHAR becomes a WCHAR which is a wchar_t for most
compilers. In any case, WCHAR is 16-bit character and the encoding that MS
Windows uses is UTF-16.

Post by n***@rediffmail.com
After going through some posts I found that for commands that has non-
ascii characters, I need to convert them to ascii characters for popen
to accept them (and from what I have read it, it works).

No. You can not convert non-ASCII characters to ASCII characters. What you
can do is convert a wchar_t string containing e.g. UTF-32 encoded Unicode
to a char string containing UTF-8 encoded Unicode. Note: go and read what
the UTF representations are, this is crucial for understanding.

Now, fopen() or popen() use char-based strings. However, the encoding of
those strings is not fixed. Rather, they depend on the system's locale and
can be one of the various codepages (see e.g. iso-8859-15 manpage) or even
UTF-8. The function to convert a wchar_t Unicode string to a char string
with the locale's encoding is called mbstowcs(). Note that this function
can fail, e.g. trying to represent Thai characters in a Cyrillic codepage
will fail. For me, that is a reason to switch systems to using a UTF-8
locale, at least for the charset, because then these conversion errors
can't happen.

Post by n***@rediffmail.com
But once the command completes successfully, again, from what I have
read so far, if the file contains unicode characters, fread/fgets
fails.

fread() only returns a stream of bytes, without interpreting them. If the
file contains UTF-16, you will not receive a useful char string from it,
you have to convert first.

Post by n***@rediffmail.com
C++ perhaps could have helped with this but it seems there is no easy
way to convert "FILE" returned by popen to any of iostream objects.

Actually I think that many implementations of C++ offer a way to attach a
FILE* to a stream, but those are nonstandard extensions. Using some #ifdefs
this can be made to work sufficiently though, alternatively you could write
your own streambuffer.

One thing as general advise how to get a program to support Unicode: do not
make it dependent on the system's way to deal with it. Rather, convert the
internal representation to the required format in places where you interact
with the system like filenames or when writing the text on a GUI or when
writing the content to disk. These conversions might seem cumbersome, but
with the exception of perhaps writing to disk they shouldn't affect
performance too much.

At the moment, I'm working on a program that uses std::wstring (i.e. a
wchar_t string) internally and uses UTF-8 for file content. This works both
on win32 (Desktop and CE variants) and POSIX platforms like Linux. If I
could, I would take a look at libICU for my next project though, because it
seems much more complete than just using std::wstring.

Uli