The Unicode HOWTO: Locale setup

You can now already use any Unicode characters in file names. No kernel or file utilities need modifications. This is because file names in the kernel can be anything not containing a null byte, and '/' is used to delimit subdirectories. When encoded using UTF-8, non-ASCII characters will never be encoded using null bytes or slashes. All that happens is that file and directory names occupy more bytes than they contain characters. For example, a filename consisting of five greek characters will appear to the kernel as a 10-byte filename. The kernel does not know (and does not need to know) that these bytes are displayed as greek.

This is the general theory, as long as your files stay inside Linux. On filesystems which are used from other operating systems, you have mount options to control conversion of filenames to/from UTF-8:

The "vfat" filesystems has a mount option "utf8". See file:/usr/src/linux/Documentation/filesystems/vfat.txt. When you give an "iocharset" mount option different from the default (which is "iso8859-1"), the results with and without "utf8" are not consistent. Therefore I don't recommend the "iocharset" mount option.
The "msdos", "umsdos" filesystems have the same mount option, but it appears to have no effect.
The "iso9660" filesystem has a mount option "utf8". See file:/usr/src/linux/Documentation/filesystems/isofs.txt.
Since Linux 2.2.x kernels, the "ntfs" filesystem has a mount option "utf8". See file:/usr/src/linux/Documentation/filesystems/ntfs.txt.

The other filesystems (nfs, smbfs, ncpfs, hpfs, etc.) don't convert filenames; therefore they support Unicode file names in UTF-8 encoding only if the other operating system supports them. Recall that to enable a mount option for all future remount, you add to the fourth column of the corresponding /etc/fstab line.

3.2 Ttys & the kernel

Ttys are some kind of bidirectional pipes between two program, allowing fancy features like echoing or command-line editing. When in an xterm, you execute the "cat" command without arguments, you can enter and edit any number of lines, and they will be echoed back line by line. The kernel's editing actions are not correct, especially the Backspace (erase) key and the tab bey are not treated correctly.

To fix this, you need to:

apply the kernel patch linux-2.0.35-tty.diff or linux-2.2.9-tty.diff or linux-2.3.12-tty.diff and recompile your kernel,
if you are using glibc2, apply the patch glibc211-tty.diff and recompile your libc (or if you are not so adventurous, it is sufficient to patch an already installed include file: glibc-tty.diff),
apply the patch stty.diff to GNU sh-utils-1.16b, and rebuild the "stty" program, then test it using "stty -a" and "stty iutf8".
add the command "stty iutf8" to the "unicode_start" script, and add the command "stty -iutf8" to the "unicode_stop script.
apply the patch xterm.diff to xterm-109, and rebuild "xterm", then test it by starting "xterm -u8"/"xterm +u8" and running "stty -a" and interactive "cat" inside it.

To make this fix persistent across rlogin and telnet, you also need to:

Define new values for the TERM environment variable, "linux-utf8" as an alias to "linux", and "xterm-utf8" as an alias to "xterm". If your system has the ncurses library and the /usr/lib/terminfo (or /usr/share/terminfo) database, do this by running
$ tic linux-utf8.terminfo $ tic xterm-utf8.terminfo
as non-root (this will create the terminfo entries in your $HOME/.terminfo directory). Here are linux-utf8.terminfo and xterm-utf8.terminfo. I don't recommend running this as root, because it will create the terminfo entries in /usr/lib/terminfo where they might be erased next time you upgrade your system. If your system has an /etc/termcap file, you should also edit that file: copy the linux and xterm entries and give them the new names "linux-utf8" and "xterm-utf8". For an example, see termcap.diff.
Each time you call "unicode_start" or "unicode_stop" from the console, also execute "export TERM=linux-utf8" or "export TERM=linux", respectively.
Apply the patch xterm2.diff to xterm-109, rebuild "xterm", and remove any "XTerm*termName" line from /usr/X11R6/lib/X11/app-defaults/XTerm and $HOME/.Xdefaults. Now xterm sets the TERM variable to "xterm-utf8" instead of "xterm" when running in UTF-8 mode.
Apply the patches netkit.diff, netkitb.diff and telnet.diff and rebuild "rlogind" and "telnetd". Now rlogin and telnet put the tty into UTF-8 editing mode whenever the TERM environment variable is "linux-utf8" or "xterm-utf8".

3.3 General data conversion

You will need a program to convert your locally (probably ISO-8859-1) encoded texts to UTF-8. (The alternative would be to keep using texts in different encodings on the same machine; this is not fun in the long run.) One such program is `iconv', which comes with glibc-2.1. Simply use


$ iconv --from-code=ISO-8859-1 --to-code=UTF-8 < old_file > new_file

Here are two handy shell scripts, called "i2u" i2u.sh (for ISO to UTF conversion) and "u2i" u2i.sh (for UTF to ISO conversion). Adapt according to your current 8-bit character set.

If you don't have glibc-2.1 and iconv installed, you can use GNU recode 3.5 instead. "i2u" i2u_recode.sh is "recode ISO-8859-1..UTF-8", and "u2i" u2i_recode.sh is "recode UTF-8..ISO-8859-1". ftp://ftp.iro.umontreal.ca/pub/recode/recode-3.5.tar.gz ftp://ftp.gnu.org/pub/gnu/recode/recode-3.5.tar.gz Notes: You need GNU recode 3.5 or newer. To compile GNU recode 3.5 on platforms without glibc2 (i.e. on all platforms except recent Linux systems), you need to configure it with --disable-nls, otherwise it won't link.

Or you can also use CLISP instead. Here are "i2u" i2u.lsp and "u2i" u2i.lsp written in Lisp. Note: You need a CLISP version from July 1999 or newer. ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz.

Other data conversion programs, less powerful than GNU recode, are `trans' ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/trans113.tar.gz, `tcs' from the Plan9 operating system ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/tcs.tar.gz, and `utrans'/`uhtrans'/`hutrans' ftp://ftp.cdrom.com/pub/FreeBSD/distfiles/i18ntools-1.0.tar.gz by G. Adam Stanislav <adam@whizkidtech.net>.

3.4 Locale environment variables

You may have the following environment variables set, containing locale names:

LANGUAGE: override for LC_MESSAGES, used by GNU gettext only
LC_ALL: override for all other LC_* variables
LC_CTYPE, LC_MESSAGES, LC_COLLATE, LC_NUMERIC, LC_MONETARY, LC_TIME: individual variables for: character types and encoding, natural language messages, sorting rules, number formatting, money amount formatting, date and time display
LANG: default value for all LC_* variables

(See `man 7 locale' for a detailed description.)

Each of the LC_* and LANG variables can contain a locale name of the following form:

language[_territory[.codeset]][@modifier]

where language is an ISO 639 language code (lower case), territory is an ISO 3166 country code (upper case), codeset denotes a character set, and modifier stands for other particular attributes (for example indicating a particular language dialect, or a nonstandard orthography).

LANGUAGE can contain several locale names, separated by colons.

In order to tell your system and all applications that you are using UTF-8, you need to add a codeset suffix of UTF-8 to your locale names. For example, if you were using


LANGUAGE=de:fr:en
LC_CTYPE=de_DE

you would change this to


LANGUAGE=de.UTF-8:fr.UTF-8:en.UTF-8
LC_CTYPE=de_DE.UTF-8

3.5 Creating the locale support files

If you have glibc-2.1 or glibc-2.1.1 or glibc-2.1.2 installed, first check using "localedef --help" that the system directory for character maps is /usr/share/i18n/charmaps. Then apply to the file /usr/share/i18n/charmaps/UTF8 the patch glibc21.diff or glibc211.diff or glibc212.diff, respectively. Then create the support files for each UTF-8 locale you intend to use, for example:


$ localedef -v -c -i de_DE -f UTF8 /usr/share/locale/de_DE.UTF-8

You typically don't need to create locales named "de" or "fr" without country suffix, because these locales are normally only used by the LANGUAGE variable and not by the LC_* variables, and LANGUAGE is only used as an override for LC_MESSAGES.

3.6 Adding support to the C library

The glibc-2.2 will support multibyte locales, in particular the UTF-8 locales created above. But glibc-2.1 or glibc-2.1.1 do not really support it. Therefore the only real effect of the above creation of the /usr/share/locale/de_DE.UTF-8/* files is that `setlocale(LC_ALL,"")' will return "de_DE.UTF-8", according to your environment variables, instead of stripping off the ".UTF-8" suffix.

To add support for the UTF-8 locale, you should build and install the `libutf8_plug.so' library, from libutf8-0.5.2.tar.gz. Then you can set the LD_PRELOAD environment variable to point to the installed library:


$ export LD_PRELOAD=/usr/local/lib/libutf8_plug.so

Then, in every program launched with this environment variable set, the functions in libutf8_plug.so will override the original ones in /lib/libc.so.6. For more info about LD_PRELOAD, see "man 8 ld.so".

This entire thing will not be necessary any more once glibc-2.2 comes out.

3.7 Converting the message catalogs

Now let's fill these new locales with content. The following /bin/sh commands can convert your message catalogs to UTF-8 format. They must be run as root, and require the programs `msgfmt' and `msgunfmt' from GNU gettext-0.10.35 to be installed. convert-msgcat.sh

This too will not be necessary any more once glibc-2.2 comes out, because by then, the gettext function will convert the strings appropriately from the translator's character set to the user's character set, using either iconv or librecode.

3. Locale setup

3.1 Files & the kernel