The Unicode HOWTO: Specific applications

4.1 Networking

rlogin

is fine with the above mentioned patches.

telnet

telnet is not 8-bit clean by default. In order to be able to send Unicode keystrokes to the remote host, you need to set telnet into "outbinary" mode. There are two ways to do this:


$ telnet -L <host>

and


$ telnet
telnet> set outbinary
telnet> open <host>

Additionally, use the above mentioned patches.

4.2 Browsers

Netscape

Netscape 4.05 or newer can display HTML documents in UTF-8 encoding. All a document needs is the following line between the <head> and </head> tags:


<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Netscape 4.05 or newer can also display HTML and text files in UCS-2 encoding with byte-order mark.

http://www.netscape.com/computing/download/

lynx

lynx-2.8 has an options screen (key 'O') which permits to set the display character set. When running in an xterm or Linux console in UTF-8 mode, set this to "UNICODE UTF-8".

Now, again, all a document needs is the following line between the <head> and </head> tags:


<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

When you are viewing text files in UTF-8 encoding, you also need to pass the command-line option "-assume_local_charset=UTF-8" (affects only file:/... URLs) or "-assume_charset=UTF-8" (affects all URLs). In lynx-2.8.2 you can alternatively, in the options screen (key 'O'), change the assumed document character set to "utf-8".

There is also an option in the options screen, to set the "preferred document character set". But it has no effect, at least with file:/... URLs and with http://... URLs served by apache-1.3.0.

There is a spacing and line-breaking problem, however. (Look at the russian section of x-utf8.html, or at utf-8-demo.txt.)

Also, in lynx-2.8.2, configured with --enable-prettysrc, the nice colour scheme does not work correctly any more when the display character set has been set to "UNICODE UTF-8". This is fixed by a simple patch lynx282.diff.

The Lynx developers say: "For any serious use of UTF-8 screen output with lynx, compiling with slang lib and -DSLANG_MBCS_HACK is still recommended."

ftp://ftp.gnu.org/pub/gnu/lynx/lynx-2.8.2.tar.gz http://lynx.browser.org/ http://www.slcc.edu/lynx/ ftp://lynx.isc.org/

Test pages

Some test pages for browsers can be found at the pages of Alan Wood http://www.hclrss.demon.co.uk/unicode/#links and James Kass http://home.att.net/~jameskass/.

4.3 Editors

yudit

yudit by Gáspár Sinai http://czyborra.com/yudit/ is a first-class unicode text editor for the X Window System. It supports simultaneous processing of many languages, input methods, conversions for local character standards. It has facilities for entering text in all languages with only an English keyboard, using keyboard configuration maps.

It can be compiled in three versions: Xlib GUI, KDE GUI, or Motif GUI.

Customization is very easy. Typically you will first customize your font. From the font menu I chose "Unicode". Then, since the command "xlsfonts '*-*-iso10646-1'" still showed some ambiguity, I chose a font size of 13 (to match Markus Kuhn's 13-pixel fixed font).

Next, you will customize your input method. The input methods "Straight", "Unicode" and "SGML" are most remarkable. For details about the other built-in input methods, look in /usr/local/share/yudit/data/.

To make a change the default for the next session, edit your $HOME/.yuditrc file.

The general editor functionality is limited to editing, cut&paste and search&replace. No undo.

mined98

mined98 is a small text editor by Michiel Huisjes, Achim Müller and Thomas Wolff. http://www.inf.fu-berlin.de/~wolff/mined.html It lets you edit UTF-8 or 8-bit encoded files, in an UTF-8 or 8-bit xterm. It also has powerful capabilities for entering Unicode characters.

When it is running in xterm or Linux console in UTF-8 mode, you should set the environment variable utf8_term, or call mined with the command-line option -U.

mined lets you edit both 8-bit encoded and UTF-8 encoded files. By default it uses an autodetection heuristic. If you don't want to rely on heuristics, pass the command-line option -u when editing an UTF-8 file, or +u when editing an 8-bit encoded file. You can change the interpretation at any time from within the editor: It displays the encoding ("L:h" for 8-bit, "U:h" for UTF-8) in the menu line. Click on the first of these characters to change it.

A few caveats:

The Linux binary in the distribution is out of date and does not support UTF-8. You have to rebuild a binary from the source. Then install src/mined as /usr/local/bin/mined, and install doc/mined.help as /usr/local/man/cat1/mined.1, so that ESC h will find it.
mined ignores your "stty erase" setting. When your backspace key sends ASCII code 127, and you have set "stty erase ^?", -- these are the ultimately correct settings -- you have to call mined with option -B in order to get the backspace key erase a character to the left of the cursor.
The "Home", "End", "Delete" keys do not work.

vim

vim (as of version 5.4m) has some support for multi-byte locales, but only as far as the X library has the same support, and only for encodings with at most two bytes per character (i.e. ISO-2022 encodings). No UTF-8 support.

emacs

First of all, you should read the section "International Character Set Support" (node "International") in the Emacs manual. In particular, note that you need to start Emacs using the command


$ emacs -fn fontset-standard

so that it will use a font set comprising a lot of international characters.

In the short term, the emacs-utf package http://www.cs.ust.hk/faculty/otfried/Mule/ by Otfried Cheong provides a "unicode-utf8" encoding to Emacs. After compiling the program "utf2mule" and installing it somewhere in your $PATH, also install unicode.el, muleuni-1.el, unicode-char.el somewhere, and add the lines


(setq load-path (cons "/home/user/somewhere/emacs" load-path))
(if (not (string-match "XEmacs" emacs-version))
  (progn
    (require 'unicode)
    (if (eq window-system 'x)
      (progn
        (create-fontset-from-fontset-spec
          "-misc-fixed-medium-r-normal-*-12-*-*-*-*-*-fontset-standard")
        (create-fontset-from-fontset-spec
          "-misc-fixed-medium-r-normal-*-13-*-*-*-*-*-fontset-standard")
        (create-fontset-from-fontset-spec
          "-misc-fixed-medium-r-normal-*-14-*-*-*-*-*-fontset-standard")
        (create-fontset-from-fontset-spec
          "-misc-fixed-medium-r-normal-*-15-*-*-*-*-*-fontset-standard")
        (create-fontset-from-fontset-spec
          "-misc-fixed-medium-r-normal-*-16-*-*-*-*-*-fontset-standard")
        (create-fontset-from-fontset-spec
          "-misc-fixed-medium-r-normal-*-18-*-*-*-*-*-fontset-standard")))))

to your $HOME/.emacs file. To activate any of the font sets, use the Mule menu item "Set Font/FontSet" or Shift-down-mouse-1. Currently the font sets with height 15 and 13 have the best Unicode coverage, due to Markus Kuhn's 9x15 and 6x13 fonts. In order to open an UTF-8 encoded file, you will type


M-x universal-coding-system-argument unicode-utf8 RET
M-x find-file filename RET


C-x RET c unicode-utf8 RET
C-x C-f filename RET

Note that this works with Emacs in windowing mode only, not in terminal mode.

Richard Stallman plans to add integrated UTF-8 support to Emacs in the long term.

xemacs

(This section is written by Gilbert Baumann.)

Here is how to teach XEmacs (20.4 configured with MULE) the UTF-8 encoding. Unfortunately you need its sources to be able to patch it.

First you need these files provided by Tomohiko Morioka:

http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-21.0-b55-emc-b55-ucs.diff and http://turnbull.sk.tsukuba.ac.jp/Tools/XEmacs/xemacs-ucs-conv-0.1.tar.gz

The .diff is a diff against the C sources. The tar ball is elisp code, which provides lots of code tables to map to and from Unicode. As the name of the diff file suggests it is against XEmacs-21; I needed to help `patch' a bit. The most notable difference to my XEmacs-20.4 sources is that file-coding.[ch] was called mule-coding.[ch].

For those unfamilar with the XEmacs-MULE stuff (as I am) a quick guide:

What we call an encoding is called by MULE a `coding-system'. The most important commands are:


M-x set-file-coding-system
M-x set-buffer-process-coding-system   [comint buffers]

and the variable `file-coding-system-alist', which guides `find-file' to guess the encoding used. After stuff was running, the very first thing I did was this.

This code looks into the special mode line introduced by -*- somewhere in the first 600 bytes of the file about to opened; if now there is a field "Encoding: xyz;" and the xyz encoding ("coding system" in Emacs speak) exists, choose that. So now you could do e.g.


;;; -*- Mode: Lisp; Syntax: Common-Lisp; Package: CLEX; Encoding: utf-8; -*-

and XEmacs goes into utf-8 mode here.

Atfer everything was running I defined \u03BB (greek lambda) as a macro like:


(defmacro \u03BB (x) `(lambda .,x))

nedit

xedit

In theory, xedit should be able to edit UTF-8 files if you set the locale accordingly (see above), and add the line "Xedit*international: true" to your $HOME/.Xdefaults file. In practice, it will recognize UTF-8 encoding of non-ASCII characters, but will display them as sequences of "@" characters.

axe

In theory, axe should be able to edit UTF-8 files if you set the locale accordingly (see above), and add the line "Axe*international: true" to your $HOME/.Xdefaults file. In practice, it will simply dump core.

pico

TeX

The teTeX 0.9 (and newer) distribution contains an Unicode adaptation of TeX, called Omega ( http://www.gutenberg.eu.org/omega/, ftp://ftp.ens.fr/pub/tex/yannis/omega), but can someone point me to a tutorial for using this system?

4.4 Mailers

MIME: RFC 2279 defines UTF-8 as a MIME charset, which can be transported under the 8bit, quoted-printable and base64 encodings. The older MIME UTF-7 proposal (RFC 2152) is considered to be deprecated and should not be used any further.

Mail clients released after January 1, 1999, should be capable of sending and displaying UTF-8 encoded mails, otherwise they are considered deficient. But these mails have to carry the MIME labels


Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Simply piping an UTF-8 file into "mail" without caring about the MIME labels will not work.

Mail client implementors should take a look at http://www.imc.org/imc-intl/ and http://www.imc.org/mail-i18n.html.

Now about the individual mail clients (or "mail user agents"):

pine

The situation for an unpatched pine version 4.10 is as follows.

Pine does not do character set conversions. But it allows you to view UTF-8 mails in an UTF-8 text window (Linux console or xterm).

Normally, Pine will warn about different character sets each time you view an UTF-8 encoded mail. To get rid of this warning, choose S (setup), then C (config), then change the value of "character-set" to UTF-8. This option will not do anything, except to reduce the warnings, as Pine has no built-in knowledge of UTF-8.

Also note that Pine's notion of Unicode characters is pretty limited: It will display Latin and Greek characters, but not other kinds of Unicode characters.

A patch by Robert Brady http://www.ents.susu.soton.ac.uk/~robert/pine-utf8-0.1.diff adds UTF-8 support to Pine. With this patch, it decodes and prints headers and bodies properly. The patch depends on the GNOME libunicode http://cvs.gnome.org/lxr/source/libunicode/.

However, alignment remains broken in many places; replying to a mail does not cause the character set to be converted as appropriate; and the editor, pico, cannot deal with multibyte characters.

kmail

kmail (as of KDE 1.0) does not support UTF-8 mails at all.

Netscape Communicator

Netscape Communicator's Messenger can send and display mails in UTF-8 encoding, but it needs a little bit of manual user intervention.

To send an UTF-8 encoded mail: After opening the "Compose" window, but before starting to compose the message, select from the menu "View -> Character Set -> Unicode (UTF-8)". Then compose the message and send it.

When you receive an UTF-8 encoded mail, Netscape unfortunately does not display it in UTF-8 right away, and does not even give a visual clue that the mail was encoded in UTF-8. You have to manually select from the menu "View -> Character Set -> Unicode (UTF-8)".

For displaying UTF-8 mails, Netscape uses different fonts. You can adjust your font settings in the "Edit -> Preferences -> Fonts" dialog; choose the "Unicode" font category.

emacs (rmail, vm)

4.5 Other text-mode applications

less

Get ftp://ftp.gnu.org/pub/gnu/less/less-340.tar.gz and apply the patch less-340-utf-2.diff by Robert Brady <rwb197@ecs.soton.ac.uk>. Then set the environment variable LESSCHARSET:


$ export LESSCHARSET=utf-8

If you also have a LESSKEY environment variable set, also make sure that the file it points to does not define LESSCHARSET. If necessary, regenerate this file using the `lesskey' command, or unset the LESSKEY environment variable.

expand, wc

Get the GNU textutils-2.0 and apply the patch textutils-2.0.diff, then configure, add "#define HAVE_MBRTOWC 1", "#define HAVE_FGETWC 1", "#define HAVE_FPUTWC 1" to config.h. In src/Makefile, modify CFLAGS and LDFLAGS so that they include the directories where libutf8 is installed. Then rebuild.

col, colcrt, colrm, column, rev, ul

Get the util-linux-2.9y package, configure it, then define ENABLE_WIDECHAR in defines.h, change the "#if 0" to "#if 1" in lib/widechar.h. In text-utils/Makefile, modify CFLAGS and LDFLAGS so that they include the directories where libutf8 is installed. Then rebuild.

figlet

figlet 2.2 has an option for UTF-8 input: "figlet -C utf8"

kermit

The serial communications program C-Kermit http://www.columbia.edu/kermit/, in versions 7.0beta10 or newer, understands the file and transfer encodings UTF-8 and UCS-2, and understands the terminal encoding UTF-8. Documentation of these features can be found in ftp://kermit.columbia.edu/kermit/test/text/ckermit2.txt.

4.6 Other X11 applications

X11 Xlib kann leider noch kein UTF-8 locale, das müsste auch noch dringend mal gemacht werden.

4. Specific applications