[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Display types?

... I forgot one thing. There are problems with canonical encodings
for multilingual character sets, since the boundary between two
characters sometimes is not clear. Accents, diacritical marks, and
ligatures as they exist in almost all languages in the world
(including English) pose a problem. Not only in Unicode do you have
multiple ways to express such constructs. For instance you could
represent an e-acute in at least three different ways: [EACUTE]
(precoordinated), [E][ACUTE], or [ACUTE][E] (typewriter style).

In general, for a canonical form you want to decompose as much as
possible. Especially you want to get rid of those stylized ligatures
(that are often wrong anyway). For instance, the ff, fi, fl, ffi, and
ffl ligatures do only exist in roman or gothic fonts (not in fixed
width fonts) and are often wrong if they span a word boundary in a
composite word. English language does not have so many composite
words, but German has. "Giffy" might have the ffi ligature, but
"Auffassung" may not. You definitely want to keep this typographical
stuff out of SPKI.

In context of Unicode, canonicalization and decomposition has been
formalized. See http://www.unicode.org/unicode/reports/tr15. If you
want to support multiple character sets/encodings you still have the
problem, e.g., ISO Latin-1 has only precoordinated accented vowels,
while a transcription to US-ASCII would have to postcoordinate those.
So again, riding on the Unicode train frees you from a lot of headache
that you would otherwise have to take care of yourselves.


Gunther Schadow ----------------------------------- http://aurora.rg.iupui.edu
Regenstrief Institute for Health Care
1001 W 10th Street RG5, Indianapolis IN 46202, Phone: (317) 630 7960
schadow@aurora.rg.iupui.edu ---------------------- #include <usual/disclaimer>