[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Display types?



Hi,

sorry for stumbling into your conversation. I believe that going with
UNICODE and choosing UTF-8 as the canonical encoding is indeed a
pretty good idea wherever you design any character-based format or
protocol standard today. UNICODE covers all languages of the World
and its a widely accepted standard. So, to take a step, I guess
everyone would want to use UNICODE instead of any limited character
set. All contemporary script's UNICODE characters can be enumerated
with a 16 bit register.

Next question is, what is the canonical encoding? Here your
choices are

UTF-8	backwards compatible to 7bit US-ASCII, encodes at least up to
	32 bit per character.
UTF-7	backwards compatible to 7bit US-ASCII (with one exception,
	i.e. the + character), encodes 16 bit per character.
UCS-16	uses 16 bit even for US-ASCII but does not advantage US-ASCII
	over ISO Latin-1.
UCS-32  uses 32 bit for everything, clearly a waste of bandwidth.

There are subchoices for UCS-16 and UCS-32 concerning the byte
ordering (low endian vs. high endian).

UTF-8 is clearly the best supported encoding, it is backwards
compatible to 7bit US-ASCII, so legacy systems have no headache at
all. Java does UTF-8 without anyone even noticing it. Most other
libraries should be supporting UTF-8 seamlessly in their current or
upcoming versions (if they are concerned about multilingual character
strings at all.) So, UTF-8 is just the most available but also a quite
economic and backwards compatible encoding, which is a compelling
argument for making UTF-8 the canonical encoding.

The advantage over a free choice character set clearly is that the
syntax becomes simpler and that you remove possible interoperability
problems. The problem that Ron Rivest had with his Netscape browser
is an example for the interoperability issues that exist as long as
you try to directly support multiple encodings.

Of course, every application is free to use whatever encoding
internally. But for interoperability UNICODE/UTF-8 is already THE
standard character set/encoding and its support gets better every day.

Finally, I'd like to warn you for too light heartedly expanding the
notion of attribute types to include all kinds of multimedial stuff
using the MIME scheme. I admit that it is an intriguing idea, as it
would allow you to include an image of a person in a SPKI
certificate. However, it comes with a considerable cost.

For another standardization group I have written about Text and multi
media encoding. I find that it is very well applicable to SPKI as
well. See http://aurora.rg.iupui.edu/v3dt/report.html#Text for a recap
on UNICODE and UTF-8 and MIME-ish Multimedia with links to the
relevant standards.

Otherwise, do the right thing!
regards
-Gunther

Gunther Schadow ----------------------------------- http://aurora.rg.iupui.edu
Regenstrief Institute for Health Care
1001 W 10th Street RG5, Indianapolis IN 46202, Phone: (317) 630 7960
schadow@aurora.rg.iupui.edu ---------------------- #include <usual/disclaimer>

Follow-Ups: