[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Display types?



>  Carl> We started out with UTF-8 as the character set and the
>  Carl> discussion on the list pushed us back to Latin-1.  I believe
>  Carl> the primary objections were from Europe.
>
> That's odd.  I assume you mean "Western Europe", i.e., that part of
> Europe served by Latin-1.  This is really depressing, since that area
> contains some countries that would scream very loudly if we went to
> plain old ASCII, yet apparently don't seem to find anything wrong with 
> being equally language-chauvinistic.
>
> Unless you want to serve only the languages used in the EEC, Latin-1
> is grossly inadequate.  I don't believe this is a valid approach.

There is even more to say here. As you might know the Unicode per se
is backwards compatible not only to US-ASCII but also to Latin-1. It
is just the UTF-8 and UTF-7 specifications that do not allow
leveraging this backwards compatibility. Not long ago, I tried to
convince Unicode folks to define a UTF encoding that would leverage
Latin-1, but I did not succeed. The idea just didn't fly. Now, guess
who were the biggest opponents against this Latin-1 compatibility?
Europeans! Western Europeans! Those people were unisono that Latin-1
is a weak support of western european languages, something I can not
confirm personally (though I am a European), but that I take as an
oppinion from obvious "experts".

I strongly urge you to rethink any focus on Latin-1. Latin-1 is a
dying entity. It is superseded by ISO and Unicode, and the world does
move towards Unicode/UTF-8. So, you might want to be encoding neutral
(risking interoperability problems), but please, do not specialize for
Latin-1.

>  Carl> If we were to assume UTF-8 for a character set, we also have
>  Carl> the problem that it's a variable width character set, which
>  Carl> means that the byte count that preceeds a bytestring would not
>  Carl> always equal the character count.  This would have little
>  Carl> effect on a program but might get in the way of a human
>  Carl> examining the canonical form from a text display.
>
> Why would the mismatch between character count and byte count affect a 
> human?  I must be missing something here.  If variable length is the
> issue then Unicode with wide chars would serve (provided they don't
> start using codes outside the basic multinational plane, i.e., codes
> outside the 16-bit space).  

agreed. BTW someone noted that Unicode is not 100% complete. May be,
but there is still work in progress. Second, UTF-16 is indeed all you
should need. The extended plane is for Akkadian, Assyrian, and Klingon
alphabets.

The point here is that the equation 1 byte = 1 character is definitely
not true. The byte-count problem is a no-issue. You are about to make
a *big* mistake in your protocol design if you want to somehow
preserve this old equation 1 byte = 1 character. 

I believe that standards are here to be adopted. Standard
organizations should be careful with reaching beyond their scope. You
want people to trust SPKI to be a good design, so you should trust
Unicode to be a good design. Unicode/UTF is a widely adopted standard
that does superseed Latin-1. Unicode is used in IETF and W3C. The
Unicode folks are quite reasonable and doing pretty good work (I am
not affiliated with Unicode in any way and my standards are high
here.) I think it is wise to rely on their judgement and not to try to
revise decisions that have been made. If you want to discuss character
encoding issues you should go out to the unicode mailing list.

SPKI has no business in trying to preserve the nightmare of code pages
that Unicode finally overcame. I am with you for the pursuit of a more
flexible UTF but that's definitely out of SPKI's scope. But I urge you
to not make that code page mistake. It will be a major dirty spot in
the SPKI design.

> UTF-8 has an advantage, though, in that it encodes characters from
> the Latin-1 set with the same bitstrings as Latin-1 does.

not true, but not important here. That was the issue about which I
fought with the Unicode folks. UTF-8 will use multi-byte for
everything above what fits into the first 7 bit. Latin-1 is multi byte
encoded. But the encoding/decoding is really a simplish task and
source code is available in virtually all languages.

regards
-Gunther

Gunther Schadow ----------------------------------- http://aurora.rg.iupui.edu
Regenstrief Institute for Health Care
1001 W 10th Street RG5, Indianapolis IN 46202, Phone: (317) 630 7960
schadow@aurora.rg.iupui.edu ---------------------- #include <usual/disclaimer>