[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Display types?



Gunther Schadow <schadow@aurora.rg.iupui.edu> writes:

> UNICODE and choosing UTF-8 as the canonical encoding is indeed a
> pretty good idea wherever you design any character-based format or
> protocol standard today. UNICODE covers all languages of the World
> and its a widely accepted standard. So, to take a step, I guess
> everyone would want to use UNICODE instead of any limited character
> set. All contemporary script's UNICODE characters can be enumerated
> with a 16 bit register.

I agree that unicode is probably the way to go, for supporting
characters beyond latin1. I only object to requiring UTF-8 for
encoding of the unicode characters.

> Next question is, what is the canonical encoding? Here your
> choices are

As a side note, this use of the word "canonical" can not have anything
to do with canonical representations of s-sexpressions. Converting
between character sets and encodings can _not_ be done by the sexp
reader and printer. I can expand on this point if you don't see why.

> UTF-7	backwards compatible to 7bit US-ASCII (with one exception,
> 	i.e. the + character), encodes 16 bit per character.

UTF-7 is completely out of the question, I hope, as sexp provides
strings of arbitrary eightbit octets.

> UTF-8	backwards compatible to 7bit US-ASCII, encodes at least up to
> 	32 bit per character.

As I said before, I don't like multibyte encodings like UTF8, and I
think they should be avoided where at all possible.

UTF-8 has one unique feature, which is also the reason why it is
popular: Backwards compatibility with ascii. This means that if you
have a file format where strings are delimited by ascii characters,
colons, spaces, quotes, newlines, you won't break old usascii parsers
(provided that they are 8-bit clean, _and_ no eight-bit characters
have any special syntactic meaning, _and_ the number of characters in
a field doesn't matter): they will be able to parse the data, although
it will most likely be "misdisplayed" as someone else called it. This
is bacause UTF-8 makes sure that no character is encoded as an
multi-octet string that contains the octet ':' (or any usascii
character).

But this feature is _totally unneded_ for spki; strings have a length
prefix, and are never delimited by ascii characters (except for the
advanced encoding, but I consider that as not important enough to be
taken into consideration here).

Therefore, there is *no* reason to favour UTF-8 in spki. Allowing
UTF-8 is ok, but requiring its use is evil.

> UCS-16	uses 16 bit even for US-ASCII but does not advantage US-ASCII
> 	over ISO Latin-1.
> UCS-32  uses 32 bit for everything, clearly a waste of bandwidth.

These encodings fixes unicode support just as well as UTF-8, but
without the ugly and unnecessary transport encoding of UTF-8. I don't
know if 16 bits is enough, so it seems reasonable to offer both
encodings (as well as 8-bit subsets like latin1). If I have understood
these matters correctly, unicode is a 16-bit subset of the 32-bit
character iso-standard, and it just happens to include all the
characters which are currently defined.

> There are subchoices for UCS-16 and UCS-32 concerning the byte
> ordering (low endian vs. high endian).

I think we should standardize on a single byte-order. My preference
being network byte order (big-endian), but that's not something I want
to go to war about.

> Finally, I'd like to warn you for too light heartedly expanding the
> notion of attribute types to include all kinds of multimedial stuff
> using the MIME scheme. I admit that it is an intriguing idea, as it
> would allow you to include an image of a person in a SPKI
> certificate. However, it comes with a considerable cost.

This is a secondary concern for me, but what are the problems with
allowing the display type to be an arbitrary MIME type (possibly
excluding subtypes of multipart)?

As for the uniqness problems, I think Carl commented this already. We
have argued about uniqueness in s-expressions earlier (I wanted to
have a stricter definition which specified order of attributes, no
redundant leading zeros on numbers etc), but I think he managed to
convince me that this is a problem that is best left unsolved. In
SPKI, an object (name, certificate, key, whatever) has many equvalent
representations as an S-expression. Each of those S-expressions has
exactly one representation in the canonical encoding, and that is all
that is needed.

Therefore, a unique mapping name -> sexp is not something that is
needed or desirable for spki. So that is not a valid reason to
standardize on UTF-8 or any other <character set, encoding>.

Regards,
/Niels

References: