[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Base-64 proposal
Please ignore my last response to George Michaelson's questions. He
raised two good points, which I didn't think about adequately before
responding. Here is a better reply, and a slightly modified base64
proposal.
* The basic model for SPKI/SDSI is an ordinary 8-bit channel providing a
sequence of bytes (octets) that are then interpreted according to the
SPKI/SDSI rules to be parsed into lists, etc.
* In some cases it is desirable to code the elements of the 8-bit channel
with 6-bit (base-64) characters (hextets?) for protection against
mailer and channel damage. The usual encoding (RFC 1521) uses characters
A -- Z a -- z 0 -- 9 + /
to denote the hextets with decimal value 0--63, respectively.
* Thus, we propose an "augmented 8-bit channel" that allows one to drop
into six-bit mode temporarily. We propose that in an augmented 8-bit
channel, a left brace "{" signals dropping into 6-bit mode, and that
when in six-bit mode, a right brace "}" signals popping back out into
augmented 8-bit mode.
...8-bits...{...6-bits...}...8-bits...
The difference between ordinary 8-bit mode and augmented 8-bit mode is
that the left brace is used to signal the transition to 6-bit mode.
In augmented 8-bit mode a right brace is treated as an ordinary character.
* The result of processing an byte sequence that is in augmented 8-bit
mode is a byte sequence in ORDINARY eight-bit mode. That is, if the
result of decoding the six-bit bytes you get a left brace, that brace
does NOT recursively signal another embedded six-bit channel.
* How do you represent a left brace in augmented 8-bit mode?
The natural answer is as follows:
A left brace is octal 173 = binary 01111011
which breaks into hextets as 011110 11xxxx (x = doesn't matter)
Then 011110 encodes as "e"
and 11xxxx encodes as "g" (as one possibility)
Thus, by writing
{eg}
in the augmented 8-bit channel, you get a left brace into the output
ordinary 8-bit channel.
One could introduce other mechanisms for accomplishing this as well, but
this is sufficient. One natural alternative would be to let "\{" in the
augmented channel produce "{" in the output, and "\\" in the input
produce "\" in the output. But this requires two active characters
left brace and backslash, which seems overkill. I suggest that we
not introduce redundant mechanisms.
(Note that my previous note about using #1:{ was just wrong.)
* A reader for an augmented 8-bit channel is just a simple state machine
that remembers whether it is reading in 8-bit mode or 6-bit mode, and
(if in six-bit mode) what "left-over bits" there are. Each character
may cause a state change and may cause an 8-bit character to be output
for the output channel (which is the ordinary 8-bit channel). More
specifically, to process an input character x, do the following.
The state-machine has a left-over bits register that may hold up
to 12 bits, and a length-indicator for that register.
-- if in eight-bit mode:
if x is not a left brace, output x and return.
otherwise if x is left brace:
switch to 6-bit mode, clear the left-over bits
register, zero its length indicator, and return.
-- if in six-bit mode:
if x is a right brace:
switch to 8-bit mode and return.
if x is not a base-64 character (alphanumeric or +/)
ignore this character and return.
Otherwise, append the six-bit value represented by x to
the contents of the left-over-bits register, and increase
the length indicator by six. If the left-over-bits register
now has 8 or more bits, output the leftmost 8 of those bits,
and delete them from the register, and subtract 8 from the
length indicator.
Thus, the "get character" routine can be written so as to repeatedly
read characters from the augmented channel until an 8-bit character is
output for the ordinary channel.
* Note that you can't nest or recurse, since the output channel is an
ordinary 8-bit channel, not an augmented 8-bit channel. If one wants
to protect a byte-sequence from mutilation, it is silly and wasteful
to recursively encode stuff that is already 6-bit encoded. With the
current proposal, if you have for some reason a partially encoded
sequence in augmented 8-bit representation:
xxxx...xxx{YYYYYYYY}zzzzzzzzz
then the correct way to protect the entire thing is then just to recode the
eight-bit portions xx's and zz's:
{XXX....XXX}{YYYYYYYY}{ZZZZ....ZZZZZ}
where the XX's are the 6-bit encoding of the xx's, and the ZZ's are the
six-bit encoding of the zz's.
Thus, it is possible to efficiently encode an sequence, even if portions
of it were already 6-bit encoded. You only need to know that your input
is already in augmented 8-bit form in order to know that you need to
watch out for already-encoded stuff, so that you can just copy it verbatim
to the result.
* The fact that non-base-64 characters are ignored in six-bit mode gives
one the possibility of getting "fragmentation", as noted in my earlier
note:
abc{
}def
produces
abcdef
when decoded.
* This proposal for augmented 8-bit channels is formally independent of
anything else in SPKI/SDSI. Indeed, it could be used to represent
any kind of data, not just SPKI/SDSI data.
Ron Rivest
Follow-Ups: