[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Base-64 proposal
On Wed, 16 Apr 1997, Ron Rivest wrote:
> * How do you represent a left brace in augmented 8-bit mode?
> The natural answer is as follows:
> A left brace is octal 173 = binary 01111011
> which breaks into hextets as 011110 11xxxx (x = doesn't matter)
> Then 011110 encodes as "e"
> and 11xxxx encodes as "g" (as one possibility)
> Thus, by writing
> {eg}
> in the augmented 8-bit channel, you get a left brace into the output
> ordinary 8-bit channel.
Obviously, you could always use this form, but it's unpleasant to look at...
> One could introduce other mechanisms for accomplishing this as well, but
> this is sufficient. One natural alternative would be to let "\{" in the
> augmented channel produce "{" in the output, and "\\" in the input
> produce "\" in the output. But this requires two active characters
> left brace and backslash, which seems overkill. I suggest that we
> not introduce redundant mechanisms.
I have a better proposal.
Short description:
Simply allow "{" in 6-bit mode to output a "{", without changing the
shift register or any other state in the state machine decoder.
Long, verbose, detailed (and somewhat redundant!) description:
A change can instead be made to the 6-bit parser state machine, so that when
a "{" character is read, it is output verbatim. Any state (for 6-bit mode)
that can read a character should follow this rule. The active state should
not change as a result of passing the "{" character, and neither should the
shift register used for the 6->8 bit conversion. This change to the state
machine would be trivial to implement.
Unlike the suggestion of making "{{" magic (which required lookahead), this
proposal requires a "}" to return to augmented 8-bit mode. This is cleaner
and more orthogonal. Another suggestion was made to make "{{}" magic, which
is a bit more appealing (at least the {}'s would balance), but still requires
lookahead. Changing the 6-bit state machine requires no lookahead, trivial
code changes, and allows more readable encodings using 6-bit mode:
"xxx{{}yyy}zzz" (8-bit mode)
vs. "{XXXX{YYYY}}{ZZZZ}" (6-bit mode)
1 2 345 6
These 6 {}'s would have the following meanings:
1. Enter 6-bit mode.
2. Output "{" character.
3. Resume augmented 8-bit mode.
4. Output "}" character.
5. Enter 6-bit mode.
6. Resume augmented 8-bit mode.
Note that you always end up with balanced {}'s in the end, although they
won't match in exactly the same way as the original {}'s in pure 8-bit mode.
(It's close enough to make commands like "%" in vi useful for working with
these encodings.) If you use the "{eg}" 6-bit encoding for "{", then all the
matching "}" characters will be unbalanced. (That could be avoided by also
encoding the "}" in 6-bit mode, but then you still can't find the matching
one in an editor.)
The only danger to watch out for here is encoding a "{" character WITH other
6-bit characters at the same time. A proper encoder would need a more
complex state machine to avoid outputting the "{" before the full 8-bit
character preceding it is outputted in a 6-bit character form. Unless the
"{" is following an even multiple of three 8-bit characters just encoded,
the next 8-bit character following the "{" would need to be encoded BEFORE
outputting the "{" in the 6-bit output; this would "flush" the 6-bit encoding
of 8-bit character which preceded the "{". If a multiple of three 8-bit
characters were just encoded, then the "{" should not be delayed this way,
although it actually wouldn't affect the decoded string.
A lazy encoder could simply refuse to place a "{" in the middle of a 6-bit
string, in which case all it needs to do to encode a "{" is to return to
8-bit mode if necessary (possibly flushing a 6-bit character, and outputting
the "}" to terminate 6-bit mode) and immediately start 6-bit encoding again
and output the "{" character, so a lazy encoder would insert either "X}{{" or
"}{{" in the output. (Obviously, you're losing some encoding efficiency this
way -- 2 bytes to switch modes twice, and 0-4 bits in the 6-bit data stream.)
Of course, the laziest 6-bit encoder would remain in 6-bit mode throughout
and simply encode the {}'s in the 6-bit data stream. The downside here is
that you would lose the ability to use editor features to find the matching
{}'s in the 6-bit data. (You'd also lose two bits of encoding efficiency,
but that's probably not a big deal.)
This really doesn't have to be as complex as it appears. The decoder is very
trivial, and the simplest encoder would use "{{}" in 8-bit mode and encode
{}'s directly into 6-bit data streams. Fancier encoders could get a slightly
better encoding efficience over simplistic ones, but the difference isn't
very large, and the decoder won't care either way.
So, what does everyone think of this proposal?
Deven
References: