[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Base-64 proposal

On Wed, 16 Apr 1997, Ron Rivest wrote:

> * How do you represent a left brace in augmented 8-bit mode?
>   The natural answer is as follows:
> 	A left brace is octal 173 = binary 01111011
> 	which breaks into hextets as 011110 11xxxx  (x = doesn't matter)
> 	Then 011110 encodes as "e"
> 	and  11xxxx encodes as "g" (as one possibility)
>   Thus, by writing
> 	{eg}
>   in the augmented 8-bit channel, you get a left brace into the output
>   ordinary 8-bit channel.

Obviously, you could always use this form, but it's unpleasant to look at...

>   One could introduce other mechanisms for accomplishing this as well, but
>   this is sufficient.  One natural alternative would be to let "\{" in the
>   augmented channel produce "{" in the output, and "\\" in the input 
>   produce "\" in the output.  But this requires two active characters
>   left brace and backslash, which seems overkill.  I suggest that we
>   not introduce redundant mechanisms.

I have a better proposal.

Short description:

    Simply allow "{" in 6-bit mode to output a "{", without changing the
    shift register or any other state in the state machine decoder.

Long, verbose, detailed (and somewhat redundant!) description:

A change can instead be made to the 6-bit parser state machine, so that when
a "{" character is read, it is output verbatim.  Any state (for 6-bit mode)
that can read a character should follow this rule.  The active state should
not change as a result of passing the "{" character, and neither should the
shift register used for the 6->8 bit conversion.  This change to the state
machine would be trivial to implement.

Unlike the suggestion of making "{{" magic (which required lookahead), this
proposal requires a "}" to return to augmented 8-bit mode.  This is cleaner
and more orthogonal.  Another suggestion was made to make "{{}" magic, which
is a bit more appealing (at least the {}'s would balance), but still requires
lookahead.  Changing the 6-bit state machine requires no lookahead, trivial
code changes, and allows more readable encodings using 6-bit mode:

     "xxx{{}yyy}zzz" (8-bit mode)
 vs. "{XXXX{YYYY}}{ZZZZ}" (6-bit mode)
      1    2    345    6

These 6 {}'s would have the following meanings:

   1. Enter 6-bit mode.
   2. Output "{" character.
   3. Resume augmented 8-bit mode.
   4. Output "}" character.
   5. Enter 6-bit mode.
   6. Resume augmented 8-bit mode.

Note that you always end up with balanced {}'s in the end, although they
won't match in exactly the same way as the original {}'s in pure 8-bit mode.
(It's close enough to make commands like "%" in vi useful for working with
these encodings.)  If you use the "{eg}" 6-bit encoding for "{", then all the
matching "}" characters will be unbalanced.  (That could be avoided by also
encoding the "}" in 6-bit mode, but then you still can't find the matching
one in an editor.)

The only danger to watch out for here is encoding a "{" character WITH other
6-bit characters at the same time.  A proper encoder would need a more
complex state machine to avoid outputting the "{" before the full 8-bit
character preceding it is outputted in a 6-bit character form.  Unless the
"{" is following an even multiple of three 8-bit characters just encoded,
the next 8-bit character following the "{" would need to be encoded BEFORE
outputting the "{" in the 6-bit output; this would "flush" the 6-bit encoding
of 8-bit character which preceded the "{".  If a multiple of three 8-bit
characters were just encoded, then the "{" should not be delayed this way,
although it actually wouldn't affect the decoded string.

A lazy encoder could simply refuse to place a "{" in the middle of a 6-bit
string, in which case all it needs to do to encode a "{" is to return to
8-bit mode if necessary (possibly flushing a 6-bit character, and outputting
the "}" to terminate 6-bit mode) and immediately start 6-bit encoding again
and output the "{" character, so a lazy encoder would insert either "X}{{" or
"}{{" in the output.  (Obviously, you're losing some encoding efficiency this
way -- 2 bytes to switch modes twice, and 0-4 bits in the 6-bit data stream.)

Of course, the laziest 6-bit encoder would remain in 6-bit mode throughout
and simply encode the {}'s in the 6-bit data stream.  The downside here is
that you would lose the ability to use editor features to find the matching
{}'s in the 6-bit data.  (You'd also lose two bits of encoding efficiency,
but that's probably not a big deal.)

This really doesn't have to be as complex as it appears.  The decoder is very
trivial, and the simplest encoder would use "{{}" in 8-bit mode and encode
{}'s directly into 6-bit data streams.  Fancier encoders could get a slightly
better encoding efficience over simplistic ones, but the difference isn't
very large, and the decoder won't care either way.

So, what does everyone think of this proposal?