[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Base-64 proposal




Please ignore my last response to George Michaelson's questions.  He
raised two good points, which I didn't think about adequately before
responding.  Here is a better reply, and a slightly modified base64
proposal.

* The basic model for SPKI/SDSI is an ordinary 8-bit channel providing a 
  sequence of bytes (octets) that are then interpreted according to the 
  SPKI/SDSI rules to be parsed into lists, etc.

* In some cases it is desirable to code the elements of the 8-bit channel
  with 6-bit (base-64) characters (hextets?) for protection against
  mailer and channel damage.  The usual encoding (RFC 1521) uses characters
    	A -- Z   a -- z  0 -- 9  + /
  to denote the hextets with decimal value 0--63, respectively.

* Thus, we propose an "augmented 8-bit channel" that allows one to drop
  into six-bit mode temporarily.  We propose that in an augmented 8-bit
  channel, a left brace "{" signals dropping into 6-bit mode, and that
  when in six-bit mode, a right brace "}" signals popping back out into
  augmented 8-bit mode.  

	...8-bits...{...6-bits...}...8-bits...

  The difference between ordinary 8-bit mode and augmented 8-bit mode is
  that the left brace is used to signal the transition to 6-bit mode.
  In augmented 8-bit mode a right brace is treated as an ordinary character.

* The result of processing an byte sequence that is in augmented 8-bit
  mode is a byte sequence in ORDINARY eight-bit mode.  That is, if the
  result of decoding the six-bit bytes you get a left brace, that brace
  does NOT recursively signal another embedded six-bit channel.

* How do you represent a left brace in augmented 8-bit mode?
  The natural answer is as follows:
	A left brace is octal 173 = binary 01111011
	which breaks into hextets as 011110 11xxxx  (x = doesn't matter)
	Then 011110 encodes as "e"
	and  11xxxx encodes as "g" (as one possibility)
  Thus, by writing
	{eg}
  in the augmented 8-bit channel, you get a left brace into the output
  ordinary 8-bit channel.

  One could introduce other mechanisms for accomplishing this as well, but
  this is sufficient.  One natural alternative would be to let "\{" in the
  augmented channel produce "{" in the output, and "\\" in the input 
  produce "\" in the output.  But this requires two active characters
  left brace and backslash, which seems overkill.  I suggest that we
  not introduce redundant mechanisms.

  (Note that my previous note about using #1:{ was just wrong.)

* A reader for an augmented 8-bit channel is just a simple state machine
  that remembers whether it is reading in 8-bit mode or 6-bit mode, and
  (if in six-bit mode) what "left-over bits" there are.  Each character
  may cause a state change and may cause an 8-bit character to be output
  for the output channel (which is the ordinary 8-bit channel).  More
  specifically, to process an input character x, do the following.
  The state-machine has a left-over bits register that may hold up
  to 12 bits, and a length-indicator for that register.
	-- if in eight-bit mode:
		if x is not a left brace, output x and return.
		otherwise if x is left brace:
			switch to 6-bit mode, clear the left-over bits
			register, zero its length indicator, and return.
	-- if in six-bit mode:
		if x is a right brace:
			switch to 8-bit mode and return.
		if x is not a base-64 character (alphanumeric or +/)
			ignore this character and return.
		Otherwise, append the six-bit value represented by x to
	        the contents of the left-over-bits register, and increase
	        the length indicator by six.  If the left-over-bits register
		now has 8 or more bits, output the leftmost 8 of those bits,
	        and delete them from the register, and subtract 8 from the
	        length indicator.
  
  Thus, the "get character" routine can be written so as to repeatedly
  read characters from the augmented channel until an 8-bit character is
  output for the ordinary channel.  

* Note that you can't nest or recurse, since the output channel is an
  ordinary 8-bit channel, not an augmented 8-bit channel.  If one wants
  to protect a byte-sequence from mutilation, it is silly and wasteful
  to recursively encode stuff that is already 6-bit encoded.  With the
  current proposal, if you have for some reason a partially encoded
  sequence in augmented 8-bit representation:

	xxxx...xxx{YYYYYYYY}zzzzzzzzz

  then the correct way to protect the entire thing is then just to recode the
  eight-bit portions xx's and zz's:

	{XXX....XXX}{YYYYYYYY}{ZZZZ....ZZZZZ}

  where the XX's are the 6-bit encoding of the xx's, and the ZZ's are the
  six-bit encoding of the zz's. 
  
  Thus, it is possible to efficiently encode an sequence, even if portions
  of it were already 6-bit encoded.  You only need to know that your input
  is already in augmented 8-bit form in order to know that you need to
  watch out for already-encoded stuff, so that you can just copy it verbatim
  to the result.

* The fact that non-base-64 characters are ignored in six-bit mode gives
  one the possibility of getting "fragmentation", as noted in my earlier
  note:
	abc{
        }def
  produces
	abcdef
  when decoded.

* This proposal for augmented 8-bit channels is formally independent of
  anything else in SPKI/SDSI.  Indeed, it could be used to represent
  any kind of data, not just SPKI/SDSI data.

Ron Rivest



	        

		

Follow-Ups: