[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: key-ed MD5 again

An additional note on this subject from Richard Schroeppel
(rcs@cs.arizona.edu) follows.  As it is a rather detailed missive, let
me summarize the bottom line as a contribution to rough concensus in
favor of MD5(key,data,key).

    Date: Tue, 28 Mar 95 14:26:29 GMT
    From: "William Allen Simpson" <Bill.Simpson@um.cc.umich.edu>
    To: IPSEC@ans.net
    Subject: Re: key-ed MD5 again
    Status: R

    > From: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
    > > Some significant analysts think that a single key at the beginning of
    > > MD5 does not provide enough key material when the text is long.  The
    > > MD5(key,MD5(text)) was suggested to improve the effect of the key in the
    > > final hash.
    > As MD5 is a chain of addition, a transitive 1-to-1 mapping, of
    > hashed values, it is unlikely that the initial scrambling effect
    > by the added hased key is weakened later.
    An excellent question?  The conclusion was passed to me word of mouth.
    But, looking at the algorithm, it seems to me that up to 4 bits of
    influence can be lost from the high end of the sum on each block from
    lost carry in the four registers.  As the text grows longer, less of the
    initial key effect is seen.

    It is that addition was used instead of xor that has this result.


The final few lines of the reference inplementation of MD5 are

  buf[0] += a;
  buf[1] += b;
  buf[2] += c;
  buf[3] += d;

Here buf[] is an array of 4 32bit words that represent the state of
MD5 before eating the present block (64 bytes) of information; and
a,b,c,d are 4 32bit words that are a very-complicated function of both
buf[] and the 64byte data block.

The fact that the combining operator used here is + rather than xor
doesn't cause any information loss.  The missing carries don't matter
in the least:  If we know the contents of buf[] after this code has
been executed, and the values of a,b,c,d, then we can perfectly well
determine the contents of buf[] prior to the code; in fact, we could do
       buf[0] -= a;  buf[1] -= b;  buf[2] -= c;  buf[3] -=d;
to recover the original four 32bit numbers in buf[].

Using an xor operator would be equivalent, from the perspective of
(no) information loss in these 4 lines of code.

The information loss in MD5(key, VeryLongText) is more subtle, and arguably
inconsequential.  Suppose we model the MD5 block-eating operation as a random
map of 128bits -> 128bits, with the particular details of the map determined
by the 64 byte data block that is consumed.  The range of such a random map
will typically include only 63% of the possible values; each range value has
a 1/e chance of not appearing.  About 37% of the range values have one
inverse, 18% have two inverses, etc.  Eating one data block has, in effect,
erased .66 bits of information from the original 128bit state, since the
range contains only 2^127.34 values.  This is no big deal, but clearly will
be a problem if multiple rounds of the MD5 operation continue to erase .66
bits on each round.  But this doesn't happen.  On the next round, since the
first-round range has diminished by a factor of .63, the percentage of
colliding values also drops:  After two random maps, the expected range size
is 47% of the initial range, which is larger than .63^2.

The recursion for the expected range size after N applications of
(independent) random maps is

                            - F(N)
  F(0) = 1;  F(N+1) = 1 - e

The first few values after F(0) = 1 are

 F(1) = .632,  F(2) = .469,  F(3) = .374,  F(4) = .312,  F(5) = .268 ...

It's easy to check that F(N) does approach 0 for large N, but extremely
slowly.  In fact,

  F(N) ~=  2 / (N + log terms).

In the case at hand, suppose the VeryLongText above is 1 MB.  This is
15,625 blocks of 64bytes; if we feed those to MD5 one block at a time,
the expected size of the range is diminished by a factor of 15625/2,
or about 8000.  In other words, we've lost 13 bits of the original 128
bits in the initial state of MD5.  If VeryLongText is 1 GB, the same
calculation shows we'll lose 23 bits.

Now suppose Shamir comes up with some discovery that, say, the high bit of
each word of the MD5 state is lost during the mapping process, so that
16 initial MD5 states map to one range state.  It almost doesn't matter!
The recursion above becomes

                            - 16 F(N)
  F(0) = 1;  F(N+1) = 1 - e

Although the information loss in the first round is dramatic, the
asymptotic formula for this revised F is just

  F(N) ~=  2 / (16 N + log terms).

So the amount of extra information loss from the initial state is 4 bits.

The formula MD5(Key, VeryLongText) means that Key is a 128bit initial
value put into the MD5 state.  We can achieve almost the same effect
by prefixing Key to VeryLongText and using regular MD5 with the standard
initial state.  If we also append Key, getting

  MD5( Key .conc. VeryLongText .conc. Key )

then we cancel out all of the intermediate information loss discussed
above, and also protect against some appending attacks.

Rich Schroeppel   rcs@cs.arizona.edu