[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: fast 386 DES code figures



> 	When four us did an implementation back in the late 70s, as
> grad students at MIT, we used tables to compose SP and E.  Later one
> of the grad students figured out how to E in a very efficient fashion,
> whic allowed us to go back to just an S&P composition, with 32-bit
> vs. 48 bit values.  I think that's the common approach today, with
> the asusmption that the space devoted to the tables is not a big deal
> on most modern machines.  Our fastest implementations were in line code,
> but instruction cacheing may argue for loops.

I had correspondence a while ago with a fellow who managed to beat my
assembler IDEA code with DES.  I was *very* impressed.  I can't find
that code around, but it was based on a very efficient E that let you
go back to a 32-bit implementation.

As usual, the S-boxes include the final P and are 32 bits wide, and
you use 12-bit inputs.  The idea was to scramble the bits into some
odd order such that you'd have the data in (say) ah, al, dh and dl, and
move pairs of bytes into bh and bl, mask off the two low bits
and the two high bits, XOR in some key material, and do a table
lookup.

The tables fit nicely into one 64K segment, so if you prepare the
key schedule with the high two bits set properly, after the XOR, you
are left with the address of the 32-bit word to accumulate.

The trick was that the S-boxes that you did the lookups in simultaneously
were NOT adjacent.  Let me see if I can recreate the technique...

When DES encoding, the E expansion duplicates half the bits.  (To be
precise, the ones congruent to 3 and 0 mod 4 in the original 32-bit word.)
In this modified form, the bits that appear twice are the middle 4
bits of each byte register.  Each byte appears once in each position
(high or low) of the lookup, when the high or low 2 bits are masked off,
so those bits are only used in one lookup.  So let's analyze what
requirements are needed for the shared bits.

A simple mapping, where

1ah2

Means that the high two bits of AH are the private bits of S-box 1 and
the low bits are the private bits of S-box 2.  From this, I'll derive
a schedule for the shared bits.

Assuming the registers are used in pairs in the order:

ah:al
al:dh
dh:dl
dl:ah

Then assigning 1ah2:3al4 can't work, because when that pair is loaded
and masked, (resulting in ah2:3al), the private bits of S-boxes 2 and 3
are visible, requiring that the shared bits that are visible are the
1/2, 2/3, 2/3 and 3/4 pairs.  But the two 2/3's need to be in different
positions, so they can be XORed with different key material.  No good!

The first schedule to consider is:

1ah2:4al3:5dh6:8dl7:1ah2

For the first pair, ah2:4al, the shared bits in ah and al must
include  1/2, 2/3, 3/4 and 4/5.

For the second pair, al3:5dh, the shared bits in al and dh must
include 2/3, 3/4, 4/5 and 5/6.

For the third pair, dh6:8dl, the shared bits in ah and al must
include  5/6, 6/7, 7/8, and 8/1.

For the second pair, dl7:1ah, the shared bits in al and dh must
include 6/7, 7/8, 8/1 and 1/2.

Thus, al must contain 2/3, 3/4 and 4/5.  H'm.  That's not going to work.

How about 2ah1:4al3:6dh5:8dl7:2ah1....

ah1:4al -> 8/1, 1/2, 3/4, 4/5
				al -> 3/4, ???
al3:6dh -> 2/3, 3/4, 5/6, 6/7

No, that won't work, either.  Each pair must share two things with the
next pair.  Okay, let's try:

1ah8:2al7:3dh6:4dl5:1ah8

ah8:2al -> 7/8, 8/1, 1/2, 2/3
				al -> 7/8, 2/3
al7:3dh -> 6/7, 7/8, 2/3, 3/4
				dh -> 6/7, 3/4
dh6:4dl -> 5/6, 6/7, 3/4, 4/5
				dl -> 5/6, 4/5
dl5:1ah -> 4/5, 5/6, 8/1, 1/2
				ah -> 8/1, 1/2
ah8:2al -> 7/8, 8/1, 1/2, 2/3

That schedule works.  The trick was arranging things to that consecutive
paris always had two pairs of numbers differing by 1.  Mostly, cyclic
works, but you have to end the cycle somewhere.  Another one that works is

8ah1:5al2:6dh3:7dl4:8ah1

ah1:5al -> 8/1, 1/2, 4/5, 5/6
				al -> 1/2, 5/6
al2:6dh -> 1/2, 2/3, 5/6, 6/7
				dh -> 2/3, 6/7
dh3:7dl -> 2/3, 3/4, 6/7, 7/8
				dl -> 3/4, 7/8
dl4:8ah -> 3/4, 4/5, 7/8, 8/1
				ah -> 4/5, 8/1
ah1:5al -> 8/1, 1/2, 4/5, 5/6

This is somewhat more symmetric.

Anyway, it appeared to produce blazing speed on a 16-bit 8086 with a
not-totally-unreasonable table requirement.  (48 bits of key per round,
64K of fixed s-box tables, and some tables to do the new, different,
IP and IP'.

I can dig up the impmentor from my e-mail files and ask for his code if
you'd like.  I think I still have it somewhere.
-- 
	-Colin