[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
compression, responses to q's
Paul, thanks for the instruction on the SAID. I promise to do my required
reading before class next time. Here are some responses to your questions and
the followups. They reflect only my personal opinion and not the policies
or attitudes of IBM, etc. Also, I apologize if company references in this note
are in breach of netiquette. I don't know how to talk about the issues
meaningfully totally in the abstract.
Background
The abbreviation LZ77 refers to a class of algorithms set
forth in a 1977 Limpel-Ziv paper in IEEE Transactions on Information Theory.
The key concept is to use a "sliding window" of previously seen text as a
dictionary, where (optimally) one conceptually "looks ahead" in the input,
finds the longest string (sequence of bytes) matching it in the previously
seen data, and encodes the input by indicating the starting point and length
rather than literally. This type of algorithm is lossless, works on
data of different kinds, and adapts to changing data. Several vendors have
proprietary algorithms of the LZ77 type, including IBM, STAC, and Motorola
Codex. The name of IBM's version is ALDC, or Adaptive Lossless Data
Compression. The algorithm used in BSD Unix compress is a member
of another family, based upon a later paper, called LZ78. The LZ78 class
is somewhat more algorithmically complex, does not have as efficient an
implementation in hardware, performs no better in software, and has the same
kind of intellectual property law problems as the LZ77 class.
Intellectual Property Issues and Licensing
To the best of my knowledge when compression algorithm have been made part
of a standard, it is the binary encoding formats that have been placed in
the public domain. For instance ALDC is the Quarter Inch Tape Standard
designated as QIC-154, STAC is QIC-122, etc. and only the binary
encoding format definitions became public domain. Pseudocode is provided
to illustrate the operation of the codec. PC "Demo disks" are available
which perform the same data transformations.
You have all you need technically to write your own codec. There are
three good reasons I would not suggest it. #3, some of the state paths
an LZ77 codec can go through are very long, and the boundary conditions very
infrequent, so that it is rather difficult to be absolutely certain your
code is correct, #2, our code is rather well optimized, and #1
the Top reason, patents have been granted for the best apparatus and
algorithms used in hardware and software e.g. to search for the best
string matches, to use a bit in the compressed data stream to indicate
token type, etc. We do not own all of these patents. But we indemnify you,
if you license our software or buy our hardware, from potential damages
resulting from patent infringement claims.
IBM is able to do that due not only to our own patents in the compression
area but to preexisting cross license agreements with every major computer
company which exist on account of our large portfolio including fundamental
computer patents, and specific cross license agreements with companies which
only do compression. In a nutshell, IBM paid, and still pays, heavily to
be able to business in compression, and to be able to indemnify
licensees of its algorithms.
Listing patent numbers for you would not help. Patent protection begins
when patents are applied for, not when they are granted. The patent office
is years behind, and the filings in the field are voluminous.
This leads to what has been called the submarine patent attack. Also,
although patents or claims may have been mistakenly granted where there
already exist conflicting patents or prior applications or where there
was there was well known prior art which was not disclosed in the
application, it takes tens of millions of dollars, years of time, and still
it's a crapshoot to overturn one. So it is best to learn to live with them.
Software Availability and Contract Terms
IBM currently offers it ALDC software codecs in binary and source form
(at our discretion) at no charge under an NDA or CEA for bona fide R & D of
protocol and product development. We sell them under a software license
agreement for integration into a product, i.e. if you are using the code
under NDA you have to license it before the product is sold. For source
code the agreements provide that the user will not functionally modify
the algorithm, i.e. change the encoding format, but only make modifications
for integration or packaging purposes. The software license fees are
cheaper than for IDEA or RSA.
Effectiveness in Conjunction with IP
No problem. Anything with recent repeats of byte sequences in the data
stream is compressible. Incompressible data can expand by one part
in eight, worstcase. Thus the protocol should be able to indicate
incompressible data and send it the data without compression.
Pertaining to this point, Don Eastlake mentions the topic of how well
a specific algorithm works on short messages. There are two issues:
if the entire message is actually short ALDC does relatively well, as it
is designed to ramp up to its asymptotic compression ratio limit rapidly.
(The rapidity with which the asymptote is reached is very data dependent.)
If the individual packets are just short segments of a long stream,
as a rule you would prefer to compress the data before segmenting it,
and decompress it after collecting it. ( I would think this would also
be true for encryption and decryption.) However, if this is not possible
to collect reasonable size buffers of data, you can use a codec which
supports segmented compression. This means that it can flush its output,
indicate an end of segment, and remember the context or "history" across
the segment boundarys.
Effectiveness in Conjunction with Encryption
There would be two concerns: one that it would make it the output too easy
to cryptanalyse, and the other (bearing on export restrictions) that it would
make it too hard by reducing the redundancy in the data, making a system
with ALDC and DES unexportable. As to the former, it is most unlikely that
reducing the information theoretic redundancy of the payload would make
an attack easier. As to the latter, after being called to talk to the NSA
we have not been prohibited from offering a product internationally using
ALDC followed by DES.
Easily and Effectively Implemented in software
I covered part of this question above in "Intellectual property and
Licensing." ALDC software yields essentially the same speed and compression
ratio as STAC or BSD. There are also some tuning tradeoffs available
between compression ratio, speed, and memory usage in our software codecs.
Decompression is several times as fast as compression, much moreso than
with LZ78. The asymmetry is nice for such applications as storage and
multicast communications systems, where you encode once, decode many times.
Easily and effectively implemented in hardware
Exceedingly effectively. To me, this is why LZ77 is golden: The concept
looks rather brute force, and indeed for software the more intricate LZ78
class of algorithms superceded it for many years. About 13 years later,
we ( and apparently some others ) discovered an elegant hardware embodiment
which uses a CAM to perform progressive string matching in parallel,
effectively removing what is the greatest computational bottleneck.
And our search is exhaustive ( always returning the longest possible string
match) which permits better compression ratios with the same history buffer
length, in turn permitting smaller silicon implementations and ramping up
to the maximum compresson ratio faster as mentioned above.
Combined with careful pipelining of the encoder, etc, we get to one input
byte per clock cycle with very few gate delays ( high speed clock ) using
very little silicon area. One datum per cycle at very few gate delays,
that's the speed of light for compression. Give me an order and I'll build
a 500 MByte/sec version in GaAs.
Integrity Checking
As with decryption, the ALDC decoders need a reliable data stream. No
forward error correction, etc is provided by ALDC, because the additional
redundancy required would reduce compression, and the tradeoffs to be
made would be highly application specific. However, in particular,
our hardware and software codecs can check for and avoid input buffer
underrun and output buffer overflow which could result from being
provided with corrupted input data.
regards,
Oscar Strohacker
Advisory Engineer/Scientist
Data Compression Systems Architecture
IBM Microelectronics Division
11400 Burnet Road
Austin Texas 78758
o (512) 838-4077 f (512) 838-7004 Internet: stroh @ vnet.ibm.com