[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

compression, responses to q's



Paul, thanks for the instruction on the SAID. I promise to do my required
reading before class next time.  Here are some responses to your questions and
the followups.  They reflect only my personal opinion and not the policies
or attitudes of IBM, etc.  Also, I apologize if company references in this note
are in breach of netiquette.  I don't know how to talk about the issues
meaningfully totally in the abstract.

Background

   The abbreviation LZ77 refers to a class of algorithms set
   forth in a 1977 Limpel-Ziv paper in IEEE Transactions on Information Theory.
   The key concept is to use a "sliding window" of previously seen text as a
   dictionary, where (optimally) one conceptually "looks ahead" in the input,
   finds the longest string (sequence of bytes) matching it in the previously
   seen data, and encodes the input by indicating the starting point and length
   rather than literally.  This type of algorithm is lossless, works on
   data of different kinds, and adapts to changing data.  Several vendors have
   proprietary algorithms of the LZ77 type, including IBM, STAC, and Motorola
   Codex.  The name of IBM's version is ALDC, or Adaptive Lossless Data
   Compression.   The algorithm used in BSD Unix compress is a member
   of another family, based upon a later paper, called LZ78.  The LZ78 class
   is somewhat more algorithmically complex, does not have as efficient an
   implementation in hardware, performs no better in software, and has the same
   kind of intellectual property law problems as the LZ77 class.


Intellectual Property Issues and Licensing

   To the best of my knowledge when compression algorithm have been made part
   of a standard, it is the binary encoding formats that have been placed in
   the public domain.  For instance ALDC is the Quarter Inch Tape Standard
   designated as QIC-154, STAC is QIC-122, etc. and only the binary
   encoding format definitions became public domain.  Pseudocode is provided
   to illustrate the operation of the codec.  PC "Demo disks" are available
   which perform the same data transformations.

   You have all you need technically to write your own codec.  There are
   three good reasons I would not suggest it.  #3, some of the state paths
   an LZ77 codec can go through are very long, and the boundary conditions very
   infrequent, so that it is rather difficult to be absolutely certain your
   code is correct,  #2, our code is rather well optimized, and #1
   the Top reason, patents have been granted for the best apparatus and
   algorithms used in hardware and software e.g. to search for the best
   string matches, to use a bit in the compressed data stream to indicate
   token type, etc.  We do not own all of these patents.  But we indemnify you,
   if you license our software or buy our hardware, from potential damages
   resulting from patent infringement claims.

   IBM is able to do that due not only to our own patents in the compression
   area but to preexisting cross license agreements with every major computer
   company which exist on account of our large portfolio including fundamental
   computer patents, and specific cross license agreements with companies which
   only do compression.  In a nutshell, IBM paid, and still pays, heavily to
   be able to business in compression, and to be able to indemnify
   licensees of its algorithms.

   Listing patent numbers for you would not help. Patent protection begins
   when patents are applied for, not when they are granted.  The patent office
   is years behind, and the filings in the field are voluminous.
   This leads to what has been called the submarine patent attack.  Also,
   although patents or claims may have been mistakenly granted where there
   already exist conflicting patents or prior applications or where there
   was there was well known prior art which was not disclosed in the
   application, it takes tens of millions of dollars, years of time, and still
   it's a crapshoot to overturn one. So it is best to learn to live with them.


Software Availability and Contract Terms

   IBM currently offers it ALDC software codecs in binary and source form
   (at our discretion) at no charge under an NDA or CEA for bona fide R & D of
   protocol and product development. We sell them under a software license
   agreement for integration into a product, i.e. if you are using the code
   under NDA you have to license it before the product is sold.  For source
   code the agreements provide that the user will not functionally modify
   the algorithm, i.e. change the encoding format, but only make modifications
   for integration or packaging purposes.  The software license fees are
   cheaper than for IDEA or RSA.

Effectiveness in Conjunction with IP

   No problem.  Anything with recent repeats of byte sequences in the data
   stream is compressible.  Incompressible data can expand by one part
   in eight, worstcase.  Thus the protocol should be able to indicate
   incompressible data and send it the data without compression.

   Pertaining to this point, Don Eastlake mentions the topic of how well
   a specific algorithm works on short messages.  There are two issues:
   if the entire message is actually short ALDC does relatively well, as it
   is designed to ramp up to its asymptotic compression ratio limit rapidly.
   (The rapidity with which the asymptote is reached is very data dependent.)
   If the individual packets are just short segments of a long stream,
   as a rule you would prefer to compress the data before segmenting it,
   and decompress it after collecting it.  ( I would think this would also
   be true for encryption and decryption.)  However, if this is not possible
   to collect reasonable size buffers of data, you can use a codec which
   supports segmented compression.  This means that it can flush its output,
   indicate an end of segment, and remember the context or "history" across
   the segment boundarys.

Effectiveness in Conjunction with Encryption

   There would be two concerns: one that it would make it the output too easy
   to cryptanalyse, and the other (bearing on export restrictions) that it would
   make it too hard by reducing the redundancy in the data, making a system
   with ALDC and DES unexportable.  As to the former, it is most unlikely that
   reducing the information theoretic redundancy of the payload would make
   an attack easier.  As to the latter, after being called to talk to the NSA
   we have not been prohibited from offering a product internationally using
   ALDC followed by DES.

Easily and Effectively Implemented in software

   I covered part of this question above in "Intellectual property and
   Licensing."  ALDC software yields essentially the same speed and compression
   ratio as STAC or BSD.  There are also some tuning tradeoffs available
   between compression ratio, speed, and memory usage in our software codecs.
   Decompression is several times as fast as compression, much moreso than
   with LZ78.  The asymmetry is nice for such applications as storage and
   multicast communications systems, where you encode once, decode many times.

Easily and effectively implemented in hardware

   Exceedingly effectively.  To me, this is why LZ77 is golden: The concept
   looks rather brute force, and indeed for software the more intricate LZ78
   class of algorithms superceded it for many years.  About 13 years later,
   we ( and apparently some others ) discovered an elegant hardware embodiment
   which uses a CAM to perform progressive string matching in parallel,
   effectively removing what is the greatest computational bottleneck.
   And our search is exhaustive ( always returning the longest possible string
   match) which permits better compression ratios with the same history buffer
   length, in turn permitting smaller silicon implementations and ramping up
   to the maximum compresson ratio faster as mentioned above.

   Combined with careful pipelining of the encoder, etc, we get to one input
   byte per clock cycle with very few gate delays ( high speed clock ) using
   very little silicon area.  One datum per cycle at very few gate delays,
   that's the speed of light for compression.  Give me an order and I'll build
   a 500 MByte/sec version in GaAs.

Integrity Checking

   As with decryption, the ALDC decoders need a reliable data stream. No
   forward error correction, etc is provided by ALDC, because the additional
   redundancy required would reduce compression, and the tradeoffs to be
   made would be highly application specific.  However, in particular,
   our hardware and software codecs can check for and avoid input buffer
   underrun and output buffer overflow which could result from being
   provided with corrupted input data.

                                    regards,

                                      Oscar Strohacker


Advisory Engineer/Scientist
Data Compression Systems Architecture
IBM Microelectronics Division
11400 Burnet Road
Austin Texas 78758

o (512) 838-4077      f (512) 838-7004         Internet: stroh @ vnet.ibm.com