6.02 - Huffman Encoding and LZW Compression
Lecture 2
Katrina LaCurts, lacurts@mit.edu

======================================================================
Overview
======================================================================

- This lecture: efficient ways to encode strings as bits
- Next few lectures: how to make those encodings resilient to errors

Our overall goal today is encoding symbols efficiently, where
efficiency is measured by how close our expected codeword length gets
to the entropy.

======================================================================
Background
======================================================================

----------------------------------------------------------------------
Shannon's information source
----------------------------------------------------------------------

The model we're working with is the one George discussed last week:
Stationary, discrete, memoryless, probabilistic source.

[S] -> ... s_8, s_11, s_12, s_5, s_4, s_8, s_2, ...

To recap:
- Symbols are emitted sequentially in time
- The symbol at each time is chosen independently of choices at other
  times
- Symbols are chosen from the same probability distribution (we will
  use the variables s_i and p_i: symbol s_i occurs with probability
  p_i)

The last two properties mean this is an "independent, identically
distributed" (i.i.d.) stream

This is a fairly simple model, and you should be thinking about what
types of sources it captures.  For now, let's see what we can do with
it.

You'll also remember from last week that, in our model, the source
then goes through an encoder.

[S] -> ... s_8, s_11, ... -> [E] -> 100010

This converts each symbol s_i into a codeword made of binary digits.
The binary digits will then be sent on a binary channel:

[S] -> ... s_8, s_11, ... -> [E] -> ... 100010 ... -> [C] ->

(More on this during the second part of the course)

----------------------------------------------------------------------
Code trees
----------------------------------------------------------------------

Imagine we have the following source (these are symbols,
probabilities):

                  X, .4 ; Y, .2 ; Z, .3; W, .1

We ended last week with a discussion of the entropy of S, H(S):

 H(S) = average information (in binary information units) per symbol
      = the uncertainty about the source

For this source, H(S) = 1.84644

To represent an encoding, we use code trees, as George showed last
week.  Here is one example of a code tree for this source:

    .               encoding:
   / \                X: 00
  .   .               Y: 10
 /   / \              W: 110
X   Y   .             Z: 111
       / \
      W   Z           L = 2*.4 + 2*.2 + 3*.1 + 3*.3 = 2.4 

(Symbols are only on leaves, not internal nodes, because we need codes
 to be unambiguous, or "prefix-free")

We are interested in code trees that are efficient, i.e., that use as
few bits per codeword (on average) as possible.  We will denote the
average codeword length as L, and note that:

                        L = sum (p_i * l_i)

Does this code seem sensible/efficient?  No; at the very least, we
should move X up one level:

    .               encoding:
   / \                X: 0
  X   .               Y: 10
     / \              W: 110
    Y   .             Z: 111
       / \
      W   Z           L = 1*.4 + 2*.2 + 3*.1 + 3*.3 = 2.0

What about now?  Could we do anything else?  Yes; Z is more probable
than Y; Y should not use fewer symbols that Z.

    .               encoding:
   / \                X: 0
  X   .               Y: 111
     / \              W: 110
    Z   .             Z: 10
       / \
      W   Y           L = 1*.4 + 3*.2 + 3*.1 + 2*.3 = 1.9

======================================================================
Relationship between H and L
======================================================================

In this example, we saw that L got closer to H(S) each time there was
an improvement.  From recitation, you should've seen that L >= H,
always.  Which makes sense: if H measures the uncertainty of the
source, the encoding is going to resolve that uncertainty.  We can't
do that with fewer than H bits per codeword.

Our goal today is to develop codes that are efficient in terms of the
expected number of bits per symbol.  The first question is, for any
given source of this type (i.i.d., single symbols, etc.), how close to
H can we get?

We will show there is a code for which L <= H+1, and thus H <= L <=
H+1.

----------------------------------------------------------------------
Kraft's inequality
----------------------------------------------------------------------

To do this, we need a particular inequality:

Kraft's inequality: For a full binary code tree, with codeword lengths
{l_i}, the sum over i of (1/2^l_i) = 1.  If the tree is not full, then
this sum is <  1.

The converse of Kraft's inequality is also true:

For a set of lengths {l_i} that satisfy the Kraft inequality, there
exists a binary code tree of these lengths.

We will not prove this, but the proofs of both are fairly
straightforward.  The proof of the converse actually by construction
rather than just existence.

Fun fact: Kraft was an MIT grad student.  Many of the people mentioned
today were MIT grad students.

----------------------------------------------------------------------
L < H+1
----------------------------------------------------------------------

We're going to construct a code such that L is < H+1.

Let l_i = ceil(log((1/p_i))). With this definition,

                  l_i >= log(1/p_i) > l_i - 1

First question: do these lengths correspond to a valid code tree?

Since l_i > log(1/p_i), we can do some algebra

                 2^l_i >= 2^log(1/p_i) = 1/p_i
                 1/2^l_i <= p_i
                 sum(1/(2^l_i)) <= sum(p_i) = 1

So these lengths satisfy Kraft's inequality, and thus such a code
exists.

Now we know that a valid code tree exists.  So back to that inequality:

                 l_i >= log(1/p_i) > l_i - 1
                 p_i * l_i >= p_i * log(1/p_i) > p_i * (l_i - 1)

Sum everything up, and since p_i * l_i = L, and p_i * log(1/p_i) = H,
we have:

                 L >= H > L - sum(p_i) ==> L >= H > L - 1

Or L+1 >= H + 1 > L, i.e., L < H + 1.

So the code that lets each codeword have length l_i =
ceil(log((1/p_i)) satisfies H <= L < H+1 (remember: such a code exists
by the converse of Kraft's inequality).

======================================================================
Huffman Encoding
======================================================================

But this is not actually the best code.  What is "best" here?  The
code that has the minimum expected codeword length.  Restricted to
symbol-by-symbol coding, with symbols drawn independently from a
fixed, known symbol probability distribution, Huffman encoding
produces a code tree with minimum expected length.

You can find the proof of optimality in the lecture handouts (chapter
3).  Note that since Huffman is optimal, the it's also true that H <=
L < H+1 for Huffman codes.

To do Huffman encoding, we start with symbols and probabilities.  We
order them from least probable, to most probable.

A, .1 ; E, .1; C, .2; D, .3; B, .3

We take the smallest probabilities and combine them..

  .2    ; C, .2 ; D, .3; B, .3
 /  \
A    E

..and repeat until the tree is complete:

    .4    ; D, .3; B, .3
   /  \
  .2   C
 /  \
A    E


    .4    ;    .6
   /  \       /  \
  .2   C     D    B
 /  \
A    E

        1.0
      /     \
    .4      .6
   /  \    /  \
  .2   C  D    B
 /  \
A    E

What if there were ties?  We could break them arbitrarily; Huffman
code trees are not unique (every Huffman tree for source will have the
same L value).

Out of curiosity, how well did this do?  H(S) is 2.17 for this
source.  The L is 2.2.  Not bad!

----------------------------------------------------------------------
Extension: encoding multiple symbols
----------------------------------------------------------------------

Why encode just one symbol at a time?  Let's encode pairs of symbols.
This may be beneficial; for example, in English, we might benefit from
having a separate encoding for "qu".

P(s_i,s_j) = p_i * p_j, since symbols are chosen independently.  The
entropy of this "super source" will be 2*H(S) (again, because of
independence), and the resulting code will satisfy:

			   2H <= 2L < 2H+1
               H <= L < H + 1/2

Where L is still the expected length *per symbol* (not symbol pair).

We can extend this argument to encoding K symbols at once.  You can
begin to see why encoding multiple symbols at once might be a good
idea

======================================================================
Problems with our model
======================================================================

Before we talk about problems with our model, let's reflect: Huffman
encoding is kind of impressive!  It's a surprisingly simple encoding
technique, and yet is optimal under certain conditions.  Also, a fun
fact: Huffman developed the code as part of a term paper at MIT, after
Shannon, et al. had been working for a long time on it.

We've been working with this particular model, where probabilities are
known, and symbols are i.i.d.  What's wrong with that?

- Probs may not be known
- Probs may change with time
- Source may not generate iid symbols

Think about English:

Taking into account the individual symbol probabilities, but not
context, H = 4.177.  But English has a ton of context!  For instance:

Nothing can be said to be certain except death and ta___

The next letter is almost certainly x, and yet x has a very low
occurrence probability (.17).

What we really want to determine is the *average per-symbol entropy*
over long sequences; the entropy rate:

	 _H_ = lim(k->inf) H(s_1, s_2, .., s_k) / k

Or the related (under certain conditions, identical)

     _H_' = lim(k->inf) H(s_k | s_{k-1}, s_{k-2}, .., s_1)

Recent estimates of English entropy rate are 1--1.5 bits/letter; early
estimates were .6--1.3 bits/letter.  The entropy of English, however,
is around 4.18.

Our goal is to get a code that gets close to _H_.  This means that we
should be able to have a much smaller value of L (and thus more
efficient code).

======================================================================
LZW compression
======================================================================

What we will learn is Lempel-Ziv-Welch (LZW) encoding.  LZW encoding
provides universal lossless compression of sequential (streaming) data
by adaptive variable-length coding.  This means that we do *not* need
to know the probabilities up front.  In some sense, the algorithm is
going to "figure out" what sequences are more frequent as it goes
along.  Furthermore, it will also work by encoding more than one
symbol at a time.

Theoretically: under appropriate assumptions on the source, LZW
asymptotically attains the lower bound _H_ on compression performance.

Fun facts:
- Algorithm first developed by Ziv and Lempel, later improved by Welch
- Widely used, sometimes in combination with Huffman (gif, tiff, png,
  pdf, zip, gzip)
- Patents have expired.  Much confusion/distress regarding these
  patents.
- Ziv was also an MIT grad student during the "golden years" of
  information theory.

----------------------------------------------------------------------
Example
----------------------------------------------------------------------

For LZW, we'll learn by doing.  We will compress the string

                      abcabcabcabcabcabcabcabc

What is going to happen is that the encoder will keep a table.  Each
index in the table will map to a *variable-length* string.  What the
encoder sends to the decoder will be table indeces, rather than
letters.  These indeces are sent as N-bit integers, and thus we can
hold 2^N of them (we'll talk about what happens once the table is full
later).

Because we transmit indeces, rather than the string, and the tables
are usually short than the string, this algorithm is (almost always)
going to compress the string.  Hence, "LZW Compression" not just "LZW
encoding".

For this string, we will initialize the table to include the
characters a, b, and c.  The encoder and decoder need to agree on what
this core dictionary is in order for the algorithm to work.

1 = a
2 = b
3 = c

The encoder is greedy; it's going to find longest possible match in
the string table before it makes a transmission.

(--- denotes what we've sent; ^ denotes where we're looking)
                               
                                  a = 1
								  b = 2
								  c = 3
. abcabcabcabcabcabcabcabc  a    
  ^
. abcabcabcabcabcabcabcabc  ab   ab = 4	1 ("a")
   ^
. abcabcabcabcabcabcabcabc  bc   bc = 5	2 ("b")
  - ^
. abcabcabcabcabcabcabcabc  ca   ca = 6	3 ("c")
  -- ^
. abcabcabcabcabcabcabcabc  ab   
  --- ^
. abcabcabcabcabcabcabcabc  abc  abc = 7	4 ("ab")
  ---  ^
. abcabcabcabcabcabcabcabc  ca 
  ----- ^
. abcabcabcabcabcabcabcabc  cab  cab = 8	6 ("ca")
  -----  ^
. abcabcabcabcabcabcabcabc  bc
  ------- ^
. abcabcabcabcabcabcabcabc  bca  bca = 9	5 ("bc")
  -------  ^
. abcabcabcabcabcabcabcabc  ab
  --------- ^
. abcabcabcabcabcabcabcabc  abc
  ---------  ^
. abcabcabcabcabcabcabcabc  abca abca = 10  7 ("abc")
  ---------   ^
. abcabcabcabcabcabcabcabc  ab
  ------------ ^
. abcabcabcabcabcabcabcabc  abc
  ------------  ^
. abcabcabcabcabcabcabcabc  abca
  -------------  ^
. abcabcabcabcabcabcabcabc  abcab abcab = 11 10("abca")
  -------------   ^

Now, how does the decoder work?  Starting from that agreed-upon core
dictionary, the decoder will build up a dictionary that mirrors the
sender's, but with a one-step delay.  The dictionary itself is never
transmitted (pretty cool!).

. a	   			 		    1
  ^
. ab						2	ab = 4	
  ^^
. abc						3	bc = 5  
   ^^
. abcab						4	ca = 6  
    ^^
. abcab					    -
     ^^
. abcabca				    6	abc = 7
     ^^^
. abcabca				    -	
       ^^
. abcabcabc				    5   cab = 8
       ^^^
. abcabcabc				    -
         ^^
. abcabcabcabc			    7   bca = 9
         ^^^
. abcabcabcabc			    -
           ^^^
. abcabcabcabc			    10
           ^^^

Wait, what?  We don't have 10 in our table.  Did something go wrong?

Since the decoder is one step behind the encoder, the index we
received must be the one that the encoder just added to its table
(because we're one step behind and haven't added it to ours yet).

Claim 1: The entry at index 10 is abc_.

   This is easy to see: The string the encoder just sent us was "abc",
   and whatever the encoder adds to the table is [whatever it just
   sent] + an additional character.

   We could also see it another way: "abc" starts the string that
   we're about to add to our own table, which is going to be the entry
   at index 10.

So now we can infer the next part of the string:

. abcabcabcabcabc_			    
           ^^^

Claim 2: The entry at index 10 is abca.

    Even easier to see!  Since we've inferred a few more characters
    into the string, we can just look at what happens as we continue
    adding to our own table.  You should see that the next entry we
    add is abca = 10.

----------------------------------------------------------------------
Summary of LZW
----------------------------------------------------------------------

We can generalize this special case.  Whenever it occurs, we are
dealing with a string of the form xSxSx, where x is some character,
and S is an arbitrary -- perhaps null -- string.  The missing entry
will always be xSx.

Now that we've done an example, here are some things to specifically
notice about LZW:

- LZW is universal; it doesn't need the source statistics in advance.
  It learns the source characteristics as it builds the dictionary.
- The string table is filled with sequences actually found in the
  message stream; No encodings are wasted on sequences not found in
  the input data.
- The amount of compression increases as the encoding progresses
- Regular resetting of the dictionary when it gets too big allows
  adaptation to changing source characteristics.  (Resetting is easy
  as long as the encoder and decoder agree on the fixed size; you just
  re-initialize it when it's full).

In summary, LZW is a good example of compression/communication schemes
that "transmit the model" (with auxiliary information to run the
model) rather than "transmit the data".  There is a whole world of
lossy compression, which we won't get into!

======================================================================
Summary
======================================================================

Today we saw how to encode data efficiently.
- Huffman encoding, for our i.i.d source.  H <= L < H+1
- LZW for when our source is not independent.  Asymptotically
  approaches the entropy rate.

Now that we can encode data, in the next few lectures we'll talk about
how to make that encoded data more robust to errors.