6.02 - Huffman Encoding and LZW Compression Lecture 2 Katrina LaCurts, lacurts@mit.edu ====================================================================== Overview ====================================================================== - This lecture: efficient ways to encode strings as bits - Next few lectures: how to make those encodings resilient to errors Our overall goal today is encoding symbols efficiently, where efficiency is measured by how close our expected codeword length gets to the entropy. ====================================================================== Background ====================================================================== ---------------------------------------------------------------------- Shannon's information source ---------------------------------------------------------------------- The model we're working with is the one George discussed last week: Stationary, discrete, memoryless, probabilistic source. [S] -> ... s_8, s_11, s_12, s_5, s_4, s_8, s_2, ... To recap: - Symbols are emitted sequentially in time - The symbol at each time is chosen independently of choices at other times - Symbols are chosen from the same probability distribution (we will use the variables s_i and p_i: symbol s_i occurs with probability p_i) The last two properties mean this is an "independent, identically distributed" (i.i.d.) stream This is a fairly simple model, and you should be thinking about what types of sources it captures. For now, let's see what we can do with it. You'll also remember from last week that, in our model, the source then goes through an encoder. [S] -> ... s_8, s_11, ... -> [E] -> 100010 This converts each symbol s_i into a codeword made of binary digits. The binary digits will then be sent on a binary channel: [S] -> ... s_8, s_11, ... -> [E] -> ... 100010 ... -> [C] -> (More on this during the second part of the course) ---------------------------------------------------------------------- Code trees ---------------------------------------------------------------------- Imagine we have the following source (these are symbols, probabilities): X, .4 ; Y, .2 ; Z, .3; W, .1 We ended last week with a discussion of the entropy of S, H(S): H(S) = average information (in binary information units) per symbol = the uncertainty about the source For this source, H(S) = 1.84644 To represent an encoding, we use code trees, as George showed last week. Here is one example of a code tree for this source: . encoding: / \ X: 00 . . Y: 10 / / \ W: 110 X Y . Z: 111 / \ W Z L = 2*.4 + 2*.2 + 3*.1 + 3*.3 = 2.4 (Symbols are only on leaves, not internal nodes, because we need codes to be unambiguous, or "prefix-free") We are interested in code trees that are efficient, i.e., that use as few bits per codeword (on average) as possible. We will denote the average codeword length as L, and note that: L = sum (p_i * l_i) Does this code seem sensible/efficient? No; at the very least, we should move X up one level: . encoding: / \ X: 0 X . Y: 10 / \ W: 110 Y . Z: 111 / \ W Z L = 1*.4 + 2*.2 + 3*.1 + 3*.3 = 2.0 What about now? Could we do anything else? Yes; Z is more probable than Y; Y should not use fewer symbols that Z. . encoding: / \ X: 0 X . Y: 111 / \ W: 110 Z . Z: 10 / \ W Y L = 1*.4 + 3*.2 + 3*.1 + 2*.3 = 1.9 ====================================================================== Relationship between H and L ====================================================================== In this example, we saw that L got closer to H(S) each time there was an improvement. From recitation, you should've seen that L >= H, always. Which makes sense: if H measures the uncertainty of the source, the encoding is going to resolve that uncertainty. We can't do that with fewer than H bits per codeword. Our goal today is to develop codes that are efficient in terms of the expected number of bits per symbol. The first question is, for any given source of this type (i.i.d., single symbols, etc.), how close to H can we get? We will show there is a code for which L <= H+1, and thus H <= L <= H+1. ---------------------------------------------------------------------- Kraft's inequality ---------------------------------------------------------------------- To do this, we need a particular inequality: Kraft's inequality: For a full binary code tree, with codeword lengths {l_i}, the sum over i of (1/2^l_i) = 1. If the tree is not full, then this sum is < 1. The converse of Kraft's inequality is also true: For a set of lengths {l_i} that satisfy the Kraft inequality, there exists a binary code tree of these lengths. We will not prove this, but the proofs of both are fairly straightforward. The proof of the converse actually by construction rather than just existence. Fun fact: Kraft was an MIT grad student. Many of the people mentioned today were MIT grad students. ---------------------------------------------------------------------- L < H+1 ---------------------------------------------------------------------- We're going to construct a code such that L is < H+1. Let l_i = ceil(log((1/p_i))). With this definition, l_i >= log(1/p_i) > l_i - 1 First question: do these lengths correspond to a valid code tree? Since l_i > log(1/p_i), we can do some algebra 2^l_i >= 2^log(1/p_i) = 1/p_i 1/2^l_i <= p_i sum(1/(2^l_i)) <= sum(p_i) = 1 So these lengths satisfy Kraft's inequality, and thus such a code exists. Now we know that a valid code tree exists. So back to that inequality: l_i >= log(1/p_i) > l_i - 1 p_i * l_i >= p_i * log(1/p_i) > p_i * (l_i - 1) Sum everything up, and since p_i * l_i = L, and p_i * log(1/p_i) = H, we have: L >= H > L - sum(p_i) ==> L >= H > L - 1 Or L+1 >= H + 1 > L, i.e., L < H + 1. So the code that lets each codeword have length l_i = ceil(log((1/p_i)) satisfies H <= L < H+1 (remember: such a code exists by the converse of Kraft's inequality). ====================================================================== Huffman Encoding ====================================================================== But this is not actually the best code. What is "best" here? The code that has the minimum expected codeword length. Restricted to symbol-by-symbol coding, with symbols drawn independently from a fixed, known symbol probability distribution, Huffman encoding produces a code tree with minimum expected length. You can find the proof of optimality in the lecture handouts (chapter 3). Note that since Huffman is optimal, the it's also true that H <= L < H+1 for Huffman codes. To do Huffman encoding, we start with symbols and probabilities. We order them from least probable, to most probable. A, .1 ; E, .1; C, .2; D, .3; B, .3 We take the smallest probabilities and combine them.. .2 ; C, .2 ; D, .3; B, .3 / \ A E ..and repeat until the tree is complete: .4 ; D, .3; B, .3 / \ .2 C / \ A E .4 ; .6 / \ / \ .2 C D B / \ A E 1.0 / \ .4 .6 / \ / \ .2 C D B / \ A E What if there were ties? We could break them arbitrarily; Huffman code trees are not unique (every Huffman tree for source will have the same L value). Out of curiosity, how well did this do? H(S) is 2.17 for this source. The L is 2.2. Not bad! ---------------------------------------------------------------------- Extension: encoding multiple symbols ---------------------------------------------------------------------- Why encode just one symbol at a time? Let's encode pairs of symbols. This may be beneficial; for example, in English, we might benefit from having a separate encoding for "qu". P(s_i,s_j) = p_i * p_j, since symbols are chosen independently. The entropy of this "super source" will be 2*H(S) (again, because of independence), and the resulting code will satisfy: 2H <= 2L < 2H+1 H <= L < H + 1/2 Where L is still the expected length *per symbol* (not symbol pair). We can extend this argument to encoding K symbols at once. You can begin to see why encoding multiple symbols at once might be a good idea ====================================================================== Problems with our model ====================================================================== Before we talk about problems with our model, let's reflect: Huffman encoding is kind of impressive! It's a surprisingly simple encoding technique, and yet is optimal under certain conditions. Also, a fun fact: Huffman developed the code as part of a term paper at MIT, after Shannon, et al. had been working for a long time on it. We've been working with this particular model, where probabilities are known, and symbols are i.i.d. What's wrong with that? - Probs may not be known - Probs may change with time - Source may not generate iid symbols Think about English: Taking into account the individual symbol probabilities, but not context, H = 4.177. But English has a ton of context! For instance: Nothing can be said to be certain except death and ta___ The next letter is almost certainly x, and yet x has a very low occurrence probability (.17). What we really want to determine is the *average per-symbol entropy* over long sequences; the entropy rate: _H_ = lim(k->inf) H(s_1, s_2, .., s_k) / k Or the related (under certain conditions, identical) _H_' = lim(k->inf) H(s_k | s_{k-1}, s_{k-2}, .., s_1) Recent estimates of English entropy rate are 1--1.5 bits/letter; early estimates were .6--1.3 bits/letter. The entropy of English, however, is around 4.18. Our goal is to get a code that gets close to _H_. This means that we should be able to have a much smaller value of L (and thus more efficient code). ====================================================================== LZW compression ====================================================================== What we will learn is Lempel-Ziv-Welch (LZW) encoding. LZW encoding provides universal lossless compression of sequential (streaming) data by adaptive variable-length coding. This means that we do *not* need to know the probabilities up front. In some sense, the algorithm is going to "figure out" what sequences are more frequent as it goes along. Furthermore, it will also work by encoding more than one symbol at a time. Theoretically: under appropriate assumptions on the source, LZW asymptotically attains the lower bound _H_ on compression performance. Fun facts: - Algorithm first developed by Ziv and Lempel, later improved by Welch - Widely used, sometimes in combination with Huffman (gif, tiff, png, pdf, zip, gzip) - Patents have expired. Much confusion/distress regarding these patents. - Ziv was also an MIT grad student during the "golden years" of information theory. ---------------------------------------------------------------------- Example ---------------------------------------------------------------------- For LZW, we'll learn by doing. We will compress the string abcabcabcabcabcabcabcabc What is going to happen is that the encoder will keep a table. Each index in the table will map to a *variable-length* string. What the encoder sends to the decoder will be table indeces, rather than letters. These indeces are sent as N-bit integers, and thus we can hold 2^N of them (we'll talk about what happens once the table is full later). Because we transmit indeces, rather than the string, and the tables are usually short than the string, this algorithm is (almost always) going to compress the string. Hence, "LZW Compression" not just "LZW encoding". For this string, we will initialize the table to include the characters a, b, and c. The encoder and decoder need to agree on what this core dictionary is in order for the algorithm to work. 1 = a 2 = b 3 = c The encoder is greedy; it's going to find longest possible match in the string table before it makes a transmission. (--- denotes what we've sent; ^ denotes where we're looking) a = 1 b = 2 c = 3 . abcabcabcabcabcabcabcabc a ^ . abcabcabcabcabcabcabcabc ab ab = 4 1 ("a") ^ . abcabcabcabcabcabcabcabc bc bc = 5 2 ("b") - ^ . abcabcabcabcabcabcabcabc ca ca = 6 3 ("c") -- ^ . abcabcabcabcabcabcabcabc ab --- ^ . abcabcabcabcabcabcabcabc abc abc = 7 4 ("ab") --- ^ . abcabcabcabcabcabcabcabc ca ----- ^ . abcabcabcabcabcabcabcabc cab cab = 8 6 ("ca") ----- ^ . abcabcabcabcabcabcabcabc bc ------- ^ . abcabcabcabcabcabcabcabc bca bca = 9 5 ("bc") ------- ^ . abcabcabcabcabcabcabcabc ab --------- ^ . abcabcabcabcabcabcabcabc abc --------- ^ . abcabcabcabcabcabcabcabc abca abca = 10 7 ("abc") --------- ^ . abcabcabcabcabcabcabcabc ab ------------ ^ . abcabcabcabcabcabcabcabc abc ------------ ^ . abcabcabcabcabcabcabcabc abca ------------- ^ . abcabcabcabcabcabcabcabc abcab abcab = 11 10("abca") ------------- ^ Now, how does the decoder work? Starting from that agreed-upon core dictionary, the decoder will build up a dictionary that mirrors the sender's, but with a one-step delay. The dictionary itself is never transmitted (pretty cool!). . a 1 ^ . ab 2 ab = 4 ^^ . abc 3 bc = 5 ^^ . abcab 4 ca = 6 ^^ . abcab - ^^ . abcabca 6 abc = 7 ^^^ . abcabca - ^^ . abcabcabc 5 cab = 8 ^^^ . abcabcabc - ^^ . abcabcabcabc 7 bca = 9 ^^^ . abcabcabcabc - ^^^ . abcabcabcabc 10 ^^^ Wait, what? We don't have 10 in our table. Did something go wrong? Since the decoder is one step behind the encoder, the index we received must be the one that the encoder just added to its table (because we're one step behind and haven't added it to ours yet). Claim 1: The entry at index 10 is abc_. This is easy to see: The string the encoder just sent us was "abc", and whatever the encoder adds to the table is [whatever it just sent] + an additional character. We could also see it another way: "abc" starts the string that we're about to add to our own table, which is going to be the entry at index 10. So now we can infer the next part of the string: . abcabcabcabcabc_ ^^^ Claim 2: The entry at index 10 is abca. Even easier to see! Since we've inferred a few more characters into the string, we can just look at what happens as we continue adding to our own table. You should see that the next entry we add is abca = 10. ---------------------------------------------------------------------- Summary of LZW ---------------------------------------------------------------------- We can generalize this special case. Whenever it occurs, we are dealing with a string of the form xSxSx, where x is some character, and S is an arbitrary -- perhaps null -- string. The missing entry will always be xSx. Now that we've done an example, here are some things to specifically notice about LZW: - LZW is universal; it doesn't need the source statistics in advance. It learns the source characteristics as it builds the dictionary. - The string table is filled with sequences actually found in the message stream; No encodings are wasted on sequences not found in the input data. - The amount of compression increases as the encoding progresses - Regular resetting of the dictionary when it gets too big allows adaptation to changing source characteristics. (Resetting is easy as long as the encoder and decoder agree on the fixed size; you just re-initialize it when it's full). In summary, LZW is a good example of compression/communication schemes that "transmit the model" (with auxiliary information to run the model) rather than "transmit the data". There is a whole world of lossy compression, which we won't get into! ====================================================================== Summary ====================================================================== Today we saw how to encode data efficiently. - Huffman encoding, for our i.i.d source. H <= L < H+1 - LZW for when our source is not independent. Asymptotically approaches the entropy rate. Now that we can encode data, in the next few lectures we'll talk about how to make that encoded data more robust to errors.