We will study several basic techniques including run length encoding,
Huffman coding and frequency domain encoding (the essential technique behind
JPEG.) Any compression method requires a method for compression, and a
complementary method for decompression. This pair of methods is called
a codec and is usually embodied in a piece of software.
You will need to know how to compute the amount of storage required for raw, uncompressed data. It's simple byte-counting stuff. For instance,
Query 1. Compute the storage required for a 500 x 400 image with 8 bits per pixel.
I put in the spaces to make things easier to count. You could briefly describe the sequence as "eight 0, six 1, fourteen 0". This would be called "run length encoding" (RLE). Assume that instead of binary code, we're working in bytes. We could encode a complex sequence of bytes or characters using RLE, by the following technique. In this discussion, if we write [101] we mean the byte whose value is 101 decimal. But if we write a character like A or !, we mean the byte whose value is the ASCII value of that character.
Here's the Run Length Decoding algorithm
1) If you ever encounter the character "!", turn on special software which:
a) reads the next byte as an integer; call it N
b) puts N copies of the following byte into the
output stream.
Thus, the sequence ! [32] A
would emit 32 A's into the output stream, at a storage cost of only 3 bytes.
The above symbolism with square brackets is intended to convey the idea
that there are actually the following codes being stored:
| Hex | 21 | 20 | 64 |
| ASCII | ! | space | A |
| Decimal | you fill | in these | values please! |
Query 1: You fill in the decimal values, please (now!). You should be converting hex to decimal in your sleep, by this point. (And maybe decimal to hex in your nightmares?)
Query 2: What kind of data would achieve maximum compression with the above one-byte RLE technique? What would its compression ratio be? (That is, if I gave you M bytes of data, how much space S would it take and would the ratio M/S be?
(Answer: 255/3 = about 85:1). Because the longest run you could represent would be represented by the hex digit FF, and it would take up 3 bytes to represent this code. For instance if we're storing 255 copies of ASCII A, the code would be ! [255] A or 21 FF 64.
Query 3: What kind of data would achieve MINIMUM compression with RLE? In fact is there data which you're better off not trying to compress with RLE at all?
Yeah, of course there is. It's data with runs of less than some minimum length. If you're using a one-byte run-length field as we are, the minimum run length that it pays to encode is 4. I hope you see why. It is actually dumb to use the length field of '0' to signal that NO copies of something is coming, so we should use 0 to stand for 4 of 'em, and 2 to stand for 5 of 'em ... etc. This allows us to represent up to 260 repeated symbols with one 3-byte code block.
Let's just call data with less-than-minimum runs, "incoherent data". If you're encountering incoherent data with no runs (or short runs) in it, you just dump the bytes right into the output stream. If you come to an acceptable run of some character C, you encode it with ! [n] C.
Movin' on.... What's the minimum length repeated pattern, if we're using a two-byte run-length field? What's the maximum possible compression ratio with two-byte length fields?)
Think
think
think
think
think
Well, let's see. two bytes allows 65532 different combinations. But it takes up 4 bytes to encode (one marker like !, two bytes for the length, and one byte for the what's-it-being-repeated.) So it's not worth encoding anything shorter than 5 characters. And you can encode up to 65532+5. But we seldom get runs that long, so we seldom use two byte length fields.
Query 4: There's one problem with this technique.... what's that?
The "escape sequence" technique means that some specific character (or characters) is designated as special, and different from all others. When it is encountered, it's not treated like data, but like a flag. In the present case, the ! character is an escape character. (In fact the ESC character was originally put into ASCII to serve this purpose.)
But what if you want to transmit ! as part of your raw (incoherent) data stream? Won't it trigger escape sequences and generate all kinds of fake RLE?
Query 5: How do we solve the "transmit the escape character" problem?
I won't put the answer to that one here in these notes. You have to pay attention in class, after all. Or ask your friends.
Query 6: Using ! as the escape character, one-byte character counts and a minimum run-length of 4, run-length encode the following data. The blanks in the data are NOT to be encoded; they are simply inserted to make it easier for you to read and count characters.
0000 0001 1122 2222 2200 5555 5555 000
Why is the minimum run length chosen as 4?
The most frequent letter in English is E, so it gets the shortest symbol - just a "dot" (short sound.) The next most freaquent is T, so it gets a single dash (long sound.) Letters like C are relatively infrequent so it gets a long symbol. Dah dit dah dit. Got it?
Huffman coding follows a similar principle. You sort your symbols into two piles so that the sum of probabilities in each pile is as close as possible to equal. Then you assign 0 to the left pile and 1 to the right pile.
You repeat this process until a "tree" of code words is built. Here's the example for three symbols, with their probabilities.
A (0.5)
B (0.25)
C (0.25)
First we have A on the left, and B and C on the right. So we assign 0 as the symbol for A, and 1 to the pile on the right.
0:A 1:BC
Now we divide the right hand pile into equal probabilities, and so we must assign 10 to B and 11 to C.
Here's the resulting code alphabet:
0 A
10 B
11 C
Now let's try transmitting a string, to see if it can be decoded afterward.
CABBA
Query 7: Encode that string using the above alphabet, please!
and
we
waste
some
space
so
that
the
answer is offscreen!
tra
la
la
and
finally we reveal that the answer is
11010100
If you think like a robot, you immediately see that at the beginning, 11 must mean C. Then you encounter a 0 so you know it must be A.... etc.
For
practice, we have the next query.
Query 8. Develop a Huffman code for an alphabet with for letters with the following probabilities:
A: 0.5
B: 0.25
C: 0.125
D: 0.125
The resulting lookup table is what the text refers to as a "unique replacement code" or a code table.
Moshell Coding.
The following technique was thought up specifically for this course. It somewhat resembles Huffman coding. Moshell Coding is based on a dictionary, in which English words have been given numbers. The most common words receive short numbers (like 1, 2, 3). Here is a fragment of the Moshell Code Dictionary. It is designed to give short codes to frequent words.
Of course I didn't really analyze a big bunch of text; I just approximated the process.
(I have tricked up the HTML so that the Moshell Code table should come up in a new window. I did that by including the phrase
target="newpage"
in the <a href= .... > anchor tag.)
So, the above paragraph ("The following technique") would be represented, using the Moshell Code, as follows (ignoring capital letters):
1.59.416.420.418.419.415....
This string of 24 characters (numbers and dots between them) represents
"The following technique was thought up specifically" which contained 51
characters, for a compression ratio of 2.15.
Gregory Bateson my guru said that information is "a difference that makes a difference". Claude Shannon invented information theory around the idea that the amount of information in a message is related to how surprised you are when you get it.
For instance, the information in the message 'the sun came up today' is pretty small. But the information in the message "you just won the lottery" is pretty large - because it's such an improbable event.
So, Huffman coding uses the idea that you achieve maximum efficiency by using short messages for highly probable events. The logical limit of this sequence is that we should just use 1 and 0 for two equally probable events - namely, the old 'binary search' game all over again: "does the symbol fall into the left or the right pile?"
We measure information in bits and we can sometimes see immediately how to represent it with binary digits. But sometimes we have to work at it.
In the same way, if you know that English is coming through the system and I send you the letter Q, you can predict with very high confidence that a U will be the next letter.
Data compression consists of squeezing the redundancy out of a signal.
LZW Compression is a sneaky technique which uses the document itself like a code table. When a second "the" occurs in the input stream, the compressed data just points to the first place it occurred and says "copy this". LZW is the basis for .GIF as well as .TIFF compression. It is lossless, meaning that you can get the exact same information back that you originally compressed.
JPEG, however, uses a different bag of tricks. It includes both lossy (but highly efficient) and lossless
*****************
This section is an OPTIONAL EXTRA, for those of you who are really
"on top of this material". It will occur on the final exam
in the form of a BONUS QUESTION (one that provides a good deal of extra
credit if you work it out.) But if you skip it, it won't count against
you
******************
is the idea that an image can be composed of different fundamental repeating patterns, just as a sound can. The idea of "Fourier synthesis" lies behind digital music synthesizers, and also behind the Discrete Cosine Transform which is at the heart of JPEG.
DCT (Discrete Cosine Transform) is very important. So I'm going to teach
you about an easier-to-understand
"cousin" of DCT, called the Walsh transform. Then we'll talk about
DCT. We begin with the one dimensional
Walsh. Consider a stream of data (real numbers); perhaps it's a series
of samples of an audio signal.
You're familiar with the idea of building up binary numbers by adding
together base components; or with making
change by adding together pennies, nickels, dimes and quarters. (Why
do coins keep coming up in this course?)
For instance, to spell out the value "9" in binary, we need an eight
(1 0 0 0) and a one (0 0 0 1). Adding, we get
1 0 0 1.
The Walsh transform represents a series of numbers by using a base set
like this one (for series of 8 numbers)
B0 = 1 1 1 1 1 1 1 1
B1 = 1 1 1 1 1 1 1 0
B2 = 1 1 1 1 1 1 0 0
B3 = 1 1 1 1 1 0 0 0
B4 = 1 1 1 1 0 0 0 0
B5 = 1 1 1 0 0 0 1 1
B6 = 1 1 0 0 1 1 0 0
B7 = 1 0 1 0 1 0 1 0
We refer to these bases as B0, B1 ... B7. You can see that all of these
are "really" repeating series, it's just that
B0 through B4 haven't got room to be seen actually repeating. (B0 either
repeats with every character, or never
repeats, depending on your point of view.Its frequency is zero, anyhow,
so its wavelength must be infinite.)
DEMO: To represent the series 1 1 1 1 0 0 0 0, we would just say that it's the same as B4. Or, another way to say it, would be "no B0, and no B1, and no B2, ... and one B4, and no B5 .. etc." Or we could even write this stuff down as
0000 1000
Now you may ask, why bother encoding an 8 digit string with another 8 digit string? I will defer this answer for a little while.
PRACTICE: How could you combine two of these bases to produce 0000 1111? Hint: You can subtract one base from another one, like a "cookie cutter".
AND .. . the answer is ....
just below the screen, and it is ....
1 0 0 0 -1 0 0 0
What if we wanted to represent a series like S1 = 3 3 3 1 1 1 3 3? A
little experimentation would reveal
that S1 = B0 + 2*B5. In fact, you can represent ANY series of 8 numbers,
by some linear combination of these
bases. The coefficients might even be negative. You don't actually
need negative values, though, if you use a "wrap-around" or modulo-2 addition
system. Like this:
0+0=0
0+1=1
1+0=1
1+1=0 -- actually it would be 10 (two) but we
throw away the carry, so only the 0 is left.
In these simplified notes, we allow you to use negative values for coefficients.
Query 9: If I wanted to produce the series -1 1 -1 1 -1 1 -1 1, how would I do it? The answer is revealed below, but try it first!
B0 is called the "DC Component". DC means "direct current", which is
what you get out of a battery. It provides
a constant voltage (until the battery runs down.) The rest are AC (alternating
current) components, with various
frequencies. The frequency of a signal is the reciprocal of its wavelengh
(how long it takes to repeat.) What's the
frequency of B7? B6? B5? See a pattern here?
Definitions: A signal S is a series of numbers S0, S1, S2... Sn. A transform
T is a series of coefficients T0,
T1..Tn. T is the Walsh Transform of S, if S = T0*B0 + T1*B1 + T2*B2
... + Tn*Bn.
It would be extremely tedious to have to hunt around for the Walsh transform of some arbitrary signal, and I'll only have you do trivial ones - to get the idea. Turns out, though, that there's an elegant way to do it with MATRICES and VECTORS! -- but let's not, for now.
Spatial frequencies.
NOW we stop thinking about cycles per second (if you ever were), and start thinking about cycles per inch. What if the sample S consisted of pixel values along a scan line? We could still talk about its frequency. A pattern like 1 0 1 0 1 0 which repeats itself 50 times in 100 pixels, could be said to have a frequency of 1/2 cycle per pixel.
A pattern can have different frequencies in the X and Y directions.
What are spatial frequencies of this pattern?
1 1 0 0 1 1 0 0
1 1 0 0 1 1 0 0
1 1 0 0 1 1 0 0
0 0 1 1 0 0 1 1
0 0 1 1 0 0 1 1
0 0 1 1 0 0 1 1
1 1 0 0 1 1 0 0
1 1 0 0 1 1 0 0
In the horizontal direction, it repeats every 4 pixels. In the vertical direction, it repeats every 6 pixels. SO... we can consider the idea of the Walsh transform of an image, too! The Walsh transform of a one dimensional signal with N samples consisted of N coefficients. The transform of an n x n image will, by analogy, consist of an array of nxn coefficients.
The most important thing to know about such a transform is that now, the 'finest details' are all in the lower right corner of the transform. So, if we were to throw away some information, that's the part we would throw away. The picture (once decoded) would get fuzzier but it would still have its general structure intact. And that's exactly how JPEG works.
Normally, 8 x 8 blocks of pixels are encoded separately - rather than analyzing the whole image against some enormous basis set. It is not uncommon for a JPEG image to reduce a whole 8x8 array of 8 bit pixels, down to maybe six 8-bit numbers. That's a 90% compression ratio.
MPEG - Moving Pictures of Ponies and Butterflies and .... everything.
The text explains MPEG is very general terms. In more detail: the image is chopped up into 8x8 pixel blocks. Starting from an I-frame, the P frame is computed by taking the "real" image at the time of the P-frame you want to make. Each block is compared to all the adjacent blocks in the I-frame. The most similar one is chosen as its "ancestor", and a pointer is constructed to tell "where this block came from" in the I-frame.
For instance, if the camera motion was a pan to the right, then most of the blocks in P would come from a different block to the right of that location, in the I-frame. So, instead of storing the whole picture, MPEG will store an array of "where-I-came-from" vectors (X and Y offsets).
But since this isn't perfect, it also has to store some corrections to each block. These little 8x8 corrections get squeezed by JPEG-like techniques until they are VERY compact.
This idea of tracking motion blocks works for pans, zooms and also for objects moving in a scene. But what does it do at the edge of the screen where new material is being revealed?
Query 10: Temporal redundancy is the key idea in MPEG. What does
it mean?
An Extra Section (More about Walsh Transforms) for the Mathematically Literate
Since we don't have Linear Algebra as a prerequisite for this course, we will end the lecture here. But I'm including the following material for the mathematically literate who may have a burning desire to know more about Walsh Transforms.
To cut down on the typing, we'll use signals with four samples instead
of 8. Here's the basis set, as a matrix called M4. Note that the bases
B0, B1 etc. form columns in M4.
1 1 1 1
M4 = 1 1 1 0
1 1 0 1
1 0 0 0
Now, if I multiply a Walsh transform vector like T = (1 0 0 -2) by placing its column form on the right of M4, I should get the corresponding signal S = (-1 1 -1 1). I hope you got that result, too.
This trick works if you know T and want S. But what if you have S and
want to find T? Well, if you remember your linear algebra (which very few
computer scientists seem to remember, if they ever had it back in high
school), you know that if S = M*T, then T = R*S where R is the "inverse"
of M. If you multiply M * R, you get the Identity matrix I (left-up to
right-down diagonal=1, all else=0). Not all matrices have inverses, but
fortunately the Walsh matrices do. Here's R4:
0 0 0 1
R4 = -1 1 1 -1
1 0 -1 0
1 -1 0 0
Query : Compute R4*S, where S=-1 1 -1 1, and see if you get our alleged transform which was 1 0 0 -2.
SUMMARY: If I gave you a series of numbers (of some length n which is a power of 2), you could compute its Walsh transform, and vice versa. The most important thing to note is that the transform is a frequency analysis of the signal. It can be read as saying that the signal has 1 volt of B0 (=DC) and -2 volts of B3 (which was the component with frequency 1/2.)
What do we mean "frequency 1/2"? Well, if we knew what units the time axis was in, we could calibrate that. Assume we're taking 1000 samples per second. So we see that B3's frequency is 1/2 cycle per millisecond, or 500 cycles/second (Hz).
Now what about the DCT? It's the same deal, only with a different matrix.
The DCT matrix has somewhat better mathematical properties for image compression,
but it's less intuitive to explain than the Walsh is.