Run-Length Encoding - Almer S. Tigelaar

When going on holiday I always pack my clothes in vacuum bags. These nifty bags are made from plastic. Put some clothes in one end, seal the bag, then push out all the air of the main compartment, and your clothes are well: sealed. They take up much less space: physical space that is.

This brief article is not about physical compression, but about compressing data. Nevertheless, the goal and implications are somewhat comparable: data takes up less space in transit, but also: it costs more time to access it again. Before being able to use it, the data needs to be decompressed: same as taking those clothes out of the vacuum bag.

Compression aside, my black t-shirts necessarily always take up the same amount of physical space each. Whether I bring one or five, and no matter how much I squeeze out the air. However, data can actually be compressed in a way where five shirts take only the space of two. Confused yet? Let’s see how this works.

The Concept of Compressing Runs

How can we take some data and represent it in a smaller way? Let’s say we have the text “AAAABBBCC”. This consists of 9 characters, but some of them repeat. A much shorter way to write this is to take advantage of this repetition. Since we see four A’s followed by three B’s and two C’s we could write this as “4A3B2C”. This brings the required space down from 9 characters to only 6, thus achieving a 9 / 6 = 1.5x compression ratio.

This technique is called Run-Length Encoding (RLE). The AAAA is a ‘run’ (sequence) of four times A and we ‘encode’ that as 4A. However, there is a problem with this approach. Consider a text like “ABABAB”. This definitely has a pattern, but not one that RLE can exploit. In fact applying it just inflates the required space by a factor two!: “1A1B1A1B1A1B”.

A smarter way may be to use some special character to denote the start of a run, for example “X”. Consider the text “ABAAAAB”. Encoding using this approach gives us “ABX4AB”. The X acting as an ‘escape’ character. This takes up a bit of space, but saves us from having to encode runs of length one.

The “X” helps prevent wasting space, but RLE does not actually achieve any compression on locally heterogeneous data like the “ABABAB” example. Hence, this technique is only useful in contexts where longer repeated sequences are common. This is not something we often find in real text data, like this article. So, where has RLE been usefully applied?

A History of Reinvention

Run-Length Encoding was invented many times, likely due to its intuitive nature. Early mentions of the technique appear in papers from the fifties. Furthermore, it was applied to analog television signals in the sixties. This points us to one of the main applications of this technique: images.

On both Bulletin Board Systems and the early Internet, images exchanged were far less rich than the full colour renderings that are ubiquitous nowadays. The ‘pixel art’, that evokes nostalgic feelings nowadays, was really the only thing that actually existed for some time, due to limitations on both the storage and transmission of data.

Images with decreasing colour depth (number of colours, in this case greyscale). From left to right: 256 shades of grey, 8 shades and 4 shades respectively.

I remember the first hand-scanner we had. This device needed to be manually pulled at an even pace over whatever you wanted to scan. Not only was this quite impossible to get right, the resolution was a meager 320 x 200 by default. This is so small that you could fill a modern full high definition screen with 36(!) scanned photos. It could also only digitize into grayscale with a maximum of 256 shades of gray, the extremes being black and white. Contrast this with the images of today that by default contain over sixteen millions colours.

Wait? How did we get to sixteen million? An image in a computer is represented as a grid with cells called pixels, with each pixel having a colour. An easy way to think of this is as a large spreadsheet, where each cell in the sheet has a colour [2].

There are different ways to define colours. A common one called RGB defines for each pixel the level of Red, Green and Blue as a number between 0 and 255. That gives us 256 possibilities for each colour, which gives us $256^3 = 16.7$ million possible unique values for each pixel. This is referred to as 24-bit colour depth or “true colour”, and is ubiquitous on screens and for stored images nowadays.

Applications

The nice side effect of having fewer distinct colours in an image is that it is more likely that adjacent pixels have the exact same value: an ideal application for run-length encoding. Consider an image that starts with four white pixels (WWWW), followed by three black pixels (BBB), followed by two gray pixels (GG), we could encode that as 4W3B2G.

It was exactly this application that made RLE popular for images with few different colours. Early Windows users may recall that in Paint, the (in)famous desktop painting program, you could save your drawings as “Bitmap (RLE-Encoded)”, as well as Targa (TGA) and PiCture eXchange (PCX). All of these took advantage encoding runs of a small number of unique colours, and thus made the final image take up less space on disk. Which again: back then was a lot scarcer than it is nowadays.

Though RLE may feel rather ‘basic’, it is still used in many places where runs are common. Even if other more advanced compression techniques produce these runs. Examples are the popular JPEG image format and the Burrows-Wheeler Transform. For both the cases the data is first transformed using some other technique, namely entropy encoding for JPEG, and block-sorting for the Burrows-Wheeler Transform. Both subjects for an other posts.

A Run-Length Encoding Implementation

Here is naive implementation in Python. Creating a version that has uses the “X” escape character, as suggested earlier, is left as an exercise to the reader, including further run-time optimizations.

def rle_encode(text):
    last_character = None
    count = 0
    output = ""

    for character in text:
        if character == last_character:
            count += 1
        else:
            if last_character:
                output += "{}{}".format(count, last_character)
            last_character = character
            count = 1

    output += "{}{}".format(count, last_character)
    return output

Conclusion

Run-Length Encoding is one of the most basic and intuitive data compression techniques. Despite its seemingly simple premise, it has earned a lasting spot in various places in the data compression landscape. If you have used any computing device, you most certainly leveraged it.