The Machine Learning Myth

“I was recently at a demonstration of walking robots, and you know what? The organizers had spent a good deal of time preparing everything the day before. However, in the evening the cleaners had come to polish the floor. When the robots started, during the real demonstration the next morning, they had a lot of difficulty getting up. Amazingly, after half a minute or so, they walked perfectly on the polished surface. The robots could really think and reason, I am absolutely sure of it!”

Somehow the conversation had ended up here. I stared with a blank look at the lady who was trying to convince me of the robot’s self-awareness. I was trying to figure out how to tell her that she was a little ‘off’ in her assessment.

As science-fiction writer Arthur C. Clarke said: any sufficiently advanced technology is indistinguishable from magic. However, conveying this to someone with little knowledge of ‘the magic’, other than that gleaned from popular fiction, is hard. Despite trying several angles, I was unable to convince this lady that what she had seen the robots do had nothing to do with actual consciousness.

Machine learning is all the rage these days, demand for data scientists has risen to similar levels as that for software engineers in the late nineties. Jobs in this field are among the best paying relative to the number of years of working experience. Is machine learning magic, mundane or perhaps somewhere in between? Let’s take a look.

A Thinking Machine

Austria, 1770. The court of Maria Theresa buzzes. The chandeliers mounted on the ceiling of the throne room cast long shadows on the carpeted floor. Flocks of nobles arrive in anticipation of the demonstration about to take place. After a long wait, a large cabinet is wheeled in, overshadowed by something attached to it. It looks like a human-sized doll. Its arms lean over the top of the cabinet. In between those puppet arms is a chess board.

The Mechanical Turk by Wolfgang von Kempelen

The cabinet is accompanied by its maker, Wolfgang van Kempelen. He opens the cabinet doors which reveals cogs, levers and other machinery. After a dramatic pause he reveals the true name of this device: the Turk. He explains it is an automated chess player and invites Empress Maria Theresa to play. The crowd sneers and giggles. However, scorn turns to fascination as Maria Theresa’s opening chess move is countered by the mechanical hand of the Turk by a clever counter move.

To anyone in the audience the Turk looked like an actual thinking machine. It would move its arm just like people do, it took time between moves to think just like people do, it even corrected invalid moves of the opponent by reverting the faulty moves, just like people do. Was the Turk, so far ahead of its time, really the first thinking machine?

Unfortunately: no, the Turk was a hoax, an elaborate illusion. Inside the cabinet a real person was hidden. A skilled chess player, who controlled the arms of the doll-like figure. The Turk shows that people can see and have an understanding of an effect, but fail to correctly infer its cause. The Turk’s chess skills were not due to its mechanics. Instead they were the result of clever concealment of ‘mundane’ human intellect.

The Birth of Artificial Intelligence

It would take until the early 1950’s before a group of researchers at Darthmouth started the field of artificial intelligence. They believed that if the process of learning something can be precisely described in sufficiently small steps, a machine should be able to execute these steps as well as a human can. Building on existing ideas that emerged in the preceding years, the group set out to lay some of the groundwork for breakthroughs to come in the ensuing two decades. Those breakthroughs did indeed come in the form of path finding, natural language understanding and even mechanical robots.

Arthur Samuel

Around the same time, Arthur Samuel of IBM wrote a computer program that could play checkers. A computer-based checkers opponent had been developed before. However, Samuel’s program could do something unique: adapt based on what it had seen before. He roughly did this by making the program store the moves that led to games that were won in the past, and then replaying those moves at appropriate situations in the current game. Samuel referred to this self-adapting process as machine learning.

What then really is the difference between artificial intelligence and machine learning? Machine learning is best thought of as a more practically oriented sub field of artificial intelligence. With mathematics at its core, it could be viewed as a form of automated applied statistics. At the core of machine learning is finding patterns in data and exploiting those to automate specific tasks. Tasks like finding people in a picture, recognizing your voice or predicting the weather in your neighbourhood.

In contrast, at the core of artificial intelligence is the question of how to make entities that can perceive the environment, plan and take action in that environment to reach goals, and learn from the relation between actions and outcomes. These entities need not be physical or sentient, though that is often the way they are portrayed in (science) fiction. Artificial intelligence intersects with other fields like psychology and philosophy, as discussed next.

Philosophical Intermezzo: Turing and the Chinese room

Say, a machine can convince a real person into thinking that it – the machine – is a human being. By this very act of persuasion, you could say the machine is at least as cognitively able as that human. This famous claim was made by mathematician Alan Turing in the early fifties.

Turing’s claim just did not sit well with philosopher John Searle. He proposed the following thought experiment: imagine Robert, an average Joe who speaks only English. Let’s put Robert in a lighted room with a book and two small openings for a couple of hours. The first opening is for people to put in slips of paper with questions in Chinese: the input. The second opening is to deposit the answers to these questions written on new slips, also in Chinese: the output.

Searle’s Chinese Room

Robert does not know Chinese at all. To help him he has a huge book in this ‘Chinese’ room. In this book he can look up what symbols he needs to write on the output slips, given the symbols he sees on the input slips. Searle argued that no matter how many slips Robert processes, he will never truly understand Chinese. After all, he is only translating input questions to output answers without understanding the meaning of either. The book he uses also does not ‘understand’ anything, as it contains just a set of rules for Robert to follow. So, this Chinese room as a whole can answer questions. However, none of its components actually understands or can reason about Chinese!

Replace Robert in the story with a computer, and you get a feeling for what Searle tries to point out. Consider that while a translation algorithm may be able to translate one language to the other, being able to do that is insufficient for really understanding the languages themselves.

The root of Searle’s argument is that knowing how to transform information is not the same as actually understanding it. Taking that one step further: in contrast with Turing, Searle’s claim is that merely being able to function on the outside like a human being is not enough for actually being one.

The lady I talked with about the walking robots had a belief. Namely, that the robots were conscious based on their adaptive response to the polished floor. We could say the robots were able to ‘fool’ her into this. Her reasoning is valid under Turing’s claim: from her perspective the robots functioned like a human. However, it is invalid under Searle’s, as his claim implies ‘being fooled’ is simply not enough to prove consciousness.

As you let this sink in, let’s get back to something more practical that shows the strength of machine learning.

Getting Practical with Spam

In the early years of this century spam emails were on the rise. There was no defense against the onslaught of emails. So too thought Paul Graham, a computer scientist, whose mailbox was overflowing like everyone else’s. He approached it like an engineer: by writing a program. His program looked at the email’s text, and filtered those that met certain criteria. This is similar to making filter rules to sort emails into different folders, something you have likely set up for your own mailbox.

Graham spent half a year manually coding rules for detecting spam. He found himself in an addictive arms race with his spammers, trying to outsmart each other. One day he figured he should look at this problem in a different way: using statistics. Instead of coding manual rules, he simply labeled each email as spam or not spam. This resulted in two labeled sets of emails, one consisting of genuine mails, the other of only spam emails.

He analyzed the sets and found that spam emails contained many typical words, like ‘promotion’, ‘republic’, ‘investment’, and so forth. Graham no longer had to write rules manually. Instead he approached this with machine learning. Let’s get some intuition for how he did that.

Identifying Spam Automagically

Imagine that you want to train someone to recognize spam emails. Your first step is showing the person many examples of emails that are labeled genuine and that are labeled spam. That is: for each of these emails the participant is told explicitly whether each is genuine or spam. After this training phase, you put the person to work to classify new unlabeled emails. The person thus assigns each new email a label: spam or genuine. He or she does this based on the resemblance of the new email with ones seen during the training phase.

Replace the person in the previous paragraph with a machine learning model and you have a conceptual understanding of how machine learning works. Graham took both email sets he created: one consisting of spam emails, the other of genuine ones. He then used these to train a machine learning model that looks at how often words occur in text. After training he used the model to classify new incoming mails as being either spam or genuine. The model would for example mark an email as spam if the word ‘promotion’ would appear in it often. A relation it ‘learned’ from the labeled emails. This approach made his hand-crafted rules obsolete.

Your mailbox still works with this basic approach. By explicitly labeling spam messages as spam you are effectively participating in training a machine learning model that can recognize spam. The more examples it has, the better it will become. This type of application goes to the core of what machine learning can do: find patterns in data and bind some kind of consequence to the presence, or absence, of a pattern.

This example also reveals the difference between software engineering and data science. A software engineer builds a computer program explicitly by coding rules of the form: if I see this then do that. Much like Graham tried to initially combat spam. In contrast, a data scientist collects a large amount of things to see, and a large amount of subsequent things to do, and then tries to infer the rules using a machine learning method. This results in a model: essentially an automatically written computer program.

Software Engineering versus Machine Learning

If you would like a deeper conceptual understanding and don’t shy away from something a bit more abstract: let’s dive a little bit deeper into the difference between software engineering and machine learning. If you don’t: feel free to skip to the conclusion.

As a simple intuition you can think of the difference between software engineering and machine learning as the difference between writing a function explicitly and inferring a function from data implicitly. As a minimal contrived example: imagine you have to write a function f that adds two numbers. You could write it like this in an imperative language:

function f(a, b):
    c = a + b
    return c

This is the case were you explicitly code a rule. Contrast this with the case where you no longer write f yourself. Instead you train a machine learning model that produces the function f based on many examples of inputs a and b and output c.

train([(a, b, c), (a, b, c), (...)]) -> f

That is effectively what machine learning boils down to: training a model is like writing a function implicitly by inferring it from the data. After this f can be used on new previously unseen (a, b) inputs. If it has seen enough input examples, it will perform addition on the unseen (a, b) inputs. This is exactly what we want. However, consider what happens if the model would be fed only one training sample: input (2, 2, 4). Since 2 * 2 = 4 and 2 + 2 = 4, it might yield a function that multiplies its inputs instead of adding them!

There are roughly two types of functions that you can generate, that correspond to two types of tasks. The first one, called regression, returns some continuous value, as in the addition example above. The second one, called classification, returns a value from a limited set of options like in the spam example. The simplest set of options being: ‘true’ or ‘false’.

Effectively we are giving up precise control over the function’s exact definition. Instead we move to the level of specifying what should come out given what we put in. What we gain from this is the ability to infer much more complex functions than we could reasonably write ourselves. This is what makes machine learning a powerful approach for many problems. Nevertheless, every convenience comes at a cost and machine learning is no exception.

Limitations

The main challenge for using a machine learning approach is that large amounts of data are required to train models, which can be costly to obtain. Luckily, recent years have seen an explosion in available data. More people are producing content than ever, and wearable electronics yield vast streams of data. All this available data makes it easier to train useful models, at least for some cases.

A second caveat is that training models on large amounts of data requires a lot of processing power. Fortunately, rapid advances in graphics hardware have led to orders of magnitude faster training of complex models. Additionally, these hardware resources can be easily accessed through cloud services, making things much easier.

A third downside is that, depending on the machine learning method, what happens inside of the model created can become opaque. What does the generated function do to get from input to output? It is important to understand the resulting model, to ensure it performs sanely on a wide range of inputs. Tweaking the model is more an art than a science, and the risk of harmful hidden biases in models is certainly realistic.

Applying machine learning is no replacement for software engineering, but rather an augmentation for specific challenges. Many problems are far more cost effective to approach by writing a simple set of explicitly written rules instead of endlessly tinkering with a machine learning model. Machine learning practitioners are best of not only knowing the mechanics of each specific method, but also whether machine learning is appropriate to use at all.

Conclusion

Many modern inventions would look like magic to people from one or two centuries ago. Yet, knowing how these really work shatters this illusion. A huge increase in the use of machine learning to solve a wide range of problems has taken place in the last decade. People unfamiliar with the underlying techniques often both under and overestimate the potential and the limitations of machine learning methods. This leads to unrealistic expectations, fears and confusion. From apocalyptic prophecies to shining utopias: machine learning myths abound where we are better served staying grounded in reality.

This is not helped by naming. Many people associate the term ‘learning’ with the way humans learn, and ‘intelligence’ with the way people think. Though sometimes a useful conceptual analog, it is quite different from what these methods currently actually do. A better name would have been data-based function generation. However, that admittedly sounds much less sexy.

Nevertheless, machine learning at its core is not much more than generating functions based on input and, usually, output data. With this approach it can deliver fantastic results on narrowly defined problems. This makes it an important and evolving tool, but like a hammer for a carpenter: it is really just a tool. A tool that augments, rather than replaces, software engineering. Like a hammer is limited by laws of physics, machine learning is fundamentally limited by laws of mathematics. It is no magic black box, nor can it solve all problems. However, it does offer a way forward for creating solutions to specific real-world challenges that were previously elusive. In the end it is neither magic nor mundane, but is grounded in ways that you now have a better understanding of.

Resources

  1. Singh, S. (2002). Turk’s Gambit.
  2. Rusell, S. & Norvig, P. (2003). Artificial Intelligence.
  3. Halevy, A. & Norvig, P. & Pereira, F. (2010). The Unreasonable Effectiveness of Data.
  4. Lewis, T. G. & Denning, P. J. (2018). Learning Machine Learning, CACM 61:12.
  5. Graham, P. (2002). A Plan for Spam.
  6. O’Neil, C. (2016). Weapons of Math Destruction.
  7. Stack Overflow Developer Survey (2019).

Impact

“How can I save what I have made?”, this is the question a young girl asked me in her school’s class recently. In the past hour she had created her own website. It was about her and the books that she loved. Standing in front of me with her hands clamped around the loan laptop, she wanted to know how she could save her website. Her protectiveness revealed that she did not do this because she wanted to show it off, but because she wanted to continue to build it at home.

Primary schools are not my day-to-day work environment. That day I participated in a voluntary event, organized by my work, which had the goal of making societal impact. Among the various available options, I had chosen to learn children in disadvantaged neighbourhoods about computer programming. One of these assignments was making a website.

The girl in question on the left

The girl that came to me with the question about how to save her website was one of a group of four that I was actively helping that day. All the girls in that small group were eager to learn, but she took the assignments quite seriously. She had the potential, the motivation and the attention span. This experience got me thinking: what would her chances be of actually making a career in IT?

There is a huge demand for skilled IT personnel in the Netherlands: forty-two percent of employers in IT struggles to find people, compared to only eighteen percent for general employment [1]. Given this demand, it should be easy for her to launch a career in IT, right?

Challenges

Many people want equal career opportunities for everyone, but the sad truth is that we do not live in a society that is meritocratic enough to meet that ideal. I see three major challenges for this girl: where she lives, the ethnical group she is part of and her gender.

Firstly, she lives in a disadvantaged neighbourhood. Sadly, the statistics about this are not at all uplifting [2]. Children that grow up in these neighbourhoods score lower on tests, experience more health problems and more psychological problems, and their poor socio-economic status tends to extend across generations. The only real exception to this is kids that have a resilient personality. The effects on them are smaller, but they also move out of such neighbourhoods much earlier in life [3, 4]. Nevertheless, even if she has a resilient personality, the fact remains that her neighbourhood provides less access to modern technology.

Secondly, even if she would go into IT and become, say, a software developer, she is part of an ethnic minority. The unfortunate fact is that minorities, also in the Netherlands, have a harder time finding a job and holding on to it: eight percent of highly educated minorities have no job, where this is only about three percent for the majority group: a telling sign. Even disregarding all this, the question remains: would she end up in IT at all?

The Gender Trap

When I was choosing a school to go to learn the craft of software development, in the late nineties, I attended various information events. I distinctly remember one such event. As almost everyone there, I too was accompanied by a parent: my mother. At the end of the presentation, she was the one parent in the room that dared ask the question: “are there any girls studying here?” The unsatisfying answer after a confused stare and some silence was: “no, there aren’t any, this is a technical school.”

This brings me to my third point: has that imbalance improved at all since that time? We would expect perhaps that half of the people working in IT is female, and the other half is male. However, that is not the case looking at present day workforce numbers. The country that seems to do this best is Australia with nearly three out of ten IT workers being female, whereas the Netherlands scores poorly with a score of only half that: less than two out of ten [5].

Is that due to a difference in capability? Are males ‘naturally’ better at these more technical subjects than females? There indeed is a bit of a developmental difference between boys and girls, but it may not be what you think.

Let’s take the most abstract subject as example: math. It may appear that boys are better at math, but that is not really true. Averaging out over their entire childhood into adolescence: boys and girls really are equally good at math. However, girls tend to develop language skills faster and earlier. Since boys lag behind in that development, it appears that they are better at math. Don’t be fooled by this deception: girls are not worse at math, they are better at language [6].

This difference between the genders disappears throughout adolescence, but by that time, paths have already been chosen. This is a sad consequence of our education systems pressuring children to make life affecting choices early on, which is not helped by reinforced stereotypes through gender specific toys and activities.

Parents

This brings me to another question: to what extent can parents influence their children? Specifically, can average parents, by parenting, influence their children’s personality and intelligence? How much do you think that influence is? Think about it for a minute.

This may come as a shock, but the effect of parents on their children’s personality and intelligence does not exist. To better understand this: consider that if anything parents do affect their children in any systematic way, then children growing up with the same parents will turn out to be more similar than children growing up with different parents, and that is simply not the case [7].

So, what does influence a child’s personality and intelligence? Half of that is genes, the other half is experiences that are unique to them. Those experiences are shaped largely by their peers. To some extent we all already know this: whether adolescents smoke, get into problem with the law, or commit serious crimes depends far more on what their peers do, then on what their parents do [8].

Don’t misunderstand me. Parents can inflict serious harm on their children in myriad ways, leaving scars for a lifetime, but that’s not what the average parent does. On the flip side, parents can also provide their children with the opportunity to acquire skills and knowledge, for example through reading, by playing a musical instrument or even encouraging them to explore computer programming. Hence, they can influence the conditions for them to get the right kinds of unique experiences.

Conclusion

Despite the odds being stacked against the girl that asked me how she could save her website. Looking backwards, why was that volunteering day so important? Was I really there to teach her how to do computer programming? No, not really, I was there to provide her with a unique experience that is rare to have in her regular day-to-day environment. It is my hope that his has cemented a positive association with IT in her, which may just tip the balance the right way.

As a final word I think that these kinds of volunteering efforts are important. Whether you work in IT or elsewhere, please consider giving others unique inspirational experiences to save, cherish and build upon. Because despite her odds, what she experienced may in the end actually make the difference for both her and her friends.

Sources

  1. Kalkhoven, F. (2018). ICT Beroepen Factsheet.
  2. Broeke, A. ten (2010). Opgroeien in een slechte wijk.
  3. Mayer & Jencks (1989). Growing Up in Poor Neigborhoods.
  4. Achterhuis, J. (2015). Jongeren in Achterstandswijk.
  5. Honeypot (2018). Women in Tech Index 2018.
  6. Oakley, B. (2018). Make your Daughter Practice Math.
  7. Turkenheimer, E. (2000). Three Laws of Behavior Genetics and What They Mean.
  8. Pinker, S. (2002). The Blank Slate.

Representing Data: Bits and Bytes

If you use a computer regularly, you know that every document you write, every music file you listen to and every photo you see is stored as a file. You may also know that a file consists of bytes and that bytes consist of smaller building blocks called bits. But, what are bits and bytes really? How do they work, why do they work that way, and how does storing them create text, sound and images? To find out we have to start with something very basic: counting.

Decimal Counting

Consider what happens if you count using ten fingers. Once you reach ten you have no more fingers left and thus, when you reach eleven, you remember one times ten and keep up one finger. This same thing happens over and over again as you keep counting upwards. This is also reflected in the way that we count when we write down numbers. Though in that case we use ten symbols instead of ten fingers: 0 up to including 9. Once we reach nine and want to express ten, we no longer have any symbols. So, we remember 1, which for writing means: we shift the number 1 to the left, and start over again with a 0, giving us 10. The difference with finger counting is that for zero we use no fingers at all, whereas when writing numbers we use the symbol 0 to denote zero.

Counting Conventions

The way we write down numbers is just a convention. The number system that we use most often is the decimal numeral system, because it has, as discussed, ten symbols that we enumerate. Consider that there is nothing stopping us from defining a counting system with only five symbols: 0 up to including 4. In this case we will run out of symbols when we want to express five, and we will have to do the same thing we did before: shift 1 to the left and then start over again. This gives us (again) 10. But since we now count with five symbols, this means that the two symbols 10 actually represent the value five. If you find this confusing, try counting from zero to five using your fingers on only one hand and use the first finger for the zero. Notice that reaching five forces you to remember one and continue counting by resetting your hand to one finger again for zero.

Binary Counting

The ten-symbol decimal counting system is just one of an infinite number of possible counting systems. However, there are only several such systems in common usage. Namely, the octal system that uses eight symbols, the hexadecimal system which uses sixteen symbols (letters A through F are used to denote 10 up to 16) and also the binary system, which uses only two symbols: 0 and 1. Having only two symbols is similar to counting with only two fingers. So, how would we count from zero to three using only two symbols? Zero would just be 0, one would just be 1. For two it becomes more complicated. Since we have now run out of symbols we need to shift 1 left and start over with zero, giving us 10. Finally, for three: since we still have a next symbol for the rightmost position, we only have to replace the 0 with a 1 to express three in binary, giving us 11. If we now want to go up to four, we can not increase the existing symbols anymore. Hence, we have to set them to 0, and add a new position with a 1, giving us 100. In the table below we show the values for counting up to including five. If this is not immediately clear: do not worry, a more intuitive explanation follows.

DecimalBinary
00
11
210
311
4100
5101

Bits

The binary system brings us to our first definition: the binary digit, commonly shortened to ‘bit’. One bit is thus simply a single-digit binary number. Glancing at the table above we see that higher decimal numbers need more bits to be expressed in binary notation. Specifically, to store the number two or three we need two bits, and to store the number four or five we need three bits. The key insight is that by concatenating bits, that themselves can only take two values, we can represent any whole number, no matter how small or large, by using a sufficient amount of bits.

there are only 10 kinds of people in the world: those who understand binary, and those who do notnumeral base joke

An intuitive way to think of bits is as switches that turn on or off values. Take a look at the table below. The first row contains numbers, the second row contains a 0 if the number is ‘off’ and a 1 if the number is ‘on’. Try to work out the decimal value by summing the values of the first row for all the ‘on’ switches.

Solution (click to expand)

128 * 0 + 64 * 1 + 32 * 0 + 16 * 0 + 8 * 1 + 4 * 0 + 2 * 1 + 1 * 0 = 64 + 8 + 2 = 74

1286432168421
01001010

 

The second row represents a number (01001010) written in binary notation. The first row consists of powers of two if you consider them from right to left. In fact we can simply rewrite the first row of the table as follows:

2^72^62^52^42^32^22^12^0
01001010

Why powers of two? Well, since we have two symbols: 0 and 1. Starting at the right, each step towards the left increases the value by one power of two. If we would have three symbols, each position step would represent a power of three. Consider what happens when we have not two, but ten symbols, as in the discussed decimal system: each step towards the left is then an increase of a power of ten. An example:

10^3 = 100010^2 = 10010^1 = 1010^0 = 1
1100

 

If we try to express this as a decimal number we get 1000 + 100 = 1100, which in fact is exactly the same as the on/off switches concatenated (1100), as each switch represents a power of ten.

Bitwise Operations

Back to bits: it may seem as though storing numbers in binary is inefficient, as we need more bits to store higher numbers. However, the number of bits required does not increase linearly, instead it decreases exponentially. To see this, let us write the decimal number ten in binary: 1010. This requires four bits, now a thousand: 11 1110 1000, requires only ten positions despite being a magnitude hundred larger than ten. This in another key insight: we can represent very large numbers with relatively few bits.

You can think of a bit as the smallest possible piece of information. Many things map nicely on the two states that a single bit can have. For example: a logic statement can be either true or false. However, by adding additional bits we can express more states than just two and perform actual arithmetic with them. If we want to add the numbers 1 and 2 in binary we can simply do this by turning on all the switches that are ‘on’ in the representation of both 1 and 2 of the result. If we look at 1 and 2 in the table below, we see that to form 3, the sum of these numbers, we can look at the two rows above 3 and set the bit to 1 if either the bit in the row of 1 OR the row of 2 is one. This is why this is called an OR operation.

2^32^22^12^0
1 =0001
2 =0010
3 =0011

 

There are in fact four of these logical operations we can do on bits:

  • OR : turns on the switch in the result if either of the two input switches are on.
  • AND : turns on the switch in the result if both of the two input switches are on.
  • XOR : turns on the switch in the result if either of the two input switches are on, but not both. This is called an eXclusive-OR.
  • NOT : takes one input row and inverts all the switches, turning all zeros to 1 and all ones to 0.

With these four operations we can implement common mathematical operations on whole numbers. A hardware implementation of one of these operations is called a logic gate and your computer has many of them: billions.

Bytes & Text

Now we have some feeling for bits, let us turn to bytes. Computers only work with these binary numbers expressed as groups of bits. So, how do we get from groups of bits to stuff such as text, audio and images? The answer: mappings. You are probably familiar with many systems in the real world where a number ‘maps’ to something else. For example: a telephone number maps to a person or a company, a postal code maps to some area within a city, and in some restaurants a number maps to a specific dish. The same mechanism of mapping a number to something with a specific semantic interpretation is used in computers.

Characters

For storing text on a computer we need to map the numbers to the alphabet. There a twenty-six letters in the alphabet. However, if we need to express both upper and lowercase letters, we would need to double that amount: fifty-two. How many bits would that take? The closest power of two is 64 which is 2^6, thus we need at least 6 bits. However, we also need to express many other symbols. Take a look at your keyboard and you will find that it contains at least a 100 or so keys, and some combinations of keys can produce yet other symbols. Though, historically six bits were indeed used to store text, over time many more symbols were added. This eventually led to an 7-bit standard known as ASCII, commonly extended with 1 extra bit for additional language-specific symbols. The modern day successor of this is Unicode which can use up to 32 bits per character, allowing for many more symbols. Yet, the first 7-bits of Unicode still map to the same symbols as ASCII.

Processing Units

Early microprocessors were designed to operate on eight bits at a time. Early personal computers used eight bits as well. Since these groups of eight bits form a natural unit we refer to them as a byte. A byte can take 2^8 = 256 distinct numeric values, which in practice means 0 up to including 255. Half a byte, four bits, is sometimes referred to as a nibble, which is a tong-in-the-cheek reference to the larger ‘bite’.

A byte is thus just a group of eight bits. Any meaning that a byte has really depends on what we are expressing. In a text file the bytes are characters, in a music file they are audio samples and in a digital photo they represent the colours of a specific part of the image.

Images

Image files are quite intuitive to understand. If you take a photo and put a grid over it, you can give each small square in the grid a specific colour. Zoom very far into a digital photo, and you will see this grid, the small squares are often referred to as pixels: a portmanteau of ‘picture element’. If a digital photo has more pixels, it contains more information and we say it has a higher resolution.

Colour

The colours of pixels can be encoded in bits. Indeed, colours are often expressed in the Red Green Blue (RGB) format. In a commonly used variant of this format each colour is one byte and can take a value from 0 to 255 that maps to the intensity of the colour. So, for each square in the grid we use three bytes: twenty-four bits. For example the colour (R=0, G=0, B=0) would represent pure white and (R=255, G=255, B=255) is pure black as in these cases all colours are mixed evenly. However, (R=255, B=0, G=0) would be bright red, and so on.

Hexadecimal

If you have ever edited a web page and had to change a colour, you probably noticed that colours can be expressed as a hashtags followed by some combination of characters and numbers, for example: \#\textrm{FF0000}. This takes us back to the discussion of counting systems. The hexadecimal system is another way to represent numbers. For this system we count 0 through 9 and then A through F, giving us 16 possibilities in total. Perhaps you already see what those symbols behind the hashtag mean.

Let us look at \#\textrm{FF0000}. This actually consists of three groups of two symbols. The first group is FF, we know that F maps to 15 in decimal, so the hexadecimal value of FF is 15 * 16^1 + 15 * 16^0 = 255. The other two groups are 0. These three groups in this notation are the three colours! The first group is red, with value 255 (\#\boxed{\mathbf{FF}}0000), the second is green with a value of 0, and the last is blue, also with value 0 (\#\textrm{FF}00\boxed{\mathbf{00}}). Hence, again we have bright red (R=255, G=0, B=0), but now expressed in hexadecimal as \#\textrm{FF0000}.

Practical Considerations

Modern images take up a lot of storage space when we would just store them in RGB format. A modern camera easily snaps photos at 16 million pixels, for each of these pixels we need to store three bytes. So that’s over 48 million bytes (megabytes) for one photo. Storing video is similarly challenging, as that is essentially a collection of still images in sequence, typically at least twenty-four per second. Fortunately, digital compression techniques exists that can make images and video files much smaller. These techniques take advantage of repetition and patterns within and across images and remove small differences that can not easily be seen by the human eye.

Sound & Samples

How sound is represented digitally is a bit more complex than text and images. To get a better impression of this, imagine you are sitting on a swing. You are swinging right to left from my perspective. Let’s say that I want to take pictures, so I can later play them back as a filmstrip. A choice I have to make is how many pictures I take relative to the time you are swinging.

If it takes you ten seconds to swing back and forth, and I only take a picture once every ten seconds, all pictures would have you frozen in the exact same place, since I am missing the the 9.99 seconds where you are actually swinging. If I take a picture every second, I’d see you swing, but it would look a bit choppy. Taking pictures in more rapid succession would fix this and yield smooth motion. However, the minimum amount of pictures I’d need to snap to be able to see you move is actually longer than one second: half the time it takes for a swing, which would be every 5 seconds. I’d catch you at times 0, 5 and 10, respectively for ‘up to the left’, ‘centered’, and ‘up to the right’, and so forth. Differently worded: I’d need to snap pictures twice as fast as the rate at which you are swinging.

Sampling

The process of determining how many pictures to take in an audio context is called sampling, and instead of pictures we record the amplitude of the audio signal at specific points in time. The signal consists of a mix of sound frequencies that relate to the pitch of what you are hearing. The speed of the swing is comparable with the highest possible frequency of the audio signal. Frequencies are expressed in Hertz, 1 Hertz = 1 swing per second.

If you imagine many people sitting on swings next to you, going back and forth at different speeds, we’d need to snap pictures so we can catch the fastest one: snapping twice as fast as that speed. This corresponds to the highest frequency for audio: we need to record a certain minimum amount of samples in order to reconstruct the original recording. If our sampling rate is too low, we will not be able to record sounds with a higher frequency. Like with the swings: we need to sample at least twice the rate of the highest frequency we want to be able to record. This is one of the main reasons that telephone audio sounds rather poor: the sampling rate is low: 8000 samples are taken each second, limiting the range of the actual audio to 4000 Hertz. Human hearing can distinguish sounds from 1 up to 20 000 Hertz. This is the reason Compact Disc audio sounds so good: it takes 44 100 samples per second, enabling an actual range up to 22 050 Hertz.

Resolution

This does not get us the actual bytes yet that we need for audio, which takes us to the other part of representing audio digitally: how precise a value we are going to record for each sample. The more bits we use for a sample, the more accurately we can model the original signal. If we would only use one bit we can only record either no sound, or the loudest possible sound: it’s on or off. With two bits we can record four levels, etc. Compact Disc quality audio records samples with 16 bit resolution, giving 2^{16} = 65536 possible levels.

Practical Considerations

Like with video files, sound files also quickly grow large. In Compact Disc quality we need to store two bytes for every sample of which there are many thousands every second. Fortunately, as with video there are compression techniques to reduce the sizes of such files, the most famous of which is undoubtedly mp3 which takes advantage of differences that the human ear can not easily distinguish.

Conclusion

In this post you have learned about counting in the familiar decimal system, but also in the likely less familiar binary counting system. You have also seen how a number raised to a certain exponent relates to the counting system: the decimal system uses ten as the base number and the binary system uses two. These two possibilities act like an on/off switch, and each of these switches is referred to as a bit. You have an understanding of bitwise operations that can be performed to implement basic arithmetic. Finally, we have seen how groups of bits form bytes, and how bytes are often mapped to various things such as characters in text, colours in images and samples in sound. I hope this gives you a better feeling of how the basic primitives of modern computing, bits and bytes, relate to the things you see and read on screens and hear through speakers and headphones.