Passwords and Security

After a long journey he was nearly there. In the distance there was the outline of the city wall. Moments later he approached the city gate.
“Halt!”, shouted a heavily armed guard.
He had grown used to this ritual, so he went through the motions.
“What is the pass word?”, the guard asked.
He spoke the phrase he had memorized. The guard nodded, lowered his hands from his weapon, and stepped aside to allow him entry.

The above is how I imagine passwords came into common usage long ago. Passwords are not very practical in the above scenario, which is probably why we now have passports: literally a document to pass through some port, such as a city gate or a border. Checks at the border can also be done using fingerprints. If the guard would take fingerprints and quickly compare them to a set of known prints, he could determine whether to let you pass based on a matching print.

Consider what these three things fundamentally represent:

  1. A password is something that you know, you need to memorize it.
  2. A passport is something that you have, you need to take it with you.
  3. A fingerprint is something that you are, you always have it with you.

Most security systems combine at least two of these three factors:

Access to your bank transactions requires two things. Firstly, your debit card: something that you have. Secondly, your Personal Identification Number (PIN): something that you know. Entering a modern house also requires two things: the keys to your door and the access code to disable the alarm, which again combines something that you have with something that you know. Finally, entering a foreign country may even combine all three ingredients: a border guard may ask why you are entering the country and where you will be staying, he will ask for your passport and may scan your fingerprints.

Where am I going with this? Good security systems combine at least two of the three factors above. Think about how you access all your on-line accounts like Google, Facebook and LinkedIn. Do you use a password? Is that the only thing that you use to gain access? The answer to that is likely yes, and that is not a good thing.

Of all the three fundamental ingredients above, the password: something you memorize, is likely also the easiest to bypass. Not so much because of technical issues, although those do occur, but because of completely understandable human limitations.

The problem with passwords is that a complex password is hard to remember, and a simple password is easy to guess. Most people err on the side of making their passwords too simple. Why are such passwords easily too weak? For that we have to do some calculations.

Let us assume that you pick a single number between 1 and 10 as password. Let me think: you likely picked either a seven or a three, am I right? Even if I am not, people prefer some numbers over others, and that is exactly the root of the problem. Consider that with a single digit password I would need to guess only ten times and then I would certainly be right. If I can make my guesses a bit smarter – starting with the digits that are more often chosen – I may be able to guess ninety percent of the single digit passwords with only five tries.

Obviously we need something a little longer, a four digit password would have 10^4 = 10000 possible combinations, which is already much harder to guess. This is in fact the search space of the famous PIN codes. Some banks allow their customers to choose their own four digit code, which is a bad idea. Four digits are, from a memorization point of view, ideal for representing a birth date, or some other significant date. Consider that many such dates either start with 19 or 20 and we are left with only two numbers we need to guess: 10^2 = 100 is a much smaller space of possibilities.

Digits are often not the only parts of a password, letters are often allowed. This seems sound, since adding twenty-six letters gives us an additional fifty-two possibilities, letters can be either lower or uppercase, yielding us (10+52)^4 = 14776336 possible passwords of length four. If we add in special characters this number grows even larger.

Adding extra symbols (digits, letters, other characters) to the possible password range may seem like a good idea. However, just as we saw with numbers: if the patterns are predictable they are easy to guess. Consider that if we make a word of two characters in English there are a limited number of actually valid words: ‘of’, ‘it’ and ‘to’ are all valid. In contrast ‘tj’, ‘gh’ and ‘lq’ are not valid words. Sequences of letters that are not words are difficult to remember. Hence, people rarely use them. This leads to predicable passwords that consist usually of nouns combined with predictable number sequences: ‘Ghost2012’, ‘lipgloss’ and even ‘password’.

Indeed the top five passwords are: ‘123456’, ‘password’, ‘12345’, ‘12345678’ and ‘qwerty’. Fortunately few people actually use these passwords. If you were to guess someone’s password using one of these top ten most popular passwords, you would succeed in about sixteen in one thousand tries. Which, while not spectacular, is still ridiculously high.

A thousand tries may seem like a lot, and it is if you would have to type all those passwords yourself. However, this can be automated quite easily. Trying all possible passwords is called ‘brute-forcing’. A modern computer can easily do this at a rate of five-thousand per second. Using some statistical insights, such as those mentioned above, this process can be made highly effective. In fact most passwords under ten characters can be easily broken in several hours using off-the-shelf computer hardware.

I hope it is clear by now that using only a password that you can memorize to secure your on-line accounts is a bad idea. So, how can we improve this?

There are at least two things that you can quite easily do with respect to passwords alone:

  1. Generate passwords, instead of making them up yourself. No offense, but: a randomly generated password by a computer is most certainly better than something that you can think of.
  2. Use long passwords, as we have seen the length of a password is a means to easily increase the difficulty of guessing it. A minimal passwords consists of ten characters, but as computing power increases, this may rapidly become too short. A password of twelve characters is a more realistic minimum nowadays, and sixteen to thirty-two characters is a safe range.
  3. Use a different password for each service that you use. This way, when one account is breached, you do not get a domino effect.

Using a very long password, is one of the few exceptions where you could suffice with choosing your own. Consider that a long sentence as password is quite hard to guess: there are so many possible sentences! Even though a completely random password of the same length is harder to guess, this matters less if the password is sufficiently long.

If you are not into the long passwords, then the best solution is using a password manager of some sort. Keepass and Lastpass are popular solutions that are easy to use. There are two caveats to these services:

  1. They usually use one strong ‘master’ password, which gives access to all the site-specific passwords. This is a single-point of failure is some sense, and can also lead to a domino effect, but this is not a major problem if you have a sufficiently strong master password combined with two-factor authentication: more on that later.
  2. Some of these services may store your passwords ‘in the cloud’ in encrypted form. Understandably not everyone is okay with that. Fortunately, there are also variants which store your passwords locally on your own machine.

In a sense using a password manager in some way may feel like ‘writing down your password on a piece of paper’. This is true, but a strong password written down on a piece of paper that you keep in a safe place, is much better than a weak password that you have memorized. The same applies to password managers: the benefits outweigh the risks.

Improvements to your password do not address the most pressing concern: remember that most systems combine at least two of the three factors: something you know, something you have and something you are. A password is still only one of those ingredients. Hence, where possible you should add another one of these ingredients.

Almost all major on-line service providers – Microsoft, Google, Facebook, Yahoo, et cetera – offer some form of two-factor authentication. One popular mechanism called TOTP consists of codes that are generated using an app on your phone. How does this work? You take a picture of a QR image on the screen once, and a security app uses the data in this image to generate access codes that change every thirty seconds. You can set things up so that you are asked for a code only once a month on computers that you regularly use. So the effort is minimal and the security benefit is huge: in addition to guessing your password an attacker would have to gain access to your phone, which is way more difficult.

Some other services may rely on sending you an SMS with a code, or an e-mail with a clickable link. This is a bit less secure, but still way better than only using a password, and thus certainly worth it. If you use a password manager, then securing it with some type of two-factor authentication is an absolute must.

Say that you want to secure some other service X that does not offer two-factor authentication.
What to do? Well, the service may offer logging in via OpenID. This means that you can log in to the service using one of your main on-line accounts, like Google or Facebook. If you have secured that on-line account by enabling two-factor authentication, then transitively the account of service X is now also protected using two-factor authentication.

To wrap up: I recommend that you:

  1. Always use two-factor authentication wherever it is offered.
  2. Always construct sufficiently long passwords.
  3. Seriously consider using a password manager.

After a long journey the data packet, the first in a long data stream, was nearly there. Residing inside the last switch, in the distance was the faint hum of a server. Moments later the packet had entered the server system. The server unwrapped the data packet and found a password inside. But it knew the password was not enough. The server generated a code that it was expecting. It unwrapped the next packet in the stream and found the exact same code it had generated just a moment ago. It allowed the rest of the stream op packets to enter.

Renewed Keyboard Joy: Dvorak

Typing: you do it every day nearly unconsciously. You think of what you want to appear on the screen. This is followed by some rattling sound and the next instant it is there. The blinking cursor stares at you as to encourage you to keep going. Handwriting feels mostly like a thing of the past since typing is so much faster for you, likely up to two or three times. So, what would it be like if you were stripped from this ‘magical’ ability to type?

If you are like me, you probably learned how to type all by yourself. I never took a touch typing class, since it seemed like a waste of time. After all: I could already type, so why take a course to learn something I could already do?

Many self-learned typist adopt a hunt and peck style, meaning they need to look at the keyboard to find the keys. Usually this is done with only two fingers, since using more fingers obscures the view on the keyboard making it harder to ‘hunt’. I did not adopt this style, but rather used the three-finger approach: both hands hover over the keyboard and type using the three strongest fingers: the thumb, index finger and middle finger. Occasionally I used the ring finger as well, though not consistently. Observing my typing style, I noticed that my hands positioned themselves in anticipation of the next key to strike. This all went seamlessly, achieving speeds of about eighty-five to a hundred words per minute, which is not bad at all.

Though my self-learned typing style worked for me, I did try to switch to touch typing several times. Particularly because my hands would feel strained after intense typing sessions. However, switching never worked out. I would intensely concentrate for one day, keeping my fingers on the QWERTY home row of ‘ASDF-JKL;’, touch typing as one should. Nevertheless, the next day the years of acquired muscle memory would take over: I would be thrown back to my ‘own’ style. My hands seemed to have no incentive to touch type, even though I really wanted to consciously. Had I only taken that typing class when I had the chance, then I would be better off today, or … perhaps not?

The famous QWERTY layout, referring to the six top left keys on most standard keyboards, is not the only way to arrange the keys. Firstly, there are many small variations such as AZERTY, common in Belgium, and QWERTZ, common in Germany. Secondly, there are alternative keyboard layouts such as Colemak, Workman and Dvorak. Of these alternatives, Dvorak has been around the longest, since the 1930’s, and is also an official ANSI standard. The story behind both QWERTY and Dvorak, both developed for typewriters, is interesting in its own right and explained very well in the Dvorak zine.

The standardized simplified Dvorak layout is much less random than the QWERTY layout, it notably places the vowels on the left side of the keyboard and often used consonants on the right:

2015-07-Simplified_Dvorak

The simplified Dvorak layout

Several years ago I tried switching to Dvorak cold turkey. I relabeled all my keys and forced myself to type using the Dvorak layout. It was a disaster. I would constantly hit the wrong keys, my typing slowed to near a grinding halt. I would spent fifteen minutes typing an e-mail that previously I could write in under a minute. Frustrated, I stopped after three days.

Fast forward to several months ago. I caught a bit of a summer flu and although I was recovering I could not really think straight. Since learning a new keyboard layout is rather mechanical and repetitious in nature, I figured the timing would be right to have another stab at this. My main motivation was to increase typing comfort and reduce hand fatigue. Secondary motivations included load balancing better suited for my hands, reducing the amount of typing errors and being able to reach a higher sustained typing speed. Finally, I also picked this up as a challenge: it is good to force your brain to rewire things every once in a while. I wanted to switch layouts for these reasons for quite a while and this time I decided I would go about it the ‘right’ way.

Firstly, I had to choose a layout. Hence, I determined the following criteria:

  1. Since my left hand is a bit weaker I should opt for a right hand dominant layout, meaning one that utilizes the right hand to control more keys than the left in terms of both count and striking frequency.
  2. The layout should differ sufficiently from QWERTY, as to prevent me from relapsing into my ‘own’ typing style.
  3. As I do a fair bit of software development, the layout should be programming friendly.

Based on these criteria I chose the Programmer Dvorak layout. This layout is similar to simplified Dvorak, but has a different number row. It looks like this:

2015-07-Programmers_Dvorak

Programmer Dvorak

The main difference between this Dvorak layout and the simplified layout shown previously is that the number row is entirely different. Instead of numbers, the keys on the number row contain many characters that are often used in source code, such as parentheses and curly braces. To enter numbers the shift key needs to be pressed. This sounds cumbersome, but it makes sense if you count how many times you actually enter numbers using the number row. The numeric pad on the keyboard is much better suited to batch entry of numbers.

Awkwardly the numbers are not laid out in a linear progression. Rather the odd numbers appear on the left side and the even number on the right. This can be quite confusing at first, but interestingly it was also how the numbers were arranged on the original, non simplified, version of Dvorak. So there is some statistical basis for doing so.

If you are considering alternative keyboard layouts you should know that Dvorak and Colemak are the two most popular ones. Dvorak is said to ‘alternate’ as the left and right hand mostly alternate when pressing keys, whereas Colemak is said to ‘roll’ because adjacent fingers mostly strike keys in succession. One of the main reasons that Colemak is preferred by some is that it does not radically change the location of most keys with respect to QWERTY and, as a result, keeps several common keyboard shortcuts, particularly those for copy, cut and paste, in the same positions. This means that those shortcuts can be operated with one hand. As I am an Emacs user, used to typing three or four key chords to do comparatively trivial things – more on that later – this was not really an argument for me. I also read that the way in which you more easily roll your fingers can help with making the choice between Dvorak and Colemak. I think this was conjecture and I have no good rational explanation for it, but perhaps it helps you: tap your fingers in sequence on a flat surface. First from outwards in, striking the surface with your pinky first and then rolling off to ending with your thumb. After this do it from inwards out, striking with your thumb first and rolling back to your pinky. If the inwards roll feels more natural then Dvorak is likely a better choice for you, whereas if the outward roll feels better, Colemak may be the better choice. Again this is conjecture, interpret it as you wish.

Whichever alternative layout you choose: anything other than QWERTY, or a close variant thereof, will generally be an improvement in terms of typing effort. Dvorak cuts effort by about a third with respect to QWERTY. This means that entering hundred characters using QWERTY feels the same as entering about sixty-six characters in Dvorak in terms of the strain on your hands. If your job requires typing all day, that difference is huge. Even more so if you factor in that the number of typing errors is usually halved when you use an alternative layouts, due the more sensible and less error prone arrangement of the keys. Most alternative layouts are as good as Dvorak or better, depending on the characteristics of the text that you type. Different layouts can be easily compared here.

Now that I had chosen a layout, it was time to practice, so I set some simple rules:

  1. Practice the new layout daily for at least half an hour using on-line training tools.
  2. Do not switch layouts completely, rather keep using QWERTY as primary layout until you are confident you can switch effectively.
  3. Train on all three different keyboards that you regularly use. Do not buy any new physical keyboard, do not relabel keys, but simply switch between layouts in software.
  4. Focus on accuracy and not on speed.

Before starting I measured my raw QWERTY typing speed, which hovered around ninety words per minute sustained and about a hundred words per minute as top speed. Unfortunately, raw typing speed is a bit of a deceptive measure, as it does not factor in errors. Hitting backspace and then retyping what you intended to type contributes to your overall speed, yet it does not contribute at all to your effectiveness. So it is the effective typing speed which is of interest: how fast you type what you actually intended to type. Effective typing speed is a reasonable proxy for typing proficiency. My effective QWERTY typing speed was a bit lower than the raw speed, by about five to ten percent. This gives a sustained speed of eighty to eighty-five words per minute and a top speed of around ninety-five words per minute.

As I started with my daily Dvorak training sessions, I also started seeing a decrease in my effective QWERTY typing speed. My fingers started tripping up over simple words and key combinations, even though I still used my ‘own’ typing style for QWERTY, and touch typed only in Dvorak. The effect was subtle, but noticeable, lowering my effective QWERTY speed with about ten to fifteen percent. I deemed this acceptable, so I persevered, but it does show that using two keyboard layouts definitely messes up muscle memory. I think this effect can be mitigated to some extent by using specific layouts on specific keyboards, but I did not test this, as I would be breaking my own rules.

The first sessions in Dvorak were slow, with effective speeds of about five to ten words per minute. In fact the first days were highly demotivating, it felt like learning to walk or ride a bike from scratch again. I started out with my fingers on the home row and consciously moved my fingers into position. That process took a lot of concentration, you can think of it as talking by spelling out each word. Furthermore, every time I hit a wrong key, my muscle memory would stare me in the face full of tears and proclaim it had triggered the right motion. It did … just not for this new layout I was learning.

So, what did I use to train? I started out using a site called 10fastfingers, but I found it a bit cumbersome and it did not have a lot of variance. In the end, I can really recommend only two sites, namely learn.dvorak.nl and keybr.com. The latter has the nice property that it adapts the lessons to your proficiency level and is quite effective for improving weak keys. /r/dvorak is also good for inspiration and tips.

Some basic other tips: start typing e-mails and chats with your new layout before making a complete switch, as it will give you some training in thinking and typing, rather than just copying text. Furthermore, switching the keyboard layout of your smartphone may help as well, not for efficiency, as Dvorak is really a two-handed layout, but for memorization. Dvorak is not really designed for phones, other layouts may be better, I have not looked deeply into this, as I generally dislike using phones for entering text, it does not seem worth the trouble of optimization. I do not recommend switching the keys on your computer keyboard, or relabeling them, as doing so will tempt you to look at the keyboard as you type, which will slow you down. It is better to type ‘blind’.

It took some discipline to keep at it the first few days, but after about a week or two I was able to type at an average speed of about twenty-five words per minute. Still not even a third of my original QWERTY speed, but there was definitely improvement. After this there was a bit of a plateau. I spent more time on the combinations and key sequences that were problematic, which helped. Six weeks in I was able to type with an average speed of around forty words per minute. Since this was half of my QWERTY speed, I deemed it was time to switch to Programmer Dvorak completely.

In contrast with my previous attempt several years ago, this time the switch was not a frustrating experience. The rate of learning increased as my muscle memory no longer had to deal with two layouts. Typing became increasingly unconscious. The only things that remained difficult were special characters and numbers, for the sole reason that these do not appear often and thus learning them is slower.

Currently I am about ten weeks in. I did not use the same training tools during that entire time, but I do have data from the last eight weeks. Let us first take a look at the average typing speed:

2015-07-Typing_Speed_Smooth

Average smoothed typing speed

The graph shows two lines spanning a time of eight weeks, a green one which shows the raw speed and a purple one that shows the effective speed. You can see that both speeds go up over time and the lines are converging, which implies the error rate is going down. My average speed is currently around seventy words per minute, which is close to my original QWERTY speed.

We can also look at the non-smoothed data, which gives a feeling for the top speed. In the second graph, shown below, we see that the top speed is about hundred words per minute which is actually about the same as my QWERTY top speed.

2015-07-Typing_Speed_Raw

Raw typing speed

There is still quite a bit of variation, as is to be expected: not every character sequence can be entered at a high speed and some keys have a higher error rate than others. Most errors are mechanical in nature, which means: simply hitting the wrong key. This is particularly prevalent when the same fingers needs to move to press a subsequent key, for example for the word ‘pike’ one finger needs to move thrice to hit the first three letters. More generally, my slowest keys are the Q, J and Z and the keys with the highest error rate are the K, X and Z. Luckily these are not high frequency keys, and they are also underrepresented during training, so over time the errors will likely decrease and the speed will increase for these keys.

With respect to my original goals: firstly, I can say that typing in Dvorak is more comfortable than QWERTY, particularly at higher speeds my fingers feel much less jumbled up. The hand alternation is very pleasant, though it took some time for my hands to get synchronized. Secondly, in terms of speed: after about ten weeks I am very close to my QWERTY speed, which is great. It shows that switching layouts is possible, even though it takes effort and discipline to do so. It was frustrating at first, but I feel that it was a good opportunity to purge many bad typing habits that had accumulated over the years.

There are also some downsides, the main one is that typing QWERTY is slow for me now, and that will likely continue to deteriorate. I do not see this as a major issue, as I do about ninety-nine percent of typing on my own machines. For the other one percent, it is possible to switch layouts on each and every computer out there. Some people may dislike the moving of keyboard shortcuts, and that can really be an issue, but for the most part it is just a matter of getting used to it. As an Emacs user, I took the opportunity to switch to the ergomacs layout, which I can recommend. It significantly reduces the number and length of chords: keys that need to be pressed in succession, and is also more compatible with more broadly adapted shortcuts.

Do I recommend that you switch to Dvorak, or an other alternative layout? That really depends on how frequently you type. If you type rarely, switching may not be worth the effort. However, if you have to type a lot every day then I think it is worth it purely for the increase in typing comfort. The only argument against this is if you often need to switch computers and you can not easily change the keyboard layout on those machines.

Dvorak definitely feels a lot more natural than QWERTY, and so will most other more optimal layouts. I am relieved I never took a touch typing course. It would have taken much more effort to unlearn touch typing QWERTY if I had. Thanks to not doing that I have been able to learn and become proficient using a layout suited for my hands in just ten weeks. So, if you type frequently, are willing to make the jump and have enough discipline to get through the initially steep learning curve, then I can definitely recommend it. Even just for the challenge.

The Origins of Copyright

2013-01-15-Printing-Press

The most profound characteristic of the networked era we live in today is that ‘ordinary’ people can easily create and distribute their own content. Sites like YouTube and Vimeo are more than Internet versions of America’s Funniest Home Videos. Indeed, they enable budding filmmakers to showcase their works, singers to attract their own following, and artists to use the screens in people’s homes as their canvas. Most importantly, this direct form of broadcasting eliminates the need for a slew of middlemen, moderation and the accompanying politics. It does not end there, writers can publish their novels as e-books, journalists can report the news by using web logs, and everyone can share their day-to-day activities via social networks like Facebook, Twitter or Google+.

Never before have there existed so many options to distribute and create one’s own content. Not all of this is of high quality, nor does it really need to be. As the ability to publish taps into the basic human needs of self expression and sharing. However, there are limits, because whatever anyone creates, there is one thing that protects every creative expression: copyright.

The origin of copyright stems back to a very old invention: that of the printing press. Prior to this, literacy rates were fairly low and reproducing texts was labor intensive: the manual letter-by-letter copying by writing was performed by educated monks in monasteries. The mechanical printing press changed everything, as it made it significantly easier to produce exact duplicates of a text. And this is exactly where the tension between authors of original works, publishers/distributors and consumers originated.

An unprotected work, one that is in the public domain, can be reproduced without the original author’s consent. Indeed, one may even take a work in the public domain, change nothing and republish it under one’s own credentials. There are no limits to how works in the public domain can be used and there is no legal protection for them. Luckily, a work is not in the public domain by default. Anything you create is at least protected by copyright automatically, with some minor exceptions. However, as we will see later, this is also a double-edged sword.

As Europe became more literate, the demand for books increased and concerns grew over the monopoly of the large printing companies. After all, they could republish any work without consent, and profit solely from the act of printing without compensating the original author. To remedy this, copying restrictions were introduced, first by self-regulation at the industry level, and later government enforced by copyright law.

The duration of copyright is limited, sharing some resemblance to the patent system. After you have created some original work, you have a set time to profit from it by displaying, selling or transmitting it. You also have the exclusive right to produce derivative works and naturally: to copy it. The (expected) return based on all these rights is intended to cover the costs it took in terms of your own labor as well as other resources to produce the work. This incentive may be an explanation for the proliferation of creative works over the last couple of centuries. After the copyright expires, your work enters the public domain. So, for how long is a work covered by copyright? Initially, this was about 14 to 28 years, but most countries now use the lifespan of the author plus 50 to 70 (!) years.

The expiration of copyright and the existence of works in the public domain give rise to a somewhat odd dichotomy. One can produce a derivative of a work in the public domain, and then produce a new original copyright-protected work from it. Do you know of anyone who has done this? I bet you do, since one of the most famous examples is Walt Disney.

This brief clip shows that Disney relied on many existing works. He created original derivative works by creating modern adaptations of them. In some sense this is exactly how human culture works: you build upon the works of others. Whether those derivative works should then by fiercely protected by a profit-driven entertainment company, is an interesting other debate.

In some way copyright seems entirely fair. After all, it provides an incentive for creating new works, and enables the author to make a living. However, there are some consequences of copyright law which have nothing to do with such lofty goals. These affect both the ownership and duration of copyright.

It is common to transfer ownership from the original author to some other entity, usually a publishing company. For example, many scientists have to transfer copyright of a final manuscript version of a publication to the publishing entity, like ACM or Springer. As scientific articles are usually paywalled, this leads to the odd construction that an author may have to pay to be able to view his original work in the final published form, although commonly, and commonsensically, such access is provided free of charge. Open source projects are another example: some require the author to sign over copyright to the project itself. One reason to do this is to prevent the situation where all copyright holders have to agree to a future license change, an other is a better legal position. These examples show that copyright does not necessarily remain with the original creators, but rather is transferred based on either direct monetary gain or legal convenience.

I think the legal convenience argument is somewhat weak: indeed, it can be better to have a slew of copyright holders involved. When a license change is proposed they can all democratically vote for it. With respect to monetary gain, science is the odd one out. Scientist produce their articles, and then publish them for free at conferences or in journals. Yet, the publishers are the only ones profiting in this system. Indeed, through scientific grands, the taxpayer indirectly funds those large publishing companies when scientists publish their works. However, this is just one route through which the public sponsors those companies: universities pay licensing costs for access to repositories of publications and printed journals, which is also public money. While the publishing companies do add value, by editorial means, providing infrastructure and promotion as well as actual printed works, it is doubtful whether such a cozy money-sandwich is justified for this rather limited contribution. Contrast this with the early days of printing, where the publishers were also responsible for the actual printing and brought to the table considerable knowledge and craftsmanship concerning typesetting and replication. These tasks have quietly shifted to dedicated printing companies and the authors themselves, weakening the role of traditional publishers.

The situation is different outside of science. Most artists get some form of compensation when they sign over their copyright. This can either be some fixed amount or a share of the profit. However, the lion’s share still goes to the publishers and distributors themselves. This also raises the interesting question: why should copyright last longer than the author’s lifetime? Who owns the rights, and the profits, after the author has died? In some cases these may be the heirs of the author, for example the Tolkien Estate holds the rights to The Lord of the Rings, which makes some sense. However, this is not always the case. For example, who owns the music of the Beatles (two of whom are still alive)? That would be Sony Music. So, they can profit from the Beatles’s catalog, consisting of over 250 songs, until probably the end of this century. Wait a minute, I thought the point of copyright was to fairly compensate the authors, so that they could make a living of what they had created? In fairness, the Beatles and their estates still receive some royalties (though most profits now go to Sony Music). However, the effort it took to create those original works has already been compensated many times over by this point, and even more so by the turn of the century.

This clip illustrates some of the points made thus far:

So far, we have learned about the origin of copyright and seen the deviation between the original goal of copyright: protecting the individual authors, and the present-day situation: publishers and large corporations owning vast amounts of intellectual property. However, computers and the Internet are changing the game. In a follow-up article, to appear soon, we will look at copyright in the digital age.

Designing a Thesis

A while ago I spent quite some time to research the best options for designing my thesis. I used ideas from various sources, and in this brief article I will explain some of the choices I made, which will hopefully be useful for those that still need to complete their own thesis. Many of these wisdoms are part of the excellent classic thesis style.

1) Tools
Before you start writing, you should pick your tools. I used LyX to typeset my thesis, which is a LaTeX front-end. I have been using it for years, also for my publications, and have grown used to it. It’s stable, and easy to use for beginners. Unlike LaTeX, you don’t have to spend a lot of time memorizing arcane codes, which really is unnecessary anyway in a time where graphical user interfaces dominate. Of course: you still have the power of LaTeX underneath, which is nice, especially for more sophisticated typesetting tweaks.

2) Fonts
With regard to the document content, one of the first choices that you should make is that of the fonts you want to use. Although a particular font is never really right or wrong, LaTeX shields you from making really bad choices here, unlike for example Microsoft Word. There are three categories of fonts you will need to choose: a serif font, a sans-serif font and a mono-space font.

2012-09-24-Fonts

The serif font, sometimes termed roman font, is the most important. You probably remember the lined paper on which you learned to write. Those lines were not only there to force you to write on them, but also to guide your eyes. A serif font has subtle strokes on each character: when you view a page with serif text from a distance, and squint your eyes a bit, you will see that these strokes form ‘virtual’ lines as well. Hence, serifs aid reading by preventing blocks of text from looking ‘wobbly’. This is the reason why most running text is usually set in a serif font. However, times are changing, and more blocks of text are being set in sans-serif these days.

The sans-serif font is, as the name implies, without serifs. Such a font is often used in glossy magazines, and has a cleaner, less cluttered, look. This also makes it more suitable for computer screens, because these typically have a lower display resolution, which can make serifs look ugly. In a thesis, it is used primarily for chapter and section titles. Alternatively, you can choose to use a serif font with ‘small caps’ for titles. This gives a more classical look, whereas sans-serif fonts give your text a more modern feel. Sans-serifs should generally not be used for large blocks of text, unless you really want that and know what you are doing.

The mono-space, or typewriter, font is normally used in places where each character needs to take up the same amount of space, for readability. The best example is a listing of computer code. However, mono-spaced text is not as comfortable to read as text set in a serif or sans-serif font. Since the space each character occupies needs to be equal, visual readability aids, like ligatures and kerning, can not be used. Use mono-spaced text conservatively.

My choices were Palatino as serif, using small caps for titles, and Bera Mono for mono-space. A good overview of fonts for LaTeX can be found here. Make sure that, besides the font sizes, you choose other settings optimally for the fonts you pick, for example: for Palatino a slightly higher line-spread is better for readability, and Bera Mono needs to be scaled down in order to properly complement Palatino. Also, for the PDF output consider using microtype, which allows you to fine-tune settings such as protrusion and expansion.

As a final font tip: try to avoid using bold text where you can, particularly in running text, as this draws unnecessary visual attention in printed matter. Bold text in print is the typographic equivalent of a ‘blinking’ element on a web page: annoying. If you want to emphasize something use italic instead. Bold text is okay for titles, but I’d avoid it even there if possible.

3) Page Lay-out
Consult with your printer to see what type of output they want to have. It is common in the Netherlands to print a thesis in B5. While that is 240mmx180mm by default, some printers use other variants of B5 with slightly different dimensions. Since you are making a book: make sure to select a double-sided lay-out and ask what the binding correction should be: this is an offset that pushes the center of the page content slightly to the left for left pages, and slightly to the right for right pages, which results in optically centered pages after they are bound. Also double-check the page margins. LaTeX chooses very liberal margins by default, which you may want to reduce in order to more effectively use the available space.

Another point of attention with respect to page lay-out is where in the text a figure or table is mentioned, and where it really is in the document. LaTeX has a number of placement rules for this, which can be overridden. A good automatic result is usually obtained by placing the text that refers to a figure or table directly ‘below’ it in in your TeX file, but in some cases you may need to override this placement. Keep in mind that you are working with a double-sided lay-out, which gives a bit more placement freedom, as people continuously see two pages in your document simultaneously.

Try to keep the number of color pages you have as low as you can, as these are expensive when you print your thesis. Restrict it to graphs or pages where a strong visual aesthetic matters.

4) Table of Contents
As a general rule: do not include more than two ‘levels’ in your table of contents: chapter and section. So, no subsections or subsubsections. Besides, if you need numbered subsubsections, you may want to consider restructuring your text entirely: perhaps the parent section should be a separate chapter instead.

There is a fair number of people that align all the titles to the left and the page numbers to the right. It doesn’t really make sense to do this (what are you going to do: add up the page numbers? really?). The page numbers are there to help the reader, and hence should be placed directly behind the titles. This also removes the need for the visual horror of thick or dotted horizontal lines in the table of contents. Finally, consider adding the bibliography as an item in the table of contents, since this is quite an important part of a scientific work.

2012-09-24-TableOfContents

5) Chapter Openings
The convention is to put chapter opening pages on the right side of your book. Take special care of the opening page of each chapter. For LaTeX, there are many packages that can help you make these look more visually appealing, such as fncychap. The main rule is: keep it simple, less is more. Some people use a separate page for the start of new chapters, with only the heading, which can also be quite visually pleasing.

6) Tables
This is probably the one most often abused visual element in any document. There are two important things to keep in mind. Firstly, use tables only for listing things structurally, that’s what they are for. In all other cases: use figures. Secondly, please do not use vertical lines in your table, if you need vertical lines: it’s not a table, it’s a figure. As a small visual test: create a table in your favorite word processor or spreadsheet and experiment with how it looks with only horizontal lines and horizontal and vertical lines. You will find that using only horizontal lines makes the table easier to read. Even when using only horizontal lines: use as few as possible, focus instead on properly aligning the data that you are presenting, which alleviates the need for lines in many cases. As a general rule you should use a line above and below the first row, and below the last row. The outside lines may be slightly bolder with respect to the other lines. Tables may look better when they span the entire page width, but this depends on the content.

2012-09-24-Tables

7) Figures
When you include any figure in your document, really any figure, use a scalable (vector) format where possible. In LaTeX the most obvious choice for this is Encapsulated Postscript (EPS) files. If you must include a non-scalable (raster) image, avoid lossy formats, like JPEG (use PNG instead). Also: include high resolution images. The reason for all this? Many theses include low resolution non-scalable graphics. Unfortunately, this looks horrible when printed: blocky and pixelated. Either your graphics need to have a higher resolution then the resolution used for the print (typically 300ppi), or you need to use scalable (vector) graphics (which looks optimal regardless of the printer’s ppi). A nice, but costly, way to convert raster to vector is by taking the raster image and drawing over it with a vector graphics tool.

8) Captions
Whether you use hanging or non-hanging captions for your tables and figures is a personal choice, same goes for bold caption text or not (I’d personally try to avoid that). By convention, captions for tables are always placed above the table, and captions for figures are always placed below the figure. Try to keep your table captions as short as possible (avoid multi-line captions if you can). It’s visually nicer to have more elaborate text below the element you are presenting. Hence, for figures this type of text can go directly in the caption.

I hope these tips will help you design a better thesis in the short-run and help you produce more visually appealing texts in the future.

How long would it take to read Wikipedia?

Wikipedia has become the de facto encyclopedia on the Internet. A traditional encyclopedia spans many textbook volumes which would take any normal person ages to read. Few people would likely engage in such an endeavor. However, since Wikipedia is readily accessible: should you take up the challenge?

Wikipedia is continuously being changed and updated. Consequently, reading all of it would take an infinite amount of time, as when you would have finished reading, you would have to go back to re-read the changed articles. Hence, we need to change our initial question to: How long would it take to read a snapshot of today’s Wikipedia? But we aren’t there yet, there’s one more thing left to specify. Since Wikipedia is multi-lingual, we have to pick a language. While some articles are translated from English into various languages, there are quite a few that have no English counterpart and are language, culture or even region specific. Let’s focus only on the English portion of Wikipedia.

The English Wikipedia consists of about 4 million articles, counting only the content pages. That’s roughly an estimated 3000 volumes of Encyclopædia Brittanica. It’s heavily consulted as this collection of articles gets viewed almost 3000 times every second. However, let’s not get distracted: our goal is to find out how long it would take to read Wikipedia. For this we first have to find out how fast people read. Although this greatly varies from person to person, and some Wikipedia articles may be more difficult to read than others, we will have to pick a reasonable value. It turns out that the average American adult can read about 300 words per minute.

So, how many words does Wikipedia consist of? Although no recent exact numbers are available, a quick extrapolation reveals that all articles combined form about 2500 million words, that’s about 625 words per article on average. Each article would take a little over two minutes to read for an average adult. Reading all of Wikipedia would take about 140 000 hours, which is 5800 days, or almost sixteen years. That’s assuming that you’re reading 24 hours a day! Reading a standard eight-hour workday seems more reasonable. In that case it would take you thrice as long: 48 years, but you’d still have time to do other things, and most importantly: time to sleep.

Thus, right now it would (still) be possible to read Wikipedia in a lifetime: if you start reading when you turn 18, you would have reached about pension-age (66) when you would be finished. Nevertheless, Wikipedia continues to grow, albeit at a slower rate than it used to: the word count is increasing with about two percent per month. Hence, if you still want to actually do this: I suggest you start reading now. However, perhaps your time is better spent some other way 🙂

Source: Wikipedia Statistics