Sep 21, 2010 Science
I recently came across a query on a music software mailing list about bit depth, what exactly it meant, and why you would choose anything other than 16 bits (given that CDs are mastered at 44.1khz, 16bit). I wrote a long, disorganized reply there and then it occurred to me that I should take those remarks and bundle something up for you folks here.
The first important thing is to understand what bit depth means. Bit depth and sampling rate are actually two related terms, on two different axes, and as such, they’re best expressed with a graph, but before the graph will be meaningful, we should talk a little bit about digital recording.
Sound in the real world is analog. What analog means, in terms of analog vs. digital, is that it’s continuous. If you take a ruler, it has a bunch of markings on it. Let’s say it’s marked at only centimetres. So it has markings of 1cm, 2cm, 3cm, etc. The ruler doesn’t cease to exist between those markings, however, nor does real-world distance. There’s a defined value of ruler between those. Now, you could make more markings so that you had a 1.5cm marking. But if you zoomed in between the 1cm and the 1.5cm marking, there’s still ruler between them. You could make more markings, and if you zoomed in again, there would be ruler between them and so on and so on down the line. Eventually, since a ruler is a physical object made of atoms you’d get to the point where you’d have discrete atoms. But there would still be distance between each atoms. And that’s where analog vs. digital comes in to play. Digital is like the ruler — if you zoom down deeply enough there’s a point where there is space between the units. Analog is like the distance — no matter how far you zoom down, there’s more distance between the ends. I hope that makes some sense.
Now, sound is a form of energy, which gets transmitted to your eardrums. (It moves as a wave of pressure, and you can imagine a string moving back and forth pushing the air back and forth as it moves, those compressions and rarefactions being transmitted through the air to your eardrum.) When we measure sound, it really had only one dimension — the amount of energy. All of the things we hear in a sound — its colour, timbre, frequency, tone, etc. — are built out of that one energy measure, and how it changes over time.
So when we make a digital recording, we can plot a representation of a given sound as a measure of its energy over time. The most common representation puts energy level on the vertical axis and time on the horizontal axis. A typical sound displayed this way looks something like this:
For our purposes, however, I’ll be dealing with a much simpler graph. All the same principles apply to both. We’re going to talk about a sound graph that looks like this:
Now, I said that digital recording was like the ruler, where eventually you reach the atomic level and there’s a discrete bit of ruler, and then empty space before the next bit. That’s exactly what digital means — discrete elements. If I have a digtal scale that represents only whole numbers, then there is a value for 1, 2, 3, 4, 5, but no value for anything in between them. So if I were to try to plot the graph above in a digtal system, it might look something like this:
(Pedant note: Both the graph above and this one are fundamentally digital because the computers and monitors we’re using to make them are digital. Unless I go over to your house and draw a line on your monitor with a Sharpie, I can’t really show an analog graph to you. A certain suspension of disbelief is required here.)
You can’t have that smooth line from our original because those smooth, continuous values between each value don’t exist. Now, this might seem like a pretty poor way to represent sound, and many audiophiles would agree with you on that, but for the most part there seems to be a point at which we can’t really tell the difference between the straight line and the stair stepped line, and for most of us, the industry standard known commonly as “CD Quality” meets that criteria. CD Quality is actually 44.1khz, 16bits, stereo. What does that mean in terms of our graph? Well, the first part is easy — 44.1khz means that the graph is divided into 44,100 “samples” (discrete measurements) per second. That’s the horizontal line, and I can’t draw you a picture of it because that’s probably well over 40 times the number of vertical lines within your browser’s document window. It’s quite fine. Now, “stereo” is also easy to explain — it just means that there are two separate 44.1khz, 16bit recordings on the disc, one for the left speaker and one for the right speaker.
To understand what 16bit means, you need to know a little about binary numbers, which is how computers store things. Binary numbers are numbers broken down into just ones and zeros, as I’m sure all of you know. You’ve seen the infamous “10100101001010101101001010101…” representation of binary, or in jokes about robots or whatever. The bit depth is what controls how many ones and zeroes are used to represent each sample, mentioned above. So the phrase, 16bit means that sixteen digits, each of which can be a one or a zero are used to represent each sample, i.e., at each moment in time we detect the amount of energy in the sound and store it as a sixteen-digit binary value.
What’s the implication of that? Well, with a single binary digit, you can store two values, 0 or 1. This lets you count from 0 to 1, obviously. With two binary digits, you can store four values, 00, 01, 10, 11. This lets us count to 3, since we can assign these representations the meanings 0, 1, 2 and 3. Each time we add a digit, the number of values that we can store doubles (which makes sense, since we take all the values we could previously represent, and for each we add the option of having a 1 or a 0 prefixed to it).
That may not sound like a lot, but the progression moves pretty fast. The number of possible values at each step of the way (subtract one to get the maximum number we can represent, given that we’re counting from zero) are:
1 bit: 2
2 bit: 4
3 bit: 8
4 bit: 16
5 bit: 32
6 bit: 64
7 bit: 128
8 bit: 256
9 bit: 512
10 bit: 1024
11 bit: 2048
12 bit: 4096
13 bit: 8192
14 bit: 16384
15 bit: 32768
16 bit: 65536
(It’s worth noting that in actual practice for audio use, these aren’t used typically to denote, say, 0 to 65535, for the example of 16 bits, but rather, -32,768 through +32,767.)
So with 44.1khz, 16bit, stereo sound, or “CD quality”, to display a graph of one second of sound, you’d need to have two pictures (for stereo), each of which would have to have 44,100 dots horizontally and 65,535 (see above) dots vertically. On the monitor that I’m typing on here, which has a horizontal dot resolution of about 107dpi and a vertical resolution of about 114dpi, that pair of pictures would (in total) measure almost 69 feet wide and 48 feet tall. So it’s a really fine measurement of sound. Most importantly, it’s a fine enough measurement of sound that just in the way we don’t see the ruler as a collection of discrete objects, we don’t hear the sound as a collection of discrete values and moments in time — instead we hear an indistinguisable representation of the original continuous sound.
Okay, that’s all the background material! Fortunately, from here it becomes easy to demonstrate the answer to the question: If your goal is to render sound in 16 bits, why would you ever work with the common higher-bit rates (20 and 24)? (This by extension can be used to answer the related question of why you’d use higher sampling rates like 48khz, 96khz or 192khz, although those also involve a little delving into nyquist limits as a side topic, which explains why the possible rates in each scale are so different.)
Well, the answer to why you’d use 20 bit or 24 bit is because we tend to manipulate sound. If your goal was just to make a recording and press it, completely unprocessed, onto a CD, there would be no reason to use anything but 16 bit, but in actuality, most of us run our audio through the wringer and back before it gets even close to a CD, and that poses a problem. This problem is fortunately very easy to illustrate with a couple of pictures.
First, let’s look at a graph of our line as presented in 2 bit sound (over one second at 8 hz):
Now, let’s bump up the quality of our graph to 3 bit sound (over one second at 8 hz):
It looks a lot better, right? And it will sound better, accordingly. But we’re talking here about a sound that moves smoothly from the minimum possible loudness (presumably silence) to the maximum possible loudness that our system can record without distortion. Realistically, a lot of sounds don’t neatly maximize their loudness that way. Let’s suppose that the sound had been half as loud. Here’s our 3 bit graph:
Now, suppose that we get that sound, and as is not uncommon, we immediately normalize it (which alters the sound file to scale all values such that the maximum value is at the maximum possible level). The result looks like this:
Notice anything about it? Yes, while it may not look exactly like the 2 bit graph, above, it’s more or less functionally equivalent. That’s because the quieter volumes used less of our available dynamic range, and thus were being represented by fewer possible values.
Now, if we’d recorded the half-volume sound initially in 4-bit audio, we would have gotten this:
Then when we normalized it, we would have gotten this:
While this may look slightly different than our 3 bit graph, it’s more or less functionally equivalent. If we’re then exporting it to 3 bit media, we have the best possible recording of it, despite the fact that we had to normalize it. In fact, if we thereafter “downsampled” it to 3 bit audio for output, the graph would be exactly identical to our 3 bit, 8 hz line plot above.
Normalization is the easiest manipulation to explain the effects of graphically, but of course every processing step you take (and I take a lot) risks a loss of some data, so the more “extra” data you had to begin with, the more you can lose without degrading apparent quality on the target media. And remember that while 20 bit or 24 bit may not sound like a lot of extra leeway for a project targeting a 16 bit destination, each bit doubles the sensitivity and fidelity of the recording, so four or eight extra bits is lots of room for mangling. Also, a good processing algorithm will take steps to lessen the damage — for example, by interpolating or smoothing values where gaps have been introduced.
Is there a downside, or should you just pick the highest bit rate you possibly can at all times?
Well, the more bits in a chunk of sound, the more horsepower it takes to process, the more memory it takes to fit the sound in active memory, and the larger the file when the sound is written to disk. So there’s a cost there, and if you use large quantities of sound on a modest computer, that may be a significant issue for you. Also, you will want to be sure that all of the apps you use to process the sound can handle the format you chose, and 44.1khz, 16bit, stereo is handled by just about everything under the sun. Lastly, that every processing step involves possible data loss is a double-edged sword, and while most apps will do their best to give you the best rendering at all times, if you really do plan to just record direct to CD without any manipulation at all, then it’s quite possible that you will get the most accurate rendition you can by using one format end-to-end, which would be 44.1khz, 16bit, stereo.
(As an aside, while many synthesizers, DAWs and other tools have in the past advertised having x bit internal processing, where x could be anywhere from 20 to 64, these days most software uses 32 or 64 bit internal math, and uses floating point numbers, which operate differently than described here. A good tool will know when to use each format and will switch between them as needed. Also, playback settings just affect what gets sent to your speakers, not what winds up on the sound file. As such, this post in the modern era mostly applies to making choices about recording external audio sources.)