Intuitions in Machine Learning: Understanding ROC Curves

Have you tried to teach yourself Machine Learning online, and got overwhelmed by the first flood of mathematical terms and equations?

Screen Shot 2017-04-15 at 2.21.05 PM
What you get when you look up ROC curves and Confusion Matrices

My goal in these articles is to give you an intuition for the math, by connecting the math to your personal experience. I’ll cover Machine Learning topics through concrete, accessible examples.

I’ll start the series by covering a simple topic: ROC curves*. In this article, I’ll explain what they are, how they work, and why and when to use them. In order to explain ROC curves we’ll also cover these related concepts: Confusion Matrices, sensitivity, and specificity.

(*ROC is short for “Receiver Operating Characteristic”. This term is a holdover from when it was used in Electrical Engineering, and you don’t need to understand it to use it now.)

The Prospector

Let’s start by creating an imagined scenario. You’re a gold prospector back in the Old West. These were entrepreneurs who explored the California frontier looking for gold during the gold rush.

This is you.

One morning you come upon a new field that no prospector has been in before. Scattered around the landscape you find a bunch of objects; some of them are rocks, some are gold. Your job is to collect as much of the gold as you can.

Your job is made tricky because it’s not easy to tell what’s gold and what’s not. Here are the types of objects you might find:

The four types of objects you might find in the field.
  1. Gold nuggets: Gold nuggets look like gold, and are, in fact, gold.
  2. Ore: These look like rocks, but they have gold hidden inside them. You may overlook these during your search.
  3. Fool’s Gold: Fool’s Gold (pyrite) is a mineral that looks a bit like gold, but it’s actually not worth much.
  4. Rocks: These are worthless rocks that, conveniently, also look like rocks.

Your job is to pick out the gold from the worthless rocks. You realize from the start that your decisions are sometimes going to be wrong. For instance, you might think an object is gold when it isn’t (fool’s gold), or you might guess some object is worthless when in fact it’s gold (ore).

Understanding the Confusion Matrix

Let’s look back at the Confusion Matrix in the image above and see if we can make sense of it. Using the four types of objects we’re going to find, we can start to better understand the boxes in the matrix. Starting with the group at the top left, let’s replace some of the labels in the boxes to give us a better sense of what they mean.

TrueFalse (1)
Top left box of the Confusion Matrix

Looking down the left hand side, we see that the rows are labeled “condition positive” and “condition negative”. These labels tell us what the objects actually are, regardless of what we think they are. In our scenario, “positive” means they’re gold, where “gold” includes both nuggets and ore. “Negative” means they’re worthless, which includes both fool’s gold and plain old rocks.

Let’s replace these labels so they make more sense:


The columns at the top are labeled “prediction positive” and “prediction negative”. As a prospector walking around the field, you make a guess as to what each object is. If you guess “positive”, in other words “gold”, then you would pick it up and put it in your wheelbarrow. If you guess “negative” i.e. “worthless”, you’ll leave it on the ground.

So let’s replace these labels with something that’s easier to understand:

TrueFalse (3).png

With these labels in place, we can fill in the rest of the boxes so they are more intuitive:

TrueFalse (4)

Let’s go through each of these:

  1. True Positives: These are the gold nuggets. As a prospector, you would pick up a nugget believing it was gold. When brought it home and checked it, you find that you were right. I tend to attach a feeling with each group to help me remember them. The feeling I attach with this group is: Win!
  2. False Positives: Fool’s gold (pyrite). You thought these items were gold, so you put them in your wheelbarrow, but it turned out they were worthless. The feeling associated here is disappointment, or a let-down.
  3. False Negatives: The ore. You thought these were worthless rocks, so you left them behind. It turned out there was gold hidden inside them. You wish you’d picked them up too. The feeling here is regret.
  4. True Negatives: Plain old rocks. You left them behind thinking they were worthless, and they were. The feeling I associate with this is apathy. You might also feel relief; if you were second-guessing yourself because you may have missed out on some gold, you might be relieved to know you didn’t.

Now we’ve got a better understanding of what’s going on, let’s go back and try to decipher our ROC curve.

Sensitivity and Fall Out

I still have no idea what’s going on here:


Let’s look at some of the words on the graph to see if we can figure them out. I see the word sensitivity down one side, and specificity (or 1 – specificity) along the bottom. Did I see these words anywhere before?


I saw them in the Confusion Matrix. The matrix says that sensitivity is the true positives divided by condition positives. This is the gold I have in my wheelbarrow, divided by all the gold I could have picked up.

Sensitivity :   My gold 
               All gold

To put it another way: “What percentage (%) of all the gold ended up in my wheelbarrow?”

canstockphoto28268917Sensitivity is a number between zero and one. Zero means I got 0% of all the gold, and one means I got 100%. Ideally my sensitivity should be 100%.

Why is this called “sensitivity”? I like to imagine it’s like the sensitivity of those metal detectors some people use on the beach to find valuable stuff. The more sensitive the metal detector is, the more gold it’ll find. 100% sensitivity means it’ll find 100% of the gold.

OK, so we’re done! We don’t need any other measurements, right? So long as I try to get 100% of the gold, I should achieve my goals?

Well, not exactly, because I can think of a way to get 100% of the gold that is not particularly smart.

What if I picked up every object that I saw? If I pick up everything that’s not bolted to the ground, I’m bound to get all the gold in the process. Granted, I’ll also pick up a lot of worthless stuff. Remember, my goal is to pick up only the gold, and leave the worthless rocks behind.

So I need to be more specific about what objects I pick up.

Sensitivity (1).png

Let’s look for the other word we saw, specificity. Back in the big matrix, we see that specificity is the true negatives divided by the condition negatives. The true negatives are the rocks that we left behind, and the condition negatives are all the worthless stuff we could have found.

Specificity :  rocks I left behind
               all worthless stuff

In other words: How much of the worthless stuff did I leave behind?

Why is specificity important?

As a prospector, you’re judged by how well you can separate the gold from the worthless rock. To be a great prospector you should be able to pick up all the gold, and at the same time leave the worthless rocks on the ground. This is why you want to have a high specificity as well as a high sensitivity.

The graph above was labeled 1 – specificity. This would be 100% of the worthless stuff minus the worthless stuff that I left behind. Since all the worthless stuff either got left behind, or ended up in my wheelbarrow, then 1 – specificity represents the worthless stuff I picked up and put in my wheelbarrow. This is a little confusing though.

Screen Shot 2017-05-01 at 9.16.28 PM
How much of the garbage did you take home?

If I look through the Confusion Matrix, I can find another word that also means “the worthless stuff that I put in my wheelbarrow”. It’s called fallout.

Fallout: How much of the worthless stuff did I pick up and put in my wheelbarrow?

Ideally this amount is 0%, or none of the worthless stuff. These two parts of the worthless pile add up to 100%, so when fallout is 0% then specificity is 100%. The higher the fallout, the lower the specificity, and vice versa.

At this point I still had difficulty wrapping my mind around these two ideas, specificity and fallout. It helped to think of them as the difference between a shotgun and a sniper’s rifle.

Screen Shot 2017-05-01 at 9.19.38 PM.pngA shotgun tends to spray pellets in a wide area in front of the gun. You’ll hit your target, but you’re also going to hit a lot of things around it that you didn’t intend to hit. The collateral damage, or fallout, of a shotgun is high. With a sniper’s rifle, you only hit the thing you intended to hit. It’s targeted; in other words, it’s specific.

How we make decisions

Let’s go back to our prospector. As a prospector, every time you see an object on the ground, you have to decide if you want to put it in your wheelbarrow or leave it on the ground. You’re going to make some mistakes, and you want your decision to be right as often as possible.

So how do you make your decisions? Something about the rock should tell you if it’s gold or not, even if your system isn’t perfect. Screen Shot 2017-04-10 at 7.24.51 PM

What’s the best way to tell if something is gold or not? Is it the size, the shape, the colour?


Let’s make a guess: let’s base it on “shininess”. I predict that the shinier something is, the more likely it is to be gold.

A shininess spectrum. Objects on the left are shinier than those on the right; but that does not necessarily mean they are gold.

This is how we’re going to start. We realize that some shiny things are actually fool’s gold (pyrite), and some of the real gold will look dull (the ore). Overall, we’re hoping that more of the shiny stuff is gold, and more of the dull-looking stuff is plain old rock.

In the image below I’ve split the spectrum into two parts. The yellow half represents the gold. Most of it is on the shiny end. The grey part is the worthless rock, which tends to be at the dull end.

Shininess (1)

You won’t know what the actual split between the gold and worthless stuff is until you bring your wheelbarrow home and check. You’re hoping it’ll be something like the split above. When you bring everything home, you’ll find out how much fool’s gold you picked up. You’ll also hear from other prospectors that dig in the same field if you missed any of the real gold because you thought it looked like rock.

The image above shows about a 50-50 split between the gold and not-gold. This makes things simpler for our example, though it probably won’t be like that in real life. It turns out that most of the stuff we talk about below still holds even if the split is, say 10%-90%.

Decision #2: How shiny?

Now that we’ve picked “shininess” as the basis for our decisions, how shiny does something have to be for you to pick it up? Let’s pick a shininess “threshold”, a minimum shininess. We’ll put everything that’s shinier than a certain minimum into our wheelbarrow.

Our shininess threshold might depend on how big our wheelbarrow is. If we have only a small wheelbarrow, we have to be more picky and only collect the shiniest stuff we see. On the other hand, if we have a few dump trucks, we might lower our shininess minimum, collect things that are a bit duller, then sort it all out when we get home. We’re OK gathering a lot of worthless stuff too (fallout) as long as we didn’t miss the opportunity to get more of the gold (sensitivity).

I want to focus on this point above. The threshold you chose is based entirely on your own preference. In our case we based the decision partly on the size of our wheelbarrow, but we could have hired more wheelbarrows if we thought it would be worth it. We also assume that it’s OK to pick up some fool’s gold if it means we get more real gold in the process.

Sometimes this decision is a bit more complicated than our simple one. If, instead of being a prospector, you were a doctor trying to decide whether or not to tell someone they have a terminal illness, you might ask yourself some tougher questions. Is it worse to tell someone who doesn’t have cancer that they do and that they are going to die, or to tell someone who does have cancer that they’re healthy and have nothing to worry about? The answers to these questions are left to our individual judgment. The bottom line: you can move the threshold around as you see fit.

Back to prospecting. The red line in the image below shows how shiny something has to be before we decide to pick it up.

The threshold of shininess. You take everything on the left and leave behind everything on the right.

Since we assumed more of the gold is on the shinier end of the spectrum, you’d get a high percentage of all the gold; more than 50% of it. This gives you a high sensitivity. You’ll get a smaller percentage of the worthless stuff, which gives you a low fallout.

Threshold (1)
The difference between what we take and what we leave behind. Most of the grey stuff got left behind, and we picked up a sizeable chunk of the gold.

Remember that sensitivity and fallout measure the percentage of the total gold and the rocks that we picked up.

Graph it out

Let’s see what happens when we move the shininess threshold around. As I move it around, I’ll draw both the sensitivity and the fallout on two axes of the same graph.

ROC Curve

To help with this, I redrew the ROC graph we saw before and made it easier to understand. I replaced ‘1 – specificity’ with fallout, which is the same thing, and instead of writing numbers from 0 to 1 along the sides, I wrote percentages from 0% to 100%.

A simple way to think about this graph is that every time you pick a threshold, the more of the gold you collected, the higher up on the graph your point is. The more of the worthless stuff you collected, the more to the right our point is.

Let’s pick four example thresholds along the spectrum, and for each we’ll draw points on the graph showing us where the sensitivity and fallout would be.ROC Curve (1).png

Let’s look at each of the points:

  1. Point (1) is what happens when you’re really picky; only the shiniest of the shiny things are good enough. Our shininess threshold is so high that we only get a small portion of the total gold, and almost none of the garbage. When we plot that point on the graph, both sensitivity and fallout are close to 0%.
  2. As we lower our threshold to (2) we start collecting more of the gold and more of the worthless rocks. Since most of the gold is on the shiny end of the spectrum, we end up with a lot of the gold (about 50% of it), and somewhat less of the rocks (about 10%). So our sensitivity is about 50%, and our fallout is about 10%.
  3. By the time we lower our threshold to (3), we’ve collected the majority of the gold (about 85%), but since our standards have gotten really low, we’ve also picked up a sizeable chunk of the worthless rocks. We’d need a few dump trucks to carry all the stuff off the field. We’re collecting objects that are a little dull, just hoping that we don’t miss out on any of the gold.
  4. Finally, when we get to (4), we’re collecting anything and everything we can pick up. As a prospector you aren’t really making any decisions here; everything you find is considered “shiny enough” to put on the truck. In reality this is either impossible, impractical or just foolish. By gathering everything there is to collect, you end up with 100% of the gold, but you also get 100% of the garbage.

When we plot everything out, the graph looks like a slightly curved line that starts from the bottom left corner and ends at the top right. Nice, we’ve drawn our first ROC curve.

A common pattern in life

You can find the pattern above all around you in life. For example, whenever you explore new songs on the radio, you’re hoping you’ll find songs that you love. Some people listen to a broad range of music, including music outside their favourite genres. They end up with a large list of songs they love, but they also spend some time sitting through songs they don’t like. Other people only listen to a few saved songs within their narrow niche. They only have a handful songs in their playlist, which they probably enjoy a lot, but they don’t get to find new songs very often. The same thing happens with movies we watch, clothes we buy, food we eat, websites we bookmark, or anything else we do often and where we may not always make the right decision.

Finding something good is often a result of luck. It’s a guess-and-check system. But is there a way to predict in advance the kinds of songs we will enjoy, and take the guesswork out of finding new music?

We instinctively learn to spot features about a band or a song that may tell us beforehand if we will like their music. For instance, the title of the song, the name of the band, the album art, and recommendations from friends who like the same music we do, all give us clues before we even listen to a song. Discovering new songs is less of a roll of the dice, and becomes more of a personally tailored experience. Spotify and other online music streaming sites use programs to recommend new music to listeners that they are likely to enjoy.

What ROC Curves are for

This is where we discover what ROC curves are for. ROC curves let us know if the features we’re using to make our decisions are good ones. In the case of our prospector the feature was “shininess”. We saw how increasing and decreasing our shininess threshold changed both the sensitivity and fallout. If we move the threshold from highest to lowest, and we notice that the “wins” are growing faster than the “disappointments”, it probably means more of the gold is on the left side of the spectrum than on the right.

We can understand this even better by asking two questions:

  1. What would the best possible scenario look like?
  2. What would the worst possible scenario look like?

Best possible scenario

What would our graph look like if shininess were the perfect way to separate gold from rocks? In other words, everything above a certain shininess level is real gold, and everything below is not. Our spectrum would look something like this:

ROC Perfect.png
The “perfect” scenario. Here, all that glitters is actually gold.

What would the ROC curve in this case look like?

Again, let’s start with the highest possible threshold, where we only collect the top-most shiny objects. These will all be gold. As we move our threshold down to include less shiny stuff, we would still get nothing but gold (see the graph below). Our sensitivity is increasing from 0% to 100%, and there is no fallout at all. This would continue until we hit the threshold of shininess that separates gold from rocks, at point (2). At this point we have collected all of the gold. ROC Curve (2).png

After this, if we collect any duller objects, we would only be adding rocks to our pile, so our fallout would start to increase from 0% to 100%. This would keep going until we had collected everything at (3). At this point we’ll have all of the gold and all of the rocks.

Worst (random) case

Suppose we based our decision on a feature that has nothing to do with whether or not an object is gold. Say we picked objects based on how much they look like our grandmother, or based on how close the nearest snail is.

In this case, we’re making blind stabs in the dark. We’re not being specific at all. It’s as if we put all the objects into a big bag, reached inside without looking and picked them out one at a time. The chances that we’ll pick gold are about the same as how much of the bag is actually gold. If 20% of the bag is gold, then our odds are 20% that we’ll get gold. Our spectrum would look something like this:

ROC Random.png
The “random” decision scenario. Gold is spread evenly across the spectrum.

Here, no matter where we put out threshold, the percentage of the gold on the left of the threshold the same as the percentage of the worthless stuff on the left of that same threshold.

Another way of understanding this is that if we reached into our bag, pulled out objects at random, and put the gold and the rocks into two piles, the percentage of the total gold we’ll get from the bag would be about the same as the percentage of the total rocks. Being more or less “picky” doesn’t improve our odds. In this case the sensitivity and the fallout will stay equal.

ROC Curve (3).png

The graph in the random scenario looks like a straight line starting from the bottom left and ending at the top right. The fallout is always equal to the sensitivity.

What are we trying to find?

Our goal is for the sensitivity to increase faster than the fallout. Our odds of picking gold should be higher the more picky we are. If it turns out we collect gold about as often as luck, then the feature we’re basing our decisions on is not a good one.

AUC (1)If we look at the three graphs we’ve drawn so far, we can see a pattern. The curves for good features tend to be closer to the left and the top of the graph. This makes the pink area under them bigger.

In fact, the Area Under the Curve (AUC) is a useful way to determine if the feature we’re basing our decisions on (shininess or anything else) is a good one.

Calculating areas is something a computer can do for us. By running a bunch of different features automatically through this test, we can find the best feature or combination of features to base our decisions on. In each case, we start from the highest threshold, slowly decrease it, plot the curve, and measure the area under it. A computer can quickly pick the best feature, or combination of features, to predict if something is gold.

Of course we have to run a lot of experiments. This is made easier if you have a lot of data already in hand. ROC curves can help you make sense of data you already have, finding events or features that are good predictors for things you want to find. If we have a database of rock samples that records their shape, weight, colour, shininess, etc, as well as whether or not they are gold, a computer could automatically figure out which of those features is the best predictor of real gold.

How is this connected to Machine Learning?

When we base our decisions on some feature, we are using a simple model. You’ll hear the term “model” used in Machine Learning a lot. A model is a set of criteria or a process we use to make decisions. Models in Machine Learning are often more complicated than the one in our example, but they all serve the same purpose.

Playing the role of the prospector, your decision-making went something like this:

(See an object)
"If the object is shinier than my threshold, pick it up. 
Otherwise, leave it on the floor"

This is your simple mental model. You may change it depending on how useful it is to you. You might add more features, such as it’s weight, or it’s weight relative to it’s size (density). All of these then become part of your model.

The benefit of models is that they can be used in future decisions. If you found that shininess, or some other group of features, is a good way to sort the gold from the rocks, you might tell your close friends to use that same model to make their decisions. You’d also use the same model yourself the next time you were out in the field.

Updating our model to make better decisions is the “learning” part of Machine Learning. In Machine Learning, programs automatically run through a lot of data and find useful relationships to build useful models. ROC curves help us decide if the model is a good one, or if it needs to be updated. How Machine Learning algorithms actually create these models will be the subject of another post.

Extra: Other Useful Terms

The Confusion Matrix has a lot of terms in it. Here are some of the other terms and what they mean for our prospector.

Miss rate

Miss rate is the counterpart to sensitivity. Together, both terms help you understand how much of the total gold you got, and how much you left behind.

Miss Rate :  gold I left behind
                 all gold

How much of the gold did I leave behind?

Miss rate and sensitivity add up to 100% of all the gold.

False omission rate and negative predictive rate

These two terms help you understand what the landscape you left behind looks like, i.e. how much of it is gold, and how much of it is worthless.

False Omission Rate :     gold I left behind   
                       all things I left behind

How much of the stuff I left behind was actually gold?

Negative Predictive Rate :    rocks I left behind   
                            all things I left behind

How much of the stuff I left behind was I right to leave behind?

Together, false omission rate and negative predictive rate add up to 100% of the stuff that was left behind.

Precision and false discovery rate

These two terms help you understand how much of the stuff in your wheelbarrow is gold, and how much are worthless.

Precision :     gold in my wheelbarrow   
             all things in my wheelbarrow

How much of my wheelbarrow is actually gold?

False Discovery Rate :  worthless stuff in my wheelbarrow
                          all things in my wheelbarrow

 How much of the stuff in my wheelbarrow is worthless?

Precision and false discovery rate add up to 100% of the wheelbarrow.


10 thoughts on “Intuitions in Machine Learning: Understanding ROC Curves

  1. Anonymous

    this is the clearest explanation I have found on the internet!
    Thank you for giving such intuitive example and make everything clear. :-)

      1. Leo

        Yes!!! If only all the tutorials and textbooks are as intuitive as yours! Hope you can write more materials like this! As a future business-background practitioner, I believe intuitions are much more useful than the algorithm itself!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s