If you’ve read a few blogs or articles on machine learning, data, analytics, or any field related to statistics, you may have come across the terms “frequentist” and “Bayesian”. These terms refer to two different approaches to statistics that, on the surface, seem to be in conflict with one another.
Even after doing a bit of research, it may not be clear to you what the difference is. It wasn’t clear to me. I started to suspect it’s just another mathematicians’ nerd-war over nit-picky details.
Turns out, the difference between the two approaches goes deep, and reflects the assumptions behind the decision we make in everyday life. I’ll try to explain this difference with an example that you might be familiar with.
The normal die
(Note: “die” singular -> “dice” plural)
Suppose you rolled a normal six-sided die. How likely are you to roll a 2?
Most of us would answer this question by looking at the number of sides on the die (in this case, six) and assume it has an equal chance of turning up any of the sides. There are six sides, one of the sides is a 2, so roughly one out of every six times you rolled you’d get a 2.
This thinking seems intuitively correct. If I had to bet money on these dice rolls, I’d have a rough idea of how much money I’d lose or win over time based on the number of sides the die had. If 5 of my friends and I were betting on the outcome of the roll, and we each had to pick a side to bet on, unless I were superstitious (or suspected my friends of cheating) I’d have no practical reason to pick one number over any of the others.
The faulty die
What if I told you now that when the die was being made in the factory, it was made faulty? Not intentionally, mind you; these aren’t cheater’s dice. They’re just weird, irregular.
Now, if I asked you what the odds are that any roll would turn up a 2, what would your answer be?
This question feels harder to answer than the first one. The die may be weighted to one side or another, so some sides may never turn up, and others may turn up more than half the time. The assumptions we made when talking about the normal die no longer apply.
If I pushed you for an answer, the only way you could guess is by rolling the die a few (hundred?) times, and keeping a record of how often each number comes up. Then you could give me a good sense of how likely I am to roll a 2.
The catch
Here’s the catch. If I gave you a die, how could you know if it was faulty or not?
In other words, given any unknown die, how could you decide if it was well made, or if it was imbalanced?
There’s no easy way to answer this. Our common intuition is to roll the die a few hundred times and keep a record of how many times each number is rolled. From the results we can see if the die is weighted towards one number or another.
But there’s a problem with that approach. Even if you roll a perfectly balanced die you won’t roll all 6 numbers evenly. You could easily roll five 6s in a row.
I have a die next to me. These are my first 10 rolls:
1, 5, 2, 1, 3, 5, 6, 6, 5, 3
I didn’t roll any 4s, and I rolled 5 three times. You can try this experiment yourself at this link.
Some would argue that ten rolls isn’t enough to decide if a die is faulty. You’d need at least a few dozen rolls, if not more, to feel confident.
But no matter how many times you roll the die, you will likely never get a perfect 1/6th split. It may get very close, but it won’t be exact. This is a natural result of the fact that the world has randomness in it. Given that’s the case, how can you know if these variations were because the die was faulty, or if it was due to luck?
Even a tiny margin, say 1%, can make a difference over the long run. For example, a gambler who is playing with weighted dice, where the number 5 is slightly more likely to be rolled than other numbers, could use that knowledge to her advantage and make a profit. Casinos use this information to give them a small edge, which, in the long run, adds up to massive (and legal) profits.
Frequentists and Bayesians
So we’re stuck in a dilemma. There is no real way to differentiate between results that are caused due to randomness, or results that are actually a consequence of how the die was made.
In cases like these, we tend to go with our common sense. Consider the following two scenarios:
If you get a die out of a brand new board-game box, you have no reason to believe it is fraudulent or faulty. If you rolled one a few times, and 6s came up more often than 3s, you’d attribute it to luck. This is the Bayesian approach, which says that you can guess beforehand what the outcomes (“probabilities”) are going to be based on certain rational criteria and experiences. In this case, new dice always roll each of the six numbers evenly, with a bit of randomness thrown in to spice things up. It’s only when you get new information, such as news of a fault in the factory, that you readjust your predictions.
On other other hand, if you hang out with gamblers and magicians, and one of them handed you a die, you’d probably want to check it beforehand. If you rolled it a few times, and 6s came up more often than 3s, you might accuse the person who handed it to you of cheating. This is closer to the frequentists position; it says that every new die is a new experiment, and must be experimented with on its own terms. Even after the experiment, the frequentist will only talk about the history the dice has shown thus far, and leaves open the possibility that it will change in the future.
Ultimately, there is no universal way to know what caused the dice to roll one way or another, so neither approach is applicable in all cases. It’s up to you to decide in a given situation which seems most appropriate.
When this difference matters
If you had to guess which of your country’s political parties was going to win the next election, how would you make that guess?
One approach is to go by past success. If a particular fringe party has never won, it seems reasonable to assume that they won’t win this time either. This is the Bayesian approach; if every new election is like rolling a new die, you have no reason to suspect any of them is faulty until new information (like a shift in political zeitgeist) is added to the mix.
Consider, however, there haven’t been that many elections in any given country’s history. The political parties who have won may have done the equivalent of rolling 5 sixes in a row, and this assumption is not impossible. Your decision that a certain fringe party is unlikely to win is a judgement call you make based on past experiences, and a gut feeling.
Alternately, you may decide to poll pedestrians on the street and generalize your sample to the entire population. This is the frequentist approach. Every new election, like every new die from a gambler, has to be experimented with separately, to see what the outcome is.
You could even imagine yourself taking both approaches in different contexts, and that is the point. Though the two approaches contradict each other in some ways, in other ways they can be seen as complimentary. They serve different use cases. You have to decide in each case if you’re going to start from prior probabilities, or if you’re going to do a tally of the current use case, and only go by the result.