What’s All the Buzz About Bayes?

Pop quiz, recite the definition of a p-value. Have an answer? An appropriate definition follows something along the lines of:

Assuming the null hypothesis is true and after many repeated samples, the p-value would be the probability that we would obtain a distribution of data that is as extreme or more extreme than what we observe.

P-values have come under hot fire lately with many criticisms of use and application being levied their way.

Much of this criticism has come from proponents of Bayesian statistics, an entirely different branch of statistics that does classic hypothesis testing with an alternative method. If this is your first time hearing of Bayesian statistics, you’re not alone.

I finished over half of my BS in psychology without even hearing the word Bayesian before. I wish I had the opportunity to learn about it earlier. This article will introduce the basis of Bayesian statistics and explain how it differs from the kind of frequentist statistics that are taught in most undergraduate courses.

Reverend Thomas Bayes, who lived 1701–1761, is the man who discovered the underlying statistics of this alternative method. Bayes developed a very important mathematical theorem about probability that mathematicians have creatively decided to call Bayes’ Theorem. It follows the form: P(A|B)=P(B|A)/P(A)P(B).

In simpler terms, this means that the probability of some event A given that some event B has already occurred is modelled by the proportion of probabilities given on the right side of the equation.

This theorem is incredibly important to the foundation of mathematical statistics. Around the early to mid-20th century, however,  this theorem was eventually used in a model of hypothesis testing. This model has the following form:

Bayes’ theorem for hypothesis testing.

H represents our hypothesis we wish to examine. 

E represents evidence for that hypothesis (presumably evidence we obtain through experiment and/or observation). 

P(H|E) is called the posterior. This value gives us the probability of our hypothesis given our obtained evidence. In a sense, it shows the degree that your hypothesis is confirmed from your evidence. This is the unknown end value that we would wish to obtain.

P(E|H) is called the likelihood. This gives us the probability that we would find our evidence given that our stated hypothesis is correct. In other words, it is the probability that our evidence is compatible with our hypothesis. There are different techniques that can be used to obtain the likelihood value, a couple of which will be touched on further in the article.

P(H) is called the prior. This is how probable our hypothesis was before any evidence is taken into account. The prior is infamous for being hard to define properly. How could you know any prior information about a hypothesis you haven’t even tested? Without going into too much detail (because one could go on indefinitely about Bayesian priors), just know that this value has to be assumed. There are methods that can pick out better prior probabilities amongst worse ones, such as examining the results of previously conducted experiments. The prior that is developed from past experiments, however, will still be arbitrary to a degree. 

P(E) is called the marginal likelihood/probability or often just the marginal. This value represents the probability of our evidence given all possible hypotheses. Most of the time, we do not actually need a direct value of the marginal, as it can often be reduced out of the equation.

For a practical example, imagine that one day while strolling around your house, you notice that your fine china display has been smashed into pieces. You suspect that a burglar broke into your home and destroyed your precious dishware. You figure doing a Bayesian hypothesis test could help you get to the bottom of what occurred.

We denote the burglar breaking in as our H and the smashed china as our E

We want to figure out the posterior probability of a burglar breaking into our home given the fact that all of your china has been destroyed. To do this, we need to fill in the other pieces of the equation. 

On the right side, we first look at the likelihood. This is the probability that we would find the smashed china given that a burglar did break into the house. One possible way of coming up with a value of this probability might be to look at crime statistics. How many cases of burglary end up with their victims’ china destroyed as well? 

Next we need to find the prior, which in this case would be the probability that someone would break into your house in spite of any evidence. One way to go about finding this value in this example could be to look at the rates of burglary in your area. 

The last part of the equation to deduce would be the marginal, however, we do not actually have to calculate this at all if we introduce a second hypothesis. 

Let’s say that you are now considering the possibility that the ruined china isn’t caused by a burglar, but instead, the china was smashed by you tripping and falling into it the night before. We will now state the burglar hypothesis as H1 and the tripping and falling hypothesis as H2. With two hypotheses, we can get two different forms of Bayes’ theorem, do some fancy math, and get the following equation: 

An equation for the Bayes factor.

From this we have obtained a ratio of two likelihoods on the left side by equating them to the odds of the posterior and prior probabilities. As one can see, the marginals for the two hypotheses have been dropped from the equation, but this new ratio will prove to be incredibly helpful. In fact, we call this ratio the Bayes factor

The Bayes factor tells us the degree to which the two hypotheses are more plausible compared with one another. If the likelihood of H1 is greater than the likelihood of H2, the Bayes factor will be greater than 1. If the likelihood of H2 is larger, then the Bayes factor will be less than 1. In either case, the Bayes factor lets us know which hypothesis has more weight when examining the evidence. For our example, let’s say we have obtained a Bayes factor of 0.01. This would mean that our hypothesis of tripping and falling is much stronger than the hypothesis of a burglar breaking in when it comes to explaining our observed evidence of the smashed china. 

This is in Bayesian hypothesis testing in a nutshell. Bayes factors have a similar function as p-values used in frequentist statistics, but one large difference is their interpretation. The p-value has a complicated interpretation that requires the assumption that the null hypothesis is true and that we are examining the frequency of outcomes in many, hypothetical repeated samples. Whereas the Bayes factor is much more intuitive in that it is directly comparing the probability of two different unproven hypotheses. 

Okay, one might be thinking, but which one should we use, frequentist or Bayesian statistics? Well, the lucky answer is that, for many cases, it doesn’t matter! Much of the time when one has used a frequentist method they could have used an appropriate Bayesian method and get the same result and vice-versa. You can think of both styles of statistics as two different techniques for the same end goal.

This brings us to a big issue, why aren’t we taught about Bayesian statistics in undergraduate psychology? Well, part of it goes back to the dogmatism of certain renowned statisticians of the 20th century. Some big-name figures in statistics, Fisher, Pearson, and others, were not fond of Bayesian methods when compared to the frequentist alternatives. Universities have followed their authority by not including Bayesian statistics within their curriculum, and this has still continued to be the case today.

Another reason Bayesian statistics aren’t taught that often in undergraduate is the difficulty their calculations. A large portion of Bayesian methods cannot be done without some strong computational power, which relies on computer technology that has not been widely available until the past couple of decades. 

The most popular technique in calculating the likelihood of Bayes’ theorem is called Markov Chain Monte Carlo or MCMC for short. This technique uses complex software simulations of tens of thousands of varying probability states, which would be hard to teach undergraduate statistics courses.

Other statistical tests such as a two-sample t-test or a one-way analysis of variance (ANOVA) may feel easier to teach to undergraduates because they can be calculated without the need of complicated software Still, there are some Bayesian methods that require less computation that would be more feasible to teach undergraduates in addition to quizzing them on p-values. Time will tell just how far this “Bayesian Revolution” spreads.

Bayesian statistics is a far deeper topic than this article can feasibly go into. I hope it will inspire some readers to look further into these ideas. Or at least, I hope they will be able to take away that an entirely separate statistical technique is out there waiting for them to learn.

[1] Vidgen, B., & Yasseri, T. (2016). P-values: Misunderstood and misused. Frontiers in Physics, 4

https://doi.org/10.3389/fphy.2016.00006.

[2] https://en.wikipedia.org/wiki/Thomas_Bayes.

[3] Ortega, A., & Navarrete, G. (2017). Bayesian hypothesis testing: An alternative to null hypothesis significance testing (NHST) in psychology and Social Sciences. Bayesian Inference. https://doi.org/10.5772/intechopen.70230.

[4] Efron, B. (1986). Why isn’t everyone a bayesian? The American Statistician, 40(1), 1.  https://doi.org/10.2307/2683105.