Early Access

1.2 A model of a murder

Searching carefully around the library, Dr Bayes spots a bullet lodged in the book case. “Hmm, interesting”, he says, “I think this could be an important clue”.

So it seems that the murder weapon was the revolver, not the dagger. Our intuition is that this new evidence points more strongly towards Major Grey than it does to Miss Auburn, since the Major, with his military background, is more likely to have experience with a revolver than Miss Auburn. But how can we use this information?

A convenient way to think about the probabilities we have looked at so far is as a description of the process by which we believe the murder took place, taking account of the various sources of uncertainty. So, in this process, we first pick the murderer with the help of Figure 1.1. This shows that there is a 30% chance of choosing Major Grey and a 70% chance of choosing Miss Auburn. Let us suppose that Miss Auburn was the murderer. We can then refer to Figure 1.3 to pick which weapon she used. There is a 20% chance that she would have used the revolver and an 80% chance that she would have used the dagger. Let’s consider the event of Miss Auburn picking the revolver. The probability of choosing Miss Auburn and the revolver is therefore 70% \times 20% = 14%. This is the joint probability of choosing Auburn and revolver. If we repeat this exercise for the other three combinations of murderer and weapon we obtain the joint probability distribution over the two random variables, which we can illustrate pictorially as seen in Figure 1.4.

Figure 1.4Representation of the joint probabilities for the two random variables murderer and weapon.

Figure 1.5 below shows how this joint distribution was constructed from the previous distributions we have defined. We have taken the left-hand slice of the P(\var{murderer}) square corresponding to Major Grey, and divided it vertically in proportion to the two regions of the conditional probability square for Grey. Likewise, we have taken the right-hand slice of the P(\var{murderer}) square corresponding Miss Auburn, and divided it vertically in proportion to the two regions of the conditional probability square for Auburn.

= \times P(\var{weapon}, \var{murderer})   P(\var{murderer})   P(\var{weapon} | \var{murderer})
Figure 1.5The joint distribution for our two-variable model, shown as a product of two factors.

We denote this joint probability distribution by P(\var{weapon}, \var{murderer}), which should be read as “the probability of weapon and murderer”. In general, the joint distribution of two random variables A and B can be written P(\var{A}, \var{B}) and specifies the probability for each possible combination of settings of A and B. Because probabilities must sum to one, we have

 \sum_{\var{A}} \sum_{\var{B}} P(\var{A}, \var{B}) = 1. \label{eq:joint-normalized}

Here the notation \sum_A denote a sum over all possible states of the random variable A, and likewise for B. This corresponds to the total area of the square in Figure 1.4 being 1, and arises because we assume the world consists of one, and only one, combination of murderer and weapon. Picking a point randomly in this new square corresponds to sampling from the joint probability distribution.

Probabilistic models

We can now introduce the central concept of this book, the probabilistic model. A probabilistic model consists of:

Once we have a probabilistic model, we can reason about the variables it contains, make predictions, learn about the values of some random variables given the values of others, and in general, answer any possible question that can be stated in terms of the random variables included in the model. This makes a probabilistic model an incredibly powerful tool for doing machine learning.

We can think of a probabilistic model as a set of assumptions we are making about the problem we are trying to solve, where any assumptions involving uncertainty are expressed using probabilities. The best way to understand how this is done, and how the model can be used to reason and make predictions, is by looking at example models. In this chapter, we give the example of a probabilistic model of a murder. In later chapters, we shall build a variety of more complex models for other applications. All the machine learning applications in this book will be solved solely through the use of probabilistic models.

Two rules for working with probabilistic models

So for our murder mystery, we have a probabilistic model with two variables murderer and weapon where the joint probability distribution over those variables is the one shown in Figure 1.4. To use this model, we now need to introduce two key rules which allow us to manipulate the probability distributions in a model.

From the discussion above, we see that the joint probability distribution for our model is obtained by taking the probability distribution over murderer and multiplying by the conditional distribution of weapon. This can be written in the form

 P(\var{weapon}, \var{murderer}) = P(\var{murderer}) P(\var{weapon} | \var{murderer}). \label{eq:murderer-product-rule}

Equation (1.15) is an example of a very important result called the product rule of probability. The product rule says that the joint distribution of A and B can be written as the product of the distribution over A and the conditional distribution of B conditioned on the value of A, in the form

 P(\var{A}, \var{B}) = P(\var{A}) P(\var{B} | \var{A}). \label{eq:product-rule}

Now suppose we sum up the values in the two left-hand regions of Figure 1.4 corresponding to Major Grey. Their total area is 0.3, as we expect because we know that the probability of Grey being the murderer is 0.3. The sum is over the different possibilities for the choice of weapon, so we can express this in the form

 \sum_{\var{weapon}} P(\var{weapon}, \var{murderer} = \state{Grey}) = P(\var{murderer} = \state{Grey}). \label{eq:sum-rule-Grey}

Similarly, the entries in the second column, corresponding to the murderer being Miss Auburn, must add up to 0.7. Combining these together we can write

 \sum_{\var{weapon}} P(\var{weapon}, \var{murderer}) = P(\var{murderer}). \label{eq:murderer-sum-rule}

This is an example of the sum rule of probability, which says that the probability distribution over a random variable A is obtained by summing the joint distribution P(\var{A}, \var{B}) over all values of B

 P(\var{A}) = \sum_{\var{B}} P(\var{A}, \var{B}). \label{eq:sum-rule}

In this context, the distribution P(\var{A}) is known as the marginal distribution for A and the act of summing out B is called marginalisation. We can equally apply the sum rule to marginalise over the murderer to find the probability that each of the weapons was used, irrespective of who used them. If we sum the areas of the top two regions of Figure 1.4 we see that the probability of the weapon being the revolver was 0.27 + 0.14 = 0.41, or 41%. Similarly if we add up the areas of the bottom two regions we see that the probability that the weapon was the dagger is 0.03 + 0.56 = 0.59 or 59%. The two marginal probabilities then add up to 1, as expected since the weapon must have been either the revolver or the dagger.

The sum and product rules are very general. They apply not just when A and B are binary random variables, but also when they are multi-state random variables, and even when they are continuous (in which case the sums are replaced by integrations). Furthermore, A and B could each represent sets of several random variables. For example, if \var{B} \equiv \{ \var{C}, \var{D} \}, then from the product rule (1.16) we have

 P(\var{A}, \var{C}, \var{D}) = P(\var{A}) P(\var{C}, \var{D} | \var{A}) \label{eq:product-rule-3-variables}

and similarly the sum rule (1.19) gives

 P(\var{A}) =\sum_\var{C} \sum_\var{D} P(\var{A}, \var{C}, \var{D}). \label{eq:sum-rule-3-variables}

The last result is particularly useful since it shows that we can find the marginal distribution for a particular random variable in a joint distribution by summing over all the other random variables, no matter how many there are.

Together, the product rule and sum rule provide the two key results that we will need throughout the book in order to manipulate and calculate with probabilities. It is remarkable that the rich and powerful complexity of probabilistic modelling is all founded on these two simple rules.

Inference using the joint distribution

We now have the tools that we need to incorporate the fact that the weapon was the revolver. Intuitively, we expect that this should increase the probability that Grey was the murderer but to confirm this we need to calculate that updated probability. The process of computing revised probability distributions after we have observed the values of some the random variables, is called inference. Inference is the cornerstone of model-based machine learning – it can be used for reasoning about a model, learning from data, making predictions with a model – in fact any machine learning task can be achieved using inference.

We can do inference in our model using the joint probability distribution shown in Figure 1.4. Our model says that, before we observe which weapon was used to commit the crime, all points within this square are equally likely. However, we now know that the weapon was the revolver. We can therefore rule out the two lower regions which correspond to the weapon being the dagger, as illustrated in Figure 1.6.

Figure 1.6This shows the joint distribution from Figure 1.4 in which the regions corresponding to the dagger have been eliminated.

Because all points in the remaining two regions are equally likely, we see that the probability of the murderer being Major Grey is given by the fraction of the remaining area given by the grey box on the left.  P(\var{murderer} = \state{Grey}| \var{weapon} = \state{revolver}) = \frac{0.27}{0.27 + 0.14} \simeq 0.66 in other words a 66% probability. This is significantly higher than the 30% probability we had before observing that the weapon used was the revolver. We see that our intuition is therefore correct and it now looks more likely that Grey is the murderer rather than Auburn. The probability that we assigned to Grey being the murderer before seeing the evidence of the bullet is sometimes called the prior probability (or just the prior), while the revised probability after seeing the new evidence is called the posterior probability (or just the posterior).

The probability that Miss Auburn is the murderer is similarly given by  P(\var{murderer} = \state{Auburn}| \var{weapon} = \state{revolver}) = \frac{0.14}{0.27 + 0.14} \simeq 0.34. Because the murderer is either Grey or Auburn these two probabilities again sum to 1. We can capture this pictorially by re-scaling the regions in Figure 1.6 to give the diagram shown in Figure 1.7.

Figure 1.7Representation of the posterior probabilities that Grey or Auburn was the murderer, given that the weapon is the revolver.

We have seen that, as new data, or evidence, is collected we can use the product and sum rules to revise the probabilities to reflect changing levels of uncertainty. The system can be viewed as having learned from that data.

So, after all this hard work, have we finally solved our murder mystery? Well, given the evidence so far it appears that Grey is more likely to be the murderer, but the probability of his guilt currently stands at 66% which feels too small for a conviction. But how high a probability would we need? To find an answer we turn to William Blackstone’s principle of 1765:

“Better that ten guilty persons escape than one innocent suffer.”

We therefore need a probability of guilt for our murderer which exceeds \frac{10}{10+1} \approx 91\%. To achieve this level of proof we will need to gather more evidence from the crime scene, and to make a corresponding extension to our model in order to incorporate this new evidence. We’ll look at how to do this in the next section.

Self assessment 1.2

The following exercises will help embed the concepts you have learned in this section. It may help to refer back to the text or to the concept summary below.

  1. Check for yourself that the joint probabilities for the four areas in Figure 1.4 are correct and confirm that their total is 1. Use this figure to compute the posterior probability over murderer, if the murder weapon had been the dagger rather than the revolver.
  2. Choose one of the following scenarios (continued from the previous self assessment) or choose your own scenario
    1. Whether you are late for work, depending on whether or not traffic is bad.
    2. Whether a user replies to an email, depending on whether or not he knows the sender.
    3. Whether it will rain on a particular day, depending on whether or not it rained on the previous day.
    For your selected scenario, pick a suitable prior probability for the conditioning variable (for example, whether the traffic is bad, whether the user knows the sender, whether it rained the previous day). Recall the conditional probability table that you estimated in the previous self assessment. Using the prior and this conditional distribution, use the product rule to calculate the joint distribution over the two variables in the scenario. Draw this joint distribution pictorially, like the example of Figure 1.4. Make sure you label each area with the probability value, and that these values all add up to 1.
  3. Now assume that you know the value of the conditioned variable, for example, assume that you are late for work on a particular day. Now compute the posterior probability of the conditioning variable, for example, the probability that the traffic was bad on that day. You can achieve this using your diagram from the previous question, by crossing out the areas that don’t apply and finding the fraction of the remaining area where the conditioning event happened.
  4. For your joint probability distribution, write a program to print out 1,000 joint samples of both variables. Compute the fraction of samples that have each possible pair of values. Check that this is close to your joint probability table. Now change the program to only print out those samples which are consistent with your known value from the previous question (for example, samples where you are late for work). What fraction of these samples have each possible pair of values now? How does this compare to your answer to the previous question?

Review of concepts introduced on this page

joint probabilityA probability distribution over multiple variables which gives the probability of the variables jointly taking a particular configuration of values. For example, P(\var{A},\var{B},\var{C}) is a joint distribution over the random variables A, B, and C.

probabilistic modelA set of random variables combined with a joint distribution that assigns a probability to every configuration of these variables.

The model is often represented using a graph, such as a factor graph, making it a graphical model.

product rule of probabilityThe rule that the joint distribution of A and B can be written as the product of the distribution over A and the conditional distribution of B conditioned on the value of A, in the form

 P(\var{A}, \var{B}) = P(\var{A}) P(\var{B} | \var{A}).

sum rule of probabilityThe rule that the probability distribution over a random variable A is obtained by summing the joint distribution P(\var{A}, \var{B}) over all values of B

 P(\var{A}) = \sum_{\var{B}} P(\var{A}, \var{B}).

marginal distributionThe distribution over a random variable computed by using the sum rule to sum a joint distribution over all other variables in the distribution.

marginalisationThe process of summing a joint distribution to compute a marginal distribution.

inferenceThe process of computing probability distributions over certain specified random variables, usually after observing the value of some other variables in the model.

prior probabilityThe probability distribution over a random variable prior to seeing any data. Careful choice of prior distributions is an important part of model design.

posterior probabilityThe updated probability distribution over a random variable after some data has been taken into account. The aim of inference is to compute posterior probability distributions over variables of interest.