## 2.2 Testing out the model

Having constructed a model, the first thing to do is to test it out with some simple example data to check that its behaviour is reasonable. Suppose a candidate knows C# but not SQL – we would expect them to get the first question right and the other two questions wrong. So let’s test the model out for this case and see what skills it infers for that pattern of answers. For convenience, we’ll use isCorrect to refer to the array [isCorrect1,isCorrect2,isCorrect3] and so we will consider the case where isCorrect is [true,false,false].

We want to infer the probability that the candidate has the csharp and sql skills, given this particular set of isCorrect values. This is an example of an inference query which is defined by the variables we want to infer (csharp, sql) and the variables we are conditioning on (isCorrect) along with their observed values. Because this example is quite small, we can work out the answer to this inference query by hand.

### Doing inference by hand

Inference deep-dive

In this optional section, we perform inference in our model manually by marginalising the joint distribution. If you want to focus on modelling, feel free to skip this section.

As we saw in Chapter 1, we can perform inference by marginalising the joint distribution (summing it over all the variables except the one we are interested in) while fixing the values of any observed variables. For the three-question factor graph, we wrote down the joint probability distribution in equation (2.4). Here it is again:

Before we start this inference calculation, we need to show how to compute a product of distributions. Suppose for some variable x, we know that:

This may look odd since we normally only associate one distribution with a variable but, as we’ll see, products of distributions arise frequently when performing inference. Evaluating this expression for the two values of x gives:

Since we know that $P(\var{x}=\state{true})$ and $P(\var{x}=\state{false})$ must add up to one, we can divide these values by $0.32+0.12=0.44$ to get:

This calculation may feel familiar to you – it is very similar to the inference calculations that we performed in Chapter 1.

In general, if want to multiply two Bernoulli distributions, we can use the rule that:

If, say, the second distribution is uniform (b=0.5), the result of the product is $\mathrm{Bernoulli}(\var{x};a)$. In other words, the distribution $\mathrm{Bernoulli}(\var{x};a)$ is unchanged by multiplying by a uniform distribution. In general, multiplying any distribution by the uniform distribution leaves it unchanged.

Armed with the ability to multiply distributions, we can now compute the probability that our example candidate has the csharp skill. The precise probability we want to compute is $P(\var{csharp} | \var{isCorrect}=[\state{T},\state{F},\state{F}])$, where we have abbreviated true to T and false to F. As before, we can compute this by marginalising the joint distribution and fixing the observed values:

As we saw in Chapter 1, we use the proportional sign $\propto$ because we do not care about the scaling of the right hand side, only the ratio of its value when csharp is true to the value when csharp is false.

Now we put in the full expression for the joint probability from (2.5) and fix the values of all the observed variables. We can ignore the Bernoulli(0.5) terms since, as we just learned, multiplying a distribution by a uniform distribution leaves it unchanged. So the right hand side of (2.10) becomes

Terms inside of each summation $\sum$ that do not mention the variable being summed over can be moved outside of the summation, because they have the same value for each term being summed. You can also think of this as moving the summation signs in towards the right:

If you look at the first term here, you’ll see that it is a function of csharp only, since isCorrect1 is observed to be true. When csharp is true, this term has the value 0.9. When csharp is false, this term has the value 0.2. Since we only care about the relative sizes of these two numbers, we can replace this term by a Bernoulli term where the probability of true is $\frac{0.9}{0.9+0.2} = 0.818$ and the probability of false is therefore 1-0.818=0.182. Note that this has preserved the true/false ratio $0.818/0.182 = 0.9/0.2$.

Similarly the second AddNoise term has value 0.1 when sql is true and the value 0.8 when sql is false, so can be replaced by a Bernoulli term where the probability of true is $\frac{0.1}{0.1+0.8} = 0.111$. The final AddNoise term can also be replaced, giving:

For the deterministic And factor, we need to consider the four cases where the factor is not zero (which we saw in Panel 2.1) and plug in the Bernoulli(0.111) distributions for sql and hasSkills in each case: Table 2.3Evaluation of the last three terms in (2.13). Each row of the table corresponds to one of the four cases where the And factor is 1 (rather than 0). The first three columns give the values of csharp, sql and hasSkills, which is just the truth table for AND. The next two columns give the corresponding values of the Bernoulli distributions for sql and hasSkills and the final column multiplies these together.

Looking at Table 2.3, we can see that when csharp is true, either both sql and hasSkills are false (with probability 0.790) or both sql and hasSkills are true (with probability 0.012). The sum of these is 0.802. When csharp is false, the corresponding sum is $0.790+0.099=0.889$. So we can replace the last three terms by a Bernoulli term with parameter $\frac{0.802}{0.802+0.889} = 0.474$.

Now we have a product of Bernoulli distributions, so we can use (2.9) to multiply them together. When csharp is true, this product has value $0.818\times0.474 = 0.388$. When csharp is false, the value is $(1-0.818)\times(1-0.474) = 0.096$. Therefore, the product of these two distributions is a Bernoulli whose parameter is $\frac{0.388}{0.388+0.096}$:

So we have calculated that the posterior probability that our candidate has the csharp skill to be 80.2%. If we work through a similar calculation for the sql skill, we find the posterior probability is 3.4%. Together these probabilities say that the candidate is likely to know C# but is unlikely to know SQL, which seems like a very reasonable inference given that the candidate only got the C# question right.

### Doing inference by passing messages on the graph

Inference deep-dive

Doing inference calculations manually takes a long time and it is easy to make mistakes. Instead, we can do the same calculation mechanically by using a message passing algorithm. This works by passing messages along the edges of the factor graph, where a message is a probability distribution over the variable that the edge is connected to. We will see that using a message passing algorithm lets us do the inference calculations automatically – a huge advantage of the model-based approach!

In this optional section, we show how inference can be performed using a message passing algorithm called belief propagation. If you want to focus on modelling, feel free to skip this section.

Let us redo the inference calculation for the csharp skill using message passing – we’ll describe the message passing process for this example first, and then look at the general form later on. The first step in the manual calculation was to fix the values of the observed variables. Using message passing, this corresponds to each observed variable sending out a message which is a point mass distribution at the observed value. In our case, each isCorrect variable sends the point mass Bernoulli(0) if it is observed to be false or the point mass Bernoulli(1) if it is observed to be true. This means that the three messages sent are as shown in Figure 2.6. Figure 2.6The messages sent from the observed variable nodes, which are shown shaded and labelled with their observed values. The message on any edge is a distribution over the variable that the edge is connected to. For example, the left hand $\dist{Bern}(1)$ is short for $\dist{Bernoulli}(\var{isCorrect1};1)$.

These point mass messages then arrive at the AddNoise factor nodes. The outgoing messages at each factor node can be computed separately as follows:

• The message up from the first AddNoise factor to csharp can be computed by writing $\mathrm{AddNoise}(\var{isCorrect1=\state{T}} | \var{csharp})$ as a Bernoulli distribution over csharp. As we saw in the last section, the parameter of the Bernoulli is $p=\frac{0.9}{0.9+0.2}=0.818$, so the upward message is $\dist{Bernoulli}(0.818)$.
• The message up from the second AddNoise factor to sql can be computed by writing $\mathrm{AddNoise}(\var{isCorrect2=\state{F}} | \var{sql})$ as a Bernoulli distribution over sql. The parameter of the Bernoulli is $p=\frac{0.1}{0.1+0.8}=0.111$, so the upward message is $\dist{Bernoulli}(0.111)$.
• The message up from the third AddNoise factor to hasSkills is the same as the second message, since it is computed for the same factor with the same incoming message. Hence, the third upward message is also $Bernoulli(0.111)$. Figure 2.7Outgoing messages from the AddNoise factor nodes.

Note that these three messages are exactly the three Bernoulli distributions we saw in in (2.13). Rather than working on the entire joint distribution, we have broken down the calculation into simple, repeatable message computations at the nodes in the factor graph.

The messages down from the Bernoulli(0.5) prior factors are just the prior distributions themselves: Figure 2.8Messages from the Bernoulli prior factor nodes.

The outgoing message for any variable node is the product of the incoming messages on the other edges connected to that node. For the sql variable node we now have incoming messages on two edges, which means we can compute the outgoing message towards the And factor. This is Bernoulli(0.111) since the upward message is unchanged by multiplying by the uniform downward message Bernoulli(0.5). The hasSkills variable node is even simpler: since there is only one incoming message, the outgoing message is just a copy of it. Figure 2.9Messages out of the sql and hasSkills variable nodes.

Finally, we can compute the outgoing message from the And factor to the csharp variable. This is computed by multiplying the incoming messages by the factor function and summing over all variables other than the one being sent to (so we sum over sql and hasSkills):

The summation gives the message Bernoulli(0.474), as we saw in equation (2.14). Figure 2.10The final message toward the csharp variable node.

We now have all three incoming messages at the csharp variable node, which means we are ready to compute its posterior marginal. This is achieved by multiplying together the three messages – this is the calculation we performed in equation (2.14) and hence gives the same result Bernoulli(0.802) or 80.2%.

To compute the marginal for sql, we can re-use most of the messages we just calculated and so only need to compute two additional messages (shown in Figure 2.11). The first message, from csharp to the And factor, is the product of Bernoulli(0.818) and the uniform distribution Bernoulli(0.5), so the result is also Bernoulli(0.818).

The second message is from the And factor to sql. Again, we compute it by multiplying the incoming messages by the factor function and summing over all variables other than the one being sent to (so we sum over csharp and hasSkills):

The summation gives the message Bernoulli(0.221), so the two new messages we have computed are those shown in Figure 2.11. Figure 2.11Additional messages needed to compute the marginal for the sql variable.

Multiplying this message into sql with the upward message from the AddNoise factor gives $\dist{Bernoulli}(0.111)\times\dist{Bernoulli}(0.221)\propto\dist{Bernoulli}(0.034)$ or 3.4%, the same result as before. Note that again we have ignored the uniform $\dist{Bernoulli}(0.5)$ message from the prior, since multiplying by a uniform distribution has no effect.

The message passing procedure we just saw arises from applying an algorithm called belief propagation [Pearl, 1988; Lauritzen and Spiegelhalter, 1988]. In belief propagation, messages are computed in one of three ways, depending on whether the message is coming from a factor node, an observed variable node or an unobserved variable node. The full algorithm is given in Algorithm 2.1. The derivation of this algorithm can be found in Bishop .

Algorithm 2.1Belief Propagation
factor graph, list of target variables to compute marginal distributions for.
marginal distributions for target variables.
node in the factor graph
edge connected to the node
If all needed incoming messages are available send the appropriate outgoing message below:
- Variable node message: the product of all messages received on the other edges;
- Factor node message: the product of all messages received on the other edges, multiplied by the factor function and summed over all variables except the one being sent to;
- Observed node message: a point mass at the observed value;
Compute marginal distributions as the product of all incoming messages at each target variable node.

### Using belief propagation to test out the model

The belief propagation algorithm allows us to do inference calculations entirely automatically for a given factor graph. This means that it is possible to completely automate the process of answering an inference query without writing any code or doing any hand calculation!

Using belief propagation, we can test out our model fully by automatically inferring the marginal distributions for the skills for every possible configuration of correct and incorrect answers. The results of doing this are shown in Table 2.4. Table 2.4The posterior probabilities for the csharp and sql variables for all possible configurations of isCorrect. As before, the blue bars give a visual representation of the inferred probabilities.

Inspecting this table, we can see that the results appear to be sensible – the probability of having the csharp skill is generally higher when the candidate got the first question correct and similarly the probability of having the sql skill is generally higher when the candidate got the second question correct. Also, both probabilities are higher when the third question is correct rather than incorrect.

Interestingly, the probability of having the sql skill is actually lower when only the first question is correct, than where the candidate got all the questions wrong (first and second rows of Table 2.4). This makes sense because getting the first question right means the candidate probably has the csharp skill, which makes it even more likely that the explanation for getting the third question wrong is that they didn’t have the sql skill. This is an example of the kind of subtle reasoning which model-based machine learning can achieve, which can give it an advantage over simpler approaches. For example, if we just used the number of questions needing a particular skill that a person got right as an indicator of that skill, we would be ignoring potentially useful information coming from the other questions. In contrast, using a suitable model, we have exploited the fact that getting a csharp question right can actually decrease the probability of having the sql skill.

Self assessment 2.2

The following exercises will help embed the concepts you have learned in this section. It may help to refer back to the text or to the concept summary below.

1. Compute the product of the following pairs of Bernoulli distributions
1. $\mathrm{Bernoulli}(\var{x};0.3) \times \mathrm{Bernoulli}(\var{x};0.9)$
2. $\mathrm{Bernoulli}(\var{x};0.5) \times \mathrm{Bernoulli}(\var{x};0.2)$
3. $\mathrm{Bernoulli}(\var{x};0.5) \times \mathrm{Bernoulli}(\var{x};0.3)$
4. $\mathrm{Bernoulli}(\var{x};1.0) \times \mathrm{Bernoulli}(\var{x};0.2)$
5. $\mathrm{Bernoulli}(\var{x};1.0) \times \mathrm{Bernoulli}(\var{x};0.3)$
Why can we not compute $\mathrm{Bernoulli}(\var{x};1.0) \times \mathrm{Bernoulli}(\var{x};0.0)$?

2. Write a program (or create a spreadsheet) to print out pairs of samples from two Bernoulli distributions with different parameters $a$ and $b$. Now filter the output of the program to only show samples pairs which have the same value (i.e. where both samples are true or where both are false). Print out the fraction of these samples which are true. This process corresponds to multiplying the two Bernoulli distributions together and so the resulting fraction should be close to the value given by equation (2.9).

Use your program to (approximately) verify your answers to the previous question. What does your program do when $a=0.0$ and $b=1.0$?

3. Manually compute the posterior probability for the sql skill, as we did for the csharp skill in section 2.2, and show that it comes to 3.4%.
4. Build this model in Infer.NET and reproduce the results in Table 2.4. For examples of how to construct a conditional probability table, it may be useful to refer to the wet grass/sprinkler/rain example in this thread. You will also need to use the Infer.NET & operator for the And factor. This exercise demonstrates how inference calculations can be performed completely automatically given a model definition.

inference queryA query which defines an inference calculation to be done on a probabilistic model. It consists of the set of variables whose values we know (along with those values) and another set of variables that we wish to infer posterior distributions for. An example of an inference query is if we may know that the variable weapon takes the value revolver and wish to infer the posterior distribution over the variable murderer.

product of distributionsAn operation which multiplies two (or more) probability distributions and then normalizes the result to sum to 1, giving a new probability distribution. This operation should not be confused with multiplying two different random variables together (which may happen using a deterministic factor in a model). Instead, a product of distributions involves two distributions over the same random variable. Products of distributions are used frequently during inference to combine multiple pieces of uncertain information about a particular variable which have come from different sources.

message passing algorithmAn algorithm for doing inference calculations by passing messages over the edges of a graphical model, such as a factor graph. The messages are probability distributions over the variable that the edge is connected to. Belief propagation is a commonly used message passing algorithm.

belief propagationA message passing algorithm for computing posterior marginal distributions over variables in a factor graph. Belief propagation uses two different messages computations, one for messages from factors to variables and one for messages from variables to factors. Observed variables send point mass messages. See Algorithm 2.1.

Next: Loopiness
References

[Pearl, 1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Francisco.

[Lauritzen and Spiegelhalter, 1988] Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems. Journal of the Royal Statistical Society, Series B, 50(2):157–224.

[Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.