3.3 A solution: expectation propagation
In this optional section, we introduce the approximate inference technique of expectation propagation, which we will use extensively in this book. If you want to focus on modelling, feel free to skip this section.
We have seen that belief propagation allows us to calculate the exact marginal posterior distribution for the variable Jskill in the model described by Figure 3.10. Whereas the prior distribution for Jskill is a Gaussian described by two parameters, the posterior distribution requires four parameters and is non-Gaussian. To solve the problem of the proliferation of parameters we now need a way to approximate this true posterior by a distribution having a fixed number of parameters, and for this we choose the Gaussian. The posterior distribution will then have the same functional form as the prior, mimicking the behaviour of a conjugate prior. If we can achieve this we will be able to treat the resulting approximate posterior distribution as the prior distribution for the next game, and so the skill for each player will at all times be represented by a Gaussian distribution governed by just two parameters.
The first question is how to approximate a non-Gaussian distribution by a Gaussian. A simple solution is to find the mean and the variance of the non-Gaussian distribution and then to choose as our approximation a Gaussian having the same mean and variance. This turns out to be a sensible approximation, which can be derived formally by optimizing a measure of the dissimilarity of two probability distributions [Bishop, 2006; Minka, 2005].
We might be tempted then just to approximate the exact posterior distribution for Jskill by a Gaussian. Although this will work satisfactorily for the factor graph of Figure 3.10 it will break down again as we go to more complex factor graphs (such as those we will encounter later in this chapter). Messages with simple functional forms tend to become more complex as a result of passing through factors. As we extend our model to larger and more sophisticated graphs we quickly arrive at situations where messages cannot be evaluated exactly. Such problems can be avoided by making our approximations locally at each factor node, so that all messages have the desired distribution type. This ensures that factors can be composed together into arbitrary graphs, as long as each factor is capable of sending approximate messages to all neighbouring variable nodes using the appropriate types of distribution. Returning to Figure 3.12, which is reproduced in Figure 3.21 for convenience, we see that message (6) was the first message which we encountered which was non-Gaussian.
Our goal is therefore to approximate message (6) by a Gaussian, thereby ensuring that all subsequent messages will also be Gaussian distributions.
While this seems like a desirable goal, there also seems to be a significant obstacle if we consider the exact form of message (6) as seen in Figure 3.17. This message does not look at all Gaussian, and in fact its mean and variance are not well defined (they are both infinite). The key to finding a sensible Gaussian approximation is to notice that the approximate version of message (6) will subsequently be passed through the graph as modified forms of messages (7) and (8) and will then be multiplied by the downward message (9) in order to determine the (approximate) posterior distribution of Jskill. Our goal will therefore be to make the Gaussian approximation to message (6) over Jperf be most accurate in those regions which are considered more probable by the information coming from other parts of the graph. As we have just discussed, however, we need to keep our approximation local to the region of the graph where the message is generated. Message (6) is sent to the node Jperf and so we can choose our approximation so as to maximize the accuracy of the marginal distribution of Jperf. This is obtained by multiplying message (6) by the downward message on the same edge in the graph, which can be evaluated as shown in Figure 3.22. Note that these same messages are needed to find the posterior marginal for Fskill, so there is no additional overhead introduced by evaluating them.
Here denotes the exact message (6) as seen previous in Figure 3.20, denotes the downward ‘context’ message, and denotes our desired Gaussian approximation to message . These messages are all just functions of the variable Jperf. We have already seen that we cannot simply approximate message by a Gaussian since message has infinite mean and variance. Instead we make a Gaussian approximation for the marginal distribution of Jperf. The exact marginal is given by the product of incoming messages . We therefore define our approximate message to be such that the product of the messages and gives a marginal distribution for Jperf which is a best Gaussian approximation to the true marginal, so that
Here Proj() denotes ‘projection’ and represents the process of replacing a non-Gaussian distribution with a Gaussian having the same mean and variance. This can be viewed as projecting the exact message onto the ‘nearest’ message within the family of Gaussian distributions. Dividing both sides by we then obtain
We therefore find a Gaussian approximation to the exact message (6) as follows. First we compute the exact outgoing message (6) as before. This is shown in blue in Figure 3.24.
Then we multiply this by the incoming message context message which is shown in red in Figure 3.24. This gives a distribution, shown in green in Figure 3.24 which is non-Gaussian but which is localised and therefore has finite mean and variance and so can be approximated by a Gaussian. This curve is repeated in Figure 3.25 which also shows the Gaussian distribution which has the same mean and variance.
Finally, we divide this Gaussian distribution by the incoming context message to generate our approximate outgoing message. Because both of these functions are Gaussian, and because the ratio of two Gaussians is itself a Gaussian [Bishop, 2006], the resulting outgoing message will be Gaussian, which was our original goal. For our specific example, this message is a Gaussian with mean and standard deviation . The computation of the approximate message is summarised in Figure 3.26.
We see that overall we multiplied by the incoming context message, then made the Gaussian approximation, then finally divided out the context message again. The evidence provided by the incoming message is therefore used only to determine the region over which the Gaussian approximation should be accurate, but is not directly incorporated into the approximated message. If we happened to have a conjugate distribution, then the projection operation would be unnecessary and the context message would have no effect.
This approach of using an incoming message to provide the context in which to approximate the corresponding outgoing message is known as expectation propagation (or EP). We see that the approximation is being made locally at the factor node, and in a way that is independent of the structure of the remainder of the graph. This technique can therefore be applied, without modification, to arbitrarily structured graphs as long as each factor is consistently sending and receiving messages with the required distribution types, in this case Gaussians. The expectation propagation algorithm is summarised in Algorithm 3.1.
- Variable node message: the product of all messages received on the other edges;
- Factor node message: Compute the belief propagation message (see Algorithm 2.1). Multiply by the context message (the message coming towards the factor on this edge). Project into the desired distribution type for this edge using moment matching. Divide out the context message.
- Observed node message: a point mass at the observed value;
Now that we have found a suitable Gaussian approximation to the outgoing message (6) we can continue to pass messages along the graph to give the corresponding approximate message (7) as shown in Figure 3.27.
The downward message (9) is unchanged, and so we can finally compute the Gaussian approximation to the posterior distribution of Jskill as the product of two Gaussians, which gives the final result of a Gaussian with mean 140.1 and standard deviation 28.5.
Applying expectation propagation
Let’s see what happens to the skill distributions when we apply expectation propagation to our game of Halo between Jill and Fred. First we suppose that Jill is the winner of the game. In Figure 3.29 we see the prior and posterior distributions of skill for Jill and Fred.
Because Jill is the winner, the mean of the skill distribution for Jill increases, while the mean of the skill distribution for Fred decreases. The change in mean is much larger for Fred than for Jill due to the greater certainty in Jskill compared to that for Fskill. This is a crucial difference compared to Elo, where the changes in skill estimates are equal and opposite for the two players, and arises because we are modelling the uncertainty in each player’s skill rather than using a point estimate as in Elo. We also see from Figure 3.29 that the standard deviation for both distributions decreases as a result of this game. This is because we have learned something about the skills of both players and therefore the degree of uncertainty for both of them is reduced.
Alternatively, if Fred were to have won the game, we have the results shown in Figure 3.30,
This result is much more surprising, since we believed Jill to be the stronger player. Intuitively we would expect the adjustments of the skill distributions therefore to be greater, and this indeed is born out. We see that the shift in the means of the distributions is larger than in Figure 3.29. In fact the change in the mean of the distribution of Fskill is so large that it is now greater than the mean of Jskill. Again, the standard deviations of both distributions have decreased reflecting a reduction in uncertainty due to the incorporation of new evidence.
Because the skill updates in TrueSkill model depend on the variance of the skill distribution, TrueSkill is able to make relatively large changes to the distributions of new players. Furthermore, this happens automatically as a consequence of running inference in our model. By contrast, the updates in Elo are governed by the update parameter . In practice, this problem is addressed in Elo by altering the value of the update parameter according to the number of games played. For instance, FIDE (the World Chess Federation) uses for new players (this was increased from the previous value of 25 in July 2011 to accelerate the rating changes for early players). Once a player has played 30 games, this switches to as long as their rating remains below 2,400, and becomes once players have achieved a rating of 2,400. By tracking uncertainty in a model-based approach, we can avoid the need for such ad-hoc parameter changes.
So far in this section we have developed a probabilistic model for a single game of Halo between Jill and Fred. In practice we will have a large pool of players, and individual games will take place between pairs of players from within that pool. When we try to assess the skill of a player we potentially have available the results of all the games ever played by that player against a range of different opponents. We might also have available the results of all the games played by those opponents, many of which might involve yet other players, and so on. In principle, all of this information is relevant and could help us to assess the original player’s skill. Furthermore, every time there is a new game outcome we could include this additional information and update the skill of the player even if they themselves haven’t played any new games. This new information could be relevant even if it involves a game between other players since it could influence the assessment of their skills, and hence the relative skill of our player.
We could in principle handle this by constructing a very large factor graph expressing all of the games played so far. Each player would have a single variable representing their skill value, but multiple variables (one for each game they have played) representing their performances on each of the games. This would be a complex graph with multiple loops, and we could run loopy belief propagation, using the expectation propagation approximation, to keep the messages within the Gaussian family of distributions, until a suitable convergence criterion is satisfied. This would give a posterior skill distribution for each of the players, which takes account of all of the games played. If a new game is then played we would start again with a new, larger factor graph and re-run inference in order to obtain the new posterior distributions of every player.
In practice, an approach which takes account of all previous game outcomes in order to update the skills of all the players would be completely infeasible. Instead we can use an approximate inference approach known as online learning (sometimes called filtering) in which each player’s skill distribution gets updated only when a game outcome is obtained which involves that player. We therefore need only store the mean and variance of the Gaussian skill distribution for each player. When a player plays a new game, we run inference using this current Gaussian skill distribution as the prior, and the resulting posterior distribution is then stored and forms the prior for the next game. Each single game is therefore described by a graph of the form shown in Figure 3.10.
This particular form of online inference algorithm, based on local projection onto the Gaussian distribution in which each data point (i.e. game outcome) is used only once, is also known as Gaussian Density Filtering [Maybeck, 1982; Opper, 1998]. It can be viewed as a special case of expectation propagation in which a specific choice is made for the message-passing schedule: namely that messages are only passed forwards in time from older games to newer games, but never in the reverse direction.
It is worth noting that, if we consider the full factor graph describing all games played so far, then the order in which those games had been played would have been irrelevant. When doing online learning, however, the ordering becomes significant and can influence the assessed skills. We have to live with this, however, as only online learning would be feasible in a practical system.
We can illustrate the behaviour of online learning in our model using data taken from the game Halo 2 on Xbox Live. We use a data set involving 1,650 players which contains the outcomes of 5,915 games. Each game is a head-to-head contest in which a pair of players play against each other. Figure 3.31 shows how the skill distributions for two of the top players varies as a function of the number of game outcomes played by each of the two players.
We see that the initial skill distributions are the same, because all player skills have the same prior distribution before any games are played. As an increasing number of games are played, we see that the standard deviation of the skill distributions decreases. This reduction in uncertainty as a result of observing the outcome of games is the effect we saw earlier in Figure 3.29 and Figure 3.30.
The model we have constructed in this section represents a single game between two players. However, many games on Xbox Live have a more elaborate structure, and so we turn next to a number of model extensions which take account of these additional complexities.
The following exercises will help embed the concepts you have learned in this section. It may help to refer back to the text or to the concept summary below.
- Reproduce Figure 3.24 by evaluating the (red) Gaussian context message and the (blue) exact CumGauss message at Jperf values of -50, -49 … 0,1,2 … 299,300. Plot the two lines you get with Jperf on the x axis and the evaluated messages on the y axis. You will need to rescale the CumGauss message to get it to fit (remember that the scale of this message does not matter since it is an improper distribution). To get the (green) product message corresponding to the exact marginal for Jperf, first multiply your two messages together at each Jperf value. Then rescale the result so that the area under the line is 1 (you can achieve this roughly by rescaling to make the sum of the value at each point equal to 1). Plot this result as a third line on your axes.
- Compute the mean and standard deviation of the exact marginal product message that you just computed. The mean can be well approximated by summing the product of the message at each point times the Jperf value at each point. The variance (which is the square of the standard deviation) can be approximated similarly using the mnemonic “the mean of the square minus the square of the mean” (see Wikipedia). First, you need to compute the “mean of the square” which can be approximated by sum of the product of the message at each point times the square of the Jperf value at each point. Then subtract off “the square of the mean” which refers to the mean you just computed. This gives the variance, which you can take the square root of to get the standard deviation. You have now computed the mean and standard deviation of the Gaussian approximation to the marginal for Jperf. You can check your result against the Gaussian in Figure 3.25.
- Finally, we need to divide this Gaussian distribution (whose mean and standard deviation you just found in the previous exercise) by the Gaussian context message. You can refer to [Bishop, 2006] for how to do this. You can check your result against message (6) in Figure 3.27. Congratulations! You have now successfully calculated an expectation propagation message!
- Now we can use Infer.NET to do the expectation propagation calculations for us. Implement the Trueskill model in Infer.NET, setting the skill distributions for Jill and Fred to the ones used in this section (this How to guide might help). Compute the posterior marginal distributions for Jill and Fred for the two outcomes where Jill wins the game and where Fred wins the game. Plot your results and check them against Figure 3.29 and Figure 3.30.
expectation propagationAn approximate message-passing algorithm that extends belief propagation by allowing messages to be approximated by the closest distribution in a particular family, such as a Gaussian distribution. This approximation is done either to ensure that the inference algorithm remains tractable or to speed up the inference process. See Algorithm 3.1.
online learningAn approach to machine learning in which data points are considered one at a time, with model parameter distributions updated after each data point.
[Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
[Minka, 2005] Minka, T. (2005). Divergence measures and message passing. Technical report, Microsoft Research.
[Herbrich et al., 2007] Herbrich, R., Minka, T., and Graepel, T. (2007). TrueSkill(TM): A Bayesian Skill Rating System. In Advances in Neural Information Processing Systems 20, pages 569–576. MIT Press.
[Moser, 2010] Moser, J. (2010). The Math behind TrueSkill.
[Maybeck, 1982] Maybeck, P. S. (1982). Stochastic models, estimation, and control. In Volume 2, volume 141, Part 2 of Mathematics in Science and Engineering, chapter Chapter 12 Nonlinear estimation, pages 212–271. Elsevier.
[Opper, 1998] Opper, M. (1998). A Bayesian approach to on-line learning. In Saad, D., editor, On-line Learning in Neural Networks, chapter A Bayesian Approach to On-line Learning, pages 363–378. Cambridge University Press, New York, NY, USA.