## 3.2 Inferring the players’ skills

The discussion so far has assumed that we know the skills of Jill and Fred. These skills were then used to compute the probability that each of the players would be the winner. In practice, we have to reason backwards: we observe the outcome of the game and we wish to use this information to learn something about the skill values of the players. We therefore turn to the problem of assessing player skill.

The original Xbox was launched in 2001, and was followed a year later by an online gaming service Xbox Live. This required a matchmaking feature which in turn needed a way of determining the relative skills of players. To address this requirement, the Xbox online service adopted a widely used algorithm called Elo [Elo, 1978], which is used (with various modifications) by most of the major international chess organizations as well as by representative bodies of many other sports. The Elo algorithm is summarized in Panel 3.2.

The Elo algorithm is widely used in chess.

Panel 3.2The Elo Algorithm
The skill of a player is an uncertain quantity and so we should model this using a probability distribution. However, it is instructive to see first what happens if we simply treat the skill as a number whose value gets updated when we learn the outcome of a game. We will see how far we can get, and what sort of problems we encounter, and this will help us understand some of the benefits of using a model-based approach.

Given the outcome of a game it would seem reasonable to increase the skill value for the winner and decrease the skill value for the loser. What is less clear, however, is how big an adjustment we should make. Intuitively we can reason as follows. Suppose that Jill is the winner of the game. If Jill’s skill is significantly higher than Fred’s, then it is unsurprising that Jill should be the winner, and so the change in skill values should be relatively small. If the skills are similar then a larger change would be justified. However, if Jill’s skill is significantly less than Fred’s then the game outcome is very surprising. The outcome suggests that our current assessments of the skill values are not very accurate, and therefore that we should make a much larger adjustment in skill values. Put concisely, the degree of surprise gives an indication of how big a change in skill values should be made.

Let us define the winner of a game to have a score of 1 and the loser to have a score of 0. If we now imagine Jill and Fred playing lots of games against each other then the average score for Jill after playing lots of games against Fred is the same as the probability that she will win. This can be obtained from equation (3.10), and is given by [Elo, 1978]:

where, as before, $\var{perfSD}$ is an arbitrary constant which sets the scale of the skill variables. We refer to this average score as the expected score for Jill. Similarly, the expected score for Fred is $(1 - \var{Jscore})$. We can then define the degree of surprise as the difference between the actual score of a player against a particular opponent and the expected score for that game. Thus, if it turns out that Jill is the winner, then the surprise for Jill is $(1 - \var{Jscore})$ while for Fred it is $0 - (1 - \var{Jscore})$. The change in skill value can then be made proportional to this degree of surprise, so that
Here $K$ is an arbitrary coefficient which determines how much the players skills change as a result of each game.

Elo has some limitations which are relevant to our matchmaking problem. For instance, it does not apply to team games, or to games involving more than two players. Furthermore, because Elo is an algorithm, rather than a model, it is not immediately obvious how it should be modified in order to overcome such limitations. By contrast, in our probabilistic modelling approach, when the assumptions of the model are not satisfied, we change the assumptions and then construct the corresponding modified model. We shall see how to do this in practice in section 3.4.

One merit of the Elo system is that it is relatively easy for players to compute the skill updates for themselves. In Xbox Live, however, the skill calculations are automated – and so more complex but more accurate methods become possible. Furthermore, the Elo algorithm does not handle more than two players, nor does it apply to games played by teams of players, and so it was not directly applicable to many of the games on Xbox Live. There was a significant motivation, therefore, to find improved methods to assess player skills which are able to give accurate skill values after relatively few games have been played. When the successor games console the Xbox 360 was launched in 2005 it was decided to replace the Elo algorithm with an approach built on a probabilistic model of the skill assessment problem, known as the TrueSkill® system [Herbrich et al., 2007]. TrueSkill has been used as the skill rating and matchmaking system on Xbox Live continuously since 2005, processing the outcomes of millions of games per day. As well as overcoming the above limitations of the Elo system, the modelling of uncertainty in the players’ skills in TrueSkill leads to a significant improvement in the efficiency with which skill is assessed [Herbrich et al., 2007]. In the remainder of this chapter we explain how the TrueSkill model was constructed. For more details of the mathematics behind TrueSkill see [Moser, 2010].

### A probabilistic model: TrueSkill

We have already noted that skill is an uncertain quantity, and should therefore be included in the model as a random variable. We need to define a suitable prior distribution for this variable. This distribution captures our prior knowledge about a player’s skill before they have played any games. Since we know very little about a player before they play any games, this distribution needs to be broad and cover the full range of skills that a new player might have. Because skill is a continuous variable we can once again use a Gaussian distribution to define this prior. This represents a modification to our first modelling assumption, which becomes:

1. Each player has a skill value, represented by a continuous variable with a broad prior distribution.

For a new player, this distribution will represent our (significant) uncertainty in their skill value. Again, we make this modelling assumption precise through the choice of a Gaussian distribution. Once a new player has played a game, we aim to use the outcome of the game to infer the updated skill distribution for the player (and also for any other players in the game). This involves solving a probabilistic inference problem to calculate the posterior distribution of each player’s skill, taking account of the new information provided by the result of the game. Although the prior distribution is Gaussian, the corresponding posterior distribution may not be Gaussian. However, we shall see that there is considerable benefit in approximating the exact posterior distribution by a Gaussian. This Gaussian posterior will act as the effective prior distribution for the next game played by that player. We will discuss this idea in greater detail later in the chapter.

For the moment, let us return to our game of Halo between Jill and Fred. We will suppose that some games have already been played, and that the uncertainty in the skills of our two players are represented by Gaussian distributions. Earlier in this chapter we used skill values for Jill and Fred of 15 and 12.5 respectively, which are quite small in relation to the standard deviation in performance of 5, and therefore help to illustrate the influence of these quantities on the probability of a win for Jill. For the remainder of the chapter we shall use values which are more realistic for a pair of real players and so we assume the mean for Jill is 120 while that for Fred 100. We will also suppose that standard deviation of the skill distribution for Jill is 40, while for Fred it is 5. This would typically arise if Jill is a relatively new player and so there is a lot of uncertainty in her skill whereas Fred is a more established player whose skill is more precisely known. We must therefore extend our model for a specific game of Halo by introducing two more uncertain variables Jskill and Fskill: the skills of Jill and Fred. Each of these variables has its own Gaussian distribution and therefore its own factor in the factor graph. The factor graph for our extended model is shown in Figure 3.10.

At this point it is convenient to summarise the assumptions that are encoded in the model represented in Figure 3.10. They are all shown together in Figure 3.11.

1. Each player has a skill value, represented by a continuous variable with a broad Gaussian distribution.
2. Each player has a performance value for each game, which varies from game to game such that the average value is equal to the skill of that player. The variation in performance, which is the same for all players, is symmetrically distributed around the mean value and is more likely to be close to the mean than to be far from the mean.
3. The player with the higher performance value wins the game.
Figure 3.11The three assumptions encoded in our model.

Having stated our modelling assumptions explicitly, it is worth taking a moment to review them. Assumption 1 says that a player’s ability at a particular type of game can be expressed as a single continuous variable. This seems reasonable for most situations, but we could imagine a more complex description of a player’s abilities which, for example, distinguishes between their skill in attack and their skill at defence. This might be important in team games (discussed later) where a strong team may require a balance of players with strong attack skills and those with good defensive skills. We also assumed a Gaussian prior for the skill variable. This is the simplest probabilistic model we could have for a continuous skill variable, and it brings some nice analytical and engineering properties. However, if we looked at the skills of a large population of players we might find a rather non-Gaussian distribution of skills, for example, new players may often have low skill but, if they have played a similar game before, may occasionally have a high skill.

Similarly, Assumption 2 considers a single performance variable and again assumes it has a Gaussian distribution. It might well be the case that players can sometimes have a seriously ‘off’ day when their performance is way below their skill value, while it would be very unlikely for a player to perform dramatically higher than their skill value. This suggests that the performance distribution might be non-symmetrical. Another aspect that could be improved is the assumption that the variance is the same for all players – it is likely that some players are more consistent than others and so would have correspondingly lower variance.

Finally, Assumption 3 says that the game outcome is determined purely by the performance values. If we had introduced multiple variables to characterize the skill of a player, there would presumably each have a corresponding performance variable (such as how the player performed in attack or defence), and we would need to define how these would be combined in order to determine the outcome of a game.

### Inference in the TrueSkill model

Inference deep-dive

In this optional section, we see how to do exact inference in the model as defined so far, and then we see why exact inference is not useable in practice. If you want to focus on modelling, feel free to skip this section.

Now that we have the factor graph describing our model, we can set the variable Jwins according to the observed game outcome and run inference in order to compute the marginal posterior distributions of the skill variables Jskill and Fskill. The graph has a tree structure (there are no loops) and so we have already seen in Chapter 2 that we can solve this problem using belief propagation.

Consider the evaluation of the posterior distribution for Jskill in the case where Jill won the game (Jwins is true). Using the belief propagation algorithm we have to evaluate the messages shown in Figure 3.12.

Message (1) is just given by the Gaussian factor itself. Similarly, message (2) is just the product of all incoming messages on other edges of the Fskill node, and since there is only one incoming message this is just copied to the output message. These first two messages are summarized in Figure 3.13.

Next we have to compute message (3) in Figure 3.12. The belief propagation algorithm tells us to multiply the incoming message (2) by the Gaussian factor and then sum over the variable Fskill. In this case the summation becomes an integration because Fskill is a continuous variable. We can gain some insight into this step by once again considering a generative viewpoint based on sampling. Imagine that we draw samples from the Gaussian distribution over Fskill. Each sample is a specific value of Fskill and forms the mean of a Gaussian distribution over Fperf. In Figure 3.14a we consider three samples of Fskill and plot the corresponding distributions over Fperf. To compute the desired outgoing message we then average these samples, giving the result shown in Figure 3.14b. This represents an approximation to the marginalization over Fskill, and would become exact if we considered an infinite number of samples instead of just three.

The sampling approximation becomes more accurate as we increase the number of samples, as shown in Figure 3.14c and Figure 3.14d. In this final figure the resulting distribution looks almost Gaussian. This is not a coincidence, and in fact the calculation of the outgoing message can be worked out exactly (see Equation (2.115) in Bishop [2006]) with the result that the outgoing message is also a Gaussian whose mean is the mean of the distribution of Fskill and whose variance is the sum of the variances of Fskill and Fperf: $5^2 + 5^2$. This process of ‘smearing’ out one Gaussian using the other Gaussian is an example of a mathematical operation called convolution. Message (4) is just a copy of message (3) as there is only one incoming message to the Fperf node. Since we observe that Jwins is true, message (5) is a Bernoulli point mass at the value true. These three messages are illustrated in Figure 3.15.

To compute message (6) we take the GreaterThan factor, multiply by the two incoming messages (4) and (5), and then sum over Jwins and integrate over Fperf. Since we are considering the case where Jill is the winner, message (5) evaluates to zero for Jwins = false and one for Jwins = true. The sum over Jwins then multiplies by zero those regions of the performance space which correspond to Fred being the winner. This is illustrated in Figure 3.16 which shows the result of multiplying the GreaterThan factor by the two incoming messages and then summing over Jwins.

Finally we have to integrate over Fperf in order to generate the outgoing message (6) which itself is a function of Jperf. This message is shown in Figure 3.17.

The reader should take a moment to confirm that shape of this function is what would be expected from integrating Figure 3.16 over the variable Fperf. Mathematically, for each value of Jperf a truncated Gaussian distribution is being integrated, this is equivalent to the evaluation of the cumulative Gaussian that we introduced back in equation (3.10), and so this message can be evaluated analytically (indeed, this is how Figure 3.17 was plotted).

We can also understand this message intuitively. Suppose we knew that Fperf was exactly 100, given that Jill won, this tells us that Jill’s performance Jperf must be some number greater than 100. This would mean a message in the form of a step, with the value zero below 100 and some positive constant above it. Since, we don’t know that Fperf is exactly 100, but only know that it is likely to be near 100, this smears out the step into the smooth function of Figure 3.17.

Message (7) is just a copy of message (6) since there is only one incoming message to the Jperf node. These messages are illustrated in Figure 3.18

Message (8) is found by multiplying the Gaussian factor describing the performance variability by the incoming message (7) and then integrating over Jperf. This again is a convolution, and has an exact solution again in the form of a cumulative Gaussian. Effectively it is a blurred version of the incoming cumulative Gaussian message which has been smeared out by the variance of the Gaussian performance factor. Finally, message (9) is the Gaussian distribution for the skill prior. These messages are summarized in Figure 3.19.

To obtain the marginal distribution of Jskill we then multiply messages (8) and (9). Because this is the product of a Gaussian and a cumulative Gaussian the result is a bump-like distribution which is not symmetrical, and therefore is not a Gaussian. These messages, and the resulting marginal distribution for Jskill, are shown in Figure 3.20.

### A problem with using exact inference

We seem to have solved the problem of finding the posterior distribution for Jskill. We can also pass messages in the opposite direction around the graph to obtain the corresponding posterior distribution for Fskill. These posterior distributions can be expressed exactly as the product of a Gaussian and a cumulative Gaussian. However, there is a major problem which becomes apparent if we imagine, say, Jill going on to play another game with a new player. Before the game with Fred, our uncertainty in the value of Jskill was expressed as a Gaussian distribution, which has two parameters (the mean and the variance). After she has played against Fred the corresponding posterior distribution is expressed as the product of a Gaussian and a cumulative Gaussian and therefore has four parameters, where the two additional parameters come from the cumulative Gaussian. Now suppose that Jill plays another game of Halo against Alice. We can again represent this by a factor graph similar to Figure 3.10, except that the factor describing our current uncertainty in Jskill is now the posterior distribution resulting from the game against Fred. When we run inference in this new graph, to take account of the outcome of the game against Alice, the new posterior marginal will be given by the product of messages into the node Jskill and will therefore consist of the original product of a Gaussian and a cumulative Gaussian times another cumulative Gaussian resulting from the game with Alice. This gives a posterior distribution for Jskill which now has six parameters. Each time Jill plays a game of Halo the distribution over her skill value requires two additional parameters to represent it, and the number of parameters continues to grow as she plays more and more games. This is not a practical approach to use in an engineering application.

Notice that this problem would not arise if the posterior distribution for the variable of interest had the same form as the prior distribution. In some probabilistic models we are able to choose a form of prior distribution, known as a conjugate prior, such that the posterior ends up having the same form as the prior. For example, suppose we have a model containing a random variable x for the probability-of-true parameter of a Bernoulli distribution. The conjugate prior for x is then the beta distribution, as explained in more detail in Panel 3.3. In our TrueSkill model, the presence of the GreaterThan factor, means that the Gaussian is not a conjugate distribution for Jskill. We must therefore introduce some form of approximation. In the next section, we will describe a powerful algorithm that extends belief propagation by allowing messages to be approximated even when they are not conjugate. This algorithm will not only solve the inference problem with this model, but turns out to be applicable to a wide variety of other probabilistic models as well.

Self assessment 3.2

The following exercises will help embed the concepts you have learned in this section. It may help to refer back to the text or to the concept summary below.

1. Reproduce Figure 3.14 by plotting the average of $K$ Gaussian distributions each with standard deviation is 5 and with mean is given by a sample from a $\dist{Gaussian}(100,5^2)$. Do this for $K=3$,$K=6$ and $K=100$.
2. Referring to Panel 3.3, use Bayes’ theorem to compute the posterior distribution over $x$ (the probability of clicking on the buy button) given that $N=20$ people do click on the button but $M=100$ people do not click on it. Assume a $\dist{Beta}(1,1)$ prior distribution. Notice that this is a conjugate prior and so the posterior distribution is also a beta distribution.
3. [Advanced] Show that the convolution of two Gaussian distributions is also a Gaussian distribution whose variance is the sum of the variances of the two original distributions. Section 2.3.3 in Bishop [2006] should help.

convolutionThe convolution of a function $f$ with a function $g$ measures the overlap between $f$ and a version of $g$ which is translated by an amount $a$. It is expressed as a function of $a$.

conjugate distributionFor a given likelihood function, a prior distribution is said to be conjugate if the corresponding posterior distribution has the same functional form as the prior.

References

[Elo, 1978] Elo, A. E. (1978). The Rating of Chessplayers, Past and Present. Arco Pub., New York.

[Herbrich et al., 2007] Herbrich, R., Minka, T., and Graepel, T. (2007). TrueSkill(TM): A Bayesian Skill Rating System. In Advances in Neural Information Processing Systems 20, pages 569–576. MIT Press.

[Moser, 2010] Moser, J. (2010). The Math behind TrueSkill.

[Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.