This season Soccernomics has been posting weekly predictions for games across 25 European leagues – you can see the predictions here. This is the work of Guy Wilkinson, who completed his PhD at the University of Michigan in 2017 and is now an assistant professor at the University of Stirling.
This post is about the methodology behind the research, and how it relates to the evolving field of soccer analytics. If you interested in the full details of how Guy’s predictions are generated, you can read the relevant chapter from his PhD here.
The basic idea of the model is that every player on the field contributes to the result, and that therefore an index of individual ability should be constructed by crediting players, based on the results of each game played. That’s like saying that a drug should be rated according the outcome for a patient’s health when they take the drug. The efficacy of a drug is measured by the sum of outcomes for individual patients who take it; the efficacy of a player is the sum of outcomes for games in which they played.
Once you know the efficacy of each player, you can predict the outcome of the next game they are expected to play in- if the sum of efficacies for team A is greater than that of team B, then team A is predicted to win.
It’s a simple enough idea, and the difficulty lies in
- collecting the data on games, players and results
- estimating the model
The first step is an exercise in data-scraping – all this material is now on the web. The second stage is difficult because of the scale of the problem. To estimate the model which best describes the contribution of each player, you need to run an optimization routine. These are easy with present day computing power if you have a few thousand players and a few hundred games, but Guy was working with over 66,000 games and 133,000 team line-ups for these games, covering the 25 leagues over the last decade.
The details of the estimation method are to be found in the paper (also a shout-out to Professor Eric Schwartz who gave Guy a lot help with the process). Once the individual player estimates are generated, the forecasts can be generated, using the assumption that next week’s line-up will be the same as this week’s.
How good are the results? You can see for yourself, but the percentage of correctly predicted results is typically in the range of 40-50%. To put this in context, if you picked the results at random you would get the correct result (win, draw, loss) 33% of the time, while the bookmakers, whose predictions are best available (otherwise they go out of business) tend to get it right about 55% of the time. So the model is somewhere in the middle.
Clearly the model could get better, but would be competitive with most alternative models. A model that gets the results right 50% of the time instead of 45% of the time is better, but in betting markets I doubt this margin would be enough to make large profits taking account of the bookmaker’s margin and after tax.
It’s also not difficult to see how it could be improved. At the moment relies only on the names of the players on the field. It doesn’t distinguish any individual characteristics, and it doesn’t weight more recent games more highly. It doesn’t attempt to predict which players will appear in the next game. And there are a whole host of other technicalities in the estimation procedure which might yield small improvements.
But this is a very different type of modeling from what we see in the soccer analytics world today. That work seems largely focused on trying to make predictions based on individual actions on the field. The question that seldom seems to be answered in that work is “how well does the modeling predict the outcome of games”. That, I think, has to be the ultimate yardstick for the usefulness of soccer analytics.