Soccer Analytics: Science or Alchemy?

08, 02, 19

by Stefan Szymanski

It is obvious to anyone with a passing interest in the game that the analysis of player and team performance in soccer is in the middle of a revolution. Soccer analytics, which one might broadly define as the application of statistical methods to player and team data, barely existed ten years ago. If you enter a “google scholar” search using the term, you will come up with a few papers, mostly concerning player physiology and issues such as the measurement of VO2 Max, or the processing of substances, such as creatine, by the soccer player’s body. What we mean today by the term goes far beyond this. Depending on who you talk to, it is about the application of big data to soccer tactics, using machine learning to analyze performance in real time, player tracking, ball tracking and the quantification of everything that happens on the field. A google scholar search today yields hundreds of results, while dozens, possibly hundreds, of sites have sprouted on the web, promising to unlock the benefits from the use of soccer analytics. When Simon Kuper and I wrote Soccernomics in 2008, we did not mention the subject. By the time of our third edition, in 2014, we had added an entire chapter on it.

Soccer analytics is part of the “big data” revolution. This might be thought of as two things:

The ability to collect and record digitally very large numbers of events surrounding activities of interest.
The ability to process this data on powerful computers. “Process” in this context means nothing more than the application of basic algebra to the data- adding, subtracting, multiplying and dividing – but the ability to process data on a vast scale means that more complex mathematical concepts (which nonetheless rely on basic algebra) can be implemented.

The applications of big data range from the search for cures for cancer based on targeting human genes, creating a system that can safely operate a self-driving car, developing more accurate weather forecasting models, building marketing strategies on the basis of real time twitter activity, and targeting voters to swing elections. It’s a real thing, and it is changing our lives in many ways, many of which we may not even know about.

While soccer analysts had long talked about the potential gains to be made by analyzing data, the ready availability of the tools has turned the potential into a reality. It is a new way of thinking about soccer. To give three examples:

Decroos et al., (2018), use play-by-play event data for individual games to construct a probability that any given event (such as a pass) will to a goal either for the home team or the away team, based on every game played in the top 5 European leagues over 3 seasons. Although they do not say, this must amount to several million events. Once the probabilities have been estimated, an index can be constructed to measure each player’s contribution in a game, and across a season.
Hobbs et al (2018), used ten second film segments from Premier League games in one season to classify game states and identify the notion of transitions which lead to counter-attacks, enabling them to measure the effectiveness of different teams. The data with which they’re working must have some 200,000 plus segments, with each one being classified according to the position of the players.
Hubáček et al,, (2018), used data on the result of over 200,000 games played across 52 leagues since 2000 to predict the outcomes of future games. Their paper was the winner of the 2017 Soccer Prediction Challenge to develop machine learning models applicable to soccer data. The model. The rank probability score for their model was around 0.2, compared to 0.16 for the bookmakers (a lower score implies a higher degree of accuracy).

The first thing to say is that these developments are fascinating. Bringing new techniques and ideas to old problems must be a very good thing. But this leads obviously to two questions: (a) what went before? and (b) how is this innovative research an improvement on what went before?

Statistical modeling in soccer, like other sports, has some history. Most of that research was focused on analyzing results. Statistical models were developed and tested to predict game outcomes, and these models were then benchmarked against bookmaker odds. These models relied on regression methods familiar in econometrics to fit a model the data – what in data science is now called “training” the data. The anthropomorphism here can be a little confusing if you’re not familiar with what is being done. The old methods were optimization methods: getting the computer to solve a maximization problem, which as byproduct reveals the sensitivity of the variable to each other.

“Machine Learning” does the same thing- the computer solves an optimization problem, only now based on millions of observations rather than a few hundred. Moreover, machine learning doesn’t tie you down to one statistical model, it optimizes among models. While these are important advances, it’s also important not to lose sight of the fact that the same fundamental principles apply. I don’t know if humans can be thought of as massive optimization solvers (though I doubt it), but the word “learning” in machine learning should not lead us to think that machines are doing anything more than optimization routines.

In the past, the models were constrained by the availability of data and the ability to process it, but in the soccer prediction world, the models worked quite well. The main outcome of this research was that recent results, suitably weighted, were fairly good predictors of future results. This should not be a surprise- a soccer team is a fairly stable entity, at least over a short period of time. Weather is a good analogy- weather forecasting works fairly well over a five day period, but anything longer than that and it becomes unreliable because so many other things can change. Soccer teams on average are fairly stable over a period of ten games or so, but over longer periods many more things can change.

But the best predictors have always been the bookmakers. Which should also not surprise anyone- if you can beat a bookmaker then sooner or later you will drive them out of business. A bookmaker who sets out to balance its books is the perfect example of the wisdom of the crowd. By making the odds inversely proportional to amount staked on each outcome, the bookie pays out the same on each outcome, and so allowing for the profit margin on each bet, the bookie cannot lose. If enough people with a diverse knowledge set are betting, the evidence is that the odds in these predictions are unlikely to be beaten. In fact, bookies don’t always balance the books, but take a position themselves when they adjust the odds, but usually safe in the knowledge that they are paying far more attention to the relevant information than anyone else.

Any statistician planning to turn their skill into profit also faces an uneven playing field in the form of tax- which usually means you have to make a 10% or so return on your bets just to breakeven. Many people say they can beat the bookies, few provide substantial evidence to prove it. Researchers who have published their results, often say they get close.

So the real test for the new soccer analytics is whether it can generate predictions better than the betting odds. And the answer thus far, seems to be “no”. There are many sites that now offer predictions based on big data models (FiveThirtyEight runs a nice model with a full explanation), but I’ve yet to see one that systematically outperforms the bookmakers. Betfair, the betting exchange, has published its own model generated by its data science team, and will sell you data to build your own model.

Are these models better than the models we were able to build in the past? This is harder to answer, but I have some doubts. I have long argued that the wage bill of soccer clubs is the best forecaster of results – because players are assets traded in a market where there are many buyers and sellers, and there is little hidden information. The problem with using this data to make predictions is that it can only be obtained from financial statements, which are typically only available in the season following the one in which games were played, and are in any case only available for a limited subset of countries.

However, the valuations posted by the website transfermarkt.de (or .com, .co.uk, etc) are relatively reliable indicators of wages paid. I ran a regression for five years of data with the wage data for English clubs and found a correlation coefficient of over +0.95. The transfermarkt valuations (henceforth TM) , as I understand it, are crowdsourced opinions of fans, and they seem remarkably accurate. I used the TM values to predict outcomes of Premier League games this season. To be specific, I used an ordered logit regression to predict win, draw and loss probabilities based only on the identity of the home team and the ratio of TM values for the two teams. You can download the code in R and the datafiles below.

soccerfiles

The results were very close to the forecast probabilities implied by the odds given by the bookmaker Bet365 (the data is available at the website football-data.co.uk). The Brier score, which measures the accuracy of predictions, for my model was 0.589, and for Bet365 was 0.575 – almost identical. Lower score are better, so my model can’t beat the bookies either, but I think it would be competitive with any model produced by big data. Remember, my model is based on two pieces of information only – the identity of the home team and the TM values, and the estimates were based (“trained”) on data for just over 3,000 games played- which doesn’t qualify as “big data”.

This, I think, represents a big challenge for big data. If simple models can generate results which are close to the bookmakers, and big data can’t beat the bookmakers, then there is very little space to add value. One response would be to say that I have placed excessive emphasis on prediction as a measure of value- in fact, more or less the only measure of value. Proponents of big data might say that their role is to provide insights to coaches about player performance, team tactics, and so on.

There is where the question raised by my title becomes relevant. First, a little history of science. The scientific revolution in Europe is usually dated to sixteenth and seventeenth century in Europe and is typically associated with the work of Copernicus, Galileo, Descartes, Boyle, Newton and a number of others. These individuals challenged a worldview based on principles derived from Aristotle which, until then, were considered unchallengeable. They were responsible for developing the scientific method, which rejected the notion of any absolute principle in favor of the notion of evidence and ultimately, prediction. The scientific revolution succeeded because its predictions work. It also facilitated the industrial revolution, which enabled us to use fundamental insights to develop machines to replace human labor, to study the nature of matter and transform life through increasing the food supply and identifying life-saving drugs.

Many of these advances have to do with chemistry, which has a different history from physics or astronomy. Europeans and others had been fiddling around with substances for hundreds of years (think Chinese gunpowder) and engaged in an empirical procedure with almost no base in theory. The theory that they had, derived from Paracelsus, identified three life forces: body, soul and spirit with three “elements”: salt, sulphur and mercury. Alchemists sought to identify the mysteries of life and create gold from dross; they did not care to explain their methods in too much detail.

Alchemy, however, played a very important role in the development of modern science. In their quest to create gold, alchemists endlessly combined substances and in the process discovered many interesting results. Their model of experimentation was the basis on which modern experimental methods developed. Isaac Newton himself, the man who did more than almost anybody to usher in the scientific age, was a committed alchemist, and his writings on the subject far exceed the number of pages in the Principia Mathematica.

Alchemy was superseded by modern chemistry, in the end, not because of any particular fault in its experimental methods, but because of its rigid adherence to a metaphysical theory which was lent little or no support by the data. Chemistry has advanced mainly through the support of a theory, atomic theory, which generates predictions that are supported by the data. We did not get from alchemy to chemistry by going from a bad theory to no theory, but to a better theory.

This, I think is problem with a lot of the soccer analytics I have read: little or no attention is paid to the development of a theory. The hallmark of a theory is that it can generate predictions, and this, in my view, is how soccer analytics will advance. Much of soccer analytics as it is presently practiced deserves to be called scientific and I do not wish to cast aspersions on any of the examples I have cited above. But I do believe there are some problems, and I worry that some of what is reported as complex science is tinged with residues of alchemical thought. I can summarize my concerns in these four points:

Theories are useful because they help us to identify causal effects. Big data approaches in soccer seem largely to avoid theorizing, which risks reducing the analysis simply to the search for correlations. I don’t think any science can prosper if straightjacketed in this way.
In natural science causal effects are usually tested using controlled experiments. Social science relies on observational data, which makes the identification of causality much more difficult. There have been huge advances in our understanding of how to identify causality through statistical methods over the last three decades, but as yet I’ve seen little recognition of the issue in soccer analytics papers.
There is a financial profit to be made in soccer analytics. Mostly this is not about beating the bookies but advising clubs on identification of strategies, playing talent and so on. This world closely resembles the world of alchemy- it is secretive and given to obscure utterings. Results are announced but not explained, success is claimed but not proven. This is perhaps inevitable as long as the potential for profit exists. There are also such non-scientific analyses of the stock market, together with promises of untold returns. The only cure for this a healthy skepticism.
In the end, I believe, soccer analytics will be judged on its capacity to predict. I have outlined some of the challenges to developing predictions in relation to game results, but there are potentially other areas where soccer analytics can contribute, involving outcome related to specific on-field events.

Some of what passes for soccer analytics seems, at least to me, to be little better than alchemy. But the opportunities presented by the big data in soccer, like other sports, are very great.

References

Decroos, T., Bransen, L., Van Haaren, J. and Davis, J., 2018. Actions Speak Louder Than Goals: Valuing Player Actions in Soccer. arXiv preprint arXiv:1802.07127.

Hobbs, J., Power, P., Sha, L. and Lucey, P., 2018. Quantifying the value of transitions in soccer via spatiotemporal trajectory clustering. MIT Sloan Sports Analytics Conference.

Hubáček, O., Šourek, G. and Železný, F., Learning to predict soccer results from relational data with gradient boosted trees. Machine Learning, pp.1-19.

About the Author

5 Comments

César March 9, 2019 at 11:08 pm Reply

Excellent. I will used this text in my class about predictions in Brasilia University. I teaching Valuation in accounting courses.
opap stoixima February 27, 2020 at 3:00 am Reply

Hello. Can you please explain what you mean by “the identity of the home team”? I downloaded the files but still not sure of what you mean
- Stefan Szymanski February 27, 2020 at 10:23 am Reply
  
  That just means that in the regression the home team player values are always in the numerator and the away team always in the denominator. This means that the constant in the regression reflects the home advantage bias.
David Lepetit July 12, 2020 at 11:43 am Reply

Interesting article, as always. But I disagree with the idea that Analytics will be judged on their ability to predict: going back to the Roots of Analytics, data can’t predict but help to prescript, and that’s a huge difference.
- Stefan Szymanski July 12, 2020 at 12:21 pm Reply
  
  Interesting point, but I think prescription entails prediction. I can’t rationally prescribe a course of action unless I believe I can predict the consequence of this action. I think the problem with a lot of prescriptive advice is that it skates over this necessary step.

Soccer Analytics: Science or Alchemy?

Leave a Reply Cancel reply

Soccernomics on Twitter

Contact Us

From the Blog

Soccernomics on Twitter

Opinion we like

Data we like

In The Media