Roger Federer is arguably one of the best tennis players of our time. His intense athleticism and ability to do the impossible on the court has been compared to a religious experience and even led to the coining of terms like “a Federer Moment.” His status among the greats can be quantified through a science of sorts, and illustrated with data and statistics. All eyes are on Federer during this year’s U.S. Open – can he win an eighteenth grand slam after a 3-year drought? Pete Sampras seems to think so.
One thing that’s for sure, is that Federer has consistently defied expectations ever since he went pro in 1998. Data (and tennis) geek that I am, I recently set out to gauge his career performance relative to reasonable expectations. By leveraging some of the same predictive modeling tools and concepts that we use at my company Infer, I analyzed a dataset on professional tennis matches from January 3, 2000 to November 29, 2009 to determine precisely when Federer reached his prime, and to try and quantify how eerily close to perfection he came during that time.
Based on his annual win rates, Federer’s best years were 2004 – 2006, when he won 93%, 95% and 95% of his matches respectively (70, 80 and 90 wins, with only 5, 4 and 5 losses). Those stats are just incredible, and no other player in this dataset posted those kind of numbers. In 2004, Federer had 23 straight wins at one point. The following year, he reached 34 straight wins, and in 2006, 27 straight wins.
His longest streak in overlapping years was 39 wins – and not just any 39 wins, but dominating wins starting with the 2006 US Open. Here I’ve shown matches from that streak in a heatmap (ordered by time from left to right), with Federer’s opponent’s ATP rank below each. Highlighted in green are Federer’s straight set victories, where he didn’t even lose one set:Of his losses during those years, five were to Nadal (who was the consistent #2 in the rankings and potentially the best clay player ever… but that’s a post for another day). Seven of Federer’s loses were best of three matches, as opposed to best of five matches, which one could argue would have given him more time to go the distance and come back to win. Four of his 2004 – 2006 losses were closely contested, containing sets with seven or more games.
Now, perhaps the more fun part. I also built a very simple probabilistic-based predictive model to analyze Federer’s match history prior to August 2006. Here are his win rates by opponent ATP rank (grouped in buckets) prior to 8/30/2006:
This shows the difference in skill and difficulty by rank. For example, it was 30% harder for Federer to beat a top ten player than a top 26-50 player. It’s interesting to see how smoothly his win rates increased as rank went up. Federer was amazingly consistent. His AUC (one of the metrics data scientists use to determine if predictive models are accurate) by reverse rank and targeting wins is very good.
By using these win rates as probabilities, we can compute how many wins Federer should have expected in the subsequent years, and compare that to his actual win streak to gauge just how crazy his performance really was in his prime. Looking at the ranks of his opponents in those 39 matches, probability tells us Federer realistically should have won 28 of those matches. Still a ridiculously high number, and consistent with his previous win streaks, but in hindsight that’s about 30% lower than his actual performance over the following six months.
Basic data science and statistics here show us just how improbable Federer’s performance was in those years. Even with an advanced model and feature set, I seriously doubt that anyone would have predicted 39 wins based on data prior to 8/30/2006. We might get close if we continually re-train the model after each win with recency-based features that attempt to weigh the importance of the running streak going into the next match, but my hunch says even those predictions would still be off by a good margin.
It’s just difficult for models to predict outcomes like these because of a number of factors, including of course randomness and luck. Real-world predictions in sports, stocks, etc. are challenging even with modeling techniques and data access improving every day. Although Kaggle competitions have spawned amazing modeling work and extraordinary performance results, as have Nate Silver’s election predictions, it’s important to recognize the limitations of data science.
Despite all the hype, data scientists need to embrace restraint and simplicity in considering where models can truly add value and how they can be improved. This is key to managing expectations as well as avoiding wasted over-optimization and overfitting models. In this case, I’m pretty sure that what the analysis means is that when Roger Federer is playing, you should step back from the computer and just embrace the experience.