If last year was considered the “Year of Analytics,” this year would have to be quite the opposite, right? Los Angeles, owners of the 2nd best CF in the league missed the playoffs and Calgary, 26th in the league, clinched. The NY Rangers won the Presidents’ Trophy in convincing fashion, yet they finished 19th in the league in CF. In fact, of the top 5 teams in the NHL standings the best CF performer amongst them, St. Louis, finished a rather unimpressive 12th in CF. If you also take into consideration the difference between 7th and 15th place was 4 points whilst the difference between 5th and 7th was 6 points, it further questions the validity of Corsi as a predictor of success.
Luck, Randomness, Correlation and Cause
Many hardcore Corsi advocates will point to “luck” or “randomness” as the reason why the Kings and Flames finished as they did and you know what… they’re correct, statistically speaking. However, let’s define what “luck” and “randomness” truly mean here, “statistically speaking.”
When a conversation is carried out and the term “luck” is used we normally associate the word with its textbook traditional definition and context.
Luck: the force that seems to operate for good or ill in a person’s life, as in shaping circumstances, events, or opportunities:
“Let’s hope the Maple Leafs get some luck and find a franchise player in the 2015 NHL entry draft.”
Before we can define ‘luck’ in a non-traditional statistical sense of the word, we need to first understand what ‘correlation’ means. You may have heard ‘correlation’ thrown around social media or in previous articles that you’ve read. I know I’ve seen it plenty of times and more often than not, it is used incorrectly.
Correlation: is computed into what is known as the correlation coefficient, which ranges between -1 and +1. Perfect positive correlation (a correlation co-efficient of +1) implies that as a hockey stat moves (eg. CF), either up or down, the other stat (eg. Wins) will move in lockstep, in the same direction. Alternatively, perfect negative correlation means that if one stat moves in either direction the stat that is perfectly negatively correlated will move in the opposite direction. If the correlation is 0, the movements of these stats are said to have no correlation; they are completely random. (Source: Investopedia, with hockey references edited in)
When hockey stats are not perfectly correlated reversion to the mean or regression occurs. For example, the chart above looks at the relationship between Pts (team total points) and CF (CF = team shot attempts for – shot attempts against) for this most recently expired NHL season. Each plot on the chart represents an NHL team and their position on the chart is determined by where their CF and Pts intersect. The line that runs through the scattered team plots represents the linear relationship (reversion/regression line) between CF and Pts. Numerically speaking, this correlation coefficient was calculated to be 0.56 or more aptly put. 56% of a teams point total can be explained by their CF. To the untrained eye someone might mistakenly interpret this in the following way:
“Wow, 56% of a team’s point total can be explained by a team’s CF. If a team wants to improve their position in the standings they should focus on improving CF. Hence, if a team focused their efforts on pursuing players with a positive CF, that team would in turn improve overall team CF and as a result improve their position in the standings. As such, anything that falls above or below this linear relationship can most likely be attributable to luck/randomness, which over time will eventually regress back towards the mean. The data clearly shows this.” – Ms. Informed
If you feel the sequence of events in the above statement is correct, then I’m sorry, but your feelings of logic are quite misguided. Unfortunately, the above statement implies causation NOT correlation and that implication, simply put, is WRONG. It also implies that a player’s corsi is transferable from one team to another, which is also quite wrong. For the record, CORSI does not have a causal relationship with PTS /winning.
Causality: also referred to as causation is the relation between an event (the cause) and a second event (the effect), where the first event is understood to be responsible for the second.
We know causality fails here because statistics through a numerical process known as Granger Causality can actually quantify and test for it. Fact of the matter, research to date has not produced any hockey stat that has passed a Granger Causality test, which brings me once again to reversion to the mean. Far too often the term reversion has been inappropriately defined and or utilized in hockey analytics. It’s to the point where the meaning has been so grossly perverted that I’m sure many of you are quite confounded by it altogether. Let’s cut through the bullshit and get to the bottom of ‘reversion’ once and for all.
“Any time results from period to period aren’t perfectly correlated, you will have reversion to the mean. Saying it differently, any time luck contributes to outcomes, you will have reversion to the mean. This is a statistical point that our minds grapple with.”
Reversion to the mean creates some illusions that trip us up. One is the illusion of causality. The trick is you don’t need causality to explain reversion to the mean, it simply happens when results are not perfectly correlated. A famous example is the stature of fathers and sons. Tall fathers have tall sons, but the sons have heights that are closer to the average of all sons than their fathers do. Likewise, short fathers have short sons, but again the sons have stature closer to average than that of their fathers. Few people are surprised when they hear this.
But since reversion to the mean simply reflects results that are not perfectly correlated, the arrow of time doesn’t matter. So tall sons have tall fathers, but the height of the fathers is closer to the average height of all fathers. It is abundantly clear that sons can’t cause fathers, but the statement of reversion to the mean is still true.
I guess the main point is that there is nothing so special about reversion to the mean, but our minds are quick to create a story that reflects some causality.” – Michael Mauboussin (Author: Success Equation: Untangling Skill and Luck in Business, Sports, and Investing)
The above quote is an excellent explanation of how correlation and causation are very different and distinct animals. To sum up, causation requires correlation to be true, but correlation does not require causation. In fact, more often than not correlation without causation is better described as coincidence rather than anything meaningful or predictive in nature.
So back to the question at hand, what is ‘luck’ as defined by statistics?
Luck: or ‘randomness’ is defined as any value that lies outside a statistically expected outcome. The statistically expected outcome is determined numerically by analyzing a historical data set of observed outcomes.
So yes, when analytics tells us that the Kings and Flames were lucky events, statistically speaking they are correct because that is how an outcome that falls above or below the reversion line is defined. In fact, every plot on the above chart is lucky to some degree. This brings us to the ultimate conundrum in hockey analytics.
Does luck in the traditional sense of the word mean the same as luck in the statistical sense of the word as it pertains to hockey analysis?
If the data displayed strong correlation and tested positive for causality, then ‘luck’ would in fact share a very similar meaning both traditionally and statistically. In other words, any unexpected outcomes could be attributed to some outside unexplainable force. For example, luck could be an unbelievably hot/cold goaltender, a puck with eyes or a blindfold, or a lack/plague of injuries. However, when causality is not present, as is the case with CORSI’s relationship with winning, a lot more context is needed in order for ‘luck’ to actually be luck in the traditional sense and not just simply a meaningless consequence of imperfect correlation.
Identifying The Assumptions Behind Corsi as a Predictor
An education in economics required that I learn, understand and perform data analysis using a ‘Statistical Hypothesis Test’. This continued into my career where I was also required to research and use models that were largely based on varying methods of data analysis. Basically, I’ve read a shit load of university assigned textbooks and academically published papers related to the subject of data analysis and despite all of that, the most profound and useful piece of information I ever received about the subject came from a former colleague/mentor I worked with.
“I don’t care how much data you have or how sophisticated your methods of data analysis are. The results of your analysis are only as strong as the assumptions that shape it.” – A. Mentor
The hypothesis is that CORSI or ‘shot attempt differential’ is the primary indicator/driver of a team’s success. The better a team’s shot differential, the more points a team will achieve. Therefore, if shot attempt differential alone is the primary driving force behind winning, then it also assumes that all shot attempts are equal in value. The assumption is built upon another assumption that over time and with larger samples, enough data will be compiled to average out any “random” affects associated with any high or low shooting %/save % that would imply these shot attempts aren’t equal.
These assumptions are irrational for a number of logically rational reasons:
- It assumes that a shot travelling at 50km/hr is the same as a shot travelling at 150km/hr.
- It assumes a shot taken by Ovechkin has the same accuracy as a shot taken by Clarkson.
- It assumes a shot with a body in the shooting lane is the same as a one-timer that has forced the goalie to move laterally.
- It assumes player skill is primarily a function of the number of shots attempted.
- It assumes all goalies are equal in skill since shots are the only controllable variable.
- It assumes 25 shots recorded by Arizona is equivalent to 25 shots recorded by Anaheim.
- It assumes 25 shots against Edmonton is equivalent to 25 shots against Montreal
For obvious reasons, the above assumptions are sufficiently lacking in rational logic. Some in the analytics community have acknowledged the error in these assumptions and have attempted to address them with certain adjustments, such as a shot location subset within the data. This does help somewhat, but it is still critically flawed by all of the assumptions listed above. Others will argue that a large enough sample mitigates the affect of sh% or sv% since over time they will tend to revert to the mean. Again, I refer you to my earlier arguments about misinterpreting reversion. I won’t bother further entertaining the assumption that all shots are equal in nature because for lack of a better word, it’s just plain ignorant. As A. Mentor said earlier, an analysis is only as strong as the assumptions it’s based on and these assumptions are weak at best.
So, if we know that:
- Rationally speaking, all shots are in fact not equal,
- Statistically speaking, a causal relationship does not exist between CORSI and winning, and
- Practically speaking, the best performing teams in the league have at best middle of the row CORSI stats,
We can deduce that CF is a weak predictor of success. However, I will throw CF this small bone. It appears to be quite good at determining the worst teams in the league far better than it does predicting the best. This leads to other interesting possibilities all of which argue against CORSI as a strong predictor of success, of which I won’t entertain further at this time.
Are you Biased? Am I?
My rationale for debunking CORSI as a predictor of performance is often met with the accusation that I suffer from cognitive bias and in all likelihood, there is a great deal of merit to that argument.
Cognitive biases: are tendencies to think in certain ways that can lead to systematic deviations from a standard of rationality or good judgment, and are often studied in psychology and behavioral economics.
Specifically, people who watch games or video and observe data using their ‘eyes’ as a receptor for analysis suffer from a cognitive bias known as confirmation bias.
Confirmation Bias: The tendency to search for, interpret, focus on and remember information in a way that confirms one’s preconceptions.
Ironically, the majority of data collected across the NHL is done via human observation. So essentially, the data that analytics itself uses suffers from confirmation bias.
Moreover, there are times when analytics makes comparisons between several players indicating superior or inferior performance, while at the same time ignoring many other players of relevance in order to support a preconception. I won’t bother to bring up examples as I’m sure most of you have had daily exposure to it.
Pot. Kettle. Black.
As hard as it may be to believe (sarcasm), analytics suffers from some other cognitive biases:
- Bias Blind Spot: The tendency to see oneself as less biased than other people, or to be able to identify more cognitive biases in others than in oneself. Anyone who has been accused of biased analysis can easily relate to this one. Shit, I might even be guilty of it at this very moment.
- Survivorship Bias: seeing the winners and trying to “learn” from them, while forgetting/ignoring the huge number of losers. Corsi advocates are guilty of this bias quite often in that they tend to ignore outliers as ‘random events’ instead of seeing them as something that should be analyzed and learned from a great deal further. For an in depth look at survivorship bias and randomness you may want to read Nassim Nicholas Taleb’s, “Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets”.
Is it really that hard to believe that CORSI is at best just another evaluation tool that needs a great deal of context?
A reality of statistics is that many things in life just can’t be statistically predicted. The more dynamic an event is the more likely it follows a statistically unpredictable path or random walk. Hockey is far more dynamic than baseball or basketball and as such, it is a lot less easy to predict.
It’s easier to measure relative skill when the same players play each other the majority of the time and the opposite is true when this control is not available. The main reason for Hockey’s increased dynamism over other pro sports involves player contributions. The rules of a hockey game require 12 players to be iced during live play and each team is allotted 20 players for a game. Save for the backup goalie, each player contributes a good portion of TOI to each game. Add to that the fact you can change players on the fly during live play and the permutations in terms of player matchups are enormous. Even if you were to group matchups using individual player time stamps to control for whom plays whom during a game (WOWY analysis), the large number of data sub sets created would just cause sample size problems. Baseball and Basketball are less exposed to this type of dynamism as the same players tend to play the majority of a game against each other. As such, Baseball and Basketball are far more cooperative statistically speaking.
It may appear to some of you that I am arguing against the use of analytics and that people should revert back to traditional methods of analysis, i.e. “watch the games.” This couldn’t be further from my true intention. Hockey analytics has produced some very revealing evaluation tools that are, in my opinion, essential in evaluation, development, preparation and forming coaching strategy. These tools in conjunction with traditional stats and video analysis can become a very powerful means of evaluation. It would be a huge disadvantage to ignore analytics as part of any team analysis, just as it would be disadvantageous ignoring a traditional method of analysis using video.
What this article was intent on accomplishing, was to set the record straight on the many misconceptions about hockey analytics and CORSI in particular. So if in the future you have some ‘elitist know-it-all’ accuse you of being a “flat-earther,” a “mouth breather,” a “paste eater,” (BTW, Elmer’s is my favorite) or an “idiot” because of your apparent lack of understanding when it comes to math and analytics, you are more than welcome to refer them to this article just after you tell them where to stick their heads.