NBA Game Model
Predicting NBA games using a statistical model with data from NBA’s API
There are all sorts of predictions for the outcome of NBA games, including those from experts, Vegas, and advanced statistical simulated models (such as the one from 538). In this exercise, I attempt to build a simple statistical model to predict the outcome of regular season games in the NBA.
Data Source
To start off this project of creating a model to predict the outcome of games, I pulled the 2014-2015 NBA season from NBA.com’s API. This sample of 1,230 games was used for building of the model as well as the validation through cross validation. The code used for creating this schedule is here.
Data Exploration
Picking the right variables is essential to improving the model’s accuracy and resiliency. The prominent variables for predicting a game’s outcome are the “Four Factors.” Within a game, four factors are calculated per team:
- Shooting: eFG% = (FGM + 0.5*3PM) / FGA
- Turnovers: TOV% = Turnovers / (FGA + 0.44*FTA + Turnovers)
- Rebounding: ORB% = DRB / (Opp ORB + DRB)
- Free Throws: FTFGA = FT/FGA
First, I wanted to see how these factors correlated with the game’s outcome. Using the NBA games listed above, I calculated the four factors for each game and looked at the results. For each factor, the y-axis has the game outcome: 1 for wins and 0 for losses. The red line connects the mean factor values for wins and losses, showing the directionality of the factor:
All four factors appear to have the right directionality with respect to the game outcome, so now it’s time to plug these variables into a logistic regression model. After excluding 7 outliers based on the plots above, the model achieved a 10 fold cross-validated accuracy of 88%. Not bad for just using 8 variables (each team has its Four Factors). The caveat is that these factors can’t be calculated until the game is over, meaning that the in-game factors don’t offer any predictive power. For the game outcome model, we have to get predictions before the game starts, so we can’t use the in-game Four Factors. We can, however, use a team’s Four Factors in past games.
Variable Selection
To use a team’s Four Factors in past games as a variable, I needed to decide how many past games to consider. On one hand, picking a lot of past games would help reduce the variance of this variable for a team. But a high number of past games would also restrict the eligible games for the model to predict as the first few games of the season would not have enough past games. On the other hand, picking a smaller number of past games could capture short-term swings in a team’s performance and allow the model to predict more games earlier in the season. To determine the exact number, I tested these variables when cross-validating the model.
I also added a few other variables that seemed to make intuitive sense. :
- Back to Back Games: Teams often are tired in their second game of a back-to-back schedule, causing them to struggle
- Win % for last 5 and 15 games: This metric doesn’t add a ton of information over the Four Factors, but its inclusion might be able to reveal a team’s instincts and willingness to win
- Dunk Score: As mentioned in this post, the Dunk Score has some predictive power in predicting game outcomes
Next, I used all of these variables with two types of models: a Support Vector Machine (SVM) model and a Random Forest model. For classification problems with lots of variables, these models perform well. Using 10-fold cross validation, I acheived the following accuracies:
Only past 5 game variables:
SVM:
0.642045454545
{'kernel': 'poly', 'C': 5, 'degree': 1}
Random Forest:
0.632575757576
{'n_estimators': 30, 'criterion': 'entropy'}
Only past 15 game variables:
SVM:
0.660984848485
{'kernel': 'linear', 'C': 1, 'degree': 1}
Random Forest:
0.655303030303
{'n_estimators': 50, 'criterion': 'gini'}
Including all past games within the season:
SVM:
0.673295454545
{'kernel': 'rbf', 'C': 10, 'degree': 1}
Random Forest:
0.648674242424
{'n_estimators': 40, 'criterion': 'gini'}
All of the variables:
SVM:
0.683712121212
{'kernel': 'linear', 'C': 0.5, 'degree': 1}
Random Forest
0.666666666667
{'n_estimators': 40, 'criterion': 'entropy'}
Based on these results, it looks like including a variety of past n games produces the most accurate model. Most of this accuracy appears to be driven by the variables that include all past games. In fact, the model based on those variables alone nearly achieved the same accuracy as the model with every variable.