Introduction

I will be analyzing from my observational data set that I have collected over time from MLB regular seasons 2000-2016. In order to improve my data analysis I excluded pitchers, and players with less than 1000 at bat, also I chose the line up position with the most recurrent position in the past 16 seasons. The line up position is crucial for predicting. It is well known that players from 3 to 6 find more runners in score position than any other line up position.

The result will help scouts and team members to relocate players and predict rbi.

Knowing the data set

Variables Description
player_id Player id
player_name Player name
at_bat Number of at bat
rbi_average Total rbi divided by at bat
hit_average Total hits divided by at bat
line_average Total lines divided by at bat
ground_average Total grounds divided by at bat
fly_average Total flies divided by at bat
strikeout_average Total strike outs divided by at bat
lineup_position Most recurrent line up position
path <- "https://raw.githubusercontent.com/luije87/Big-Data-Analysis/master/project.csv"
data <- read.csv(path)
mlb_data <- data[ , c("rbi_average", "hit_average", "line_average", "ground_average", "lineup_position")]
attach(mlb_data)
pairs(mlb_data, pch=16, col="blue")

This pairs graph show us the correlation between variables. What I expected :

Our model will use the following relation:

Rbi = (\(\beta_0\) + \(\beta_1\) * hit_average + \(\beta_2\) * line_average + \(\beta_3\) * ground_average + \(\beta_4\) * lineup_position) * AtBat

mlb_model <- lm(rbi_average ~ hit_average + line_average + ground_average + format(lineup_position), data = mlb_data)
summary(mlb_data)
##   rbi_average      hit_average      line_average    ground_average  
##  Min.   :0.0584   Min.   :0.1947   Min.   :0.1161   Min.   :0.1943  
##  1st Qu.:0.1022   1st Qu.:0.2488   1st Qu.:0.1643   1st Qu.:0.3208  
##  Median :0.1219   Median :0.2621   Median :0.1809   Median :0.3575  
##  Mean   :0.1231   Mean   :0.2635   Mean   :0.1814   Mean   :0.3639  
##  3rd Qu.:0.1412   3rd Qu.:0.2797   3rd Qu.:0.1971   3rd Qu.:0.4056  
##  Max.   :0.2092   Max.   :0.3209   Max.   :0.2533   Max.   :0.5813  
##  lineup_position
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :4.071  
##  3rd Qu.:6.000  
##  Max.   :9.000
summary(mlb_model)
## 
## Call:
## lm(formula = rbi_average ~ hit_average + line_average + ground_average + 
##     format(lineup_position), data = mlb_data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.029749 -0.008996 -0.001147  0.007998  0.047274 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               0.065343   0.012480   5.236 3.04e-07 ***
## hit_average               0.660817   0.060822  10.865  < 2e-16 ***
## line_average             -0.219223   0.034286  -6.394 5.94e-10 ***
## ground_average           -0.261231   0.016442 -15.888  < 2e-16 ***
## format(lineup_position)2  0.014618   0.002614   5.591 4.94e-08 ***
## format(lineup_position)3  0.032195   0.003063  10.510  < 2e-16 ***
## format(lineup_position)4  0.042626   0.003268  13.044  < 2e-16 ***
## format(lineup_position)5  0.030952   0.002977  10.398  < 2e-16 ***
## format(lineup_position)6  0.015821   0.003328   4.754 3.06e-06 ***
## format(lineup_position)7  0.021401   0.004046   5.290 2.32e-07 ***
## format(lineup_position)8  0.017106   0.003018   5.669 3.29e-08 ***
## format(lineup_position)9  0.005194   0.004172   1.245    0.214    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01373 on 310 degrees of freedom
## Multiple R-squared:  0.7828, Adjusted R-squared:  0.7751 
## F-statistic: 101.6 on 11 and 310 DF,  p-value: < 2.2e-16

The p-values for the selected variables are significants and Adjust R Square is 77%, meaning the model explain 77% the variability of the response around its means.

par(mfrow=c(2,2))
plot(mlb_model, which = 1:4)

Residual vs Fitted

Observed values are not far away from 0, which means the model fitted well:

residual = observed y - model-predicted y, values should be around 0.

We have three residuals that are away from 0; 67, 83 and 117

Lets predict one of those data points (67), Wilson Ramos, 35 RBI in 2017:

predict(mlb_model, list(line_average = 0.14903, ground_average = 0.4735, hit_average = 0.25961, lineup_position = 7)) * 280
##        1 
## 28.54172

The prediction for 2017 season is not that close.

Normal Q-Q

Observations lie well along the 45-degree line, so we can asume that normality holds here.

Season (2017)

I picked 4 players with different positions in the line up and applied the model:

Yuliesky Gourriel

Y. Gourriel

Y. Gourriel

At Bat : 529

RBI : 75

Most recurrent line up position : 7

Variables Values Average
Hits 158 0.29867
Lines Drive 88 0.16635
Grounds 218 0.41209
predict(mlb_model, list(line_average = 0.16635, ground_average = 0.41209, hit_average = 0.29867, lineup_position = 7)) * 529
##        1 
## 74.05549

Jose Abreu

J. Abreu

J. Abreu

At Bat : 621

RBI : 102

Most recurrent line up position : 5

Variables Values Average
Hits 189 0.30434
Lines Drive 92 0.14814
Grounds 229 0.36876
predict(mlb_model, list(line_average = 0.14814, ground_average = 0.36876, hit_average = 0.30434, lineup_position = 5)) * 621
##        1 
## 104.7012

Jose Altuve

J. Altuve

J. Altuve

At Bat : 590

RBI : 81

Most recurrent line up position : 1

Variables Values Average
Hits 204 0.32033
Lines Drive 102 0.17288
Grounds 236 0.4
predict(mlb_model, list(line_average = 0.17288, ground_average = 0.4, hit_average = 0.32033, lineup_position = 1)) * 590
##       1 
## 79.4323

Evan Longoria

E. Longoria

E. Longoria

At Bat : 613

RBI : 86

Most recurrent line up position : 3

Variables Values Average
Hits 160 0.26101
Lines Drive 102 0.16630
Grounds 236 0.36541
predict(mlb_model, list(line_average = 0.16630, ground_average = 0.36541, hit_average = 0.26101, lineup_position = 3)) * 613
##        1 
## 84.65801

Conclusion

The predictions for players selected were close to their RBI in 2017. The model works as expected. The baseball game has a lot of stats that can influence our results. I learned we cannot exclude variables we know are important like lineup_position in this case which improved my predictions notably.

I will keep working on this model to improve it to help coaches to find the perfect fit in their line up.

References

Fan Graph - https://www.fangraphs.com

ESPN - http://www.espn.com/

Retrosheet - http://www.retrosheet.org/

Baseball Reference - https://www.baseball-reference.com/