Chapter 9: Predictive Modeling & Machine Learning

9.1 Introduction to Prediction in Baseball

9.1.1 Why Predict?

Prediction in baseball serves three primary purposes, each critical to different aspects of baseball operations:

Player Valuation: Before signing a free agent or making a trade, teams need to estimate future performance. A 28-year-old shortstop hitting .290 with 25 home runs might command a $200 million contract, but only if we expect similar production over the contract's duration. Prediction systems project player performance by analyzing historical patterns, aging curves, component skills, and contextual factors. These projections directly inform financial decisions worth hundreds of millions of dollars.

In-Game Decisions: During games, managers constantly make decisions with uncertain outcomes. Should we intentionally walk this hitter? Should we bring in our closer in the eighth inning? Should we attempt a stolen base? Win probability models estimate the likelihood of winning given the current game state, allowing managers to evaluate which decision maximizes win expectancy. These models have revolutionized late-game strategy.

Roster Construction: Building a competitive team requires predicting not just individual player performance but how players fit together. Will this platoon split maximize value? How much depth do we need to account for injuries? Which minor league prospects will develop into major league contributors? Prediction models help teams optimize roster construction within budget constraints and organizational goals.

9.1.2 The Prediction Landscape

Baseball prediction tasks vary widely in difficulty, data requirements, and methods. Understanding this landscape helps you choose appropriate approaches:

Task	Primary Inputs	Outputs	Difficulty	Key Challenges
Season Projections	Historical stats, age, league	WAR, AVG, HR, ERA	Medium	Aging curves, injuries, luck vs skill
Hit/Out Prediction	Pitch type, location, count, batter stats	Binary (hit/out)	Medium	Class imbalance, context variation
Pitch Type Classification	Velocity, spin, movement	Categorical (pitch type)	Low-Medium	Overlapping characteristics
Win Probability	Score, inning, baserunners, outs	Probability (0-1)	Medium	Rare situations, changing strategy
Expected Stats (xBA, xwOBA)	Exit velocity, launch angle	Continuous (batting avg)	Low-Medium	Park effects, defense positioning
Contract Value	Age, position, projected WAR	Dollars	High	Market inefficiencies, team needs
Injury Risk	Biomechanics, workload, history	Probability, days lost	High	Small samples, privacy constraints
Prospect Success	Minor league stats, tools, age	Major league outcome	Very High	Different competition level, development variability

The "Low" to "Very High" difficulty ratings reflect both technical modeling challenges and fundamental predictability constraints. Some tasks like pitch classification work well because the signal is strong and data abundant. Others like prospect projection remain difficult because talent development involves many unmeasured factors and substantial randomness.

9.1.3 Prediction Philosophy

Several principles guide effective prediction in baseball:

Embrace Uncertainty: Baseball is inherently random. Even the best hitters make outs 65% of the time. Predictions should express uncertainty through confidence intervals or probability distributions rather than point estimates. Saying a player will hit "25-30 home runs with 80% confidence" is more honest and useful than saying "27.4 home runs."

Validate Rigorously: Baseball provides abundant historical data for testing predictions. Always evaluate models on held-out data that the model never saw during training. If your model predicts 2023 performance, it should be trained only on data through 2022. This prevents overfitting—models that memorize historical patterns but fail on new data.

Start Simple: Begin with straightforward models you can understand and interpret. Linear regression often performs nearly as well as complex neural networks for many baseball tasks while being far easier to explain and debug. Add complexity only when simpler models prove insufficient.

Domain Knowledge Matters: The best prediction models incorporate baseball expertise. A purely statistical model might overlook important factors like a pitcher recovering from injury or a hitter changing his swing mechanics. Combining baseball knowledge with statistical rigor produces better predictions than either alone.

9.2 Regression Fundamentals

Regression forms the foundation of predictive modeling. At its core, regression finds relationships between predictor variables (features) and an outcome variable (target), then uses those relationships to make predictions.

9.2.1 Simple Linear Regression

Simple linear regression models the relationship between one predictor and one outcome using a straight line. The equation is:

Y = β₀ + β₁X + ε

Where:

Y is the outcome we're predicting

X is the predictor variable

β₀ is the intercept (expected Y when X = 0)

β₁ is the slope (change in Y per unit change in X)

ε is the error term (what the model can't explain)

Let's predict pitcher ERA from FIP (Fielding Independent Pitching). FIP estimates what a pitcher's ERA "should be" based only on outcomes the pitcher controls: strikeouts, walks, and home runs. If FIP truly captures pitcher skill, it should predict future ERA.

R Implementation

library(tidyverse)
library(Lahman)
library(broom)

# Get pitching data from 2022 (to predict 2023)
pitching_2022 <- Pitching %>%
  filter(yearID == 2022, IPouts >= 150) %>%  # Min 50 IP
  mutate(
    IP = IPouts / 3,
    ERA = (ER / IP) * 9,
    FIP = ((13*HR + 3*BB - 2*SO) / IP) + 3.2  # FIP constant
  ) %>%
  select(playerID, ERA_2022 = ERA, FIP_2022 = FIP)

# Get 2023 ERA (our target to predict)
pitching_2023 <- Pitching %>%
  filter(yearID == 2023, IPouts >= 150) %>%
  mutate(
    IP = IPouts / 3,
    ERA = (ER / IP) * 9
  ) %>%
  select(playerID, ERA_2023 = ERA)

# Combine
pitching_data <- inner_join(pitching_2022, pitching_2023, by = "playerID")

# Fit simple linear regression
model_simple <- lm(ERA_2023 ~ FIP_2022, data = pitching_data)

# View results
summary(model_simple)

Output Interpretation:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.2847     0.4123   3.116  0.00239 **
FIP_2022     0.6789     0.1067   6.364 8.74e-09 ***

Residual standard error: 0.856 on 98 degrees of freedom
Multiple R-squared:  0.292,	Adjusted R-squared:  0.285

The coefficient of 0.68 on FIP_2022 means for every 1.00 increase in 2022 FIP, we expect 2023 ERA to increase by 0.68 runs. The R² of 0.292 indicates FIP explains about 29% of the variance in next year's ERA—meaningful but far from perfect. The remaining 71% reflects luck, injuries, aging, changes in supporting cast, and other factors.

Python Implementation

import pandas as pd
import numpy as np
from pybaseball import pitching_stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns

# Get pitching data for 2022 and 2023
pitching_2022 = pitching_stats(2022, qual=50)  # Min 50 IP
pitching_2023 = pitching_stats(2023, qual=50)

# Select and rename columns
pred_2022 = pitching_2022[['IDfg', 'FIP']].rename(columns={'FIP': 'FIP_2022'})
target_2023 = pitching_2023[['IDfg', 'ERA']].rename(columns={'ERA': 'ERA_2023'})

# Merge datasets
data = pd.merge(pred_2022, target_2023, on='IDfg')
data = data.dropna()

# Prepare features and target
X = data[['FIP_2022']].values
y = data['ERA_2023'].values

# Fit linear regression
model = LinearRegression()
model.fit(X, y)

# Print coefficients
print(f"Intercept: {model.intercept_:.3f}")
print(f"Coefficient: {model.coef_[0]:.3f}")

# Calculate R² and RMSE
y_pred = model.predict(X)
r2 = r2_score(y, y_pred)
rmse = np.sqrt(mean_squared_error(y, y_pred))

print(f"R²: {r2:.3f}")
print(f"RMSE: {rmse:.3f}")

# Visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Actual')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('2022 FIP')
plt.ylabel('2023 ERA')
plt.title('Predicting 2023 ERA from 2022 FIP')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

This produces similar results to the R version, showing FIP's moderate predictive power for future ERA.

9.2.2 Multiple Regression

Simple linear regression uses one predictor, but baseball performance involves many factors. Multiple regression extends the concept to multiple predictors:

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε

Let's predict WAR (Wins Above Replacement) from its component parts: batting runs, baserunning runs, fielding runs, and positional adjustment.

R Implementation

library(tidyverse)
library(baseballr)
library(broom)

# Note: This uses FanGraphs data (requires baseballr package)
# Alternatively, construct from available data

# Create synthetic example data for demonstration
set.seed(42)
n <- 200

war_data <- tibble(
  player_id = 1:n,
  batting_runs = rnorm(n, 15, 20),
  baserunning_runs = rnorm(n, 0, 3),
  fielding_runs = rnorm(n, 0, 8),
  positional_adj = sample(c(-10, -5, 0, 2.5, 7.5), n, replace = TRUE)
) %>%
  mutate(
    # WAR is approximately these components divided by runs per win (~10)
    WAR = (batting_runs + baserunning_runs + fielding_runs +
           positional_adj + 20) / 10 + rnorm(n, 0, 0.5)
  )

# Fit multiple regression
model_multi <- lm(WAR ~ batting_runs + baserunning_runs +
                  fielding_runs + positional_adj,
                  data = war_data)

# Tidy output
tidy(model_multi)
glance(model_multi)

# Example prediction for new player
new_player <- tibble(
  batting_runs = 25,
  baserunning_runs = 2,
  fielding_runs = 5,
  positional_adj = 7.5
)

predict(model_multi, newdata = new_player, interval = "prediction")

Output:

# A tibble: 5 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)         2.04      0.0352     58.0  1.22e-129
2 batting_runs        0.100     0.00247    40.4  9.95e-103
3 baserunning_runs    0.0991    0.0165      6.01 9.25e-  9
4 fielding_runs       0.0987    0.00620    15.9  1.01e- 35
5 positional_adj      0.100     0.00885    11.3  3.27e- 23

Multiple R-squared: 0.967

The coefficients near 0.10 make sense: WAR uses approximately 10 runs per win, so each component contributes about 0.10 WAR per run. The high R² (0.967) indicates WAR is indeed composed primarily of these measured components.

Python Implementation

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import statsmodels.api as sm

# Create synthetic data
np.random.seed(42)
n = 200

data = pd.DataFrame({
    'batting_runs': np.random.normal(15, 20, n),
    'baserunning_runs': np.random.normal(0, 3, n),
    'fielding_runs': np.random.normal(0, 8, n),
    'positional_adj': np.random.choice([-10, -5, 0, 2.5, 7.5], n)
})

# Calculate WAR from components
data['WAR'] = ((data['batting_runs'] + data['baserunning_runs'] +
                data['fielding_runs'] + data['positional_adj'] + 20) / 10 +
               np.random.normal(0, 0.5, n))

# Prepare features and target
X = data[['batting_runs', 'baserunning_runs', 'fielding_runs', 'positional_adj']]
y = data['WAR']

# Fit using sklearn
model = LinearRegression()
model.fit(X, y)

print("Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature}: {coef:.4f}")
print(f"Intercept: {model.intercept_:.4f}")

# Get detailed statistics using statsmodels
X_with_const = sm.add_constant(X)
model_stats = sm.OLS(y, X_with_const).fit()
print("\nModel Summary:")
print(model_stats.summary())

# Make prediction
new_player = pd.DataFrame({
    'batting_runs': [25],
    'baserunning_runs': [2],
    'fielding_runs': [5],
    'positional_adj': [7.5]
})

predicted_war = model.predict(new_player)
print(f"\nPredicted WAR for new player: {predicted_war[0]:.2f}")

9.2.3 Regression Diagnostics

Regression models make assumptions that should be verified. Key diagnostics include:

Residuals Plot: Residuals (actual - predicted) should show no pattern when plotted against predicted values. Patterns indicate the model is missing something.

Normal Q-Q Plot: Residuals should follow a normal distribution. Deviations at the tails suggest outliers or model misspecification.

Scale-Location Plot: Checks for heteroscedasticity (non-constant variance). If residual spread increases with predicted values, transformations may help.

Leverage Plot: Identifies influential observations that strongly affect the model. High leverage points deserve investigation.

R Diagnostic Plots

# Using our previous ERA prediction model
par(mfrow = c(2, 2))  # 2x2 layout
plot(model_simple)
par(mfrow = c(1, 1))  # Reset

# Or using ggplot2 for prettier diagnostics
library(ggfortify)
autoplot(model_simple, which = 1:6, ncol = 2, label.size = 3)

Python Diagnostic Plots

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Get predictions and residuals
y_pred = model.predict(X)
residuals = y - y_pred

# Create diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs Fitted
axes[0, 0].scatter(y_pred, residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# Normal Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')

# Scale-Location Plot
standardized_residuals = np.sqrt(np.abs(residuals / np.std(residuals)))
axes[1, 0].scatter(y_pred, standardized_residuals, alpha=0.5)
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized residuals|')
axes[1, 0].set_title('Scale-Location')

# Residuals Histogram
axes[1, 1].hist(residuals, bins=30, edgecolor='black')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Residuals Distribution')

plt.tight_layout()
plt.show()

What to Look For:

Residuals plot should show random scatter around zero (no curve, no funnel)

Q-Q plot points should follow the diagonal line closely

Scale-location should show roughly constant spread

Histogram should be approximately bell-shaped

Violations suggest model improvements: transforming variables, adding interaction terms, or using different modeling approaches.

R

Y = β₀ + β₁X + ε

R

library(tidyverse)
library(Lahman)
library(broom)

# Get pitching data from 2022 (to predict 2023)
pitching_2022 <- Pitching %>%
  filter(yearID == 2022, IPouts >= 150) %>%  # Min 50 IP
  mutate(
    IP = IPouts / 3,
    ERA = (ER / IP) * 9,
    FIP = ((13*HR + 3*BB - 2*SO) / IP) + 3.2  # FIP constant
  ) %>%
  select(playerID, ERA_2022 = ERA, FIP_2022 = FIP)

# Get 2023 ERA (our target to predict)
pitching_2023 <- Pitching %>%
  filter(yearID == 2023, IPouts >= 150) %>%
  mutate(
    IP = IPouts / 3,
    ERA = (ER / IP) * 9
  ) %>%
  select(playerID, ERA_2023 = ERA)

# Combine
pitching_data <- inner_join(pitching_2022, pitching_2023, by = "playerID")

# Fit simple linear regression
model_simple <- lm(ERA_2023 ~ FIP_2022, data = pitching_data)

# View results
summary(model_simple)

R

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.2847     0.4123   3.116  0.00239 **
FIP_2022     0.6789     0.1067   6.364 8.74e-09 ***

Residual standard error: 0.856 on 98 degrees of freedom
Multiple R-squared:  0.292,	Adjusted R-squared:  0.285

R

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ + ε

R

library(tidyverse)
library(baseballr)
library(broom)

# Note: This uses FanGraphs data (requires baseballr package)
# Alternatively, construct from available data

# Create synthetic example data for demonstration
set.seed(42)
n <- 200

war_data <- tibble(
  player_id = 1:n,
  batting_runs = rnorm(n, 15, 20),
  baserunning_runs = rnorm(n, 0, 3),
  fielding_runs = rnorm(n, 0, 8),
  positional_adj = sample(c(-10, -5, 0, 2.5, 7.5), n, replace = TRUE)
) %>%
  mutate(
    # WAR is approximately these components divided by runs per win (~10)
    WAR = (batting_runs + baserunning_runs + fielding_runs +
           positional_adj + 20) / 10 + rnorm(n, 0, 0.5)
  )

# Fit multiple regression
model_multi <- lm(WAR ~ batting_runs + baserunning_runs +
                  fielding_runs + positional_adj,
                  data = war_data)

# Tidy output
tidy(model_multi)
glance(model_multi)

# Example prediction for new player
new_player <- tibble(
  batting_runs = 25,
  baserunning_runs = 2,
  fielding_runs = 5,
  positional_adj = 7.5
)

predict(model_multi, newdata = new_player, interval = "prediction")

R

# A tibble: 5 × 5
  term              estimate std.error statistic  p.value
  <chr>                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)         2.04      0.0352     58.0  1.22e-129
2 batting_runs        0.100     0.00247    40.4  9.95e-103
3 baserunning_runs    0.0991    0.0165      6.01 9.25e-  9
4 fielding_runs       0.0987    0.00620    15.9  1.01e- 35
5 positional_adj      0.100     0.00885    11.3  3.27e- 23

Multiple R-squared: 0.967

R

# Using our previous ERA prediction model
par(mfrow = c(2, 2))  # 2x2 layout
plot(model_simple)
par(mfrow = c(1, 1))  # Reset

# Or using ggplot2 for prettier diagnostics
library(ggfortify)
autoplot(model_simple, which = 1:6, ncol = 2, label.size = 3)

Python

import pandas as pd
import numpy as np
from pybaseball import pitching_stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns

# Get pitching data for 2022 and 2023
pitching_2022 = pitching_stats(2022, qual=50)  # Min 50 IP
pitching_2023 = pitching_stats(2023, qual=50)

# Select and rename columns
pred_2022 = pitching_2022[['IDfg', 'FIP']].rename(columns={'FIP': 'FIP_2022'})
target_2023 = pitching_2023[['IDfg', 'ERA']].rename(columns={'ERA': 'ERA_2023'})

# Merge datasets
data = pd.merge(pred_2022, target_2023, on='IDfg')
data = data.dropna()

# Prepare features and target
X = data[['FIP_2022']].values
y = data['ERA_2023'].values

# Fit linear regression
model = LinearRegression()
model.fit(X, y)

# Print coefficients
print(f"Intercept: {model.intercept_:.3f}")
print(f"Coefficient: {model.coef_[0]:.3f}")

# Calculate R² and RMSE
y_pred = model.predict(X)
r2 = r2_score(y, y_pred)
rmse = np.sqrt(mean_squared_error(y, y_pred))

print(f"R²: {r2:.3f}")
print(f"RMSE: {rmse:.3f}")

# Visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Actual')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('2022 FIP')
plt.ylabel('2023 ERA')
plt.title('Predicting 2023 ERA from 2022 FIP')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Python

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import statsmodels.api as sm

# Create synthetic data
np.random.seed(42)
n = 200

data = pd.DataFrame({
    'batting_runs': np.random.normal(15, 20, n),
    'baserunning_runs': np.random.normal(0, 3, n),
    'fielding_runs': np.random.normal(0, 8, n),
    'positional_adj': np.random.choice([-10, -5, 0, 2.5, 7.5], n)
})

# Calculate WAR from components
data['WAR'] = ((data['batting_runs'] + data['baserunning_runs'] +
                data['fielding_runs'] + data['positional_adj'] + 20) / 10 +
               np.random.normal(0, 0.5, n))

# Prepare features and target
X = data[['batting_runs', 'baserunning_runs', 'fielding_runs', 'positional_adj']]
y = data['WAR']

# Fit using sklearn
model = LinearRegression()
model.fit(X, y)

print("Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature}: {coef:.4f}")
print(f"Intercept: {model.intercept_:.4f}")

# Get detailed statistics using statsmodels
X_with_const = sm.add_constant(X)
model_stats = sm.OLS(y, X_with_const).fit()
print("\nModel Summary:")
print(model_stats.summary())

# Make prediction
new_player = pd.DataFrame({
    'batting_runs': [25],
    'baserunning_runs': [2],
    'fielding_runs': [5],
    'positional_adj': [7.5]
})

predicted_war = model.predict(new_player)
print(f"\nPredicted WAR for new player: {predicted_war[0]:.2f}")

Python

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Get predictions and residuals
y_pred = model.predict(X)
residuals = y - y_pred

# Create diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Residuals vs Fitted
axes[0, 0].scatter(y_pred, residuals, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--')
axes[0, 0].set_xlabel('Fitted values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted')

# Normal Q-Q Plot
stats.probplot(residuals, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot')

# Scale-Location Plot
standardized_residuals = np.sqrt(np.abs(residuals / np.std(residuals)))
axes[1, 0].scatter(y_pred, standardized_residuals, alpha=0.5)
axes[1, 0].set_xlabel('Fitted values')
axes[1, 0].set_ylabel('√|Standardized residuals|')
axes[1, 0].set_title('Scale-Location')

# Residuals Histogram
axes[1, 1].hist(residuals, bins=30, edgecolor='black')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Residuals Distribution')

plt.tight_layout()
plt.show()

9.3 Introduction to tidymodels (R) / scikit-learn (Python)

Modern machine learning requires systematic workflows that separate training and testing data, preprocess features consistently, and evaluate models rigorously. Two frameworks have emerged as standards: tidymodels for R and scikit-learn for Python.

9.3.1 The Machine Learning Workflow

Every supervised learning project follows similar steps:

Split Data: Separate into training (build model) and testing (evaluate performance)
Preprocessing/Feature Engineering: Transform variables, handle missing data, create new features
Specify Model: Choose algorithm and set parameters
Fit Model: Train on training data
Evaluate: Assess performance on test data
Iterate: Refine based on results

This structure prevents a critical mistake: using the same data for training and evaluation. If you test on data the model saw during training, you'll overestimate performance. The model might memorize training data (overfitting) but fail on new data.

9.3.2 tidymodels Workflow (R)

tidymodels provides a consistent interface for machine learning in R, integrating seamlessly with tidyverse packages.

Complete Workflow Example: Predicting Hit/Out

library(tidymodels)
library(tidyverse)

# Create synthetic at-bat data
set.seed(123)
n_atbats <- 1000

atbat_data <- tibble(
  exit_velocity = rnorm(n_atbats, 89, 8),
  launch_angle = rnorm(n_atbats, 12, 18),
  sprint_speed = rnorm(n_atbats, 27, 1.5),
  hit_distance = exit_velocity * 3 + abs(launch_angle) * 2 + rnorm(n_atbats, 0, 20)
) %>%
  mutate(
    # Hit probability increases with exit velocity and optimal launch angle
    hit_prob = plogis(
      -3 +
      0.08 * exit_velocity +
      0.02 * (launch_angle - 15)^2 * -0.01 +
      0.1 * sprint_speed
    ),
    is_hit = rbinom(n_atbats, 1, hit_prob),
    is_hit = factor(is_hit, labels = c("out", "hit"))
  )

# Step 1: Split Data (75% training, 25% testing)
set.seed(456)
data_split <- initial_split(atbat_data, prop = 0.75, strata = is_hit)
train_data <- training(data_split)
test_data <- testing(data_split)

# Step 2: Create Recipe (preprocessing)
hit_recipe <- recipe(is_hit ~ exit_velocity + launch_angle + sprint_speed,
                     data = train_data) %>%
  step_normalize(all_numeric_predictors())  # Standardize numeric features

# Step 3: Specify Model
logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# Step 4: Create Workflow (combines recipe + model)
hit_workflow <- workflow() %>%
  add_recipe(hit_recipe) %>%
  add_model(logistic_model)

# Step 5: Fit Model
hit_fit <- hit_workflow %>%
  fit(data = train_data)

# Step 6: Evaluate on Test Data
# Get predictions
test_predictions <- predict(hit_fit, test_data) %>%
  bind_cols(predict(hit_fit, test_data, type = "prob")) %>%
  bind_cols(test_data)

# Calculate metrics
test_metrics <- test_predictions %>%
  metrics(truth = is_hit, estimate = .pred_class, .pred_hit)

print(test_metrics)

# Confusion matrix
test_predictions %>%
  conf_mat(truth = is_hit, estimate = .pred_class)

# ROC curve
test_predictions %>%
  roc_curve(truth = is_hit, .pred_hit) %>%
  autoplot()

Key tidymodels Concepts:

initial_split(): Divides data into train/test

recipe(): Defines preprocessing steps

Model specification separates algorithm from engine

workflow(): Bundles preprocessing and model

metrics(): Calculates multiple evaluation metrics at once

9.3.3 scikit-learn Workflow (Python)

scikit-learn provides a unified API for machine learning in Python, with consistent .fit() and .predict() methods across all models.

Complete Workflow Example: Predicting Hit/Out

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                            f1_score, roc_auc_score, confusion_matrix,
                            classification_report, roc_curve)
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns

# Create synthetic at-bat data
np.random.seed(123)
n_atbats = 1000

data = pd.DataFrame({
    'exit_velocity': np.random.normal(89, 8, n_atbats),
    'launch_angle': np.random.normal(12, 18, n_atbats),
    'sprint_speed': np.random.normal(27, 1.5, n_atbats)
})

data['hit_distance'] = (data['exit_velocity'] * 3 +
                        np.abs(data['launch_angle']) * 2 +
                        np.random.normal(0, 20, n_atbats))

# Calculate hit probability
def logistic(x):
    return 1 / (1 + np.exp(-x))

data['hit_prob'] = logistic(
    -3 +
    0.08 * data['exit_velocity'] +
    0.02 * (data['launch_angle'] - 15)**2 * -0.01 +
    0.1 * data['sprint_speed']
)

data['is_hit'] = np.random.binomial(1, data['hit_prob'])

# Step 1: Split Data
X = data[['exit_velocity', 'launch_angle', 'sprint_speed']]
y = data['is_hit']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=456, stratify=y
)

# Step 2-4: Create Pipeline (preprocessing + model)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

# Step 5: Fit Model
pipeline.fit(X_train, y_train)

# Step 6: Evaluate
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

# Print metrics
print("Classification Metrics:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_pred_proba):.3f}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Out', 'Hit']))

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc_score(y_test, y_pred_proba):.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Hit/Out Prediction')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Key scikit-learn Concepts:

train_test_split(): Divides data

Pipeline: Chains preprocessing and modeling steps

.fit(): Trains model

.predict(): Makes predictions

.predict_proba(): Returns probabilities

Consistent metrics interface

R

library(tidymodels)
library(tidyverse)

# Create synthetic at-bat data
set.seed(123)
n_atbats <- 1000

atbat_data <- tibble(
  exit_velocity = rnorm(n_atbats, 89, 8),
  launch_angle = rnorm(n_atbats, 12, 18),
  sprint_speed = rnorm(n_atbats, 27, 1.5),
  hit_distance = exit_velocity * 3 + abs(launch_angle) * 2 + rnorm(n_atbats, 0, 20)
) %>%
  mutate(
    # Hit probability increases with exit velocity and optimal launch angle
    hit_prob = plogis(
      -3 +
      0.08 * exit_velocity +
      0.02 * (launch_angle - 15)^2 * -0.01 +
      0.1 * sprint_speed
    ),
    is_hit = rbinom(n_atbats, 1, hit_prob),
    is_hit = factor(is_hit, labels = c("out", "hit"))
  )

# Step 1: Split Data (75% training, 25% testing)
set.seed(456)
data_split <- initial_split(atbat_data, prop = 0.75, strata = is_hit)
train_data <- training(data_split)
test_data <- testing(data_split)

# Step 2: Create Recipe (preprocessing)
hit_recipe <- recipe(is_hit ~ exit_velocity + launch_angle + sprint_speed,
                     data = train_data) %>%
  step_normalize(all_numeric_predictors())  # Standardize numeric features

# Step 3: Specify Model
logistic_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# Step 4: Create Workflow (combines recipe + model)
hit_workflow <- workflow() %>%
  add_recipe(hit_recipe) %>%
  add_model(logistic_model)

# Step 5: Fit Model
hit_fit <- hit_workflow %>%
  fit(data = train_data)

# Step 6: Evaluate on Test Data
# Get predictions
test_predictions <- predict(hit_fit, test_data) %>%
  bind_cols(predict(hit_fit, test_data, type = "prob")) %>%
  bind_cols(test_data)

# Calculate metrics
test_metrics <- test_predictions %>%
  metrics(truth = is_hit, estimate = .pred_class, .pred_hit)

print(test_metrics)

# Confusion matrix
test_predictions %>%
  conf_mat(truth = is_hit, estimate = .pred_class)

# ROC curve
test_predictions %>%
  roc_curve(truth = is_hit, .pred_hit) %>%
  autoplot()

Python

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                            f1_score, roc_auc_score, confusion_matrix,
                            classification_report, roc_curve)
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import seaborn as sns

# Create synthetic at-bat data
np.random.seed(123)
n_atbats = 1000

data = pd.DataFrame({
    'exit_velocity': np.random.normal(89, 8, n_atbats),
    'launch_angle': np.random.normal(12, 18, n_atbats),
    'sprint_speed': np.random.normal(27, 1.5, n_atbats)
})

data['hit_distance'] = (data['exit_velocity'] * 3 +
                        np.abs(data['launch_angle']) * 2 +
                        np.random.normal(0, 20, n_atbats))

# Calculate hit probability
def logistic(x):
    return 1 / (1 + np.exp(-x))

data['hit_prob'] = logistic(
    -3 +
    0.08 * data['exit_velocity'] +
    0.02 * (data['launch_angle'] - 15)**2 * -0.01 +
    0.1 * data['sprint_speed']
)

data['is_hit'] = np.random.binomial(1, data['hit_prob'])

# Step 1: Split Data
X = data[['exit_velocity', 'launch_angle', 'sprint_speed']]
y = data['is_hit']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=456, stratify=y
)

# Step 2-4: Create Pipeline (preprocessing + model)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

# Step 5: Fit Model
pipeline.fit(X_train, y_train)

# Step 6: Evaluate
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

# Print metrics
print("Classification Metrics:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")
print(f"ROC AUC:   {roc_auc_score(y_test, y_pred_proba):.3f}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Out', 'Hit']))

# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc_score(y_test, y_pred_proba):.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Hit/Out Prediction')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

9.4 Classification Models

Classification predicts categorical outcomes: hit or out, swing or take, fastball or breaking ball. Several algorithms excel at different types of classification problems.

9.4.1 Logistic Regression

Despite its name, logistic regression is a classification algorithm. It estimates the probability of an outcome using the logistic function:

P(Y=1) = 1 / (1 + e^-(β₀ + β₁X₁ + ... + βₙXₙ))

This squashes any linear combination of predictors into a probability between 0 and 1.

Predicting Hit/Out with Launch Conditions

Building on our previous example, let's examine how launch conditions predict hits:

R Implementation

library(tidymodels)
library(tidyverse)

# Using previous atbat_data
# Fit logistic regression
logistic_spec <- logistic_reg() %>%
  set_engine("glm")

logistic_fit <- logistic_spec %>%
  fit(is_hit ~ exit_velocity + launch_angle + sprint_speed,
      data = train_data)

# View coefficients
tidy(logistic_fit)

# Prediction surface visualization
# Create grid of exit velocity and launch angle values
pred_grid <- expand_grid(
  exit_velocity = seq(75, 105, length.out = 50),
  launch_angle = seq(-20, 50, length.out = 50),
  sprint_speed = 27  # Hold constant at average
)

# Get predictions
pred_grid <- pred_grid %>%
  bind_cols(predict(logistic_fit, pred_grid, type = "prob"))

# Plot
ggplot(pred_grid, aes(x = exit_velocity, y = launch_angle, fill = .pred_hit)) +
  geom_tile() +
  scale_fill_viridis_c(name = "Hit Probability", limits = c(0, 1)) +
  labs(
    title = "Hit Probability by Exit Velocity and Launch Angle",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (degrees)"
  ) +
  theme_minimal() +
  coord_fixed(ratio = 1)

Python Implementation

from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import numpy as np

# Fit logistic regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Print coefficients
print("Coefficients:")
for feature, coef in zip(X.columns, log_reg.coef_[0]):
    print(f"  {feature}: {coef:.4f}")
print(f"Intercept: {log_reg.intercept_[0]:.4f}")

# Create prediction surface
ev_range = np.linspace(75, 105, 50)
la_range = np.linspace(-20, 50, 50)
ev_grid, la_grid = np.meshgrid(ev_range, la_range)

# Create feature array (exit_velocity, launch_angle, sprint_speed=27)
grid_features = np.c_[
    ev_grid.ravel(),
    la_grid.ravel(),
    np.full(ev_grid.ravel().shape, 27)
]

# Get predictions
hit_probs = log_reg.predict_proba(grid_features)[:, 1].reshape(ev_grid.shape)

# Plot
plt.figure(figsize=(10, 8))
contour = plt.contourf(ev_grid, la_grid, hit_probs, levels=20, cmap='viridis')
plt.colorbar(contour, label='Hit Probability')
plt.xlabel('Exit Velocity (mph)')
plt.ylabel('Launch Angle (degrees)')
plt.title('Hit Probability by Exit Velocity and Launch Angle')
plt.tight_layout()
plt.show()

This visualization shows the "barrel zone"—the combination of high exit velocity and optimal launch angle that maximizes hit probability.

9.4.2 Decision Trees

Decision trees split data based on feature values, creating a tree-like structure of decisions. They're intuitive, handle non-linear relationships, and require minimal preprocessing.

Pitch Type Classification

Let's classify pitches as fastball or breaking ball based on velocity and spin:

R Implementation

library(tidymodels)
library(rpart.plot)

# Create synthetic pitch data
set.seed(789)
n_pitches <- 800

pitch_data <- tibble(
  velocity = c(
    rnorm(400, 94, 2.5),  # Fastballs
    rnorm(400, 82, 3)     # Breaking balls
  ),
  spin_rate = c(
    rnorm(400, 2300, 200),  # Fastballs
    rnorm(400, 2650, 250)   # Breaking balls
  ),
  vertical_break = c(
    rnorm(400, 15, 2),    # Fastballs (less drop)
    rnorm(400, -5, 3)     # Breaking balls (more drop)
  ),
  pitch_type = factor(rep(c("Fastball", "Breaking"), each = 400))
)

# Split data
pitch_split <- initial_split(pitch_data, prop = 0.75, strata = pitch_type)
pitch_train <- training(pitch_split)
pitch_test <- testing(pitch_split)

# Specify decision tree
tree_spec <- decision_tree(
  cost_complexity = 0.01,
  tree_depth = 5
) %>%
  set_engine("rpart") %>%
  set_mode("classification")

# Fit model
tree_fit <- tree_spec %>%
  fit(pitch_type ~ velocity + spin_rate + vertical_break,
      data = pitch_train)

# Visualize tree
tree_fit$fit %>%
  rpart.plot(roundint = FALSE, extra = 2)

# Evaluate
pitch_predictions <- predict(tree_fit, pitch_test) %>%
  bind_cols(pitch_test)

pitch_predictions %>%
  conf_mat(truth = pitch_type, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

# Metrics
pitch_predictions %>%
  metrics(truth = pitch_type, estimate = .pred_class)

Python Implementation

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Create synthetic pitch data
np.random.seed(789)
n_pitches = 800

pitch_data = pd.DataFrame({
    'velocity': np.concatenate([
        np.random.normal(94, 2.5, 400),  # Fastballs
        np.random.normal(82, 3, 400)      # Breaking balls
    ]),
    'spin_rate': np.concatenate([
        np.random.normal(2300, 200, 400),  # Fastballs
        np.random.normal(2650, 250, 400)   # Breaking balls
    ]),
    'vertical_break': np.concatenate([
        np.random.normal(15, 2, 400),      # Fastballs
        np.random.normal(-5, 3, 400)       # Breaking balls
    ]),
    'pitch_type': ['Fastball']*400 + ['Breaking']*400
})

# Split data
X = pitch_data[['velocity', 'spin_rate', 'vertical_break']]
y = pitch_data['pitch_type']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Fit decision tree
tree_clf = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=20,
    random_state=42
)
tree_clf.fit(X_train, y_train)

# Visualize tree
plt.figure(figsize=(20, 10))
plot_tree(tree_clf,
          feature_names=X.columns,
          class_names=['Breaking', 'Fastball'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree for Pitch Classification')
plt.tight_layout()
plt.show()

# Evaluate
y_pred = tree_clf.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix heatmap
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Breaking', 'Fastball'],
            yticklabels=['Breaking', 'Fastball'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix: Pitch Classification')
plt.tight_layout()
plt.show()

Decision trees are interpretable—you can follow the splits—but prone to overfitting. They perform best when ensemble methods combine many trees.

9.4.3 Random Forests

Random forests build many decision trees, each trained on a random subset of data and features, then aggregate their predictions. This reduces overfitting while maintaining trees' ability to capture complex patterns.

R Implementation

library(tidymodels)
library(ranger)  # Fast random forest implementation

# Specify random forest
rf_spec <- rand_forest(
  mtry = 2,        # Number of variables to sample at each split
  trees = 500,     # Number of trees
  min_n = 5        # Minimum observations in terminal nodes
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

# Create workflow
rf_workflow <- workflow() %>%
  add_formula(pitch_type ~ velocity + spin_rate + vertical_break) %>%
  add_model(rf_spec)

# Fit model
rf_fit <- rf_workflow %>%
  fit(data = pitch_train)

# Predictions
rf_predictions <- predict(rf_fit, pitch_test) %>%
  bind_cols(predict(rf_fit, pitch_test, type = "prob")) %>%
  bind_cols(pitch_test)

# Metrics
rf_predictions %>%
  metrics(truth = pitch_type, estimate = .pred_class)

# ROC-AUC
rf_predictions %>%
  roc_auc(truth = pitch_type, .pred_Fastball)

# Variable importance
rf_fit %>%
  extract_fit_parsnip() %>%
  vip::vip()

Python Implementation

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

# Fit random forest
rf_clf = RandomForestClassifier(
    n_estimators=500,
    max_features=2,
    min_samples_split=10,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

rf_clf.fit(X_train, y_train)

# Predictions
y_pred = rf_clf.predict(X_test)
y_pred_proba = rf_clf.predict_proba(X_test)

# Metrics
print("Classification Report:")
print(classification_report(y_test, y_pred))

# ROC-AUC (for binary classification, convert to numeric)
y_test_binary = (y_test == 'Fastball').astype(int)
roc_auc = roc_auc_score(y_test_binary, y_pred_proba[:, 1])
print(f"\nROC-AUC: {roc_auc:.3f}")

# Feature importance
importances = rf_clf.feature_importances_
features = X.columns

plt.figure(figsize=(8, 6))
plt.barh(features, importances)
plt.xlabel('Importance')
plt.title('Feature Importance in Random Forest')
plt.tight_layout()
plt.show()

# Feature importance with error bars
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Calculate standard deviation of importance across trees
std = np.std([tree.feature_importances_ for tree in rf_clf.estimators_], axis=0)

plt.figure(figsize=(8, 6))
plt.barh(features, importances, xerr=std)
plt.xlabel('Importance')
plt.title('Feature Importance (with std across trees)')
plt.tight_layout()
plt.show()

Random forests typically outperform single decision trees and require less tuning than many algorithms. They're a strong default choice for classification tasks.

R

P(Y=1) = 1 / (1 + e^-(β₀ + β₁X₁ + ... + βₙXₙ))

R

library(tidymodels)
library(tidyverse)

# Using previous atbat_data
# Fit logistic regression
logistic_spec <- logistic_reg() %>%
  set_engine("glm")

logistic_fit <- logistic_spec %>%
  fit(is_hit ~ exit_velocity + launch_angle + sprint_speed,
      data = train_data)

# View coefficients
tidy(logistic_fit)

# Prediction surface visualization
# Create grid of exit velocity and launch angle values
pred_grid <- expand_grid(
  exit_velocity = seq(75, 105, length.out = 50),
  launch_angle = seq(-20, 50, length.out = 50),
  sprint_speed = 27  # Hold constant at average
)

# Get predictions
pred_grid <- pred_grid %>%
  bind_cols(predict(logistic_fit, pred_grid, type = "prob"))

# Plot
ggplot(pred_grid, aes(x = exit_velocity, y = launch_angle, fill = .pred_hit)) +
  geom_tile() +
  scale_fill_viridis_c(name = "Hit Probability", limits = c(0, 1)) +
  labs(
    title = "Hit Probability by Exit Velocity and Launch Angle",
    x = "Exit Velocity (mph)",
    y = "Launch Angle (degrees)"
  ) +
  theme_minimal() +
  coord_fixed(ratio = 1)

R

library(tidymodels)
library(rpart.plot)

# Create synthetic pitch data
set.seed(789)
n_pitches <- 800

pitch_data <- tibble(
  velocity = c(
    rnorm(400, 94, 2.5),  # Fastballs
    rnorm(400, 82, 3)     # Breaking balls
  ),
  spin_rate = c(
    rnorm(400, 2300, 200),  # Fastballs
    rnorm(400, 2650, 250)   # Breaking balls
  ),
  vertical_break = c(
    rnorm(400, 15, 2),    # Fastballs (less drop)
    rnorm(400, -5, 3)     # Breaking balls (more drop)
  ),
  pitch_type = factor(rep(c("Fastball", "Breaking"), each = 400))
)

# Split data
pitch_split <- initial_split(pitch_data, prop = 0.75, strata = pitch_type)
pitch_train <- training(pitch_split)
pitch_test <- testing(pitch_split)

# Specify decision tree
tree_spec <- decision_tree(
  cost_complexity = 0.01,
  tree_depth = 5
) %>%
  set_engine("rpart") %>%
  set_mode("classification")

# Fit model
tree_fit <- tree_spec %>%
  fit(pitch_type ~ velocity + spin_rate + vertical_break,
      data = pitch_train)

# Visualize tree
tree_fit$fit %>%
  rpart.plot(roundint = FALSE, extra = 2)

# Evaluate
pitch_predictions <- predict(tree_fit, pitch_test) %>%
  bind_cols(pitch_test)

pitch_predictions %>%
  conf_mat(truth = pitch_type, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

# Metrics
pitch_predictions %>%
  metrics(truth = pitch_type, estimate = .pred_class)

R

library(tidymodels)
library(ranger)  # Fast random forest implementation

# Specify random forest
rf_spec <- rand_forest(
  mtry = 2,        # Number of variables to sample at each split
  trees = 500,     # Number of trees
  min_n = 5        # Minimum observations in terminal nodes
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

# Create workflow
rf_workflow <- workflow() %>%
  add_formula(pitch_type ~ velocity + spin_rate + vertical_break) %>%
  add_model(rf_spec)

# Fit model
rf_fit <- rf_workflow %>%
  fit(data = pitch_train)

# Predictions
rf_predictions <- predict(rf_fit, pitch_test) %>%
  bind_cols(predict(rf_fit, pitch_test, type = "prob")) %>%
  bind_cols(pitch_test)

# Metrics
rf_predictions %>%
  metrics(truth = pitch_type, estimate = .pred_class)

# ROC-AUC
rf_predictions %>%
  roc_auc(truth = pitch_type, .pred_Fastball)

# Variable importance
rf_fit %>%
  extract_fit_parsnip() %>%
  vip::vip()

Python

from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import numpy as np

# Fit logistic regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Print coefficients
print("Coefficients:")
for feature, coef in zip(X.columns, log_reg.coef_[0]):
    print(f"  {feature}: {coef:.4f}")
print(f"Intercept: {log_reg.intercept_[0]:.4f}")

# Create prediction surface
ev_range = np.linspace(75, 105, 50)
la_range = np.linspace(-20, 50, 50)
ev_grid, la_grid = np.meshgrid(ev_range, la_range)

# Create feature array (exit_velocity, launch_angle, sprint_speed=27)
grid_features = np.c_[
    ev_grid.ravel(),
    la_grid.ravel(),
    np.full(ev_grid.ravel().shape, 27)
]

# Get predictions
hit_probs = log_reg.predict_proba(grid_features)[:, 1].reshape(ev_grid.shape)

# Plot
plt.figure(figsize=(10, 8))
contour = plt.contourf(ev_grid, la_grid, hit_probs, levels=20, cmap='viridis')
plt.colorbar(contour, label='Hit Probability')
plt.xlabel('Exit Velocity (mph)')
plt.ylabel('Launch Angle (degrees)')
plt.title('Hit Probability by Exit Velocity and Launch Angle')
plt.tight_layout()
plt.show()

Python

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Create synthetic pitch data
np.random.seed(789)
n_pitches = 800

pitch_data = pd.DataFrame({
    'velocity': np.concatenate([
        np.random.normal(94, 2.5, 400),  # Fastballs
        np.random.normal(82, 3, 400)      # Breaking balls
    ]),
    'spin_rate': np.concatenate([
        np.random.normal(2300, 200, 400),  # Fastballs
        np.random.normal(2650, 250, 400)   # Breaking balls
    ]),
    'vertical_break': np.concatenate([
        np.random.normal(15, 2, 400),      # Fastballs
        np.random.normal(-5, 3, 400)       # Breaking balls
    ]),
    'pitch_type': ['Fastball']*400 + ['Breaking']*400
})

# Split data
X = pitch_data[['velocity', 'spin_rate', 'vertical_break']]
y = pitch_data['pitch_type']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Fit decision tree
tree_clf = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=20,
    random_state=42
)
tree_clf.fit(X_train, y_train)

# Visualize tree
plt.figure(figsize=(20, 10))
plot_tree(tree_clf,
          feature_names=X.columns,
          class_names=['Breaking', 'Fastball'],
          filled=True,
          rounded=True,
          fontsize=10)
plt.title('Decision Tree for Pitch Classification')
plt.tight_layout()
plt.show()

# Evaluate
y_pred = tree_clf.predict(X_test)

print("Classification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix heatmap
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Breaking', 'Fastball'],
            yticklabels=['Breaking', 'Fastball'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix: Pitch Classification')
plt.tight_layout()
plt.show()

Python

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

# Fit random forest
rf_clf = RandomForestClassifier(
    n_estimators=500,
    max_features=2,
    min_samples_split=10,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

rf_clf.fit(X_train, y_train)

# Predictions
y_pred = rf_clf.predict(X_test)
y_pred_proba = rf_clf.predict_proba(X_test)

# Metrics
print("Classification Report:")
print(classification_report(y_test, y_pred))

# ROC-AUC (for binary classification, convert to numeric)
y_test_binary = (y_test == 'Fastball').astype(int)
roc_auc = roc_auc_score(y_test_binary, y_pred_proba[:, 1])
print(f"\nROC-AUC: {roc_auc:.3f}")

# Feature importance
importances = rf_clf.feature_importances_
features = X.columns

plt.figure(figsize=(8, 6))
plt.barh(features, importances)
plt.xlabel('Importance')
plt.title('Feature Importance in Random Forest')
plt.tight_layout()
plt.show()

# Feature importance with error bars
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Calculate standard deviation of importance across trees
std = np.std([tree.feature_importances_ for tree in rf_clf.estimators_], axis=0)

plt.figure(figsize=(8, 6))
plt.barh(features, importances, xerr=std)
plt.xlabel('Importance')
plt.title('Feature Importance (with std across trees)')
plt.tight_layout()
plt.show()

9.5 Advanced Models

Modern machine learning offers powerful algorithms for complex patterns. Two stand out for baseball applications: gradient boosting (especially XGBoost) and neural networks.

9.5.1 XGBoost

XGBoost (Extreme Gradient Boosting) builds trees sequentially, each correcting errors from previous trees. It's dominated Kaggle competitions and performs exceptionally well on structured/tabular data like baseball statistics.

Predicting Hit/Out with XGBoost

R Implementation

library(tidymodels)
library(xgboost)

# Specify XGBoost model
xgb_spec <- boost_tree(
  trees = 500,
  tree_depth = 6,
  min_n = 5,
  loss_reduction = 0,
  sample_size = 0.8,
  mtry = 0.8,
  learn_rate = 0.01
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

# Create recipe
xgb_recipe <- recipe(is_hit ~ exit_velocity + launch_angle + sprint_speed,
                     data = train_data) %>%
  step_normalize(all_numeric_predictors())

# Workflow
xgb_workflow <- workflow() %>%
  add_recipe(xgb_recipe) %>%
  add_model(xgb_spec)

# Fit
xgb_fit <- xgb_workflow %>%
  fit(data = train_data)

# Evaluate
xgb_predictions <- predict(xgb_fit, test_data) %>%
  bind_cols(predict(xgb_fit, test_data, type = "prob")) %>%
  bind_cols(test_data)

# Metrics
xgb_metrics <- xgb_predictions %>%
  metrics(truth = is_hit, estimate = .pred_class, .pred_hit)

print(xgb_metrics)

# Feature importance
xgb_fit %>%
  extract_fit_parsnip() %>%
  vip::vip(geom = "col")

Python Implementation

from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt

# Fit XGBoost
xgb_clf = XGBClassifier(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.01,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='logloss'
)

xgb_clf.fit(X_train, y_train)

# Predictions
y_pred = xgb_clf.predict(X_test)
y_pred_proba = xgb_clf.predict_proba(X_test)[:, 1]

# Metrics
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Out', 'Hit']))
print(f"\nROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")

# Feature importance
from xgboost import plot_importance

fig, ax = plt.subplots(figsize=(8, 6))
plot_importance(xgb_clf, ax=ax)
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.show()

# Learning curves (if you saved eval results)
# Show how performance improves with more trees
xgb_clf_eval = XGBClassifier(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.01,
    random_state=42
)

xgb_clf_eval.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    eval_metric='logloss',
    verbose=False
)

# Plot learning curves
results = xgb_clf_eval.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
ax.set_ylabel('Log Loss')
ax.set_xlabel('Number of Trees')
ax.set_title('XGBoost Learning Curves')
plt.tight_layout()
plt.show()

XGBoost excels when you have many features and complex interactions. It's particularly effective for player projection systems that incorporate dozens of variables.

9.5.2 Neural Networks

Neural networks learn representations by passing data through layers of interconnected nodes (neurons). They excel at finding complex patterns but require more data and tuning than tree-based models.

Simple Neural Network for Hit Prediction

R Implementation (keras/tensorflow)

library(keras)
library(tensorflow)

# Prepare data (neural networks need numeric matrices)
X_train_nn <- train_data %>%
  select(exit_velocity, launch_angle, sprint_speed) %>%
  as.matrix()

y_train_nn <- train_data %>%
  pull(is_hit) %>%
  as.numeric() - 1  # Convert to 0/1

X_test_nn <- test_data %>%
  select(exit_velocity, launch_angle, sprint_speed) %>%
  as.matrix()

y_test_nn <- test_data %>%
  pull(is_hit) %>%
  as.numeric() - 1

# Normalize features
mean_vals <- colMeans(X_train_nn)
sd_vals <- apply(X_train_nn, 2, sd)

X_train_scaled <- scale(X_train_nn, center = mean_vals, scale = sd_vals)
X_test_scaled <- scale(X_test_nn, center = mean_vals, scale = sd_vals)

# Build neural network
model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu", input_shape = 3) %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 1, activation = "sigmoid")

# Compile model
model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.001),
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

# Train model
history <- model %>% fit(
  x = X_train_scaled,
  y = y_train_nn,
  epochs = 50,
  batch_size = 32,
  validation_split = 0.2,
  verbose = 0
)

# Plot training history
plot(history)

# Evaluate on test set
test_metrics <- model %>% evaluate(X_test_scaled, y_test_nn, verbose = 0)
cat(sprintf("Test Accuracy: %.3f\n", test_metrics["accuracy"]))

# Predictions
predictions <- model %>% predict(X_test_scaled)
pred_classes <- ifelse(predictions > 0.5, 1, 0)

# Confusion matrix
table(Predicted = pred_classes, Actual = y_test_nn)

Python Implementation

import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score

# Prepare data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build neural network
model = keras.Sequential([
    keras.layers.Dense(32, activation='relu', input_shape=(3,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train model
history = model.fit(
    X_train_scaled, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

# Plot training history
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Model Loss')

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Model Accuracy')

plt.tight_layout()
plt.show()

# Evaluate on test set
test_loss, test_acc = model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Test Accuracy: {test_acc:.3f}")

# Predictions
y_pred_proba = model.predict(X_test_scaled).flatten()
y_pred = (y_pred_proba > 0.5).astype(int)

# Metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Out', 'Hit']))
print(f"\nROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")

Neural networks often don't outperform gradient boosting on small-to-medium tabular datasets typical in baseball analytics. They shine with very large datasets or when working with images (e.g., classifying pitches from video).

R

library(tidymodels)
library(xgboost)

# Specify XGBoost model
xgb_spec <- boost_tree(
  trees = 500,
  tree_depth = 6,
  min_n = 5,
  loss_reduction = 0,
  sample_size = 0.8,
  mtry = 0.8,
  learn_rate = 0.01
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

# Create recipe
xgb_recipe <- recipe(is_hit ~ exit_velocity + launch_angle + sprint_speed,
                     data = train_data) %>%
  step_normalize(all_numeric_predictors())

# Workflow
xgb_workflow <- workflow() %>%
  add_recipe(xgb_recipe) %>%
  add_model(xgb_spec)

# Fit
xgb_fit <- xgb_workflow %>%
  fit(data = train_data)

# Evaluate
xgb_predictions <- predict(xgb_fit, test_data) %>%
  bind_cols(predict(xgb_fit, test_data, type = "prob")) %>%
  bind_cols(test_data)

# Metrics
xgb_metrics <- xgb_predictions %>%
  metrics(truth = is_hit, estimate = .pred_class, .pred_hit)

print(xgb_metrics)

# Feature importance
xgb_fit %>%
  extract_fit_parsnip() %>%
  vip::vip(geom = "col")

R

library(keras)
library(tensorflow)

# Prepare data (neural networks need numeric matrices)
X_train_nn <- train_data %>%
  select(exit_velocity, launch_angle, sprint_speed) %>%
  as.matrix()

y_train_nn <- train_data %>%
  pull(is_hit) %>%
  as.numeric() - 1  # Convert to 0/1

X_test_nn <- test_data %>%
  select(exit_velocity, launch_angle, sprint_speed) %>%
  as.matrix()

y_test_nn <- test_data %>%
  pull(is_hit) %>%
  as.numeric() - 1

# Normalize features
mean_vals <- colMeans(X_train_nn)
sd_vals <- apply(X_train_nn, 2, sd)

X_train_scaled <- scale(X_train_nn, center = mean_vals, scale = sd_vals)
X_test_scaled <- scale(X_test_nn, center = mean_vals, scale = sd_vals)

# Build neural network
model <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = "relu", input_shape = 3) %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dropout(rate = 0.2) %>%
  layer_dense(units = 1, activation = "sigmoid")

# Compile model
model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.001),
  loss = "binary_crossentropy",
  metrics = c("accuracy")
)

# Train model
history <- model %>% fit(
  x = X_train_scaled,
  y = y_train_nn,
  epochs = 50,
  batch_size = 32,
  validation_split = 0.2,
  verbose = 0
)

# Plot training history
plot(history)

# Evaluate on test set
test_metrics <- model %>% evaluate(X_test_scaled, y_test_nn, verbose = 0)
cat(sprintf("Test Accuracy: %.3f\n", test_metrics["accuracy"]))

# Predictions
predictions <- model %>% predict(X_test_scaled)
pred_classes <- ifelse(predictions > 0.5, 1, 0)

# Confusion matrix
table(Predicted = pred_classes, Actual = y_test_nn)

Python

from xgboost import XGBClassifier
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt

# Fit XGBoost
xgb_clf = XGBClassifier(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.01,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='logloss'
)

xgb_clf.fit(X_train, y_train)

# Predictions
y_pred = xgb_clf.predict(X_test)
y_pred_proba = xgb_clf.predict_proba(X_test)[:, 1]

# Metrics
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Out', 'Hit']))
print(f"\nROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")

# Feature importance
from xgboost import plot_importance

fig, ax = plt.subplots(figsize=(8, 6))
plot_importance(xgb_clf, ax=ax)
plt.title('XGBoost Feature Importance')
plt.tight_layout()
plt.show()

# Learning curves (if you saved eval results)
# Show how performance improves with more trees
xgb_clf_eval = XGBClassifier(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.01,
    random_state=42
)

xgb_clf_eval.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    eval_metric='logloss',
    verbose=False
)

# Plot learning curves
results = xgb_clf_eval.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
ax.set_ylabel('Log Loss')
ax.set_xlabel('Number of Trees')
ax.set_title('XGBoost Learning Curves')
plt.tight_layout()
plt.show()

Python

import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score

# Prepare data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build neural network
model = keras.Sequential([
    keras.layers.Dense(32, activation='relu', input_shape=(3,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile model
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Train model
history = model.fit(
    X_train_scaled, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

# Plot training history
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Model Loss')

plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Model Accuracy')

plt.tight_layout()
plt.show()

# Evaluate on test set
test_loss, test_acc = model.evaluate(X_test_scaled, y_test, verbose=0)
print(f"Test Accuracy: {test_acc:.3f}")

# Predictions
y_pred_proba = model.predict(X_test_scaled).flatten()
y_pred = (y_pred_proba > 0.5).astype(int)

# Metrics
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Out', 'Hit']))
print(f"\nROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")

9.6 Cross-Validation and Tuning

Testing on a single train/test split can be misleading—you might get lucky (or unlucky) with how the data splits. Cross-validation provides more robust estimates of model performance.

9.6.1 K-Fold Cross-Validation

K-fold CV divides data into K folds. The model trains on K-1 folds and tests on the remaining fold, repeating K times so each fold serves as the test set once. Final performance is the average across all K iterations.

R Implementation

library(tidymodels)

# Create 10-fold cross-validation
set.seed(234)
folds <- vfold_cv(train_data, v = 10, strata = is_hit)

# Fit model on all folds
rf_cv <- rf_workflow %>%
  fit_resamples(
    resamples = folds,
    metrics = metric_set(accuracy, roc_auc, precision, recall),
    control = control_resamples(save_pred = TRUE)
  )

# Collect metrics across folds
cv_metrics <- collect_metrics(rf_cv)
print(cv_metrics)

# View metrics by fold
cv_metrics_by_fold <- rf_cv %>%
  unnest(.metrics)

# Plot accuracy across folds
cv_metrics_by_fold %>%
  filter(.metric == "accuracy") %>%
  ggplot(aes(x = id, y = .estimate)) +
  geom_point(size = 3) +
  geom_hline(yintercept = mean(cv_metrics_by_fold %>%
               filter(.metric == "accuracy") %>%
               pull(.estimate)),
             linetype = "dashed", color = "red") +
  labs(
    title = "Accuracy Across Cross-Validation Folds",
    x = "Fold",
    y = "Accuracy"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Confusion matrix across all predictions
cv_predictions <- collect_predictions(rf_cv)
cv_predictions %>%
  conf_mat(truth = is_hit, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Python Implementation

from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np

# Create random forest
rf_clf = RandomForestClassifier(n_estimators=500, random_state=42)

# 10-fold stratified cross-validation
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Define scoring metrics
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Perform cross-validation
cv_results = cross_validate(
    rf_clf, X_train, y_train,
    cv=cv,
    scoring=scoring,
    return_train_score=True
)

# Print mean and std of metrics
print("Cross-Validation Results:")
for metric in scoring:
    test_scores = cv_results[f'test_{metric}']
    print(f"{metric.upper()}: {test_scores.mean():.3f} (+/- {test_scores.std():.3f})")

# Plot metrics across folds
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, metric in enumerate(scoring):
    test_scores = cv_results[f'test_{metric}']
    train_scores = cv_results[f'train_{metric}']

    axes[idx].plot(range(1, 11), train_scores, 'o-', label='Train', alpha=0.7)
    axes[idx].plot(range(1, 11), test_scores, 'o-', label='Test', alpha=0.7)
    axes[idx].axhline(test_scores.mean(), linestyle='--', color='red', alpha=0.5)
    axes[idx].set_xlabel('Fold')
    axes[idx].set_ylabel(metric.upper())
    axes[idx].set_title(f'{metric.upper()} Across Folds')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

# Remove empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

Cross-validation gives you confidence in model performance—if metrics are consistent across folds, the model is robust.

9.6.2 Hyperparameter Tuning

Most algorithms have hyperparameters—settings you choose before training that affect performance. Examples include:

Random Forest: number of trees, max depth, min samples per leaf

XGBoost: learning rate, max depth, number of estimators

Neural Networks: number of layers, neurons per layer, learning rate

Grid search and random search systematically try different hyperparameter combinations to find the best settings.

R Implementation (Grid Search with tune)

library(tidymodels)

# Specify model with tuning parameters
rf_tune_spec <- rand_forest(
  mtry = tune(),      # Number of predictors sampled at each split
  trees = 500,
  min_n = tune()      # Minimum observations in terminal nodes
) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Create workflow
rf_tune_workflow <- workflow() %>%
  add_formula(pitch_type ~ velocity + spin_rate + vertical_break) %>%
  add_model(rf_tune_spec)

# Create tuning grid
rf_grid <- grid_regular(
  mtry(range = c(1, 3)),
  min_n(range = c(2, 20)),
  levels = 5
)

# Create cross-validation folds
set.seed(345)
pitch_folds <- vfold_cv(pitch_train, v = 5, strata = pitch_type)

# Tune model
rf_tuning <- rf_tune_workflow %>%
  tune_grid(
    resamples = pitch_folds,
    grid = rf_grid,
    metrics = metric_set(accuracy, roc_auc)
  )

# View results
collect_metrics(rf_tuning)

# Best parameters
best_params <- select_best(rf_tuning, metric = "roc_auc")
print(best_params)

# Finalize workflow with best parameters
final_rf_workflow <- rf_tune_workflow %>%
  finalize_workflow(best_params)

# Fit final model on all training data
final_rf_fit <- final_rf_workflow %>%
  fit(data = pitch_train)

# Evaluate on test data
final_predictions <- predict(final_rf_fit, pitch_test) %>%
  bind_cols(predict(final_rf_fit, pitch_test, type = "prob")) %>%
  bind_cols(pitch_test)

final_predictions %>%
  metrics(truth = pitch_type, estimate = .pred_class, .pred_Fastball)

# Visualize tuning results
autoplot(rf_tuning)

Python Implementation (RandomizedSearchCV)

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
import numpy as np

# Define parameter distributions
param_distributions = {
    'n_estimators': [100, 300, 500, 700],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

# Create base model
rf_base = RandomForestClassifier(random_state=42)

# Create RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf_base,
    param_distributions=param_distributions,
    n_iter=50,  # Number of parameter combinations to try
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

# Fit
random_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:")
print(random_search.best_params_)

print(f"\nBest Cross-Validation ROC-AUC: {random_search.best_score_:.3f}")

# Evaluate best model on test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

test_roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"Test Set ROC-AUC: {test_roc_auc:.3f}")

# Visualize parameter importance
cv_results = pd.DataFrame(random_search.cv_results_)

# Plot top parameters
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

params_to_plot = ['param_n_estimators', 'param_max_depth',
                  'param_min_samples_split', 'param_min_samples_leaf']
titles = ['Number of Trees', 'Max Depth', 'Min Samples Split', 'Min Samples Leaf']

for ax, param, title in zip(axes.ravel(), params_to_plot, titles):
    # Filter out None values for plotting
    plot_data = cv_results[[param, 'mean_test_score']].copy()
    plot_data = plot_data[plot_data[param].notna()]

    ax.scatter(plot_data[param], plot_data['mean_test_score'], alpha=0.5)
    ax.set_xlabel(title)
    ax.set_ylabel('Mean CV ROC-AUC')
    ax.set_title(f'Effect of {title}')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Tuning improves performance but beware overfitting the validation set. Use separate test data for final evaluation.

R

library(tidymodels)

# Create 10-fold cross-validation
set.seed(234)
folds <- vfold_cv(train_data, v = 10, strata = is_hit)

# Fit model on all folds
rf_cv <- rf_workflow %>%
  fit_resamples(
    resamples = folds,
    metrics = metric_set(accuracy, roc_auc, precision, recall),
    control = control_resamples(save_pred = TRUE)
  )

# Collect metrics across folds
cv_metrics <- collect_metrics(rf_cv)
print(cv_metrics)

# View metrics by fold
cv_metrics_by_fold <- rf_cv %>%
  unnest(.metrics)

# Plot accuracy across folds
cv_metrics_by_fold %>%
  filter(.metric == "accuracy") %>%
  ggplot(aes(x = id, y = .estimate)) +
  geom_point(size = 3) +
  geom_hline(yintercept = mean(cv_metrics_by_fold %>%
               filter(.metric == "accuracy") %>%
               pull(.estimate)),
             linetype = "dashed", color = "red") +
  labs(
    title = "Accuracy Across Cross-Validation Folds",
    x = "Fold",
    y = "Accuracy"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Confusion matrix across all predictions
cv_predictions <- collect_predictions(rf_cv)
cv_predictions %>%
  conf_mat(truth = is_hit, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

R

library(tidymodels)

# Specify model with tuning parameters
rf_tune_spec <- rand_forest(
  mtry = tune(),      # Number of predictors sampled at each split
  trees = 500,
  min_n = tune()      # Minimum observations in terminal nodes
) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Create workflow
rf_tune_workflow <- workflow() %>%
  add_formula(pitch_type ~ velocity + spin_rate + vertical_break) %>%
  add_model(rf_tune_spec)

# Create tuning grid
rf_grid <- grid_regular(
  mtry(range = c(1, 3)),
  min_n(range = c(2, 20)),
  levels = 5
)

# Create cross-validation folds
set.seed(345)
pitch_folds <- vfold_cv(pitch_train, v = 5, strata = pitch_type)

# Tune model
rf_tuning <- rf_tune_workflow %>%
  tune_grid(
    resamples = pitch_folds,
    grid = rf_grid,
    metrics = metric_set(accuracy, roc_auc)
  )

# View results
collect_metrics(rf_tuning)

# Best parameters
best_params <- select_best(rf_tuning, metric = "roc_auc")
print(best_params)

# Finalize workflow with best parameters
final_rf_workflow <- rf_tune_workflow %>%
  finalize_workflow(best_params)

# Fit final model on all training data
final_rf_fit <- final_rf_workflow %>%
  fit(data = pitch_train)

# Evaluate on test data
final_predictions <- predict(final_rf_fit, pitch_test) %>%
  bind_cols(predict(final_rf_fit, pitch_test, type = "prob")) %>%
  bind_cols(pitch_test)

final_predictions %>%
  metrics(truth = pitch_type, estimate = .pred_class, .pred_Fastball)

# Visualize tuning results
autoplot(rf_tuning)

Python

from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np

# Create random forest
rf_clf = RandomForestClassifier(n_estimators=500, random_state=42)

# 10-fold stratified cross-validation
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Define scoring metrics
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Perform cross-validation
cv_results = cross_validate(
    rf_clf, X_train, y_train,
    cv=cv,
    scoring=scoring,
    return_train_score=True
)

# Print mean and std of metrics
print("Cross-Validation Results:")
for metric in scoring:
    test_scores = cv_results[f'test_{metric}']
    print(f"{metric.upper()}: {test_scores.mean():.3f} (+/- {test_scores.std():.3f})")

# Plot metrics across folds
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, metric in enumerate(scoring):
    test_scores = cv_results[f'test_{metric}']
    train_scores = cv_results[f'train_{metric}']

    axes[idx].plot(range(1, 11), train_scores, 'o-', label='Train', alpha=0.7)
    axes[idx].plot(range(1, 11), test_scores, 'o-', label='Test', alpha=0.7)
    axes[idx].axhline(test_scores.mean(), linestyle='--', color='red', alpha=0.5)
    axes[idx].set_xlabel('Fold')
    axes[idx].set_ylabel(metric.upper())
    axes[idx].set_title(f'{metric.upper()} Across Folds')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

# Remove empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.show()

Python

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
import numpy as np

# Define parameter distributions
param_distributions = {
    'n_estimators': [100, 300, 500, 700],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None]
}

# Create base model
rf_base = RandomForestClassifier(random_state=42)

# Create RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=rf_base,
    param_distributions=param_distributions,
    n_iter=50,  # Number of parameter combinations to try
    cv=5,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

# Fit
random_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:")
print(random_search.best_params_)

print(f"\nBest Cross-Validation ROC-AUC: {random_search.best_score_:.3f}")

# Evaluate best model on test set
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred_proba = best_model.predict_proba(X_test)[:, 1]

test_roc_auc = roc_auc_score(y_test, y_pred_proba)
print(f"Test Set ROC-AUC: {test_roc_auc:.3f}")

# Visualize parameter importance
cv_results = pd.DataFrame(random_search.cv_results_)

# Plot top parameters
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

params_to_plot = ['param_n_estimators', 'param_max_depth',
                  'param_min_samples_split', 'param_min_samples_leaf']
titles = ['Number of Trees', 'Max Depth', 'Min Samples Split', 'Min Samples Leaf']

for ax, param, title in zip(axes.ravel(), params_to_plot, titles):
    # Filter out None values for plotting
    plot_data = cv_results[[param, 'mean_test_score']].copy()
    plot_data = plot_data[plot_data[param].notna()]

    ax.scatter(plot_data[param], plot_data['mean_test_score'], alpha=0.5)
    ax.set_xlabel(title)
    ax.set_ylabel('Mean CV ROC-AUC')
    ax.set_title(f'Effect of {title}')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

9.7 Clustering for Player Comparison

Clustering groups similar observations together without predefined labels. In baseball, clustering finds similar players—useful for scouting comparisons, replacement identification, and market analysis.

9.7.1 K-Means Clustering

K-means partitions data into K clusters by minimizing within-cluster variance. Each cluster has a centroid (mean), and observations belong to their nearest centroid.

R Implementation

library(tidyverse)
library(cluster)
library(factoextra)

# Create synthetic player data
set.seed(456)
n_players <- 200

player_stats <- tibble(
  player_id = 1:n_players,
  HR_per_600 = rnorm(n_players, 20, 8),
  BB_pct = rnorm(n_players, 9, 3),
  K_pct = rnorm(n_players, 22, 5),
  avg_exit_velo = rnorm(n_players, 89, 4),
  sprint_speed = rnorm(n_players, 27, 1.5)
)

# Standardize features (important for clustering!)
player_stats_scaled <- player_stats %>%
  select(-player_id) %>%
  scale()

# Determine optimal number of clusters (elbow method)
fviz_nbclust(player_stats_scaled, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal K")

# Silhouette method
fviz_nbclust(player_stats_scaled, kmeans, method = "silhouette") +
  labs(title = "Silhouette Method for Optimal K")

# Fit k-means with k=5
set.seed(789)
kmeans_result <- kmeans(player_stats_scaled, centers = 5, nstart = 25)

# Add cluster assignments to data
player_stats <- player_stats %>%
  mutate(cluster = factor(kmeans_result$cluster))

# Visualize clusters (first 2 principal components)
fviz_cluster(kmeans_result, data = player_stats_scaled,
             geom = "point",
             ellipse.type = "convex",
             ggtheme = theme_minimal(),
             main = "Player Clusters")

# Cluster characteristics
cluster_summary <- player_stats %>%
  group_by(cluster) %>%
  summarise(
    n_players = n(),
    avg_HR = mean(HR_per_600),
    avg_BB_pct = mean(BB_pct),
    avg_K_pct = mean(K_pct),
    avg_EV = mean(avg_exit_velo),
    avg_speed = mean(sprint_speed)
  )

print(cluster_summary)

# Label clusters based on characteristics
cluster_labels <- case_when(
  cluster_summary$avg_HR > 25 ~ "Power Hitters",
  cluster_summary$avg_speed > 28 ~ "Speed Players",
  cluster_summary$avg_BB_pct > 10 ~ "Patient Hitters",
  cluster_summary$avg_K_pct < 20 ~ "Contact Hitters",
  TRUE ~ "Balanced Players"
)

Python Implementation

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

# Create synthetic player data
np.random.seed(456)
n_players = 200

player_stats = pd.DataFrame({
    'player_id': range(1, n_players + 1),
    'HR_per_600': np.random.normal(20, 8, n_players),
    'BB_pct': np.random.normal(9, 3, n_players),
    'K_pct': np.random.normal(22, 5, n_players),
    'avg_exit_velo': np.random.normal(89, 4, n_players),
    'sprint_speed': np.random.normal(27, 1.5, n_players)
})

# Standardize features
scaler = StandardScaler()
features = player_stats.drop('player_id', axis=1)
features_scaled = scaler.fit_transform(features)

# Determine optimal k using elbow method
inertias = []
silhouette_scores = []
K_range = range(2, 11)

from sklearn.metrics import silhouette_score

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(features_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(features_scaled, kmeans.labels_))

# Plot elbow curve and silhouette scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia (Within-Cluster Sum of Squares)')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)

axes[1].plot(K_range, silhouette_scores, 'ro-')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Method')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Fit k-means with k=5
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
player_stats['cluster'] = kmeans.fit_predict(features_scaled)

# Visualize with PCA
pca = PCA(n_components=2)
features_pca = pca.fit_transform(features_scaled)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(features_pca[:, 0], features_pca[:, 1],
                     c=player_stats['cluster'], cmap='viridis',
                     s=50, alpha=0.6)
plt.colorbar(scatter, label='Cluster')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Player Clusters (PCA Visualization)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Cluster characteristics
cluster_summary = player_stats.groupby('cluster').agg({
    'player_id': 'count',
    'HR_per_600': 'mean',
    'BB_pct': 'mean',
    'K_pct': 'mean',
    'avg_exit_velo': 'mean',
    'sprint_speed': 'mean'
}).round(2)

cluster_summary.columns = ['n_players', 'avg_HR', 'avg_BB_pct',
                           'avg_K_pct', 'avg_EV', 'avg_speed']
print("Cluster Summary:")
print(cluster_summary)

# Heatmap of cluster characteristics
plt.figure(figsize=(10, 6))
sns.heatmap(cluster_summary.drop('n_players', axis=1).T,
            annot=True, fmt='.1f', cmap='RdYlGn', center=0,
            cbar_kws={'label': 'Value'})
plt.xlabel('Cluster')
plt.ylabel('Feature')
plt.title('Cluster Characteristics Heatmap')
plt.tight_layout()
plt.show()

9.7.2 Finding Player Comps

Once you have clusters, finding similar players is straightforward—look within the same cluster or calculate distances between players.

R Implementation

# Function to find similar players
find_player_comps <- function(player_idx, player_data, features_scaled, n_comps = 5) {
  # Calculate Euclidean distance from target player to all others
  target_features <- features_scaled[player_idx, ]

  distances <- apply(features_scaled, 1, function(x) {
    sqrt(sum((x - target_features)^2))
  })

  # Get indices of most similar players (excluding the player itself)
  similar_indices <- order(distances)[2:(n_comps + 1)]

  # Return similar players with distances
  player_data %>%
    slice(similar_indices) %>%
    mutate(distance = distances[similar_indices]) %>%
    arrange(distance)
}

# Example: Find comps for player 50
player_50_comps <- find_player_comps(50, player_stats, player_stats_scaled, n_comps = 5)

cat("Player 50 stats:\n")
player_stats %>% filter(player_id == 50) %>% print()

cat("\nMost similar players:\n")
print(player_50_comps)

Python Implementation

from sklearn.metrics.pairwise import euclidean_distances

def find_player_comps(player_idx, player_data, features_scaled, n_comps=5):
    """Find most similar players based on Euclidean distance."""

    # Calculate distances from target player to all others
    target_features = features_scaled[player_idx].reshape(1, -1)
    distances = euclidean_distances(target_features, features_scaled)[0]

    # Get indices of most similar players (excluding the player itself)
    similar_indices = np.argsort(distances)[1:n_comps+1]

    # Return similar players with distances
    comps = player_data.iloc[similar_indices].copy()
    comps['distance'] = distances[similar_indices]

    return comps.sort_values('distance')

# Example: Find comps for player 50
player_50_stats = player_stats.iloc[49]  # 0-indexed
player_50_comps = find_player_comps(49, player_stats, features_scaled, n_comps=5)

print("Player 50 stats:")
print(player_50_stats)
print("\nMost similar players:")
print(player_50_comps)

Player comparisons help scouts communicate ("he's like a young Freddie Freeman"), identify replacement candidates, and estimate market value based on similar players' contracts.

R

library(tidyverse)
library(cluster)
library(factoextra)

# Create synthetic player data
set.seed(456)
n_players <- 200

player_stats <- tibble(
  player_id = 1:n_players,
  HR_per_600 = rnorm(n_players, 20, 8),
  BB_pct = rnorm(n_players, 9, 3),
  K_pct = rnorm(n_players, 22, 5),
  avg_exit_velo = rnorm(n_players, 89, 4),
  sprint_speed = rnorm(n_players, 27, 1.5)
)

# Standardize features (important for clustering!)
player_stats_scaled <- player_stats %>%
  select(-player_id) %>%
  scale()

# Determine optimal number of clusters (elbow method)
fviz_nbclust(player_stats_scaled, kmeans, method = "wss") +
  labs(title = "Elbow Method for Optimal K")

# Silhouette method
fviz_nbclust(player_stats_scaled, kmeans, method = "silhouette") +
  labs(title = "Silhouette Method for Optimal K")

# Fit k-means with k=5
set.seed(789)
kmeans_result <- kmeans(player_stats_scaled, centers = 5, nstart = 25)

# Add cluster assignments to data
player_stats <- player_stats %>%
  mutate(cluster = factor(kmeans_result$cluster))

# Visualize clusters (first 2 principal components)
fviz_cluster(kmeans_result, data = player_stats_scaled,
             geom = "point",
             ellipse.type = "convex",
             ggtheme = theme_minimal(),
             main = "Player Clusters")

# Cluster characteristics
cluster_summary <- player_stats %>%
  group_by(cluster) %>%
  summarise(
    n_players = n(),
    avg_HR = mean(HR_per_600),
    avg_BB_pct = mean(BB_pct),
    avg_K_pct = mean(K_pct),
    avg_EV = mean(avg_exit_velo),
    avg_speed = mean(sprint_speed)
  )

print(cluster_summary)

# Label clusters based on characteristics
cluster_labels <- case_when(
  cluster_summary$avg_HR > 25 ~ "Power Hitters",
  cluster_summary$avg_speed > 28 ~ "Speed Players",
  cluster_summary$avg_BB_pct > 10 ~ "Patient Hitters",
  cluster_summary$avg_K_pct < 20 ~ "Contact Hitters",
  TRUE ~ "Balanced Players"
)

R

# Function to find similar players
find_player_comps <- function(player_idx, player_data, features_scaled, n_comps = 5) {
  # Calculate Euclidean distance from target player to all others
  target_features <- features_scaled[player_idx, ]

  distances <- apply(features_scaled, 1, function(x) {
    sqrt(sum((x - target_features)^2))
  })

  # Get indices of most similar players (excluding the player itself)
  similar_indices <- order(distances)[2:(n_comps + 1)]

  # Return similar players with distances
  player_data %>%
    slice(similar_indices) %>%
    mutate(distance = distances[similar_indices]) %>%
    arrange(distance)
}

# Example: Find comps for player 50
player_50_comps <- find_player_comps(50, player_stats, player_stats_scaled, n_comps = 5)

cat("Player 50 stats:\n")
player_stats %>% filter(player_id == 50) %>% print()

cat("\nMost similar players:\n")
print(player_50_comps)

Python

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

# Create synthetic player data
np.random.seed(456)
n_players = 200

player_stats = pd.DataFrame({
    'player_id': range(1, n_players + 1),
    'HR_per_600': np.random.normal(20, 8, n_players),
    'BB_pct': np.random.normal(9, 3, n_players),
    'K_pct': np.random.normal(22, 5, n_players),
    'avg_exit_velo': np.random.normal(89, 4, n_players),
    'sprint_speed': np.random.normal(27, 1.5, n_players)
})

# Standardize features
scaler = StandardScaler()
features = player_stats.drop('player_id', axis=1)
features_scaled = scaler.fit_transform(features)

# Determine optimal k using elbow method
inertias = []
silhouette_scores = []
K_range = range(2, 11)

from sklearn.metrics import silhouette_score

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(features_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(features_scaled, kmeans.labels_))

# Plot elbow curve and silhouette scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia (Within-Cluster Sum of Squares)')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)

axes[1].plot(K_range, silhouette_scores, 'ro-')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Method')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Fit k-means with k=5
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
player_stats['cluster'] = kmeans.fit_predict(features_scaled)

# Visualize with PCA
pca = PCA(n_components=2)
features_pca = pca.fit_transform(features_scaled)

plt.figure(figsize=(10, 8))
scatter = plt.scatter(features_pca[:, 0], features_pca[:, 1],
                     c=player_stats['cluster'], cmap='viridis',
                     s=50, alpha=0.6)
plt.colorbar(scatter, label='Cluster')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Player Clusters (PCA Visualization)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Cluster characteristics
cluster_summary = player_stats.groupby('cluster').agg({
    'player_id': 'count',
    'HR_per_600': 'mean',
    'BB_pct': 'mean',
    'K_pct': 'mean',
    'avg_exit_velo': 'mean',
    'sprint_speed': 'mean'
}).round(2)

cluster_summary.columns = ['n_players', 'avg_HR', 'avg_BB_pct',
                           'avg_K_pct', 'avg_EV', 'avg_speed']
print("Cluster Summary:")
print(cluster_summary)

# Heatmap of cluster characteristics
plt.figure(figsize=(10, 6))
sns.heatmap(cluster_summary.drop('n_players', axis=1).T,
            annot=True, fmt='.1f', cmap='RdYlGn', center=0,
            cbar_kws={'label': 'Value'})
plt.xlabel('Cluster')
plt.ylabel('Feature')
plt.title('Cluster Characteristics Heatmap')
plt.tight_layout()
plt.show()

Python

from sklearn.metrics.pairwise import euclidean_distances

def find_player_comps(player_idx, player_data, features_scaled, n_comps=5):
    """Find most similar players based on Euclidean distance."""

    # Calculate distances from target player to all others
    target_features = features_scaled[player_idx].reshape(1, -1)
    distances = euclidean_distances(target_features, features_scaled)[0]

    # Get indices of most similar players (excluding the player itself)
    similar_indices = np.argsort(distances)[1:n_comps+1]

    # Return similar players with distances
    comps = player_data.iloc[similar_indices].copy()
    comps['distance'] = distances[similar_indices]

    return comps.sort_values('distance')

# Example: Find comps for player 50
player_50_stats = player_stats.iloc[49]  # 0-indexed
player_50_comps = find_player_comps(49, player_stats, features_scaled, n_comps=5)

print("Player 50 stats:")
print(player_50_stats)
print("\nMost similar players:")
print(player_50_comps)

9.8 Baseball-Specific Prediction Tasks

Let's apply our machine learning knowledge to three important baseball prediction problems: projecting player WAR, estimating in-game win probability, and calculating expected statistics.

9.8.1 WAR Projection (Marcel-Style)

The Marcel projection system, created by Tom Tango, is elegantly simple: weight the last three years (most recent weighted heaviest), regress toward league average based on playing time, and age adjust. We'll implement a simplified version.

R Implementation

library(tidyverse)

# Simplified Marcel projection function
marcel_projection <- function(stats_3yr, stats_2yr, stats_1yr, age, pa_3yr, pa_2yr, pa_1yr) {
  # Weights for recent years (5:4:3 ratio)
  w1 <- 5  # Most recent year
  w2 <- 4  # Two years ago
  w3 <- 3  # Three years ago

  # Weight recent years
  weighted_avg <- (stats_1yr * w1 + stats_2yr * w2 + stats_3yr * w3) / (w1 + w2 + w3)

  # Total PA for regression
  total_pa <- pa_1yr * w1 + pa_2yr * w2 + pa_3yr * w3

  # Regression to mean (more regression with fewer PA)
  # League average WAR per 600 PA ≈ 2.0
  league_avg_rate <- 2.0 / 600
  regression_pa <- 1200  # Amount of "fake" league average PA to add

  regressed <- (weighted_avg * total_pa + league_avg_rate * regression_pa) /
               (total_pa + regression_pa)

  # Age adjustment (peak at 27, decline ~0.003 WAR/600PA per year after)
  age_adjustment <- (27 - age) * 0.003

  # Final projection (assuming 600 PA)
  projection <- (regressed + age_adjustment) * 600

  return(projection)
}

# Example: Project Mike Trout's WAR
# (Using hypothetical recent WAR/600PA rates)
trout_projection <- marcel_projection(
  stats_3yr = 0.058,  # 2021: 6.7 WAR in 490 PA
  stats_2yr = 0.062,  # 2022: 6.3 WAR in 426 PA
  stats_1yr = 0.048,  # 2023: 5.0 WAR in 455 PA
  age = 32,
  pa_3yr = 490,
  pa_2yr = 426,
  pa_1yr = 455
)

cat(sprintf("Mike Trout 2024 WAR Projection: %.1f\n", trout_projection))

# Vectorized version for multiple players
project_war_marcel <- function(player_data) {
  player_data %>%
    mutate(
      weighted_rate = (war_rate_1yr * 5 + war_rate_2yr * 4 + war_rate_3yr * 3) / 12,
      total_pa = pa_1yr * 5 + pa_2yr * 4 + pa_3yr * 3,
      regressed_rate = (weighted_rate * total_pa + (2/600) * 1200) / (total_pa + 1200),
      age_adj = (27 - age) * 0.003,
      projected_war = (regressed_rate + age_adj) * projected_pa
    )
}

Python Implementation

import pandas as pd
import numpy as np

def marcel_projection(stats_3yr, stats_2yr, stats_1yr, age,
                     pa_3yr, pa_2yr, pa_1yr, projected_pa=600):
    """
    Simplified Marcel projection for WAR.

    Parameters:
    - stats_*yr: WAR per PA for each year
    - age: Player's age in projection year
    - pa_*yr: Plate appearances in each year
    - projected_pa: PA to project for (default 600)

    Returns:
    - Projected WAR for upcoming season
    """
    # Weights (5:4:3 for most recent to oldest)
    w1, w2, w3 = 5, 4, 3

    # Weighted average
    weighted_avg = (stats_1yr * w1 + stats_2yr * w2 + stats_3yr * w3) / (w1 + w2 + w3)

    # Total weighted PA
    total_pa = pa_1yr * w1 + pa_2yr * w2 + pa_3yr * w3

    # Regression to mean
    league_avg_rate = 2.0 / 600
    regression_pa = 1200

    regressed = ((weighted_avg * total_pa + league_avg_rate * regression_pa) /
                 (total_pa + regression_pa))

    # Age adjustment
    age_adjustment = (27 - age) * 0.003

    # Final projection
    projection = (regressed + age_adjustment) * projected_pa

    return projection

# Example: Mike Trout
trout_proj = marcel_projection(
    stats_3yr=6.7/490,  # 2021
    stats_2yr=6.3/426,  # 2022
    stats_1yr=5.0/455,  # 2023
    age=32,
    pa_3yr=490,
    pa_2yr=426,
    pa_1yr=455,
    projected_pa=600
)

print(f"Mike Trout 2024 WAR Projection: {trout_proj:.1f}")

# Vectorized version
def project_war_marcel_batch(df, projected_pa=600):
    """Project WAR for multiple players."""

    df = df.copy()

    # Weighted average
    df['weighted_rate'] = ((df['war_rate_1yr'] * 5 +
                           df['war_rate_2yr'] * 4 +
                           df['war_rate_3yr'] * 3) / 12)

    # Total weighted PA
    df['total_pa'] = df['pa_1yr'] * 5 + df['pa_2yr'] * 4 + df['pa_3yr'] * 3

    # Regression
    league_avg_rate = 2.0 / 600
    regression_pa = 1200
    df['regressed_rate'] = ((df['weighted_rate'] * df['total_pa'] +
                            league_avg_rate * regression_pa) /
                           (df['total_pa'] + regression_pa))

    # Age adjustment
    df['age_adj'] = (27 - df['age']) * 0.003

    # Projection
    df['projected_war'] = (df['regressed_rate'] + df['age_adj']) * projected_pa

    return df

Real projection systems incorporate many more factors (batted ball data, platoon splits, park factors), but Marcel's simplicity and effectiveness make it a strong baseline.

9.8.2 Win Probability Model

Win probability estimates the chance each team wins given the current game state: score, inning, outs, baserunners. Modern models use historical data to calculate these probabilities.

Simplified Win Probability Function

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Simulate historical game states and outcomes
np.random.seed(42)
n_states = 10000

def simulate_game_states():
    """Create synthetic game state data."""

    data = pd.DataFrame({
        'inning': np.random.randint(1, 10, n_states),
        'outs': np.random.randint(0, 3, n_states),
        'run_diff': np.random.randint(-5, 6, n_states),  # Home team perspective
        'on_1b': np.random.choice([0, 1], n_states),
        'on_2b': np.random.choice([0, 1], n_states),
        'on_3b': np.random.choice([0, 1], n_states)
    })

    # Win probability increases with run differential and later innings
    base_prob = 1 / (1 + np.exp(-data['run_diff']))
    inning_factor = (data['inning'] - 5) / 10  # Later innings = more certain

    # Adjust based on inning
    win_prob = base_prob + inning_factor * (base_prob - 0.5)
    win_prob = np.clip(win_prob, 0.01, 0.99)

    # Simulate outcomes
    data['home_won'] = np.random.binomial(1, win_prob)

    return data

# Create data
game_data = simulate_game_states()

# Features and target
X = game_data[['inning', 'outs', 'run_diff', 'on_1b', 'on_2b', 'on_3b']]
y = game_data['home_won']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train gradient boosting model
wp_model = GradientBoostingClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42
)

wp_model.fit(X_train, y_train)

# Function to get win probability for any game state
def get_win_probability(inning, outs, run_diff, on_1b=0, on_2b=0, on_3b=0):
    """Calculate win probability for current game state."""

    state = pd.DataFrame({
        'inning': [inning],
        'outs': [outs],
        'run_diff': [run_diff],
        'on_1b': [on_1b],
        'on_2b': [on_2b],
        'on_3b': [on_3b]
    })

    win_prob = wp_model.predict_proba(state)[0, 1]
    return win_prob

# Example scenarios
print("Win Probability Examples (Home Team):")
print(f"Bottom 9th, tied, bases loaded, 1 out: {get_win_probability(9, 1, 0, 1, 1, 1):.1%}")
print(f"Top 7th, down 2 runs, nobody on, 0 outs: {get_win_probability(7, 0, -2):.1%}")
print(f"Bottom 5th, up 3 runs, nobody on, 2 outs: {get_win_probability(5, 2, 3):.1%}")

Real win probability models use much larger datasets (every game state from decades of games) and incorporate team quality, pitcher/batter matchups, and situational factors.

9.8.3 Expected Stats Model

Expected statistics (xBA, xwOBA) estimate what a player's outcomes should be based on batted ball quality, independent of defensive positioning and luck.

Simplified xBA Model

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Simulate batted ball data
np.random.seed(42)
n_batted_balls = 5000

def simulate_batted_balls():
    """Create synthetic batted ball data."""

    data = pd.DataFrame({
        'exit_velocity': np.random.normal(89, 8, n_batted_balls),
        'launch_angle': np.random.normal(12, 18, n_batted_balls),
        'spray_angle': np.random.normal(0, 30, n_batted_balls)  # -45 to 45 degrees
    })

    # Hit probability based on launch conditions
    # Barrel zone: 95+ mph EV, 15-35° LA
    ev_factor = (data['exit_velocity'] - 70) / 30  # Normalize
    la_optimal = 25  # Optimal launch angle
    la_factor = 1 - (np.abs(data['launch_angle'] - la_optimal) / 50)

    hit_prob = 1 / (1 + np.exp(-(ev_factor + la_factor - 1)))
    hit_prob = np.clip(hit_prob, 0.05, 0.95)

    data['is_hit'] = np.random.binomial(1, hit_prob)

    return data

# Create data
batted_ball_data = simulate_batted_balls()

# Features and target
X = batted_ball_data[['exit_velocity', 'launch_angle', 'spray_angle']]
y = batted_ball_data['is_hit']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
xba_model = GradientBoostingClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    random_state=42
)

xba_model.fit(X_train, y_train)

# Function to calculate xBA for a player
def calculate_xba(player_batted_balls):
    """Calculate expected batting average for a player."""

    X_player = player_batted_balls[['exit_velocity', 'launch_angle', 'spray_angle']]
    hit_probs = xba_model.predict_proba(X_player)[:, 1]

    xba = hit_probs.mean()
    return xba

# Example: Calculate xBA for a subset of balls
player_sample = batted_ball_data.sample(100, random_state=42)
player_xba = calculate_xba(player_sample)
player_actual_ba = player_sample['is_hit'].mean()

print(f"Player Actual BA: {player_actual_ba:.3f}")
print(f"Player xBA: {player_xba:.3f}")
print(f"Difference (luck): {player_actual_ba - player_xba:+.3f}")

Expected stats help identify players who've been lucky or unlucky and predict regression to "true" performance levels.

R

library(tidyverse)

# Simplified Marcel projection function
marcel_projection <- function(stats_3yr, stats_2yr, stats_1yr, age, pa_3yr, pa_2yr, pa_1yr) {
  # Weights for recent years (5:4:3 ratio)
  w1 <- 5  # Most recent year
  w2 <- 4  # Two years ago
  w3 <- 3  # Three years ago

  # Weight recent years
  weighted_avg <- (stats_1yr * w1 + stats_2yr * w2 + stats_3yr * w3) / (w1 + w2 + w3)

  # Total PA for regression
  total_pa <- pa_1yr * w1 + pa_2yr * w2 + pa_3yr * w3

  # Regression to mean (more regression with fewer PA)
  # League average WAR per 600 PA ≈ 2.0
  league_avg_rate <- 2.0 / 600
  regression_pa <- 1200  # Amount of "fake" league average PA to add

  regressed <- (weighted_avg * total_pa + league_avg_rate * regression_pa) /
               (total_pa + regression_pa)

  # Age adjustment (peak at 27, decline ~0.003 WAR/600PA per year after)
  age_adjustment <- (27 - age) * 0.003

  # Final projection (assuming 600 PA)
  projection <- (regressed + age_adjustment) * 600

  return(projection)
}

# Example: Project Mike Trout's WAR
# (Using hypothetical recent WAR/600PA rates)
trout_projection <- marcel_projection(
  stats_3yr = 0.058,  # 2021: 6.7 WAR in 490 PA
  stats_2yr = 0.062,  # 2022: 6.3 WAR in 426 PA
  stats_1yr = 0.048,  # 2023: 5.0 WAR in 455 PA
  age = 32,
  pa_3yr = 490,
  pa_2yr = 426,
  pa_1yr = 455
)

cat(sprintf("Mike Trout 2024 WAR Projection: %.1f\n", trout_projection))

# Vectorized version for multiple players
project_war_marcel <- function(player_data) {
  player_data %>%
    mutate(
      weighted_rate = (war_rate_1yr * 5 + war_rate_2yr * 4 + war_rate_3yr * 3) / 12,
      total_pa = pa_1yr * 5 + pa_2yr * 4 + pa_3yr * 3,
      regressed_rate = (weighted_rate * total_pa + (2/600) * 1200) / (total_pa + 1200),
      age_adj = (27 - age) * 0.003,
      projected_war = (regressed_rate + age_adj) * projected_pa
    )
}

Python

import pandas as pd
import numpy as np

def marcel_projection(stats_3yr, stats_2yr, stats_1yr, age,
                     pa_3yr, pa_2yr, pa_1yr, projected_pa=600):
    """
    Simplified Marcel projection for WAR.

    Parameters:
    - stats_*yr: WAR per PA for each year
    - age: Player's age in projection year
    - pa_*yr: Plate appearances in each year
    - projected_pa: PA to project for (default 600)

    Returns:
    - Projected WAR for upcoming season
    """
    # Weights (5:4:3 for most recent to oldest)
    w1, w2, w3 = 5, 4, 3

    # Weighted average
    weighted_avg = (stats_1yr * w1 + stats_2yr * w2 + stats_3yr * w3) / (w1 + w2 + w3)

    # Total weighted PA
    total_pa = pa_1yr * w1 + pa_2yr * w2 + pa_3yr * w3

    # Regression to mean
    league_avg_rate = 2.0 / 600
    regression_pa = 1200

    regressed = ((weighted_avg * total_pa + league_avg_rate * regression_pa) /
                 (total_pa + regression_pa))

    # Age adjustment
    age_adjustment = (27 - age) * 0.003

    # Final projection
    projection = (regressed + age_adjustment) * projected_pa

    return projection

# Example: Mike Trout
trout_proj = marcel_projection(
    stats_3yr=6.7/490,  # 2021
    stats_2yr=6.3/426,  # 2022
    stats_1yr=5.0/455,  # 2023
    age=32,
    pa_3yr=490,
    pa_2yr=426,
    pa_1yr=455,
    projected_pa=600
)

print(f"Mike Trout 2024 WAR Projection: {trout_proj:.1f}")

# Vectorized version
def project_war_marcel_batch(df, projected_pa=600):
    """Project WAR for multiple players."""

    df = df.copy()

    # Weighted average
    df['weighted_rate'] = ((df['war_rate_1yr'] * 5 +
                           df['war_rate_2yr'] * 4 +
                           df['war_rate_3yr'] * 3) / 12)

    # Total weighted PA
    df['total_pa'] = df['pa_1yr'] * 5 + df['pa_2yr'] * 4 + df['pa_3yr'] * 3

    # Regression
    league_avg_rate = 2.0 / 600
    regression_pa = 1200
    df['regressed_rate'] = ((df['weighted_rate'] * df['total_pa'] +
                            league_avg_rate * regression_pa) /
                           (df['total_pa'] + regression_pa))

    # Age adjustment
    df['age_adj'] = (27 - df['age']) * 0.003

    # Projection
    df['projected_war'] = (df['regressed_rate'] + df['age_adj']) * projected_pa

    return df

Python

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Simulate historical game states and outcomes
np.random.seed(42)
n_states = 10000

def simulate_game_states():
    """Create synthetic game state data."""

    data = pd.DataFrame({
        'inning': np.random.randint(1, 10, n_states),
        'outs': np.random.randint(0, 3, n_states),
        'run_diff': np.random.randint(-5, 6, n_states),  # Home team perspective
        'on_1b': np.random.choice([0, 1], n_states),
        'on_2b': np.random.choice([0, 1], n_states),
        'on_3b': np.random.choice([0, 1], n_states)
    })

    # Win probability increases with run differential and later innings
    base_prob = 1 / (1 + np.exp(-data['run_diff']))
    inning_factor = (data['inning'] - 5) / 10  # Later innings = more certain

    # Adjust based on inning
    win_prob = base_prob + inning_factor * (base_prob - 0.5)
    win_prob = np.clip(win_prob, 0.01, 0.99)

    # Simulate outcomes
    data['home_won'] = np.random.binomial(1, win_prob)

    return data

# Create data
game_data = simulate_game_states()

# Features and target
X = game_data[['inning', 'outs', 'run_diff', 'on_1b', 'on_2b', 'on_3b']]
y = game_data['home_won']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train gradient boosting model
wp_model = GradientBoostingClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42
)

wp_model.fit(X_train, y_train)

# Function to get win probability for any game state
def get_win_probability(inning, outs, run_diff, on_1b=0, on_2b=0, on_3b=0):
    """Calculate win probability for current game state."""

    state = pd.DataFrame({
        'inning': [inning],
        'outs': [outs],
        'run_diff': [run_diff],
        'on_1b': [on_1b],
        'on_2b': [on_2b],
        'on_3b': [on_3b]
    })

    win_prob = wp_model.predict_proba(state)[0, 1]
    return win_prob

# Example scenarios
print("Win Probability Examples (Home Team):")
print(f"Bottom 9th, tied, bases loaded, 1 out: {get_win_probability(9, 1, 0, 1, 1, 1):.1%}")
print(f"Top 7th, down 2 runs, nobody on, 0 outs: {get_win_probability(7, 0, -2):.1%}")
print(f"Bottom 5th, up 3 runs, nobody on, 2 outs: {get_win_probability(5, 2, 3):.1%}")

Python

import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# Simulate batted ball data
np.random.seed(42)
n_batted_balls = 5000

def simulate_batted_balls():
    """Create synthetic batted ball data."""

    data = pd.DataFrame({
        'exit_velocity': np.random.normal(89, 8, n_batted_balls),
        'launch_angle': np.random.normal(12, 18, n_batted_balls),
        'spray_angle': np.random.normal(0, 30, n_batted_balls)  # -45 to 45 degrees
    })

    # Hit probability based on launch conditions
    # Barrel zone: 95+ mph EV, 15-35° LA
    ev_factor = (data['exit_velocity'] - 70) / 30  # Normalize
    la_optimal = 25  # Optimal launch angle
    la_factor = 1 - (np.abs(data['launch_angle'] - la_optimal) / 50)

    hit_prob = 1 / (1 + np.exp(-(ev_factor + la_factor - 1)))
    hit_prob = np.clip(hit_prob, 0.05, 0.95)

    data['is_hit'] = np.random.binomial(1, hit_prob)

    return data

# Create data
batted_ball_data = simulate_batted_balls()

# Features and target
X = batted_ball_data[['exit_velocity', 'launch_angle', 'spray_angle']]
y = batted_ball_data['is_hit']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
xba_model = GradientBoostingClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    random_state=42
)

xba_model.fit(X_train, y_train)

# Function to calculate xBA for a player
def calculate_xba(player_batted_balls):
    """Calculate expected batting average for a player."""

    X_player = player_batted_balls[['exit_velocity', 'launch_angle', 'spray_angle']]
    hit_probs = xba_model.predict_proba(X_player)[:, 1]

    xba = hit_probs.mean()
    return xba

# Example: Calculate xBA for a subset of balls
player_sample = batted_ball_data.sample(100, random_state=42)
player_xba = calculate_xba(player_sample)
player_actual_ba = player_sample['is_hit'].mean()

print(f"Player Actual BA: {player_actual_ba:.3f}")
print(f"Player xBA: {player_xba:.3f}")
print(f"Difference (luck): {player_actual_ba - player_xba:+.3f}")

9.9 Model Evaluation

Evaluating model performance requires appropriate metrics for the task. Different metrics highlight different aspects of model quality.

9.9.1 Regression Metrics

For continuous outcomes (WAR, ERA, exit velocity), use these metrics:

RMSE (Root Mean Squared Error): Average prediction error, in the same units as the target. Lower is better. Penalizes large errors more heavily than small ones.

RMSE = √(Σ(actual - predicted)² / n)

MAE (Mean Absolute Error): Average absolute prediction error. Less sensitive to outliers than RMSE.

MAE = Σ|actual - predicted| / n

R² (R-squared): Proportion of variance explained, ranging from 0 to 1 (higher is better). An R² of 0.7 means the model explains 70% of the variance in the outcome.

R² = 1 - (Σ(actual - predicted)² / Σ(actual - mean)²)

R Implementation

library(yardstick)
library(tidymodels)

# Using predictions from earlier model
regression_metrics <- test_predictions %>%
  metrics(truth = ERA_2023, estimate = .pred)

print(regression_metrics)

# Custom metric set
custom_metrics <- metric_set(rmse, mae, rsq, mape)

test_predictions %>%
  custom_metrics(truth = ERA_2023, estimate = .pred)

# Visualize prediction errors
test_predictions %>%
  ggplot(aes(x = ERA_2023, y = .pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Predicted vs Actual ERA",
    x = "Actual 2023 ERA",
    y = "Predicted 2023 ERA"
  ) +
  theme_minimal() +
  coord_equal()

Python Implementation

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import numpy as np

# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Regression Metrics:")
print(f"RMSE: {rmse:.3f}")
print(f"MAE:  {mae:.3f}")
print(f"R²:   {r2:.3f}")

# Visualize predictions vs actuals
plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()],
         'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual ERA')
plt.ylabel('Predicted ERA')
plt.title(f'Predicted vs Actual ERA\nRMSE: {rmse:.3f}, R²: {r2:.3f}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

# Residual plot
residuals = y_test - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted ERA')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

9.9.2 Classification Metrics

For binary outcomes (hit/out, win/loss), classification metrics assess different aspects of performance:

Accuracy: Proportion of correct predictions. Simple but misleading with class imbalance.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision: Of predicted positives, what proportion are truly positive? Important when false positives are costly.

Precision = TP / (TP + FP)

Recall (Sensitivity): Of actual positives, what proportion did we catch? Important when false negatives are costly.

Recall = TP / (TP + FN)

F1 Score: Harmonic mean of precision and recall. Balances both.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

AUC-ROC: Area under the ROC curve. Measures discrimination ability across all classification thresholds. Ranges from 0.5 (random) to 1.0 (perfect).

R Implementation

library(yardstick)

# Classification metrics
class_metrics <- metric_set(accuracy, precision, recall, f_meas, roc_auc)

test_predictions %>%
  class_metrics(truth = is_hit, estimate = .pred_class, .pred_hit)

# Confusion matrix
test_predictions %>%
  conf_mat(truth = is_hit, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

# ROC curve
test_predictions %>%
  roc_curve(truth = is_hit, .pred_hit) %>%
  autoplot() +
  labs(title = "ROC Curve: Hit/Out Classification")

# Precision-Recall curve (better for imbalanced classes)
test_predictions %>%
  pr_curve(truth = is_hit, .pred_hit) %>%
  autoplot()

Python Implementation

from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                            f1_score, roc_auc_score, roc_curve,
                            precision_recall_curve, confusion_matrix)
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred_proba)

print("Classification Metrics:")
print(f"Accuracy:  {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 Score:  {f1:.3f}")
print(f"AUC-ROC:   {auc_roc:.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Out', 'Hit'],
            yticklabels=['Out', 'Hit'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc_roc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Precision-Recall Curve
precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Choose metrics based on your problem. For imbalanced classes (e.g., home runs vs. non-home runs), AUC-ROC and precision-recall curves are more informative than accuracy.

9.9.3 Feature Importance

Understanding which features drive predictions helps validate models and generate insights.

R Implementation

library(vip)

# Random forest feature importance
rf_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10) +
  labs(title = "Top 10 Most Important Features")

# XGBoost feature importance (multiple types)
xgb_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10, geom = "col") +
  labs(title = "XGBoost Feature Importance (Gain)")

Python Implementation

import matplotlib.pyplot as plt
import pandas as pd

# Random Forest feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_clf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# XGBoost feature importance (multiple types)
from xgboost import plot_importance

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

plot_importance(xgb_clf, importance_type='weight', ax=axes[0])
axes[0].set_title('Feature Importance (Weight)')

plot_importance(xgb_clf, importance_type='gain', ax=axes[1])
axes[1].set_title('Feature Importance (Gain)')

plot_importance(xgb_clf, importance_type='cover', ax=axes[2])
axes[2].set_title('Feature Importance (Cover)')

plt.tight_layout()
plt.show()

Feature importance reveals what the model learns. If exit velocity dominates while launch angle contributes little, you've learned something about what drives hits in your data.

R

RMSE = √(Σ(actual - predicted)² / n)

R

MAE = Σ|actual - predicted| / n

R

R² = 1 - (Σ(actual - predicted)² / Σ(actual - mean)²)

R

library(yardstick)
library(tidymodels)

# Using predictions from earlier model
regression_metrics <- test_predictions %>%
  metrics(truth = ERA_2023, estimate = .pred)

print(regression_metrics)

# Custom metric set
custom_metrics <- metric_set(rmse, mae, rsq, mape)

test_predictions %>%
  custom_metrics(truth = ERA_2023, estimate = .pred)

# Visualize prediction errors
test_predictions %>%
  ggplot(aes(x = ERA_2023, y = .pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Predicted vs Actual ERA",
    x = "Actual 2023 ERA",
    y = "Predicted 2023 ERA"
  ) +
  theme_minimal() +
  coord_equal()

R

Accuracy = (TP + TN) / (TP + TN + FP + FN)

R

Precision = TP / (TP + FP)

R

Recall = TP / (TP + FN)

R

F1 = 2 × (Precision × Recall) / (Precision + Recall)

R

library(yardstick)

# Classification metrics
class_metrics <- metric_set(accuracy, precision, recall, f_meas, roc_auc)

test_predictions %>%
  class_metrics(truth = is_hit, estimate = .pred_class, .pred_hit)

# Confusion matrix
test_predictions %>%
  conf_mat(truth = is_hit, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

# ROC curve
test_predictions %>%
  roc_curve(truth = is_hit, .pred_hit) %>%
  autoplot() +
  labs(title = "ROC Curve: Hit/Out Classification")

# Precision-Recall curve (better for imbalanced classes)
test_predictions %>%
  pr_curve(truth = is_hit, .pred_hit) %>%
  autoplot()

R

library(vip)

# Random forest feature importance
rf_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10) +
  labs(title = "Top 10 Most Important Features")

# XGBoost feature importance (multiple types)
xgb_fit %>%
  extract_fit_parsnip() %>%
  vip(num_features = 10, geom = "col") +
  labs(title = "XGBoost Feature Importance (Gain)")

Python

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import numpy as np

# Calculate metrics
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Regression Metrics:")
print(f"RMSE: {rmse:.3f}")
print(f"MAE:  {mae:.3f}")
print(f"R²:   {r2:.3f}")

# Visualize predictions vs actuals
plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()],
         [y_test.min(), y_test.max()],
         'r--', linewidth=2, label='Perfect Prediction')
plt.xlabel('Actual ERA')
plt.ylabel('Predicted ERA')
plt.title(f'Predicted vs Actual ERA\nRMSE: {rmse:.3f}, R²: {r2:.3f}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.tight_layout()
plt.show()

# Residual plot
residuals = y_test - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted ERA')
plt.ylabel('Residuals (Actual - Predicted)')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Python

from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                            f1_score, roc_auc_score, roc_curve,
                            precision_recall_curve, confusion_matrix)
import matplotlib.pyplot as plt
import seaborn as sns

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc_roc = roc_auc_score(y_test, y_pred_proba)

print("Classification Metrics:")
print(f"Accuracy:  {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 Score:  {f1:.3f}")
print(f"AUC-ROC:   {auc_roc:.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Out', 'Hit'],
            yticklabels=['Out', 'Hit'])
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc_roc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Precision-Recall Curve
precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Python

import matplotlib.pyplot as plt
import pandas as pd

# Random Forest feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_clf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Importance')
plt.title('Random Forest Feature Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

# XGBoost feature importance (multiple types)
from xgboost import plot_importance

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

plot_importance(xgb_clf, importance_type='weight', ax=axes[0])
axes[0].set_title('Feature Importance (Weight)')

plot_importance(xgb_clf, importance_type='gain', ax=axes[1])
axes[1].set_title('Feature Importance (Gain)')

plot_importance(xgb_clf, importance_type='cover', ax=axes[2])
axes[2].set_title('Feature Importance (Cover)')

plt.tight_layout()
plt.show()

9.10 Interactive Model Exploration

While static visualizations communicate model results effectively, interactive visualizations enable deeper exploration and understanding of predictive models. Interactive tools allow stakeholders to manipulate parameters, compare model performance across different metrics, and explore prediction confidence in real-time. This section demonstrates how to build interactive visualizations for model exploration using Plotly, which works seamlessly in both R and Python.

Interactive model exploration serves several purposes: it helps data scientists diagnose model behavior and identify weaknesses, enables non-technical stakeholders to understand model predictions and limitations, supports decision-making by allowing users to explore scenarios, and facilitates model comparison across different algorithms and hyperparameters. By making models interactive rather than presenting static results, you transform passive consumers of analytics into active explorers who develop intuition about model behavior.

9.9.1 Interactive ROC Curve Comparison

Receiver Operating Characteristic (ROC) curves visualize classification model performance across all possible decision thresholds. Comparing ROC curves for multiple models helps identify which algorithm performs best. An interactive ROC curve allows users to hover over points to see exact threshold values, zoom into specific regions of the curve, and toggle models on and off for clearer comparison.

R Implementation

library(tidymodels)
library(plotly)
library(tidyverse)

# Assume we have three fitted models: logistic, random forest, and XGBoost
# Each produces predictions on the same test set

# Create synthetic test predictions for demonstration
set.seed(123)
n_test <- 250

test_results <- tibble(
  truth = factor(sample(c("hit", "out"), n_test,
                       replace = TRUE, prob = c(0.3, 0.7))),

  # Logistic regression predictions (moderate performance)
  prob_logistic = plogis(rnorm(n_test,
                               mean = ifelse(truth == "hit", 0.5, -0.5),
                               sd = 1.2)),

  # Random forest predictions (better performance)
  prob_rf = plogis(rnorm(n_test,
                         mean = ifelse(truth == "hit", 1.0, -0.8),
                         sd = 1.0)),

  # XGBoost predictions (best performance)
  prob_xgb = plogis(rnorm(n_test,
                          mean = ifelse(truth == "hit", 1.2, -1.0),
                          sd = 0.9))
)

# Calculate ROC curves for each model
roc_logistic <- test_results %>%
  roc_curve(truth = truth, prob_logistic, event_level = "first") %>%
  mutate(model = "Logistic Regression")

roc_rf <- test_results %>%
  roc_curve(truth = truth, prob_rf, event_level = "first") %>%
  mutate(model = "Random Forest")

roc_xgb <- test_results %>%
  roc_curve(truth = truth, prob_xgb, event_level = "first") %>%
  mutate(model = "XGBoost")

# Combine ROC curves
roc_combined <- bind_rows(roc_logistic, roc_rf, roc_xgb)

# Calculate AUC for each model
auc_scores <- tibble(
  model = c("Logistic Regression", "Random Forest", "XGBoost"),
  auc = c(
    test_results %>% roc_auc(truth, prob_logistic, event_level = "first") %>% pull(.estimate),
    test_results %>% roc_auc(truth, prob_rf, event_level = "first") %>% pull(.estimate),
    test_results %>% roc_auc(truth, prob_xgb, event_level = "first") %>% pull(.estimate)
  )
) %>%
  mutate(label = sprintf("%s (AUC = %.3f)", model, auc))

# Join AUC labels to ROC data
roc_combined <- roc_combined %>%
  left_join(auc_scores, by = "model") %>%
  mutate(
    hover_text = sprintf(
      "Model: %s<br>Threshold: %.3f<br>Sensitivity: %.3f<br>Specificity: %.3f",
      model, .threshold, sensitivity, 1 - specificity
    )
  )

# Create interactive plot
roc_plot <- plot_ly() %>%
  add_lines(
    data = roc_combined %>% filter(model == "Logistic Regression"),
    x = ~(1 - specificity),
    y = ~sensitivity,
    name = ~unique(label),
    text = ~hover_text,
    hoverinfo = "text",
    line = list(color = "#E74C3C", width = 2.5)
  ) %>%
  add_lines(
    data = roc_combined %>% filter(model == "Random Forest"),
    x = ~(1 - specificity),
    y = ~sensitivity,
    name = ~unique(label),
    text = ~hover_text,
    hoverinfo = "text",
    line = list(color = "#3498DB", width = 2.5)
  ) %>%
  add_lines(
    data = roc_combined %>% filter(model == "XGBoost"),
    x = ~(1 - specificity),
    y = ~sensitivity,
    name = ~unique(label),
    text = ~hover_text,
    hoverinfo = "text",
    line = list(color = "#2ECC71", width = 2.5)
  ) %>%
  add_segments(
    x = 0, xend = 1, y = 0, yend = 1,
    line = list(color = "gray", dash = "dash", width = 1.5),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "ROC Curve Comparison: Hit/Out Prediction Models",
      font = list(size = 16, family = "Arial")
    ),
    xaxis = list(
      title = "False Positive Rate (1 - Specificity)",
      range = c(0, 1),
      gridcolor = "#E5E5E5"
    ),
    yaxis = list(
      title = "True Positive Rate (Sensitivity)",
      range = c(0, 1),
      gridcolor = "#E5E5E5"
    ),
    hovermode = "closest",
    legend = list(
      x = 0.6,
      y = 0.2,
      bgcolor = "rgba(255, 255, 255, 0.8)",
      bordercolor = "#CCCCCC",
      borderwidth = 1
    ),
    plot_bgcolor = "#FAFAFA",
    paper_bgcolor = "white"
  )

# Display the interactive plot
roc_plot

# To save as HTML file:
# htmlwidgets::saveWidget(roc_plot, "roc_comparison.html")

Python Implementation

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from sklearn.metrics import roc_curve, auc
from scipy.special import expit  # logistic function

# Create synthetic test predictions
np.random.seed(123)
n_test = 250

# Generate true labels
y_true = np.random.choice(['hit', 'out'], n_test, p=[0.3, 0.7])
y_true_binary = (y_true == 'hit').astype(int)

# Generate predictions for three models
# Logistic regression (moderate performance)
prob_logistic = expit(np.random.normal(
    loc=np.where(y_true_binary == 1, 0.5, -0.5),
    scale=1.2
))

# Random forest (better performance)
prob_rf = expit(np.random.normal(
    loc=np.where(y_true_binary == 1, 1.0, -0.8),
    scale=1.0
))

# XGBoost (best performance)
prob_xgb = expit(np.random.normal(
    loc=np.where(y_true_binary == 1, 1.2, -1.0),
    scale=0.9
))

# Calculate ROC curves and AUC scores
models = {
    'Logistic Regression': prob_logistic,
    'Random Forest': prob_rf,
    'XGBoost': prob_xgb
}

fig = go.Figure()

colors = {
    'Logistic Regression': '#E74C3C',
    'Random Forest': '#3498DB',
    'XGBoost': '#2ECC71'
}

for model_name, y_pred_prob in models.items():
    fpr, tpr, thresholds = roc_curve(y_true_binary, y_pred_prob)
    roc_auc = auc(fpr, tpr)

    # Create hover text
    hover_text = [
        f"Model: {model_name}<br>"
        f"Threshold: {thresh:.3f}<br>"
        f"Sensitivity: {tp:.3f}<br>"
        f"Specificity: {1-fp:.3f}"
        for thresh, tp, fp in zip(thresholds, tpr, fpr)
    ]

    fig.add_trace(go.Scatter(
        x=fpr,
        y=tpr,
        mode='lines',
        name=f'{model_name} (AUC = {roc_auc:.3f})',
        line=dict(color=colors[model_name], width=2.5),
        hovertext=hover_text,
        hoverinfo='text'
    ))

# Add diagonal reference line (random classifier)
fig.add_trace(go.Scatter(
    x=[0, 1],
    y=[0, 1],
    mode='lines',
    name='Random Classifier',
    line=dict(color='gray', dash='dash', width=1.5),
    showlegend=False,
    hoverinfo='skip'
))

# Update layout
fig.update_layout(
    title=dict(
        text='ROC Curve Comparison: Hit/Out Prediction Models',
        font=dict(size=16, family='Arial')
    ),
    xaxis=dict(
        title='False Positive Rate (1 - Specificity)',
        range=[0, 1],
        gridcolor='#E5E5E5'
    ),
    yaxis=dict(
        title='True Positive Rate (Sensitivity)',
        range=[0, 1],
        gridcolor='#E5E5E5'
    ),
    hovermode='closest',
    legend=dict(
        x=0.6,
        y=0.2,
        bgcolor='rgba(255, 255, 255, 0.8)',
        bordercolor='#CCCCCC',
        borderwidth=1
    ),
    plot_bgcolor='#FAFAFA',
    paper_bgcolor='white',
    width=800,
    height=600
)

# Display the interactive plot
fig.show()

# To save as HTML file:
# fig.write_html("roc_comparison.html")

Interpretation: The interactive ROC curve reveals several insights. XGBoost achieves the highest AUC (area under the curve), indicating superior overall performance across all thresholds. Random Forest outperforms logistic regression but falls short of XGBoost. Hovering over specific points shows the trade-off at each threshold: higher sensitivity comes at the cost of lower specificity. For baseball applications, you might prioritize sensitivity (catching all potential hits) or specificity (avoiding false alarms) depending on the decision context.

9.9.2 Interactive Feature Importance Chart

Feature importance shows which variables drive model predictions. An interactive chart enables users to sort features by importance, filter to top N features, and see exact importance values. This is particularly valuable when models use dozens of features.

R Implementation

library(plotly)
library(tidyverse)

# Create sample feature importance data
# In practice, extract from fitted random forest or XGBoost model
feature_importance <- tibble(
  feature = c(
    "exit_velocity", "launch_angle", "barrel_percentage",
    "hard_hit_percentage", "sweet_spot_percentage", "sprint_speed",
    "strikeout_rate", "walk_rate", "avg_launch_angle", "max_exit_velo",
    "contact_percentage", "zone_contact", "chase_rate", "swing_rate",
    "whiff_rate"
  ),
  importance = c(
    0.285, 0.198, 0.142, 0.115, 0.089, 0.067,
    0.043, 0.028, 0.019, 0.015, 0.012, 0.009, 0.006, 0.004, 0.003
  ),
  category = c(
    "Contact Quality", "Contact Quality", "Contact Quality",
    "Contact Quality", "Contact Quality", "Speed",
    "Plate Discipline", "Plate Discipline", "Contact Quality", "Contact Quality",
    "Plate Discipline", "Plate Discipline", "Plate Discipline", "Plate Discipline",
    "Plate Discipline"
  )
) %>%
  arrange(desc(importance)) %>%
  mutate(
    importance_pct = importance * 100,
    cumulative_pct = cumsum(importance_pct),
    hover_text = sprintf(
      "<b>%s</b><br>Category: %s<br>Importance: %.1f%%<br>Cumulative: %.1f%%",
      feature, category, importance_pct, cumulative_pct
    )
  )

# Create color mapping for categories
category_colors <- c(
  "Contact Quality" = "#E74C3C",
  "Plate Discipline" = "#3498DB",
  "Speed" = "#2ECC71"
)

# Create interactive bar chart
importance_plot <- plot_ly(
  data = feature_importance,
  x = ~importance_pct,
  y = ~reorder(feature, importance_pct),
  type = "bar",
  orientation = "h",
  text = ~hover_text,
  hoverinfo = "text",
  marker = list(
    color = ~category_colors[category],
    line = list(color = "rgba(0, 0, 0, 0.3)", width = 1)
  )
) %>%
  layout(
    title = list(
      text = "Feature Importance: Hit Probability Model",
      font = list(size = 16, family = "Arial")
    ),
    xaxis = list(
      title = "Importance (%)",
      gridcolor = "#E5E5E5"
    ),
    yaxis = list(
      title = "",
      categoryorder = "total ascending"
    ),
    hovermode = "closest",
    plot_bgcolor = "#FAFAFA",
    paper_bgcolor = "white",
    showlegend = FALSE,
    margin = list(l = 150)
  ) %>%
  add_annotations(
    x = feature_importance$importance_pct + 1.5,
    y = reorder(feature_importance$feature, feature_importance$importance_pct),
    text = sprintf("%.1f%%", feature_importance$importance_pct),
    xanchor = "left",
    showarrow = FALSE,
    font = list(size = 10, color = "#333333")
  )

importance_plot

Python Implementation

import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Create sample feature importance data
feature_data = {
    'feature': [
        'exit_velocity', 'launch_angle', 'barrel_percentage',
        'hard_hit_percentage', 'sweet_spot_percentage', 'sprint_speed',
        'strikeout_rate', 'walk_rate', 'avg_launch_angle', 'max_exit_velo',
        'contact_percentage', 'zone_contact', 'chase_rate', 'swing_rate',
        'whiff_rate'
    ],
    'importance': [
        0.285, 0.198, 0.142, 0.115, 0.089, 0.067,
        0.043, 0.028, 0.019, 0.015, 0.012, 0.009, 0.006, 0.004, 0.003
    ],
    'category': [
        'Contact Quality', 'Contact Quality', 'Contact Quality',
        'Contact Quality', 'Contact Quality', 'Speed',
        'Plate Discipline', 'Plate Discipline', 'Contact Quality', 'Contact Quality',
        'Plate Discipline', 'Plate Discipline', 'Plate Discipline', 'Plate Discipline',
        'Plate Discipline'
    ]
}

df_importance = pd.DataFrame(feature_data)
df_importance = df_importance.sort_values('importance', ascending=False)
df_importance['importance_pct'] = df_importance['importance'] * 100
df_importance['cumulative_pct'] = df_importance['importance_pct'].cumsum()

# Create color mapping
category_colors = {
    'Contact Quality': '#E74C3C',
    'Plate Discipline': '#3498DB',
    'Speed': '#2ECC71'
}

df_importance['color'] = df_importance['category'].map(category_colors)

# Create hover text
df_importance['hover_text'] = df_importance.apply(
    lambda row: f"<b>{row['feature']}</b><br>"
                f"Category: {row['category']}<br>"
                f"Importance: {row['importance_pct']:.1f}%<br>"
                f"Cumulative: {row['cumulative_pct']:.1f}%",
    axis=1
)

# Sort for plotting
df_plot = df_importance.sort_values('importance_pct')

# Create interactive bar chart
fig = go.Figure()

fig.add_trace(go.Bar(
    x=df_plot['importance_pct'],
    y=df_plot['feature'],
    orientation='h',
    text=df_plot['importance_pct'].apply(lambda x: f'{x:.1f}%'),
    textposition='outside',
    textfont=dict(size=10, color='#333333'),
    hovertext=df_plot['hover_text'],
    hoverinfo='text',
    marker=dict(
        color=df_plot['color'],
        line=dict(color='rgba(0, 0, 0, 0.3)', width=1)
    ),
    showlegend=False
))

fig.update_layout(
    title=dict(
        text='Feature Importance: Hit Probability Model',
        font=dict(size=16, family='Arial')
    ),
    xaxis=dict(
        title='Importance (%)',
        gridcolor='#E5E5E5'
    ),
    yaxis=dict(
        title='',
        categoryorder='total ascending'
    ),
    hovermode='closest',
    plot_bgcolor='#FAFAFA',
    paper_bgcolor='white',
    width=900,
    height=600,
    margin=dict(l=150, r=100)
)

fig.show()

Interpretation: Exit velocity dominates feature importance at 28.5%, making it by far the most influential predictor of hit probability. Contact quality features (exit velocity, launch angle, barrel percentage, hard hit percentage, sweet spot percentage) collectively account for over 80% of total importance. Plate discipline metrics contribute relatively little to predicting individual batted ball outcomes, though they matter greatly for overall offensive performance. The cumulative importance curve shows that the top 5 features capture approximately 83% of model signal.

9.9.3 Prediction Confidence Interval Visualization

Communicating prediction uncertainty is crucial for responsible model deployment. Confidence intervals show the range of plausible predictions, helping users understand when predictions are reliable versus uncertain. Interactive confidence interval plots allow users to explore uncertainty for individual players or situations.

R Implementation

library(plotly)
library(tidyverse)

# Generate sample player projections with uncertainty
set.seed(42)
n_players <- 20

player_projections <- tibble(
  player = sprintf("Player %02d", 1:n_players),
  projected_war = rnorm(n_players, mean = 3.5, sd = 1.5),
  standard_error = runif(n_players, min = 0.4, max = 1.2)
) %>%
  mutate(
    # Calculate 90% confidence intervals
    ci_lower = projected_war - (1.645 * standard_error),
    ci_upper = projected_war + (1.645 * standard_error),
    ci_width = ci_upper - ci_lower,

    # Color by projection tier
    tier = case_when(
      projected_war >= 5.0 ~ "Elite (5+ WAR)",
      projected_war >= 3.0 ~ "Above Average (3-5 WAR)",
      projected_war >= 1.5 ~ "Average (1.5-3 WAR)",
      TRUE ~ "Below Average (<1.5 WAR)"
    ),

    hover_text = sprintf(
      "<b>%s</b><br>Projected WAR: %.2f<br>90%% CI: [%.2f, %.2f]<br>Uncertainty: ±%.2f",
      player, projected_war, ci_lower, ci_upper, standard_error
    )
  ) %>%
  arrange(desc(projected_war))

# Define color scheme
tier_colors <- c(
  "Elite (5+ WAR)" = "#27AE60",
  "Above Average (3-5 WAR)" = "#3498DB",
  "Average (1.5-3 WAR)" = "#F39C12",
  "Below Average (<1.5 WAR)" = "#E74C3C"
)

# Create interactive confidence interval plot
ci_plot <- plot_ly() %>%

  # Add confidence intervals as error bars
  add_trace(
    data = player_projections,
    x = ~projected_war,
    y = ~reorder(player, projected_war),
    error_x = list(
      type = "data",
      symmetric = FALSE,
      array = ~ci_upper - projected_war,
      arrayminus = ~projected_war - ci_lower,
      color = "rgba(100, 100, 100, 0.4)",
      thickness = 2,
      width = 6
    ),
    type = "scatter",
    mode = "markers",
    marker = list(
      size = 10,
      color = ~tier_colors[tier],
      line = list(color = "white", width = 1)
    ),
    text = ~hover_text,
    hoverinfo = "text",
    showlegend = FALSE
  ) %>%

  # Add vertical reference lines for WAR tiers
  add_segments(
    x = 5.0, xend = 5.0, y = 0, yend = n_players + 1,
    line = list(color = "#27AE60", dash = "dot", width = 1),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  add_segments(
    x = 3.0, xend = 3.0, y = 0, yend = n_players + 1,
    line = list(color = "#3498DB", dash = "dot", width = 1),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  add_segments(
    x = 1.5, xend = 1.5, y = 0, yend = n_players + 1,
    line = list(color = "#F39C12", dash = "dot", width = 1),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%

  layout(
    title = list(
      text = "2025 WAR Projections with 90% Confidence Intervals",
      font = list(size = 16, family = "Arial")
    ),
    xaxis = list(
      title = "Projected WAR",
      gridcolor = "#E5E5E5",
      range = c(-1, 8)
    ),
    yaxis = list(
      title = "",
      categoryorder = "array",
      categoryarray = ~reorder(player_projections$player, player_projections$projected_war)
    ),
    hovermode = "closest",
    plot_bgcolor = "#FAFAFA",
    paper_bgcolor = "white",
    margin = list(l = 100)
  ) %>%

  # Add annotations for WAR tier thresholds
  add_annotations(
    x = c(5.0, 3.0, 1.5),
    y = rep(n_players + 0.5, 3),
    text = c("Elite", "Above Avg", "Average"),
    showarrow = FALSE,
    font = list(size = 9, color = "#666666"),
    yshift = 10
  )

ci_plot

Python Implementation

import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Generate sample player projections with uncertainty
np.random.seed(42)
n_players = 20

df_proj = pd.DataFrame({
    'player': [f'Player {i:02d}' for i in range(1, n_players + 1)],
    'projected_war': np.random.normal(3.5, 1.5, n_players),
    'standard_error': np.random.uniform(0.4, 1.2, n_players)
})

# Calculate confidence intervals
df_proj['ci_lower'] = df_proj['projected_war'] - (1.645 * df_proj['standard_error'])
df_proj['ci_upper'] = df_proj['projected_war'] + (1.645 * df_proj['standard_error'])
df_proj['ci_width'] = df_proj['ci_upper'] - df_proj['ci_lower']

# Assign tiers
def assign_tier(war):
    if war >= 5.0:
        return 'Elite (5+ WAR)'
    elif war >= 3.0:
        return 'Above Average (3-5 WAR)'
    elif war >= 1.5:
        return 'Average (1.5-3 WAR)'
    else:
        return 'Below Average (<1.5 WAR)'

df_proj['tier'] = df_proj['projected_war'].apply(assign_tier)

# Create hover text
df_proj['hover_text'] = df_proj.apply(
    lambda row: f"<b>{row['player']}</b><br>"
                f"Projected WAR: {row['projected_war']:.2f}<br>"
                f"90% CI: [{row['ci_lower']:.2f}, {row['ci_upper']:.2f}]<br>"
                f"Uncertainty: ±{row['standard_error']:.2f}",
    axis=1
)

# Sort by projected WAR
df_proj = df_proj.sort_values('projected_war', ascending=True)

# Define colors
tier_colors = {
    'Elite (5+ WAR)': '#27AE60',
    'Above Average (3-5 WAR)': '#3498DB',
    'Average (1.5-3 WAR)': '#F39C12',
    'Below Average (<1.5 WAR)': '#E74C3C'
}

df_proj['color'] = df_proj['tier'].map(tier_colors)

# Create figure
fig = go.Figure()

# Add confidence intervals with error bars
fig.add_trace(go.Scatter(
    x=df_proj['projected_war'],
    y=df_proj['player'],
    error_x=dict(
        type='data',
        symmetric=False,
        array=df_proj['ci_upper'] - df_proj['projected_war'],
        arrayminus=df_proj['projected_war'] - df_proj['ci_lower'],
        color='rgba(100, 100, 100, 0.4)',
        thickness=2,
        width=6
    ),
    mode='markers',
    marker=dict(
        size=10,
        color=df_proj['color'],
        line=dict(color='white', width=1)
    ),
    text=df_proj['hover_text'],
    hoverinfo='text',
    showlegend=False
))

# Add vertical reference lines for WAR tiers
for war_threshold, color in [(5.0, '#27AE60'), (3.0, '#3498DB'), (1.5, '#F39C12')]:
    fig.add_vline(
        x=war_threshold,
        line=dict(color=color, dash='dot', width=1),
        annotation_text='',
        annotation_position='top'
    )

# Update layout
fig.update_layout(
    title=dict(
        text='2025 WAR Projections with 90% Confidence Intervals',
        font=dict(size=16, family='Arial')
    ),
    xaxis=dict(
        title='Projected WAR',
        gridcolor='#E5E5E5',
        range=[-1, 8]
    ),
    yaxis=dict(
        title='',
        categoryorder='array',
        categoryarray=df_proj['player'].tolist()
    ),
    hovermode='closest',
    plot_bgcolor='#FAFAFA',
    paper_bgcolor='white',
    width=900,
    height=700,
    margin=dict(l=100)
)

# Add annotations for tier thresholds
for x, text in [(5.0, 'Elite'), (3.0, 'Above Avg'), (1.5, 'Average')]:
    fig.add_annotation(
        x=x,
        y=1.05,
        yref='paper',
        text=text,
        showarrow=False,
        font=dict(size=9, color='#666666')
    )

fig.show()

Interpretation: Confidence intervals reveal varying degrees of certainty in player projections. Players with narrow intervals (small error bars) have more predictable outcomes based on strong track records or consistent performance patterns. Wide intervals indicate high uncertainty, often occurring for young players with limited data, players returning from injury, or those who changed teams or roles. The visualization makes clear that while Player 01 projects for the highest WAR, Player 05 has a more reliable projection with less downside risk. Decision-makers can use these uncertainty ranges to evaluate risk-reward trade-offs: a high-risk, high-reward player with a wide confidence interval versus a safer bet with a narrower range.

Interactive model exploration transforms static model results into dynamic decision-support tools. ROC curve comparisons enable algorithm selection based on performance criteria. Feature importance charts communicate which factors drive predictions, building stakeholder trust in model logic. Confidence interval visualizations quantify prediction uncertainty, supporting risk-aware decision-making. Together, these interactive tools make predictive models more transparent, interpretable, and actionable for baseball operations personnel who may lack deep statistical training but need to make informed decisions based on model outputs.

R

library(tidymodels)
library(plotly)
library(tidyverse)

# Assume we have three fitted models: logistic, random forest, and XGBoost
# Each produces predictions on the same test set

# Create synthetic test predictions for demonstration
set.seed(123)
n_test <- 250

test_results <- tibble(
  truth = factor(sample(c("hit", "out"), n_test,
                       replace = TRUE, prob = c(0.3, 0.7))),

  # Logistic regression predictions (moderate performance)
  prob_logistic = plogis(rnorm(n_test,
                               mean = ifelse(truth == "hit", 0.5, -0.5),
                               sd = 1.2)),

  # Random forest predictions (better performance)
  prob_rf = plogis(rnorm(n_test,
                         mean = ifelse(truth == "hit", 1.0, -0.8),
                         sd = 1.0)),

  # XGBoost predictions (best performance)
  prob_xgb = plogis(rnorm(n_test,
                          mean = ifelse(truth == "hit", 1.2, -1.0),
                          sd = 0.9))
)

# Calculate ROC curves for each model
roc_logistic <- test_results %>%
  roc_curve(truth = truth, prob_logistic, event_level = "first") %>%
  mutate(model = "Logistic Regression")

roc_rf <- test_results %>%
  roc_curve(truth = truth, prob_rf, event_level = "first") %>%
  mutate(model = "Random Forest")

roc_xgb <- test_results %>%
  roc_curve(truth = truth, prob_xgb, event_level = "first") %>%
  mutate(model = "XGBoost")

# Combine ROC curves
roc_combined <- bind_rows(roc_logistic, roc_rf, roc_xgb)

# Calculate AUC for each model
auc_scores <- tibble(
  model = c("Logistic Regression", "Random Forest", "XGBoost"),
  auc = c(
    test_results %>% roc_auc(truth, prob_logistic, event_level = "first") %>% pull(.estimate),
    test_results %>% roc_auc(truth, prob_rf, event_level = "first") %>% pull(.estimate),
    test_results %>% roc_auc(truth, prob_xgb, event_level = "first") %>% pull(.estimate)
  )
) %>%
  mutate(label = sprintf("%s (AUC = %.3f)", model, auc))

# Join AUC labels to ROC data
roc_combined <- roc_combined %>%
  left_join(auc_scores, by = "model") %>%
  mutate(
    hover_text = sprintf(
      "Model: %s<br>Threshold: %.3f<br>Sensitivity: %.3f<br>Specificity: %.3f",
      model, .threshold, sensitivity, 1 - specificity
    )
  )

# Create interactive plot
roc_plot <- plot_ly() %>%
  add_lines(
    data = roc_combined %>% filter(model == "Logistic Regression"),
    x = ~(1 - specificity),
    y = ~sensitivity,
    name = ~unique(label),
    text = ~hover_text,
    hoverinfo = "text",
    line = list(color = "#E74C3C", width = 2.5)
  ) %>%
  add_lines(
    data = roc_combined %>% filter(model == "Random Forest"),
    x = ~(1 - specificity),
    y = ~sensitivity,
    name = ~unique(label),
    text = ~hover_text,
    hoverinfo = "text",
    line = list(color = "#3498DB", width = 2.5)
  ) %>%
  add_lines(
    data = roc_combined %>% filter(model == "XGBoost"),
    x = ~(1 - specificity),
    y = ~sensitivity,
    name = ~unique(label),
    text = ~hover_text,
    hoverinfo = "text",
    line = list(color = "#2ECC71", width = 2.5)
  ) %>%
  add_segments(
    x = 0, xend = 1, y = 0, yend = 1,
    line = list(color = "gray", dash = "dash", width = 1.5),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  layout(
    title = list(
      text = "ROC Curve Comparison: Hit/Out Prediction Models",
      font = list(size = 16, family = "Arial")
    ),
    xaxis = list(
      title = "False Positive Rate (1 - Specificity)",
      range = c(0, 1),
      gridcolor = "#E5E5E5"
    ),
    yaxis = list(
      title = "True Positive Rate (Sensitivity)",
      range = c(0, 1),
      gridcolor = "#E5E5E5"
    ),
    hovermode = "closest",
    legend = list(
      x = 0.6,
      y = 0.2,
      bgcolor = "rgba(255, 255, 255, 0.8)",
      bordercolor = "#CCCCCC",
      borderwidth = 1
    ),
    plot_bgcolor = "#FAFAFA",
    paper_bgcolor = "white"
  )

# Display the interactive plot
roc_plot

# To save as HTML file:
# htmlwidgets::saveWidget(roc_plot, "roc_comparison.html")

R

library(plotly)
library(tidyverse)

# Create sample feature importance data
# In practice, extract from fitted random forest or XGBoost model
feature_importance <- tibble(
  feature = c(
    "exit_velocity", "launch_angle", "barrel_percentage",
    "hard_hit_percentage", "sweet_spot_percentage", "sprint_speed",
    "strikeout_rate", "walk_rate", "avg_launch_angle", "max_exit_velo",
    "contact_percentage", "zone_contact", "chase_rate", "swing_rate",
    "whiff_rate"
  ),
  importance = c(
    0.285, 0.198, 0.142, 0.115, 0.089, 0.067,
    0.043, 0.028, 0.019, 0.015, 0.012, 0.009, 0.006, 0.004, 0.003
  ),
  category = c(
    "Contact Quality", "Contact Quality", "Contact Quality",
    "Contact Quality", "Contact Quality", "Speed",
    "Plate Discipline", "Plate Discipline", "Contact Quality", "Contact Quality",
    "Plate Discipline", "Plate Discipline", "Plate Discipline", "Plate Discipline",
    "Plate Discipline"
  )
) %>%
  arrange(desc(importance)) %>%
  mutate(
    importance_pct = importance * 100,
    cumulative_pct = cumsum(importance_pct),
    hover_text = sprintf(
      "<b>%s</b><br>Category: %s<br>Importance: %.1f%%<br>Cumulative: %.1f%%",
      feature, category, importance_pct, cumulative_pct
    )
  )

# Create color mapping for categories
category_colors <- c(
  "Contact Quality" = "#E74C3C",
  "Plate Discipline" = "#3498DB",
  "Speed" = "#2ECC71"
)

# Create interactive bar chart
importance_plot <- plot_ly(
  data = feature_importance,
  x = ~importance_pct,
  y = ~reorder(feature, importance_pct),
  type = "bar",
  orientation = "h",
  text = ~hover_text,
  hoverinfo = "text",
  marker = list(
    color = ~category_colors[category],
    line = list(color = "rgba(0, 0, 0, 0.3)", width = 1)
  )
) %>%
  layout(
    title = list(
      text = "Feature Importance: Hit Probability Model",
      font = list(size = 16, family = "Arial")
    ),
    xaxis = list(
      title = "Importance (%)",
      gridcolor = "#E5E5E5"
    ),
    yaxis = list(
      title = "",
      categoryorder = "total ascending"
    ),
    hovermode = "closest",
    plot_bgcolor = "#FAFAFA",
    paper_bgcolor = "white",
    showlegend = FALSE,
    margin = list(l = 150)
  ) %>%
  add_annotations(
    x = feature_importance$importance_pct + 1.5,
    y = reorder(feature_importance$feature, feature_importance$importance_pct),
    text = sprintf("%.1f%%", feature_importance$importance_pct),
    xanchor = "left",
    showarrow = FALSE,
    font = list(size = 10, color = "#333333")
  )

importance_plot

R

library(plotly)
library(tidyverse)

# Generate sample player projections with uncertainty
set.seed(42)
n_players <- 20

player_projections <- tibble(
  player = sprintf("Player %02d", 1:n_players),
  projected_war = rnorm(n_players, mean = 3.5, sd = 1.5),
  standard_error = runif(n_players, min = 0.4, max = 1.2)
) %>%
  mutate(
    # Calculate 90% confidence intervals
    ci_lower = projected_war - (1.645 * standard_error),
    ci_upper = projected_war + (1.645 * standard_error),
    ci_width = ci_upper - ci_lower,

    # Color by projection tier
    tier = case_when(
      projected_war >= 5.0 ~ "Elite (5+ WAR)",
      projected_war >= 3.0 ~ "Above Average (3-5 WAR)",
      projected_war >= 1.5 ~ "Average (1.5-3 WAR)",
      TRUE ~ "Below Average (<1.5 WAR)"
    ),

    hover_text = sprintf(
      "<b>%s</b><br>Projected WAR: %.2f<br>90%% CI: [%.2f, %.2f]<br>Uncertainty: ±%.2f",
      player, projected_war, ci_lower, ci_upper, standard_error
    )
  ) %>%
  arrange(desc(projected_war))

# Define color scheme
tier_colors <- c(
  "Elite (5+ WAR)" = "#27AE60",
  "Above Average (3-5 WAR)" = "#3498DB",
  "Average (1.5-3 WAR)" = "#F39C12",
  "Below Average (<1.5 WAR)" = "#E74C3C"
)

# Create interactive confidence interval plot
ci_plot <- plot_ly() %>%

  # Add confidence intervals as error bars
  add_trace(
    data = player_projections,
    x = ~projected_war,
    y = ~reorder(player, projected_war),
    error_x = list(
      type = "data",
      symmetric = FALSE,
      array = ~ci_upper - projected_war,
      arrayminus = ~projected_war - ci_lower,
      color = "rgba(100, 100, 100, 0.4)",
      thickness = 2,
      width = 6
    ),
    type = "scatter",
    mode = "markers",
    marker = list(
      size = 10,
      color = ~tier_colors[tier],
      line = list(color = "white", width = 1)
    ),
    text = ~hover_text,
    hoverinfo = "text",
    showlegend = FALSE
  ) %>%

  # Add vertical reference lines for WAR tiers
  add_segments(
    x = 5.0, xend = 5.0, y = 0, yend = n_players + 1,
    line = list(color = "#27AE60", dash = "dot", width = 1),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  add_segments(
    x = 3.0, xend = 3.0, y = 0, yend = n_players + 1,
    line = list(color = "#3498DB", dash = "dot", width = 1),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%
  add_segments(
    x = 1.5, xend = 1.5, y = 0, yend = n_players + 1,
    line = list(color = "#F39C12", dash = "dot", width = 1),
    showlegend = FALSE,
    hoverinfo = "skip"
  ) %>%

  layout(
    title = list(
      text = "2025 WAR Projections with 90% Confidence Intervals",
      font = list(size = 16, family = "Arial")
    ),
    xaxis = list(
      title = "Projected WAR",
      gridcolor = "#E5E5E5",
      range = c(-1, 8)
    ),
    yaxis = list(
      title = "",
      categoryorder = "array",
      categoryarray = ~reorder(player_projections$player, player_projections$projected_war)
    ),
    hovermode = "closest",
    plot_bgcolor = "#FAFAFA",
    paper_bgcolor = "white",
    margin = list(l = 100)
  ) %>%

  # Add annotations for WAR tier thresholds
  add_annotations(
    x = c(5.0, 3.0, 1.5),
    y = rep(n_players + 0.5, 3),
    text = c("Elite", "Above Avg", "Average"),
    showarrow = FALSE,
    font = list(size = 9, color = "#666666"),
    yshift = 10
  )

ci_plot

Python

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from sklearn.metrics import roc_curve, auc
from scipy.special import expit  # logistic function

# Create synthetic test predictions
np.random.seed(123)
n_test = 250

# Generate true labels
y_true = np.random.choice(['hit', 'out'], n_test, p=[0.3, 0.7])
y_true_binary = (y_true == 'hit').astype(int)

# Generate predictions for three models
# Logistic regression (moderate performance)
prob_logistic = expit(np.random.normal(
    loc=np.where(y_true_binary == 1, 0.5, -0.5),
    scale=1.2
))

# Random forest (better performance)
prob_rf = expit(np.random.normal(
    loc=np.where(y_true_binary == 1, 1.0, -0.8),
    scale=1.0
))

# XGBoost (best performance)
prob_xgb = expit(np.random.normal(
    loc=np.where(y_true_binary == 1, 1.2, -1.0),
    scale=0.9
))

# Calculate ROC curves and AUC scores
models = {
    'Logistic Regression': prob_logistic,
    'Random Forest': prob_rf,
    'XGBoost': prob_xgb
}

fig = go.Figure()

colors = {
    'Logistic Regression': '#E74C3C',
    'Random Forest': '#3498DB',
    'XGBoost': '#2ECC71'
}

for model_name, y_pred_prob in models.items():
    fpr, tpr, thresholds = roc_curve(y_true_binary, y_pred_prob)
    roc_auc = auc(fpr, tpr)

    # Create hover text
    hover_text = [
        f"Model: {model_name}<br>"
        f"Threshold: {thresh:.3f}<br>"
        f"Sensitivity: {tp:.3f}<br>"
        f"Specificity: {1-fp:.3f}"
        for thresh, tp, fp in zip(thresholds, tpr, fpr)
    ]

    fig.add_trace(go.Scatter(
        x=fpr,
        y=tpr,
        mode='lines',
        name=f'{model_name} (AUC = {roc_auc:.3f})',
        line=dict(color=colors[model_name], width=2.5),
        hovertext=hover_text,
        hoverinfo='text'
    ))

# Add diagonal reference line (random classifier)
fig.add_trace(go.Scatter(
    x=[0, 1],
    y=[0, 1],
    mode='lines',
    name='Random Classifier',
    line=dict(color='gray', dash='dash', width=1.5),
    showlegend=False,
    hoverinfo='skip'
))

# Update layout
fig.update_layout(
    title=dict(
        text='ROC Curve Comparison: Hit/Out Prediction Models',
        font=dict(size=16, family='Arial')
    ),
    xaxis=dict(
        title='False Positive Rate (1 - Specificity)',
        range=[0, 1],
        gridcolor='#E5E5E5'
    ),
    yaxis=dict(
        title='True Positive Rate (Sensitivity)',
        range=[0, 1],
        gridcolor='#E5E5E5'
    ),
    hovermode='closest',
    legend=dict(
        x=0.6,
        y=0.2,
        bgcolor='rgba(255, 255, 255, 0.8)',
        bordercolor='#CCCCCC',
        borderwidth=1
    ),
    plot_bgcolor='#FAFAFA',
    paper_bgcolor='white',
    width=800,
    height=600
)

# Display the interactive plot
fig.show()

# To save as HTML file:
# fig.write_html("roc_comparison.html")

Python

import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Create sample feature importance data
feature_data = {
    'feature': [
        'exit_velocity', 'launch_angle', 'barrel_percentage',
        'hard_hit_percentage', 'sweet_spot_percentage', 'sprint_speed',
        'strikeout_rate', 'walk_rate', 'avg_launch_angle', 'max_exit_velo',
        'contact_percentage', 'zone_contact', 'chase_rate', 'swing_rate',
        'whiff_rate'
    ],
    'importance': [
        0.285, 0.198, 0.142, 0.115, 0.089, 0.067,
        0.043, 0.028, 0.019, 0.015, 0.012, 0.009, 0.006, 0.004, 0.003
    ],
    'category': [
        'Contact Quality', 'Contact Quality', 'Contact Quality',
        'Contact Quality', 'Contact Quality', 'Speed',
        'Plate Discipline', 'Plate Discipline', 'Contact Quality', 'Contact Quality',
        'Plate Discipline', 'Plate Discipline', 'Plate Discipline', 'Plate Discipline',
        'Plate Discipline'
    ]
}

df_importance = pd.DataFrame(feature_data)
df_importance = df_importance.sort_values('importance', ascending=False)
df_importance['importance_pct'] = df_importance['importance'] * 100
df_importance['cumulative_pct'] = df_importance['importance_pct'].cumsum()

# Create color mapping
category_colors = {
    'Contact Quality': '#E74C3C',
    'Plate Discipline': '#3498DB',
    'Speed': '#2ECC71'
}

df_importance['color'] = df_importance['category'].map(category_colors)

# Create hover text
df_importance['hover_text'] = df_importance.apply(
    lambda row: f"<b>{row['feature']}</b><br>"
                f"Category: {row['category']}<br>"
                f"Importance: {row['importance_pct']:.1f}%<br>"
                f"Cumulative: {row['cumulative_pct']:.1f}%",
    axis=1
)

# Sort for plotting
df_plot = df_importance.sort_values('importance_pct')

# Create interactive bar chart
fig = go.Figure()

fig.add_trace(go.Bar(
    x=df_plot['importance_pct'],
    y=df_plot['feature'],
    orientation='h',
    text=df_plot['importance_pct'].apply(lambda x: f'{x:.1f}%'),
    textposition='outside',
    textfont=dict(size=10, color='#333333'),
    hovertext=df_plot['hover_text'],
    hoverinfo='text',
    marker=dict(
        color=df_plot['color'],
        line=dict(color='rgba(0, 0, 0, 0.3)', width=1)
    ),
    showlegend=False
))

fig.update_layout(
    title=dict(
        text='Feature Importance: Hit Probability Model',
        font=dict(size=16, family='Arial')
    ),
    xaxis=dict(
        title='Importance (%)',
        gridcolor='#E5E5E5'
    ),
    yaxis=dict(
        title='',
        categoryorder='total ascending'
    ),
    hovermode='closest',
    plot_bgcolor='#FAFAFA',
    paper_bgcolor='white',
    width=900,
    height=600,
    margin=dict(l=150, r=100)
)

fig.show()

Python

import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Generate sample player projections with uncertainty
np.random.seed(42)
n_players = 20

df_proj = pd.DataFrame({
    'player': [f'Player {i:02d}' for i in range(1, n_players + 1)],
    'projected_war': np.random.normal(3.5, 1.5, n_players),
    'standard_error': np.random.uniform(0.4, 1.2, n_players)
})

# Calculate confidence intervals
df_proj['ci_lower'] = df_proj['projected_war'] - (1.645 * df_proj['standard_error'])
df_proj['ci_upper'] = df_proj['projected_war'] + (1.645 * df_proj['standard_error'])
df_proj['ci_width'] = df_proj['ci_upper'] - df_proj['ci_lower']

# Assign tiers
def assign_tier(war):
    if war >= 5.0:
        return 'Elite (5+ WAR)'
    elif war >= 3.0:
        return 'Above Average (3-5 WAR)'
    elif war >= 1.5:
        return 'Average (1.5-3 WAR)'
    else:
        return 'Below Average (<1.5 WAR)'

df_proj['tier'] = df_proj['projected_war'].apply(assign_tier)

# Create hover text
df_proj['hover_text'] = df_proj.apply(
    lambda row: f"<b>{row['player']}</b><br>"
                f"Projected WAR: {row['projected_war']:.2f}<br>"
                f"90% CI: [{row['ci_lower']:.2f}, {row['ci_upper']:.2f}]<br>"
                f"Uncertainty: ±{row['standard_error']:.2f}",
    axis=1
)

# Sort by projected WAR
df_proj = df_proj.sort_values('projected_war', ascending=True)

# Define colors
tier_colors = {
    'Elite (5+ WAR)': '#27AE60',
    'Above Average (3-5 WAR)': '#3498DB',
    'Average (1.5-3 WAR)': '#F39C12',
    'Below Average (<1.5 WAR)': '#E74C3C'
}

df_proj['color'] = df_proj['tier'].map(tier_colors)

# Create figure
fig = go.Figure()

# Add confidence intervals with error bars
fig.add_trace(go.Scatter(
    x=df_proj['projected_war'],
    y=df_proj['player'],
    error_x=dict(
        type='data',
        symmetric=False,
        array=df_proj['ci_upper'] - df_proj['projected_war'],
        arrayminus=df_proj['projected_war'] - df_proj['ci_lower'],
        color='rgba(100, 100, 100, 0.4)',
        thickness=2,
        width=6
    ),
    mode='markers',
    marker=dict(
        size=10,
        color=df_proj['color'],
        line=dict(color='white', width=1)
    ),
    text=df_proj['hover_text'],
    hoverinfo='text',
    showlegend=False
))

# Add vertical reference lines for WAR tiers
for war_threshold, color in [(5.0, '#27AE60'), (3.0, '#3498DB'), (1.5, '#F39C12')]:
    fig.add_vline(
        x=war_threshold,
        line=dict(color=color, dash='dot', width=1),
        annotation_text='',
        annotation_position='top'
    )

# Update layout
fig.update_layout(
    title=dict(
        text='2025 WAR Projections with 90% Confidence Intervals',
        font=dict(size=16, family='Arial')
    ),
    xaxis=dict(
        title='Projected WAR',
        gridcolor='#E5E5E5',
        range=[-1, 8]
    ),
    yaxis=dict(
        title='',
        categoryorder='array',
        categoryarray=df_proj['player'].tolist()
    ),
    hovermode='closest',
    plot_bgcolor='#FAFAFA',
    paper_bgcolor='white',
    width=900,
    height=700,
    margin=dict(l=100)
)

# Add annotations for tier thresholds
for x, text in [(5.0, 'Elite'), (3.0, 'Above Avg'), (1.5, 'Average')]:
    fig.add_annotation(
        x=x,
        y=1.05,
        yref='paper',
        text=text,
        showarrow=False,
        font=dict(size=9, color='#666666')
    )

fig.show()

9.11 Exercises

Exercise 1: Pitcher Performance Prediction

Build a model to predict pitcher ERA in 2024 using 2023 data:

a) Load pitcher statistics for 2023 (minimum 50 IP)
b) Create features: FIP, K/9, BB/9, HR/9, GB%, and age
c) Split into train/test (75/25)
d) Fit three models: linear regression, random forest, and XGBoost
e) Compare RMSE and R² across models
f) Identify which features are most important
g) For pitchers with ERA > 4.50, calculate how many the model correctly identified as above-average risk

Bonus: Add previous year (2022) metrics and see if multi-year data improves predictions.

Exercise 2: Hit Type Classification

Classify batted balls as ground balls, line drives, fly balls, or pop-ups using Statcast data:

a) Download Statcast data for a full month (any month in 2024)
b) Create labels based on launch angle:

Ground ball: < 10°

Line drive: 10-25°

Fly ball: 25-50°

Pop-up: > 50°

c) Use exit velocity, launch angle, and spray angle as features
d) Build a random forest classifier
e) Evaluate using accuracy and confusion matrix
f) Which hit types does the model confuse most often?
g) Plot decision boundaries in exit velocity vs. launch angle space

Bonus: Can you improve performance by adding pitcher handedness and batter handedness?

Exercise 3: Stolen Base Success Prediction

Predict whether stolen base attempts succeed:

a) Simulate or acquire stolen base attempt data with features:

Runner sprint speed

Pitcher time to plate

Catcher pop time

Lead distance

Base being stolen (2nd vs 3rd)

b) Build logistic regression and XGBoost classifiers
c) Compare performance using ROC-AUC
d) At what probability threshold would you recommend attempting a steal?
e) Calculate expected value: (successprob × valueofsuccess) - (failureprob × costofcaught_stealing)

Bonus: Incorporate game situation (score differential, inning, outs) and determine when stealing becomes more or less valuable.

Exercise 4: Player Clustering and Archetypes

Identify distinct player archetypes using clustering:

a) Gather batting statistics for all qualified hitters in 2024:

HR rate, BB%, K%, ISO, Sprint Speed, Hard Hit%, Barrel%

b) Standardize all features
c) Use elbow method and silhouette analysis to choose optimal K
d) Fit K-means clustering
e) Characterize each cluster with mean statistics
f) Assign descriptive names to clusters (e.g., "Power Sluggers", "Contact Specialists")
g) Build a function that takes a player name and returns their 5 most similar players
h) Visualize clusters using PCA or t-SNE

Bonus: Try hierarchical clustering and compare the resulting player groups. Do the same archetypes emerge?

This chapter introduced machine learning fundamentals and applied them to baseball prediction problems. You've learned regression for continuous outcomes, classification for categorical predictions, and clustering for finding patterns. You understand the machine learning workflow: split data, preprocess features, fit models, and evaluate rigorously. You can now build projection systems, win probability models, and expected statistics calculators.

The next chapter will explore advanced topics in baseball analytics, including survival analysis for injury prediction, causal inference for evaluating organizational decisions, and optimization methods for lineup construction and roster management. The machine learning foundation you've built here will support these more sophisticated analyses.

Chapter 9: Predictive Modeling & Machine Learning

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

9.1 Introduction to Prediction in Baseball

9.1.1 Why Predict?

9.1.2 The Prediction Landscape

9.1.3 Prediction Philosophy

9.2 Regression Fundamentals

9.2.1 Simple Linear Regression

R Implementation

Python Implementation

9.2.2 Multiple Regression

R Implementation

Python Implementation

9.2.3 Regression Diagnostics

R Diagnostic Plots

Python Diagnostic Plots

9.3 Introduction to tidymodels (R) / scikit-learn (Python)

9.3.1 The Machine Learning Workflow

9.3.2 tidymodels Workflow (R)

Complete Workflow Example: Predicting Hit/Out

9.3.3 scikit-learn Workflow (Python)

Complete Workflow Example: Predicting Hit/Out

9.4 Classification Models

9.4.1 Logistic Regression

R Implementation

Python Implementation

9.4.2 Decision Trees

R Implementation

Python Implementation

9.4.3 Random Forests

R Implementation

Python Implementation

9.5 Advanced Models

9.5.1 XGBoost

R Implementation

Python Implementation

9.5.2 Neural Networks

R Implementation (keras/tensorflow)

Python Implementation

9.6 Cross-Validation and Tuning

9.6.1 K-Fold Cross-Validation

R Implementation

Python Implementation

9.6.2 Hyperparameter Tuning

R Implementation (Grid Search with tune)

Python Implementation (RandomizedSearchCV)

9.7 Clustering for Player Comparison

9.7.1 K-Means Clustering

R Implementation

Python Implementation

9.7.2 Finding Player Comps

R Implementation

Python Implementation

9.8 Baseball-Specific Prediction Tasks

9.8.1 WAR Projection (Marcel-Style)

R Implementation

Python Implementation

9.8.2 Win Probability Model

Simplified Win Probability Function

9.8.3 Expected Stats Model

Simplified xBA Model

9.9 Model Evaluation

9.9.1 Regression Metrics

R Implementation

Python Implementation

9.9.2 Classification Metrics

R Implementation

Python Implementation

9.9.3 Feature Importance

R Implementation

Python Implementation

9.10 Interactive Model Exploration

9.9.1 Interactive ROC Curve Comparison

R Implementation

Python Implementation

9.9.2 Interactive Feature Importance Chart