Chapter 17: Advanced Statistical Methods

17.1 Bayesian Methods

Bayesian statistics offers a fundamentally different approach to inference than the frequentist methods taught in most introductory statistics courses. Rather than treating parameters as fixed unknowns, Bayesian methods treat them as random variables with probability distributions that we update as we observe data.

The Bayesian Framework

Bayes' theorem, in its most general form, states:

P(θ|D) = P(D|θ) × P(θ) / P(D)

Where:

θ represents parameters (e.g., a player's true batting average)

D represents observed data (e.g., 30 hits in 100 at-bats)

P(θ|D) is the posterior: our belief about θ after seeing data

P(D|θ) is the likelihood: probability of observing the data given θ

P(θ) is the prior: our belief about θ before seeing data

P(D) is the marginal likelihood (often treated as a normalizing constant)

For baseball analytics, this translates naturally: When a rookie goes 3-for-10 in his first games, what's his true talent? The frequentist would say .300 (3/10). The Bayesian incorporates prior knowledge—most players hit around .250—and produces a more conservative estimate that blends the observed data with historical patterns.

Bayesian Updating for Player Performance

Consider a pitcher who allows 2 runs in 7 innings pitched. His ERA is 2.57. Should we believe he's genuinely a sub-3.00 ERA pitcher? Bayesian analysis helps us answer this by incorporating both the observed performance and our prior beliefs.

R Implementation:

library(tidyverse)

# Bayesian estimation of true ERA
bayesian_era_estimate <- function(observed_er, observed_ip,
                                  prior_mean = 4.00, prior_sd = 0.80) {
  # Convert to ERs per 9 innings
  observed_era <- (observed_er / observed_ip) * 9

  # Prior distribution: ERA ~ Normal(prior_mean, prior_sd^2)
  prior_precision <- 1 / prior_sd^2

  # Likelihood precision (uncertainty decreases with sample size)
  # We model ERA uncertainty as proportional to 1/sqrt(IP)
  likelihood_sd <- prior_sd / sqrt(observed_ip / 10)
  likelihood_precision <- 1 / likelihood_sd^2

  # Posterior calculation (conjugate normal model)
  posterior_precision <- prior_precision + likelihood_precision
  posterior_mean <- (prior_precision * prior_mean +
                    likelihood_precision * observed_era) / posterior_precision
  posterior_sd <- sqrt(1 / posterior_precision)

  return(list(
    observed_era = observed_era,
    posterior_mean = posterior_mean,
    posterior_sd = posterior_sd,
    credible_interval_95 = c(
      posterior_mean - 1.96 * posterior_sd,
      posterior_mean + 1.96 * posterior_sd
    )
  ))
}

# Example: Rookie pitcher with 7 IP, 2 ER
rookie_result <- bayesian_era_estimate(observed_er = 2, observed_ip = 7)

cat("Observed ERA:", round(rookie_result$observed_era, 2), "\n")
cat("Bayesian Estimate:", round(rookie_result$posterior_mean, 2), "\n")
cat("95% Credible Interval:",
    round(rookie_result$credible_interval_95, 2), "\n")

# Compare with veteran (100 IP, same 2.57 ERA)
veteran_result <- bayesian_era_estimate(observed_er = 28.6, observed_ip = 100)

cat("\nVeteran with same ERA over 100 IP:\n")
cat("Posterior Mean:", round(veteran_result$posterior_mean, 2), "\n")
cat("95% Credible Interval:",
    round(veteran_result$credible_interval_95, 2), "\n")

# Visualization
eras <- seq(1, 7, 0.01)
rookie_posterior <- dnorm(eras, rookie_result$posterior_mean,
                          rookie_result$posterior_sd)
veteran_posterior <- dnorm(eras, veteran_result$posterior_mean,
                           veteran_result$posterior_sd)

plot_data <- data.frame(
  ERA = rep(eras, 2),
  Density = c(rookie_posterior, veteran_posterior),
  Player = rep(c("Rookie (7 IP)", "Veteran (100 IP)"), each = length(eras))
)

ggplot(plot_data, aes(x = ERA, y = Density, color = Player)) +
  geom_line(size = 1.2) +
  geom_vline(xintercept = 2.57, linetype = "dashed", alpha = 0.5) +
  labs(title = "Bayesian ERA Estimates: Sample Size Matters",
       subtitle = "Both pitchers have observed ERA of 2.57",
       x = "True ERA",
       y = "Posterior Density",
       color = "Sample Size") +
  theme_minimal() +
  scale_color_manual(values = c("#E03A3E", "#4A90E2"))

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

def bayesian_era_estimate(observed_er, observed_ip,
                         prior_mean=4.00, prior_sd=0.80):
    """
    Bayesian estimation of pitcher's true ERA
    """
    # Convert to ERs per 9 innings
    observed_era = (observed_er / observed_ip) * 9

    # Prior distribution: ERA ~ Normal(prior_mean, prior_sd^2)
    prior_precision = 1 / prior_sd**2

    # Likelihood precision (uncertainty decreases with sample size)
    likelihood_sd = prior_sd / np.sqrt(observed_ip / 10)
    likelihood_precision = 1 / likelihood_sd**2

    # Posterior calculation (conjugate normal model)
    posterior_precision = prior_precision + likelihood_precision
    posterior_mean = (prior_precision * prior_mean +
                     likelihood_precision * observed_era) / posterior_precision
    posterior_sd = np.sqrt(1 / posterior_precision)

    return {
        'observed_era': observed_era,
        'posterior_mean': posterior_mean,
        'posterior_sd': posterior_sd,
        'credible_interval_95': (
            posterior_mean - 1.96 * posterior_sd,
            posterior_mean + 1.96 * posterior_sd
        )
    }

# Example: Rookie pitcher with 7 IP, 2 ER
rookie_result = bayesian_era_estimate(observed_er=2, observed_ip=7)

print(f"Observed ERA: {rookie_result['observed_era']:.2f}")
print(f"Bayesian Estimate: {rookie_result['posterior_mean']:.2f}")
print(f"95% Credible Interval: "
      f"({rookie_result['credible_interval_95'][0]:.2f}, "
      f"{rookie_result['credible_interval_95'][1]:.2f})")

# Compare with veteran (100 IP, same 2.57 ERA)
veteran_result = bayesian_era_estimate(observed_er=28.6, observed_ip=100)

print(f"\nVeteran with same ERA over 100 IP:")
print(f"Posterior Mean: {veteran_result['posterior_mean']:.2f}")
print(f"95% Credible Interval: "
      f"({veteran_result['credible_interval_95'][0]:.2f}, "
      f"{veteran_result['credible_interval_95'][1]:.2f})")

# Visualization
eras = np.linspace(1, 7, 600)
rookie_posterior = stats.norm.pdf(eras, rookie_result['posterior_mean'],
                                  rookie_result['posterior_sd'])
veteran_posterior = stats.norm.pdf(eras, veteran_result['posterior_mean'],
                                   veteran_result['posterior_sd'])

plt.figure(figsize=(10, 6))
plt.plot(eras, rookie_posterior, label='Rookie (7 IP)',
         color='#E03A3E', linewidth=2)
plt.plot(eras, veteran_posterior, label='Veteran (100 IP)',
         color='#4A90E2', linewidth=2)
plt.axvline(x=2.57, color='gray', linestyle='--', alpha=0.5,
            label='Observed ERA')
plt.xlabel('True ERA', fontsize=12)
plt.ylabel('Posterior Density', fontsize=12)
plt.title('Bayesian ERA Estimates: Sample Size Matters\n' +
          'Both pitchers have observed ERA of 2.57',
          fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

The key insight: With only 7 innings, we remain highly uncertain about the rookie's true talent. The Bayesian estimate pulls his ERA toward the league average (the prior mean), producing an estimate around 3.50. The veteran with 100 innings gets an estimate much closer to his observed 2.57 ERA because we have more evidence.

Beta-Binomial Model for Batting Average

The beta-binomial model is particularly elegant for estimating batting averages. Batting can be modeled as a binomial process (hit or no-hit), and the beta distribution serves as a conjugate prior, making calculations straightforward.

R Implementation:

# Beta-binomial model for batting average
estimate_batting_average <- function(hits, at_bats,
                                    prior_alpha = 82, prior_beta = 238) {
  # Prior parameters correspond to roughly .255 average
  # (alpha / (alpha + beta) = 82/320 ≈ 0.256)

  # Posterior parameters (conjugate update)
  posterior_alpha <- prior_alpha + hits
  posterior_beta <- prior_beta + (at_bats - hits)

  # Posterior mean
  posterior_mean <- posterior_alpha / (posterior_alpha + posterior_beta)

  # Credible interval
  ci_lower <- qbeta(0.025, posterior_alpha, posterior_beta)
  ci_upper <- qbeta(0.975, posterior_alpha, posterior_beta)

  return(list(
    observed_avg = hits / at_bats,
    posterior_mean = posterior_mean,
    credible_interval = c(ci_lower, ci_upper),
    alpha = posterior_alpha,
    beta = posterior_beta
  ))
}

# Example: Hot rookie start (15 hits in 40 AB = .375)
hot_start <- estimate_batting_average(hits = 15, at_bats = 40)

cat("Observed Average:", round(hot_start$observed_avg, 3), "\n")
cat("Bayesian Estimate:", round(hot_start$posterior_mean, 3), "\n")
cat("95% Credible Interval: [",
    round(hot_start$credible_interval[1], 3), ",",
    round(hot_start$credible_interval[2], 3), "]\n")

# Simulate the posterior distribution
set.seed(42)
posterior_samples <- rbeta(10000, hot_start$alpha, hot_start$beta)

# What's the probability this player is truly above .300?
prob_above_300 <- mean(posterior_samples > 0.300)
cat("Probability true average > .300:", round(prob_above_300, 3), "\n")

# Visualization
avg_range <- seq(0.15, 0.45, 0.001)
prior_density <- dbeta(avg_range, 82, 238)
posterior_density <- dbeta(avg_range, hot_start$alpha, hot_start$beta)

plot_data <- data.frame(
  Average = rep(avg_range, 2),
  Density = c(prior_density, posterior_density),
  Distribution = rep(c("Prior", "Posterior"), each = length(avg_range))
)

ggplot(plot_data, aes(x = Average, y = Density, color = Distribution)) +
  geom_line(size = 1.2) +
  geom_vline(xintercept = hot_start$observed_avg,
             linetype = "dashed", color = "black") +
  annotate("text", x = hot_start$observed_avg + 0.02, y = max(posterior_density) * 0.8,
           label = "Observed:\n.375", hjust = 0) +
  labs(title = "Bayesian Batting Average Estimation",
       subtitle = "Player goes 15-for-40 (.375) in first 40 AB",
       x = "Batting Average",
       y = "Density") +
  scale_x_continuous(labels = scales::number_format(accuracy = 0.001)) +
  theme_minimal() +
  scale_color_manual(values = c("Prior" = "gray60", "Posterior" = "#E03A3E"))

Python Implementation:

from scipy.stats import beta

def estimate_batting_average(hits, at_bats, prior_alpha=82, prior_beta=238):
    """
    Beta-binomial model for batting average estimation
    """
    # Posterior parameters (conjugate update)
    posterior_alpha = prior_alpha + hits
    posterior_beta = prior_beta + (at_bats - hits)

    # Posterior mean
    posterior_mean = posterior_alpha / (posterior_alpha + posterior_beta)

    # Credible interval
    ci_lower = beta.ppf(0.025, posterior_alpha, posterior_beta)
    ci_upper = beta.ppf(0.975, posterior_alpha, posterior_beta)

    return {
        'observed_avg': hits / at_bats,
        'posterior_mean': posterior_mean,
        'credible_interval': (ci_lower, ci_upper),
        'alpha': posterior_alpha,
        'beta': posterior_beta
    }

# Example: Hot rookie start (15 hits in 40 AB = .375)
hot_start = estimate_batting_average(hits=15, at_bats=40)

print(f"Observed Average: {hot_start['observed_avg']:.3f}")
print(f"Bayesian Estimate: {hot_start['posterior_mean']:.3f}")
print(f"95% Credible Interval: "
      f"[{hot_start['credible_interval'][0]:.3f}, "
      f"{hot_start['credible_interval'][1]:.3f}]")

# Simulate the posterior distribution
np.random.seed(42)
posterior_samples = np.random.beta(hot_start['alpha'], hot_start['beta'], 10000)

# What's the probability this player is truly above .300?
prob_above_300 = np.mean(posterior_samples > 0.300)
print(f"Probability true average > .300: {prob_above_300:.3f}")

# Visualization
avg_range = np.linspace(0.15, 0.45, 1000)
prior_density = beta.pdf(avg_range, 82, 238)
posterior_density = beta.pdf(avg_range, hot_start['alpha'], hot_start['beta'])

plt.figure(figsize=(10, 6))
plt.plot(avg_range, prior_density, label='Prior',
         color='gray', linewidth=2)
plt.plot(avg_range, posterior_density, label='Posterior',
         color='#E03A3E', linewidth=2)
plt.axvline(x=hot_start['observed_avg'], color='black',
            linestyle='--', alpha=0.7)
plt.text(hot_start['observed_avg'] + 0.01, max(posterior_density) * 0.8,
         f"Observed:\n{hot_start['observed_avg']:.3f}", fontsize=10)
plt.xlabel('Batting Average', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bayesian Batting Average Estimation\n' +
          'Player goes 15-for-40 (.375) in first 40 AB',
          fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

This analysis reveals that despite the .375 observed average, the player's true talent is likely around .290—better than average but far from .375. There's only about a 30-35% chance his true talent exceeds .300. This demonstrates the value of Bayesian thinking: we temper our reaction to small samples by incorporating broader context.

P(θ|D) = P(D|θ) × P(θ) / P(D)

library(tidyverse)

# Bayesian estimation of true ERA
bayesian_era_estimate <- function(observed_er, observed_ip,
                                  prior_mean = 4.00, prior_sd = 0.80) {
  # Convert to ERs per 9 innings
  observed_era <- (observed_er / observed_ip) * 9

  # Prior distribution: ERA ~ Normal(prior_mean, prior_sd^2)
  prior_precision <- 1 / prior_sd^2

  # Likelihood precision (uncertainty decreases with sample size)
  # We model ERA uncertainty as proportional to 1/sqrt(IP)
  likelihood_sd <- prior_sd / sqrt(observed_ip / 10)
  likelihood_precision <- 1 / likelihood_sd^2

  # Posterior calculation (conjugate normal model)
  posterior_precision <- prior_precision + likelihood_precision
  posterior_mean <- (prior_precision * prior_mean +
                    likelihood_precision * observed_era) / posterior_precision
  posterior_sd <- sqrt(1 / posterior_precision)

  return(list(
    observed_era = observed_era,
    posterior_mean = posterior_mean,
    posterior_sd = posterior_sd,
    credible_interval_95 = c(
      posterior_mean - 1.96 * posterior_sd,
      posterior_mean + 1.96 * posterior_sd
    )
  ))
}

# Example: Rookie pitcher with 7 IP, 2 ER
rookie_result <- bayesian_era_estimate(observed_er = 2, observed_ip = 7)

cat("Observed ERA:", round(rookie_result$observed_era, 2), "\n")
cat("Bayesian Estimate:", round(rookie_result$posterior_mean, 2), "\n")
cat("95% Credible Interval:",
    round(rookie_result$credible_interval_95, 2), "\n")

# Compare with veteran (100 IP, same 2.57 ERA)
veteran_result <- bayesian_era_estimate(observed_er = 28.6, observed_ip = 100)

cat("\nVeteran with same ERA over 100 IP:\n")
cat("Posterior Mean:", round(veteran_result$posterior_mean, 2), "\n")
cat("95% Credible Interval:",
    round(veteran_result$credible_interval_95, 2), "\n")

# Visualization
eras <- seq(1, 7, 0.01)
rookie_posterior <- dnorm(eras, rookie_result$posterior_mean,
                          rookie_result$posterior_sd)
veteran_posterior <- dnorm(eras, veteran_result$posterior_mean,
                           veteran_result$posterior_sd)

plot_data <- data.frame(
  ERA = rep(eras, 2),
  Density = c(rookie_posterior, veteran_posterior),
  Player = rep(c("Rookie (7 IP)", "Veteran (100 IP)"), each = length(eras))
)

ggplot(plot_data, aes(x = ERA, y = Density, color = Player)) +
  geom_line(size = 1.2) +
  geom_vline(xintercept = 2.57, linetype = "dashed", alpha = 0.5) +
  labs(title = "Bayesian ERA Estimates: Sample Size Matters",
       subtitle = "Both pitchers have observed ERA of 2.57",
       x = "True ERA",
       y = "Posterior Density",
       color = "Sample Size") +
  theme_minimal() +
  scale_color_manual(values = c("#E03A3E", "#4A90E2"))

# Beta-binomial model for batting average
estimate_batting_average <- function(hits, at_bats,
                                    prior_alpha = 82, prior_beta = 238) {
  # Prior parameters correspond to roughly .255 average
  # (alpha / (alpha + beta) = 82/320 ≈ 0.256)

  # Posterior parameters (conjugate update)
  posterior_alpha <- prior_alpha + hits
  posterior_beta <- prior_beta + (at_bats - hits)

  # Posterior mean
  posterior_mean <- posterior_alpha / (posterior_alpha + posterior_beta)

  # Credible interval
  ci_lower <- qbeta(0.025, posterior_alpha, posterior_beta)
  ci_upper <- qbeta(0.975, posterior_alpha, posterior_beta)

  return(list(
    observed_avg = hits / at_bats,
    posterior_mean = posterior_mean,
    credible_interval = c(ci_lower, ci_upper),
    alpha = posterior_alpha,
    beta = posterior_beta
  ))
}

# Example: Hot rookie start (15 hits in 40 AB = .375)
hot_start <- estimate_batting_average(hits = 15, at_bats = 40)

cat("Observed Average:", round(hot_start$observed_avg, 3), "\n")
cat("Bayesian Estimate:", round(hot_start$posterior_mean, 3), "\n")
cat("95% Credible Interval: [",
    round(hot_start$credible_interval[1], 3), ",",
    round(hot_start$credible_interval[2], 3), "]\n")

# Simulate the posterior distribution
set.seed(42)
posterior_samples <- rbeta(10000, hot_start$alpha, hot_start$beta)

# What's the probability this player is truly above .300?
prob_above_300 <- mean(posterior_samples > 0.300)
cat("Probability true average > .300:", round(prob_above_300, 3), "\n")

# Visualization
avg_range <- seq(0.15, 0.45, 0.001)
prior_density <- dbeta(avg_range, 82, 238)
posterior_density <- dbeta(avg_range, hot_start$alpha, hot_start$beta)

plot_data <- data.frame(
  Average = rep(avg_range, 2),
  Density = c(prior_density, posterior_density),
  Distribution = rep(c("Prior", "Posterior"), each = length(avg_range))
)

ggplot(plot_data, aes(x = Average, y = Density, color = Distribution)) +
  geom_line(size = 1.2) +
  geom_vline(xintercept = hot_start$observed_avg,
             linetype = "dashed", color = "black") +
  annotate("text", x = hot_start$observed_avg + 0.02, y = max(posterior_density) * 0.8,
           label = "Observed:\n.375", hjust = 0) +
  labs(title = "Bayesian Batting Average Estimation",
       subtitle = "Player goes 15-for-40 (.375) in first 40 AB",
       x = "Batting Average",
       y = "Density") +
  scale_x_continuous(labels = scales::number_format(accuracy = 0.001)) +
  theme_minimal() +
  scale_color_manual(values = c("Prior" = "gray60", "Posterior" = "#E03A3E"))

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

def bayesian_era_estimate(observed_er, observed_ip,
                         prior_mean=4.00, prior_sd=0.80):
    """
    Bayesian estimation of pitcher's true ERA
    """
    # Convert to ERs per 9 innings
    observed_era = (observed_er / observed_ip) * 9

    # Prior distribution: ERA ~ Normal(prior_mean, prior_sd^2)
    prior_precision = 1 / prior_sd**2

    # Likelihood precision (uncertainty decreases with sample size)
    likelihood_sd = prior_sd / np.sqrt(observed_ip / 10)
    likelihood_precision = 1 / likelihood_sd**2

    # Posterior calculation (conjugate normal model)
    posterior_precision = prior_precision + likelihood_precision
    posterior_mean = (prior_precision * prior_mean +
                     likelihood_precision * observed_era) / posterior_precision
    posterior_sd = np.sqrt(1 / posterior_precision)

    return {
        'observed_era': observed_era,
        'posterior_mean': posterior_mean,
        'posterior_sd': posterior_sd,
        'credible_interval_95': (
            posterior_mean - 1.96 * posterior_sd,
            posterior_mean + 1.96 * posterior_sd
        )
    }

# Example: Rookie pitcher with 7 IP, 2 ER
rookie_result = bayesian_era_estimate(observed_er=2, observed_ip=7)

print(f"Observed ERA: {rookie_result['observed_era']:.2f}")
print(f"Bayesian Estimate: {rookie_result['posterior_mean']:.2f}")
print(f"95% Credible Interval: "
      f"({rookie_result['credible_interval_95'][0]:.2f}, "
      f"{rookie_result['credible_interval_95'][1]:.2f})")

# Compare with veteran (100 IP, same 2.57 ERA)
veteran_result = bayesian_era_estimate(observed_er=28.6, observed_ip=100)

print(f"\nVeteran with same ERA over 100 IP:")
print(f"Posterior Mean: {veteran_result['posterior_mean']:.2f}")
print(f"95% Credible Interval: "
      f"({veteran_result['credible_interval_95'][0]:.2f}, "
      f"{veteran_result['credible_interval_95'][1]:.2f})")

# Visualization
eras = np.linspace(1, 7, 600)
rookie_posterior = stats.norm.pdf(eras, rookie_result['posterior_mean'],
                                  rookie_result['posterior_sd'])
veteran_posterior = stats.norm.pdf(eras, veteran_result['posterior_mean'],
                                   veteran_result['posterior_sd'])

plt.figure(figsize=(10, 6))
plt.plot(eras, rookie_posterior, label='Rookie (7 IP)',
         color='#E03A3E', linewidth=2)
plt.plot(eras, veteran_posterior, label='Veteran (100 IP)',
         color='#4A90E2', linewidth=2)
plt.axvline(x=2.57, color='gray', linestyle='--', alpha=0.5,
            label='Observed ERA')
plt.xlabel('True ERA', fontsize=12)
plt.ylabel('Posterior Density', fontsize=12)
plt.title('Bayesian ERA Estimates: Sample Size Matters\n' +
          'Both pitchers have observed ERA of 2.57',
          fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Python

from scipy.stats import beta

def estimate_batting_average(hits, at_bats, prior_alpha=82, prior_beta=238):
    """
    Beta-binomial model for batting average estimation
    """
    # Posterior parameters (conjugate update)
    posterior_alpha = prior_alpha + hits
    posterior_beta = prior_beta + (at_bats - hits)

    # Posterior mean
    posterior_mean = posterior_alpha / (posterior_alpha + posterior_beta)

    # Credible interval
    ci_lower = beta.ppf(0.025, posterior_alpha, posterior_beta)
    ci_upper = beta.ppf(0.975, posterior_alpha, posterior_beta)

    return {
        'observed_avg': hits / at_bats,
        'posterior_mean': posterior_mean,
        'credible_interval': (ci_lower, ci_upper),
        'alpha': posterior_alpha,
        'beta': posterior_beta
    }

# Example: Hot rookie start (15 hits in 40 AB = .375)
hot_start = estimate_batting_average(hits=15, at_bats=40)

print(f"Observed Average: {hot_start['observed_avg']:.3f}")
print(f"Bayesian Estimate: {hot_start['posterior_mean']:.3f}")
print(f"95% Credible Interval: "
      f"[{hot_start['credible_interval'][0]:.3f}, "
      f"{hot_start['credible_interval'][1]:.3f}]")

# Simulate the posterior distribution
np.random.seed(42)
posterior_samples = np.random.beta(hot_start['alpha'], hot_start['beta'], 10000)

# What's the probability this player is truly above .300?
prob_above_300 = np.mean(posterior_samples > 0.300)
print(f"Probability true average > .300: {prob_above_300:.3f}")

# Visualization
avg_range = np.linspace(0.15, 0.45, 1000)
prior_density = beta.pdf(avg_range, 82, 238)
posterior_density = beta.pdf(avg_range, hot_start['alpha'], hot_start['beta'])

plt.figure(figsize=(10, 6))
plt.plot(avg_range, prior_density, label='Prior',
         color='gray', linewidth=2)
plt.plot(avg_range, posterior_density, label='Posterior',
         color='#E03A3E', linewidth=2)
plt.axvline(x=hot_start['observed_avg'], color='black',
            linestyle='--', alpha=0.7)
plt.text(hot_start['observed_avg'] + 0.01, max(posterior_density) * 0.8,
         f"Observed:\n{hot_start['observed_avg']:.3f}", fontsize=10)
plt.xlabel('Batting Average', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.title('Bayesian Batting Average Estimation\n' +
          'Player goes 15-for-40 (.375) in first 40 AB',
          fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

17.2 Hierarchical Models

Hierarchical (or multilevel) models recognize that baseball data has natural groupings: players within teams, seasons within careers, plate appearances within games. These models allow us to share information across groups while still estimating individual effects.

The Hierarchical Structure

Consider estimating each team's true winning talent. A naive approach estimates each team independently. A hierarchical model recognizes that all teams come from the same population of MLB teams and should be pulled toward the overall league average, with the amount of pulling depending on sample size.

The hierarchy:

League level: Overall distribution of team talent

Team level: Each team's true talent drawn from the league distribution

Game level: Observed wins and losses

Mathematical Framework:

Level 1 (Observations): Wins_i ~ Binomial(n_i, p_i)
Level 2 (Teams): p_i ~ Beta(α, β)
Level 3 (Hyperpriors): α, β ~ some prior distribution

Application: Estimating Team True Talent

We'll build a hierarchical model to estimate team true talent, accounting for the fact that teams vary but share common structure.

R Implementation:

library(tidyverse)
library(rstan)

# Simulate some team data (in practice, use real data)
set.seed(42)
n_teams <- 30
games_played <- 162

# True talents vary around .500 with SD of about .050
true_talents <- rnorm(n_teams, mean = 0.500, sd = 0.050)
true_talents <- pmax(0.3, pmin(0.7, true_talents))  # Constrain to reasonable range

# Simulate season outcomes
team_data <- data.frame(
  team_id = 1:n_teams,
  games = games_played,
  wins = rbinom(n_teams, games_played, true_talents),
  true_talent = true_talents
)

team_data <- team_data %>%
  mutate(observed_pct = wins / games)

# Non-hierarchical estimates (just observed win pct)
team_data$non_hierarchical_est <- team_data$observed_pct

# Hierarchical Bayesian estimation using beta-binomial
hierarchical_estimate <- function(wins, games, global_alpha = 81, global_beta = 81) {
  # Hierarchical shrinkage: posterior parameters
  posterior_alpha <- global_alpha + wins
  posterior_beta <- global_beta + (games - wins)

  # Posterior mean
  est <- posterior_alpha / (posterior_alpha + posterior_beta)
  return(est)
}

team_data$hierarchical_est <- hierarchical_estimate(
  team_data$wins,
  team_data$games
)

# Compare estimates
comparison <- team_data %>%
  select(team_id, true_talent, observed_pct,
         non_hierarchical_est, hierarchical_est) %>%
  mutate(
    non_hier_error = abs(non_hierarchical_est - true_talent),
    hier_error = abs(hierarchical_est - true_talent)
  )

cat("Mean Absolute Error:\n")
cat("Non-hierarchical:", mean(comparison$non_hier_error), "\n")
cat("Hierarchical:", mean(comparison$hier_error), "\n")

# Visualization: Shrinkage toward mean
ggplot(team_data, aes(x = observed_pct, y = hierarchical_est)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed",
              color = "gray50") +
  geom_point(aes(color = abs(observed_pct - 0.500)), size = 3) +
  scale_color_gradient(low = "#4A90E2", high = "#E03A3E",
                       name = "Distance from .500") +
  labs(title = "Hierarchical Shrinkage of Team Win Percentages",
       subtitle = "Extreme records shrink more toward league average",
       x = "Observed Win Percentage",
       y = "Hierarchical Estimate") +
  coord_fixed() +
  theme_minimal()

# Second visualization: comparing estimation errors
comparison_long <- comparison %>%
  select(team_id, non_hier_error, hier_error) %>%
  pivot_longer(cols = c(non_hier_error, hier_error),
               names_to = "Method",
               values_to = "Error") %>%
  mutate(Method = recode(Method,
                        non_hier_error = "Non-Hierarchical",
                        hier_error = "Hierarchical"))

ggplot(comparison_long, aes(x = Method, y = Error, fill = Method)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3) +
  labs(title = "Hierarchical Models Reduce Estimation Error",
       subtitle = "Absolute error in estimating true team talent",
       y = "Absolute Error",
       x = "") +
  scale_fill_manual(values = c("Non-Hierarchical" = "gray60",
                               "Hierarchical" = "#4A90E2")) +
  theme_minimal() +
  theme(legend.position = "none")

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate team data
np.random.seed(42)
n_teams = 30
games_played = 162

# True talents vary around .500 with SD of about .050
true_talents = np.random.normal(0.500, 0.050, n_teams)
true_talents = np.clip(true_talents, 0.3, 0.7)

# Simulate season outcomes
wins = np.random.binomial(games_played, true_talents)

team_data = pd.DataFrame({
    'team_id': range(1, n_teams + 1),
    'games': games_played,
    'wins': wins,
    'true_talent': true_talents
})

team_data['observed_pct'] = team_data['wins'] / team_data['games']

def hierarchical_estimate(wins, games, global_alpha=81, global_beta=81):
    """
    Hierarchical Bayesian estimation using beta-binomial
    """
    posterior_alpha = global_alpha + wins
    posterior_beta = global_beta + (games - wins)

    est = posterior_alpha / (posterior_alpha + posterior_beta)
    return est

team_data['non_hierarchical_est'] = team_data['observed_pct']
team_data['hierarchical_est'] = hierarchical_estimate(
    team_data['wins'].values,
    team_data['games'].values
)

# Compare estimates
team_data['non_hier_error'] = abs(
    team_data['non_hierarchical_est'] - team_data['true_talent']
)
team_data['hier_error'] = abs(
    team_data['hierarchical_est'] - team_data['true_talent']
)

print("Mean Absolute Error:")
print(f"Non-hierarchical: {team_data['non_hier_error'].mean():.5f}")
print(f"Hierarchical: {team_data['hier_error'].mean():.5f}")

# Visualization: Shrinkage toward mean
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Shrinkage
scatter = axes[0].scatter(team_data['observed_pct'],
                         team_data['hierarchical_est'],
                         c=abs(team_data['observed_pct'] - 0.500),
                         cmap='coolwarm', s=100, alpha=0.7)
axes[0].plot([0.3, 0.7], [0.3, 0.7], 'k--', alpha=0.5, label='No shrinkage')
axes[0].set_xlabel('Observed Win Percentage', fontsize=12)
axes[0].set_ylabel('Hierarchical Estimate', fontsize=12)
axes[0].set_title('Hierarchical Shrinkage of Team Win Percentages\n' +
                  'Extreme records shrink more toward league average',
                  fontsize=13, fontweight='bold')
axes[0].set_aspect('equal')
plt.colorbar(scatter, ax=axes[0], label='Distance from .500')
axes[0].grid(alpha=0.3)

# Plot 2: Error comparison
error_data = pd.DataFrame({
    'Non-Hierarchical': team_data['non_hier_error'],
    'Hierarchical': team_data['hier_error']
})

bp = axes[1].boxplot([error_data['Non-Hierarchical'],
                       error_data['Hierarchical']],
                      labels=['Non-Hierarchical', 'Hierarchical'],
                      patch_artist=True)
bp['boxes'][0].set_facecolor('gray')
bp['boxes'][1].set_facecolor('#4A90E2')

axes[1].set_ylabel('Absolute Error', fontsize=12)
axes[1].set_title('Hierarchical Models Reduce Estimation Error\n' +
                  'Absolute error in estimating true team talent',
                  fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

The hierarchical model reduces estimation error by recognizing that extreme performances (very high or low win percentages) likely include luck. The model shrinks estimates toward the league mean, with more shrinkage for more extreme observations.

Player-Within-Team Hierarchical Model

A more complex application: estimating player performance while accounting for team effects. Players on good teams may benefit from better lineups; accounting for team quality improves player estimates.

R Implementation:

# Simplified player-within-team model
library(lme4)

# Simulate player data
set.seed(123)
n_teams <- 30
players_per_team <- 10

# Create team effects
team_quality <- rnorm(n_teams, mean = 0, sd = 0.3)

# Create player data
player_data <- expand.grid(
  team_id = 1:n_teams,
  player_num = 1:players_per_team
) %>%
  mutate(
    player_id = row_number(),
    team_effect = team_quality[team_id],
    # Player true talent
    player_talent = rnorm(n(), mean = 0, sd = 0.5),
    # Observed performance = team + player + noise
    true_ops = 0.750 + team_effect + player_talent,
    # Add observation noise
    observed_ops = true_ops + rnorm(n(), mean = 0, sd = 0.050)
  )

# Fit hierarchical model: OPS varying by team and player
model <- lmer(observed_ops ~ 1 + (1|team_id), data = player_data)

# Extract predictions
player_data$hierarchical_pred <- predict(model)

# Compare to non-hierarchical (just observed)
player_data$non_hier_pred <- player_data$observed_ops

# Calculate errors
player_data <- player_data %>%
  mutate(
    hier_error = abs(hierarchical_pred - true_ops),
    non_hier_error = abs(non_hier_pred - true_ops)
  )

cat("Mean Absolute Error:\n")
cat("Non-hierarchical:", mean(player_data$non_hier_error), "\n")
cat("Hierarchical:", mean(player_data$hier_error), "\n")

# Examine team effects
team_effects <- ranef(model)$team_id
team_effects$team_id <- 1:n_teams
colnames(team_effects)[1] <- "estimated_effect"
team_effects$true_effect <- team_quality

ggplot(team_effects, aes(x = true_effect, y = estimated_effect)) +
  geom_point(size = 3, alpha = 0.7, color = "#4A90E2") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "Hierarchical Model Recovers Team Effects",
       x = "True Team Effect",
       y = "Estimated Team Effect") +
  theme_minimal()

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.regression.mixed_linear_model import MixedLM

# Simulate player data
np.random.seed(123)
n_teams = 30
players_per_team = 10

# Create team effects
team_quality = np.random.normal(0, 0.3, n_teams)

# Create player data
teams = np.repeat(range(n_teams), players_per_team)
players = np.tile(range(players_per_team), n_teams)

player_data = pd.DataFrame({
    'team_id': teams,
    'player_num': players,
    'player_id': range(n_teams * players_per_team)
})

player_data['team_effect'] = team_quality[player_data['team_id']]
player_data['player_talent'] = np.random.normal(0, 0.5, len(player_data))
player_data['true_ops'] = (0.750 + player_data['team_effect'] +
                            player_data['player_talent'])
player_data['observed_ops'] = (player_data['true_ops'] +
                                np.random.normal(0, 0.050, len(player_data)))

# Fit hierarchical model
model = MixedLM.from_formula('observed_ops ~ 1',
                             groups=player_data['team_id'],
                             data=player_data)
result = model.fit()

# Extract predictions
player_data['hierarchical_pred'] = result.fittedvalues
player_data['non_hier_pred'] = player_data['observed_ops']

# Calculate errors
player_data['hier_error'] = abs(
    player_data['hierarchical_pred'] - player_data['true_ops']
)
player_data['non_hier_error'] = abs(
    player_data['non_hier_pred'] - player_data['true_ops']
)

print("Mean Absolute Error:")
print(f"Non-hierarchical: {player_data['non_hier_error'].mean():.5f}")
print(f"Hierarchical: {player_data['hier_error'].mean():.5f}")

# Extract and visualize team effects
team_effects = result.random_effects
estimated_effects = [team_effects[i]['Group'] for i in range(n_teams)]

plt.figure(figsize=(8, 6))
plt.scatter(team_quality, estimated_effects, s=80, alpha=0.7, color='#4A90E2')
plt.plot([-1, 1], [-1, 1], 'k--', alpha=0.5)
plt.xlabel('True Team Effect', fontsize=12)
plt.ylabel('Estimated Team Effect', fontsize=12)
plt.title('Hierarchical Model Recovers Team Effects',
          fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Hierarchical models are particularly valuable when sample sizes vary across groups. A rookie with 50 plate appearances gets more shrinkage toward his team's average than a veteran with 500 PAs.

Level 1 (Observations): Wins_i ~ Binomial(n_i, p_i)
Level 2 (Teams): p_i ~ Beta(α, β)
Level 3 (Hyperpriors): α, β ~ some prior distribution

library(tidyverse)
library(rstan)

# Simulate some team data (in practice, use real data)
set.seed(42)
n_teams <- 30
games_played <- 162

# True talents vary around .500 with SD of about .050
true_talents <- rnorm(n_teams, mean = 0.500, sd = 0.050)
true_talents <- pmax(0.3, pmin(0.7, true_talents))  # Constrain to reasonable range

# Simulate season outcomes
team_data <- data.frame(
  team_id = 1:n_teams,
  games = games_played,
  wins = rbinom(n_teams, games_played, true_talents),
  true_talent = true_talents
)

team_data <- team_data %>%
  mutate(observed_pct = wins / games)

# Non-hierarchical estimates (just observed win pct)
team_data$non_hierarchical_est <- team_data$observed_pct

# Hierarchical Bayesian estimation using beta-binomial
hierarchical_estimate <- function(wins, games, global_alpha = 81, global_beta = 81) {
  # Hierarchical shrinkage: posterior parameters
  posterior_alpha <- global_alpha + wins
  posterior_beta <- global_beta + (games - wins)

  # Posterior mean
  est <- posterior_alpha / (posterior_alpha + posterior_beta)
  return(est)
}

team_data$hierarchical_est <- hierarchical_estimate(
  team_data$wins,
  team_data$games
)

# Compare estimates
comparison <- team_data %>%
  select(team_id, true_talent, observed_pct,
         non_hierarchical_est, hierarchical_est) %>%
  mutate(
    non_hier_error = abs(non_hierarchical_est - true_talent),
    hier_error = abs(hierarchical_est - true_talent)
  )

cat("Mean Absolute Error:\n")
cat("Non-hierarchical:", mean(comparison$non_hier_error), "\n")
cat("Hierarchical:", mean(comparison$hier_error), "\n")

# Visualization: Shrinkage toward mean
ggplot(team_data, aes(x = observed_pct, y = hierarchical_est)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed",
              color = "gray50") +
  geom_point(aes(color = abs(observed_pct - 0.500)), size = 3) +
  scale_color_gradient(low = "#4A90E2", high = "#E03A3E",
                       name = "Distance from .500") +
  labs(title = "Hierarchical Shrinkage of Team Win Percentages",
       subtitle = "Extreme records shrink more toward league average",
       x = "Observed Win Percentage",
       y = "Hierarchical Estimate") +
  coord_fixed() +
  theme_minimal()

# Second visualization: comparing estimation errors
comparison_long <- comparison %>%
  select(team_id, non_hier_error, hier_error) %>%
  pivot_longer(cols = c(non_hier_error, hier_error),
               names_to = "Method",
               values_to = "Error") %>%
  mutate(Method = recode(Method,
                        non_hier_error = "Non-Hierarchical",
                        hier_error = "Hierarchical"))

ggplot(comparison_long, aes(x = Method, y = Error, fill = Method)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(width = 0.2, alpha = 0.3) +
  labs(title = "Hierarchical Models Reduce Estimation Error",
       subtitle = "Absolute error in estimating true team talent",
       y = "Absolute Error",
       x = "") +
  scale_fill_manual(values = c("Non-Hierarchical" = "gray60",
                               "Hierarchical" = "#4A90E2")) +
  theme_minimal() +
  theme(legend.position = "none")

# Simplified player-within-team model
library(lme4)

# Simulate player data
set.seed(123)
n_teams <- 30
players_per_team <- 10

# Create team effects
team_quality <- rnorm(n_teams, mean = 0, sd = 0.3)

# Create player data
player_data <- expand.grid(
  team_id = 1:n_teams,
  player_num = 1:players_per_team
) %>%
  mutate(
    player_id = row_number(),
    team_effect = team_quality[team_id],
    # Player true talent
    player_talent = rnorm(n(), mean = 0, sd = 0.5),
    # Observed performance = team + player + noise
    true_ops = 0.750 + team_effect + player_talent,
    # Add observation noise
    observed_ops = true_ops + rnorm(n(), mean = 0, sd = 0.050)
  )

# Fit hierarchical model: OPS varying by team and player
model <- lmer(observed_ops ~ 1 + (1|team_id), data = player_data)

# Extract predictions
player_data$hierarchical_pred <- predict(model)

# Compare to non-hierarchical (just observed)
player_data$non_hier_pred <- player_data$observed_ops

# Calculate errors
player_data <- player_data %>%
  mutate(
    hier_error = abs(hierarchical_pred - true_ops),
    non_hier_error = abs(non_hier_pred - true_ops)
  )

cat("Mean Absolute Error:\n")
cat("Non-hierarchical:", mean(player_data$non_hier_error), "\n")
cat("Hierarchical:", mean(player_data$hier_error), "\n")

# Examine team effects
team_effects <- ranef(model)$team_id
team_effects$team_id <- 1:n_teams
colnames(team_effects)[1] <- "estimated_effect"
team_effects$true_effect <- team_quality

ggplot(team_effects, aes(x = true_effect, y = estimated_effect)) +
  geom_point(size = 3, alpha = 0.7, color = "#4A90E2") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "Hierarchical Model Recovers Team Effects",
       x = "True Team Effect",
       y = "Estimated Team Effect") +
  theme_minimal()

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate team data
np.random.seed(42)
n_teams = 30
games_played = 162

# True talents vary around .500 with SD of about .050
true_talents = np.random.normal(0.500, 0.050, n_teams)
true_talents = np.clip(true_talents, 0.3, 0.7)

# Simulate season outcomes
wins = np.random.binomial(games_played, true_talents)

team_data = pd.DataFrame({
    'team_id': range(1, n_teams + 1),
    'games': games_played,
    'wins': wins,
    'true_talent': true_talents
})

team_data['observed_pct'] = team_data['wins'] / team_data['games']

def hierarchical_estimate(wins, games, global_alpha=81, global_beta=81):
    """
    Hierarchical Bayesian estimation using beta-binomial
    """
    posterior_alpha = global_alpha + wins
    posterior_beta = global_beta + (games - wins)

    est = posterior_alpha / (posterior_alpha + posterior_beta)
    return est

team_data['non_hierarchical_est'] = team_data['observed_pct']
team_data['hierarchical_est'] = hierarchical_estimate(
    team_data['wins'].values,
    team_data['games'].values
)

# Compare estimates
team_data['non_hier_error'] = abs(
    team_data['non_hierarchical_est'] - team_data['true_talent']
)
team_data['hier_error'] = abs(
    team_data['hierarchical_est'] - team_data['true_talent']
)

print("Mean Absolute Error:")
print(f"Non-hierarchical: {team_data['non_hier_error'].mean():.5f}")
print(f"Hierarchical: {team_data['hier_error'].mean():.5f}")

# Visualization: Shrinkage toward mean
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Shrinkage
scatter = axes[0].scatter(team_data['observed_pct'],
                         team_data['hierarchical_est'],
                         c=abs(team_data['observed_pct'] - 0.500),
                         cmap='coolwarm', s=100, alpha=0.7)
axes[0].plot([0.3, 0.7], [0.3, 0.7], 'k--', alpha=0.5, label='No shrinkage')
axes[0].set_xlabel('Observed Win Percentage', fontsize=12)
axes[0].set_ylabel('Hierarchical Estimate', fontsize=12)
axes[0].set_title('Hierarchical Shrinkage of Team Win Percentages\n' +
                  'Extreme records shrink more toward league average',
                  fontsize=13, fontweight='bold')
axes[0].set_aspect('equal')
plt.colorbar(scatter, ax=axes[0], label='Distance from .500')
axes[0].grid(alpha=0.3)

# Plot 2: Error comparison
error_data = pd.DataFrame({
    'Non-Hierarchical': team_data['non_hier_error'],
    'Hierarchical': team_data['hier_error']
})

bp = axes[1].boxplot([error_data['Non-Hierarchical'],
                       error_data['Hierarchical']],
                      labels=['Non-Hierarchical', 'Hierarchical'],
                      patch_artist=True)
bp['boxes'][0].set_facecolor('gray')
bp['boxes'][1].set_facecolor('#4A90E2')

axes[1].set_ylabel('Absolute Error', fontsize=12)
axes[1].set_title('Hierarchical Models Reduce Estimation Error\n' +
                  'Absolute error in estimating true team talent',
                  fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.regression.mixed_linear_model import MixedLM

# Simulate player data
np.random.seed(123)
n_teams = 30
players_per_team = 10

# Create team effects
team_quality = np.random.normal(0, 0.3, n_teams)

# Create player data
teams = np.repeat(range(n_teams), players_per_team)
players = np.tile(range(players_per_team), n_teams)

player_data = pd.DataFrame({
    'team_id': teams,
    'player_num': players,
    'player_id': range(n_teams * players_per_team)
})

player_data['team_effect'] = team_quality[player_data['team_id']]
player_data['player_talent'] = np.random.normal(0, 0.5, len(player_data))
player_data['true_ops'] = (0.750 + player_data['team_effect'] +
                            player_data['player_talent'])
player_data['observed_ops'] = (player_data['true_ops'] +
                                np.random.normal(0, 0.050, len(player_data)))

# Fit hierarchical model
model = MixedLM.from_formula('observed_ops ~ 1',
                             groups=player_data['team_id'],
                             data=player_data)
result = model.fit()

# Extract predictions
player_data['hierarchical_pred'] = result.fittedvalues
player_data['non_hier_pred'] = player_data['observed_ops']

# Calculate errors
player_data['hier_error'] = abs(
    player_data['hierarchical_pred'] - player_data['true_ops']
)
player_data['non_hier_error'] = abs(
    player_data['non_hier_pred'] - player_data['true_ops']
)

print("Mean Absolute Error:")
print(f"Non-hierarchical: {player_data['non_hier_error'].mean():.5f}")
print(f"Hierarchical: {player_data['hier_error'].mean():.5f}")

# Extract and visualize team effects
team_effects = result.random_effects
estimated_effects = [team_effects[i]['Group'] for i in range(n_teams)]

plt.figure(figsize=(8, 6))
plt.scatter(team_quality, estimated_effects, s=80, alpha=0.7, color='#4A90E2')
plt.plot([-1, 1], [-1, 1], 'k--', alpha=0.5)
plt.xlabel('True Team Effect', fontsize=12)
plt.ylabel('Estimated Team Effect', fontsize=12)
plt.title('Hierarchical Model Recovers Team Effects',
          fontsize=14, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

17.3 Time Series Analysis

Baseball generates temporal data: game-by-game performance, pitch-by-pitch velocity changes, season-to-season aging curves. Time series methods help us model trends, detect changepoints, and forecast future performance.

Components of Time Series

Any time series can be decomposed into:

Trend: Long-term increase or decrease

Seasonality: Regular patterns (within-season hot/cold streaks)

Cycles: Longer irregular patterns

Noise: Random variation

Modeling Pitcher Velocity Decline

Pitcher fastball velocity typically declines over a career. Let's model this using time series methods.

R Implementation:

library(tidyverse)
library(mgcv)  # For generalized additive models

# Simulate pitcher velocity data over career
set.seed(456)
ages <- seq(22, 35, by = 0.5)
n_obs <- length(ages)

# True velocity follows aging curve
true_velocity <- 95 - 0.3 * (ages - 22) - 0.05 * (ages - 22)^2

# Add random variation and measurement error
observed_velocity <- true_velocity + rnorm(n_obs, 0, 0.8)

pitcher_data <- data.frame(
  age = ages,
  velocity = observed_velocity,
  season = floor(ages - 22) + 1
)

# Fit various time series models

# 1. Simple linear trend
linear_model <- lm(velocity ~ age, data = pitcher_data)

# 2. Quadratic trend (captures acceleration of decline)
quad_model <- lm(velocity ~ age + I(age^2), data = pitcher_data)

# 3. Smoothing spline (flexible non-parametric trend)
gam_model <- gam(velocity ~ s(age), data = pitcher_data)

# 4. LOESS smoother
loess_model <- loess(velocity ~ age, data = pitcher_data, span = 0.3)

# Generate predictions
pitcher_data$linear_pred <- predict(linear_model)
pitcher_data$quad_pred <- predict(quad_model)
pitcher_data$gam_pred <- predict(gam_model)
pitcher_data$loess_pred <- predict(loess_model)

# Visualize all models
ggplot(pitcher_data, aes(x = age)) +
  geom_point(aes(y = velocity), alpha = 0.6, size = 2) +
  geom_line(aes(y = linear_pred, color = "Linear"), size = 1) +
  geom_line(aes(y = quad_pred, color = "Quadratic"), size = 1) +
  geom_line(aes(y = gam_pred, color = "GAM"), size = 1) +
  geom_line(aes(y = loess_pred, color = "LOESS"), size = 1) +
  scale_color_manual(values = c("Linear" = "blue",
                               "Quadratic" = "red",
                               "GAM" = "green",
                               "LOESS" = "purple")) +
  labs(title = "Modeling Pitcher Velocity Decline Over Career",
       subtitle = "Comparing different trend estimation methods",
       x = "Age",
       y = "Fastball Velocity (mph)",
       color = "Model") +
  theme_minimal()

# Forecast future velocity
future_ages <- data.frame(age = seq(35.5, 38, by = 0.5))
future_pred <- predict(quad_model, newdata = future_ages,
                      interval = "prediction", level = 0.95)

forecast_data <- cbind(future_ages, future_pred)

ggplot() +
  geom_point(data = pitcher_data, aes(x = age, y = velocity),
             alpha = 0.6, size = 2) +
  geom_line(data = pitcher_data, aes(x = age, y = quad_pred),
            color = "red", size = 1) +
  geom_line(data = forecast_data, aes(x = age, y = fit),
            color = "red", size = 1, linetype = "dashed") +
  geom_ribbon(data = forecast_data,
              aes(x = age, ymin = lwr, ymax = upr),
              alpha = 0.2, fill = "red") +
  geom_vline(xintercept = 35, linetype = "dotted", alpha = 0.5) +
  annotate("text", x = 36.5, y = 95,
           label = "Forecast\nRegion", hjust = 0.5) +
  labs(title = "Forecasting Future Velocity Decline",
       subtitle = "Quadratic model with 95% prediction intervals",
       x = "Age",
       y = "Fastball Velocity (mph)") +
  theme_minimal()

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Simulate pitcher velocity data
np.random.seed(456)
ages = np.arange(22, 35.5, 0.5)
n_obs = len(ages)

# True velocity follows aging curve
true_velocity = 95 - 0.3 * (ages - 22) - 0.05 * (ages - 22)**2
observed_velocity = true_velocity + np.random.normal(0, 0.8, n_obs)

pitcher_data = pd.DataFrame({
    'age': ages,
    'velocity': observed_velocity
})

# Fit various models

# 1. Linear model
linear_model = LinearRegression()
linear_model.fit(pitcher_data[['age']], pitcher_data['velocity'])
pitcher_data['linear_pred'] = linear_model.predict(pitcher_data[['age']])

# 2. Quadratic model
quad_model = make_pipeline(PolynomialFeatures(2), LinearRegression())
quad_model.fit(pitcher_data[['age']], pitcher_data['velocity'])
pitcher_data['quad_pred'] = quad_model.predict(pitcher_data[['age']])

# 3. Smoothing spline
spline = UnivariateSpline(pitcher_data['age'], pitcher_data['velocity'],
                          s=5, k=3)
pitcher_data['spline_pred'] = spline(pitcher_data['age'])

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Model comparison
axes[0].scatter(pitcher_data['age'], pitcher_data['velocity'],
                alpha=0.6, s=50, label='Observed')
axes[0].plot(pitcher_data['age'], pitcher_data['linear_pred'],
             label='Linear', linewidth=2)
axes[0].plot(pitcher_data['age'], pitcher_data['quad_pred'],
             label='Quadratic', linewidth=2)
axes[0].plot(pitcher_data['age'], pitcher_data['spline_pred'],
             label='Spline', linewidth=2)
axes[0].set_xlabel('Age', fontsize=12)
axes[0].set_ylabel('Fastball Velocity (mph)', fontsize=12)
axes[0].set_title('Modeling Pitcher Velocity Decline Over Career\n' +
                  'Comparing different trend estimation methods',
                  fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Forecast
future_ages = np.arange(35.5, 38.5, 0.5).reshape(-1, 1)
future_pred = quad_model.predict(future_ages)

# Calculate approximate prediction interval (simplified)
residuals = pitcher_data['velocity'] - pitcher_data['quad_pred']
pred_std = np.std(residuals)
future_lower = future_pred - 1.96 * pred_std
future_upper = future_pred + 1.96 * pred_std

axes[1].scatter(pitcher_data['age'], pitcher_data['velocity'],
                alpha=0.6, s=50, label='Observed')
axes[1].plot(pitcher_data['age'], pitcher_data['quad_pred'],
             color='red', linewidth=2, label='Fitted')
axes[1].plot(future_ages, future_pred, color='red',
             linewidth=2, linestyle='--', label='Forecast')
axes[1].fill_between(future_ages.flatten(), future_lower, future_upper,
                     alpha=0.2, color='red', label='95% PI')
axes[1].axvline(x=35, color='gray', linestyle=':', alpha=0.5)
axes[1].text(36.5, 95, 'Forecast\nRegion', ha='center', fontsize=10)
axes[1].set_xlabel('Age', fontsize=12)
axes[1].set_ylabel('Fastball Velocity (mph)', fontsize=12)
axes[1].set_title('Forecasting Future Velocity Decline\n' +
                  'Quadratic model with 95% prediction intervals',
                  fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Detecting Changepoints in Performance

Sometimes player performance changes abruptly due to injury, mechanical adjustments, or other factors. Changepoint detection identifies when these shifts occur.

R Implementation:

library(changepoint)

# Simulate hitter with performance change mid-season
set.seed(789)
games_before <- 80
games_after <- 82

# Pre-change: .280 hitter (roughly 1 hit per 3.57 AB)
ba_before <- 0.280
ab_before <- rep(4, games_before)
hits_before <- rbinom(games_before, ab_before, ba_before)

# Post-change: .320 hitter (improvement after mechanical adjustment)
ba_after <- 0.320
ab_after <- rep(4, games_after)
hits_after <- rbinom(games_after, ab_after, ba_after)

# Combine
game_data <- data.frame(
  game = 1:(games_before + games_after),
  hits = c(hits_before, hits_after),
  at_bats = c(ab_before, ab_after)
) %>%
  mutate(game_ba = hits / at_bats)

# Calculate rolling average
window <- 15
game_data$rolling_avg <- zoo::rollmean(game_data$game_ba, k = window,
                                       fill = NA, align = "right")

# Changepoint detection
# Use PELT algorithm on rolling average
cp_result <- cpt.mean(na.omit(game_data$rolling_avg), method = "PELT")
changepoints <- cpts(cp_result)

# Adjust for rolling window offset
actual_changepoint <- changepoints[1] + window - 1

# Visualization
ggplot(game_data, aes(x = game, y = game_ba)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_line(aes(y = rolling_avg), color = "#4A90E2", size = 1) +
  geom_vline(xintercept = games_before, linetype = "dashed",
             color = "red", size = 1, alpha = 0.7) +
  geom_vline(xintercept = actual_changepoint, linetype = "dotted",
             color = "darkred", size = 1) +
  annotate("text", x = games_before - 10, y = 0.45,
           label = "True\nChangepoint", hjust = 1, color = "red") +
  annotate("text", x = actual_changepoint + 10, y = 0.45,
           label = "Detected\nChangepoint", hjust = 0, color = "darkred") +
  labs(title = "Detecting Performance Changepoints",
       subtitle = paste0("Hitter improves from .280 to .320 at game ",
                        games_before),
       x = "Game Number",
       y = "Batting Average",
       caption = "Blue line shows 15-game rolling average") +
  theme_minimal()

# Statistical test: did performance change?
before_segment <- game_data$game_ba[1:games_before]
after_segment <- game_data$game_ba[(games_before+1):nrow(game_data)]

t_test <- t.test(after_segment, before_segment, alternative = "greater")
cat("\nT-test for performance improvement:\n")
cat("Before mean:", mean(before_segment), "\n")
cat("After mean:", mean(after_segment), "\n")
cat("P-value:", t_test$p.value, "\n")

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import ruptures as rpt  # Changepoint detection library

# Simulate hitter with performance change mid-season
np.random.seed(789)
games_before = 80
games_after = 82

# Pre-change: .280 hitter
ba_before = 0.280
ab_before = np.repeat(4, games_before)
hits_before = np.random.binomial(ab_before, ba_before)

# Post-change: .320 hitter
ba_after = 0.320
ab_after = np.repeat(4, games_after)
hits_after = np.random.binomial(ab_after, ba_after)

# Combine
game_data = pd.DataFrame({
    'game': range(1, games_before + games_after + 1),
    'hits': np.concatenate([hits_before, hits_after]),
    'at_bats': np.concatenate([ab_before, ab_after])
})

game_data['game_ba'] = game_data['hits'] / game_data['at_bats']

# Calculate rolling average
window = 15
game_data['rolling_avg'] = game_data['game_ba'].rolling(
    window=window, min_periods=1
).mean()

# Changepoint detection using ruptures
signal = game_data['rolling_avg'].fillna(method='bfill').values
algo = rpt.Pelt(model="rbf").fit(signal)
changepoints = algo.predict(pen=1)

# Get first changepoint
detected_cp = changepoints[0] if changepoints[0] < len(signal) else None

# Visualization
plt.figure(figsize=(12, 6))
plt.scatter(game_data['game'], game_data['game_ba'],
           alpha=0.3, s=20, label='Game BA')
plt.plot(game_data['game'], game_data['rolling_avg'],
        color='#4A90E2', linewidth=2, label='15-game rolling avg')
plt.axvline(x=games_before, color='red', linestyle='--',
           linewidth=2, alpha=0.7, label='True changepoint')
if detected_cp:
    plt.axvline(x=detected_cp, color='darkred', linestyle=':',
               linewidth=2, label='Detected changepoint')

plt.xlabel('Game Number', fontsize=12)
plt.ylabel('Batting Average', fontsize=12)
plt.title(f'Detecting Performance Changepoints\n' +
         f'Hitter improves from .280 to .320 at game {games_before}',
         fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Statistical test
before_segment = game_data['game_ba'][:games_before]
after_segment = game_data['game_ba'][games_before:]

t_stat, p_value = stats.ttest_ind(after_segment, before_segment)

print("\nT-test for performance change:")
print(f"Before mean: {before_segment.mean():.3f}")
print(f"After mean: {after_segment.mean():.3f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")

Changepoint detection is valuable for identifying when pitchers lose velocity due to injury, when hitters make productive swing changes, or when teams change strategic approaches.

library(tidyverse)
library(mgcv)  # For generalized additive models

# Simulate pitcher velocity data over career
set.seed(456)
ages <- seq(22, 35, by = 0.5)
n_obs <- length(ages)

# True velocity follows aging curve
true_velocity <- 95 - 0.3 * (ages - 22) - 0.05 * (ages - 22)^2

# Add random variation and measurement error
observed_velocity <- true_velocity + rnorm(n_obs, 0, 0.8)

pitcher_data <- data.frame(
  age = ages,
  velocity = observed_velocity,
  season = floor(ages - 22) + 1
)

# Fit various time series models

# 1. Simple linear trend
linear_model <- lm(velocity ~ age, data = pitcher_data)

# 2. Quadratic trend (captures acceleration of decline)
quad_model <- lm(velocity ~ age + I(age^2), data = pitcher_data)

# 3. Smoothing spline (flexible non-parametric trend)
gam_model <- gam(velocity ~ s(age), data = pitcher_data)

# 4. LOESS smoother
loess_model <- loess(velocity ~ age, data = pitcher_data, span = 0.3)

# Generate predictions
pitcher_data$linear_pred <- predict(linear_model)
pitcher_data$quad_pred <- predict(quad_model)
pitcher_data$gam_pred <- predict(gam_model)
pitcher_data$loess_pred <- predict(loess_model)

# Visualize all models
ggplot(pitcher_data, aes(x = age)) +
  geom_point(aes(y = velocity), alpha = 0.6, size = 2) +
  geom_line(aes(y = linear_pred, color = "Linear"), size = 1) +
  geom_line(aes(y = quad_pred, color = "Quadratic"), size = 1) +
  geom_line(aes(y = gam_pred, color = "GAM"), size = 1) +
  geom_line(aes(y = loess_pred, color = "LOESS"), size = 1) +
  scale_color_manual(values = c("Linear" = "blue",
                               "Quadratic" = "red",
                               "GAM" = "green",
                               "LOESS" = "purple")) +
  labs(title = "Modeling Pitcher Velocity Decline Over Career",
       subtitle = "Comparing different trend estimation methods",
       x = "Age",
       y = "Fastball Velocity (mph)",
       color = "Model") +
  theme_minimal()

# Forecast future velocity
future_ages <- data.frame(age = seq(35.5, 38, by = 0.5))
future_pred <- predict(quad_model, newdata = future_ages,
                      interval = "prediction", level = 0.95)

forecast_data <- cbind(future_ages, future_pred)

ggplot() +
  geom_point(data = pitcher_data, aes(x = age, y = velocity),
             alpha = 0.6, size = 2) +
  geom_line(data = pitcher_data, aes(x = age, y = quad_pred),
            color = "red", size = 1) +
  geom_line(data = forecast_data, aes(x = age, y = fit),
            color = "red", size = 1, linetype = "dashed") +
  geom_ribbon(data = forecast_data,
              aes(x = age, ymin = lwr, ymax = upr),
              alpha = 0.2, fill = "red") +
  geom_vline(xintercept = 35, linetype = "dotted", alpha = 0.5) +
  annotate("text", x = 36.5, y = 95,
           label = "Forecast\nRegion", hjust = 0.5) +
  labs(title = "Forecasting Future Velocity Decline",
       subtitle = "Quadratic model with 95% prediction intervals",
       x = "Age",
       y = "Fastball Velocity (mph)") +
  theme_minimal()

library(changepoint)

# Simulate hitter with performance change mid-season
set.seed(789)
games_before <- 80
games_after <- 82

# Pre-change: .280 hitter (roughly 1 hit per 3.57 AB)
ba_before <- 0.280
ab_before <- rep(4, games_before)
hits_before <- rbinom(games_before, ab_before, ba_before)

# Post-change: .320 hitter (improvement after mechanical adjustment)
ba_after <- 0.320
ab_after <- rep(4, games_after)
hits_after <- rbinom(games_after, ab_after, ba_after)

# Combine
game_data <- data.frame(
  game = 1:(games_before + games_after),
  hits = c(hits_before, hits_after),
  at_bats = c(ab_before, ab_after)
) %>%
  mutate(game_ba = hits / at_bats)

# Calculate rolling average
window <- 15
game_data$rolling_avg <- zoo::rollmean(game_data$game_ba, k = window,
                                       fill = NA, align = "right")

# Changepoint detection
# Use PELT algorithm on rolling average
cp_result <- cpt.mean(na.omit(game_data$rolling_avg), method = "PELT")
changepoints <- cpts(cp_result)

# Adjust for rolling window offset
actual_changepoint <- changepoints[1] + window - 1

# Visualization
ggplot(game_data, aes(x = game, y = game_ba)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_line(aes(y = rolling_avg), color = "#4A90E2", size = 1) +
  geom_vline(xintercept = games_before, linetype = "dashed",
             color = "red", size = 1, alpha = 0.7) +
  geom_vline(xintercept = actual_changepoint, linetype = "dotted",
             color = "darkred", size = 1) +
  annotate("text", x = games_before - 10, y = 0.45,
           label = "True\nChangepoint", hjust = 1, color = "red") +
  annotate("text", x = actual_changepoint + 10, y = 0.45,
           label = "Detected\nChangepoint", hjust = 0, color = "darkred") +
  labs(title = "Detecting Performance Changepoints",
       subtitle = paste0("Hitter improves from .280 to .320 at game ",
                        games_before),
       x = "Game Number",
       y = "Batting Average",
       caption = "Blue line shows 15-game rolling average") +
  theme_minimal()

# Statistical test: did performance change?
before_segment <- game_data$game_ba[1:games_before]
after_segment <- game_data$game_ba[(games_before+1):nrow(game_data)]

t_test <- t.test(after_segment, before_segment, alternative = "greater")
cat("\nT-test for performance improvement:\n")
cat("Before mean:", mean(before_segment), "\n")
cat("After mean:", mean(after_segment), "\n")
cat("P-value:", t_test$p.value, "\n")

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.interpolate import UnivariateSpline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Simulate pitcher velocity data
np.random.seed(456)
ages = np.arange(22, 35.5, 0.5)
n_obs = len(ages)

# True velocity follows aging curve
true_velocity = 95 - 0.3 * (ages - 22) - 0.05 * (ages - 22)**2
observed_velocity = true_velocity + np.random.normal(0, 0.8, n_obs)

pitcher_data = pd.DataFrame({
    'age': ages,
    'velocity': observed_velocity
})

# Fit various models

# 1. Linear model
linear_model = LinearRegression()
linear_model.fit(pitcher_data[['age']], pitcher_data['velocity'])
pitcher_data['linear_pred'] = linear_model.predict(pitcher_data[['age']])

# 2. Quadratic model
quad_model = make_pipeline(PolynomialFeatures(2), LinearRegression())
quad_model.fit(pitcher_data[['age']], pitcher_data['velocity'])
pitcher_data['quad_pred'] = quad_model.predict(pitcher_data[['age']])

# 3. Smoothing spline
spline = UnivariateSpline(pitcher_data['age'], pitcher_data['velocity'],
                          s=5, k=3)
pitcher_data['spline_pred'] = spline(pitcher_data['age'])

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Model comparison
axes[0].scatter(pitcher_data['age'], pitcher_data['velocity'],
                alpha=0.6, s=50, label='Observed')
axes[0].plot(pitcher_data['age'], pitcher_data['linear_pred'],
             label='Linear', linewidth=2)
axes[0].plot(pitcher_data['age'], pitcher_data['quad_pred'],
             label='Quadratic', linewidth=2)
axes[0].plot(pitcher_data['age'], pitcher_data['spline_pred'],
             label='Spline', linewidth=2)
axes[0].set_xlabel('Age', fontsize=12)
axes[0].set_ylabel('Fastball Velocity (mph)', fontsize=12)
axes[0].set_title('Modeling Pitcher Velocity Decline Over Career\n' +
                  'Comparing different trend estimation methods',
                  fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Forecast
future_ages = np.arange(35.5, 38.5, 0.5).reshape(-1, 1)
future_pred = quad_model.predict(future_ages)

# Calculate approximate prediction interval (simplified)
residuals = pitcher_data['velocity'] - pitcher_data['quad_pred']
pred_std = np.std(residuals)
future_lower = future_pred - 1.96 * pred_std
future_upper = future_pred + 1.96 * pred_std

axes[1].scatter(pitcher_data['age'], pitcher_data['velocity'],
                alpha=0.6, s=50, label='Observed')
axes[1].plot(pitcher_data['age'], pitcher_data['quad_pred'],
             color='red', linewidth=2, label='Fitted')
axes[1].plot(future_ages, future_pred, color='red',
             linewidth=2, linestyle='--', label='Forecast')
axes[1].fill_between(future_ages.flatten(), future_lower, future_upper,
                     alpha=0.2, color='red', label='95% PI')
axes[1].axvline(x=35, color='gray', linestyle=':', alpha=0.5)
axes[1].text(36.5, 95, 'Forecast\nRegion', ha='center', fontsize=10)
axes[1].set_xlabel('Age', fontsize=12)
axes[1].set_ylabel('Fastball Velocity (mph)', fontsize=12)
axes[1].set_title('Forecasting Future Velocity Decline\n' +
                  'Quadratic model with 95% prediction intervals',
                  fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import ruptures as rpt  # Changepoint detection library

# Simulate hitter with performance change mid-season
np.random.seed(789)
games_before = 80
games_after = 82

# Pre-change: .280 hitter
ba_before = 0.280
ab_before = np.repeat(4, games_before)
hits_before = np.random.binomial(ab_before, ba_before)

# Post-change: .320 hitter
ba_after = 0.320
ab_after = np.repeat(4, games_after)
hits_after = np.random.binomial(ab_after, ba_after)

# Combine
game_data = pd.DataFrame({
    'game': range(1, games_before + games_after + 1),
    'hits': np.concatenate([hits_before, hits_after]),
    'at_bats': np.concatenate([ab_before, ab_after])
})

game_data['game_ba'] = game_data['hits'] / game_data['at_bats']

# Calculate rolling average
window = 15
game_data['rolling_avg'] = game_data['game_ba'].rolling(
    window=window, min_periods=1
).mean()

# Changepoint detection using ruptures
signal = game_data['rolling_avg'].fillna(method='bfill').values
algo = rpt.Pelt(model="rbf").fit(signal)
changepoints = algo.predict(pen=1)

# Get first changepoint
detected_cp = changepoints[0] if changepoints[0] < len(signal) else None

# Visualization
plt.figure(figsize=(12, 6))
plt.scatter(game_data['game'], game_data['game_ba'],
           alpha=0.3, s=20, label='Game BA')
plt.plot(game_data['game'], game_data['rolling_avg'],
        color='#4A90E2', linewidth=2, label='15-game rolling avg')
plt.axvline(x=games_before, color='red', linestyle='--',
           linewidth=2, alpha=0.7, label='True changepoint')
if detected_cp:
    plt.axvline(x=detected_cp, color='darkred', linestyle=':',
               linewidth=2, label='Detected changepoint')

plt.xlabel('Game Number', fontsize=12)
plt.ylabel('Batting Average', fontsize=12)
plt.title(f'Detecting Performance Changepoints\n' +
         f'Hitter improves from .280 to .320 at game {games_before}',
         fontsize=14, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Statistical test
before_segment = game_data['game_ba'][:games_before]
after_segment = game_data['game_ba'][games_before:]

t_stat, p_value = stats.ttest_ind(after_segment, before_segment)

print("\nT-test for performance change:")
print(f"Before mean: {before_segment.mean():.3f}")
print(f"After mean: {after_segment.mean():.3f}")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")

17.4 Survival Analysis

Survival analysis, originally developed for medical research, studies time-until-event data. In baseball, we can apply it to career length, time until injury, or time until players reach certain milestones.

Kaplan-Meier Survival Curves

The Kaplan-Meier estimator produces survival curves showing the probability of "surviving" (continuing career, remaining healthy, etc.) past various time points.

R Implementation:

library(survival)
library(survminer)
library(tidyverse)

# Simulate player career data
set.seed(321)
n_players <- 200

# Different player types have different career trajectories
player_data <- data.frame(
  player_id = 1:n_players,
  player_type = sample(c("Star", "Regular", "Fringe"), n_players,
                      prob = c(0.15, 0.35, 0.50), replace = TRUE)
)

# Simulate career length based on type
# Stars: mean 12 years, regulars: mean 7 years, fringe: mean 3 years
career_lengths <- sapply(player_data$player_type, function(type) {
  if (type == "Star") {
    rweibull(1, shape = 3, scale = 12)
  } else if (type == "Regular") {
    rweibull(1, shape = 2, scale = 7)
  } else {
    rweibull(1, shape = 1.5, scale = 3)
  }
})

# Some careers are censored (player still active)
# More likely for recent debuts and better players
censored <- rbinom(n_players, 1, 0.2)
censored[player_data$player_type == "Star"] <- rbinom(
  sum(player_data$player_type == "Star"), 1, 0.4
)

player_data$career_years <- round(career_lengths)
player_data$active <- censored
player_data$career_years[player_data$active == 1] <-
  pmin(player_data$career_years[player_data$active == 1], 10)

# Create survival object
surv_obj <- Surv(time = player_data$career_years,
                 event = 1 - player_data$active)

# Fit Kaplan-Meier model by player type
km_fit <- survfit(surv_obj ~ player_type, data = player_data)

# Plot survival curves
ggsurvplot(km_fit,
          data = player_data,
          conf.int = TRUE,
          pval = TRUE,
          risk.table = TRUE,
          xlab = "Years in MLB",
          ylab = "Career Survival Probability",
          title = "MLB Career Length by Player Type",
          legend.title = "Player Type",
          legend.labs = c("Fringe", "Regular", "Star"),
          palette = c("#E03A3E", "#4A90E2", "#2ECC71"))

# Calculate median survival time by type
median_survival <- summary(km_fit)$table[, "median"]
cat("\nMedian Career Length by Player Type:\n")
print(median_survival)

# Cox proportional hazards model
# Estimate hazard ratios
player_data$type_numeric <- as.numeric(factor(player_data$player_type,
                                              levels = c("Fringe", "Regular", "Star")))

cox_model <- coxph(surv_obj ~ player_type, data = player_data)
summary(cox_model)

# Visualization of hazard ratios
hazard_ratios <- exp(coef(cox_model))
conf_int <- exp(confint(cox_model))

hr_data <- data.frame(
  type = c("Regular vs Fringe", "Star vs Fringe"),
  hr = hazard_ratios,
  lower = conf_int[, 1],
  upper = conf_int[, 2]
)

ggplot(hr_data, aes(x = type, y = hr)) +
  geom_point(size = 4, color = "#4A90E2") +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2, size = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  coord_flip() +
  labs(title = "Hazard Ratios for Career Ending",
       subtitle = "Lower values = longer careers (relative to fringe players)",
       x = "",
       y = "Hazard Ratio",
       caption = "95% confidence intervals shown") +
  theme_minimal()

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import multivariate_logrank_test

# Simulate player career data
np.random.seed(321)
n_players = 200

player_types = np.random.choice(['Star', 'Regular', 'Fringe'],
                                n_players,
                                p=[0.15, 0.35, 0.50])

# Simulate career length based on type
career_lengths = []
for ptype in player_types:
    if ptype == 'Star':
        length = np.random.weibull(3) * 12
    elif ptype == 'Regular':
        length = np.random.weibull(2) * 7
    else:
        length = np.random.weibull(1.5) * 3
    career_lengths.append(max(1, round(length)))

# Some careers are censored (player still active)
censored = np.random.binomial(1, 0.2, n_players)
censored[player_types == 'Star'] = np.random.binomial(
    1, 0.4, sum(player_types == 'Star')
)

player_data = pd.DataFrame({
    'player_id': range(1, n_players + 1),
    'player_type': player_types,
    'career_years': career_lengths,
    'active': censored
})

player_data.loc[player_data['active'] == 1, 'career_years'] = \
    player_data.loc[player_data['active'] == 1, 'career_years'].clip(upper=10)

# Fit Kaplan-Meier by player type
fig, ax = plt.subplots(figsize=(10, 6))

kmf = KaplanMeierFitter()
colors = {'Fringe': '#E03A3E', 'Regular': '#4A90E2', 'Star': '#2ECC71'}

for ptype in ['Fringe', 'Regular', 'Star']:
    mask = player_data['player_type'] == ptype
    kmf.fit(player_data.loc[mask, 'career_years'],
            event_observed=1 - player_data.loc[mask, 'active'],
            label=ptype)
    kmf.plot_survival_function(ax=ax, ci_show=True, color=colors[ptype])

ax.set_xlabel('Years in MLB', fontsize=12)
ax.set_ylabel('Career Survival Probability', fontsize=12)
ax.set_title('MLB Career Length by Player Type',
             fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate median survival time
print("\nMedian Career Length by Player Type:")
for ptype in ['Fringe', 'Regular', 'Star']:
    mask = player_data['player_type'] == ptype
    kmf.fit(player_data.loc[mask, 'career_years'],
            event_observed=1 - player_data.loc[mask, 'active'])
    print(f"{ptype}: {kmf.median_survival_time_:.1f} years")

# Log-rank test for difference between groups
results = multivariate_logrank_test(
    player_data['career_years'],
    player_data['player_type'],
    1 - player_data['active']
)
print(f"\nLog-rank test p-value: {results.p_value:.4f}")

# Cox proportional hazards model
player_data_cox = pd.get_dummies(player_data, columns=['player_type'],
                                  drop_first=True)

cph = CoxPHFitter()
cph.fit(player_data_cox[['career_years', 'active',
                         'player_type_Regular', 'player_type_Star']],
        duration_col='career_years',
        event_col='active',
        show_progress=False)

print("\nCox Proportional Hazards Model:")
print(cph.summary)

# Plot hazard ratios
cph.plot()
plt.title('Hazard Ratios for Career Ending\n' +
         'Lower values = longer careers (relative to fringe players)',
         fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Survival analysis reveals that star players have dramatically longer careers than fringe players—not surprising, but quantifiable. The Cox model estimates hazard ratios: stars have approximately 60-70% lower risk of career ending in any given year compared to fringe players.

Application: Time to First Major League Home Run

We can also apply survival analysis to milestone achievement.

R Implementation:

# Simulate rookie hitter data
set.seed(654)
n_rookies <- 150

rookie_data <- data.frame(
  player_id = 1:n_rookies,
  power_rating = rnorm(n_rookies, mean = 100, sd = 15),
  pa_until_hr = NA,
  hit_hr = 0
)

# Simulate plate appearances until first HR (or censoring)
max_pa <- 400
for (i in 1:n_rookies) {
  # HR rate depends on power rating
  hr_prob <- plogis((rookie_data$power_rating[i] - 100) / 20) * 0.05

  for (pa in 1:max_pa) {
    if (runif(1) < hr_prob) {
      rookie_data$pa_until_hr[i] <- pa
      rookie_data$hit_hr[i] <- 1
      break
    }
  }

  # If no HR by max_pa, censor
  if (rookie_data$hit_hr[i] == 0) {
    rookie_data$pa_until_hr[i] <- max_pa
  }
}

# Create power categories
rookie_data$power_category <- cut(rookie_data$power_rating,
                                  breaks = c(-Inf, 90, 110, Inf),
                                  labels = c("Low", "Average", "High"))

# Fit survival model
surv_obj_hr <- Surv(time = rookie_data$pa_until_hr,
                    event = rookie_data$hit_hr)

km_fit_hr <- survfit(surv_obj_hr ~ power_category, data = rookie_data)

# Plot
ggsurvplot(km_fit_hr,
          data = rookie_data,
          conf.int = TRUE,
          xlab = "Plate Appearances",
          ylab = "Probability of No HR Yet",
          title = "Time to First Career Home Run by Power Rating",
          legend.title = "Power Rating",
          palette = c("#E03A3E", "#FFD700", "#2ECC71"))

Python Implementation:

# Simulate rookie hitter data
np.random.seed(654)
n_rookies = 150

power_ratings = np.random.normal(100, 15, n_rookies)
pa_until_hr = np.zeros(n_rookies)
hit_hr = np.zeros(n_rookies)

max_pa = 400
for i in range(n_rookies):
    # HR rate depends on power rating
    hr_prob = 1 / (1 + np.exp(-(power_ratings[i] - 100) / 20)) * 0.05

    for pa in range(1, max_pa + 1):
        if np.random.random() < hr_prob:
            pa_until_hr[i] = pa
            hit_hr[i] = 1
            break

    if hit_hr[i] == 0:
        pa_until_hr[i] = max_pa

rookie_data = pd.DataFrame({
    'player_id': range(1, n_rookies + 1),
    'power_rating': power_ratings,
    'pa_until_hr': pa_until_hr,
    'hit_hr': hit_hr
})

# Create power categories
rookie_data['power_category'] = pd.cut(
    rookie_data['power_rating'],
    bins=[-np.inf, 90, 110, np.inf],
    labels=['Low', 'Average', 'High']
)

# Fit and plot by power category
fig, ax = plt.subplots(figsize=(10, 6))

kmf = KaplanMeierFitter()
colors_power = {'Low': '#E03A3E', 'Average': '#FFD700', 'High': '#2ECC71'}

for category in ['Low', 'Average', 'High']:
    mask = rookie_data['power_category'] == category
    kmf.fit(rookie_data.loc[mask, 'pa_until_hr'],
            event_observed=rookie_data.loc[mask, 'hit_hr'],
            label=category)
    kmf.plot_survival_function(ax=ax, ci_show=True, color=colors_power[category])

ax.set_xlabel('Plate Appearances', fontsize=12)
ax.set_ylabel('Probability of No HR Yet', fontsize=12)
ax.set_title('Time to First Career Home Run by Power Rating',
             fontsize=14, fontweight='bold')
ax.legend(title='Power Rating')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

This analysis shows how quickly players with different power levels hit their first career home run—valuable for evaluating rookie power potential.

library(survival)
library(survminer)
library(tidyverse)

# Simulate player career data
set.seed(321)
n_players <- 200

# Different player types have different career trajectories
player_data <- data.frame(
  player_id = 1:n_players,
  player_type = sample(c("Star", "Regular", "Fringe"), n_players,
                      prob = c(0.15, 0.35, 0.50), replace = TRUE)
)

# Simulate career length based on type
# Stars: mean 12 years, regulars: mean 7 years, fringe: mean 3 years
career_lengths <- sapply(player_data$player_type, function(type) {
  if (type == "Star") {
    rweibull(1, shape = 3, scale = 12)
  } else if (type == "Regular") {
    rweibull(1, shape = 2, scale = 7)
  } else {
    rweibull(1, shape = 1.5, scale = 3)
  }
})

# Some careers are censored (player still active)
# More likely for recent debuts and better players
censored <- rbinom(n_players, 1, 0.2)
censored[player_data$player_type == "Star"] <- rbinom(
  sum(player_data$player_type == "Star"), 1, 0.4
)

player_data$career_years <- round(career_lengths)
player_data$active <- censored
player_data$career_years[player_data$active == 1] <-
  pmin(player_data$career_years[player_data$active == 1], 10)

# Create survival object
surv_obj <- Surv(time = player_data$career_years,
                 event = 1 - player_data$active)

# Fit Kaplan-Meier model by player type
km_fit <- survfit(surv_obj ~ player_type, data = player_data)

# Plot survival curves
ggsurvplot(km_fit,
          data = player_data,
          conf.int = TRUE,
          pval = TRUE,
          risk.table = TRUE,
          xlab = "Years in MLB",
          ylab = "Career Survival Probability",
          title = "MLB Career Length by Player Type",
          legend.title = "Player Type",
          legend.labs = c("Fringe", "Regular", "Star"),
          palette = c("#E03A3E", "#4A90E2", "#2ECC71"))

# Calculate median survival time by type
median_survival <- summary(km_fit)$table[, "median"]
cat("\nMedian Career Length by Player Type:\n")
print(median_survival)

# Cox proportional hazards model
# Estimate hazard ratios
player_data$type_numeric <- as.numeric(factor(player_data$player_type,
                                              levels = c("Fringe", "Regular", "Star")))

cox_model <- coxph(surv_obj ~ player_type, data = player_data)
summary(cox_model)

# Visualization of hazard ratios
hazard_ratios <- exp(coef(cox_model))
conf_int <- exp(confint(cox_model))

hr_data <- data.frame(
  type = c("Regular vs Fringe", "Star vs Fringe"),
  hr = hazard_ratios,
  lower = conf_int[, 1],
  upper = conf_int[, 2]
)

ggplot(hr_data, aes(x = type, y = hr)) +
  geom_point(size = 4, color = "#4A90E2") +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2, size = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  coord_flip() +
  labs(title = "Hazard Ratios for Career Ending",
       subtitle = "Lower values = longer careers (relative to fringe players)",
       x = "",
       y = "Hazard Ratio",
       caption = "95% confidence intervals shown") +
  theme_minimal()

# Simulate rookie hitter data
set.seed(654)
n_rookies <- 150

rookie_data <- data.frame(
  player_id = 1:n_rookies,
  power_rating = rnorm(n_rookies, mean = 100, sd = 15),
  pa_until_hr = NA,
  hit_hr = 0
)

# Simulate plate appearances until first HR (or censoring)
max_pa <- 400
for (i in 1:n_rookies) {
  # HR rate depends on power rating
  hr_prob <- plogis((rookie_data$power_rating[i] - 100) / 20) * 0.05

  for (pa in 1:max_pa) {
    if (runif(1) < hr_prob) {
      rookie_data$pa_until_hr[i] <- pa
      rookie_data$hit_hr[i] <- 1
      break
    }
  }

  # If no HR by max_pa, censor
  if (rookie_data$hit_hr[i] == 0) {
    rookie_data$pa_until_hr[i] <- max_pa
  }
}

# Create power categories
rookie_data$power_category <- cut(rookie_data$power_rating,
                                  breaks = c(-Inf, 90, 110, Inf),
                                  labels = c("Low", "Average", "High"))

# Fit survival model
surv_obj_hr <- Surv(time = rookie_data$pa_until_hr,
                    event = rookie_data$hit_hr)

km_fit_hr <- survfit(surv_obj_hr ~ power_category, data = rookie_data)

# Plot
ggsurvplot(km_fit_hr,
          data = rookie_data,
          conf.int = TRUE,
          xlab = "Plate Appearances",
          ylab = "Probability of No HR Yet",
          title = "Time to First Career Home Run by Power Rating",
          legend.title = "Power Rating",
          palette = c("#E03A3E", "#FFD700", "#2ECC71"))

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import multivariate_logrank_test

# Simulate player career data
np.random.seed(321)
n_players = 200

player_types = np.random.choice(['Star', 'Regular', 'Fringe'],
                                n_players,
                                p=[0.15, 0.35, 0.50])

# Simulate career length based on type
career_lengths = []
for ptype in player_types:
    if ptype == 'Star':
        length = np.random.weibull(3) * 12
    elif ptype == 'Regular':
        length = np.random.weibull(2) * 7
    else:
        length = np.random.weibull(1.5) * 3
    career_lengths.append(max(1, round(length)))

# Some careers are censored (player still active)
censored = np.random.binomial(1, 0.2, n_players)
censored[player_types == 'Star'] = np.random.binomial(
    1, 0.4, sum(player_types == 'Star')
)

player_data = pd.DataFrame({
    'player_id': range(1, n_players + 1),
    'player_type': player_types,
    'career_years': career_lengths,
    'active': censored
})

player_data.loc[player_data['active'] == 1, 'career_years'] = \
    player_data.loc[player_data['active'] == 1, 'career_years'].clip(upper=10)

# Fit Kaplan-Meier by player type
fig, ax = plt.subplots(figsize=(10, 6))

kmf = KaplanMeierFitter()
colors = {'Fringe': '#E03A3E', 'Regular': '#4A90E2', 'Star': '#2ECC71'}

for ptype in ['Fringe', 'Regular', 'Star']:
    mask = player_data['player_type'] == ptype
    kmf.fit(player_data.loc[mask, 'career_years'],
            event_observed=1 - player_data.loc[mask, 'active'],
            label=ptype)
    kmf.plot_survival_function(ax=ax, ci_show=True, color=colors[ptype])

ax.set_xlabel('Years in MLB', fontsize=12)
ax.set_ylabel('Career Survival Probability', fontsize=12)
ax.set_title('MLB Career Length by Player Type',
             fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate median survival time
print("\nMedian Career Length by Player Type:")
for ptype in ['Fringe', 'Regular', 'Star']:
    mask = player_data['player_type'] == ptype
    kmf.fit(player_data.loc[mask, 'career_years'],
            event_observed=1 - player_data.loc[mask, 'active'])
    print(f"{ptype}: {kmf.median_survival_time_:.1f} years")

# Log-rank test for difference between groups
results = multivariate_logrank_test(
    player_data['career_years'],
    player_data['player_type'],
    1 - player_data['active']
)
print(f"\nLog-rank test p-value: {results.p_value:.4f}")

# Cox proportional hazards model
player_data_cox = pd.get_dummies(player_data, columns=['player_type'],
                                  drop_first=True)

cph = CoxPHFitter()
cph.fit(player_data_cox[['career_years', 'active',
                         'player_type_Regular', 'player_type_Star']],
        duration_col='career_years',
        event_col='active',
        show_progress=False)

print("\nCox Proportional Hazards Model:")
print(cph.summary)

# Plot hazard ratios
cph.plot()
plt.title('Hazard Ratios for Career Ending\n' +
         'Lower values = longer careers (relative to fringe players)',
         fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()

Python

# Simulate rookie hitter data
np.random.seed(654)
n_rookies = 150

power_ratings = np.random.normal(100, 15, n_rookies)
pa_until_hr = np.zeros(n_rookies)
hit_hr = np.zeros(n_rookies)

max_pa = 400
for i in range(n_rookies):
    # HR rate depends on power rating
    hr_prob = 1 / (1 + np.exp(-(power_ratings[i] - 100) / 20)) * 0.05

    for pa in range(1, max_pa + 1):
        if np.random.random() < hr_prob:
            pa_until_hr[i] = pa
            hit_hr[i] = 1
            break

    if hit_hr[i] == 0:
        pa_until_hr[i] = max_pa

rookie_data = pd.DataFrame({
    'player_id': range(1, n_rookies + 1),
    'power_rating': power_ratings,
    'pa_until_hr': pa_until_hr,
    'hit_hr': hit_hr
})

# Create power categories
rookie_data['power_category'] = pd.cut(
    rookie_data['power_rating'],
    bins=[-np.inf, 90, 110, np.inf],
    labels=['Low', 'Average', 'High']
)

# Fit and plot by power category
fig, ax = plt.subplots(figsize=(10, 6))

kmf = KaplanMeierFitter()
colors_power = {'Low': '#E03A3E', 'Average': '#FFD700', 'High': '#2ECC71'}

for category in ['Low', 'Average', 'High']:
    mask = rookie_data['power_category'] == category
    kmf.fit(rookie_data.loc[mask, 'pa_until_hr'],
            event_observed=rookie_data.loc[mask, 'hit_hr'],
            label=category)
    kmf.plot_survival_function(ax=ax, ci_show=True, color=colors_power[category])

ax.set_xlabel('Plate Appearances', fontsize=12)
ax.set_ylabel('Probability of No HR Yet', fontsize=12)
ax.set_title('Time to First Career Home Run by Power Rating',
             fontsize=14, fontweight='bold')
ax.legend(title='Power Rating')
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

17.5 Causal Inference

Correlation doesn't imply causation—every analyst knows this. But how do we establish causation in baseball contexts where randomized experiments are rare or impossible? Causal inference provides methods for estimating causal effects from observational data.

The Problem of Confounding

Suppose we observe that players who participate in intensive offseason training programs improve more than those who don't. Does training cause improvement? Or do players who are already improving (perhaps due to youth or high motivation) self-select into training programs?

This is confounding: a third variable (motivation/youth) affects both treatment (training participation) and outcome (improvement), creating spurious correlation.

Propensity Score Matching

One approach to causal inference is propensity score matching: match treated and control units that have similar probabilities of receiving treatment based on observed covariates.

R Implementation:

library(MatchIt)
library(tidyverse)

# Simulate player development data
set.seed(987)
n_players <- 300

player_dev <- data.frame(
  player_id = 1:n_players,
  age = sample(22:28, n_players, replace = TRUE),
  prior_war = rnorm(n_players, mean = 2.0, sd = 1.5),
  motivation = rnorm(n_players, mean = 0, sd = 1)
) %>%
  mutate(
    prior_war = pmax(0, prior_war),
    # Treatment assignment (offseason training) depends on age and motivation
    # Younger, more motivated players more likely to train
    training_prob = plogis(-0.5 * (age - 25) + 0.6 * motivation),
    training = rbinom(n(), 1, training_prob),
    # Outcome (WAR improvement) depends on age, prior performance, motivation,
    # AND training (true causal effect = +0.5 WAR)
    war_improvement = 1.5 - 0.2 * (age - 25) +
                      0.3 * motivation - 0.1 * prior_war +
                      0.5 * training +  # TRUE CAUSAL EFFECT
                      rnorm(n(), 0, 0.8)
  )

# Naive analysis: just compare training vs no training
naive_estimate <- player_dev %>%
  group_by(training) %>%
  summarize(mean_improvement = mean(war_improvement), .groups = "drop")

cat("Naive comparison:\n")
print(naive_estimate)
cat("Naive estimate of training effect:",
    diff(naive_estimate$mean_improvement), "\n\n")

# Propensity score matching
match_formula <- training ~ age + prior_war + motivation
match_result <- matchit(match_formula, data = player_dev,
                       method = "nearest", ratio = 1)

# Get matched data
matched_data <- match.data(match_result)

# Estimate treatment effect in matched sample
matched_estimate <- matched_data %>%
  group_by(training) %>%
  summarize(mean_improvement = mean(war_improvement), .groups = "drop")

cat("Matched comparison:\n")
print(matched_estimate)
cat("Matched estimate of training effect:",
    diff(matched_estimate$mean_improvement), "\n")
cat("True effect: 0.5 WAR\n\n")

# Visualize matching balance
plot(match_result, type = "jitter", interactive = FALSE)

# Check covariate balance before and after matching
summary(match_result)

# Estimate with regression adjustment on matched data
lm_matched <- lm(war_improvement ~ training + age + prior_war + motivation,
                data = matched_data)
cat("Regression on matched data:\n")
summary(lm_matched)

# Visualize propensity score overlap
pscore_data <- data.frame(
  pscore = match_result$distance,
  training = player_dev$training,
  matched = player_dev$player_id %in% matched_data$player_id
)

ggplot(pscore_data, aes(x = pscore, fill = factor(training))) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
  facet_wrap(~matched, labeller = labeller(matched = c(
    "FALSE" = "Unmatched Units",
    "TRUE" = "Matched Units"
  ))) +
  scale_fill_manual(values = c("0" = "#E03A3E", "1" = "#4A90E2"),
                   labels = c("No Training", "Training"),
                   name = "") +
  labs(title = "Propensity Score Distribution Before and After Matching",
       x = "Propensity Score (Probability of Receiving Training)",
       y = "Count") +
  theme_minimal()

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from scipy import stats

# Simulate player development data
np.random.seed(987)
n_players = 300

ages = np.random.choice(range(22, 29), n_players)
prior_war = np.maximum(0, np.random.normal(2.0, 1.5, n_players))
motivation = np.random.normal(0, 1, n_players)

# Treatment assignment
training_prob = 1 / (1 + np.exp(-(-0.5 * (ages - 25) + 0.6 * motivation)))
training = np.random.binomial(1, training_prob)

# Outcome
war_improvement = (1.5 - 0.2 * (ages - 25) + 0.3 * motivation -
                  0.1 * prior_war + 0.5 * training +
                  np.random.normal(0, 0.8, n_players))

player_dev = pd.DataFrame({
    'player_id': range(1, n_players + 1),
    'age': ages,
    'prior_war': prior_war,
    'motivation': motivation,
    'training': training,
    'war_improvement': war_improvement
})

# Naive analysis
naive_estimate = player_dev.groupby('training')['war_improvement'].mean()
print("Naive comparison:")
print(naive_estimate)
print(f"Naive estimate of training effect: {naive_estimate[1] - naive_estimate[0]:.3f}\n")

# Propensity score matching
X = player_dev[['age', 'prior_war', 'motivation']]
y = player_dev['training']

# Fit logistic regression to estimate propensity scores
ps_model = LogisticRegression(random_state=42)
ps_model.fit(X, y)
player_dev['pscore'] = ps_model.predict_proba(X)[:, 1]

# Match treated units to control units with nearest propensity scores
treated = player_dev[player_dev['training'] == 1]
control = player_dev[player_dev['training'] == 0]

# Use nearest neighbors matching
nn = NearestNeighbors(n_neighbors=1, algorithm='auto')
nn.fit(control[['pscore']])

distances, indices = nn.kneighbors(treated[['pscore']])

# Create matched dataset
matched_control_ids = control.iloc[indices.flatten()]['player_id'].values
matched_treated_ids = treated['player_id'].values

matched_data = player_dev[
    player_dev['player_id'].isin(matched_treated_ids) |
    player_dev['player_id'].isin(matched_control_ids)
]

# Estimate treatment effect in matched sample
matched_estimate = matched_data.groupby('training')['war_improvement'].mean()
print("Matched comparison:")
print(matched_estimate)
print(f"Matched estimate of training effect: {matched_estimate[1] - matched_estimate[0]:.3f}")
print("True effect: 0.500\n")

# Visualize propensity scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before matching
axes[0].hist(player_dev[player_dev['training'] == 0]['pscore'],
            bins=30, alpha=0.6, label='No Training', color='#E03A3E')
axes[0].hist(player_dev[player_dev['training'] == 1]['pscore'],
            bins=30, alpha=0.6, label='Training', color='#4A90E2')
axes[0].set_xlabel('Propensity Score', fontsize=11)
axes[0].set_ylabel('Count', fontsize=11)
axes[0].set_title('Before Matching', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# After matching
axes[1].hist(matched_data[matched_data['training'] == 0]['pscore'],
            bins=30, alpha=0.6, label='No Training', color='#E03A3E')
axes[1].hist(matched_data[matched_data['training'] == 1]['pscore'],
            bins=30, alpha=0.6, label='Training', color='#4A90E2')
axes[1].set_xlabel('Propensity Score', fontsize=11)
axes[1].set_ylabel('Count', fontsize=11)
axes[1].set_title('After Matching', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.suptitle('Propensity Score Distribution Before and After Matching',
            fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# T-test on matched sample
t_stat, p_value = stats.ttest_ind(
    matched_data[matched_data['training'] == 1]['war_improvement'],
    matched_data[matched_data['training'] == 0]['war_improvement']
)
print(f"\nT-test on matched sample:")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")

The propensity score matching approach yields an estimate (around 0.45-0.55 WAR) much closer to the true causal effect (0.50 WAR) than the naive comparison (which would be biased upward due to confounding by motivation).

Difference-in-Differences

When we have data before and after a treatment for both treated and control groups, difference-in-differences (DiD) isolates the causal effect by comparing changes over time.

Example: Did the 2023 pitch clock reduce game times?

R Implementation:

# Simulate game time data for 2022 and 2023
set.seed(456)
n_games_per_year <- 2430  # 162 games × 15 teams (simplified)

game_times <- data.frame(
  game_id = 1:(n_games_per_year * 2),
  year = rep(c(2022, 2023), each = n_games_per_year),
  # 2022: no clock, 2023: clock implemented
  clock = rep(c(0, 1), each = n_games_per_year)
) %>%
  mutate(
    # Base game time: 180 minutes
    # Trend: games getting longer over time (+2 min per year)
    # Clock effect: -15 minutes (TRUE CAUSAL EFFECT)
    game_time = 180 + 2 * (year - 2022) - 15 * clock +
                rnorm(n(), 0, 10)
  )

# Calculate means by year
means_by_year <- game_times %>%
  group_by(year, clock) %>%
  summarize(mean_time = mean(game_time), .groups = "drop")

# Simple before-after comparison (WRONG - ignores time trend)
before_after <- diff(means_by_year$mean_time)
cat("Before-after comparison:", before_after, "minutes\n")
cat("This is biased because it ignores the underlying time trend!\n\n")

# Difference-in-differences estimate
# If we had a control group (hypothetical league without clock):
# Simulate control league
control_games <- data.frame(
  game_id = 1:(n_games_per_year * 2),
  year = rep(c(2022, 2023), each = n_games_per_year),
  clock = 0,  # Never implemented clock
  league = "Control"
) %>%
  mutate(
    # Same time trend, no clock effect
    game_time = 180 + 2 * (year - 2022) + rnorm(n(), 0, 10)
  )

# Add treatment league
treatment_games <- game_times %>%
  mutate(league = "Treatment")

# Combine
did_data <- bind_rows(control_games, treatment_games)

# Calculate DiD estimate
did_means <- did_data %>%
  group_by(league, year) %>%
  summarize(mean_time = mean(game_time), .groups = "drop")

# Manual DiD calculation
control_change <- did_means %>%
  filter(league == "Control") %>%
  pull(mean_time) %>%
  diff()

treatment_change <- did_means %>%
  filter(league == "Treatment") %>%
  pull(mean_time) %>%
  diff()

did_estimate <- treatment_change - control_change
cat("Difference-in-Differences estimate:", did_estimate, "minutes\n")
cat("True effect: -15 minutes\n\n")

# Regression approach to DiD
did_model <- lm(game_time ~ factor(year) + factor(league) +
                factor(year):factor(league),
               data = did_data)
summary(did_model)

# Visualization
ggplot(did_means, aes(x = year, y = mean_time, color = league, group = league)) +
  geom_line(size = 1.5) +
  geom_point(size = 4) +
  geom_vline(xintercept = 2022.5, linetype = "dashed", alpha = 0.5) +
  annotate("text", x = 2022.5, y = 195, label = "Clock\nImplemented",
           hjust = -0.1, size = 4) +
  scale_color_manual(values = c("Control" = "#E03A3E",
                               "Treatment" = "#4A90E2")) +
  labs(title = "Difference-in-Differences: Pitch Clock Effect on Game Time",
       subtitle = "Treatment league implements clock in 2023, control doesn't",
       x = "Year",
       y = "Average Game Time (minutes)",
       color = "League") +
  theme_minimal()

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols

# Simulate game time data
np.random.seed(456)
n_games_per_year = 2430

years = np.repeat([2022, 2023], n_games_per_year)
clock = np.repeat([0, 1], n_games_per_year)

game_times = 180 + 2 * (years - 2022) - 15 * clock + np.random.normal(0, 10, len(years))

game_data = pd.DataFrame({
    'game_id': range(1, len(years) + 1),
    'year': years,
    'clock': clock,
    'game_time': game_times,
    'league': 'Treatment'
})

# Calculate means
means_by_year = game_data.groupby(['year', 'clock'])['game_time'].mean().reset_index()

# Simple before-after
before_after = means_by_year['game_time'].diff().iloc[-1]
print(f"Before-after comparison: {before_after:.2f} minutes")
print("This is biased because it ignores the underlying time trend!\n")

# Create control league data
years_control = np.repeat([2022, 2023], n_games_per_year)
game_times_control = 180 + 2 * (years_control - 2022) + np.random.normal(0, 10, len(years_control))

control_data = pd.DataFrame({
    'game_id': range(1, len(years_control) + 1),
    'year': years_control,
    'clock': 0,
    'game_time': game_times_control,
    'league': 'Control'
})

# Combine datasets
did_data = pd.concat([game_data, control_data], ignore_index=True)

# Calculate DiD estimate
did_means = did_data.groupby(['league', 'year'])['game_time'].mean().reset_index()

control_change = did_means[did_means['league'] == 'Control']['game_time'].diff().iloc[-1]
treatment_change = did_means[did_means['league'] == 'Treatment']['game_time'].diff().iloc[-1]

did_estimate = treatment_change - control_change
print(f"Difference-in-Differences estimate: {did_estimate:.2f} minutes")
print("True effect: -15.00 minutes\n")

# Regression approach
did_data['post'] = (did_data['year'] == 2023).astype(int)
did_data['treated'] = (did_data['league'] == 'Treatment').astype(int)
did_data['post_x_treated'] = did_data['post'] * did_data['treated']

model = ols('game_time ~ C(year) + C(league) + C(year):C(league)',
            data=did_data).fit()
print("DiD Regression Results:")
print(model.summary())

# Visualization
plt.figure(figsize=(10, 6))
for league in ['Control', 'Treatment']:
    league_data = did_means[did_means['league'] == league]
    color = '#E03A3E' if league == 'Control' else '#4A90E2'
    plt.plot(league_data['year'], league_data['game_time'],
            marker='o', markersize=10, linewidth=2.5,
            label=league, color=color)

plt.axvline(x=2022.5, color='gray', linestyle='--', alpha=0.5)
plt.text(2022.55, 195, 'Clock\nImplemented', fontsize=11)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Game Time (minutes)', fontsize=12)
plt.title('Difference-in-Differences: Pitch Clock Effect on Game Time\n' +
         'Treatment league implements clock in 2023, control doesn\'t',
         fontsize=13, fontweight='bold')
plt.legend(title='League', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

The DiD approach correctly identifies the causal effect by accounting for the underlying time trend that would have occurred even without the pitch clock implementation.

library(MatchIt)
library(tidyverse)

# Simulate player development data
set.seed(987)
n_players <- 300

player_dev <- data.frame(
  player_id = 1:n_players,
  age = sample(22:28, n_players, replace = TRUE),
  prior_war = rnorm(n_players, mean = 2.0, sd = 1.5),
  motivation = rnorm(n_players, mean = 0, sd = 1)
) %>%
  mutate(
    prior_war = pmax(0, prior_war),
    # Treatment assignment (offseason training) depends on age and motivation
    # Younger, more motivated players more likely to train
    training_prob = plogis(-0.5 * (age - 25) + 0.6 * motivation),
    training = rbinom(n(), 1, training_prob),
    # Outcome (WAR improvement) depends on age, prior performance, motivation,
    # AND training (true causal effect = +0.5 WAR)
    war_improvement = 1.5 - 0.2 * (age - 25) +
                      0.3 * motivation - 0.1 * prior_war +
                      0.5 * training +  # TRUE CAUSAL EFFECT
                      rnorm(n(), 0, 0.8)
  )

# Naive analysis: just compare training vs no training
naive_estimate <- player_dev %>%
  group_by(training) %>%
  summarize(mean_improvement = mean(war_improvement), .groups = "drop")

cat("Naive comparison:\n")
print(naive_estimate)
cat("Naive estimate of training effect:",
    diff(naive_estimate$mean_improvement), "\n\n")

# Propensity score matching
match_formula <- training ~ age + prior_war + motivation
match_result <- matchit(match_formula, data = player_dev,
                       method = "nearest", ratio = 1)

# Get matched data
matched_data <- match.data(match_result)

# Estimate treatment effect in matched sample
matched_estimate <- matched_data %>%
  group_by(training) %>%
  summarize(mean_improvement = mean(war_improvement), .groups = "drop")

cat("Matched comparison:\n")
print(matched_estimate)
cat("Matched estimate of training effect:",
    diff(matched_estimate$mean_improvement), "\n")
cat("True effect: 0.5 WAR\n\n")

# Visualize matching balance
plot(match_result, type = "jitter", interactive = FALSE)

# Check covariate balance before and after matching
summary(match_result)

# Estimate with regression adjustment on matched data
lm_matched <- lm(war_improvement ~ training + age + prior_war + motivation,
                data = matched_data)
cat("Regression on matched data:\n")
summary(lm_matched)

# Visualize propensity score overlap
pscore_data <- data.frame(
  pscore = match_result$distance,
  training = player_dev$training,
  matched = player_dev$player_id %in% matched_data$player_id
)

ggplot(pscore_data, aes(x = pscore, fill = factor(training))) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
  facet_wrap(~matched, labeller = labeller(matched = c(
    "FALSE" = "Unmatched Units",
    "TRUE" = "Matched Units"
  ))) +
  scale_fill_manual(values = c("0" = "#E03A3E", "1" = "#4A90E2"),
                   labels = c("No Training", "Training"),
                   name = "") +
  labs(title = "Propensity Score Distribution Before and After Matching",
       x = "Propensity Score (Probability of Receiving Training)",
       y = "Count") +
  theme_minimal()

# Simulate game time data for 2022 and 2023
set.seed(456)
n_games_per_year <- 2430  # 162 games × 15 teams (simplified)

game_times <- data.frame(
  game_id = 1:(n_games_per_year * 2),
  year = rep(c(2022, 2023), each = n_games_per_year),
  # 2022: no clock, 2023: clock implemented
  clock = rep(c(0, 1), each = n_games_per_year)
) %>%
  mutate(
    # Base game time: 180 minutes
    # Trend: games getting longer over time (+2 min per year)
    # Clock effect: -15 minutes (TRUE CAUSAL EFFECT)
    game_time = 180 + 2 * (year - 2022) - 15 * clock +
                rnorm(n(), 0, 10)
  )

# Calculate means by year
means_by_year <- game_times %>%
  group_by(year, clock) %>%
  summarize(mean_time = mean(game_time), .groups = "drop")

# Simple before-after comparison (WRONG - ignores time trend)
before_after <- diff(means_by_year$mean_time)
cat("Before-after comparison:", before_after, "minutes\n")
cat("This is biased because it ignores the underlying time trend!\n\n")

# Difference-in-differences estimate
# If we had a control group (hypothetical league without clock):
# Simulate control league
control_games <- data.frame(
  game_id = 1:(n_games_per_year * 2),
  year = rep(c(2022, 2023), each = n_games_per_year),
  clock = 0,  # Never implemented clock
  league = "Control"
) %>%
  mutate(
    # Same time trend, no clock effect
    game_time = 180 + 2 * (year - 2022) + rnorm(n(), 0, 10)
  )

# Add treatment league
treatment_games <- game_times %>%
  mutate(league = "Treatment")

# Combine
did_data <- bind_rows(control_games, treatment_games)

# Calculate DiD estimate
did_means <- did_data %>%
  group_by(league, year) %>%
  summarize(mean_time = mean(game_time), .groups = "drop")

# Manual DiD calculation
control_change <- did_means %>%
  filter(league == "Control") %>%
  pull(mean_time) %>%
  diff()

treatment_change <- did_means %>%
  filter(league == "Treatment") %>%
  pull(mean_time) %>%
  diff()

did_estimate <- treatment_change - control_change
cat("Difference-in-Differences estimate:", did_estimate, "minutes\n")
cat("True effect: -15 minutes\n\n")

# Regression approach to DiD
did_model <- lm(game_time ~ factor(year) + factor(league) +
                factor(year):factor(league),
               data = did_data)
summary(did_model)

# Visualization
ggplot(did_means, aes(x = year, y = mean_time, color = league, group = league)) +
  geom_line(size = 1.5) +
  geom_point(size = 4) +
  geom_vline(xintercept = 2022.5, linetype = "dashed", alpha = 0.5) +
  annotate("text", x = 2022.5, y = 195, label = "Clock\nImplemented",
           hjust = -0.1, size = 4) +
  scale_color_manual(values = c("Control" = "#E03A3E",
                               "Treatment" = "#4A90E2")) +
  labs(title = "Difference-in-Differences: Pitch Clock Effect on Game Time",
       subtitle = "Treatment league implements clock in 2023, control doesn't",
       x = "Year",
       y = "Average Game Time (minutes)",
       color = "League") +
  theme_minimal()

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
from scipy import stats

# Simulate player development data
np.random.seed(987)
n_players = 300

ages = np.random.choice(range(22, 29), n_players)
prior_war = np.maximum(0, np.random.normal(2.0, 1.5, n_players))
motivation = np.random.normal(0, 1, n_players)

# Treatment assignment
training_prob = 1 / (1 + np.exp(-(-0.5 * (ages - 25) + 0.6 * motivation)))
training = np.random.binomial(1, training_prob)

# Outcome
war_improvement = (1.5 - 0.2 * (ages - 25) + 0.3 * motivation -
                  0.1 * prior_war + 0.5 * training +
                  np.random.normal(0, 0.8, n_players))

player_dev = pd.DataFrame({
    'player_id': range(1, n_players + 1),
    'age': ages,
    'prior_war': prior_war,
    'motivation': motivation,
    'training': training,
    'war_improvement': war_improvement
})

# Naive analysis
naive_estimate = player_dev.groupby('training')['war_improvement'].mean()
print("Naive comparison:")
print(naive_estimate)
print(f"Naive estimate of training effect: {naive_estimate[1] - naive_estimate[0]:.3f}\n")

# Propensity score matching
X = player_dev[['age', 'prior_war', 'motivation']]
y = player_dev['training']

# Fit logistic regression to estimate propensity scores
ps_model = LogisticRegression(random_state=42)
ps_model.fit(X, y)
player_dev['pscore'] = ps_model.predict_proba(X)[:, 1]

# Match treated units to control units with nearest propensity scores
treated = player_dev[player_dev['training'] == 1]
control = player_dev[player_dev['training'] == 0]

# Use nearest neighbors matching
nn = NearestNeighbors(n_neighbors=1, algorithm='auto')
nn.fit(control[['pscore']])

distances, indices = nn.kneighbors(treated[['pscore']])

# Create matched dataset
matched_control_ids = control.iloc[indices.flatten()]['player_id'].values
matched_treated_ids = treated['player_id'].values

matched_data = player_dev[
    player_dev['player_id'].isin(matched_treated_ids) |
    player_dev['player_id'].isin(matched_control_ids)
]

# Estimate treatment effect in matched sample
matched_estimate = matched_data.groupby('training')['war_improvement'].mean()
print("Matched comparison:")
print(matched_estimate)
print(f"Matched estimate of training effect: {matched_estimate[1] - matched_estimate[0]:.3f}")
print("True effect: 0.500\n")

# Visualize propensity scores
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before matching
axes[0].hist(player_dev[player_dev['training'] == 0]['pscore'],
            bins=30, alpha=0.6, label='No Training', color='#E03A3E')
axes[0].hist(player_dev[player_dev['training'] == 1]['pscore'],
            bins=30, alpha=0.6, label='Training', color='#4A90E2')
axes[0].set_xlabel('Propensity Score', fontsize=11)
axes[0].set_ylabel('Count', fontsize=11)
axes[0].set_title('Before Matching', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# After matching
axes[1].hist(matched_data[matched_data['training'] == 0]['pscore'],
            bins=30, alpha=0.6, label='No Training', color='#E03A3E')
axes[1].hist(matched_data[matched_data['training'] == 1]['pscore'],
            bins=30, alpha=0.6, label='Training', color='#4A90E2')
axes[1].set_xlabel('Propensity Score', fontsize=11)
axes[1].set_ylabel('Count', fontsize=11)
axes[1].set_title('After Matching', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.suptitle('Propensity Score Distribution Before and After Matching',
            fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# T-test on matched sample
t_stat, p_value = stats.ttest_ind(
    matched_data[matched_data['training'] == 1]['war_improvement'],
    matched_data[matched_data['training'] == 0]['war_improvement']
)
print(f"\nT-test on matched sample:")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.4f}")

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols

# Simulate game time data
np.random.seed(456)
n_games_per_year = 2430

years = np.repeat([2022, 2023], n_games_per_year)
clock = np.repeat([0, 1], n_games_per_year)

game_times = 180 + 2 * (years - 2022) - 15 * clock + np.random.normal(0, 10, len(years))

game_data = pd.DataFrame({
    'game_id': range(1, len(years) + 1),
    'year': years,
    'clock': clock,
    'game_time': game_times,
    'league': 'Treatment'
})

# Calculate means
means_by_year = game_data.groupby(['year', 'clock'])['game_time'].mean().reset_index()

# Simple before-after
before_after = means_by_year['game_time'].diff().iloc[-1]
print(f"Before-after comparison: {before_after:.2f} minutes")
print("This is biased because it ignores the underlying time trend!\n")

# Create control league data
years_control = np.repeat([2022, 2023], n_games_per_year)
game_times_control = 180 + 2 * (years_control - 2022) + np.random.normal(0, 10, len(years_control))

control_data = pd.DataFrame({
    'game_id': range(1, len(years_control) + 1),
    'year': years_control,
    'clock': 0,
    'game_time': game_times_control,
    'league': 'Control'
})

# Combine datasets
did_data = pd.concat([game_data, control_data], ignore_index=True)

# Calculate DiD estimate
did_means = did_data.groupby(['league', 'year'])['game_time'].mean().reset_index()

control_change = did_means[did_means['league'] == 'Control']['game_time'].diff().iloc[-1]
treatment_change = did_means[did_means['league'] == 'Treatment']['game_time'].diff().iloc[-1]

did_estimate = treatment_change - control_change
print(f"Difference-in-Differences estimate: {did_estimate:.2f} minutes")
print("True effect: -15.00 minutes\n")

# Regression approach
did_data['post'] = (did_data['year'] == 2023).astype(int)
did_data['treated'] = (did_data['league'] == 'Treatment').astype(int)
did_data['post_x_treated'] = did_data['post'] * did_data['treated']

model = ols('game_time ~ C(year) + C(league) + C(year):C(league)',
            data=did_data).fit()
print("DiD Regression Results:")
print(model.summary())

# Visualization
plt.figure(figsize=(10, 6))
for league in ['Control', 'Treatment']:
    league_data = did_means[did_means['league'] == league]
    color = '#E03A3E' if league == 'Control' else '#4A90E2'
    plt.plot(league_data['year'], league_data['game_time'],
            marker='o', markersize=10, linewidth=2.5,
            label=league, color=color)

plt.axvline(x=2022.5, color='gray', linestyle='--', alpha=0.5)
plt.text(2022.55, 195, 'Clock\nImplemented', fontsize=11)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Game Time (minutes)', fontsize=12)
plt.title('Difference-in-Differences: Pitch Clock Effect on Game Time\n' +
         'Treatment league implements clock in 2023, control doesn\'t',
         fontsize=13, fontweight='bold')
plt.legend(title='League', fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

17.6 Monte Carlo Simulation

Monte Carlo simulation uses repeated random sampling to understand probabilistic systems. In baseball, we can simulate thousands of seasons, playoff series, or player careers to estimate probabilities and quantify uncertainty.

Simulating a Playoff Series

What's the probability that a team with 55% win probability in each game wins a best-of-seven series?

R Implementation:

# Monte Carlo simulation of playoff series
simulate_series <- function(p_win, n_games = 7, n_sims = 10000) {
  wins_needed <- ceiling(n_games / 2)

  results <- replicate(n_sims, {
    team_a_wins <- 0
    team_b_wins <- 0

    while (team_a_wins < wins_needed && team_b_wins < wins_needed) {
      if (runif(1) < p_win) {
        team_a_wins <- team_a_wins + 1
      } else {
        team_b_wins <- team_b_wins + 1
      }
    }

    return(team_a_wins > team_b_wins)
  })

  return(mean(results))
}

# Run simulation for various win probabilities
set.seed(123)
win_probs <- seq(0.50, 0.70, by = 0.05)
series_probs <- sapply(win_probs, function(p) simulate_series(p))

results_df <- data.frame(
  game_win_prob = win_probs,
  series_win_prob = series_probs
)

# Analytical solution for comparison
analytical_prob <- function(p, n = 7) {
  wins_needed <- ceiling(n / 2)
  prob <- 0
  for (w in wins_needed:n) {
    prob <- prob + choose(n, w) * p^w * (1-p)^(n-w)
  }
  # Account for series ending early
  early_prob <- 0
  for (total_games in wins_needed:(n-1)) {
    for (w in wins_needed:total_games) {
      if (w == wins_needed) {  # Winner must win on final game
        early_prob <- early_prob +
          choose(total_games - 1, w - 1) * p^w * (1-p)^(total_games-w)
      }
    }
  }
  return(early_prob + choose(n, wins_needed) * p^wins_needed * (1-p)^(n-wins_needed))
}

results_df$analytical = sapply(win_probs, analytical_prob)

# Visualization
ggplot(results_df, aes(x = game_win_prob)) +
  geom_line(aes(y = series_win_prob, color = "Simulation"),
            size = 1.5) +
  geom_line(aes(y = analytical, color = "Analytical"),
            size = 1, linetype = "dashed") +
  geom_point(aes(y = series_win_prob), size = 3) +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = c("Simulation" = "#4A90E2",
                               "Analytical" = "#E03A3E")) +
  labs(title = "Playoff Series Win Probability (Best of 7)",
       subtitle = "Monte Carlo simulation vs analytical solution",
       x = "Single Game Win Probability",
       y = "Series Win Probability",
       color = "Method") +
  theme_minimal()

# Key insight
cat("\nKey Insights:\n")
cat("55% game winner has",
    round(results_df$series_win_prob[results_df$game_win_prob == 0.55] * 100, 1),
    "% chance to win series\n")
cat("60% game winner has",
    round(results_df$series_win_prob[results_df$game_win_prob == 0.60] * 100, 1),
    "% chance to win series\n")

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.special import comb

def simulate_series(p_win, n_games=7, n_sims=10000):
    """
    Monte Carlo simulation of playoff series
    """
    wins_needed = int(np.ceil(n_games / 2))

    series_wins = 0
    for _ in range(n_sims):
        team_a_wins = 0
        team_b_wins = 0

        while team_a_wins < wins_needed and team_b_wins < wins_needed:
            if np.random.random() < p_win:
                team_a_wins += 1
            else:
                team_b_wins += 1

        if team_a_wins > team_b_wins:
            series_wins += 1

    return series_wins / n_sims

# Run simulation for various win probabilities
np.random.seed(123)
win_probs = np.arange(0.50, 0.71, 0.05)
series_probs = [simulate_series(p) for p in win_probs]

results_df = pd.DataFrame({
    'game_win_prob': win_probs,
    'series_win_prob': series_probs
})

# Analytical solution
def analytical_prob(p, n=7):
    wins_needed = int(np.ceil(n / 2))
    prob = 0
    for w in range(wins_needed, n + 1):
        prob += comb(n, w, exact=True) * (p ** w) * ((1 - p) ** (n - w))
    return prob

results_df['analytical'] = [analytical_prob(p) for p in win_probs]

# Visualization
plt.figure(figsize=(10, 6))
plt.plot(results_df['game_win_prob'], results_df['series_win_prob'],
        'o-', linewidth=2.5, markersize=8, label='Simulation', color='#4A90E2')
plt.plot(results_df['game_win_prob'], results_df['analytical'],
        '--', linewidth=2, label='Analytical', color='#E03A3E')
plt.xlabel('Single Game Win Probability', fontsize=12)
plt.ylabel('Series Win Probability', fontsize=12)
plt.title('Playoff Series Win Probability (Best of 7)\n' +
         'Monte Carlo simulation vs analytical solution',
         fontsize=14, fontweight='bold')
plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Key insights
prob_55 = results_df[results_df['game_win_prob'] == 0.55]['series_win_prob'].values[0]
prob_60 = results_df[results_df['game_win_prob'] == 0.60]['series_win_prob'].values[0]

print("\nKey Insights:")
print(f"55% game winner has {prob_55*100:.1f}% chance to win series")
print(f"60% game winner has {prob_60*100:.1f}% chance to win series")

Even a team with a 60% win probability in each game only wins the series about 71% of the time—demonstrating that playoff outcomes involve substantial randomness.

Simulating Season Outcomes with Uncertainty

We can simulate full seasons incorporating uncertainty in team quality estimates.

R Implementation:

# Simulate a season accounting for uncertainty in true talent
simulate_season <- function(team_talents, n_games = 162, n_sims = 1000) {
  n_teams <- length(team_talents)

  # Store results
  final_records <- matrix(0, nrow = n_sims, ncol = n_teams)

  for (sim in 1:n_sims) {
    # Each team plays n_games
    for (team in 1:n_teams) {
      wins <- rbinom(1, n_games, team_talents[team])
      final_records[sim, team] <- wins
    }
  }

  return(final_records)
}

# Simulate a division with 5 teams
set.seed(789)
team_names <- c("Team A", "Team B", "Team C", "Team D", "Team E")
true_talents <- c(0.580, 0.550, 0.500, 0.475, 0.420)

# Run simulation
season_results <- simulate_season(true_talents, n_sims = 10000)
colnames(season_results) <- team_names

# Calculate playoff probabilities (assume top 2 teams make playoffs)
playoff_probs <- sapply(1:length(team_names), function(i) {
  # How often does this team finish in top 2?
  ranks <- apply(season_results, 1, function(x) rank(-x))
  mean(ranks[i, ] <= 2)
})

prob_df <- data.frame(
  team = team_names,
  true_talent = true_talents,
  playoff_prob = playoff_probs
)

# Visualize
ggplot(prob_df, aes(x = true_talent, y = playoff_prob)) +
  geom_point(size = 4, color = "#4A90E2") +
  geom_text(aes(label = team), hjust = -0.2, vjust = 0) +
  scale_x_continuous(labels = scales::percent, limits = c(0.40, 0.60)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Playoff Probability vs True Talent",
       subtitle = "10,000 simulated 162-game seasons",
       x = "True Win Percentage",
       y = "Probability of Top-2 Finish (Playoff Berth)") +
  theme_minimal()

# Distribution of wins for best team
team_a_wins <- season_results[, "Team A"]
wins_dist <- data.frame(wins = team_a_wins)

ggplot(wins_dist, aes(x = wins)) +
  geom_histogram(binwidth = 2, fill = "#4A90E2", alpha = 0.7, color = "white") +
  geom_vline(xintercept = 0.580 * 162, color = "red",
             linetype = "dashed", size = 1) +
  annotate("text", x = 0.580 * 162 + 3, y = 500,
           label = "Expected\nWins (94)", hjust = 0, color = "red") +
  labs(title = "Distribution of Win Totals for .580 Team",
       subtitle = "10,000 simulated seasons",
       x = "Wins",
       y = "Frequency") +
  theme_minimal()

# Calculate probability of different win ranges
cat("\nWin Distribution for .580 Team (Team A):\n")
cat("90-94 wins:", mean(team_a_wins >= 90 & team_a_wins <= 94), "\n")
cat("95-99 wins:", mean(team_a_wins >= 95 & team_a_wins <= 99), "\n")
cat("100+ wins:", mean(team_a_wins >= 100), "\n")
cat("Below 90 wins:", mean(team_a_wins < 90), "\n")

Python Implementation:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def simulate_season(team_talents, n_games=162, n_sims=1000):
    """
    Simulate season outcomes accounting for uncertainty
    """
    n_teams = len(team_talents)
    final_records = np.zeros((n_sims, n_teams))

    for sim in range(n_sims):
        for team_idx in range(n_teams):
            wins = np.random.binomial(n_games, team_talents[team_idx])
            final_records[sim, team_idx] = wins

    return final_records

# Simulate a division with 5 teams
np.random.seed(789)
team_names = ["Team A", "Team B", "Team C", "Team D", "Team E"]
true_talents = np.array([0.580, 0.550, 0.500, 0.475, 0.420])

# Run simulation
season_results = simulate_season(true_talents, n_sims=10000)

# Calculate playoff probabilities (top 2 teams make playoffs)
playoff_probs = []
for i in range(len(team_names)):
    ranks = np.apply_along_axis(lambda x: np.argsort(-x), 1, season_results)
    playoff_prob = np.mean(ranks == i)
    # Correct: check if team finished in top 2
    top2_finishes = np.sum(ranks[:, :2] == i)
    playoff_probs.append(top2_finishes / len(ranks))

prob_df = pd.DataFrame({
    'team': team_names,
    'true_talent': true_talents,
    'playoff_prob': playoff_probs
})

# Visualization 1: Playoff probability vs talent
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

axes[0].scatter(prob_df['true_talent'], prob_df['playoff_prob'],
               s=150, color='#4A90E2')
for idx, row in prob_df.iterrows():
    axes[0].text(row['true_talent'] + 0.005, row['playoff_prob'],
                row['team'], fontsize=10)
axes[0].set_xlabel('True Win Percentage', fontsize=12)
axes[0].set_ylabel('Probability of Top-2 Finish (Playoff Berth)', fontsize=12)
axes[0].set_title('Playoff Probability vs True Talent\n10,000 simulated 162-game seasons',
                 fontsize=13, fontweight='bold')
axes[0].xaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
axes[0].yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
axes[0].grid(alpha=0.3)

# Visualization 2: Win distribution for best team
team_a_wins = season_results[:, 0]
axes[1].hist(team_a_wins, bins=30, color='#4A90E2', alpha=0.7, edgecolor='white')
axes[1].axvline(x=0.580 * 162, color='red', linestyle='--', linewidth=2)
axes[1].text(0.580 * 162 + 1, 500, 'Expected\nWins (94)',
            color='red', fontsize=10)
axes[1].set_xlabel('Wins', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Win Totals for .580 Team\n10,000 simulated seasons',
                 fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate probability of different win ranges
print("\nWin Distribution for .580 Team (Team A):")
print(f"90-94 wins: {np.mean((team_a_wins >= 90) & (team_a_wins <= 94)):.3f}")
print(f"95-99 wins: {np.mean((team_a_wins >= 95) & (team_a_wins <= 99)):.3f}")
print(f"100+ wins: {np.mean(team_a_wins >= 100):.3f}")
print(f"Below 90 wins: {np.mean(team_a_wins < 90):.3f}")

This simulation reveals that even the best team (. 580 true talent) makes the playoffs only about 85-90% of the time, and sometimes finishes with fewer than 90 wins despite being genuinely excellent.

# Monte Carlo simulation of playoff series
simulate_series <- function(p_win, n_games = 7, n_sims = 10000) {
  wins_needed <- ceiling(n_games / 2)

  results <- replicate(n_sims, {
    team_a_wins <- 0
    team_b_wins <- 0

    while (team_a_wins < wins_needed && team_b_wins < wins_needed) {
      if (runif(1) < p_win) {
        team_a_wins <- team_a_wins + 1
      } else {
        team_b_wins <- team_b_wins + 1
      }
    }

    return(team_a_wins > team_b_wins)
  })

  return(mean(results))
}

# Run simulation for various win probabilities
set.seed(123)
win_probs <- seq(0.50, 0.70, by = 0.05)
series_probs <- sapply(win_probs, function(p) simulate_series(p))

results_df <- data.frame(
  game_win_prob = win_probs,
  series_win_prob = series_probs
)

# Analytical solution for comparison
analytical_prob <- function(p, n = 7) {
  wins_needed <- ceiling(n / 2)
  prob <- 0
  for (w in wins_needed:n) {
    prob <- prob + choose(n, w) * p^w * (1-p)^(n-w)
  }
  # Account for series ending early
  early_prob <- 0
  for (total_games in wins_needed:(n-1)) {
    for (w in wins_needed:total_games) {
      if (w == wins_needed) {  # Winner must win on final game
        early_prob <- early_prob +
          choose(total_games - 1, w - 1) * p^w * (1-p)^(total_games-w)
      }
    }
  }
  return(early_prob + choose(n, wins_needed) * p^wins_needed * (1-p)^(n-wins_needed))
}

results_df$analytical = sapply(win_probs, analytical_prob)

# Visualization
ggplot(results_df, aes(x = game_win_prob)) +
  geom_line(aes(y = series_win_prob, color = "Simulation"),
            size = 1.5) +
  geom_line(aes(y = analytical, color = "Analytical"),
            size = 1, linetype = "dashed") +
  geom_point(aes(y = series_win_prob), size = 3) +
  scale_x_continuous(labels = scales::percent) +
  scale_y_continuous(labels = scales::percent) +
  scale_color_manual(values = c("Simulation" = "#4A90E2",
                               "Analytical" = "#E03A3E")) +
  labs(title = "Playoff Series Win Probability (Best of 7)",
       subtitle = "Monte Carlo simulation vs analytical solution",
       x = "Single Game Win Probability",
       y = "Series Win Probability",
       color = "Method") +
  theme_minimal()

# Key insight
cat("\nKey Insights:\n")
cat("55% game winner has",
    round(results_df$series_win_prob[results_df$game_win_prob == 0.55] * 100, 1),
    "% chance to win series\n")
cat("60% game winner has",
    round(results_df$series_win_prob[results_df$game_win_prob == 0.60] * 100, 1),
    "% chance to win series\n")

# Simulate a season accounting for uncertainty in true talent
simulate_season <- function(team_talents, n_games = 162, n_sims = 1000) {
  n_teams <- length(team_talents)

  # Store results
  final_records <- matrix(0, nrow = n_sims, ncol = n_teams)

  for (sim in 1:n_sims) {
    # Each team plays n_games
    for (team in 1:n_teams) {
      wins <- rbinom(1, n_games, team_talents[team])
      final_records[sim, team] <- wins
    }
  }

  return(final_records)
}

# Simulate a division with 5 teams
set.seed(789)
team_names <- c("Team A", "Team B", "Team C", "Team D", "Team E")
true_talents <- c(0.580, 0.550, 0.500, 0.475, 0.420)

# Run simulation
season_results <- simulate_season(true_talents, n_sims = 10000)
colnames(season_results) <- team_names

# Calculate playoff probabilities (assume top 2 teams make playoffs)
playoff_probs <- sapply(1:length(team_names), function(i) {
  # How often does this team finish in top 2?
  ranks <- apply(season_results, 1, function(x) rank(-x))
  mean(ranks[i, ] <= 2)
})

prob_df <- data.frame(
  team = team_names,
  true_talent = true_talents,
  playoff_prob = playoff_probs
)

# Visualize
ggplot(prob_df, aes(x = true_talent, y = playoff_prob)) +
  geom_point(size = 4, color = "#4A90E2") +
  geom_text(aes(label = team), hjust = -0.2, vjust = 0) +
  scale_x_continuous(labels = scales::percent, limits = c(0.40, 0.60)) +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Playoff Probability vs True Talent",
       subtitle = "10,000 simulated 162-game seasons",
       x = "True Win Percentage",
       y = "Probability of Top-2 Finish (Playoff Berth)") +
  theme_minimal()

# Distribution of wins for best team
team_a_wins <- season_results[, "Team A"]
wins_dist <- data.frame(wins = team_a_wins)

ggplot(wins_dist, aes(x = wins)) +
  geom_histogram(binwidth = 2, fill = "#4A90E2", alpha = 0.7, color = "white") +
  geom_vline(xintercept = 0.580 * 162, color = "red",
             linetype = "dashed", size = 1) +
  annotate("text", x = 0.580 * 162 + 3, y = 500,
           label = "Expected\nWins (94)", hjust = 0, color = "red") +
  labs(title = "Distribution of Win Totals for .580 Team",
       subtitle = "10,000 simulated seasons",
       x = "Wins",
       y = "Frequency") +
  theme_minimal()

# Calculate probability of different win ranges
cat("\nWin Distribution for .580 Team (Team A):\n")
cat("90-94 wins:", mean(team_a_wins >= 90 & team_a_wins <= 94), "\n")
cat("95-99 wins:", mean(team_a_wins >= 95 & team_a_wins <= 99), "\n")
cat("100+ wins:", mean(team_a_wins >= 100), "\n")
cat("Below 90 wins:", mean(team_a_wins < 90), "\n")

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.special import comb

def simulate_series(p_win, n_games=7, n_sims=10000):
    """
    Monte Carlo simulation of playoff series
    """
    wins_needed = int(np.ceil(n_games / 2))

    series_wins = 0
    for _ in range(n_sims):
        team_a_wins = 0
        team_b_wins = 0

        while team_a_wins < wins_needed and team_b_wins < wins_needed:
            if np.random.random() < p_win:
                team_a_wins += 1
            else:
                team_b_wins += 1

        if team_a_wins > team_b_wins:
            series_wins += 1

    return series_wins / n_sims

# Run simulation for various win probabilities
np.random.seed(123)
win_probs = np.arange(0.50, 0.71, 0.05)
series_probs = [simulate_series(p) for p in win_probs]

results_df = pd.DataFrame({
    'game_win_prob': win_probs,
    'series_win_prob': series_probs
})

# Analytical solution
def analytical_prob(p, n=7):
    wins_needed = int(np.ceil(n / 2))
    prob = 0
    for w in range(wins_needed, n + 1):
        prob += comb(n, w, exact=True) * (p ** w) * ((1 - p) ** (n - w))
    return prob

results_df['analytical'] = [analytical_prob(p) for p in win_probs]

# Visualization
plt.figure(figsize=(10, 6))
plt.plot(results_df['game_win_prob'], results_df['series_win_prob'],
        'o-', linewidth=2.5, markersize=8, label='Simulation', color='#4A90E2')
plt.plot(results_df['game_win_prob'], results_df['analytical'],
        '--', linewidth=2, label='Analytical', color='#E03A3E')
plt.xlabel('Single Game Win Probability', fontsize=12)
plt.ylabel('Series Win Probability', fontsize=12)
plt.title('Playoff Series Win Probability (Best of 7)\n' +
         'Monte Carlo simulation vs analytical solution',
         fontsize=14, fontweight='bold')
plt.gca().xaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# Key insights
prob_55 = results_df[results_df['game_win_prob'] == 0.55]['series_win_prob'].values[0]
prob_60 = results_df[results_df['game_win_prob'] == 0.60]['series_win_prob'].values[0]

print("\nKey Insights:")
print(f"55% game winner has {prob_55*100:.1f}% chance to win series")
print(f"60% game winner has {prob_60*100:.1f}% chance to win series")

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def simulate_season(team_talents, n_games=162, n_sims=1000):
    """
    Simulate season outcomes accounting for uncertainty
    """
    n_teams = len(team_talents)
    final_records = np.zeros((n_sims, n_teams))

    for sim in range(n_sims):
        for team_idx in range(n_teams):
            wins = np.random.binomial(n_games, team_talents[team_idx])
            final_records[sim, team_idx] = wins

    return final_records

# Simulate a division with 5 teams
np.random.seed(789)
team_names = ["Team A", "Team B", "Team C", "Team D", "Team E"]
true_talents = np.array([0.580, 0.550, 0.500, 0.475, 0.420])

# Run simulation
season_results = simulate_season(true_talents, n_sims=10000)

# Calculate playoff probabilities (top 2 teams make playoffs)
playoff_probs = []
for i in range(len(team_names)):
    ranks = np.apply_along_axis(lambda x: np.argsort(-x), 1, season_results)
    playoff_prob = np.mean(ranks == i)
    # Correct: check if team finished in top 2
    top2_finishes = np.sum(ranks[:, :2] == i)
    playoff_probs.append(top2_finishes / len(ranks))

prob_df = pd.DataFrame({
    'team': team_names,
    'true_talent': true_talents,
    'playoff_prob': playoff_probs
})

# Visualization 1: Playoff probability vs talent
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

axes[0].scatter(prob_df['true_talent'], prob_df['playoff_prob'],
               s=150, color='#4A90E2')
for idx, row in prob_df.iterrows():
    axes[0].text(row['true_talent'] + 0.005, row['playoff_prob'],
                row['team'], fontsize=10)
axes[0].set_xlabel('True Win Percentage', fontsize=12)
axes[0].set_ylabel('Probability of Top-2 Finish (Playoff Berth)', fontsize=12)
axes[0].set_title('Playoff Probability vs True Talent\n10,000 simulated 162-game seasons',
                 fontsize=13, fontweight='bold')
axes[0].xaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
axes[0].yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
axes[0].grid(alpha=0.3)

# Visualization 2: Win distribution for best team
team_a_wins = season_results[:, 0]
axes[1].hist(team_a_wins, bins=30, color='#4A90E2', alpha=0.7, edgecolor='white')
axes[1].axvline(x=0.580 * 162, color='red', linestyle='--', linewidth=2)
axes[1].text(0.580 * 162 + 1, 500, 'Expected\nWins (94)',
            color='red', fontsize=10)
axes[1].set_xlabel('Wins', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Win Totals for .580 Team\n10,000 simulated seasons',
                 fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate probability of different win ranges
print("\nWin Distribution for .580 Team (Team A):")
print(f"90-94 wins: {np.mean((team_a_wins >= 90) & (team_a_wins <= 94)):.3f}")
print(f"95-99 wins: {np.mean((team_a_wins >= 95) & (team_a_wins <= 99)):.3f}")
print(f"100+ wins: {np.mean(team_a_wins >= 100):.3f}")
print(f"Below 90 wins: {np.mean(team_a_wins < 90):.3f}")

17.7 Exercises

Exercise 17.1: Bayesian Rookie Evaluation

A rookie pitcher has thrown 25 innings with a 2.52 ERA. Using Bayesian methods:

Calculate a Bayesian estimate of his true ERA using a prior of N(4.00, 0.80^2)
Construct a 95% credible interval
What's the probability his true ERA is below 3.50?
How many innings would he need to pitch at 2.52 ERA before we're 90% confident his true ERA is below 3.50?

Implement your solution in R or Python and visualize the posterior distribution.

Exercise 17.2: Hierarchical Team-Player Model

Using data for a season (real or simulated):

Fit a hierarchical model where player OPS varies by player and team
Extract team-level random effects
Compare team effects to actual team winning percentages
Identify players whose performance is most "above" or "below" what the hierarchical model predicts based on their team
Discuss: What does it mean when a player substantially outperforms his hierarchical prediction?

Exercise 17.3: Detecting Velocity Changes

Create or obtain pitch-by-pitch velocity data for a starting pitcher over a season:

Plot velocity over time (by game or pitch number)
Fit a smooth trend to identify any systematic velocity decline
Use changepoint detection to identify if there was an abrupt velocity drop (potentially indicating injury)
Calculate the probability that any detected changepoint is real vs noise
If you detect a changepoint, investigate whether the pitcher's effectiveness (measured by wOBA allowed or K-BB%) changed simultaneously

Exercise 17.4: Causal Effect of Designated Hitter

Using propensity score matching:

Compare the career length of pitchers in AL vs NL (before universal DH in 2022)
Account for confounding variables: pitcher quality (career ERA+), usage pattern (starter vs reliever), age at debut
Estimate the causal effect of playing in the AL (with DH) vs NL (pitchers bat) on pitcher career length
Discuss: Why might the DH extend (or shorten) pitcher careers? What are alternative explanations for any observed difference?

You've now explored six advanced statistical techniques with particular application to baseball analytics. Bayesian methods help us update beliefs rationally in the face of limited data. Hierarchical models share information across groups to improve estimates. Time series methods model temporal trends and detect change points. Survival analysis quantifies time-to-event processes. Causal inference attempts to establish causation from observational data. Monte Carlo simulation quantifies uncertainty and probabilities in complex systems.

These methods represent the frontier of baseball analytics. While traditional statistics remain valuable for description and basic inference, these advanced techniques enable analysts to answer more sophisticated questions: not just "what happened?" but "what's the player's true talent?", "what caused this change?", "what will happen?", and "how certain should we be?"

As you apply these methods to real baseball questions, remember that sophisticated techniques don't guarantee correct answers. Always question your assumptions: Are your priors reasonable? Are your groups truly comparable? Is your model appropriate for your data structure? The best analysts combine advanced statistical methods with deep baseball knowledge and healthy skepticism about their own conclusions.

In the next chapter, we'll explore machine learning methods that build on these statistical foundations to create predictive models for player performance, game outcomes, and strategic decisions.

Practice Exercises

Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.

4 exercises

Tips for Success

Read the problem carefully before starting to code
Break down complex problems into smaller steps
Use the hints if you're stuck - they won't give away the answer
After solving, compare your approach with the solution

Exercise 17.1

Bayesian Rookie Evaluation

Hard

A rookie pitcher has thrown 25 innings with a 2.52 ERA. Using Bayesian methods:

1. Calculate a Bayesian estimate of his true ERA using a prior of N(4.00, 0.80^2)
2. Construct a 95% credible interval
3. What's the probability his true ERA is below 3.50?
4. How many innings would he need to pitch at 2.52 ERA before we're 90% confident his true ERA is below 3.50?

Implement your solution in R or Python and visualize the posterior distribution.

Exercise 17.2

Hierarchical Team-Player Model

Hard

Using data for a season (real or simulated):

1. Fit a hierarchical model where player OPS varies by player and team
2. Extract team-level random effects
3. Compare team effects to actual team winning percentages
4. Identify players whose performance is most "above" or "below" what the hierarchical model predicts based on their team
5. Discuss: What does it mean when a player substantially outperforms his hierarchical prediction?

Exercise 17.3

Detecting Velocity Changes

Hard

Create or obtain pitch-by-pitch velocity data for a starting pitcher over a season:

1. Plot velocity over time (by game or pitch number)
2. Fit a smooth trend to identify any systematic velocity decline
3. Use changepoint detection to identify if there was an abrupt velocity drop (potentially indicating injury)
4. Calculate the probability that any detected changepoint is real vs noise
5. If you detect a changepoint, investigate whether the pitcher's effectiveness (measured by wOBA allowed or K-BB%) changed simultaneously

Exercise 17.4

Causal Effect of Designated Hitter

Hard

Using propensity score matching:

1. Compare the career length of pitchers in AL vs NL (before universal DH in 2022)
2. Account for confounding variables: pitcher quality (career ERA+), usage pattern (starter vs reliever), age at debut
3. Estimate the causal effect of playing in the AL (with DH) vs NL (pitchers bat) on pitcher career length
4. Discuss: Why might the DH extend (or shorten) pitcher careers? What are alternative explanations for any observed difference?

---

You've now explored six advanced statistical techniques with particular application to baseball analytics. Bayesian methods help us update beliefs rationally in the face of limited data. Hierarchical models share information across groups to improve estimates. Time series methods model temporal trends and detect change points. Survival analysis quantifies time-to-event processes. Causal inference attempts to establish causation from observational data. Monte Carlo simulation quantifies uncertainty and probabilities in complex systems.

These methods represent the frontier of baseball analytics. While traditional statistics remain valuable for description and basic inference, these advanced techniques enable analysts to answer more sophisticated questions: not just "what happened?" but "what's the player's true talent?", "what caused this change?", "what will happen?", and "how certain should we be?"

As you apply these methods to real baseball questions, remember that sophisticated techniques don't guarantee correct answers. Always question your assumptions: Are your priors reasonable? Are your groups truly comparable? Is your model appropriate for your data structure? The best analysts combine advanced statistical methods with deep baseball knowledge and healthy skepticism about their own conclusions.

In the next chapter, we'll explore machine learning methods that build on these statistical foundations to create predictive models for player performance, game outcomes, and strategic decisions.

Chapter 17: Advanced Statistical Methods

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

17.1 Bayesian Methods

The Bayesian Framework

Bayesian Updating for Player Performance

Beta-Binomial Model for Batting Average

17.2 Hierarchical Models

The Hierarchical Structure

Application: Estimating Team True Talent

Player-Within-Team Hierarchical Model

17.3 Time Series Analysis

Components of Time Series

Modeling Pitcher Velocity Decline

Detecting Changepoints in Performance

17.4 Survival Analysis

Kaplan-Meier Survival Curves

Application: Time to First Major League Home Run

17.5 Causal Inference

The Problem of Confounding

Propensity Score Matching

Difference-in-Differences

17.6 Monte Carlo Simulation

Simulating a Playoff Series

Simulating Season Outcomes with Uncertainty

17.7 Exercises

Exercise 17.1: Bayesian Rookie Evaluation

Exercise 17.2: Hierarchical Team-Player Model

Exercise 17.3: Detecting Velocity Changes

Exercise 17.4: Causal Effect of Designated Hitter

Practice Exercises

Tips for Success

Bayesian Rookie Evaluation

Hierarchical Team-Player Model

Detecting Velocity Changes

Causal Effect of Designated Hitter

Chapter Summary

Chapter 17: Advanced Statistical Methods

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

17.1 Bayesian Methods

The Bayesian Framework

Bayesian Updating for Player Performance

Beta-Binomial Model for Batting Average

17.2 Hierarchical Models

The Hierarchical Structure

Application: Estimating Team True Talent

Player-Within-Team Hierarchical Model

17.3 Time Series Analysis

Components of Time Series

Modeling Pitcher Velocity Decline

Detecting Changepoints in Performance

17.4 Survival Analysis

Kaplan-Meier Survival Curves

Application: Time to First Major League Home Run

17.5 Causal Inference

The Problem of Confounding

Propensity Score Matching

Difference-in-Differences

17.6 Monte Carlo Simulation

Simulating a Playoff Series

Simulating Season Outcomes with Uncertainty

17.7 Exercises

Exercise 17.1: Bayesian Rookie Evaluation

Exercise 17.2: Hierarchical Team-Player Model

Exercise 17.3: Detecting Velocity Changes

Exercise 17.4: Causal Effect of Designated Hitter

Practice Exercises

Tips for Success

Bayesian Rookie Evaluation

Hierarchical Team-Player Model

Detecting Velocity Changes

Causal Effect of Designated Hitter

Chapter Summary

Related Resources

Glossary

Resources

All Chapters