Chapter 18: Career Analytics & Hall of Fame Analysis

18.1 Measuring Career Value (WAR, JAWS)

Wins Above Replacement (WAR)

WAR is the most comprehensive single metric for measuring a player's total value. It combines batting, baserunning, fielding, and positional adjustments into one number representing wins contributed above a replacement-level player.

Key Components:

Batting Runs: Offensive value above average

Baserunning Runs: Stolen base and baserunning value

Fielding Runs: Defensive value above average

Positional Adjustment: Credit for playing demanding positions

Replacement Level: Baseline for comparison (typically .294 winning percentage)

The two primary WAR implementations are:

bWAR (Baseball-Reference): Uses Defensive Runs Saved (DRS)

fWAR (FanGraphs): Uses Ultimate Zone Rating (UZR)

Career WAR Analysis

# R: Analyzing Career WAR
library(tidyverse)
library(Lahman)

# Top 50 Position Players by Career WAR (using batting stats as proxy)
career_war <- Batting %>%
  group_by(playerID) %>%
  summarise(
    seasons = n(),
    games = sum(G, na.rm = TRUE),
    pa = sum(AB + BB + HBP + SF + SH, na.rm = TRUE),
    hr = sum(HR, na.rm = TRUE),
    sb = sum(SB, na.rm = TRUE),
    avg = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    obp = sum(H + BB + HBP, na.rm = TRUE) /
          sum(AB + BB + HBP + SF, na.rm = TRUE),
    slg = (sum(H, na.rm = TRUE) - sum(`2B` + `3B` + HR, na.rm = TRUE) +
           2 * sum(`2B`, na.rm = TRUE) +
           3 * sum(`3B`, na.rm = TRUE) +
           4 * sum(HR, na.rm = TRUE)) / sum(AB, na.rm = TRUE)
  ) %>%
  filter(pa >= 3000) %>%
  mutate(
    ops = obp + slg,
    # Simplified WAR estimate
    estimated_war = (ops - 0.710) * pa / 20
  )

# Get player names
player_names <- People %>%
  select(playerID, nameFirst, nameLast, debut, finalGame)

career_leaders <- career_war %>%
  left_join(player_names, by = "playerID") %>%
  arrange(desc(estimated_war)) %>%
  head(20) %>%
  mutate(
    name = paste(nameFirst, nameLast),
    war_per_season = estimated_war / seasons
  ) %>%
  select(name, seasons, games, pa, hr, avg, ops, estimated_war, war_per_season)

print(career_leaders)

# Visualize career WAR distribution
ggplot(career_war %>% filter(estimated_war > 0),
       aes(x = estimated_war)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = 50, linetype = "dashed", color = "red", size = 1) +
  annotate("text", x = 52, y = 100, label = "HOF Threshold (~50 WAR)",
           hjust = 0, color = "red") +
  labs(title = "Distribution of Career WAR",
       subtitle = "Players with 3000+ Plate Appearances",
       x = "Career WAR (Estimated)",
       y = "Number of Players") +
  theme_minimal()

# Python: Career WAR Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import batting_stats, playerid_lookup, statcast_batter

# Get career statistics for recent players
def calculate_career_war(start_year=1995, end_year=2023):
    """Calculate career statistics and estimated WAR"""
    all_stats = []

    for year in range(start_year, end_year + 1):
        try:
            stats = batting_stats(year)
            stats['Season'] = year
            all_stats.append(stats)
        except:
            continue

    career_df = pd.concat(all_stats, ignore_index=True)

    # Aggregate by player
    career_totals = career_df.groupby('Name').agg({
        'Season': 'count',
        'G': 'sum',
        'PA': 'sum',
        'AB': 'sum',
        'H': 'sum',
        'HR': 'sum',
        'BB': 'sum',
        'SO': 'sum',
        'WAR': 'sum',  # If available in data
        'AVG': 'mean',
        'OBP': 'mean',
        'SLG': 'mean'
    }).reset_index()

    career_totals.columns = ['Name', 'Seasons', 'G', 'PA', 'AB', 'H',
                             'HR', 'BB', 'SO', 'Career_WAR', 'AVG', 'OBP', 'SLG']

    # Filter for qualified players
    career_totals = career_totals[career_totals['PA'] >= 3000].copy()

    # Calculate rate stats
    career_totals['WAR_per_Season'] = career_totals['Career_WAR'] / career_totals['Seasons']
    career_totals['WAR_per_162'] = career_totals['Career_WAR'] / career_totals['G'] * 162

    return career_totals.sort_values('Career_WAR', ascending=False)

# Example with historical HOF data
hof_players = pd.DataFrame({
    'Name': ['Babe Ruth', 'Willie Mays', 'Barry Bonds', 'Hank Aaron',
             'Ted Williams', 'Mike Trout', 'Albert Pujols', 'Derek Jeter'],
    'Career_WAR': [183.1, 156.1, 162.8, 143.0, 121.9, 85.5, 99.9, 72.4],
    'Seasons': [22, 23, 22, 23, 19, 13, 22, 20],
    'Peak_WAR_7yr': [71.0, 72.9, 72.8, 59.6, 63.3, 56.8, 56.4, 38.5],
    'HOF': [1, 1, 0, 1, 1, 0, 0, 1]
})

hof_players['WAR_per_Season'] = hof_players['Career_WAR'] / hof_players['Seasons']
hof_players['JAWS'] = (hof_players['Career_WAR'] + hof_players['Peak_WAR_7yr']) / 2

print("Hall of Fame Career Value Comparison:")
print(hof_players.to_string(index=False))

# Visualize WAR components
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Career WAR vs Peak WAR
axes[0].scatter(hof_players['Career_WAR'], hof_players['Peak_WAR_7yr'],
               c=hof_players['HOF'], cmap='RdYlGn', s=200, alpha=0.6)
for idx, row in hof_players.iterrows():
    axes[0].annotate(row['Name'], (row['Career_WAR'], row['Peak_WAR_7yr']),
                    fontsize=8, ha='right')
axes[0].set_xlabel('Career WAR')
axes[0].set_ylabel('Peak WAR (7-year)')
axes[0].set_title('Career vs Peak Value')
axes[0].grid(True, alpha=0.3)

# WAR per Season
axes[1].barh(hof_players['Name'], hof_players['WAR_per_Season'],
            color=['green' if x == 1 else 'gray' for x in hof_players['HOF']])
axes[1].set_xlabel('WAR per Season')
axes[1].set_title('Average Annual Value')
axes[1].axvline(x=5.0, color='red', linestyle='--', label='All-Star Level')
axes[1].legend()

plt.tight_layout()
plt.savefig('career_war_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

JAWS (Jaffe WAR Score System)

Developed by Jay Jaffe, JAWS balances career value with peak performance to evaluate Hall of Fame worthiness. It averages a player's career WAR with their best 7-year peak WAR.

$$\text{JAWS} = \frac{\text{Career WAR} + \text{Peak WAR (7-year)}}{2}$$

Why JAWS Matters:

Prevents compilers (long but mediocre careers) from dominating

Rewards sustained excellence

Position-specific comparisons

Objective HOF standard

# R: JAWS Calculation
library(zoo)

# Function to calculate peak 7-year WAR
calculate_peak_war <- function(yearly_war, window = 7) {
  if (length(yearly_war) < window) {
    return(sum(yearly_war, na.rm = TRUE))
  }
  return(max(rollapply(yearly_war, width = window, FUN = sum,
                       align = "left", partial = TRUE)))
}

# Simulate career trajectories for HOF analysis
set.seed(123)
simulate_careers <- function(n = 1000) {
  careers <- lapply(1:n, function(i) {
    # Random career length (10-25 years)
    years <- sample(10:25, 1)

    # Peak around year 5-7
    peak_year <- sample(5:7, 1)

    # Generate WAR trajectory
    war <- numeric(years)
    for (j in 1:years) {
      if (j < peak_year) {
        # Rising phase
        war[j] <- rnorm(1, mean = 2 + j * 0.8, sd = 1.5)
      } else if (j <= peak_year + 3) {
        # Peak phase
        war[j] <- rnorm(1, mean = 6, sd = 1.5)
      } else {
        # Decline phase
        decline_factor <- (j - peak_year - 3) * 0.5
        war[j] <- rnorm(1, mean = max(1, 5 - decline_factor), sd = 1.5)
      }
      war[j] <- max(0, war[j])  # No negative WAR
    }

    career_war <- sum(war)
    peak_war <- calculate_peak_war(war, 7)
    jaws <- (career_war + peak_war) / 2

    data.frame(
      career_id = i,
      years = years,
      career_war = career_war,
      peak_war = peak_war,
      jaws = jaws
    )
  })

  bind_rows(careers)
}

career_sims <- simulate_careers(1000)

# HOF threshold analysis
hof_threshold <- 50  # Typical HOF threshold
career_sims <- career_sims %>%
  mutate(
    hof_worthy = career_war >= hof_threshold,
    jaws_worthy = jaws >= (hof_threshold * 0.75)
  )

# Compare criteria
criteria_comparison <- career_sims %>%
  summarise(
    total = n(),
    war_hof = sum(hof_worthy),
    jaws_hof = sum(jaws_worthy),
    both = sum(hof_worthy & jaws_worthy)
  )

print("HOF Criteria Comparison:")
print(criteria_comparison)

# Visualize career vs peak
ggplot(career_sims, aes(x = career_war, y = peak_war)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  geom_vline(xintercept = 50, linetype = "dashed", color = "darkgreen") +
  geom_hline(yintercept = 35, linetype = "dashed", color = "darkgreen") +
  annotate("rect", xmin = 50, xmax = Inf, ymin = 35, ymax = Inf,
           alpha = 0.2, fill = "green") +
  labs(title = "Career WAR vs Peak WAR (7-year)",
       subtitle = "Green zone indicates typical HOF territory",
       x = "Career WAR",
       y = "Peak 7-Year WAR") +
  theme_minimal()

Position-Specific JAWS Standards

# Python: Position-Specific JAWS Analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Historical HOF standards by position (approximate)
position_standards = pd.DataFrame({
    'Position': ['C', '1B', '2B', '3B', 'SS', 'LF', 'CF', 'RF', 'DH'],
    'HOF_Avg_WAR': [53.5, 66.7, 69.2, 67.1, 67.4, 64.9, 69.7, 70.3, 27.4],
    'HOF_Avg_Peak': [33.4, 42.3, 42.7, 42.1, 42.8, 41.5, 43.5, 43.6, 18.8],
    'HOF_Members': [17, 21, 19, 15, 24, 20, 19, 26, 4]
})

position_standards['HOF_Avg_JAWS'] = (
    (position_standards['HOF_Avg_WAR'] + position_standards['HOF_Avg_Peak']) / 2
)

print("Hall of Fame Standards by Position:")
print(position_standards.to_string(index=False))

# Active players vs HOF standards
active_players = pd.DataFrame({
    'Name': ['Mike Trout', 'Mookie Betts', 'Francisco Lindor',
             'Nolan Arenado', 'Jose Ramirez', 'Aaron Judge'],
    'Position': ['CF', 'RF', 'SS', '3B', '3B', 'RF'],
    'Age': [32, 31, 30, 33, 31, 32],
    'Career_WAR': [85.5, 68.2, 51.3, 49.8, 45.6, 42.1],
    'Peak_7yr_WAR': [56.8, 48.3, 38.7, 35.2, 38.9, 35.4]
})

active_players['JAWS'] = (
    (active_players['Career_WAR'] + active_players['Peak_7yr_WAR']) / 2
)

# Merge with standards
active_comparison = active_players.merge(
    position_standards[['Position', 'HOF_Avg_WAR', 'HOF_Avg_JAWS']],
    on='Position'
)

active_comparison['WAR_pct_of_standard'] = (
    active_comparison['Career_WAR'] / active_comparison['HOF_Avg_WAR'] * 100
)
active_comparison['JAWS_pct_of_standard'] = (
    active_comparison['JAWS'] / active_comparison['HOF_Avg_JAWS'] * 100
)

print("\nActive Players vs HOF Standards:")
print(active_comparison[['Name', 'Position', 'Career_WAR', 'HOF_Avg_WAR',
                         'WAR_pct_of_standard', 'JAWS_pct_of_standard']].to_string(index=False))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Position standards
axes[0].barh(position_standards['Position'], position_standards['HOF_Avg_JAWS'],
            color='steelblue', alpha=0.7)
axes[0].set_xlabel('JAWS')
axes[0].set_title('Average JAWS by Position (HOF Members)')
axes[0].grid(True, alpha=0.3, axis='x')

# Active players comparison
colors = ['green' if x >= 100 else 'orange' if x >= 75 else 'red'
          for x in active_comparison['JAWS_pct_of_standard']]
axes[1].barh(active_comparison['Name'],
            active_comparison['JAWS_pct_of_standard'],
            color=colors, alpha=0.7)
axes[1].axvline(x=100, color='black', linestyle='--', label='HOF Standard')
axes[1].set_xlabel('% of HOF Standard (JAWS)')
axes[1].set_title('Active Players vs Position Standards')
axes[1].legend()

plt.tight_layout()
plt.savefig('jaws_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# R: Analyzing Career WAR
library(tidyverse)
library(Lahman)

# Top 50 Position Players by Career WAR (using batting stats as proxy)
career_war <- Batting %>%
  group_by(playerID) %>%
  summarise(
    seasons = n(),
    games = sum(G, na.rm = TRUE),
    pa = sum(AB + BB + HBP + SF + SH, na.rm = TRUE),
    hr = sum(HR, na.rm = TRUE),
    sb = sum(SB, na.rm = TRUE),
    avg = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    obp = sum(H + BB + HBP, na.rm = TRUE) /
          sum(AB + BB + HBP + SF, na.rm = TRUE),
    slg = (sum(H, na.rm = TRUE) - sum(`2B` + `3B` + HR, na.rm = TRUE) +
           2 * sum(`2B`, na.rm = TRUE) +
           3 * sum(`3B`, na.rm = TRUE) +
           4 * sum(HR, na.rm = TRUE)) / sum(AB, na.rm = TRUE)
  ) %>%
  filter(pa >= 3000) %>%
  mutate(
    ops = obp + slg,
    # Simplified WAR estimate
    estimated_war = (ops - 0.710) * pa / 20
  )

# Get player names
player_names <- People %>%
  select(playerID, nameFirst, nameLast, debut, finalGame)

career_leaders <- career_war %>%
  left_join(player_names, by = "playerID") %>%
  arrange(desc(estimated_war)) %>%
  head(20) %>%
  mutate(
    name = paste(nameFirst, nameLast),
    war_per_season = estimated_war / seasons
  ) %>%
  select(name, seasons, games, pa, hr, avg, ops, estimated_war, war_per_season)

print(career_leaders)

# Visualize career WAR distribution
ggplot(career_war %>% filter(estimated_war > 0),
       aes(x = estimated_war)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = 50, linetype = "dashed", color = "red", size = 1) +
  annotate("text", x = 52, y = 100, label = "HOF Threshold (~50 WAR)",
           hjust = 0, color = "red") +
  labs(title = "Distribution of Career WAR",
       subtitle = "Players with 3000+ Plate Appearances",
       x = "Career WAR (Estimated)",
       y = "Number of Players") +
  theme_minimal()

# R: JAWS Calculation
library(zoo)

# Function to calculate peak 7-year WAR
calculate_peak_war <- function(yearly_war, window = 7) {
  if (length(yearly_war) < window) {
    return(sum(yearly_war, na.rm = TRUE))
  }
  return(max(rollapply(yearly_war, width = window, FUN = sum,
                       align = "left", partial = TRUE)))
}

# Simulate career trajectories for HOF analysis
set.seed(123)
simulate_careers <- function(n = 1000) {
  careers <- lapply(1:n, function(i) {
    # Random career length (10-25 years)
    years <- sample(10:25, 1)

    # Peak around year 5-7
    peak_year <- sample(5:7, 1)

    # Generate WAR trajectory
    war <- numeric(years)
    for (j in 1:years) {
      if (j < peak_year) {
        # Rising phase
        war[j] <- rnorm(1, mean = 2 + j * 0.8, sd = 1.5)
      } else if (j <= peak_year + 3) {
        # Peak phase
        war[j] <- rnorm(1, mean = 6, sd = 1.5)
      } else {
        # Decline phase
        decline_factor <- (j - peak_year - 3) * 0.5
        war[j] <- rnorm(1, mean = max(1, 5 - decline_factor), sd = 1.5)
      }
      war[j] <- max(0, war[j])  # No negative WAR
    }

    career_war <- sum(war)
    peak_war <- calculate_peak_war(war, 7)
    jaws <- (career_war + peak_war) / 2

    data.frame(
      career_id = i,
      years = years,
      career_war = career_war,
      peak_war = peak_war,
      jaws = jaws
    )
  })

  bind_rows(careers)
}

career_sims <- simulate_careers(1000)

# HOF threshold analysis
hof_threshold <- 50  # Typical HOF threshold
career_sims <- career_sims %>%
  mutate(
    hof_worthy = career_war >= hof_threshold,
    jaws_worthy = jaws >= (hof_threshold * 0.75)
  )

# Compare criteria
criteria_comparison <- career_sims %>%
  summarise(
    total = n(),
    war_hof = sum(hof_worthy),
    jaws_hof = sum(jaws_worthy),
    both = sum(hof_worthy & jaws_worthy)
  )

print("HOF Criteria Comparison:")
print(criteria_comparison)

# Visualize career vs peak
ggplot(career_sims, aes(x = career_war, y = peak_war)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  geom_vline(xintercept = 50, linetype = "dashed", color = "darkgreen") +
  geom_hline(yintercept = 35, linetype = "dashed", color = "darkgreen") +
  annotate("rect", xmin = 50, xmax = Inf, ymin = 35, ymax = Inf,
           alpha = 0.2, fill = "green") +
  labs(title = "Career WAR vs Peak WAR (7-year)",
       subtitle = "Green zone indicates typical HOF territory",
       x = "Career WAR",
       y = "Peak 7-Year WAR") +
  theme_minimal()

Python

# Python: Career WAR Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import batting_stats, playerid_lookup, statcast_batter

# Get career statistics for recent players
def calculate_career_war(start_year=1995, end_year=2023):
    """Calculate career statistics and estimated WAR"""
    all_stats = []

    for year in range(start_year, end_year + 1):
        try:
            stats = batting_stats(year)
            stats['Season'] = year
            all_stats.append(stats)
        except:
            continue

    career_df = pd.concat(all_stats, ignore_index=True)

    # Aggregate by player
    career_totals = career_df.groupby('Name').agg({
        'Season': 'count',
        'G': 'sum',
        'PA': 'sum',
        'AB': 'sum',
        'H': 'sum',
        'HR': 'sum',
        'BB': 'sum',
        'SO': 'sum',
        'WAR': 'sum',  # If available in data
        'AVG': 'mean',
        'OBP': 'mean',
        'SLG': 'mean'
    }).reset_index()

    career_totals.columns = ['Name', 'Seasons', 'G', 'PA', 'AB', 'H',
                             'HR', 'BB', 'SO', 'Career_WAR', 'AVG', 'OBP', 'SLG']

    # Filter for qualified players
    career_totals = career_totals[career_totals['PA'] >= 3000].copy()

    # Calculate rate stats
    career_totals['WAR_per_Season'] = career_totals['Career_WAR'] / career_totals['Seasons']
    career_totals['WAR_per_162'] = career_totals['Career_WAR'] / career_totals['G'] * 162

    return career_totals.sort_values('Career_WAR', ascending=False)

# Example with historical HOF data
hof_players = pd.DataFrame({
    'Name': ['Babe Ruth', 'Willie Mays', 'Barry Bonds', 'Hank Aaron',
             'Ted Williams', 'Mike Trout', 'Albert Pujols', 'Derek Jeter'],
    'Career_WAR': [183.1, 156.1, 162.8, 143.0, 121.9, 85.5, 99.9, 72.4],
    'Seasons': [22, 23, 22, 23, 19, 13, 22, 20],
    'Peak_WAR_7yr': [71.0, 72.9, 72.8, 59.6, 63.3, 56.8, 56.4, 38.5],
    'HOF': [1, 1, 0, 1, 1, 0, 0, 1]
})

hof_players['WAR_per_Season'] = hof_players['Career_WAR'] / hof_players['Seasons']
hof_players['JAWS'] = (hof_players['Career_WAR'] + hof_players['Peak_WAR_7yr']) / 2

print("Hall of Fame Career Value Comparison:")
print(hof_players.to_string(index=False))

# Visualize WAR components
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Career WAR vs Peak WAR
axes[0].scatter(hof_players['Career_WAR'], hof_players['Peak_WAR_7yr'],
               c=hof_players['HOF'], cmap='RdYlGn', s=200, alpha=0.6)
for idx, row in hof_players.iterrows():
    axes[0].annotate(row['Name'], (row['Career_WAR'], row['Peak_WAR_7yr']),
                    fontsize=8, ha='right')
axes[0].set_xlabel('Career WAR')
axes[0].set_ylabel('Peak WAR (7-year)')
axes[0].set_title('Career vs Peak Value')
axes[0].grid(True, alpha=0.3)

# WAR per Season
axes[1].barh(hof_players['Name'], hof_players['WAR_per_Season'],
            color=['green' if x == 1 else 'gray' for x in hof_players['HOF']])
axes[1].set_xlabel('WAR per Season')
axes[1].set_title('Average Annual Value')
axes[1].axvline(x=5.0, color='red', linestyle='--', label='All-Star Level')
axes[1].legend()

plt.tight_layout()
plt.savefig('career_war_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

Python

# Python: Position-Specific JAWS Analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Historical HOF standards by position (approximate)
position_standards = pd.DataFrame({
    'Position': ['C', '1B', '2B', '3B', 'SS', 'LF', 'CF', 'RF', 'DH'],
    'HOF_Avg_WAR': [53.5, 66.7, 69.2, 67.1, 67.4, 64.9, 69.7, 70.3, 27.4],
    'HOF_Avg_Peak': [33.4, 42.3, 42.7, 42.1, 42.8, 41.5, 43.5, 43.6, 18.8],
    'HOF_Members': [17, 21, 19, 15, 24, 20, 19, 26, 4]
})

position_standards['HOF_Avg_JAWS'] = (
    (position_standards['HOF_Avg_WAR'] + position_standards['HOF_Avg_Peak']) / 2
)

print("Hall of Fame Standards by Position:")
print(position_standards.to_string(index=False))

# Active players vs HOF standards
active_players = pd.DataFrame({
    'Name': ['Mike Trout', 'Mookie Betts', 'Francisco Lindor',
             'Nolan Arenado', 'Jose Ramirez', 'Aaron Judge'],
    'Position': ['CF', 'RF', 'SS', '3B', '3B', 'RF'],
    'Age': [32, 31, 30, 33, 31, 32],
    'Career_WAR': [85.5, 68.2, 51.3, 49.8, 45.6, 42.1],
    'Peak_7yr_WAR': [56.8, 48.3, 38.7, 35.2, 38.9, 35.4]
})

active_players['JAWS'] = (
    (active_players['Career_WAR'] + active_players['Peak_7yr_WAR']) / 2
)

# Merge with standards
active_comparison = active_players.merge(
    position_standards[['Position', 'HOF_Avg_WAR', 'HOF_Avg_JAWS']],
    on='Position'
)

active_comparison['WAR_pct_of_standard'] = (
    active_comparison['Career_WAR'] / active_comparison['HOF_Avg_WAR'] * 100
)
active_comparison['JAWS_pct_of_standard'] = (
    active_comparison['JAWS'] / active_comparison['HOF_Avg_JAWS'] * 100
)

print("\nActive Players vs HOF Standards:")
print(active_comparison[['Name', 'Position', 'Career_WAR', 'HOF_Avg_WAR',
                         'WAR_pct_of_standard', 'JAWS_pct_of_standard']].to_string(index=False))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Position standards
axes[0].barh(position_standards['Position'], position_standards['HOF_Avg_JAWS'],
            color='steelblue', alpha=0.7)
axes[0].set_xlabel('JAWS')
axes[0].set_title('Average JAWS by Position (HOF Members)')
axes[0].grid(True, alpha=0.3, axis='x')

# Active players comparison
colors = ['green' if x >= 100 else 'orange' if x >= 75 else 'red'
          for x in active_comparison['JAWS_pct_of_standard']]
axes[1].barh(active_comparison['Name'],
            active_comparison['JAWS_pct_of_standard'],
            color=colors, alpha=0.7)
axes[1].axvline(x=100, color='black', linestyle='--', label='HOF Standard')
axes[1].set_xlabel('% of HOF Standard (JAWS)')
axes[1].set_title('Active Players vs Position Standards')
axes[1].legend()

plt.tight_layout()
plt.savefig('jaws_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

18.2 HOF Prediction Models

Feature Engineering for HOF Prediction

The Hall of Fame election depends on both statistical achievement and narrative factors. Effective prediction models combine:

Statistical Features:

Career counting stats (hits, home runs, RBI)

Rate statistics (batting average, OPS+)

Advanced metrics (WAR, JAWS)

Awards (MVP, All-Star selections)

Peak performance metrics

Contextual Features:

Position played

Era adjustments

Postseason performance

Team success

Media coverage and reputation

# R: HOF Prediction Model
library(randomForest)
library(caret)
library(pROC)

# Create HOF prediction dataset
create_hof_dataset <- function() {
  # Get batting data
  batting_career <- Batting %>%
    filter(yearID >= 1950) %>%
    group_by(playerID) %>%
    summarise(
      seasons = n(),
      games = sum(G),
      pa = sum(AB + BB + HBP + SF + SH, na.rm = TRUE),
      hits = sum(H, na.rm = TRUE),
      doubles = sum(`2B`, na.rm = TRUE),
      triples = sum(`3B`, na.rm = TRUE),
      hr = sum(HR, na.rm = TRUE),
      rbi = sum(RBI, na.rm = TRUE),
      sb = sum(SB, na.rm = TRUE),
      bb = sum(BB, na.rm = TRUE),
      so = sum(SO, na.rm = TRUE),
      avg = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
      obp = sum(H + BB + HBP, na.rm = TRUE) /
            sum(AB + BB + HBP + SF, na.rm = TRUE),
      slg = (sum(H, na.rm = TRUE) - sum(`2B` + `3B` + HR, na.rm = TRUE) +
             2 * sum(`2B`, na.rm = TRUE) +
             3 * sum(`3B`, na.rm = TRUE) +
             4 * sum(HR, na.rm = TRUE)) / sum(AB, na.rm = TRUE)
    ) %>%
    filter(pa >= 3000)

  # Add All-Star selections
  allstar_counts <- AllstarFull %>%
    group_by(playerID) %>%
    summarise(allstar_games = n())

  # Add Awards
  mvp_counts <- AwardsPlayers %>%
    filter(awardID == "MVP") %>%
    group_by(playerID) %>%
    summarise(mvp_awards = n())

  # HOF status
  hof_status <- HallOfFame %>%
    filter(inducted == "Y") %>%
    distinct(playerID, .keep_all = TRUE) %>%
    select(playerID, inducted)

  # Combine datasets
  hof_data <- batting_career %>%
    left_join(allstar_counts, by = "playerID") %>%
    left_join(mvp_counts, by = "playerID") %>%
    left_join(hof_status, by = "playerID") %>%
    mutate(
      allstar_games = ifelse(is.na(allstar_games), 0, allstar_games),
      mvp_awards = ifelse(is.na(mvp_awards), 0, mvp_awards),
      hof = ifelse(!is.na(inducted), 1, 0),
      ops = obp + slg,
      estimated_war = (ops - 0.710) * pa / 20,
      war_per_season = estimated_war / seasons
    ) %>%
    select(-playerID, -inducted) %>%
    filter(!is.na(avg), !is.na(obp), !is.na(slg))

  return(hof_data)
}

hof_data <- create_hof_dataset()

# Split data
set.seed(42)
train_index <- createDataPartition(hof_data$hof, p = 0.8, list = FALSE)
train_data <- hof_data[train_index, ]
test_data <- hof_data[-train_index, ]

# Train Random Forest model
rf_model <- randomForest(
  factor(hof) ~ seasons + games + hits + hr + rbi + sb +
    avg + obp + slg + ops + estimated_war +
    allstar_games + mvp_awards,
  data = train_data,
  ntree = 500,
  importance = TRUE
)

# Predictions
predictions <- predict(rf_model, test_data, type = "prob")[, 2]
pred_class <- ifelse(predictions > 0.5, 1, 0)

# Evaluate model
confusion_matrix <- confusionMatrix(
  factor(pred_class),
  factor(test_data$hof)
)

print("HOF Prediction Model Performance:")
print(confusion_matrix)

# Feature importance
importance_df <- data.frame(
  feature = rownames(importance(rf_model)),
  importance = importance(rf_model)[, "MeanDecreaseGini"]
) %>%
  arrange(desc(importance))

print("\nFeature Importance:")
print(head(importance_df, 10))

# ROC curve
roc_obj <- roc(test_data$hof, predictions)
auc_value <- auc(roc_obj)

plot(roc_obj, main = paste("ROC Curve (AUC =", round(auc_value, 3), ")"),
     col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")

# Variable importance plot
varImpPlot(rf_model, main = "Feature Importance for HOF Prediction")

Logistic Regression Approach

# Python: Logistic Regression HOF Model
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

# Create synthetic HOF dataset (in practice, use real data)
np.random.seed(42)

def generate_hof_data(n_samples=500):
    """Generate synthetic HOF candidate data"""
    data = {
        'seasons': np.random.randint(10, 25, n_samples),
        'hits': np.random.randint(500, 3500, n_samples),
        'hr': np.random.randint(50, 750, n_samples),
        'rbi': np.random.randint(200, 2300, n_samples),
        'avg': np.random.uniform(0.240, 0.340, n_samples),
        'war': np.random.uniform(10, 150, n_samples),
        'allstar': np.random.randint(0, 15, n_samples),
        'mvp': np.random.randint(0, 3, n_samples),
        'gold_glove': np.random.randint(0, 12, n_samples)
    }

    df = pd.DataFrame(data)

    # Calculate derivative features
    df['avg_war_per_season'] = df['war'] / df['seasons']
    df['hr_per_season'] = df['hr'] / df['seasons']
    df['peak_indicator'] = (df['avg_war_per_season'] > 5).astype(int)

    # Generate HOF status based on features with noise
    hof_score = (
        0.3 * (df['war'] - df['war'].mean()) / df['war'].std() +
        0.2 * (df['hits'] - df['hits'].mean()) / df['hits'].std() +
        0.15 * (df['hr'] - df['hr'].mean()) / df['hr'].std() +
        0.15 * df['allstar'] / 15 +
        0.2 * df['mvp'] / 3 +
        np.random.normal(0, 0.3, n_samples)
    )

    df['hof'] = (hof_score > 0.5).astype(int)

    return df

# Generate data
hof_df = generate_hof_data(500)

print("Dataset Overview:")
print(hof_df.head(10))
print(f"\nHOF Inductees: {hof_df['hof'].sum()} ({hof_df['hof'].mean()*100:.1f}%)")

# Prepare features and target
feature_cols = ['seasons', 'hits', 'hr', 'rbi', 'avg', 'war',
                'allstar', 'mvp', 'gold_glove', 'avg_war_per_season',
                'hr_per_season', 'peak_indicator']
X = hof_df[feature_cols]
y = hof_df['hof']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression
log_reg = LogisticRegression(random_state=42, max_iter=1000, C=0.1)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

# Evaluation
print("\nModel Performance:")
print(classification_report(y_test, y_pred, target_names=['Not HOF', 'HOF']))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Feature coefficients
coef_df = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': log_reg.coef_[0]
}).sort_values('Coefficient', ascending=False)

print("\nFeature Coefficients (Standardized):")
print(coef_df.to_string(index=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
axes[0, 0].set_title('Confusion Matrix')
axes[0, 0].set_ylabel('Actual')
axes[0, 0].set_xlabel('Predicted')

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, 'b-', label=f'ROC (AUC = {auc_score:.3f})')
axes[0, 1].plot([0, 1], [0, 1], 'r--', label='Random')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Feature Importance
axes[1, 0].barh(coef_df['Feature'], coef_df['Coefficient'],
               color=['green' if x > 0 else 'red' for x in coef_df['Coefficient']])
axes[1, 0].set_xlabel('Coefficient')
axes[1, 0].set_title('Feature Importance')
axes[1, 0].axvline(x=0, color='black', linestyle='-', linewidth=0.5)

# Predicted Probability Distribution
axes[1, 1].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.5,
               label='Not HOF', color='red')
axes[1, 1].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.5,
               label='HOF', color='green')
axes[1, 1].axvline(x=0.5, color='black', linestyle='--', label='Threshold')
axes[1, 1].set_xlabel('Predicted Probability')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Predicted Probability Distribution')
axes[1, 1].legend()

plt.tight_layout()
plt.savefig('hof_prediction_model.png', dpi=300, bbox_inches='tight')
plt.show()

# Cross-validation
cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"\nCross-Validation AUC Scores: {cv_scores}")
print(f"Mean AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

# R: HOF Prediction Model
library(randomForest)
library(caret)
library(pROC)

# Create HOF prediction dataset
create_hof_dataset <- function() {
  # Get batting data
  batting_career <- Batting %>%
    filter(yearID >= 1950) %>%
    group_by(playerID) %>%
    summarise(
      seasons = n(),
      games = sum(G),
      pa = sum(AB + BB + HBP + SF + SH, na.rm = TRUE),
      hits = sum(H, na.rm = TRUE),
      doubles = sum(`2B`, na.rm = TRUE),
      triples = sum(`3B`, na.rm = TRUE),
      hr = sum(HR, na.rm = TRUE),
      rbi = sum(RBI, na.rm = TRUE),
      sb = sum(SB, na.rm = TRUE),
      bb = sum(BB, na.rm = TRUE),
      so = sum(SO, na.rm = TRUE),
      avg = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
      obp = sum(H + BB + HBP, na.rm = TRUE) /
            sum(AB + BB + HBP + SF, na.rm = TRUE),
      slg = (sum(H, na.rm = TRUE) - sum(`2B` + `3B` + HR, na.rm = TRUE) +
             2 * sum(`2B`, na.rm = TRUE) +
             3 * sum(`3B`, na.rm = TRUE) +
             4 * sum(HR, na.rm = TRUE)) / sum(AB, na.rm = TRUE)
    ) %>%
    filter(pa >= 3000)

  # Add All-Star selections
  allstar_counts <- AllstarFull %>%
    group_by(playerID) %>%
    summarise(allstar_games = n())

  # Add Awards
  mvp_counts <- AwardsPlayers %>%
    filter(awardID == "MVP") %>%
    group_by(playerID) %>%
    summarise(mvp_awards = n())

  # HOF status
  hof_status <- HallOfFame %>%
    filter(inducted == "Y") %>%
    distinct(playerID, .keep_all = TRUE) %>%
    select(playerID, inducted)

  # Combine datasets
  hof_data <- batting_career %>%
    left_join(allstar_counts, by = "playerID") %>%
    left_join(mvp_counts, by = "playerID") %>%
    left_join(hof_status, by = "playerID") %>%
    mutate(
      allstar_games = ifelse(is.na(allstar_games), 0, allstar_games),
      mvp_awards = ifelse(is.na(mvp_awards), 0, mvp_awards),
      hof = ifelse(!is.na(inducted), 1, 0),
      ops = obp + slg,
      estimated_war = (ops - 0.710) * pa / 20,
      war_per_season = estimated_war / seasons
    ) %>%
    select(-playerID, -inducted) %>%
    filter(!is.na(avg), !is.na(obp), !is.na(slg))

  return(hof_data)
}

hof_data <- create_hof_dataset()

# Split data
set.seed(42)
train_index <- createDataPartition(hof_data$hof, p = 0.8, list = FALSE)
train_data <- hof_data[train_index, ]
test_data <- hof_data[-train_index, ]

# Train Random Forest model
rf_model <- randomForest(
  factor(hof) ~ seasons + games + hits + hr + rbi + sb +
    avg + obp + slg + ops + estimated_war +
    allstar_games + mvp_awards,
  data = train_data,
  ntree = 500,
  importance = TRUE
)

# Predictions
predictions <- predict(rf_model, test_data, type = "prob")[, 2]
pred_class <- ifelse(predictions > 0.5, 1, 0)

# Evaluate model
confusion_matrix <- confusionMatrix(
  factor(pred_class),
  factor(test_data$hof)
)

print("HOF Prediction Model Performance:")
print(confusion_matrix)

# Feature importance
importance_df <- data.frame(
  feature = rownames(importance(rf_model)),
  importance = importance(rf_model)[, "MeanDecreaseGini"]
) %>%
  arrange(desc(importance))

print("\nFeature Importance:")
print(head(importance_df, 10))

# ROC curve
roc_obj <- roc(test_data$hof, predictions)
auc_value <- auc(roc_obj)

plot(roc_obj, main = paste("ROC Curve (AUC =", round(auc_value, 3), ")"),
     col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")

# Variable importance plot
varImpPlot(rf_model, main = "Feature Importance for HOF Prediction")

Python

# Python: Logistic Regression HOF Model
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns

# Create synthetic HOF dataset (in practice, use real data)
np.random.seed(42)

def generate_hof_data(n_samples=500):
    """Generate synthetic HOF candidate data"""
    data = {
        'seasons': np.random.randint(10, 25, n_samples),
        'hits': np.random.randint(500, 3500, n_samples),
        'hr': np.random.randint(50, 750, n_samples),
        'rbi': np.random.randint(200, 2300, n_samples),
        'avg': np.random.uniform(0.240, 0.340, n_samples),
        'war': np.random.uniform(10, 150, n_samples),
        'allstar': np.random.randint(0, 15, n_samples),
        'mvp': np.random.randint(0, 3, n_samples),
        'gold_glove': np.random.randint(0, 12, n_samples)
    }

    df = pd.DataFrame(data)

    # Calculate derivative features
    df['avg_war_per_season'] = df['war'] / df['seasons']
    df['hr_per_season'] = df['hr'] / df['seasons']
    df['peak_indicator'] = (df['avg_war_per_season'] > 5).astype(int)

    # Generate HOF status based on features with noise
    hof_score = (
        0.3 * (df['war'] - df['war'].mean()) / df['war'].std() +
        0.2 * (df['hits'] - df['hits'].mean()) / df['hits'].std() +
        0.15 * (df['hr'] - df['hr'].mean()) / df['hr'].std() +
        0.15 * df['allstar'] / 15 +
        0.2 * df['mvp'] / 3 +
        np.random.normal(0, 0.3, n_samples)
    )

    df['hof'] = (hof_score > 0.5).astype(int)

    return df

# Generate data
hof_df = generate_hof_data(500)

print("Dataset Overview:")
print(hof_df.head(10))
print(f"\nHOF Inductees: {hof_df['hof'].sum()} ({hof_df['hof'].mean()*100:.1f}%)")

# Prepare features and target
feature_cols = ['seasons', 'hits', 'hr', 'rbi', 'avg', 'war',
                'allstar', 'mvp', 'gold_glove', 'avg_war_per_season',
                'hr_per_season', 'peak_indicator']
X = hof_df[feature_cols]
y = hof_df['hof']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression
log_reg = LogisticRegression(random_state=42, max_iter=1000, C=0.1)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]

# Evaluation
print("\nModel Performance:")
print(classification_report(y_test, y_pred, target_names=['Not HOF', 'HOF']))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Feature coefficients
coef_df = pd.DataFrame({
    'Feature': feature_cols,
    'Coefficient': log_reg.coef_[0]
}).sort_values('Coefficient', ascending=False)

print("\nFeature Coefficients (Standardized):")
print(coef_df.to_string(index=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
axes[0, 0].set_title('Confusion Matrix')
axes[0, 0].set_ylabel('Actual')
axes[0, 0].set_xlabel('Predicted')

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, 'b-', label=f'ROC (AUC = {auc_score:.3f})')
axes[0, 1].plot([0, 1], [0, 1], 'r--', label='Random')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Feature Importance
axes[1, 0].barh(coef_df['Feature'], coef_df['Coefficient'],
               color=['green' if x > 0 else 'red' for x in coef_df['Coefficient']])
axes[1, 0].set_xlabel('Coefficient')
axes[1, 0].set_title('Feature Importance')
axes[1, 0].axvline(x=0, color='black', linestyle='-', linewidth=0.5)

# Predicted Probability Distribution
axes[1, 1].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.5,
               label='Not HOF', color='red')
axes[1, 1].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.5,
               label='HOF', color='green')
axes[1, 1].axvline(x=0.5, color='black', linestyle='--', label='Threshold')
axes[1, 1].set_xlabel('Predicted Probability')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Predicted Probability Distribution')
axes[1, 1].legend()

plt.tight_layout()
plt.savefig('hof_prediction_model.png', dpi=300, bbox_inches='tight')
plt.show()

# Cross-validation
cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"\nCross-Validation AUC Scores: {cv_scores}")
print(f"Mean AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

18.3 Award Prediction

MVP Award Prediction

MVP voting is influenced by both performance and narrative. Key predictive factors include:

WAR: Single best predictor
Team Performance: Playoff teams heavily favored
Counting Stats: HR, RBI still matter to voters
Position: Premium positions get bonus consideration
Media Narrative: Difficult to quantify but impactful

# R: MVP Award Prediction
library(nnet)
library(MASS)

# Create MVP prediction dataset (simplified example)
set.seed(123)
mvp_data <- data.frame(
  player = paste("Player", 1:200),
  war = rnorm(200, mean = 5, sd = 2),
  hr = rpois(200, lambda = 25),
  rbi = rpois(200, lambda = 90),
  avg = rnorm(200, mean = 0.275, sd = 0.030),
  team_wins = rpois(200, lambda = 81),
  playoff_team = rbinom(200, 1, 0.33)
) %>%
  mutate(
    war = pmax(0, war),
    avg = pmin(0.350, pmax(0.200, avg)),
    # MVP probability influenced by multiple factors
    mvp_score = 0.35 * (war / 10) +
                0.20 * (hr / 50) +
                0.15 * (rbi / 120) +
                0.10 * ((avg - 0.250) / 0.050) +
                0.20 * playoff_team +
                rnorm(200, 0, 0.15)
  ) %>%
  arrange(desc(mvp_score)) %>%
  mutate(
    mvp_rank = row_number(),
    mvp_top3 = ifelse(mvp_rank <= 3, 1, 0),
    mvp_winner = ifelse(mvp_rank == 1, 1, 0)
  )

# Multinomial logistic regression for MVP voting
mvp_data$mvp_category <- cut(mvp_data$mvp_rank,
                              breaks = c(0, 1, 3, 10, Inf),
                              labels = c("Winner", "Top3", "Top10", "Other"))

mvp_model <- multinom(
  mvp_category ~ war + hr + rbi + avg + team_wins + playoff_team,
  data = mvp_data
)

summary(mvp_model)

# Predict MVP likelihood
mvp_predictions <- predict(mvp_model, mvp_data, type = "probs")
mvp_data$pred_winner_prob <- mvp_predictions[, "Winner"]

# Top MVP candidates
top_mvp_candidates <- mvp_data %>%
  arrange(desc(pred_winner_prob)) %>%
  head(10) %>%
  select(player, war, hr, rbi, avg, team_wins, playoff_team,
         mvp_rank, pred_winner_prob)

print("Top MVP Candidates:")
print(top_mvp_candidates)

# Feature impact analysis
mvp_linear <- lm(mvp_score ~ war + hr + rbi + avg + playoff_team,
                 data = mvp_data)

print("\nFeature Impact on MVP Score:")
print(summary(mvp_linear)$coefficients)

# Visualization
par(mfrow = c(2, 2))

# WAR vs MVP Score
plot(mvp_data$war, mvp_data$mvp_score,
     xlab = "WAR", ylab = "MVP Score",
     main = "WAR vs MVP Score",
     col = ifelse(mvp_data$playoff_team == 1, "darkgreen", "gray"),
     pch = 19)
legend("topleft", legend = c("Playoff Team", "Non-Playoff"),
       col = c("darkgreen", "gray"), pch = 19)

# Home Runs vs MVP Score
plot(mvp_data$hr, mvp_data$mvp_score,
     xlab = "Home Runs", ylab = "MVP Score",
     main = "Home Runs vs MVP Score",
     col = "steelblue", pch = 19)

# Team Wins distribution
boxplot(team_wins ~ mvp_category, data = mvp_data,
        xlab = "MVP Category", ylab = "Team Wins",
        main = "Team Wins by MVP Category",
        col = "lightblue")

# Predicted probability distribution
hist(mvp_data$pred_winner_prob, breaks = 30,
     xlab = "Predicted MVP Winner Probability",
     main = "Distribution of MVP Probabilities",
     col = "steelblue")

Cy Young Award Prediction

# Python: Cy Young Prediction for Pitchers
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

# Generate synthetic Cy Young candidate data
np.random.seed(42)

def generate_cy_young_data(n_samples=300):
    """Generate synthetic pitcher data for Cy Young prediction"""
    data = {
        'wins': np.random.randint(5, 25, n_samples),
        'losses': np.random.randint(2, 15, n_samples),
        'era': np.random.uniform(2.0, 5.5, n_samples),
        'innings': np.random.uniform(100, 250, n_samples),
        'strikeouts': np.random.randint(100, 350, n_samples),
        'whip': np.random.uniform(0.90, 1.60, n_samples),
        'war': np.random.uniform(1, 10, n_samples),
        'saves': np.random.randint(0, 50, n_samples),
        'team_wins': np.random.randint(60, 105, n_samples)
    }

    df = pd.DataFrame(data)

    # Calculate derivative features
    df['win_pct'] = df['wins'] / (df['wins'] + df['losses'])
    df['k_per_9'] = (df['strikeouts'] / df['innings']) * 9
    df['k_per_bb'] = df['strikeouts'] / (df['strikeouts'] * 0.3)  # Estimate BB
    df['is_closer'] = (df['saves'] > 20).astype(int)

    # Generate Cy Young winner based on features
    cy_score = (
        0.25 * (10 - df['era']) / 3 +  # Lower ERA better
        0.25 * df['war'] / 10 +
        0.15 * df['wins'] / 25 +
        0.15 * df['strikeouts'] / 350 +
        0.10 * (1.5 - df['whip']) / 0.6 +  # Lower WHIP better
        0.10 * (df['team_wins'] - 60) / 45 +
        np.random.normal(0, 0.2, n_samples)
    )

    df['cy_young_winner'] = (cy_score > np.percentile(cy_score, 97)).astype(int)

    return df

# Generate data
cy_young_df = generate_cy_young_data(300)

print("Cy Young Dataset Overview:")
print(cy_young_df.describe())
print(f"\nCy Young Winners: {cy_young_df['cy_young_winner'].sum()}")

# Prepare features
feature_cols = ['wins', 'era', 'innings', 'strikeouts', 'whip', 'war',
                'saves', 'team_wins', 'win_pct', 'k_per_9', 'is_closer']
X = cy_young_df[feature_cols]
y = cy_young_df['cy_young_winner']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb_model.fit(X_train_scaled, y_train)

# Predictions
y_pred = gb_model.predict(X_test_scaled)
y_pred_proba = gb_model.predict_proba(X_test_scaled)[:, 1]

# Evaluation
print("\nCy Young Prediction Model Performance:")
print(classification_report(y_test, y_pred, target_names=['Not Winner', 'Winner']))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance.to_string(index=False))

# Predict on all data for ranking
all_predictions = gb_model.predict_proba(scaler.transform(X))[:, 1]
cy_young_df['cy_young_prob'] = all_predictions
cy_young_df['cy_young_rank'] = cy_young_df['cy_young_prob'].rank(ascending=False)

# Top candidates
top_candidates = cy_young_df.nsmallest(10, 'cy_young_rank')[
    ['wins', 'era', 'strikeouts', 'war', 'cy_young_prob', 'cy_young_winner']
]

print("\nTop 10 Cy Young Candidates:")
print(top_candidates.to_string(index=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Feature Importance
axes[0, 0].barh(feature_importance['Feature'], feature_importance['Importance'],
               color='steelblue')
axes[0, 0].set_xlabel('Importance')
axes[0, 0].set_title('Feature Importance for Cy Young Prediction')

# ERA vs WAR with winner highlighting
scatter = axes[0, 1].scatter(cy_young_df['era'], cy_young_df['war'],
                            c=cy_young_df['cy_young_prob'],
                            cmap='RdYlGn', s=50, alpha=0.6)
axes[0, 1].set_xlabel('ERA')
axes[0, 1].set_ylabel('WAR')
axes[0, 1].set_title('ERA vs WAR (colored by Cy Young probability)')
plt.colorbar(scatter, ax=axes[0, 1])
axes[0, 1].invert_xaxis()  # Lower ERA is better

# Wins distribution
axes[1, 0].hist(cy_young_df[cy_young_df['cy_young_winner']==0]['wins'],
               bins=15, alpha=0.5, label='Not Winner', color='gray')
axes[1, 0].hist(cy_young_df[cy_young_df['cy_young_winner']==1]['wins'],
               bins=15, alpha=0.5, label='Winner', color='gold')
axes[1, 0].set_xlabel('Wins')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Win Distribution by Cy Young Status')
axes[1, 0].legend()

# Predicted probability vs actual
axes[1, 1].scatter(range(len(y_test)), y_pred_proba,
                  c=y_test, cmap='RdYlGn', s=50, alpha=0.6)
axes[1, 1].axhline(y=0.5, color='black', linestyle='--', label='Threshold')
axes[1, 1].set_xlabel('Sample Index')
axes[1, 1].set_ylabel('Predicted Probability')
axes[1, 1].set_title('Predicted Cy Young Probabilities (Test Set)')
axes[1, 1].legend()

plt.tight_layout()
plt.savefig('cy_young_prediction.png', dpi=300, bbox_inches='tight')
plt.show()

# R: MVP Award Prediction
library(nnet)
library(MASS)

# Create MVP prediction dataset (simplified example)
set.seed(123)
mvp_data <- data.frame(
  player = paste("Player", 1:200),
  war = rnorm(200, mean = 5, sd = 2),
  hr = rpois(200, lambda = 25),
  rbi = rpois(200, lambda = 90),
  avg = rnorm(200, mean = 0.275, sd = 0.030),
  team_wins = rpois(200, lambda = 81),
  playoff_team = rbinom(200, 1, 0.33)
) %>%
  mutate(
    war = pmax(0, war),
    avg = pmin(0.350, pmax(0.200, avg)),
    # MVP probability influenced by multiple factors
    mvp_score = 0.35 * (war / 10) +
                0.20 * (hr / 50) +
                0.15 * (rbi / 120) +
                0.10 * ((avg - 0.250) / 0.050) +
                0.20 * playoff_team +
                rnorm(200, 0, 0.15)
  ) %>%
  arrange(desc(mvp_score)) %>%
  mutate(
    mvp_rank = row_number(),
    mvp_top3 = ifelse(mvp_rank <= 3, 1, 0),
    mvp_winner = ifelse(mvp_rank == 1, 1, 0)
  )

# Multinomial logistic regression for MVP voting
mvp_data$mvp_category <- cut(mvp_data$mvp_rank,
                              breaks = c(0, 1, 3, 10, Inf),
                              labels = c("Winner", "Top3", "Top10", "Other"))

mvp_model <- multinom(
  mvp_category ~ war + hr + rbi + avg + team_wins + playoff_team,
  data = mvp_data
)

summary(mvp_model)

# Predict MVP likelihood
mvp_predictions <- predict(mvp_model, mvp_data, type = "probs")
mvp_data$pred_winner_prob <- mvp_predictions[, "Winner"]

# Top MVP candidates
top_mvp_candidates <- mvp_data %>%
  arrange(desc(pred_winner_prob)) %>%
  head(10) %>%
  select(player, war, hr, rbi, avg, team_wins, playoff_team,
         mvp_rank, pred_winner_prob)

print("Top MVP Candidates:")
print(top_mvp_candidates)

# Feature impact analysis
mvp_linear <- lm(mvp_score ~ war + hr + rbi + avg + playoff_team,
                 data = mvp_data)

print("\nFeature Impact on MVP Score:")
print(summary(mvp_linear)$coefficients)

# Visualization
par(mfrow = c(2, 2))

# WAR vs MVP Score
plot(mvp_data$war, mvp_data$mvp_score,
     xlab = "WAR", ylab = "MVP Score",
     main = "WAR vs MVP Score",
     col = ifelse(mvp_data$playoff_team == 1, "darkgreen", "gray"),
     pch = 19)
legend("topleft", legend = c("Playoff Team", "Non-Playoff"),
       col = c("darkgreen", "gray"), pch = 19)

# Home Runs vs MVP Score
plot(mvp_data$hr, mvp_data$mvp_score,
     xlab = "Home Runs", ylab = "MVP Score",
     main = "Home Runs vs MVP Score",
     col = "steelblue", pch = 19)

# Team Wins distribution
boxplot(team_wins ~ mvp_category, data = mvp_data,
        xlab = "MVP Category", ylab = "Team Wins",
        main = "Team Wins by MVP Category",
        col = "lightblue")

# Predicted probability distribution
hist(mvp_data$pred_winner_prob, breaks = 30,
     xlab = "Predicted MVP Winner Probability",
     main = "Distribution of MVP Probabilities",
     col = "steelblue")

Python

# Python: Cy Young Prediction for Pitchers
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns

# Generate synthetic Cy Young candidate data
np.random.seed(42)

def generate_cy_young_data(n_samples=300):
    """Generate synthetic pitcher data for Cy Young prediction"""
    data = {
        'wins': np.random.randint(5, 25, n_samples),
        'losses': np.random.randint(2, 15, n_samples),
        'era': np.random.uniform(2.0, 5.5, n_samples),
        'innings': np.random.uniform(100, 250, n_samples),
        'strikeouts': np.random.randint(100, 350, n_samples),
        'whip': np.random.uniform(0.90, 1.60, n_samples),
        'war': np.random.uniform(1, 10, n_samples),
        'saves': np.random.randint(0, 50, n_samples),
        'team_wins': np.random.randint(60, 105, n_samples)
    }

    df = pd.DataFrame(data)

    # Calculate derivative features
    df['win_pct'] = df['wins'] / (df['wins'] + df['losses'])
    df['k_per_9'] = (df['strikeouts'] / df['innings']) * 9
    df['k_per_bb'] = df['strikeouts'] / (df['strikeouts'] * 0.3)  # Estimate BB
    df['is_closer'] = (df['saves'] > 20).astype(int)

    # Generate Cy Young winner based on features
    cy_score = (
        0.25 * (10 - df['era']) / 3 +  # Lower ERA better
        0.25 * df['war'] / 10 +
        0.15 * df['wins'] / 25 +
        0.15 * df['strikeouts'] / 350 +
        0.10 * (1.5 - df['whip']) / 0.6 +  # Lower WHIP better
        0.10 * (df['team_wins'] - 60) / 45 +
        np.random.normal(0, 0.2, n_samples)
    )

    df['cy_young_winner'] = (cy_score > np.percentile(cy_score, 97)).astype(int)

    return df

# Generate data
cy_young_df = generate_cy_young_data(300)

print("Cy Young Dataset Overview:")
print(cy_young_df.describe())
print(f"\nCy Young Winners: {cy_young_df['cy_young_winner'].sum()}")

# Prepare features
feature_cols = ['wins', 'era', 'innings', 'strikeouts', 'whip', 'war',
                'saves', 'team_wins', 'win_pct', 'k_per_9', 'is_closer']
X = cy_young_df[feature_cols]
y = cy_young_df['cy_young_winner']

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb_model.fit(X_train_scaled, y_train)

# Predictions
y_pred = gb_model.predict(X_test_scaled)
y_pred_proba = gb_model.predict_proba(X_test_scaled)[:, 1]

# Evaluation
print("\nCy Young Prediction Model Performance:")
print(classification_report(y_test, y_pred, target_names=['Not Winner', 'Winner']))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFeature Importance:")
print(feature_importance.to_string(index=False))

# Predict on all data for ranking
all_predictions = gb_model.predict_proba(scaler.transform(X))[:, 1]
cy_young_df['cy_young_prob'] = all_predictions
cy_young_df['cy_young_rank'] = cy_young_df['cy_young_prob'].rank(ascending=False)

# Top candidates
top_candidates = cy_young_df.nsmallest(10, 'cy_young_rank')[
    ['wins', 'era', 'strikeouts', 'war', 'cy_young_prob', 'cy_young_winner']
]

print("\nTop 10 Cy Young Candidates:")
print(top_candidates.to_string(index=False))

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Feature Importance
axes[0, 0].barh(feature_importance['Feature'], feature_importance['Importance'],
               color='steelblue')
axes[0, 0].set_xlabel('Importance')
axes[0, 0].set_title('Feature Importance for Cy Young Prediction')

# ERA vs WAR with winner highlighting
scatter = axes[0, 1].scatter(cy_young_df['era'], cy_young_df['war'],
                            c=cy_young_df['cy_young_prob'],
                            cmap='RdYlGn', s=50, alpha=0.6)
axes[0, 1].set_xlabel('ERA')
axes[0, 1].set_ylabel('WAR')
axes[0, 1].set_title('ERA vs WAR (colored by Cy Young probability)')
plt.colorbar(scatter, ax=axes[0, 1])
axes[0, 1].invert_xaxis()  # Lower ERA is better

# Wins distribution
axes[1, 0].hist(cy_young_df[cy_young_df['cy_young_winner']==0]['wins'],
               bins=15, alpha=0.5, label='Not Winner', color='gray')
axes[1, 0].hist(cy_young_df[cy_young_df['cy_young_winner']==1]['wins'],
               bins=15, alpha=0.5, label='Winner', color='gold')
axes[1, 0].set_xlabel('Wins')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Win Distribution by Cy Young Status')
axes[1, 0].legend()

# Predicted probability vs actual
axes[1, 1].scatter(range(len(y_test)), y_pred_proba,
                  c=y_test, cmap='RdYlGn', s=50, alpha=0.6)
axes[1, 1].axhline(y=0.5, color='black', linestyle='--', label='Threshold')
axes[1, 1].set_xlabel('Sample Index')
axes[1, 1].set_ylabel('Predicted Probability')
axes[1, 1].set_title('Predicted Cy Young Probabilities (Test Set)')
axes[1, 1].legend()

plt.tight_layout()
plt.savefig('cy_young_prediction.png', dpi=300, bbox_inches='tight')
plt.show()

18.4 Aging Curves

Aging curves reveal how player performance changes over time, helping teams make personnel decisions and predict future performance.

Delta Method for Aging Curves

The delta method tracks year-over-year changes for the same players, controlling for selection bias.

# R: Aging Curves Analysis
library(mgcv)  # For smoothing

# Create aging curve dataset
create_aging_data <- function() {
  batting_yearly <- Batting %>%
    filter(yearID >= 1980, AB >= 300) %>%
    left_join(People %>% select(playerID, birthYear, birthMonth),
              by = "playerID") %>%
    mutate(
      age = yearID - birthYear -
            ifelse(is.na(birthMonth) | birthMonth > 6, 1, 0),
      pa = AB + BB + HBP + SF + SH,
      obp = (H + BB + HBP) / (AB + BB + HBP + SF),
      slg = (H - `2B` - `3B` - HR + 2*`2B` + 3*`3B` + 4*HR) / AB,
      ops = obp + slg,
      iso = slg - (H / AB)
    ) %>%
    filter(age >= 20, age <= 40, !is.na(ops)) %>%
    select(playerID, yearID, age, pa, obp, slg, ops, iso, HR, SB)

  return(batting_yearly)
}

aging_data <- create_aging_data()

# Calculate average performance by age
age_performance <- aging_data %>%
  group_by(age) %>%
  summarise(
    n_players = n(),
    avg_ops = mean(ops, na.rm = TRUE),
    avg_iso = mean(iso, na.rm = TRUE),
    avg_hr = mean(HR, na.rm = TRUE),
    avg_sb = mean(SB, na.rm = TRUE),
    se_ops = sd(ops, na.rm = TRUE) / sqrt(n_players)
  )

print("Performance by Age:")
print(age_performance)

# Fit smooth aging curve using GAM
gam_model <- gam(ops ~ s(age, bs = "cs"), data = aging_data)

# Predict smooth curve
age_range <- data.frame(age = 20:40)
age_range$predicted_ops <- predict(gam_model, age_range)

# Find peak age
peak_age <- age_range$age[which.max(age_range$predicted_ops)]
peak_ops <- max(age_range$predicted_ops)

print(paste("Peak performance age:", peak_age))
print(paste("Peak OPS:", round(peak_ops, 3)))

# Visualize aging curve
ggplot() +
  geom_point(data = age_performance, aes(x = age, y = avg_ops),
             size = 2, alpha = 0.6) +
  geom_errorbar(data = age_performance,
                aes(x = age, ymin = avg_ops - se_ops, ymax = avg_ops + se_ops),
                width = 0.3, alpha = 0.4) +
  geom_line(data = age_range, aes(x = age, y = predicted_ops),
            color = "red", size = 1.2) +
  geom_vline(xintercept = peak_age, linetype = "dashed", color = "blue") +
  annotate("text", x = peak_age + 1, y = min(age_performance$avg_ops),
           label = paste("Peak:", peak_age), color = "blue") +
  labs(title = "Baseball Aging Curve (OPS)",
       subtitle = "Smoothed curve shows typical performance trajectory",
       x = "Age",
       y = "OPS") +
  theme_minimal()

# Delta method: year-over-year change
delta_data <- aging_data %>%
  arrange(playerID, yearID) %>%
  group_by(playerID) %>%
  mutate(
    next_age = lead(age),
    next_ops = lead(ops),
    age_diff = next_age - age,
    ops_change = next_ops - ops
  ) %>%
  filter(age_diff == 1) %>%  # Only consecutive years
  ungroup()

# Average delta by age
age_delta <- delta_data %>%
  group_by(age) %>%
  summarise(
    n = n(),
    avg_delta = mean(ops_change, na.rm = TRUE),
    se_delta = sd(ops_change, na.rm = TRUE) / sqrt(n)
  )

print("\nYear-over-Year Change in OPS:")
print(age_delta)

# Plot delta curve
ggplot(age_delta, aes(x = age, y = avg_delta)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  geom_line(color = "darkred", size = 1.2) +
  geom_point(size = 2) +
  geom_errorbar(aes(ymin = avg_delta - se_delta, ymax = avg_delta + se_delta),
                width = 0.3, alpha = 0.5) +
  labs(title = "Year-over-Year Change in OPS by Age",
       subtitle = "Delta method controls for selection bias",
       x = "Age",
       y = "Change in OPS from Previous Year") +
  theme_minimal()

Position-Specific and Skill-Specific Aging

# Python: Skill-Specific Aging Curves
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import UnivariateSpline

# Generate synthetic career trajectory data
np.random.seed(42)

def generate_career_data(n_players=200):
    """Generate synthetic career data with realistic aging patterns"""
    careers = []

    for player_id in range(n_players):
        # Random career span
        debut_age = np.random.randint(20, 25)
        career_length = np.random.randint(8, 18)

        # Player quality (talent level)
        talent = np.random.normal(100, 15)

        for year_in_career in range(career_length):
            age = debut_age + year_in_career

            # Different aging patterns for different skills
            # Power peaks later and declines faster
            power_modifier = 1.0 + (age - 27) * 0.02 if age < 27 else 1.0 - (age - 27) * 0.04
            power = max(0, talent * 0.8 * power_modifier + np.random.normal(0, 5))

            # Speed declines earlier and more gradually
            speed_modifier = 1.0 + (age - 25) * 0.01 if age < 25 else 1.0 - (age - 25) * 0.06
            speed = max(0, talent * 1.2 * speed_modifier + np.random.normal(0, 5))

            # Contact skills more stable
            contact_modifier = 1.0 + (age - 29) * 0.01 if age < 29 else 1.0 - (age - 29) * 0.02
            contact = max(0, talent * contact_modifier + np.random.normal(0, 5))

            careers.append({
                'player_id': player_id,
                'age': age,
                'power': power,
                'speed': speed,
                'contact': contact,
                'overall': (power + speed + contact) / 3
            })

    return pd.DataFrame(careers)

# Generate data
career_df = generate_career_data(200)

print("Career Data Sample:")
print(career_df.head(20))

# Calculate average by age for each skill
aging_curves = career_df.groupby('age').agg({
    'power': ['mean', 'std', 'count'],
    'speed': ['mean', 'std', 'count'],
    'contact': ['mean', 'std', 'count'],
    'overall': ['mean', 'std', 'count']
}).reset_index()

aging_curves.columns = ['age', 'power_mean', 'power_std', 'power_n',
                        'speed_mean', 'speed_std', 'speed_n',
                        'contact_mean', 'contact_std', 'contact_n',
                        'overall_mean', 'overall_std', 'overall_n']

print("\nAging Curves by Skill:")
print(aging_curves)

# Smooth curves using splines
ages = aging_curves['age'].values
skills = ['power', 'speed', 'contact', 'overall']
smoothed = {}

for skill in skills:
    values = aging_curves[f'{skill}_mean'].values
    spline = UnivariateSpline(ages, values, s=50)
    smoothed[skill] = spline(ages)

    # Find peak
    peak_idx = np.argmax(smoothed[skill])
    peak_age = ages[peak_idx]
    print(f"{skill.capitalize()} peaks at age {peak_age}")

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Skill-specific aging curves
ax = axes[0, 0]
for skill in ['power', 'speed', 'contact']:
    ax.plot(ages, smoothed[skill], marker='o', label=skill.capitalize(), linewidth=2)
ax.set_xlabel('Age')
ax.set_ylabel('Skill Level')
ax.set_title('Skill-Specific Aging Curves')
ax.legend()
ax.grid(True, alpha=0.3)

# Decline rates by age
ax = axes[0, 1]
for skill in ['power', 'speed', 'contact']:
    values = smoothed[skill]
    peak_value = np.max(values)
    decline_pct = ((values - peak_value) / peak_value) * 100
    ax.plot(ages, decline_pct, marker='o', label=skill.capitalize(), linewidth=2)
ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Age')
ax.set_ylabel('% Change from Peak')
ax.set_title('Performance Decline from Peak')
ax.legend()
ax.grid(True, alpha=0.3)

# Distribution by age group
ax = axes[1, 0]
age_groups = pd.cut(career_df['age'], bins=[19, 25, 29, 33, 41],
                    labels=['20-25', '26-29', '30-33', '34+'])
career_df['age_group'] = age_groups

age_group_means = career_df.groupby('age_group')[['power', 'speed', 'contact']].mean()
age_group_means.plot(kind='bar', ax=ax, color=['#e74c3c', '#3498db', '#2ecc71'])
ax.set_xlabel('Age Group')
ax.set_ylabel('Skill Level')
ax.set_title('Average Skills by Age Group')
ax.legend(title='Skill')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

# Year-over-year change
ax = axes[1, 1]
yoy_changes = career_df.sort_values(['player_id', 'age']).groupby('player_id').apply(
    lambda x: x[['age', 'overall']].assign(
        overall_change=x['overall'].diff()
    )
).reset_index(drop=True)

yoy_by_age = yoy_changes.dropna().groupby('age')['overall_change'].agg(['mean', 'std'])
ax.plot(yoy_by_age.index, yoy_by_age['mean'], 'o-', color='darkred', linewidth=2)
ax.fill_between(yoy_by_age.index,
                yoy_by_age['mean'] - yoy_by_age['std'],
                yoy_by_age['mean'] + yoy_by_age['std'],
                alpha=0.3, color='red')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Age')
ax.set_ylabel('Year-over-Year Change')
ax.set_title('Annual Performance Change')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('skill_aging_curves.png', dpi=300, bbox_inches='tight')
plt.show()

# Statistical summary of decline rates
print("\n=== Decline Rate Analysis ===")
for skill in ['power', 'speed', 'contact']:
    peak_idx = np.argmax(smoothed[skill])
    peak_age = ages[peak_idx]
    peak_value = smoothed[skill][peak_idx]

    # Age 35 value (typical late career)
    age_35_idx = np.where(ages == 35)[0]
    if len(age_35_idx) > 0:
        age_35_value = smoothed[skill][age_35_idx[0]]
        decline = ((age_35_value - peak_value) / peak_value) * 100
        print(f"{skill.capitalize()}: Peak at {peak_age}, decline to age 35: {decline:.1f}%")

# R: Aging Curves Analysis
library(mgcv)  # For smoothing

# Create aging curve dataset
create_aging_data <- function() {
  batting_yearly <- Batting %>%
    filter(yearID >= 1980, AB >= 300) %>%
    left_join(People %>% select(playerID, birthYear, birthMonth),
              by = "playerID") %>%
    mutate(
      age = yearID - birthYear -
            ifelse(is.na(birthMonth) | birthMonth > 6, 1, 0),
      pa = AB + BB + HBP + SF + SH,
      obp = (H + BB + HBP) / (AB + BB + HBP + SF),
      slg = (H - `2B` - `3B` - HR + 2*`2B` + 3*`3B` + 4*HR) / AB,
      ops = obp + slg,
      iso = slg - (H / AB)
    ) %>%
    filter(age >= 20, age <= 40, !is.na(ops)) %>%
    select(playerID, yearID, age, pa, obp, slg, ops, iso, HR, SB)

  return(batting_yearly)
}

aging_data <- create_aging_data()

# Calculate average performance by age
age_performance <- aging_data %>%
  group_by(age) %>%
  summarise(
    n_players = n(),
    avg_ops = mean(ops, na.rm = TRUE),
    avg_iso = mean(iso, na.rm = TRUE),
    avg_hr = mean(HR, na.rm = TRUE),
    avg_sb = mean(SB, na.rm = TRUE),
    se_ops = sd(ops, na.rm = TRUE) / sqrt(n_players)
  )

print("Performance by Age:")
print(age_performance)

# Fit smooth aging curve using GAM
gam_model <- gam(ops ~ s(age, bs = "cs"), data = aging_data)

# Predict smooth curve
age_range <- data.frame(age = 20:40)
age_range$predicted_ops <- predict(gam_model, age_range)

# Find peak age
peak_age <- age_range$age[which.max(age_range$predicted_ops)]
peak_ops <- max(age_range$predicted_ops)

print(paste("Peak performance age:", peak_age))
print(paste("Peak OPS:", round(peak_ops, 3)))

# Visualize aging curve
ggplot() +
  geom_point(data = age_performance, aes(x = age, y = avg_ops),
             size = 2, alpha = 0.6) +
  geom_errorbar(data = age_performance,
                aes(x = age, ymin = avg_ops - se_ops, ymax = avg_ops + se_ops),
                width = 0.3, alpha = 0.4) +
  geom_line(data = age_range, aes(x = age, y = predicted_ops),
            color = "red", size = 1.2) +
  geom_vline(xintercept = peak_age, linetype = "dashed", color = "blue") +
  annotate("text", x = peak_age + 1, y = min(age_performance$avg_ops),
           label = paste("Peak:", peak_age), color = "blue") +
  labs(title = "Baseball Aging Curve (OPS)",
       subtitle = "Smoothed curve shows typical performance trajectory",
       x = "Age",
       y = "OPS") +
  theme_minimal()

# Delta method: year-over-year change
delta_data <- aging_data %>%
  arrange(playerID, yearID) %>%
  group_by(playerID) %>%
  mutate(
    next_age = lead(age),
    next_ops = lead(ops),
    age_diff = next_age - age,
    ops_change = next_ops - ops
  ) %>%
  filter(age_diff == 1) %>%  # Only consecutive years
  ungroup()

# Average delta by age
age_delta <- delta_data %>%
  group_by(age) %>%
  summarise(
    n = n(),
    avg_delta = mean(ops_change, na.rm = TRUE),
    se_delta = sd(ops_change, na.rm = TRUE) / sqrt(n)
  )

print("\nYear-over-Year Change in OPS:")
print(age_delta)

# Plot delta curve
ggplot(age_delta, aes(x = age, y = avg_delta)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  geom_line(color = "darkred", size = 1.2) +
  geom_point(size = 2) +
  geom_errorbar(aes(ymin = avg_delta - se_delta, ymax = avg_delta + se_delta),
                width = 0.3, alpha = 0.5) +
  labs(title = "Year-over-Year Change in OPS by Age",
       subtitle = "Delta method controls for selection bias",
       x = "Age",
       y = "Change in OPS from Previous Year") +
  theme_minimal()

Python

# Python: Skill-Specific Aging Curves
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import UnivariateSpline

# Generate synthetic career trajectory data
np.random.seed(42)

def generate_career_data(n_players=200):
    """Generate synthetic career data with realistic aging patterns"""
    careers = []

    for player_id in range(n_players):
        # Random career span
        debut_age = np.random.randint(20, 25)
        career_length = np.random.randint(8, 18)

        # Player quality (talent level)
        talent = np.random.normal(100, 15)

        for year_in_career in range(career_length):
            age = debut_age + year_in_career

            # Different aging patterns for different skills
            # Power peaks later and declines faster
            power_modifier = 1.0 + (age - 27) * 0.02 if age < 27 else 1.0 - (age - 27) * 0.04
            power = max(0, talent * 0.8 * power_modifier + np.random.normal(0, 5))

            # Speed declines earlier and more gradually
            speed_modifier = 1.0 + (age - 25) * 0.01 if age < 25 else 1.0 - (age - 25) * 0.06
            speed = max(0, talent * 1.2 * speed_modifier + np.random.normal(0, 5))

            # Contact skills more stable
            contact_modifier = 1.0 + (age - 29) * 0.01 if age < 29 else 1.0 - (age - 29) * 0.02
            contact = max(0, talent * contact_modifier + np.random.normal(0, 5))

            careers.append({
                'player_id': player_id,
                'age': age,
                'power': power,
                'speed': speed,
                'contact': contact,
                'overall': (power + speed + contact) / 3
            })

    return pd.DataFrame(careers)

# Generate data
career_df = generate_career_data(200)

print("Career Data Sample:")
print(career_df.head(20))

# Calculate average by age for each skill
aging_curves = career_df.groupby('age').agg({
    'power': ['mean', 'std', 'count'],
    'speed': ['mean', 'std', 'count'],
    'contact': ['mean', 'std', 'count'],
    'overall': ['mean', 'std', 'count']
}).reset_index()

aging_curves.columns = ['age', 'power_mean', 'power_std', 'power_n',
                        'speed_mean', 'speed_std', 'speed_n',
                        'contact_mean', 'contact_std', 'contact_n',
                        'overall_mean', 'overall_std', 'overall_n']

print("\nAging Curves by Skill:")
print(aging_curves)

# Smooth curves using splines
ages = aging_curves['age'].values
skills = ['power', 'speed', 'contact', 'overall']
smoothed = {}

for skill in skills:
    values = aging_curves[f'{skill}_mean'].values
    spline = UnivariateSpline(ages, values, s=50)
    smoothed[skill] = spline(ages)

    # Find peak
    peak_idx = np.argmax(smoothed[skill])
    peak_age = ages[peak_idx]
    print(f"{skill.capitalize()} peaks at age {peak_age}")

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Skill-specific aging curves
ax = axes[0, 0]
for skill in ['power', 'speed', 'contact']:
    ax.plot(ages, smoothed[skill], marker='o', label=skill.capitalize(), linewidth=2)
ax.set_xlabel('Age')
ax.set_ylabel('Skill Level')
ax.set_title('Skill-Specific Aging Curves')
ax.legend()
ax.grid(True, alpha=0.3)

# Decline rates by age
ax = axes[0, 1]
for skill in ['power', 'speed', 'contact']:
    values = smoothed[skill]
    peak_value = np.max(values)
    decline_pct = ((values - peak_value) / peak_value) * 100
    ax.plot(ages, decline_pct, marker='o', label=skill.capitalize(), linewidth=2)
ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Age')
ax.set_ylabel('% Change from Peak')
ax.set_title('Performance Decline from Peak')
ax.legend()
ax.grid(True, alpha=0.3)

# Distribution by age group
ax = axes[1, 0]
age_groups = pd.cut(career_df['age'], bins=[19, 25, 29, 33, 41],
                    labels=['20-25', '26-29', '30-33', '34+'])
career_df['age_group'] = age_groups

age_group_means = career_df.groupby('age_group')[['power', 'speed', 'contact']].mean()
age_group_means.plot(kind='bar', ax=ax, color=['#e74c3c', '#3498db', '#2ecc71'])
ax.set_xlabel('Age Group')
ax.set_ylabel('Skill Level')
ax.set_title('Average Skills by Age Group')
ax.legend(title='Skill')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

# Year-over-year change
ax = axes[1, 1]
yoy_changes = career_df.sort_values(['player_id', 'age']).groupby('player_id').apply(
    lambda x: x[['age', 'overall']].assign(
        overall_change=x['overall'].diff()
    )
).reset_index(drop=True)

yoy_by_age = yoy_changes.dropna().groupby('age')['overall_change'].agg(['mean', 'std'])
ax.plot(yoy_by_age.index, yoy_by_age['mean'], 'o-', color='darkred', linewidth=2)
ax.fill_between(yoy_by_age.index,
                yoy_by_age['mean'] - yoy_by_age['std'],
                yoy_by_age['mean'] + yoy_by_age['std'],
                alpha=0.3, color='red')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Age')
ax.set_ylabel('Year-over-Year Change')
ax.set_title('Annual Performance Change')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('skill_aging_curves.png', dpi=300, bbox_inches='tight')
plt.show()

# Statistical summary of decline rates
print("\n=== Decline Rate Analysis ===")
for skill in ['power', 'speed', 'contact']:
    peak_idx = np.argmax(smoothed[skill])
    peak_age = ages[peak_idx]
    peak_value = smoothed[skill][peak_idx]

    # Age 35 value (typical late career)
    age_35_idx = np.where(ages == 35)[0]
    if len(age_35_idx) > 0:
        age_35_value = smoothed[skill][age_35_idx[0]]
        decline = ((age_35_value - peak_value) / peak_value) * 100
        print(f"{skill.capitalize()}: Peak at {peak_age}, decline to age 35: {decline:.1f}%")

18.5 What-If Analysis

What-if analysis explores hypothetical scenarios: career-ending injuries, steroid era adjustments, or different team contexts.

Injury Impact Analysis

# R: Career Projection with Injury Scenarios
library(forecast)

# Create career projection function
project_career <- function(age_values, war_values, project_to_age = 40) {
  career_df <- data.frame(age = age_values, war = war_values)

  # Fit aging curve model
  model <- loess(war ~ age, data = career_df, span = 0.75)

  # Project remaining years
  future_ages <- (max(age_values) + 1):project_to_age
  future_ages <- future_ages[future_ages <= project_to_age]

  if (length(future_ages) > 0) {
    projected_war <- predict(model, newdata = data.frame(age = future_ages))
    # Apply decline factor
    decline_factor <- seq(0.95, 0.5, length.out = length(future_ages))
    projected_war <- projected_war * decline_factor
    projected_war <- pmax(0, projected_war)  # No negative WAR
  } else {
    projected_war <- numeric(0)
    future_ages <- integer(0)
  }

  list(
    actual = career_df,
    projected = data.frame(age = future_ages, war = projected_war),
    total_war = sum(war_values) + sum(projected_war)
  )
}

# Example: Mike Trout career projection
trout_actual <- data.frame(
  age = 20:31,
  war = c(2.1, 10.2, 10.5, 9.2, 10.0, 10.6, 8.2, 8.6, 8.0, 4.5, 3.5, 1.8)
)

# Scenario 1: No injury (baseline)
scenario_healthy <- project_career(trout_actual$age, trout_actual$war)

# Scenario 2: Injury at age 32 (miss full season)
trout_injury <- trout_actual
scenario_injury <- project_career(trout_injury$age, trout_injury$war)
scenario_injury$total_war <- scenario_injury$total_war - 6  # Lost prime season

# Scenario 3: Career-ending injury at 32
scenario_career_end <- list(
  actual = trout_actual,
  projected = data.frame(age = integer(0), war = numeric(0)),
  total_war = sum(trout_actual$war)
)

# Compare scenarios
comparison <- data.frame(
  scenario = c("Healthy Career", "One Year Injury", "Career Ending"),
  projected_total_war = c(
    scenario_healthy$total_war,
    scenario_injury$total_war,
    scenario_career_end$total_war
  ),
  hof_probability = c(0.99, 0.90, 0.75)
)

print("Career Projection Scenarios:")
print(comparison)

# Visualize scenarios
plot_data <- rbind(
  data.frame(scenario = "Actual", scenario_healthy$actual),
  data.frame(scenario = "Healthy Projection", scenario_healthy$projected),
  data.frame(scenario = "Injury Projection", scenario_injury$projected)
)

ggplot(plot_data, aes(x = age, y = war, color = scenario, linetype = scenario)) +
  geom_line(size = 1.2) +
  geom_point(data = plot_data[plot_data$scenario == "Actual", ], size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  scale_color_manual(values = c("Actual" = "black",
                                "Healthy Projection" = "darkgreen",
                                "Injury Projection" = "red")) +
  scale_linetype_manual(values = c("Actual" = "solid",
                                   "Healthy Projection" = "dashed",
                                   "Injury Projection" = "dotted")) +
  labs(title = "Career Projection: Injury Impact Analysis",
       subtitle = "Comparing healthy vs injury scenarios",
       x = "Age",
       y = "WAR",
       color = "Scenario",
       linetype = "Scenario") +
  theme_minimal() +
  theme(legend.position = "bottom")

Era Adjustment Analysis

# Python: Era-Adjusted Performance Comparison
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Historical context adjustments
era_factors = pd.DataFrame({
    'era': ['Dead Ball (1901-1919)', 'Live Ball (1920-1941)',
            'Integration (1942-1960)', 'Expansion (1961-1976)',
            'Free Agency (1977-1993)', 'Steroid (1994-2005)',
            'Post-Steroid (2006-2015)', 'Modern (2016-2023)'],
    'start_year': [1901, 1920, 1942, 1961, 1977, 1994, 2006, 2016],
    'end_year': [1919, 1941, 1960, 1976, 1993, 2005, 2015, 2023],
    'avg_runs_per_game': [3.8, 4.8, 4.3, 4.0, 4.3, 5.0, 4.6, 4.4],
    'avg_hr_per_game': [0.15, 0.5, 0.9, 0.95, 0.85, 1.1, 1.05, 1.25],
    'offensive_adjustment': [1.20, 1.05, 1.10, 1.15, 1.10, 0.95, 1.05, 1.00]
})

print("Era Adjustment Factors:")
print(era_factors[['era', 'offensive_adjustment']])

# Compare players across eras
historical_seasons = pd.DataFrame({
    'player': ['Babe Ruth', 'Ted Williams', 'Willie Mays', 'Mike Schmidt',
               'Barry Bonds', 'Albert Pujols', 'Mike Trout'],
    'peak_season': [1921, 1941, 1965, 1980, 2001, 2008, 2016],
    'hr': [59, 37, 52, 48, 73, 37, 29],
    'avg': [.378, .406, .317, .286, .328, .357, .315],
    'ops': [1.379, 1.287, 1.053, .974, 1.379, 1.114, 1.004],
    'war': [12.9, 10.4, 11.1, 8.5, 12.5, 9.6, 10.5]
})

# Add era information
def assign_era(year):
    for _, row in era_factors.iterrows():
        if row['start_year'] <= year <= row['end_year']:
            return row['era'], row['offensive_adjustment']
    return 'Unknown', 1.0

historical_seasons['era'] = historical_seasons['peak_season'].apply(
    lambda x: assign_era(x)[0]
)
historical_seasons['era_adjustment'] = historical_seasons['peak_season'].apply(
    lambda x: assign_era(x)[1]
)

# Calculate era-adjusted stats
historical_seasons['adjusted_ops'] = (
    historical_seasons['ops'] * historical_seasons['era_adjustment']
)
historical_seasons['adjusted_war'] = (
    historical_seasons['war'] * historical_seasons['era_adjustment']
)

print("\nEra-Adjusted Performance Comparison:")
print(historical_seasons[['player', 'era', 'ops', 'adjusted_ops',
                         'war', 'adjusted_war']].to_string(index=False))

# Rank players by adjusted performance
historical_seasons_ranked = historical_seasons.sort_values(
    'adjusted_war', ascending=False
)

print("\nRanking by Era-Adjusted WAR:")
for idx, row in historical_seasons_ranked.iterrows():
    print(f"{row['player']:20s} {row['adjusted_war']:5.1f} ({row['era']})")

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Raw vs Adjusted OPS
ax = axes[0, 0]
x = np.arange(len(historical_seasons))
width = 0.35
ax.bar(x - width/2, historical_seasons['ops'], width, label='Raw OPS', alpha=0.7)
ax.bar(x + width/2, historical_seasons['adjusted_ops'], width,
       label='Adjusted OPS', alpha=0.7)
ax.set_ylabel('OPS')
ax.set_title('Raw vs Era-Adjusted OPS')
ax.set_xticks(x)
ax.set_xticklabels(historical_seasons['player'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Raw vs Adjusted WAR
ax = axes[0, 1]
ax.bar(x - width/2, historical_seasons['war'], width, label='Raw WAR', alpha=0.7)
ax.bar(x + width/2, historical_seasons['adjusted_war'], width,
       label='Adjusted WAR', alpha=0.7)
ax.set_ylabel('WAR')
ax.set_title('Raw vs Era-Adjusted WAR')
ax.set_xticks(x)
ax.set_xticklabels(historical_seasons['player'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Era offensive levels
ax = axes[1, 0]
ax.plot(era_factors['start_year'], era_factors['avg_runs_per_game'],
        'o-', label='Runs/Game', linewidth=2)
ax2 = ax.twinx()
ax2.plot(era_factors['start_year'], era_factors['avg_hr_per_game'],
         's-', color='orange', label='HR/Game', linewidth=2)
ax.set_xlabel('Year')
ax.set_ylabel('Runs per Game', color='blue')
ax2.set_ylabel('Home Runs per Game', color='orange')
ax.set_title('Offensive Environment by Era')
ax.tick_params(axis='y', labelcolor='blue')
ax2.tick_params(axis='y', labelcolor='orange')
ax.grid(True, alpha=0.3)

# Adjustment factor timeline
ax = axes[1, 1]
ax.step(era_factors['start_year'], era_factors['offensive_adjustment'],
        where='post', linewidth=2, color='darkred')
ax.axhline(y=1.0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Year')
ax.set_ylabel('Adjustment Factor')
ax.set_title('Era Adjustment Factors Over Time')
ax.set_ylim([0.9, 1.25])
ax.grid(True, alpha=0.3)

# Add era labels
for _, row in era_factors.iterrows():
    ax.text(row['start_year'], row['offensive_adjustment'] + 0.02,
           row['era'].split('(')[0].strip(), rotation=45,
           fontsize=7, ha='left')

plt.tight_layout()
plt.savefig('era_adjustments.png', dpi=300, bbox_inches='tight')
plt.show()

# R: Career Projection with Injury Scenarios
library(forecast)

# Create career projection function
project_career <- function(age_values, war_values, project_to_age = 40) {
  career_df <- data.frame(age = age_values, war = war_values)

  # Fit aging curve model
  model <- loess(war ~ age, data = career_df, span = 0.75)

  # Project remaining years
  future_ages <- (max(age_values) + 1):project_to_age
  future_ages <- future_ages[future_ages <= project_to_age]

  if (length(future_ages) > 0) {
    projected_war <- predict(model, newdata = data.frame(age = future_ages))
    # Apply decline factor
    decline_factor <- seq(0.95, 0.5, length.out = length(future_ages))
    projected_war <- projected_war * decline_factor
    projected_war <- pmax(0, projected_war)  # No negative WAR
  } else {
    projected_war <- numeric(0)
    future_ages <- integer(0)
  }

  list(
    actual = career_df,
    projected = data.frame(age = future_ages, war = projected_war),
    total_war = sum(war_values) + sum(projected_war)
  )
}

# Example: Mike Trout career projection
trout_actual <- data.frame(
  age = 20:31,
  war = c(2.1, 10.2, 10.5, 9.2, 10.0, 10.6, 8.2, 8.6, 8.0, 4.5, 3.5, 1.8)
)

# Scenario 1: No injury (baseline)
scenario_healthy <- project_career(trout_actual$age, trout_actual$war)

# Scenario 2: Injury at age 32 (miss full season)
trout_injury <- trout_actual
scenario_injury <- project_career(trout_injury$age, trout_injury$war)
scenario_injury$total_war <- scenario_injury$total_war - 6  # Lost prime season

# Scenario 3: Career-ending injury at 32
scenario_career_end <- list(
  actual = trout_actual,
  projected = data.frame(age = integer(0), war = numeric(0)),
  total_war = sum(trout_actual$war)
)

# Compare scenarios
comparison <- data.frame(
  scenario = c("Healthy Career", "One Year Injury", "Career Ending"),
  projected_total_war = c(
    scenario_healthy$total_war,
    scenario_injury$total_war,
    scenario_career_end$total_war
  ),
  hof_probability = c(0.99, 0.90, 0.75)
)

print("Career Projection Scenarios:")
print(comparison)

# Visualize scenarios
plot_data <- rbind(
  data.frame(scenario = "Actual", scenario_healthy$actual),
  data.frame(scenario = "Healthy Projection", scenario_healthy$projected),
  data.frame(scenario = "Injury Projection", scenario_injury$projected)
)

ggplot(plot_data, aes(x = age, y = war, color = scenario, linetype = scenario)) +
  geom_line(size = 1.2) +
  geom_point(data = plot_data[plot_data$scenario == "Actual", ], size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
  scale_color_manual(values = c("Actual" = "black",
                                "Healthy Projection" = "darkgreen",
                                "Injury Projection" = "red")) +
  scale_linetype_manual(values = c("Actual" = "solid",
                                   "Healthy Projection" = "dashed",
                                   "Injury Projection" = "dotted")) +
  labs(title = "Career Projection: Injury Impact Analysis",
       subtitle = "Comparing healthy vs injury scenarios",
       x = "Age",
       y = "WAR",
       color = "Scenario",
       linetype = "Scenario") +
  theme_minimal() +
  theme(legend.position = "bottom")

Python

# Python: Era-Adjusted Performance Comparison
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Historical context adjustments
era_factors = pd.DataFrame({
    'era': ['Dead Ball (1901-1919)', 'Live Ball (1920-1941)',
            'Integration (1942-1960)', 'Expansion (1961-1976)',
            'Free Agency (1977-1993)', 'Steroid (1994-2005)',
            'Post-Steroid (2006-2015)', 'Modern (2016-2023)'],
    'start_year': [1901, 1920, 1942, 1961, 1977, 1994, 2006, 2016],
    'end_year': [1919, 1941, 1960, 1976, 1993, 2005, 2015, 2023],
    'avg_runs_per_game': [3.8, 4.8, 4.3, 4.0, 4.3, 5.0, 4.6, 4.4],
    'avg_hr_per_game': [0.15, 0.5, 0.9, 0.95, 0.85, 1.1, 1.05, 1.25],
    'offensive_adjustment': [1.20, 1.05, 1.10, 1.15, 1.10, 0.95, 1.05, 1.00]
})

print("Era Adjustment Factors:")
print(era_factors[['era', 'offensive_adjustment']])

# Compare players across eras
historical_seasons = pd.DataFrame({
    'player': ['Babe Ruth', 'Ted Williams', 'Willie Mays', 'Mike Schmidt',
               'Barry Bonds', 'Albert Pujols', 'Mike Trout'],
    'peak_season': [1921, 1941, 1965, 1980, 2001, 2008, 2016],
    'hr': [59, 37, 52, 48, 73, 37, 29],
    'avg': [.378, .406, .317, .286, .328, .357, .315],
    'ops': [1.379, 1.287, 1.053, .974, 1.379, 1.114, 1.004],
    'war': [12.9, 10.4, 11.1, 8.5, 12.5, 9.6, 10.5]
})

# Add era information
def assign_era(year):
    for _, row in era_factors.iterrows():
        if row['start_year'] <= year <= row['end_year']:
            return row['era'], row['offensive_adjustment']
    return 'Unknown', 1.0

historical_seasons['era'] = historical_seasons['peak_season'].apply(
    lambda x: assign_era(x)[0]
)
historical_seasons['era_adjustment'] = historical_seasons['peak_season'].apply(
    lambda x: assign_era(x)[1]
)

# Calculate era-adjusted stats
historical_seasons['adjusted_ops'] = (
    historical_seasons['ops'] * historical_seasons['era_adjustment']
)
historical_seasons['adjusted_war'] = (
    historical_seasons['war'] * historical_seasons['era_adjustment']
)

print("\nEra-Adjusted Performance Comparison:")
print(historical_seasons[['player', 'era', 'ops', 'adjusted_ops',
                         'war', 'adjusted_war']].to_string(index=False))

# Rank players by adjusted performance
historical_seasons_ranked = historical_seasons.sort_values(
    'adjusted_war', ascending=False
)

print("\nRanking by Era-Adjusted WAR:")
for idx, row in historical_seasons_ranked.iterrows():
    print(f"{row['player']:20s} {row['adjusted_war']:5.1f} ({row['era']})")

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Raw vs Adjusted OPS
ax = axes[0, 0]
x = np.arange(len(historical_seasons))
width = 0.35
ax.bar(x - width/2, historical_seasons['ops'], width, label='Raw OPS', alpha=0.7)
ax.bar(x + width/2, historical_seasons['adjusted_ops'], width,
       label='Adjusted OPS', alpha=0.7)
ax.set_ylabel('OPS')
ax.set_title('Raw vs Era-Adjusted OPS')
ax.set_xticks(x)
ax.set_xticklabels(historical_seasons['player'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Raw vs Adjusted WAR
ax = axes[0, 1]
ax.bar(x - width/2, historical_seasons['war'], width, label='Raw WAR', alpha=0.7)
ax.bar(x + width/2, historical_seasons['adjusted_war'], width,
       label='Adjusted WAR', alpha=0.7)
ax.set_ylabel('WAR')
ax.set_title('Raw vs Era-Adjusted WAR')
ax.set_xticks(x)
ax.set_xticklabels(historical_seasons['player'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Era offensive levels
ax = axes[1, 0]
ax.plot(era_factors['start_year'], era_factors['avg_runs_per_game'],
        'o-', label='Runs/Game', linewidth=2)
ax2 = ax.twinx()
ax2.plot(era_factors['start_year'], era_factors['avg_hr_per_game'],
         's-', color='orange', label='HR/Game', linewidth=2)
ax.set_xlabel('Year')
ax.set_ylabel('Runs per Game', color='blue')
ax2.set_ylabel('Home Runs per Game', color='orange')
ax.set_title('Offensive Environment by Era')
ax.tick_params(axis='y', labelcolor='blue')
ax2.tick_params(axis='y', labelcolor='orange')
ax.grid(True, alpha=0.3)

# Adjustment factor timeline
ax = axes[1, 1]
ax.step(era_factors['start_year'], era_factors['offensive_adjustment'],
        where='post', linewidth=2, color='darkred')
ax.axhline(y=1.0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Year')
ax.set_ylabel('Adjustment Factor')
ax.set_title('Era Adjustment Factors Over Time')
ax.set_ylim([0.9, 1.25])
ax.grid(True, alpha=0.3)

# Add era labels
for _, row in era_factors.iterrows():
    ax.text(row['start_year'], row['offensive_adjustment'] + 0.02,
           row['era'].split('(')[0].strip(), rotation=45,
           fontsize=7, ha='left')

plt.tight_layout()
plt.savefig('era_adjustments.png', dpi=300, bbox_inches='tight')
plt.show()

18.6 GOAT Debate

The "Greatest of All Time" debate is baseball's most entertaining analytical exercise, combining statistics, context, and personal values.

Multi-Dimensional GOAT Analysis

# R: Comprehensive GOAT Framework
library(FactoMineR)
library(factoextra)

# Create comprehensive player comparison
goat_candidates <- data.frame(
  player = c("Babe Ruth", "Willie Mays", "Barry Bonds", "Hank Aaron",
             "Ted Williams", "Mickey Mantle", "Stan Musial", "Ty Cobb",
             "Honus Wagner", "Rogers Hornsby", "Mike Trout"),
  # Career value
  career_war = c(183.1, 156.1, 162.8, 143.0, 121.9, 110.3, 128.0, 151.4, 130.8, 127.0, 85.5),
  peak_war_7yr = c(71.0, 72.9, 72.8, 59.6, 63.3, 54.2, 57.7, 62.1, 57.4, 66.3, 56.8),
  # Counting stats (normalized to per 162 games)
  hr_per_162 = c(46.3, 35.2, 37.1, 37.2, 37.0, 33.8, 27.3, 4.0, 6.8, 22.8, 37.0),
  avg = c(.342, .301, .298, .305, .344, .298, .331, .366, .328, .358, .303),
  ops = c(1.164, .941, 1.051, .928, 1.116, .977, .976, .945, .858, 1.010, .988),
  # Peak seasons
  mvps = c(1, 2, 7, 1, 2, 3, 3, 1, 0, 2, 3),
  allstar = c(2, 24, 14, 25, 19, 20, 24, 0, 0, 0, 11),
  # Postseason
  ws_titles = c(7, 1, 0, 1, 0, 7, 3, 0, 1, 0, 0),
  # Longevity
  seasons = c(22, 23, 22, 23, 19, 18, 22, 24, 21, 23, 13),
  # Era context (offensive environment, 1.0 = neutral)
  era_context = c(1.15, 1.05, 0.90, 1.05, 1.10, 1.00, 1.05, 1.20, 1.15, 1.10, 1.00)
)

# Calculate composite scores
goat_candidates <- goat_candidates %>%
  mutate(
    jaws = (career_war + peak_war_7yr) / 2,
    era_adj_war = career_war * era_context,
    peak_score = (peak_war_7yr / 75) * 100,
    counting_score = (hr_per_162 / 50 + (avg - .250) / .100 +
                     (ops - .700) / .500) / 3 * 100,
    awards_score = (mvps * 10 + allstar * 2) / 2,
    longevity_score = pmin(100, seasons * 4),
    # Overall GOAT score (weighted combination)
    goat_score = 0.40 * era_adj_war / 2 +
                 0.25 * peak_score +
                 0.15 * counting_score +
                 0.10 * awards_score +
                 0.10 * longevity_score
  )

# Rank players
goat_rankings <- goat_candidates %>%
  arrange(desc(goat_score)) %>%
  mutate(rank = row_number())

print("GOAT Rankings:")
print(goat_rankings %>%
      select(rank, player, career_war, jaws, goat_score) %>%
      head(10))

# Principal Component Analysis for visualization
pca_data <- goat_candidates %>%
  select(career_war, peak_war_7yr, ops, mvps, allstar, seasons)

pca_result <- PCA(pca_data, graph = FALSE)

# Visualize PCA
fviz_pca_biplot(pca_result,
                label = "var",
                habillage = factor(goat_candidates$player),
                addEllipses = FALSE,
                col.var = "darkred",
                repel = TRUE) +
  labs(title = "GOAT Candidates - Principal Component Analysis",
       subtitle = "Multi-dimensional comparison of greatest players") +
  theme_minimal()

# Create radar chart comparison (top 5)
library(fmsb)

top5 <- goat_rankings %>% head(5)

# Normalize scores to 0-100 scale
radar_data <- top5 %>%
  mutate(
    war_norm = (career_war / max(career_war)) * 100,
    peak_norm = (peak_war_7yr / max(peak_war_7yr)) * 100,
    rate_norm = ((ops - min(ops)) / (max(ops) - min(ops))) * 100,
    awards_norm = (awards_score / max(awards_score)) * 100,
    longevity_norm = (longevity_score / max(longevity_score)) * 100
  ) %>%
  select(player, war_norm, peak_norm, rate_norm, awards_norm, longevity_norm)

# Prepare data for radar chart
radar_matrix <- as.data.frame(t(radar_data[, -1]))
colnames(radar_matrix) <- radar_data$player
radar_matrix <- rbind(rep(100, 5), rep(0, 5), radar_matrix)

# Plot radar chart
colors_border <- c("red", "blue", "green", "orange", "purple")
colors_in <- alpha(colors_border, 0.3)

par(mar = c(1, 1, 3, 1))
radarchart(radar_matrix,
           axistype = 1,
           pcol = colors_border,
           pfcol = colors_in,
           plwd = 2,
           plty = 1,
           cglcol = "grey",
           cglty = 1,
           axislabcol = "grey",
           cglwd = 0.8,
           vlcex = 0.8,
           title = "Top 5 GOAT Candidates - Radar Comparison")

legend("topright", legend = colnames(radar_matrix),
       col = colors_border, lty = 1, lwd = 2, cex = 0.8)

Context-Specific GOAT Rankings

# Python: Situational GOAT Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

# Create comprehensive dataset
goat_data = pd.DataFrame({
    'player': ['Babe Ruth', 'Willie Mays', 'Barry Bonds', 'Hank Aaron',
               'Ted Williams', 'Mickey Mantle', 'Mike Trout', 'Stan Musial'],
    'career_war': [183.1, 156.1, 162.8, 143.0, 121.9, 110.3, 85.5, 128.0],
    'peak_war': [71.0, 72.9, 72.8, 59.6, 63.3, 54.2, 56.8, 57.7],
    'ops_plus': [207, 156, 182, 155, 190, 172, 176, 159],
    'war_per_season': [8.3, 6.8, 7.4, 6.2, 6.4, 6.1, 6.6, 5.8],
    'mvps': [1, 2, 7, 1, 2, 3, 3, 3],
    'championships': [7, 1, 0, 1, 0, 7, 0, 3],
    'iconic_moments': [10, 8, 5, 7, 6, 9, 4, 6],  # Subjective
    'cultural_impact': [10, 9, 5, 8, 8, 9, 7, 7]  # Subjective
})

# Different weighting schemes for different perspectives
weighting_schemes = {
    'Pure Statistics': {
        'career_war': 0.40,
        'peak_war': 0.30,
        'ops_plus': 0.20,
        'war_per_season': 0.10,
        'mvps': 0.0,
        'championships': 0.0,
        'iconic_moments': 0.0,
        'cultural_impact': 0.0
    },
    'Peak Performance': {
        'career_war': 0.20,
        'peak_war': 0.40,
        'ops_plus': 0.15,
        'war_per_season': 0.25,
        'mvps': 0.0,
        'championships': 0.0,
        'iconic_moments': 0.0,
        'cultural_impact': 0.0
    },
    'Career Achievement': {
        'career_war': 0.35,
        'peak_war': 0.15,
        'ops_plus': 0.15,
        'war_per_season': 0.10,
        'mvps': 0.15,
        'championships': 0.10,
        'iconic_moments': 0.0,
        'cultural_impact': 0.0
    },
    'Legacy & Impact': {
        'career_war': 0.20,
        'peak_war': 0.15,
        'ops_plus': 0.10,
        'war_per_season': 0.05,
        'mvps': 0.10,
        'championships': 0.15,
        'iconic_moments': 0.15,
        'cultural_impact': 0.10
    },
    'Balanced': {
        'career_war': 0.25,
        'peak_war': 0.20,
        'ops_plus': 0.15,
        'war_per_season': 0.15,
        'mvps': 0.10,
        'championships': 0.08,
        'iconic_moments': 0.04,
        'cultural_impact': 0.03
    }
}

# Normalize all metrics to 0-100 scale
metrics = ['career_war', 'peak_war', 'ops_plus', 'war_per_season',
           'mvps', 'championships', 'iconic_moments', 'cultural_impact']

goat_normalized = goat_data.copy()
for metric in metrics:
    goat_normalized[f'{metric}_norm'] = (
        (goat_data[metric] - goat_data[metric].min()) /
        (goat_data[metric].max() - goat_data[metric].min()) * 100
    )

# Calculate scores for each weighting scheme
results = []

for scheme_name, weights in weighting_schemes.items():
    goat_normalized[f'score_{scheme_name}'] = 0

    for metric, weight in weights.items():
        if weight > 0:
            goat_normalized[f'score_{scheme_name}'] += (
                goat_normalized[f'{metric}_norm'] * weight
            )

    # Rank players
    scheme_rankings = goat_normalized[['player', f'score_{scheme_name}']].copy()
    scheme_rankings = scheme_rankings.sort_values(
        f'score_{scheme_name}', ascending=False
    )
    scheme_rankings['rank'] = range(1, len(scheme_rankings) + 1)
    scheme_rankings['scheme'] = scheme_name

    results.append(scheme_rankings)

# Combine all rankings
all_rankings = pd.concat(results)

print("GOAT Rankings by Perspective:")
print("=" * 80)

for scheme in weighting_schemes.keys():
    scheme_data = all_rankings[all_rankings['scheme'] == scheme]
    print(f"\n{scheme}:")
    print(scheme_data[['rank', 'player', f'score_{scheme}']].to_string(index=False))

# Visualizations
fig = plt.figure(figsize=(16, 10))

# Ranking heatmap
ax1 = plt.subplot(2, 2, 1)
ranking_pivot = all_rankings.pivot(index='player', columns='scheme', values='rank')
ranking_pivot = ranking_pivot[list(weighting_schemes.keys())]  # Order columns
sns.heatmap(ranking_pivot, annot=True, fmt='g', cmap='RdYlGn_r',
            cbar_kws={'label': 'Rank'}, ax=ax1)
ax1.set_title('Player Rankings by Perspective')
ax1.set_ylabel('')

# Score heatmap
ax2 = plt.subplot(2, 2, 2)
score_cols = [f'score_{scheme}' for scheme in weighting_schemes.keys()]
score_data = goat_normalized[['player'] + score_cols].set_index('player')
score_data.columns = list(weighting_schemes.keys())
sns.heatmap(score_data, annot=True, fmt='.1f', cmap='YlOrRd', ax=ax2)
ax2.set_title('GOAT Scores by Perspective')
ax2.set_ylabel('')

# Ranking stability
ax3 = plt.subplot(2, 2, 3)
rank_variance = ranking_pivot.var(axis=1).sort_values()
colors = ['green' if x < 2 else 'orange' if x < 5 else 'red' for x in rank_variance]
ax3.barh(rank_variance.index, rank_variance.values, color=colors, alpha=0.7)
ax3.set_xlabel('Variance in Rankings')
ax3.set_title('Ranking Stability Across Perspectives')
ax3.axvline(x=2, color='green', linestyle='--', alpha=0.5, label='Consistent')
ax3.axvline(x=5, color='red', linestyle='--', alpha=0.5, label='Variable')
ax3.legend()

# Average ranking
ax4 = plt.subplot(2, 2, 4)
avg_ranking = ranking_pivot.mean(axis=1).sort_values()
bars = ax4.barh(avg_ranking.index, avg_ranking.values, alpha=0.7)
ax4.set_xlabel('Average Rank')
ax4.set_title('Consensus GOAT Rankings')
ax4.invert_xaxis()
ax4.set_xlim(ax4.get_xlim()[1], 0.5)  # Start from 0.5 to show #1 clearly

# Color gradient for bars
colors_gradient = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(bars)))
for bar, color in zip(bars, colors_gradient):
    bar.set_color(color)

plt.tight_layout()
plt.savefig('goat_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Summary statistics
print("\n" + "=" * 80)
print("CONSENSUS RANKINGS (Average across all perspectives):")
print("=" * 80)
consensus = ranking_pivot.mean(axis=1).sort_values()
for rank, (player, avg_rank) in enumerate(consensus.items(), 1):
    std = ranking_pivot.loc[player].std()
    print(f"{rank}. {player:20s} Avg Rank: {avg_rank:.2f} (σ = {std:.2f})")

# R: Comprehensive GOAT Framework
library(FactoMineR)
library(factoextra)

# Create comprehensive player comparison
goat_candidates <- data.frame(
  player = c("Babe Ruth", "Willie Mays", "Barry Bonds", "Hank Aaron",
             "Ted Williams", "Mickey Mantle", "Stan Musial", "Ty Cobb",
             "Honus Wagner", "Rogers Hornsby", "Mike Trout"),
  # Career value
  career_war = c(183.1, 156.1, 162.8, 143.0, 121.9, 110.3, 128.0, 151.4, 130.8, 127.0, 85.5),
  peak_war_7yr = c(71.0, 72.9, 72.8, 59.6, 63.3, 54.2, 57.7, 62.1, 57.4, 66.3, 56.8),
  # Counting stats (normalized to per 162 games)
  hr_per_162 = c(46.3, 35.2, 37.1, 37.2, 37.0, 33.8, 27.3, 4.0, 6.8, 22.8, 37.0),
  avg = c(.342, .301, .298, .305, .344, .298, .331, .366, .328, .358, .303),
  ops = c(1.164, .941, 1.051, .928, 1.116, .977, .976, .945, .858, 1.010, .988),
  # Peak seasons
  mvps = c(1, 2, 7, 1, 2, 3, 3, 1, 0, 2, 3),
  allstar = c(2, 24, 14, 25, 19, 20, 24, 0, 0, 0, 11),
  # Postseason
  ws_titles = c(7, 1, 0, 1, 0, 7, 3, 0, 1, 0, 0),
  # Longevity
  seasons = c(22, 23, 22, 23, 19, 18, 22, 24, 21, 23, 13),
  # Era context (offensive environment, 1.0 = neutral)
  era_context = c(1.15, 1.05, 0.90, 1.05, 1.10, 1.00, 1.05, 1.20, 1.15, 1.10, 1.00)
)

# Calculate composite scores
goat_candidates <- goat_candidates %>%
  mutate(
    jaws = (career_war + peak_war_7yr) / 2,
    era_adj_war = career_war * era_context,
    peak_score = (peak_war_7yr / 75) * 100,
    counting_score = (hr_per_162 / 50 + (avg - .250) / .100 +
                     (ops - .700) / .500) / 3 * 100,
    awards_score = (mvps * 10 + allstar * 2) / 2,
    longevity_score = pmin(100, seasons * 4),
    # Overall GOAT score (weighted combination)
    goat_score = 0.40 * era_adj_war / 2 +
                 0.25 * peak_score +
                 0.15 * counting_score +
                 0.10 * awards_score +
                 0.10 * longevity_score
  )

# Rank players
goat_rankings <- goat_candidates %>%
  arrange(desc(goat_score)) %>%
  mutate(rank = row_number())

print("GOAT Rankings:")
print(goat_rankings %>%
      select(rank, player, career_war, jaws, goat_score) %>%
      head(10))

# Principal Component Analysis for visualization
pca_data <- goat_candidates %>%
  select(career_war, peak_war_7yr, ops, mvps, allstar, seasons)

pca_result <- PCA(pca_data, graph = FALSE)

# Visualize PCA
fviz_pca_biplot(pca_result,
                label = "var",
                habillage = factor(goat_candidates$player),
                addEllipses = FALSE,
                col.var = "darkred",
                repel = TRUE) +
  labs(title = "GOAT Candidates - Principal Component Analysis",
       subtitle = "Multi-dimensional comparison of greatest players") +
  theme_minimal()

# Create radar chart comparison (top 5)
library(fmsb)

top5 <- goat_rankings %>% head(5)

# Normalize scores to 0-100 scale
radar_data <- top5 %>%
  mutate(
    war_norm = (career_war / max(career_war)) * 100,
    peak_norm = (peak_war_7yr / max(peak_war_7yr)) * 100,
    rate_norm = ((ops - min(ops)) / (max(ops) - min(ops))) * 100,
    awards_norm = (awards_score / max(awards_score)) * 100,
    longevity_norm = (longevity_score / max(longevity_score)) * 100
  ) %>%
  select(player, war_norm, peak_norm, rate_norm, awards_norm, longevity_norm)

# Prepare data for radar chart
radar_matrix <- as.data.frame(t(radar_data[, -1]))
colnames(radar_matrix) <- radar_data$player
radar_matrix <- rbind(rep(100, 5), rep(0, 5), radar_matrix)

# Plot radar chart
colors_border <- c("red", "blue", "green", "orange", "purple")
colors_in <- alpha(colors_border, 0.3)

par(mar = c(1, 1, 3, 1))
radarchart(radar_matrix,
           axistype = 1,
           pcol = colors_border,
           pfcol = colors_in,
           plwd = 2,
           plty = 1,
           cglcol = "grey",
           cglty = 1,
           axislabcol = "grey",
           cglwd = 0.8,
           vlcex = 0.8,
           title = "Top 5 GOAT Candidates - Radar Comparison")

legend("topright", legend = colnames(radar_matrix),
       col = colors_border, lty = 1, lwd = 2, cex = 0.8)

Python

# Python: Situational GOAT Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

# Create comprehensive dataset
goat_data = pd.DataFrame({
    'player': ['Babe Ruth', 'Willie Mays', 'Barry Bonds', 'Hank Aaron',
               'Ted Williams', 'Mickey Mantle', 'Mike Trout', 'Stan Musial'],
    'career_war': [183.1, 156.1, 162.8, 143.0, 121.9, 110.3, 85.5, 128.0],
    'peak_war': [71.0, 72.9, 72.8, 59.6, 63.3, 54.2, 56.8, 57.7],
    'ops_plus': [207, 156, 182, 155, 190, 172, 176, 159],
    'war_per_season': [8.3, 6.8, 7.4, 6.2, 6.4, 6.1, 6.6, 5.8],
    'mvps': [1, 2, 7, 1, 2, 3, 3, 3],
    'championships': [7, 1, 0, 1, 0, 7, 0, 3],
    'iconic_moments': [10, 8, 5, 7, 6, 9, 4, 6],  # Subjective
    'cultural_impact': [10, 9, 5, 8, 8, 9, 7, 7]  # Subjective
})

# Different weighting schemes for different perspectives
weighting_schemes = {
    'Pure Statistics': {
        'career_war': 0.40,
        'peak_war': 0.30,
        'ops_plus': 0.20,
        'war_per_season': 0.10,
        'mvps': 0.0,
        'championships': 0.0,
        'iconic_moments': 0.0,
        'cultural_impact': 0.0
    },
    'Peak Performance': {
        'career_war': 0.20,
        'peak_war': 0.40,
        'ops_plus': 0.15,
        'war_per_season': 0.25,
        'mvps': 0.0,
        'championships': 0.0,
        'iconic_moments': 0.0,
        'cultural_impact': 0.0
    },
    'Career Achievement': {
        'career_war': 0.35,
        'peak_war': 0.15,
        'ops_plus': 0.15,
        'war_per_season': 0.10,
        'mvps': 0.15,
        'championships': 0.10,
        'iconic_moments': 0.0,
        'cultural_impact': 0.0
    },
    'Legacy & Impact': {
        'career_war': 0.20,
        'peak_war': 0.15,
        'ops_plus': 0.10,
        'war_per_season': 0.05,
        'mvps': 0.10,
        'championships': 0.15,
        'iconic_moments': 0.15,
        'cultural_impact': 0.10
    },
    'Balanced': {
        'career_war': 0.25,
        'peak_war': 0.20,
        'ops_plus': 0.15,
        'war_per_season': 0.15,
        'mvps': 0.10,
        'championships': 0.08,
        'iconic_moments': 0.04,
        'cultural_impact': 0.03
    }
}

# Normalize all metrics to 0-100 scale
metrics = ['career_war', 'peak_war', 'ops_plus', 'war_per_season',
           'mvps', 'championships', 'iconic_moments', 'cultural_impact']

goat_normalized = goat_data.copy()
for metric in metrics:
    goat_normalized[f'{metric}_norm'] = (
        (goat_data[metric] - goat_data[metric].min()) /
        (goat_data[metric].max() - goat_data[metric].min()) * 100
    )

# Calculate scores for each weighting scheme
results = []

for scheme_name, weights in weighting_schemes.items():
    goat_normalized[f'score_{scheme_name}'] = 0

    for metric, weight in weights.items():
        if weight > 0:
            goat_normalized[f'score_{scheme_name}'] += (
                goat_normalized[f'{metric}_norm'] * weight
            )

    # Rank players
    scheme_rankings = goat_normalized[['player', f'score_{scheme_name}']].copy()
    scheme_rankings = scheme_rankings.sort_values(
        f'score_{scheme_name}', ascending=False
    )
    scheme_rankings['rank'] = range(1, len(scheme_rankings) + 1)
    scheme_rankings['scheme'] = scheme_name

    results.append(scheme_rankings)

# Combine all rankings
all_rankings = pd.concat(results)

print("GOAT Rankings by Perspective:")
print("=" * 80)

for scheme in weighting_schemes.keys():
    scheme_data = all_rankings[all_rankings['scheme'] == scheme]
    print(f"\n{scheme}:")
    print(scheme_data[['rank', 'player', f'score_{scheme}']].to_string(index=False))

# Visualizations
fig = plt.figure(figsize=(16, 10))

# Ranking heatmap
ax1 = plt.subplot(2, 2, 1)
ranking_pivot = all_rankings.pivot(index='player', columns='scheme', values='rank')
ranking_pivot = ranking_pivot[list(weighting_schemes.keys())]  # Order columns
sns.heatmap(ranking_pivot, annot=True, fmt='g', cmap='RdYlGn_r',
            cbar_kws={'label': 'Rank'}, ax=ax1)
ax1.set_title('Player Rankings by Perspective')
ax1.set_ylabel('')

# Score heatmap
ax2 = plt.subplot(2, 2, 2)
score_cols = [f'score_{scheme}' for scheme in weighting_schemes.keys()]
score_data = goat_normalized[['player'] + score_cols].set_index('player')
score_data.columns = list(weighting_schemes.keys())
sns.heatmap(score_data, annot=True, fmt='.1f', cmap='YlOrRd', ax=ax2)
ax2.set_title('GOAT Scores by Perspective')
ax2.set_ylabel('')

# Ranking stability
ax3 = plt.subplot(2, 2, 3)
rank_variance = ranking_pivot.var(axis=1).sort_values()
colors = ['green' if x < 2 else 'orange' if x < 5 else 'red' for x in rank_variance]
ax3.barh(rank_variance.index, rank_variance.values, color=colors, alpha=0.7)
ax3.set_xlabel('Variance in Rankings')
ax3.set_title('Ranking Stability Across Perspectives')
ax3.axvline(x=2, color='green', linestyle='--', alpha=0.5, label='Consistent')
ax3.axvline(x=5, color='red', linestyle='--', alpha=0.5, label='Variable')
ax3.legend()

# Average ranking
ax4 = plt.subplot(2, 2, 4)
avg_ranking = ranking_pivot.mean(axis=1).sort_values()
bars = ax4.barh(avg_ranking.index, avg_ranking.values, alpha=0.7)
ax4.set_xlabel('Average Rank')
ax4.set_title('Consensus GOAT Rankings')
ax4.invert_xaxis()
ax4.set_xlim(ax4.get_xlim()[1], 0.5)  # Start from 0.5 to show #1 clearly

# Color gradient for bars
colors_gradient = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(bars)))
for bar, color in zip(bars, colors_gradient):
    bar.set_color(color)

plt.tight_layout()
plt.savefig('goat_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Summary statistics
print("\n" + "=" * 80)
print("CONSENSUS RANKINGS (Average across all perspectives):")
print("=" * 80)
consensus = ranking_pivot.mean(axis=1).sort_values()
for rank, (player, avg_rank) in enumerate(consensus.items(), 1):
    std = ranking_pivot.loc[player].std()
    print(f"{rank}. {player:20s} Avg Rank: {avg_rank:.2f} (σ = {std:.2f})")

18.7 Exercises

Exercise 1: Career Value Calculator

Build a comprehensive career value calculator that computes WAR, JAWS, and position-adjusted metrics for any player.

Tasks:

Create a function that calculates career WAR from basic statistics

Implement the JAWS calculation with a rolling 7-year window

Add position-specific adjustments

Compare against Hall of Fame standards

Generate a career value report with visualizations

Solution Approach:

# R: Career Value Calculator
calculate_career_value <- function(player_stats, position) {
  # player_stats should be a dataframe with columns: year, pa, h, 2b, 3b, hr, bb, hbp, sf, ab

  # Calculate rate stats
  player_stats <- player_stats %>%
    mutate(
      obp = (h + bb + hbp) / (ab + bb + hbp + sf),
      slg = (h - `2b` - `3b` - hr + 2*`2b` + 3*`3b` + 4*hr) / ab,
      ops = obp + slg,
      # Simplified WAR estimation
      war = (ops - 0.710) * pa / 20
    )

  # Career totals
  career_war <- sum(player_stats$war, na.rm = TRUE)
  seasons <- nrow(player_stats)

  # Peak 7-year WAR
  if (nrow(player_stats) >= 7) {
    peak_war <- max(rollapply(player_stats$war, width = 7,
                              FUN = sum, align = "left", partial = TRUE))
  } else {
    peak_war <- career_war
  }

  # JAWS
  jaws <- (career_war + peak_war) / 2

  # Position-specific HOF standards (approximate)
  position_standards <- list(
    C = 53, `1B` = 67, `2B` = 69, `3B` = 67, SS = 67,
    LF = 65, CF = 70, RF = 70, DH = 27
  )

  hof_standard <- position_standards[[position]]
  hof_percentage <- (jaws / hof_standard) * 100

  # Return results
  list(
    career_war = career_war,
    peak_war = peak_war,
    jaws = jaws,
    seasons = seasons,
    war_per_season = career_war / seasons,
    hof_standard = hof_standard,
    hof_percentage = hof_percentage,
    yearly_war = player_stats$war
  )
}

# Test the function
# (Implementation depends on your specific data)

Exercise 2: HOF Predictor with Machine Learning

Develop a machine learning model that predicts Hall of Fame induction probability using various features.

Tasks:

Collect data on HOF-eligible players (both inducted and not)

Engineer relevant features (career stats, awards, era adjustments)

Train multiple models (Logistic Regression, Random Forest, Gradient Boosting)

Evaluate model performance using cross-validation

Generate predictions for current players

Exercise 3: Aging Curve Analysis

Analyze how different skills age and create projections for active players.

Tasks:

Calculate aging curves for power, speed, and contact skills

Use the delta method to control for survivorship bias

Identify peak ages for each skill type

Project the next 3 years of performance for 5 active players

Visualize skill-specific aging patterns

Bonus: Compare aging curves by position (e.g., catchers vs. outfielders).

Exercise 4: Build Your Own GOAT Ranking System

Create a customizable GOAT ranking system that allows different weighting preferences.

Tasks:

Compile comprehensive statistics for top 20 all-time players

Normalize all metrics to a common scale (0-100)

Create at least 5 different weighting schemes representing different perspectives

Calculate rankings under each scheme

Identify which players are most consensus and most controversial

Build an interactive visualization showing how rankings change with different weights

Extension: Add era adjustments and park factors to make fairer historical comparisons.

Summary {#summary}

Career analytics combines statistical rigor with historical context to evaluate player legacies and predict future outcomes. Key takeaways:

WAR and JAWS provide comprehensive frameworks for measuring career value, balancing longevity with peak performance

Hall of Fame prediction requires both statistical excellence and contextual factors like awards, team success, and narrative

Award prediction models reveal the combination of performance and story that drives voting

Aging curves show universal patterns in player development and decline, with skill-specific variations

What-if analysis explores hypothetical scenarios to understand luck and context in career outcomes

GOAT debates benefit from multi-dimensional analysis, acknowledging that different perspectives yield different conclusions

Career analytics ultimately tells the story of baseball through individual achievement, connecting past legends to present stars and future Hall of Famers.

# R: Career Value Calculator
calculate_career_value <- function(player_stats, position) {
  # player_stats should be a dataframe with columns: year, pa, h, 2b, 3b, hr, bb, hbp, sf, ab

  # Calculate rate stats
  player_stats <- player_stats %>%
    mutate(
      obp = (h + bb + hbp) / (ab + bb + hbp + sf),
      slg = (h - `2b` - `3b` - hr + 2*`2b` + 3*`3b` + 4*hr) / ab,
      ops = obp + slg,
      # Simplified WAR estimation
      war = (ops - 0.710) * pa / 20
    )

  # Career totals
  career_war <- sum(player_stats$war, na.rm = TRUE)
  seasons <- nrow(player_stats)

  # Peak 7-year WAR
  if (nrow(player_stats) >= 7) {
    peak_war <- max(rollapply(player_stats$war, width = 7,
                              FUN = sum, align = "left", partial = TRUE))
  } else {
    peak_war <- career_war
  }

  # JAWS
  jaws <- (career_war + peak_war) / 2

  # Position-specific HOF standards (approximate)
  position_standards <- list(
    C = 53, `1B` = 67, `2B` = 69, `3B` = 67, SS = 67,
    LF = 65, CF = 70, RF = 70, DH = 27
  )

  hof_standard <- position_standards[[position]]
  hof_percentage <- (jaws / hof_standard) * 100

  # Return results
  list(
    career_war = career_war,
    peak_war = peak_war,
    jaws = jaws,
    seasons = seasons,
    war_per_season = career_war / seasons,
    hof_standard = hof_standard,
    hof_percentage = hof_percentage,
    yearly_war = player_stats$war
  )
}

# Test the function
# (Implementation depends on your specific data)

Chapter 18: Career Analytics & Hall of Fame Analysis

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

18.1 Measuring Career Value (WAR, JAWS)

Wins Above Replacement (WAR)

Career WAR Analysis

JAWS (Jaffe WAR Score System)

Position-Specific JAWS Standards

18.2 HOF Prediction Models

Feature Engineering for HOF Prediction

Logistic Regression Approach

18.3 Award Prediction

MVP Award Prediction

Cy Young Award Prediction

18.4 Aging Curves

Delta Method for Aging Curves

Position-Specific and Skill-Specific Aging

18.5 What-If Analysis

Injury Impact Analysis

Era Adjustment Analysis

18.6 GOAT Debate

Multi-Dimensional GOAT Analysis

Context-Specific GOAT Rankings

18.7 Exercises

Exercise 1: Career Value Calculator

Exercise 2: HOF Predictor with Machine Learning

Exercise 3: Aging Curve Analysis

Exercise 4: Build Your Own GOAT Ranking System

Summary {#summary}

Chapter Summary

Chapter 18: Career Analytics & Hall of Fame Analysis

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

18.1 Measuring Career Value (WAR, JAWS)

Wins Above Replacement (WAR)

Career WAR Analysis

JAWS (Jaffe WAR Score System)

Position-Specific JAWS Standards

18.2 HOF Prediction Models

Feature Engineering for HOF Prediction

Logistic Regression Approach

18.3 Award Prediction

MVP Award Prediction

Cy Young Award Prediction

18.4 Aging Curves

Delta Method for Aging Curves

Position-Specific and Skill-Specific Aging

18.5 What-If Analysis

Injury Impact Analysis

Era Adjustment Analysis

18.6 GOAT Debate

Multi-Dimensional GOAT Analysis

Context-Specific GOAT Rankings

18.7 Exercises

Exercise 1: Career Value Calculator

Exercise 2: HOF Predictor with Machine Learning

Exercise 3: Aging Curve Analysis

Exercise 4: Build Your Own GOAT Ranking System

Summary {#summary}

Chapter Summary

Related Resources

Glossary

Resources

All Chapters