Wins Above Replacement (WAR)
WAR is the most comprehensive single metric for measuring a player's total value. It combines batting, baserunning, fielding, and positional adjustments into one number representing wins contributed above a replacement-level player.
Key Components:
- Batting Runs: Offensive value above average
- Baserunning Runs: Stolen base and baserunning value
- Fielding Runs: Defensive value above average
- Positional Adjustment: Credit for playing demanding positions
- Replacement Level: Baseline for comparison (typically .294 winning percentage)
The two primary WAR implementations are:
- bWAR (Baseball-Reference): Uses Defensive Runs Saved (DRS)
- fWAR (FanGraphs): Uses Ultimate Zone Rating (UZR)
Career WAR Analysis
# R: Analyzing Career WAR
library(tidyverse)
library(Lahman)
# Top 50 Position Players by Career WAR (using batting stats as proxy)
career_war <- Batting %>%
group_by(playerID) %>%
summarise(
seasons = n(),
games = sum(G, na.rm = TRUE),
pa = sum(AB + BB + HBP + SF + SH, na.rm = TRUE),
hr = sum(HR, na.rm = TRUE),
sb = sum(SB, na.rm = TRUE),
avg = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
obp = sum(H + BB + HBP, na.rm = TRUE) /
sum(AB + BB + HBP + SF, na.rm = TRUE),
slg = (sum(H, na.rm = TRUE) - sum(`2B` + `3B` + HR, na.rm = TRUE) +
2 * sum(`2B`, na.rm = TRUE) +
3 * sum(`3B`, na.rm = TRUE) +
4 * sum(HR, na.rm = TRUE)) / sum(AB, na.rm = TRUE)
) %>%
filter(pa >= 3000) %>%
mutate(
ops = obp + slg,
# Simplified WAR estimate
estimated_war = (ops - 0.710) * pa / 20
)
# Get player names
player_names <- People %>%
select(playerID, nameFirst, nameLast, debut, finalGame)
career_leaders <- career_war %>%
left_join(player_names, by = "playerID") %>%
arrange(desc(estimated_war)) %>%
head(20) %>%
mutate(
name = paste(nameFirst, nameLast),
war_per_season = estimated_war / seasons
) %>%
select(name, seasons, games, pa, hr, avg, ops, estimated_war, war_per_season)
print(career_leaders)
# Visualize career WAR distribution
ggplot(career_war %>% filter(estimated_war > 0),
aes(x = estimated_war)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = 50, linetype = "dashed", color = "red", size = 1) +
annotate("text", x = 52, y = 100, label = "HOF Threshold (~50 WAR)",
hjust = 0, color = "red") +
labs(title = "Distribution of Career WAR",
subtitle = "Players with 3000+ Plate Appearances",
x = "Career WAR (Estimated)",
y = "Number of Players") +
theme_minimal()
# Python: Career WAR Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import batting_stats, playerid_lookup, statcast_batter
# Get career statistics for recent players
def calculate_career_war(start_year=1995, end_year=2023):
"""Calculate career statistics and estimated WAR"""
all_stats = []
for year in range(start_year, end_year + 1):
try:
stats = batting_stats(year)
stats['Season'] = year
all_stats.append(stats)
except:
continue
career_df = pd.concat(all_stats, ignore_index=True)
# Aggregate by player
career_totals = career_df.groupby('Name').agg({
'Season': 'count',
'G': 'sum',
'PA': 'sum',
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'BB': 'sum',
'SO': 'sum',
'WAR': 'sum', # If available in data
'AVG': 'mean',
'OBP': 'mean',
'SLG': 'mean'
}).reset_index()
career_totals.columns = ['Name', 'Seasons', 'G', 'PA', 'AB', 'H',
'HR', 'BB', 'SO', 'Career_WAR', 'AVG', 'OBP', 'SLG']
# Filter for qualified players
career_totals = career_totals[career_totals['PA'] >= 3000].copy()
# Calculate rate stats
career_totals['WAR_per_Season'] = career_totals['Career_WAR'] / career_totals['Seasons']
career_totals['WAR_per_162'] = career_totals['Career_WAR'] / career_totals['G'] * 162
return career_totals.sort_values('Career_WAR', ascending=False)
# Example with historical HOF data
hof_players = pd.DataFrame({
'Name': ['Babe Ruth', 'Willie Mays', 'Barry Bonds', 'Hank Aaron',
'Ted Williams', 'Mike Trout', 'Albert Pujols', 'Derek Jeter'],
'Career_WAR': [183.1, 156.1, 162.8, 143.0, 121.9, 85.5, 99.9, 72.4],
'Seasons': [22, 23, 22, 23, 19, 13, 22, 20],
'Peak_WAR_7yr': [71.0, 72.9, 72.8, 59.6, 63.3, 56.8, 56.4, 38.5],
'HOF': [1, 1, 0, 1, 1, 0, 0, 1]
})
hof_players['WAR_per_Season'] = hof_players['Career_WAR'] / hof_players['Seasons']
hof_players['JAWS'] = (hof_players['Career_WAR'] + hof_players['Peak_WAR_7yr']) / 2
print("Hall of Fame Career Value Comparison:")
print(hof_players.to_string(index=False))
# Visualize WAR components
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Career WAR vs Peak WAR
axes[0].scatter(hof_players['Career_WAR'], hof_players['Peak_WAR_7yr'],
c=hof_players['HOF'], cmap='RdYlGn', s=200, alpha=0.6)
for idx, row in hof_players.iterrows():
axes[0].annotate(row['Name'], (row['Career_WAR'], row['Peak_WAR_7yr']),
fontsize=8, ha='right')
axes[0].set_xlabel('Career WAR')
axes[0].set_ylabel('Peak WAR (7-year)')
axes[0].set_title('Career vs Peak Value')
axes[0].grid(True, alpha=0.3)
# WAR per Season
axes[1].barh(hof_players['Name'], hof_players['WAR_per_Season'],
color=['green' if x == 1 else 'gray' for x in hof_players['HOF']])
axes[1].set_xlabel('WAR per Season')
axes[1].set_title('Average Annual Value')
axes[1].axvline(x=5.0, color='red', linestyle='--', label='All-Star Level')
axes[1].legend()
plt.tight_layout()
plt.savefig('career_war_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
JAWS (Jaffe WAR Score System)
Developed by Jay Jaffe, JAWS balances career value with peak performance to evaluate Hall of Fame worthiness. It averages a player's career WAR with their best 7-year peak WAR.
$$\text{JAWS} = \frac{\text{Career WAR} + \text{Peak WAR (7-year)}}{2}$$
Why JAWS Matters:
- Prevents compilers (long but mediocre careers) from dominating
- Rewards sustained excellence
- Position-specific comparisons
- Objective HOF standard
# R: JAWS Calculation
library(zoo)
# Function to calculate peak 7-year WAR
calculate_peak_war <- function(yearly_war, window = 7) {
if (length(yearly_war) < window) {
return(sum(yearly_war, na.rm = TRUE))
}
return(max(rollapply(yearly_war, width = window, FUN = sum,
align = "left", partial = TRUE)))
}
# Simulate career trajectories for HOF analysis
set.seed(123)
simulate_careers <- function(n = 1000) {
careers <- lapply(1:n, function(i) {
# Random career length (10-25 years)
years <- sample(10:25, 1)
# Peak around year 5-7
peak_year <- sample(5:7, 1)
# Generate WAR trajectory
war <- numeric(years)
for (j in 1:years) {
if (j < peak_year) {
# Rising phase
war[j] <- rnorm(1, mean = 2 + j * 0.8, sd = 1.5)
} else if (j <= peak_year + 3) {
# Peak phase
war[j] <- rnorm(1, mean = 6, sd = 1.5)
} else {
# Decline phase
decline_factor <- (j - peak_year - 3) * 0.5
war[j] <- rnorm(1, mean = max(1, 5 - decline_factor), sd = 1.5)
}
war[j] <- max(0, war[j]) # No negative WAR
}
career_war <- sum(war)
peak_war <- calculate_peak_war(war, 7)
jaws <- (career_war + peak_war) / 2
data.frame(
career_id = i,
years = years,
career_war = career_war,
peak_war = peak_war,
jaws = jaws
)
})
bind_rows(careers)
}
career_sims <- simulate_careers(1000)
# HOF threshold analysis
hof_threshold <- 50 # Typical HOF threshold
career_sims <- career_sims %>%
mutate(
hof_worthy = career_war >= hof_threshold,
jaws_worthy = jaws >= (hof_threshold * 0.75)
)
# Compare criteria
criteria_comparison <- career_sims %>%
summarise(
total = n(),
war_hof = sum(hof_worthy),
jaws_hof = sum(jaws_worthy),
both = sum(hof_worthy & jaws_worthy)
)
print("HOF Criteria Comparison:")
print(criteria_comparison)
# Visualize career vs peak
ggplot(career_sims, aes(x = career_war, y = peak_war)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
geom_vline(xintercept = 50, linetype = "dashed", color = "darkgreen") +
geom_hline(yintercept = 35, linetype = "dashed", color = "darkgreen") +
annotate("rect", xmin = 50, xmax = Inf, ymin = 35, ymax = Inf,
alpha = 0.2, fill = "green") +
labs(title = "Career WAR vs Peak WAR (7-year)",
subtitle = "Green zone indicates typical HOF territory",
x = "Career WAR",
y = "Peak 7-Year WAR") +
theme_minimal()
Position-Specific JAWS Standards
# Python: Position-Specific JAWS Analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Historical HOF standards by position (approximate)
position_standards = pd.DataFrame({
'Position': ['C', '1B', '2B', '3B', 'SS', 'LF', 'CF', 'RF', 'DH'],
'HOF_Avg_WAR': [53.5, 66.7, 69.2, 67.1, 67.4, 64.9, 69.7, 70.3, 27.4],
'HOF_Avg_Peak': [33.4, 42.3, 42.7, 42.1, 42.8, 41.5, 43.5, 43.6, 18.8],
'HOF_Members': [17, 21, 19, 15, 24, 20, 19, 26, 4]
})
position_standards['HOF_Avg_JAWS'] = (
(position_standards['HOF_Avg_WAR'] + position_standards['HOF_Avg_Peak']) / 2
)
print("Hall of Fame Standards by Position:")
print(position_standards.to_string(index=False))
# Active players vs HOF standards
active_players = pd.DataFrame({
'Name': ['Mike Trout', 'Mookie Betts', 'Francisco Lindor',
'Nolan Arenado', 'Jose Ramirez', 'Aaron Judge'],
'Position': ['CF', 'RF', 'SS', '3B', '3B', 'RF'],
'Age': [32, 31, 30, 33, 31, 32],
'Career_WAR': [85.5, 68.2, 51.3, 49.8, 45.6, 42.1],
'Peak_7yr_WAR': [56.8, 48.3, 38.7, 35.2, 38.9, 35.4]
})
active_players['JAWS'] = (
(active_players['Career_WAR'] + active_players['Peak_7yr_WAR']) / 2
)
# Merge with standards
active_comparison = active_players.merge(
position_standards[['Position', 'HOF_Avg_WAR', 'HOF_Avg_JAWS']],
on='Position'
)
active_comparison['WAR_pct_of_standard'] = (
active_comparison['Career_WAR'] / active_comparison['HOF_Avg_WAR'] * 100
)
active_comparison['JAWS_pct_of_standard'] = (
active_comparison['JAWS'] / active_comparison['HOF_Avg_JAWS'] * 100
)
print("\nActive Players vs HOF Standards:")
print(active_comparison[['Name', 'Position', 'Career_WAR', 'HOF_Avg_WAR',
'WAR_pct_of_standard', 'JAWS_pct_of_standard']].to_string(index=False))
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Position standards
axes[0].barh(position_standards['Position'], position_standards['HOF_Avg_JAWS'],
color='steelblue', alpha=0.7)
axes[0].set_xlabel('JAWS')
axes[0].set_title('Average JAWS by Position (HOF Members)')
axes[0].grid(True, alpha=0.3, axis='x')
# Active players comparison
colors = ['green' if x >= 100 else 'orange' if x >= 75 else 'red'
for x in active_comparison['JAWS_pct_of_standard']]
axes[1].barh(active_comparison['Name'],
active_comparison['JAWS_pct_of_standard'],
color=colors, alpha=0.7)
axes[1].axvline(x=100, color='black', linestyle='--', label='HOF Standard')
axes[1].set_xlabel('% of HOF Standard (JAWS)')
axes[1].set_title('Active Players vs Position Standards')
axes[1].legend()
plt.tight_layout()
plt.savefig('jaws_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# R: Analyzing Career WAR
library(tidyverse)
library(Lahman)
# Top 50 Position Players by Career WAR (using batting stats as proxy)
career_war <- Batting %>%
group_by(playerID) %>%
summarise(
seasons = n(),
games = sum(G, na.rm = TRUE),
pa = sum(AB + BB + HBP + SF + SH, na.rm = TRUE),
hr = sum(HR, na.rm = TRUE),
sb = sum(SB, na.rm = TRUE),
avg = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
obp = sum(H + BB + HBP, na.rm = TRUE) /
sum(AB + BB + HBP + SF, na.rm = TRUE),
slg = (sum(H, na.rm = TRUE) - sum(`2B` + `3B` + HR, na.rm = TRUE) +
2 * sum(`2B`, na.rm = TRUE) +
3 * sum(`3B`, na.rm = TRUE) +
4 * sum(HR, na.rm = TRUE)) / sum(AB, na.rm = TRUE)
) %>%
filter(pa >= 3000) %>%
mutate(
ops = obp + slg,
# Simplified WAR estimate
estimated_war = (ops - 0.710) * pa / 20
)
# Get player names
player_names <- People %>%
select(playerID, nameFirst, nameLast, debut, finalGame)
career_leaders <- career_war %>%
left_join(player_names, by = "playerID") %>%
arrange(desc(estimated_war)) %>%
head(20) %>%
mutate(
name = paste(nameFirst, nameLast),
war_per_season = estimated_war / seasons
) %>%
select(name, seasons, games, pa, hr, avg, ops, estimated_war, war_per_season)
print(career_leaders)
# Visualize career WAR distribution
ggplot(career_war %>% filter(estimated_war > 0),
aes(x = estimated_war)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = 50, linetype = "dashed", color = "red", size = 1) +
annotate("text", x = 52, y = 100, label = "HOF Threshold (~50 WAR)",
hjust = 0, color = "red") +
labs(title = "Distribution of Career WAR",
subtitle = "Players with 3000+ Plate Appearances",
x = "Career WAR (Estimated)",
y = "Number of Players") +
theme_minimal()
# R: JAWS Calculation
library(zoo)
# Function to calculate peak 7-year WAR
calculate_peak_war <- function(yearly_war, window = 7) {
if (length(yearly_war) < window) {
return(sum(yearly_war, na.rm = TRUE))
}
return(max(rollapply(yearly_war, width = window, FUN = sum,
align = "left", partial = TRUE)))
}
# Simulate career trajectories for HOF analysis
set.seed(123)
simulate_careers <- function(n = 1000) {
careers <- lapply(1:n, function(i) {
# Random career length (10-25 years)
years <- sample(10:25, 1)
# Peak around year 5-7
peak_year <- sample(5:7, 1)
# Generate WAR trajectory
war <- numeric(years)
for (j in 1:years) {
if (j < peak_year) {
# Rising phase
war[j] <- rnorm(1, mean = 2 + j * 0.8, sd = 1.5)
} else if (j <= peak_year + 3) {
# Peak phase
war[j] <- rnorm(1, mean = 6, sd = 1.5)
} else {
# Decline phase
decline_factor <- (j - peak_year - 3) * 0.5
war[j] <- rnorm(1, mean = max(1, 5 - decline_factor), sd = 1.5)
}
war[j] <- max(0, war[j]) # No negative WAR
}
career_war <- sum(war)
peak_war <- calculate_peak_war(war, 7)
jaws <- (career_war + peak_war) / 2
data.frame(
career_id = i,
years = years,
career_war = career_war,
peak_war = peak_war,
jaws = jaws
)
})
bind_rows(careers)
}
career_sims <- simulate_careers(1000)
# HOF threshold analysis
hof_threshold <- 50 # Typical HOF threshold
career_sims <- career_sims %>%
mutate(
hof_worthy = career_war >= hof_threshold,
jaws_worthy = jaws >= (hof_threshold * 0.75)
)
# Compare criteria
criteria_comparison <- career_sims %>%
summarise(
total = n(),
war_hof = sum(hof_worthy),
jaws_hof = sum(jaws_worthy),
both = sum(hof_worthy & jaws_worthy)
)
print("HOF Criteria Comparison:")
print(criteria_comparison)
# Visualize career vs peak
ggplot(career_sims, aes(x = career_war, y = peak_war)) +
geom_point(alpha = 0.3, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
geom_vline(xintercept = 50, linetype = "dashed", color = "darkgreen") +
geom_hline(yintercept = 35, linetype = "dashed", color = "darkgreen") +
annotate("rect", xmin = 50, xmax = Inf, ymin = 35, ymax = Inf,
alpha = 0.2, fill = "green") +
labs(title = "Career WAR vs Peak WAR (7-year)",
subtitle = "Green zone indicates typical HOF territory",
x = "Career WAR",
y = "Peak 7-Year WAR") +
theme_minimal()
# Python: Career WAR Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import batting_stats, playerid_lookup, statcast_batter
# Get career statistics for recent players
def calculate_career_war(start_year=1995, end_year=2023):
"""Calculate career statistics and estimated WAR"""
all_stats = []
for year in range(start_year, end_year + 1):
try:
stats = batting_stats(year)
stats['Season'] = year
all_stats.append(stats)
except:
continue
career_df = pd.concat(all_stats, ignore_index=True)
# Aggregate by player
career_totals = career_df.groupby('Name').agg({
'Season': 'count',
'G': 'sum',
'PA': 'sum',
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'BB': 'sum',
'SO': 'sum',
'WAR': 'sum', # If available in data
'AVG': 'mean',
'OBP': 'mean',
'SLG': 'mean'
}).reset_index()
career_totals.columns = ['Name', 'Seasons', 'G', 'PA', 'AB', 'H',
'HR', 'BB', 'SO', 'Career_WAR', 'AVG', 'OBP', 'SLG']
# Filter for qualified players
career_totals = career_totals[career_totals['PA'] >= 3000].copy()
# Calculate rate stats
career_totals['WAR_per_Season'] = career_totals['Career_WAR'] / career_totals['Seasons']
career_totals['WAR_per_162'] = career_totals['Career_WAR'] / career_totals['G'] * 162
return career_totals.sort_values('Career_WAR', ascending=False)
# Example with historical HOF data
hof_players = pd.DataFrame({
'Name': ['Babe Ruth', 'Willie Mays', 'Barry Bonds', 'Hank Aaron',
'Ted Williams', 'Mike Trout', 'Albert Pujols', 'Derek Jeter'],
'Career_WAR': [183.1, 156.1, 162.8, 143.0, 121.9, 85.5, 99.9, 72.4],
'Seasons': [22, 23, 22, 23, 19, 13, 22, 20],
'Peak_WAR_7yr': [71.0, 72.9, 72.8, 59.6, 63.3, 56.8, 56.4, 38.5],
'HOF': [1, 1, 0, 1, 1, 0, 0, 1]
})
hof_players['WAR_per_Season'] = hof_players['Career_WAR'] / hof_players['Seasons']
hof_players['JAWS'] = (hof_players['Career_WAR'] + hof_players['Peak_WAR_7yr']) / 2
print("Hall of Fame Career Value Comparison:")
print(hof_players.to_string(index=False))
# Visualize WAR components
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Career WAR vs Peak WAR
axes[0].scatter(hof_players['Career_WAR'], hof_players['Peak_WAR_7yr'],
c=hof_players['HOF'], cmap='RdYlGn', s=200, alpha=0.6)
for idx, row in hof_players.iterrows():
axes[0].annotate(row['Name'], (row['Career_WAR'], row['Peak_WAR_7yr']),
fontsize=8, ha='right')
axes[0].set_xlabel('Career WAR')
axes[0].set_ylabel('Peak WAR (7-year)')
axes[0].set_title('Career vs Peak Value')
axes[0].grid(True, alpha=0.3)
# WAR per Season
axes[1].barh(hof_players['Name'], hof_players['WAR_per_Season'],
color=['green' if x == 1 else 'gray' for x in hof_players['HOF']])
axes[1].set_xlabel('WAR per Season')
axes[1].set_title('Average Annual Value')
axes[1].axvline(x=5.0, color='red', linestyle='--', label='All-Star Level')
axes[1].legend()
plt.tight_layout()
plt.savefig('career_war_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Python: Position-Specific JAWS Analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Historical HOF standards by position (approximate)
position_standards = pd.DataFrame({
'Position': ['C', '1B', '2B', '3B', 'SS', 'LF', 'CF', 'RF', 'DH'],
'HOF_Avg_WAR': [53.5, 66.7, 69.2, 67.1, 67.4, 64.9, 69.7, 70.3, 27.4],
'HOF_Avg_Peak': [33.4, 42.3, 42.7, 42.1, 42.8, 41.5, 43.5, 43.6, 18.8],
'HOF_Members': [17, 21, 19, 15, 24, 20, 19, 26, 4]
})
position_standards['HOF_Avg_JAWS'] = (
(position_standards['HOF_Avg_WAR'] + position_standards['HOF_Avg_Peak']) / 2
)
print("Hall of Fame Standards by Position:")
print(position_standards.to_string(index=False))
# Active players vs HOF standards
active_players = pd.DataFrame({
'Name': ['Mike Trout', 'Mookie Betts', 'Francisco Lindor',
'Nolan Arenado', 'Jose Ramirez', 'Aaron Judge'],
'Position': ['CF', 'RF', 'SS', '3B', '3B', 'RF'],
'Age': [32, 31, 30, 33, 31, 32],
'Career_WAR': [85.5, 68.2, 51.3, 49.8, 45.6, 42.1],
'Peak_7yr_WAR': [56.8, 48.3, 38.7, 35.2, 38.9, 35.4]
})
active_players['JAWS'] = (
(active_players['Career_WAR'] + active_players['Peak_7yr_WAR']) / 2
)
# Merge with standards
active_comparison = active_players.merge(
position_standards[['Position', 'HOF_Avg_WAR', 'HOF_Avg_JAWS']],
on='Position'
)
active_comparison['WAR_pct_of_standard'] = (
active_comparison['Career_WAR'] / active_comparison['HOF_Avg_WAR'] * 100
)
active_comparison['JAWS_pct_of_standard'] = (
active_comparison['JAWS'] / active_comparison['HOF_Avg_JAWS'] * 100
)
print("\nActive Players vs HOF Standards:")
print(active_comparison[['Name', 'Position', 'Career_WAR', 'HOF_Avg_WAR',
'WAR_pct_of_standard', 'JAWS_pct_of_standard']].to_string(index=False))
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Position standards
axes[0].barh(position_standards['Position'], position_standards['HOF_Avg_JAWS'],
color='steelblue', alpha=0.7)
axes[0].set_xlabel('JAWS')
axes[0].set_title('Average JAWS by Position (HOF Members)')
axes[0].grid(True, alpha=0.3, axis='x')
# Active players comparison
colors = ['green' if x >= 100 else 'orange' if x >= 75 else 'red'
for x in active_comparison['JAWS_pct_of_standard']]
axes[1].barh(active_comparison['Name'],
active_comparison['JAWS_pct_of_standard'],
color=colors, alpha=0.7)
axes[1].axvline(x=100, color='black', linestyle='--', label='HOF Standard')
axes[1].set_xlabel('% of HOF Standard (JAWS)')
axes[1].set_title('Active Players vs Position Standards')
axes[1].legend()
plt.tight_layout()
plt.savefig('jaws_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
Feature Engineering for HOF Prediction
The Hall of Fame election depends on both statistical achievement and narrative factors. Effective prediction models combine:
Statistical Features:
- Career counting stats (hits, home runs, RBI)
- Rate statistics (batting average, OPS+)
- Advanced metrics (WAR, JAWS)
- Awards (MVP, All-Star selections)
- Peak performance metrics
Contextual Features:
- Position played
- Era adjustments
- Postseason performance
- Team success
- Media coverage and reputation
# R: HOF Prediction Model
library(randomForest)
library(caret)
library(pROC)
# Create HOF prediction dataset
create_hof_dataset <- function() {
# Get batting data
batting_career <- Batting %>%
filter(yearID >= 1950) %>%
group_by(playerID) %>%
summarise(
seasons = n(),
games = sum(G),
pa = sum(AB + BB + HBP + SF + SH, na.rm = TRUE),
hits = sum(H, na.rm = TRUE),
doubles = sum(`2B`, na.rm = TRUE),
triples = sum(`3B`, na.rm = TRUE),
hr = sum(HR, na.rm = TRUE),
rbi = sum(RBI, na.rm = TRUE),
sb = sum(SB, na.rm = TRUE),
bb = sum(BB, na.rm = TRUE),
so = sum(SO, na.rm = TRUE),
avg = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
obp = sum(H + BB + HBP, na.rm = TRUE) /
sum(AB + BB + HBP + SF, na.rm = TRUE),
slg = (sum(H, na.rm = TRUE) - sum(`2B` + `3B` + HR, na.rm = TRUE) +
2 * sum(`2B`, na.rm = TRUE) +
3 * sum(`3B`, na.rm = TRUE) +
4 * sum(HR, na.rm = TRUE)) / sum(AB, na.rm = TRUE)
) %>%
filter(pa >= 3000)
# Add All-Star selections
allstar_counts <- AllstarFull %>%
group_by(playerID) %>%
summarise(allstar_games = n())
# Add Awards
mvp_counts <- AwardsPlayers %>%
filter(awardID == "MVP") %>%
group_by(playerID) %>%
summarise(mvp_awards = n())
# HOF status
hof_status <- HallOfFame %>%
filter(inducted == "Y") %>%
distinct(playerID, .keep_all = TRUE) %>%
select(playerID, inducted)
# Combine datasets
hof_data <- batting_career %>%
left_join(allstar_counts, by = "playerID") %>%
left_join(mvp_counts, by = "playerID") %>%
left_join(hof_status, by = "playerID") %>%
mutate(
allstar_games = ifelse(is.na(allstar_games), 0, allstar_games),
mvp_awards = ifelse(is.na(mvp_awards), 0, mvp_awards),
hof = ifelse(!is.na(inducted), 1, 0),
ops = obp + slg,
estimated_war = (ops - 0.710) * pa / 20,
war_per_season = estimated_war / seasons
) %>%
select(-playerID, -inducted) %>%
filter(!is.na(avg), !is.na(obp), !is.na(slg))
return(hof_data)
}
hof_data <- create_hof_dataset()
# Split data
set.seed(42)
train_index <- createDataPartition(hof_data$hof, p = 0.8, list = FALSE)
train_data <- hof_data[train_index, ]
test_data <- hof_data[-train_index, ]
# Train Random Forest model
rf_model <- randomForest(
factor(hof) ~ seasons + games + hits + hr + rbi + sb +
avg + obp + slg + ops + estimated_war +
allstar_games + mvp_awards,
data = train_data,
ntree = 500,
importance = TRUE
)
# Predictions
predictions <- predict(rf_model, test_data, type = "prob")[, 2]
pred_class <- ifelse(predictions > 0.5, 1, 0)
# Evaluate model
confusion_matrix <- confusionMatrix(
factor(pred_class),
factor(test_data$hof)
)
print("HOF Prediction Model Performance:")
print(confusion_matrix)
# Feature importance
importance_df <- data.frame(
feature = rownames(importance(rf_model)),
importance = importance(rf_model)[, "MeanDecreaseGini"]
) %>%
arrange(desc(importance))
print("\nFeature Importance:")
print(head(importance_df, 10))
# ROC curve
roc_obj <- roc(test_data$hof, predictions)
auc_value <- auc(roc_obj)
plot(roc_obj, main = paste("ROC Curve (AUC =", round(auc_value, 3), ")"),
col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")
# Variable importance plot
varImpPlot(rf_model, main = "Feature Importance for HOF Prediction")
Logistic Regression Approach
# Python: Logistic Regression HOF Model
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
# Create synthetic HOF dataset (in practice, use real data)
np.random.seed(42)
def generate_hof_data(n_samples=500):
"""Generate synthetic HOF candidate data"""
data = {
'seasons': np.random.randint(10, 25, n_samples),
'hits': np.random.randint(500, 3500, n_samples),
'hr': np.random.randint(50, 750, n_samples),
'rbi': np.random.randint(200, 2300, n_samples),
'avg': np.random.uniform(0.240, 0.340, n_samples),
'war': np.random.uniform(10, 150, n_samples),
'allstar': np.random.randint(0, 15, n_samples),
'mvp': np.random.randint(0, 3, n_samples),
'gold_glove': np.random.randint(0, 12, n_samples)
}
df = pd.DataFrame(data)
# Calculate derivative features
df['avg_war_per_season'] = df['war'] / df['seasons']
df['hr_per_season'] = df['hr'] / df['seasons']
df['peak_indicator'] = (df['avg_war_per_season'] > 5).astype(int)
# Generate HOF status based on features with noise
hof_score = (
0.3 * (df['war'] - df['war'].mean()) / df['war'].std() +
0.2 * (df['hits'] - df['hits'].mean()) / df['hits'].std() +
0.15 * (df['hr'] - df['hr'].mean()) / df['hr'].std() +
0.15 * df['allstar'] / 15 +
0.2 * df['mvp'] / 3 +
np.random.normal(0, 0.3, n_samples)
)
df['hof'] = (hof_score > 0.5).astype(int)
return df
# Generate data
hof_df = generate_hof_data(500)
print("Dataset Overview:")
print(hof_df.head(10))
print(f"\nHOF Inductees: {hof_df['hof'].sum()} ({hof_df['hof'].mean()*100:.1f}%)")
# Prepare features and target
feature_cols = ['seasons', 'hits', 'hr', 'rbi', 'avg', 'war',
'allstar', 'mvp', 'gold_glove', 'avg_war_per_season',
'hr_per_season', 'peak_indicator']
X = hof_df[feature_cols]
y = hof_df['hof']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train logistic regression
log_reg = LogisticRegression(random_state=42, max_iter=1000, C=0.1)
log_reg.fit(X_train_scaled, y_train)
# Predictions
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]
# Evaluation
print("\nModel Performance:")
print(classification_report(y_test, y_pred, target_names=['Not HOF', 'HOF']))
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Feature coefficients
coef_df = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': log_reg.coef_[0]
}).sort_values('Coefficient', ascending=False)
print("\nFeature Coefficients (Standardized):")
print(coef_df.to_string(index=False))
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
axes[0, 0].set_title('Confusion Matrix')
axes[0, 0].set_ylabel('Actual')
axes[0, 0].set_xlabel('Predicted')
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, 'b-', label=f'ROC (AUC = {auc_score:.3f})')
axes[0, 1].plot([0, 1], [0, 1], 'r--', label='Random')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Feature Importance
axes[1, 0].barh(coef_df['Feature'], coef_df['Coefficient'],
color=['green' if x > 0 else 'red' for x in coef_df['Coefficient']])
axes[1, 0].set_xlabel('Coefficient')
axes[1, 0].set_title('Feature Importance')
axes[1, 0].axvline(x=0, color='black', linestyle='-', linewidth=0.5)
# Predicted Probability Distribution
axes[1, 1].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.5,
label='Not HOF', color='red')
axes[1, 1].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.5,
label='HOF', color='green')
axes[1, 1].axvline(x=0.5, color='black', linestyle='--', label='Threshold')
axes[1, 1].set_xlabel('Predicted Probability')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Predicted Probability Distribution')
axes[1, 1].legend()
plt.tight_layout()
plt.savefig('hof_prediction_model.png', dpi=300, bbox_inches='tight')
plt.show()
# Cross-validation
cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"\nCross-Validation AUC Scores: {cv_scores}")
print(f"Mean AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
# R: HOF Prediction Model
library(randomForest)
library(caret)
library(pROC)
# Create HOF prediction dataset
create_hof_dataset <- function() {
# Get batting data
batting_career <- Batting %>%
filter(yearID >= 1950) %>%
group_by(playerID) %>%
summarise(
seasons = n(),
games = sum(G),
pa = sum(AB + BB + HBP + SF + SH, na.rm = TRUE),
hits = sum(H, na.rm = TRUE),
doubles = sum(`2B`, na.rm = TRUE),
triples = sum(`3B`, na.rm = TRUE),
hr = sum(HR, na.rm = TRUE),
rbi = sum(RBI, na.rm = TRUE),
sb = sum(SB, na.rm = TRUE),
bb = sum(BB, na.rm = TRUE),
so = sum(SO, na.rm = TRUE),
avg = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
obp = sum(H + BB + HBP, na.rm = TRUE) /
sum(AB + BB + HBP + SF, na.rm = TRUE),
slg = (sum(H, na.rm = TRUE) - sum(`2B` + `3B` + HR, na.rm = TRUE) +
2 * sum(`2B`, na.rm = TRUE) +
3 * sum(`3B`, na.rm = TRUE) +
4 * sum(HR, na.rm = TRUE)) / sum(AB, na.rm = TRUE)
) %>%
filter(pa >= 3000)
# Add All-Star selections
allstar_counts <- AllstarFull %>%
group_by(playerID) %>%
summarise(allstar_games = n())
# Add Awards
mvp_counts <- AwardsPlayers %>%
filter(awardID == "MVP") %>%
group_by(playerID) %>%
summarise(mvp_awards = n())
# HOF status
hof_status <- HallOfFame %>%
filter(inducted == "Y") %>%
distinct(playerID, .keep_all = TRUE) %>%
select(playerID, inducted)
# Combine datasets
hof_data <- batting_career %>%
left_join(allstar_counts, by = "playerID") %>%
left_join(mvp_counts, by = "playerID") %>%
left_join(hof_status, by = "playerID") %>%
mutate(
allstar_games = ifelse(is.na(allstar_games), 0, allstar_games),
mvp_awards = ifelse(is.na(mvp_awards), 0, mvp_awards),
hof = ifelse(!is.na(inducted), 1, 0),
ops = obp + slg,
estimated_war = (ops - 0.710) * pa / 20,
war_per_season = estimated_war / seasons
) %>%
select(-playerID, -inducted) %>%
filter(!is.na(avg), !is.na(obp), !is.na(slg))
return(hof_data)
}
hof_data <- create_hof_dataset()
# Split data
set.seed(42)
train_index <- createDataPartition(hof_data$hof, p = 0.8, list = FALSE)
train_data <- hof_data[train_index, ]
test_data <- hof_data[-train_index, ]
# Train Random Forest model
rf_model <- randomForest(
factor(hof) ~ seasons + games + hits + hr + rbi + sb +
avg + obp + slg + ops + estimated_war +
allstar_games + mvp_awards,
data = train_data,
ntree = 500,
importance = TRUE
)
# Predictions
predictions <- predict(rf_model, test_data, type = "prob")[, 2]
pred_class <- ifelse(predictions > 0.5, 1, 0)
# Evaluate model
confusion_matrix <- confusionMatrix(
factor(pred_class),
factor(test_data$hof)
)
print("HOF Prediction Model Performance:")
print(confusion_matrix)
# Feature importance
importance_df <- data.frame(
feature = rownames(importance(rf_model)),
importance = importance(rf_model)[, "MeanDecreaseGini"]
) %>%
arrange(desc(importance))
print("\nFeature Importance:")
print(head(importance_df, 10))
# ROC curve
roc_obj <- roc(test_data$hof, predictions)
auc_value <- auc(roc_obj)
plot(roc_obj, main = paste("ROC Curve (AUC =", round(auc_value, 3), ")"),
col = "blue", lwd = 2)
abline(a = 0, b = 1, lty = 2, col = "red")
# Variable importance plot
varImpPlot(rf_model, main = "Feature Importance for HOF Prediction")
# Python: Logistic Regression HOF Model
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
# Create synthetic HOF dataset (in practice, use real data)
np.random.seed(42)
def generate_hof_data(n_samples=500):
"""Generate synthetic HOF candidate data"""
data = {
'seasons': np.random.randint(10, 25, n_samples),
'hits': np.random.randint(500, 3500, n_samples),
'hr': np.random.randint(50, 750, n_samples),
'rbi': np.random.randint(200, 2300, n_samples),
'avg': np.random.uniform(0.240, 0.340, n_samples),
'war': np.random.uniform(10, 150, n_samples),
'allstar': np.random.randint(0, 15, n_samples),
'mvp': np.random.randint(0, 3, n_samples),
'gold_glove': np.random.randint(0, 12, n_samples)
}
df = pd.DataFrame(data)
# Calculate derivative features
df['avg_war_per_season'] = df['war'] / df['seasons']
df['hr_per_season'] = df['hr'] / df['seasons']
df['peak_indicator'] = (df['avg_war_per_season'] > 5).astype(int)
# Generate HOF status based on features with noise
hof_score = (
0.3 * (df['war'] - df['war'].mean()) / df['war'].std() +
0.2 * (df['hits'] - df['hits'].mean()) / df['hits'].std() +
0.15 * (df['hr'] - df['hr'].mean()) / df['hr'].std() +
0.15 * df['allstar'] / 15 +
0.2 * df['mvp'] / 3 +
np.random.normal(0, 0.3, n_samples)
)
df['hof'] = (hof_score > 0.5).astype(int)
return df
# Generate data
hof_df = generate_hof_data(500)
print("Dataset Overview:")
print(hof_df.head(10))
print(f"\nHOF Inductees: {hof_df['hof'].sum()} ({hof_df['hof'].mean()*100:.1f}%)")
# Prepare features and target
feature_cols = ['seasons', 'hits', 'hr', 'rbi', 'avg', 'war',
'allstar', 'mvp', 'gold_glove', 'avg_war_per_season',
'hr_per_season', 'peak_indicator']
X = hof_df[feature_cols]
y = hof_df['hof']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train logistic regression
log_reg = LogisticRegression(random_state=42, max_iter=1000, C=0.1)
log_reg.fit(X_train_scaled, y_train)
# Predictions
y_pred = log_reg.predict(X_test_scaled)
y_pred_proba = log_reg.predict_proba(X_test_scaled)[:, 1]
# Evaluation
print("\nModel Performance:")
print(classification_report(y_test, y_pred, target_names=['Not HOF', 'HOF']))
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Feature coefficients
coef_df = pd.DataFrame({
'Feature': feature_cols,
'Coefficient': log_reg.coef_[0]
}).sort_values('Coefficient', ascending=False)
print("\nFeature Coefficients (Standardized):")
print(coef_df.to_string(index=False))
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Confusion Matrix
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0])
axes[0, 0].set_title('Confusion Matrix')
axes[0, 0].set_ylabel('Actual')
axes[0, 0].set_xlabel('Predicted')
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)
axes[0, 1].plot(fpr, tpr, 'b-', label=f'ROC (AUC = {auc_score:.3f})')
axes[0, 1].plot([0, 1], [0, 1], 'r--', label='Random')
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curve')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Feature Importance
axes[1, 0].barh(coef_df['Feature'], coef_df['Coefficient'],
color=['green' if x > 0 else 'red' for x in coef_df['Coefficient']])
axes[1, 0].set_xlabel('Coefficient')
axes[1, 0].set_title('Feature Importance')
axes[1, 0].axvline(x=0, color='black', linestyle='-', linewidth=0.5)
# Predicted Probability Distribution
axes[1, 1].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.5,
label='Not HOF', color='red')
axes[1, 1].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.5,
label='HOF', color='green')
axes[1, 1].axvline(x=0.5, color='black', linestyle='--', label='Threshold')
axes[1, 1].set_xlabel('Predicted Probability')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Predicted Probability Distribution')
axes[1, 1].legend()
plt.tight_layout()
plt.savefig('hof_prediction_model.png', dpi=300, bbox_inches='tight')
plt.show()
# Cross-validation
cv_scores = cross_val_score(log_reg, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f"\nCross-Validation AUC Scores: {cv_scores}")
print(f"Mean AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
MVP Award Prediction
MVP voting is influenced by both performance and narrative. Key predictive factors include:
- WAR: Single best predictor
- Team Performance: Playoff teams heavily favored
- Counting Stats: HR, RBI still matter to voters
- Position: Premium positions get bonus consideration
- Media Narrative: Difficult to quantify but impactful
# R: MVP Award Prediction
library(nnet)
library(MASS)
# Create MVP prediction dataset (simplified example)
set.seed(123)
mvp_data <- data.frame(
player = paste("Player", 1:200),
war = rnorm(200, mean = 5, sd = 2),
hr = rpois(200, lambda = 25),
rbi = rpois(200, lambda = 90),
avg = rnorm(200, mean = 0.275, sd = 0.030),
team_wins = rpois(200, lambda = 81),
playoff_team = rbinom(200, 1, 0.33)
) %>%
mutate(
war = pmax(0, war),
avg = pmin(0.350, pmax(0.200, avg)),
# MVP probability influenced by multiple factors
mvp_score = 0.35 * (war / 10) +
0.20 * (hr / 50) +
0.15 * (rbi / 120) +
0.10 * ((avg - 0.250) / 0.050) +
0.20 * playoff_team +
rnorm(200, 0, 0.15)
) %>%
arrange(desc(mvp_score)) %>%
mutate(
mvp_rank = row_number(),
mvp_top3 = ifelse(mvp_rank <= 3, 1, 0),
mvp_winner = ifelse(mvp_rank == 1, 1, 0)
)
# Multinomial logistic regression for MVP voting
mvp_data$mvp_category <- cut(mvp_data$mvp_rank,
breaks = c(0, 1, 3, 10, Inf),
labels = c("Winner", "Top3", "Top10", "Other"))
mvp_model <- multinom(
mvp_category ~ war + hr + rbi + avg + team_wins + playoff_team,
data = mvp_data
)
summary(mvp_model)
# Predict MVP likelihood
mvp_predictions <- predict(mvp_model, mvp_data, type = "probs")
mvp_data$pred_winner_prob <- mvp_predictions[, "Winner"]
# Top MVP candidates
top_mvp_candidates <- mvp_data %>%
arrange(desc(pred_winner_prob)) %>%
head(10) %>%
select(player, war, hr, rbi, avg, team_wins, playoff_team,
mvp_rank, pred_winner_prob)
print("Top MVP Candidates:")
print(top_mvp_candidates)
# Feature impact analysis
mvp_linear <- lm(mvp_score ~ war + hr + rbi + avg + playoff_team,
data = mvp_data)
print("\nFeature Impact on MVP Score:")
print(summary(mvp_linear)$coefficients)
# Visualization
par(mfrow = c(2, 2))
# WAR vs MVP Score
plot(mvp_data$war, mvp_data$mvp_score,
xlab = "WAR", ylab = "MVP Score",
main = "WAR vs MVP Score",
col = ifelse(mvp_data$playoff_team == 1, "darkgreen", "gray"),
pch = 19)
legend("topleft", legend = c("Playoff Team", "Non-Playoff"),
col = c("darkgreen", "gray"), pch = 19)
# Home Runs vs MVP Score
plot(mvp_data$hr, mvp_data$mvp_score,
xlab = "Home Runs", ylab = "MVP Score",
main = "Home Runs vs MVP Score",
col = "steelblue", pch = 19)
# Team Wins distribution
boxplot(team_wins ~ mvp_category, data = mvp_data,
xlab = "MVP Category", ylab = "Team Wins",
main = "Team Wins by MVP Category",
col = "lightblue")
# Predicted probability distribution
hist(mvp_data$pred_winner_prob, breaks = 30,
xlab = "Predicted MVP Winner Probability",
main = "Distribution of MVP Probabilities",
col = "steelblue")
Cy Young Award Prediction
# Python: Cy Young Prediction for Pitchers
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
# Generate synthetic Cy Young candidate data
np.random.seed(42)
def generate_cy_young_data(n_samples=300):
"""Generate synthetic pitcher data for Cy Young prediction"""
data = {
'wins': np.random.randint(5, 25, n_samples),
'losses': np.random.randint(2, 15, n_samples),
'era': np.random.uniform(2.0, 5.5, n_samples),
'innings': np.random.uniform(100, 250, n_samples),
'strikeouts': np.random.randint(100, 350, n_samples),
'whip': np.random.uniform(0.90, 1.60, n_samples),
'war': np.random.uniform(1, 10, n_samples),
'saves': np.random.randint(0, 50, n_samples),
'team_wins': np.random.randint(60, 105, n_samples)
}
df = pd.DataFrame(data)
# Calculate derivative features
df['win_pct'] = df['wins'] / (df['wins'] + df['losses'])
df['k_per_9'] = (df['strikeouts'] / df['innings']) * 9
df['k_per_bb'] = df['strikeouts'] / (df['strikeouts'] * 0.3) # Estimate BB
df['is_closer'] = (df['saves'] > 20).astype(int)
# Generate Cy Young winner based on features
cy_score = (
0.25 * (10 - df['era']) / 3 + # Lower ERA better
0.25 * df['war'] / 10 +
0.15 * df['wins'] / 25 +
0.15 * df['strikeouts'] / 350 +
0.10 * (1.5 - df['whip']) / 0.6 + # Lower WHIP better
0.10 * (df['team_wins'] - 60) / 45 +
np.random.normal(0, 0.2, n_samples)
)
df['cy_young_winner'] = (cy_score > np.percentile(cy_score, 97)).astype(int)
return df
# Generate data
cy_young_df = generate_cy_young_data(300)
print("Cy Young Dataset Overview:")
print(cy_young_df.describe())
print(f"\nCy Young Winners: {cy_young_df['cy_young_winner'].sum()}")
# Prepare features
feature_cols = ['wins', 'era', 'innings', 'strikeouts', 'whip', 'war',
'saves', 'team_wins', 'win_pct', 'k_per_9', 'is_closer']
X = cy_young_df[feature_cols]
y = cy_young_df['cy_young_winner']
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gb_model.fit(X_train_scaled, y_train)
# Predictions
y_pred = gb_model.predict(X_test_scaled)
y_pred_proba = gb_model.predict_proba(X_test_scaled)[:, 1]
# Evaluation
print("\nCy Young Prediction Model Performance:")
print(classification_report(y_test, y_pred, target_names=['Not Winner', 'Winner']))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")
# Feature importance
feature_importance = pd.DataFrame({
'Feature': feature_cols,
'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance.to_string(index=False))
# Predict on all data for ranking
all_predictions = gb_model.predict_proba(scaler.transform(X))[:, 1]
cy_young_df['cy_young_prob'] = all_predictions
cy_young_df['cy_young_rank'] = cy_young_df['cy_young_prob'].rank(ascending=False)
# Top candidates
top_candidates = cy_young_df.nsmallest(10, 'cy_young_rank')[
['wins', 'era', 'strikeouts', 'war', 'cy_young_prob', 'cy_young_winner']
]
print("\nTop 10 Cy Young Candidates:")
print(top_candidates.to_string(index=False))
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Feature Importance
axes[0, 0].barh(feature_importance['Feature'], feature_importance['Importance'],
color='steelblue')
axes[0, 0].set_xlabel('Importance')
axes[0, 0].set_title('Feature Importance for Cy Young Prediction')
# ERA vs WAR with winner highlighting
scatter = axes[0, 1].scatter(cy_young_df['era'], cy_young_df['war'],
c=cy_young_df['cy_young_prob'],
cmap='RdYlGn', s=50, alpha=0.6)
axes[0, 1].set_xlabel('ERA')
axes[0, 1].set_ylabel('WAR')
axes[0, 1].set_title('ERA vs WAR (colored by Cy Young probability)')
plt.colorbar(scatter, ax=axes[0, 1])
axes[0, 1].invert_xaxis() # Lower ERA is better
# Wins distribution
axes[1, 0].hist(cy_young_df[cy_young_df['cy_young_winner']==0]['wins'],
bins=15, alpha=0.5, label='Not Winner', color='gray')
axes[1, 0].hist(cy_young_df[cy_young_df['cy_young_winner']==1]['wins'],
bins=15, alpha=0.5, label='Winner', color='gold')
axes[1, 0].set_xlabel('Wins')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Win Distribution by Cy Young Status')
axes[1, 0].legend()
# Predicted probability vs actual
axes[1, 1].scatter(range(len(y_test)), y_pred_proba,
c=y_test, cmap='RdYlGn', s=50, alpha=0.6)
axes[1, 1].axhline(y=0.5, color='black', linestyle='--', label='Threshold')
axes[1, 1].set_xlabel('Sample Index')
axes[1, 1].set_ylabel('Predicted Probability')
axes[1, 1].set_title('Predicted Cy Young Probabilities (Test Set)')
axes[1, 1].legend()
plt.tight_layout()
plt.savefig('cy_young_prediction.png', dpi=300, bbox_inches='tight')
plt.show()
# R: MVP Award Prediction
library(nnet)
library(MASS)
# Create MVP prediction dataset (simplified example)
set.seed(123)
mvp_data <- data.frame(
player = paste("Player", 1:200),
war = rnorm(200, mean = 5, sd = 2),
hr = rpois(200, lambda = 25),
rbi = rpois(200, lambda = 90),
avg = rnorm(200, mean = 0.275, sd = 0.030),
team_wins = rpois(200, lambda = 81),
playoff_team = rbinom(200, 1, 0.33)
) %>%
mutate(
war = pmax(0, war),
avg = pmin(0.350, pmax(0.200, avg)),
# MVP probability influenced by multiple factors
mvp_score = 0.35 * (war / 10) +
0.20 * (hr / 50) +
0.15 * (rbi / 120) +
0.10 * ((avg - 0.250) / 0.050) +
0.20 * playoff_team +
rnorm(200, 0, 0.15)
) %>%
arrange(desc(mvp_score)) %>%
mutate(
mvp_rank = row_number(),
mvp_top3 = ifelse(mvp_rank <= 3, 1, 0),
mvp_winner = ifelse(mvp_rank == 1, 1, 0)
)
# Multinomial logistic regression for MVP voting
mvp_data$mvp_category <- cut(mvp_data$mvp_rank,
breaks = c(0, 1, 3, 10, Inf),
labels = c("Winner", "Top3", "Top10", "Other"))
mvp_model <- multinom(
mvp_category ~ war + hr + rbi + avg + team_wins + playoff_team,
data = mvp_data
)
summary(mvp_model)
# Predict MVP likelihood
mvp_predictions <- predict(mvp_model, mvp_data, type = "probs")
mvp_data$pred_winner_prob <- mvp_predictions[, "Winner"]
# Top MVP candidates
top_mvp_candidates <- mvp_data %>%
arrange(desc(pred_winner_prob)) %>%
head(10) %>%
select(player, war, hr, rbi, avg, team_wins, playoff_team,
mvp_rank, pred_winner_prob)
print("Top MVP Candidates:")
print(top_mvp_candidates)
# Feature impact analysis
mvp_linear <- lm(mvp_score ~ war + hr + rbi + avg + playoff_team,
data = mvp_data)
print("\nFeature Impact on MVP Score:")
print(summary(mvp_linear)$coefficients)
# Visualization
par(mfrow = c(2, 2))
# WAR vs MVP Score
plot(mvp_data$war, mvp_data$mvp_score,
xlab = "WAR", ylab = "MVP Score",
main = "WAR vs MVP Score",
col = ifelse(mvp_data$playoff_team == 1, "darkgreen", "gray"),
pch = 19)
legend("topleft", legend = c("Playoff Team", "Non-Playoff"),
col = c("darkgreen", "gray"), pch = 19)
# Home Runs vs MVP Score
plot(mvp_data$hr, mvp_data$mvp_score,
xlab = "Home Runs", ylab = "MVP Score",
main = "Home Runs vs MVP Score",
col = "steelblue", pch = 19)
# Team Wins distribution
boxplot(team_wins ~ mvp_category, data = mvp_data,
xlab = "MVP Category", ylab = "Team Wins",
main = "Team Wins by MVP Category",
col = "lightblue")
# Predicted probability distribution
hist(mvp_data$pred_winner_prob, breaks = 30,
xlab = "Predicted MVP Winner Probability",
main = "Distribution of MVP Probabilities",
col = "steelblue")
# Python: Cy Young Prediction for Pitchers
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt
import seaborn as sns
# Generate synthetic Cy Young candidate data
np.random.seed(42)
def generate_cy_young_data(n_samples=300):
"""Generate synthetic pitcher data for Cy Young prediction"""
data = {
'wins': np.random.randint(5, 25, n_samples),
'losses': np.random.randint(2, 15, n_samples),
'era': np.random.uniform(2.0, 5.5, n_samples),
'innings': np.random.uniform(100, 250, n_samples),
'strikeouts': np.random.randint(100, 350, n_samples),
'whip': np.random.uniform(0.90, 1.60, n_samples),
'war': np.random.uniform(1, 10, n_samples),
'saves': np.random.randint(0, 50, n_samples),
'team_wins': np.random.randint(60, 105, n_samples)
}
df = pd.DataFrame(data)
# Calculate derivative features
df['win_pct'] = df['wins'] / (df['wins'] + df['losses'])
df['k_per_9'] = (df['strikeouts'] / df['innings']) * 9
df['k_per_bb'] = df['strikeouts'] / (df['strikeouts'] * 0.3) # Estimate BB
df['is_closer'] = (df['saves'] > 20).astype(int)
# Generate Cy Young winner based on features
cy_score = (
0.25 * (10 - df['era']) / 3 + # Lower ERA better
0.25 * df['war'] / 10 +
0.15 * df['wins'] / 25 +
0.15 * df['strikeouts'] / 350 +
0.10 * (1.5 - df['whip']) / 0.6 + # Lower WHIP better
0.10 * (df['team_wins'] - 60) / 45 +
np.random.normal(0, 0.2, n_samples)
)
df['cy_young_winner'] = (cy_score > np.percentile(cy_score, 97)).astype(int)
return df
# Generate data
cy_young_df = generate_cy_young_data(300)
print("Cy Young Dataset Overview:")
print(cy_young_df.describe())
print(f"\nCy Young Winners: {cy_young_df['cy_young_winner'].sum()}")
# Prepare features
feature_cols = ['wins', 'era', 'innings', 'strikeouts', 'whip', 'war',
'saves', 'team_wins', 'win_pct', 'k_per_9', 'is_closer']
X = cy_young_df[feature_cols]
y = cy_young_df['cy_young_winner']
# Split and scale
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Gradient Boosting model
gb_model = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gb_model.fit(X_train_scaled, y_train)
# Predictions
y_pred = gb_model.predict(X_test_scaled)
y_pred_proba = gb_model.predict_proba(X_test_scaled)[:, 1]
# Evaluation
print("\nCy Young Prediction Model Performance:")
print(classification_report(y_test, y_pred, target_names=['Not Winner', 'Winner']))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_pred_proba):.3f}")
# Feature importance
feature_importance = pd.DataFrame({
'Feature': feature_cols,
'Importance': gb_model.feature_importances_
}).sort_values('Importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance.to_string(index=False))
# Predict on all data for ranking
all_predictions = gb_model.predict_proba(scaler.transform(X))[:, 1]
cy_young_df['cy_young_prob'] = all_predictions
cy_young_df['cy_young_rank'] = cy_young_df['cy_young_prob'].rank(ascending=False)
# Top candidates
top_candidates = cy_young_df.nsmallest(10, 'cy_young_rank')[
['wins', 'era', 'strikeouts', 'war', 'cy_young_prob', 'cy_young_winner']
]
print("\nTop 10 Cy Young Candidates:")
print(top_candidates.to_string(index=False))
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Feature Importance
axes[0, 0].barh(feature_importance['Feature'], feature_importance['Importance'],
color='steelblue')
axes[0, 0].set_xlabel('Importance')
axes[0, 0].set_title('Feature Importance for Cy Young Prediction')
# ERA vs WAR with winner highlighting
scatter = axes[0, 1].scatter(cy_young_df['era'], cy_young_df['war'],
c=cy_young_df['cy_young_prob'],
cmap='RdYlGn', s=50, alpha=0.6)
axes[0, 1].set_xlabel('ERA')
axes[0, 1].set_ylabel('WAR')
axes[0, 1].set_title('ERA vs WAR (colored by Cy Young probability)')
plt.colorbar(scatter, ax=axes[0, 1])
axes[0, 1].invert_xaxis() # Lower ERA is better
# Wins distribution
axes[1, 0].hist(cy_young_df[cy_young_df['cy_young_winner']==0]['wins'],
bins=15, alpha=0.5, label='Not Winner', color='gray')
axes[1, 0].hist(cy_young_df[cy_young_df['cy_young_winner']==1]['wins'],
bins=15, alpha=0.5, label='Winner', color='gold')
axes[1, 0].set_xlabel('Wins')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Win Distribution by Cy Young Status')
axes[1, 0].legend()
# Predicted probability vs actual
axes[1, 1].scatter(range(len(y_test)), y_pred_proba,
c=y_test, cmap='RdYlGn', s=50, alpha=0.6)
axes[1, 1].axhline(y=0.5, color='black', linestyle='--', label='Threshold')
axes[1, 1].set_xlabel('Sample Index')
axes[1, 1].set_ylabel('Predicted Probability')
axes[1, 1].set_title('Predicted Cy Young Probabilities (Test Set)')
axes[1, 1].legend()
plt.tight_layout()
plt.savefig('cy_young_prediction.png', dpi=300, bbox_inches='tight')
plt.show()
Aging curves reveal how player performance changes over time, helping teams make personnel decisions and predict future performance.
Delta Method for Aging Curves
The delta method tracks year-over-year changes for the same players, controlling for selection bias.
# R: Aging Curves Analysis
library(mgcv) # For smoothing
# Create aging curve dataset
create_aging_data <- function() {
batting_yearly <- Batting %>%
filter(yearID >= 1980, AB >= 300) %>%
left_join(People %>% select(playerID, birthYear, birthMonth),
by = "playerID") %>%
mutate(
age = yearID - birthYear -
ifelse(is.na(birthMonth) | birthMonth > 6, 1, 0),
pa = AB + BB + HBP + SF + SH,
obp = (H + BB + HBP) / (AB + BB + HBP + SF),
slg = (H - `2B` - `3B` - HR + 2*`2B` + 3*`3B` + 4*HR) / AB,
ops = obp + slg,
iso = slg - (H / AB)
) %>%
filter(age >= 20, age <= 40, !is.na(ops)) %>%
select(playerID, yearID, age, pa, obp, slg, ops, iso, HR, SB)
return(batting_yearly)
}
aging_data <- create_aging_data()
# Calculate average performance by age
age_performance <- aging_data %>%
group_by(age) %>%
summarise(
n_players = n(),
avg_ops = mean(ops, na.rm = TRUE),
avg_iso = mean(iso, na.rm = TRUE),
avg_hr = mean(HR, na.rm = TRUE),
avg_sb = mean(SB, na.rm = TRUE),
se_ops = sd(ops, na.rm = TRUE) / sqrt(n_players)
)
print("Performance by Age:")
print(age_performance)
# Fit smooth aging curve using GAM
gam_model <- gam(ops ~ s(age, bs = "cs"), data = aging_data)
# Predict smooth curve
age_range <- data.frame(age = 20:40)
age_range$predicted_ops <- predict(gam_model, age_range)
# Find peak age
peak_age <- age_range$age[which.max(age_range$predicted_ops)]
peak_ops <- max(age_range$predicted_ops)
print(paste("Peak performance age:", peak_age))
print(paste("Peak OPS:", round(peak_ops, 3)))
# Visualize aging curve
ggplot() +
geom_point(data = age_performance, aes(x = age, y = avg_ops),
size = 2, alpha = 0.6) +
geom_errorbar(data = age_performance,
aes(x = age, ymin = avg_ops - se_ops, ymax = avg_ops + se_ops),
width = 0.3, alpha = 0.4) +
geom_line(data = age_range, aes(x = age, y = predicted_ops),
color = "red", size = 1.2) +
geom_vline(xintercept = peak_age, linetype = "dashed", color = "blue") +
annotate("text", x = peak_age + 1, y = min(age_performance$avg_ops),
label = paste("Peak:", peak_age), color = "blue") +
labs(title = "Baseball Aging Curve (OPS)",
subtitle = "Smoothed curve shows typical performance trajectory",
x = "Age",
y = "OPS") +
theme_minimal()
# Delta method: year-over-year change
delta_data <- aging_data %>%
arrange(playerID, yearID) %>%
group_by(playerID) %>%
mutate(
next_age = lead(age),
next_ops = lead(ops),
age_diff = next_age - age,
ops_change = next_ops - ops
) %>%
filter(age_diff == 1) %>% # Only consecutive years
ungroup()
# Average delta by age
age_delta <- delta_data %>%
group_by(age) %>%
summarise(
n = n(),
avg_delta = mean(ops_change, na.rm = TRUE),
se_delta = sd(ops_change, na.rm = TRUE) / sqrt(n)
)
print("\nYear-over-Year Change in OPS:")
print(age_delta)
# Plot delta curve
ggplot(age_delta, aes(x = age, y = avg_delta)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
geom_line(color = "darkred", size = 1.2) +
geom_point(size = 2) +
geom_errorbar(aes(ymin = avg_delta - se_delta, ymax = avg_delta + se_delta),
width = 0.3, alpha = 0.5) +
labs(title = "Year-over-Year Change in OPS by Age",
subtitle = "Delta method controls for selection bias",
x = "Age",
y = "Change in OPS from Previous Year") +
theme_minimal()
Position-Specific and Skill-Specific Aging
# Python: Skill-Specific Aging Curves
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import UnivariateSpline
# Generate synthetic career trajectory data
np.random.seed(42)
def generate_career_data(n_players=200):
"""Generate synthetic career data with realistic aging patterns"""
careers = []
for player_id in range(n_players):
# Random career span
debut_age = np.random.randint(20, 25)
career_length = np.random.randint(8, 18)
# Player quality (talent level)
talent = np.random.normal(100, 15)
for year_in_career in range(career_length):
age = debut_age + year_in_career
# Different aging patterns for different skills
# Power peaks later and declines faster
power_modifier = 1.0 + (age - 27) * 0.02 if age < 27 else 1.0 - (age - 27) * 0.04
power = max(0, talent * 0.8 * power_modifier + np.random.normal(0, 5))
# Speed declines earlier and more gradually
speed_modifier = 1.0 + (age - 25) * 0.01 if age < 25 else 1.0 - (age - 25) * 0.06
speed = max(0, talent * 1.2 * speed_modifier + np.random.normal(0, 5))
# Contact skills more stable
contact_modifier = 1.0 + (age - 29) * 0.01 if age < 29 else 1.0 - (age - 29) * 0.02
contact = max(0, talent * contact_modifier + np.random.normal(0, 5))
careers.append({
'player_id': player_id,
'age': age,
'power': power,
'speed': speed,
'contact': contact,
'overall': (power + speed + contact) / 3
})
return pd.DataFrame(careers)
# Generate data
career_df = generate_career_data(200)
print("Career Data Sample:")
print(career_df.head(20))
# Calculate average by age for each skill
aging_curves = career_df.groupby('age').agg({
'power': ['mean', 'std', 'count'],
'speed': ['mean', 'std', 'count'],
'contact': ['mean', 'std', 'count'],
'overall': ['mean', 'std', 'count']
}).reset_index()
aging_curves.columns = ['age', 'power_mean', 'power_std', 'power_n',
'speed_mean', 'speed_std', 'speed_n',
'contact_mean', 'contact_std', 'contact_n',
'overall_mean', 'overall_std', 'overall_n']
print("\nAging Curves by Skill:")
print(aging_curves)
# Smooth curves using splines
ages = aging_curves['age'].values
skills = ['power', 'speed', 'contact', 'overall']
smoothed = {}
for skill in skills:
values = aging_curves[f'{skill}_mean'].values
spline = UnivariateSpline(ages, values, s=50)
smoothed[skill] = spline(ages)
# Find peak
peak_idx = np.argmax(smoothed[skill])
peak_age = ages[peak_idx]
print(f"{skill.capitalize()} peaks at age {peak_age}")
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Skill-specific aging curves
ax = axes[0, 0]
for skill in ['power', 'speed', 'contact']:
ax.plot(ages, smoothed[skill], marker='o', label=skill.capitalize(), linewidth=2)
ax.set_xlabel('Age')
ax.set_ylabel('Skill Level')
ax.set_title('Skill-Specific Aging Curves')
ax.legend()
ax.grid(True, alpha=0.3)
# Decline rates by age
ax = axes[0, 1]
for skill in ['power', 'speed', 'contact']:
values = smoothed[skill]
peak_value = np.max(values)
decline_pct = ((values - peak_value) / peak_value) * 100
ax.plot(ages, decline_pct, marker='o', label=skill.capitalize(), linewidth=2)
ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Age')
ax.set_ylabel('% Change from Peak')
ax.set_title('Performance Decline from Peak')
ax.legend()
ax.grid(True, alpha=0.3)
# Distribution by age group
ax = axes[1, 0]
age_groups = pd.cut(career_df['age'], bins=[19, 25, 29, 33, 41],
labels=['20-25', '26-29', '30-33', '34+'])
career_df['age_group'] = age_groups
age_group_means = career_df.groupby('age_group')[['power', 'speed', 'contact']].mean()
age_group_means.plot(kind='bar', ax=ax, color=['#e74c3c', '#3498db', '#2ecc71'])
ax.set_xlabel('Age Group')
ax.set_ylabel('Skill Level')
ax.set_title('Average Skills by Age Group')
ax.legend(title='Skill')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
# Year-over-year change
ax = axes[1, 1]
yoy_changes = career_df.sort_values(['player_id', 'age']).groupby('player_id').apply(
lambda x: x[['age', 'overall']].assign(
overall_change=x['overall'].diff()
)
).reset_index(drop=True)
yoy_by_age = yoy_changes.dropna().groupby('age')['overall_change'].agg(['mean', 'std'])
ax.plot(yoy_by_age.index, yoy_by_age['mean'], 'o-', color='darkred', linewidth=2)
ax.fill_between(yoy_by_age.index,
yoy_by_age['mean'] - yoy_by_age['std'],
yoy_by_age['mean'] + yoy_by_age['std'],
alpha=0.3, color='red')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Age')
ax.set_ylabel('Year-over-Year Change')
ax.set_title('Annual Performance Change')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('skill_aging_curves.png', dpi=300, bbox_inches='tight')
plt.show()
# Statistical summary of decline rates
print("\n=== Decline Rate Analysis ===")
for skill in ['power', 'speed', 'contact']:
peak_idx = np.argmax(smoothed[skill])
peak_age = ages[peak_idx]
peak_value = smoothed[skill][peak_idx]
# Age 35 value (typical late career)
age_35_idx = np.where(ages == 35)[0]
if len(age_35_idx) > 0:
age_35_value = smoothed[skill][age_35_idx[0]]
decline = ((age_35_value - peak_value) / peak_value) * 100
print(f"{skill.capitalize()}: Peak at {peak_age}, decline to age 35: {decline:.1f}%")
# R: Aging Curves Analysis
library(mgcv) # For smoothing
# Create aging curve dataset
create_aging_data <- function() {
batting_yearly <- Batting %>%
filter(yearID >= 1980, AB >= 300) %>%
left_join(People %>% select(playerID, birthYear, birthMonth),
by = "playerID") %>%
mutate(
age = yearID - birthYear -
ifelse(is.na(birthMonth) | birthMonth > 6, 1, 0),
pa = AB + BB + HBP + SF + SH,
obp = (H + BB + HBP) / (AB + BB + HBP + SF),
slg = (H - `2B` - `3B` - HR + 2*`2B` + 3*`3B` + 4*HR) / AB,
ops = obp + slg,
iso = slg - (H / AB)
) %>%
filter(age >= 20, age <= 40, !is.na(ops)) %>%
select(playerID, yearID, age, pa, obp, slg, ops, iso, HR, SB)
return(batting_yearly)
}
aging_data <- create_aging_data()
# Calculate average performance by age
age_performance <- aging_data %>%
group_by(age) %>%
summarise(
n_players = n(),
avg_ops = mean(ops, na.rm = TRUE),
avg_iso = mean(iso, na.rm = TRUE),
avg_hr = mean(HR, na.rm = TRUE),
avg_sb = mean(SB, na.rm = TRUE),
se_ops = sd(ops, na.rm = TRUE) / sqrt(n_players)
)
print("Performance by Age:")
print(age_performance)
# Fit smooth aging curve using GAM
gam_model <- gam(ops ~ s(age, bs = "cs"), data = aging_data)
# Predict smooth curve
age_range <- data.frame(age = 20:40)
age_range$predicted_ops <- predict(gam_model, age_range)
# Find peak age
peak_age <- age_range$age[which.max(age_range$predicted_ops)]
peak_ops <- max(age_range$predicted_ops)
print(paste("Peak performance age:", peak_age))
print(paste("Peak OPS:", round(peak_ops, 3)))
# Visualize aging curve
ggplot() +
geom_point(data = age_performance, aes(x = age, y = avg_ops),
size = 2, alpha = 0.6) +
geom_errorbar(data = age_performance,
aes(x = age, ymin = avg_ops - se_ops, ymax = avg_ops + se_ops),
width = 0.3, alpha = 0.4) +
geom_line(data = age_range, aes(x = age, y = predicted_ops),
color = "red", size = 1.2) +
geom_vline(xintercept = peak_age, linetype = "dashed", color = "blue") +
annotate("text", x = peak_age + 1, y = min(age_performance$avg_ops),
label = paste("Peak:", peak_age), color = "blue") +
labs(title = "Baseball Aging Curve (OPS)",
subtitle = "Smoothed curve shows typical performance trajectory",
x = "Age",
y = "OPS") +
theme_minimal()
# Delta method: year-over-year change
delta_data <- aging_data %>%
arrange(playerID, yearID) %>%
group_by(playerID) %>%
mutate(
next_age = lead(age),
next_ops = lead(ops),
age_diff = next_age - age,
ops_change = next_ops - ops
) %>%
filter(age_diff == 1) %>% # Only consecutive years
ungroup()
# Average delta by age
age_delta <- delta_data %>%
group_by(age) %>%
summarise(
n = n(),
avg_delta = mean(ops_change, na.rm = TRUE),
se_delta = sd(ops_change, na.rm = TRUE) / sqrt(n)
)
print("\nYear-over-Year Change in OPS:")
print(age_delta)
# Plot delta curve
ggplot(age_delta, aes(x = age, y = avg_delta)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
geom_line(color = "darkred", size = 1.2) +
geom_point(size = 2) +
geom_errorbar(aes(ymin = avg_delta - se_delta, ymax = avg_delta + se_delta),
width = 0.3, alpha = 0.5) +
labs(title = "Year-over-Year Change in OPS by Age",
subtitle = "Delta method controls for selection bias",
x = "Age",
y = "Change in OPS from Previous Year") +
theme_minimal()
# Python: Skill-Specific Aging Curves
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import UnivariateSpline
# Generate synthetic career trajectory data
np.random.seed(42)
def generate_career_data(n_players=200):
"""Generate synthetic career data with realistic aging patterns"""
careers = []
for player_id in range(n_players):
# Random career span
debut_age = np.random.randint(20, 25)
career_length = np.random.randint(8, 18)
# Player quality (talent level)
talent = np.random.normal(100, 15)
for year_in_career in range(career_length):
age = debut_age + year_in_career
# Different aging patterns for different skills
# Power peaks later and declines faster
power_modifier = 1.0 + (age - 27) * 0.02 if age < 27 else 1.0 - (age - 27) * 0.04
power = max(0, talent * 0.8 * power_modifier + np.random.normal(0, 5))
# Speed declines earlier and more gradually
speed_modifier = 1.0 + (age - 25) * 0.01 if age < 25 else 1.0 - (age - 25) * 0.06
speed = max(0, talent * 1.2 * speed_modifier + np.random.normal(0, 5))
# Contact skills more stable
contact_modifier = 1.0 + (age - 29) * 0.01 if age < 29 else 1.0 - (age - 29) * 0.02
contact = max(0, talent * contact_modifier + np.random.normal(0, 5))
careers.append({
'player_id': player_id,
'age': age,
'power': power,
'speed': speed,
'contact': contact,
'overall': (power + speed + contact) / 3
})
return pd.DataFrame(careers)
# Generate data
career_df = generate_career_data(200)
print("Career Data Sample:")
print(career_df.head(20))
# Calculate average by age for each skill
aging_curves = career_df.groupby('age').agg({
'power': ['mean', 'std', 'count'],
'speed': ['mean', 'std', 'count'],
'contact': ['mean', 'std', 'count'],
'overall': ['mean', 'std', 'count']
}).reset_index()
aging_curves.columns = ['age', 'power_mean', 'power_std', 'power_n',
'speed_mean', 'speed_std', 'speed_n',
'contact_mean', 'contact_std', 'contact_n',
'overall_mean', 'overall_std', 'overall_n']
print("\nAging Curves by Skill:")
print(aging_curves)
# Smooth curves using splines
ages = aging_curves['age'].values
skills = ['power', 'speed', 'contact', 'overall']
smoothed = {}
for skill in skills:
values = aging_curves[f'{skill}_mean'].values
spline = UnivariateSpline(ages, values, s=50)
smoothed[skill] = spline(ages)
# Find peak
peak_idx = np.argmax(smoothed[skill])
peak_age = ages[peak_idx]
print(f"{skill.capitalize()} peaks at age {peak_age}")
# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Skill-specific aging curves
ax = axes[0, 0]
for skill in ['power', 'speed', 'contact']:
ax.plot(ages, smoothed[skill], marker='o', label=skill.capitalize(), linewidth=2)
ax.set_xlabel('Age')
ax.set_ylabel('Skill Level')
ax.set_title('Skill-Specific Aging Curves')
ax.legend()
ax.grid(True, alpha=0.3)
# Decline rates by age
ax = axes[0, 1]
for skill in ['power', 'speed', 'contact']:
values = smoothed[skill]
peak_value = np.max(values)
decline_pct = ((values - peak_value) / peak_value) * 100
ax.plot(ages, decline_pct, marker='o', label=skill.capitalize(), linewidth=2)
ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Age')
ax.set_ylabel('% Change from Peak')
ax.set_title('Performance Decline from Peak')
ax.legend()
ax.grid(True, alpha=0.3)
# Distribution by age group
ax = axes[1, 0]
age_groups = pd.cut(career_df['age'], bins=[19, 25, 29, 33, 41],
labels=['20-25', '26-29', '30-33', '34+'])
career_df['age_group'] = age_groups
age_group_means = career_df.groupby('age_group')[['power', 'speed', 'contact']].mean()
age_group_means.plot(kind='bar', ax=ax, color=['#e74c3c', '#3498db', '#2ecc71'])
ax.set_xlabel('Age Group')
ax.set_ylabel('Skill Level')
ax.set_title('Average Skills by Age Group')
ax.legend(title='Skill')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
# Year-over-year change
ax = axes[1, 1]
yoy_changes = career_df.sort_values(['player_id', 'age']).groupby('player_id').apply(
lambda x: x[['age', 'overall']].assign(
overall_change=x['overall'].diff()
)
).reset_index(drop=True)
yoy_by_age = yoy_changes.dropna().groupby('age')['overall_change'].agg(['mean', 'std'])
ax.plot(yoy_by_age.index, yoy_by_age['mean'], 'o-', color='darkred', linewidth=2)
ax.fill_between(yoy_by_age.index,
yoy_by_age['mean'] - yoy_by_age['std'],
yoy_by_age['mean'] + yoy_by_age['std'],
alpha=0.3, color='red')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Age')
ax.set_ylabel('Year-over-Year Change')
ax.set_title('Annual Performance Change')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('skill_aging_curves.png', dpi=300, bbox_inches='tight')
plt.show()
# Statistical summary of decline rates
print("\n=== Decline Rate Analysis ===")
for skill in ['power', 'speed', 'contact']:
peak_idx = np.argmax(smoothed[skill])
peak_age = ages[peak_idx]
peak_value = smoothed[skill][peak_idx]
# Age 35 value (typical late career)
age_35_idx = np.where(ages == 35)[0]
if len(age_35_idx) > 0:
age_35_value = smoothed[skill][age_35_idx[0]]
decline = ((age_35_value - peak_value) / peak_value) * 100
print(f"{skill.capitalize()}: Peak at {peak_age}, decline to age 35: {decline:.1f}%")
What-if analysis explores hypothetical scenarios: career-ending injuries, steroid era adjustments, or different team contexts.
Injury Impact Analysis
# R: Career Projection with Injury Scenarios
library(forecast)
# Create career projection function
project_career <- function(age_values, war_values, project_to_age = 40) {
career_df <- data.frame(age = age_values, war = war_values)
# Fit aging curve model
model <- loess(war ~ age, data = career_df, span = 0.75)
# Project remaining years
future_ages <- (max(age_values) + 1):project_to_age
future_ages <- future_ages[future_ages <= project_to_age]
if (length(future_ages) > 0) {
projected_war <- predict(model, newdata = data.frame(age = future_ages))
# Apply decline factor
decline_factor <- seq(0.95, 0.5, length.out = length(future_ages))
projected_war <- projected_war * decline_factor
projected_war <- pmax(0, projected_war) # No negative WAR
} else {
projected_war <- numeric(0)
future_ages <- integer(0)
}
list(
actual = career_df,
projected = data.frame(age = future_ages, war = projected_war),
total_war = sum(war_values) + sum(projected_war)
)
}
# Example: Mike Trout career projection
trout_actual <- data.frame(
age = 20:31,
war = c(2.1, 10.2, 10.5, 9.2, 10.0, 10.6, 8.2, 8.6, 8.0, 4.5, 3.5, 1.8)
)
# Scenario 1: No injury (baseline)
scenario_healthy <- project_career(trout_actual$age, trout_actual$war)
# Scenario 2: Injury at age 32 (miss full season)
trout_injury <- trout_actual
scenario_injury <- project_career(trout_injury$age, trout_injury$war)
scenario_injury$total_war <- scenario_injury$total_war - 6 # Lost prime season
# Scenario 3: Career-ending injury at 32
scenario_career_end <- list(
actual = trout_actual,
projected = data.frame(age = integer(0), war = numeric(0)),
total_war = sum(trout_actual$war)
)
# Compare scenarios
comparison <- data.frame(
scenario = c("Healthy Career", "One Year Injury", "Career Ending"),
projected_total_war = c(
scenario_healthy$total_war,
scenario_injury$total_war,
scenario_career_end$total_war
),
hof_probability = c(0.99, 0.90, 0.75)
)
print("Career Projection Scenarios:")
print(comparison)
# Visualize scenarios
plot_data <- rbind(
data.frame(scenario = "Actual", scenario_healthy$actual),
data.frame(scenario = "Healthy Projection", scenario_healthy$projected),
data.frame(scenario = "Injury Projection", scenario_injury$projected)
)
ggplot(plot_data, aes(x = age, y = war, color = scenario, linetype = scenario)) +
geom_line(size = 1.2) +
geom_point(data = plot_data[plot_data$scenario == "Actual", ], size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
scale_color_manual(values = c("Actual" = "black",
"Healthy Projection" = "darkgreen",
"Injury Projection" = "red")) +
scale_linetype_manual(values = c("Actual" = "solid",
"Healthy Projection" = "dashed",
"Injury Projection" = "dotted")) +
labs(title = "Career Projection: Injury Impact Analysis",
subtitle = "Comparing healthy vs injury scenarios",
x = "Age",
y = "WAR",
color = "Scenario",
linetype = "Scenario") +
theme_minimal() +
theme(legend.position = "bottom")
Era Adjustment Analysis
# Python: Era-Adjusted Performance Comparison
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Historical context adjustments
era_factors = pd.DataFrame({
'era': ['Dead Ball (1901-1919)', 'Live Ball (1920-1941)',
'Integration (1942-1960)', 'Expansion (1961-1976)',
'Free Agency (1977-1993)', 'Steroid (1994-2005)',
'Post-Steroid (2006-2015)', 'Modern (2016-2023)'],
'start_year': [1901, 1920, 1942, 1961, 1977, 1994, 2006, 2016],
'end_year': [1919, 1941, 1960, 1976, 1993, 2005, 2015, 2023],
'avg_runs_per_game': [3.8, 4.8, 4.3, 4.0, 4.3, 5.0, 4.6, 4.4],
'avg_hr_per_game': [0.15, 0.5, 0.9, 0.95, 0.85, 1.1, 1.05, 1.25],
'offensive_adjustment': [1.20, 1.05, 1.10, 1.15, 1.10, 0.95, 1.05, 1.00]
})
print("Era Adjustment Factors:")
print(era_factors[['era', 'offensive_adjustment']])
# Compare players across eras
historical_seasons = pd.DataFrame({
'player': ['Babe Ruth', 'Ted Williams', 'Willie Mays', 'Mike Schmidt',
'Barry Bonds', 'Albert Pujols', 'Mike Trout'],
'peak_season': [1921, 1941, 1965, 1980, 2001, 2008, 2016],
'hr': [59, 37, 52, 48, 73, 37, 29],
'avg': [.378, .406, .317, .286, .328, .357, .315],
'ops': [1.379, 1.287, 1.053, .974, 1.379, 1.114, 1.004],
'war': [12.9, 10.4, 11.1, 8.5, 12.5, 9.6, 10.5]
})
# Add era information
def assign_era(year):
for _, row in era_factors.iterrows():
if row['start_year'] <= year <= row['end_year']:
return row['era'], row['offensive_adjustment']
return 'Unknown', 1.0
historical_seasons['era'] = historical_seasons['peak_season'].apply(
lambda x: assign_era(x)[0]
)
historical_seasons['era_adjustment'] = historical_seasons['peak_season'].apply(
lambda x: assign_era(x)[1]
)
# Calculate era-adjusted stats
historical_seasons['adjusted_ops'] = (
historical_seasons['ops'] * historical_seasons['era_adjustment']
)
historical_seasons['adjusted_war'] = (
historical_seasons['war'] * historical_seasons['era_adjustment']
)
print("\nEra-Adjusted Performance Comparison:")
print(historical_seasons[['player', 'era', 'ops', 'adjusted_ops',
'war', 'adjusted_war']].to_string(index=False))
# Rank players by adjusted performance
historical_seasons_ranked = historical_seasons.sort_values(
'adjusted_war', ascending=False
)
print("\nRanking by Era-Adjusted WAR:")
for idx, row in historical_seasons_ranked.iterrows():
print(f"{row['player']:20s} {row['adjusted_war']:5.1f} ({row['era']})")
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Raw vs Adjusted OPS
ax = axes[0, 0]
x = np.arange(len(historical_seasons))
width = 0.35
ax.bar(x - width/2, historical_seasons['ops'], width, label='Raw OPS', alpha=0.7)
ax.bar(x + width/2, historical_seasons['adjusted_ops'], width,
label='Adjusted OPS', alpha=0.7)
ax.set_ylabel('OPS')
ax.set_title('Raw vs Era-Adjusted OPS')
ax.set_xticks(x)
ax.set_xticklabels(historical_seasons['player'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# Raw vs Adjusted WAR
ax = axes[0, 1]
ax.bar(x - width/2, historical_seasons['war'], width, label='Raw WAR', alpha=0.7)
ax.bar(x + width/2, historical_seasons['adjusted_war'], width,
label='Adjusted WAR', alpha=0.7)
ax.set_ylabel('WAR')
ax.set_title('Raw vs Era-Adjusted WAR')
ax.set_xticks(x)
ax.set_xticklabels(historical_seasons['player'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# Era offensive levels
ax = axes[1, 0]
ax.plot(era_factors['start_year'], era_factors['avg_runs_per_game'],
'o-', label='Runs/Game', linewidth=2)
ax2 = ax.twinx()
ax2.plot(era_factors['start_year'], era_factors['avg_hr_per_game'],
's-', color='orange', label='HR/Game', linewidth=2)
ax.set_xlabel('Year')
ax.set_ylabel('Runs per Game', color='blue')
ax2.set_ylabel('Home Runs per Game', color='orange')
ax.set_title('Offensive Environment by Era')
ax.tick_params(axis='y', labelcolor='blue')
ax2.tick_params(axis='y', labelcolor='orange')
ax.grid(True, alpha=0.3)
# Adjustment factor timeline
ax = axes[1, 1]
ax.step(era_factors['start_year'], era_factors['offensive_adjustment'],
where='post', linewidth=2, color='darkred')
ax.axhline(y=1.0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Year')
ax.set_ylabel('Adjustment Factor')
ax.set_title('Era Adjustment Factors Over Time')
ax.set_ylim([0.9, 1.25])
ax.grid(True, alpha=0.3)
# Add era labels
for _, row in era_factors.iterrows():
ax.text(row['start_year'], row['offensive_adjustment'] + 0.02,
row['era'].split('(')[0].strip(), rotation=45,
fontsize=7, ha='left')
plt.tight_layout()
plt.savefig('era_adjustments.png', dpi=300, bbox_inches='tight')
plt.show()
# R: Career Projection with Injury Scenarios
library(forecast)
# Create career projection function
project_career <- function(age_values, war_values, project_to_age = 40) {
career_df <- data.frame(age = age_values, war = war_values)
# Fit aging curve model
model <- loess(war ~ age, data = career_df, span = 0.75)
# Project remaining years
future_ages <- (max(age_values) + 1):project_to_age
future_ages <- future_ages[future_ages <= project_to_age]
if (length(future_ages) > 0) {
projected_war <- predict(model, newdata = data.frame(age = future_ages))
# Apply decline factor
decline_factor <- seq(0.95, 0.5, length.out = length(future_ages))
projected_war <- projected_war * decline_factor
projected_war <- pmax(0, projected_war) # No negative WAR
} else {
projected_war <- numeric(0)
future_ages <- integer(0)
}
list(
actual = career_df,
projected = data.frame(age = future_ages, war = projected_war),
total_war = sum(war_values) + sum(projected_war)
)
}
# Example: Mike Trout career projection
trout_actual <- data.frame(
age = 20:31,
war = c(2.1, 10.2, 10.5, 9.2, 10.0, 10.6, 8.2, 8.6, 8.0, 4.5, 3.5, 1.8)
)
# Scenario 1: No injury (baseline)
scenario_healthy <- project_career(trout_actual$age, trout_actual$war)
# Scenario 2: Injury at age 32 (miss full season)
trout_injury <- trout_actual
scenario_injury <- project_career(trout_injury$age, trout_injury$war)
scenario_injury$total_war <- scenario_injury$total_war - 6 # Lost prime season
# Scenario 3: Career-ending injury at 32
scenario_career_end <- list(
actual = trout_actual,
projected = data.frame(age = integer(0), war = numeric(0)),
total_war = sum(trout_actual$war)
)
# Compare scenarios
comparison <- data.frame(
scenario = c("Healthy Career", "One Year Injury", "Career Ending"),
projected_total_war = c(
scenario_healthy$total_war,
scenario_injury$total_war,
scenario_career_end$total_war
),
hof_probability = c(0.99, 0.90, 0.75)
)
print("Career Projection Scenarios:")
print(comparison)
# Visualize scenarios
plot_data <- rbind(
data.frame(scenario = "Actual", scenario_healthy$actual),
data.frame(scenario = "Healthy Projection", scenario_healthy$projected),
data.frame(scenario = "Injury Projection", scenario_injury$projected)
)
ggplot(plot_data, aes(x = age, y = war, color = scenario, linetype = scenario)) +
geom_line(size = 1.2) +
geom_point(data = plot_data[plot_data$scenario == "Actual", ], size = 2) +
geom_hline(yintercept = 0, linetype = "dashed", alpha = 0.5) +
scale_color_manual(values = c("Actual" = "black",
"Healthy Projection" = "darkgreen",
"Injury Projection" = "red")) +
scale_linetype_manual(values = c("Actual" = "solid",
"Healthy Projection" = "dashed",
"Injury Projection" = "dotted")) +
labs(title = "Career Projection: Injury Impact Analysis",
subtitle = "Comparing healthy vs injury scenarios",
x = "Age",
y = "WAR",
color = "Scenario",
linetype = "Scenario") +
theme_minimal() +
theme(legend.position = "bottom")
# Python: Era-Adjusted Performance Comparison
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Historical context adjustments
era_factors = pd.DataFrame({
'era': ['Dead Ball (1901-1919)', 'Live Ball (1920-1941)',
'Integration (1942-1960)', 'Expansion (1961-1976)',
'Free Agency (1977-1993)', 'Steroid (1994-2005)',
'Post-Steroid (2006-2015)', 'Modern (2016-2023)'],
'start_year': [1901, 1920, 1942, 1961, 1977, 1994, 2006, 2016],
'end_year': [1919, 1941, 1960, 1976, 1993, 2005, 2015, 2023],
'avg_runs_per_game': [3.8, 4.8, 4.3, 4.0, 4.3, 5.0, 4.6, 4.4],
'avg_hr_per_game': [0.15, 0.5, 0.9, 0.95, 0.85, 1.1, 1.05, 1.25],
'offensive_adjustment': [1.20, 1.05, 1.10, 1.15, 1.10, 0.95, 1.05, 1.00]
})
print("Era Adjustment Factors:")
print(era_factors[['era', 'offensive_adjustment']])
# Compare players across eras
historical_seasons = pd.DataFrame({
'player': ['Babe Ruth', 'Ted Williams', 'Willie Mays', 'Mike Schmidt',
'Barry Bonds', 'Albert Pujols', 'Mike Trout'],
'peak_season': [1921, 1941, 1965, 1980, 2001, 2008, 2016],
'hr': [59, 37, 52, 48, 73, 37, 29],
'avg': [.378, .406, .317, .286, .328, .357, .315],
'ops': [1.379, 1.287, 1.053, .974, 1.379, 1.114, 1.004],
'war': [12.9, 10.4, 11.1, 8.5, 12.5, 9.6, 10.5]
})
# Add era information
def assign_era(year):
for _, row in era_factors.iterrows():
if row['start_year'] <= year <= row['end_year']:
return row['era'], row['offensive_adjustment']
return 'Unknown', 1.0
historical_seasons['era'] = historical_seasons['peak_season'].apply(
lambda x: assign_era(x)[0]
)
historical_seasons['era_adjustment'] = historical_seasons['peak_season'].apply(
lambda x: assign_era(x)[1]
)
# Calculate era-adjusted stats
historical_seasons['adjusted_ops'] = (
historical_seasons['ops'] * historical_seasons['era_adjustment']
)
historical_seasons['adjusted_war'] = (
historical_seasons['war'] * historical_seasons['era_adjustment']
)
print("\nEra-Adjusted Performance Comparison:")
print(historical_seasons[['player', 'era', 'ops', 'adjusted_ops',
'war', 'adjusted_war']].to_string(index=False))
# Rank players by adjusted performance
historical_seasons_ranked = historical_seasons.sort_values(
'adjusted_war', ascending=False
)
print("\nRanking by Era-Adjusted WAR:")
for idx, row in historical_seasons_ranked.iterrows():
print(f"{row['player']:20s} {row['adjusted_war']:5.1f} ({row['era']})")
# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Raw vs Adjusted OPS
ax = axes[0, 0]
x = np.arange(len(historical_seasons))
width = 0.35
ax.bar(x - width/2, historical_seasons['ops'], width, label='Raw OPS', alpha=0.7)
ax.bar(x + width/2, historical_seasons['adjusted_ops'], width,
label='Adjusted OPS', alpha=0.7)
ax.set_ylabel('OPS')
ax.set_title('Raw vs Era-Adjusted OPS')
ax.set_xticks(x)
ax.set_xticklabels(historical_seasons['player'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# Raw vs Adjusted WAR
ax = axes[0, 1]
ax.bar(x - width/2, historical_seasons['war'], width, label='Raw WAR', alpha=0.7)
ax.bar(x + width/2, historical_seasons['adjusted_war'], width,
label='Adjusted WAR', alpha=0.7)
ax.set_ylabel('WAR')
ax.set_title('Raw vs Era-Adjusted WAR')
ax.set_xticks(x)
ax.set_xticklabels(historical_seasons['player'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
# Era offensive levels
ax = axes[1, 0]
ax.plot(era_factors['start_year'], era_factors['avg_runs_per_game'],
'o-', label='Runs/Game', linewidth=2)
ax2 = ax.twinx()
ax2.plot(era_factors['start_year'], era_factors['avg_hr_per_game'],
's-', color='orange', label='HR/Game', linewidth=2)
ax.set_xlabel('Year')
ax.set_ylabel('Runs per Game', color='blue')
ax2.set_ylabel('Home Runs per Game', color='orange')
ax.set_title('Offensive Environment by Era')
ax.tick_params(axis='y', labelcolor='blue')
ax2.tick_params(axis='y', labelcolor='orange')
ax.grid(True, alpha=0.3)
# Adjustment factor timeline
ax = axes[1, 1]
ax.step(era_factors['start_year'], era_factors['offensive_adjustment'],
where='post', linewidth=2, color='darkred')
ax.axhline(y=1.0, color='black', linestyle='--', alpha=0.5)
ax.set_xlabel('Year')
ax.set_ylabel('Adjustment Factor')
ax.set_title('Era Adjustment Factors Over Time')
ax.set_ylim([0.9, 1.25])
ax.grid(True, alpha=0.3)
# Add era labels
for _, row in era_factors.iterrows():
ax.text(row['start_year'], row['offensive_adjustment'] + 0.02,
row['era'].split('(')[0].strip(), rotation=45,
fontsize=7, ha='left')
plt.tight_layout()
plt.savefig('era_adjustments.png', dpi=300, bbox_inches='tight')
plt.show()
The "Greatest of All Time" debate is baseball's most entertaining analytical exercise, combining statistics, context, and personal values.
Multi-Dimensional GOAT Analysis
# R: Comprehensive GOAT Framework
library(FactoMineR)
library(factoextra)
# Create comprehensive player comparison
goat_candidates <- data.frame(
player = c("Babe Ruth", "Willie Mays", "Barry Bonds", "Hank Aaron",
"Ted Williams", "Mickey Mantle", "Stan Musial", "Ty Cobb",
"Honus Wagner", "Rogers Hornsby", "Mike Trout"),
# Career value
career_war = c(183.1, 156.1, 162.8, 143.0, 121.9, 110.3, 128.0, 151.4, 130.8, 127.0, 85.5),
peak_war_7yr = c(71.0, 72.9, 72.8, 59.6, 63.3, 54.2, 57.7, 62.1, 57.4, 66.3, 56.8),
# Counting stats (normalized to per 162 games)
hr_per_162 = c(46.3, 35.2, 37.1, 37.2, 37.0, 33.8, 27.3, 4.0, 6.8, 22.8, 37.0),
avg = c(.342, .301, .298, .305, .344, .298, .331, .366, .328, .358, .303),
ops = c(1.164, .941, 1.051, .928, 1.116, .977, .976, .945, .858, 1.010, .988),
# Peak seasons
mvps = c(1, 2, 7, 1, 2, 3, 3, 1, 0, 2, 3),
allstar = c(2, 24, 14, 25, 19, 20, 24, 0, 0, 0, 11),
# Postseason
ws_titles = c(7, 1, 0, 1, 0, 7, 3, 0, 1, 0, 0),
# Longevity
seasons = c(22, 23, 22, 23, 19, 18, 22, 24, 21, 23, 13),
# Era context (offensive environment, 1.0 = neutral)
era_context = c(1.15, 1.05, 0.90, 1.05, 1.10, 1.00, 1.05, 1.20, 1.15, 1.10, 1.00)
)
# Calculate composite scores
goat_candidates <- goat_candidates %>%
mutate(
jaws = (career_war + peak_war_7yr) / 2,
era_adj_war = career_war * era_context,
peak_score = (peak_war_7yr / 75) * 100,
counting_score = (hr_per_162 / 50 + (avg - .250) / .100 +
(ops - .700) / .500) / 3 * 100,
awards_score = (mvps * 10 + allstar * 2) / 2,
longevity_score = pmin(100, seasons * 4),
# Overall GOAT score (weighted combination)
goat_score = 0.40 * era_adj_war / 2 +
0.25 * peak_score +
0.15 * counting_score +
0.10 * awards_score +
0.10 * longevity_score
)
# Rank players
goat_rankings <- goat_candidates %>%
arrange(desc(goat_score)) %>%
mutate(rank = row_number())
print("GOAT Rankings:")
print(goat_rankings %>%
select(rank, player, career_war, jaws, goat_score) %>%
head(10))
# Principal Component Analysis for visualization
pca_data <- goat_candidates %>%
select(career_war, peak_war_7yr, ops, mvps, allstar, seasons)
pca_result <- PCA(pca_data, graph = FALSE)
# Visualize PCA
fviz_pca_biplot(pca_result,
label = "var",
habillage = factor(goat_candidates$player),
addEllipses = FALSE,
col.var = "darkred",
repel = TRUE) +
labs(title = "GOAT Candidates - Principal Component Analysis",
subtitle = "Multi-dimensional comparison of greatest players") +
theme_minimal()
# Create radar chart comparison (top 5)
library(fmsb)
top5 <- goat_rankings %>% head(5)
# Normalize scores to 0-100 scale
radar_data <- top5 %>%
mutate(
war_norm = (career_war / max(career_war)) * 100,
peak_norm = (peak_war_7yr / max(peak_war_7yr)) * 100,
rate_norm = ((ops - min(ops)) / (max(ops) - min(ops))) * 100,
awards_norm = (awards_score / max(awards_score)) * 100,
longevity_norm = (longevity_score / max(longevity_score)) * 100
) %>%
select(player, war_norm, peak_norm, rate_norm, awards_norm, longevity_norm)
# Prepare data for radar chart
radar_matrix <- as.data.frame(t(radar_data[, -1]))
colnames(radar_matrix) <- radar_data$player
radar_matrix <- rbind(rep(100, 5), rep(0, 5), radar_matrix)
# Plot radar chart
colors_border <- c("red", "blue", "green", "orange", "purple")
colors_in <- alpha(colors_border, 0.3)
par(mar = c(1, 1, 3, 1))
radarchart(radar_matrix,
axistype = 1,
pcol = colors_border,
pfcol = colors_in,
plwd = 2,
plty = 1,
cglcol = "grey",
cglty = 1,
axislabcol = "grey",
cglwd = 0.8,
vlcex = 0.8,
title = "Top 5 GOAT Candidates - Radar Comparison")
legend("topright", legend = colnames(radar_matrix),
col = colors_border, lty = 1, lwd = 2, cex = 0.8)
Context-Specific GOAT Rankings
# Python: Situational GOAT Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
# Create comprehensive dataset
goat_data = pd.DataFrame({
'player': ['Babe Ruth', 'Willie Mays', 'Barry Bonds', 'Hank Aaron',
'Ted Williams', 'Mickey Mantle', 'Mike Trout', 'Stan Musial'],
'career_war': [183.1, 156.1, 162.8, 143.0, 121.9, 110.3, 85.5, 128.0],
'peak_war': [71.0, 72.9, 72.8, 59.6, 63.3, 54.2, 56.8, 57.7],
'ops_plus': [207, 156, 182, 155, 190, 172, 176, 159],
'war_per_season': [8.3, 6.8, 7.4, 6.2, 6.4, 6.1, 6.6, 5.8],
'mvps': [1, 2, 7, 1, 2, 3, 3, 3],
'championships': [7, 1, 0, 1, 0, 7, 0, 3],
'iconic_moments': [10, 8, 5, 7, 6, 9, 4, 6], # Subjective
'cultural_impact': [10, 9, 5, 8, 8, 9, 7, 7] # Subjective
})
# Different weighting schemes for different perspectives
weighting_schemes = {
'Pure Statistics': {
'career_war': 0.40,
'peak_war': 0.30,
'ops_plus': 0.20,
'war_per_season': 0.10,
'mvps': 0.0,
'championships': 0.0,
'iconic_moments': 0.0,
'cultural_impact': 0.0
},
'Peak Performance': {
'career_war': 0.20,
'peak_war': 0.40,
'ops_plus': 0.15,
'war_per_season': 0.25,
'mvps': 0.0,
'championships': 0.0,
'iconic_moments': 0.0,
'cultural_impact': 0.0
},
'Career Achievement': {
'career_war': 0.35,
'peak_war': 0.15,
'ops_plus': 0.15,
'war_per_season': 0.10,
'mvps': 0.15,
'championships': 0.10,
'iconic_moments': 0.0,
'cultural_impact': 0.0
},
'Legacy & Impact': {
'career_war': 0.20,
'peak_war': 0.15,
'ops_plus': 0.10,
'war_per_season': 0.05,
'mvps': 0.10,
'championships': 0.15,
'iconic_moments': 0.15,
'cultural_impact': 0.10
},
'Balanced': {
'career_war': 0.25,
'peak_war': 0.20,
'ops_plus': 0.15,
'war_per_season': 0.15,
'mvps': 0.10,
'championships': 0.08,
'iconic_moments': 0.04,
'cultural_impact': 0.03
}
}
# Normalize all metrics to 0-100 scale
metrics = ['career_war', 'peak_war', 'ops_plus', 'war_per_season',
'mvps', 'championships', 'iconic_moments', 'cultural_impact']
goat_normalized = goat_data.copy()
for metric in metrics:
goat_normalized[f'{metric}_norm'] = (
(goat_data[metric] - goat_data[metric].min()) /
(goat_data[metric].max() - goat_data[metric].min()) * 100
)
# Calculate scores for each weighting scheme
results = []
for scheme_name, weights in weighting_schemes.items():
goat_normalized[f'score_{scheme_name}'] = 0
for metric, weight in weights.items():
if weight > 0:
goat_normalized[f'score_{scheme_name}'] += (
goat_normalized[f'{metric}_norm'] * weight
)
# Rank players
scheme_rankings = goat_normalized[['player', f'score_{scheme_name}']].copy()
scheme_rankings = scheme_rankings.sort_values(
f'score_{scheme_name}', ascending=False
)
scheme_rankings['rank'] = range(1, len(scheme_rankings) + 1)
scheme_rankings['scheme'] = scheme_name
results.append(scheme_rankings)
# Combine all rankings
all_rankings = pd.concat(results)
print("GOAT Rankings by Perspective:")
print("=" * 80)
for scheme in weighting_schemes.keys():
scheme_data = all_rankings[all_rankings['scheme'] == scheme]
print(f"\n{scheme}:")
print(scheme_data[['rank', 'player', f'score_{scheme}']].to_string(index=False))
# Visualizations
fig = plt.figure(figsize=(16, 10))
# Ranking heatmap
ax1 = plt.subplot(2, 2, 1)
ranking_pivot = all_rankings.pivot(index='player', columns='scheme', values='rank')
ranking_pivot = ranking_pivot[list(weighting_schemes.keys())] # Order columns
sns.heatmap(ranking_pivot, annot=True, fmt='g', cmap='RdYlGn_r',
cbar_kws={'label': 'Rank'}, ax=ax1)
ax1.set_title('Player Rankings by Perspective')
ax1.set_ylabel('')
# Score heatmap
ax2 = plt.subplot(2, 2, 2)
score_cols = [f'score_{scheme}' for scheme in weighting_schemes.keys()]
score_data = goat_normalized[['player'] + score_cols].set_index('player')
score_data.columns = list(weighting_schemes.keys())
sns.heatmap(score_data, annot=True, fmt='.1f', cmap='YlOrRd', ax=ax2)
ax2.set_title('GOAT Scores by Perspective')
ax2.set_ylabel('')
# Ranking stability
ax3 = plt.subplot(2, 2, 3)
rank_variance = ranking_pivot.var(axis=1).sort_values()
colors = ['green' if x < 2 else 'orange' if x < 5 else 'red' for x in rank_variance]
ax3.barh(rank_variance.index, rank_variance.values, color=colors, alpha=0.7)
ax3.set_xlabel('Variance in Rankings')
ax3.set_title('Ranking Stability Across Perspectives')
ax3.axvline(x=2, color='green', linestyle='--', alpha=0.5, label='Consistent')
ax3.axvline(x=5, color='red', linestyle='--', alpha=0.5, label='Variable')
ax3.legend()
# Average ranking
ax4 = plt.subplot(2, 2, 4)
avg_ranking = ranking_pivot.mean(axis=1).sort_values()
bars = ax4.barh(avg_ranking.index, avg_ranking.values, alpha=0.7)
ax4.set_xlabel('Average Rank')
ax4.set_title('Consensus GOAT Rankings')
ax4.invert_xaxis()
ax4.set_xlim(ax4.get_xlim()[1], 0.5) # Start from 0.5 to show #1 clearly
# Color gradient for bars
colors_gradient = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(bars)))
for bar, color in zip(bars, colors_gradient):
bar.set_color(color)
plt.tight_layout()
plt.savefig('goat_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Summary statistics
print("\n" + "=" * 80)
print("CONSENSUS RANKINGS (Average across all perspectives):")
print("=" * 80)
consensus = ranking_pivot.mean(axis=1).sort_values()
for rank, (player, avg_rank) in enumerate(consensus.items(), 1):
std = ranking_pivot.loc[player].std()
print(f"{rank}. {player:20s} Avg Rank: {avg_rank:.2f} (σ = {std:.2f})")
# R: Comprehensive GOAT Framework
library(FactoMineR)
library(factoextra)
# Create comprehensive player comparison
goat_candidates <- data.frame(
player = c("Babe Ruth", "Willie Mays", "Barry Bonds", "Hank Aaron",
"Ted Williams", "Mickey Mantle", "Stan Musial", "Ty Cobb",
"Honus Wagner", "Rogers Hornsby", "Mike Trout"),
# Career value
career_war = c(183.1, 156.1, 162.8, 143.0, 121.9, 110.3, 128.0, 151.4, 130.8, 127.0, 85.5),
peak_war_7yr = c(71.0, 72.9, 72.8, 59.6, 63.3, 54.2, 57.7, 62.1, 57.4, 66.3, 56.8),
# Counting stats (normalized to per 162 games)
hr_per_162 = c(46.3, 35.2, 37.1, 37.2, 37.0, 33.8, 27.3, 4.0, 6.8, 22.8, 37.0),
avg = c(.342, .301, .298, .305, .344, .298, .331, .366, .328, .358, .303),
ops = c(1.164, .941, 1.051, .928, 1.116, .977, .976, .945, .858, 1.010, .988),
# Peak seasons
mvps = c(1, 2, 7, 1, 2, 3, 3, 1, 0, 2, 3),
allstar = c(2, 24, 14, 25, 19, 20, 24, 0, 0, 0, 11),
# Postseason
ws_titles = c(7, 1, 0, 1, 0, 7, 3, 0, 1, 0, 0),
# Longevity
seasons = c(22, 23, 22, 23, 19, 18, 22, 24, 21, 23, 13),
# Era context (offensive environment, 1.0 = neutral)
era_context = c(1.15, 1.05, 0.90, 1.05, 1.10, 1.00, 1.05, 1.20, 1.15, 1.10, 1.00)
)
# Calculate composite scores
goat_candidates <- goat_candidates %>%
mutate(
jaws = (career_war + peak_war_7yr) / 2,
era_adj_war = career_war * era_context,
peak_score = (peak_war_7yr / 75) * 100,
counting_score = (hr_per_162 / 50 + (avg - .250) / .100 +
(ops - .700) / .500) / 3 * 100,
awards_score = (mvps * 10 + allstar * 2) / 2,
longevity_score = pmin(100, seasons * 4),
# Overall GOAT score (weighted combination)
goat_score = 0.40 * era_adj_war / 2 +
0.25 * peak_score +
0.15 * counting_score +
0.10 * awards_score +
0.10 * longevity_score
)
# Rank players
goat_rankings <- goat_candidates %>%
arrange(desc(goat_score)) %>%
mutate(rank = row_number())
print("GOAT Rankings:")
print(goat_rankings %>%
select(rank, player, career_war, jaws, goat_score) %>%
head(10))
# Principal Component Analysis for visualization
pca_data <- goat_candidates %>%
select(career_war, peak_war_7yr, ops, mvps, allstar, seasons)
pca_result <- PCA(pca_data, graph = FALSE)
# Visualize PCA
fviz_pca_biplot(pca_result,
label = "var",
habillage = factor(goat_candidates$player),
addEllipses = FALSE,
col.var = "darkred",
repel = TRUE) +
labs(title = "GOAT Candidates - Principal Component Analysis",
subtitle = "Multi-dimensional comparison of greatest players") +
theme_minimal()
# Create radar chart comparison (top 5)
library(fmsb)
top5 <- goat_rankings %>% head(5)
# Normalize scores to 0-100 scale
radar_data <- top5 %>%
mutate(
war_norm = (career_war / max(career_war)) * 100,
peak_norm = (peak_war_7yr / max(peak_war_7yr)) * 100,
rate_norm = ((ops - min(ops)) / (max(ops) - min(ops))) * 100,
awards_norm = (awards_score / max(awards_score)) * 100,
longevity_norm = (longevity_score / max(longevity_score)) * 100
) %>%
select(player, war_norm, peak_norm, rate_norm, awards_norm, longevity_norm)
# Prepare data for radar chart
radar_matrix <- as.data.frame(t(radar_data[, -1]))
colnames(radar_matrix) <- radar_data$player
radar_matrix <- rbind(rep(100, 5), rep(0, 5), radar_matrix)
# Plot radar chart
colors_border <- c("red", "blue", "green", "orange", "purple")
colors_in <- alpha(colors_border, 0.3)
par(mar = c(1, 1, 3, 1))
radarchart(radar_matrix,
axistype = 1,
pcol = colors_border,
pfcol = colors_in,
plwd = 2,
plty = 1,
cglcol = "grey",
cglty = 1,
axislabcol = "grey",
cglwd = 0.8,
vlcex = 0.8,
title = "Top 5 GOAT Candidates - Radar Comparison")
legend("topright", legend = colnames(radar_matrix),
col = colors_border, lty = 1, lwd = 2, cex = 0.8)
# Python: Situational GOAT Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
# Create comprehensive dataset
goat_data = pd.DataFrame({
'player': ['Babe Ruth', 'Willie Mays', 'Barry Bonds', 'Hank Aaron',
'Ted Williams', 'Mickey Mantle', 'Mike Trout', 'Stan Musial'],
'career_war': [183.1, 156.1, 162.8, 143.0, 121.9, 110.3, 85.5, 128.0],
'peak_war': [71.0, 72.9, 72.8, 59.6, 63.3, 54.2, 56.8, 57.7],
'ops_plus': [207, 156, 182, 155, 190, 172, 176, 159],
'war_per_season': [8.3, 6.8, 7.4, 6.2, 6.4, 6.1, 6.6, 5.8],
'mvps': [1, 2, 7, 1, 2, 3, 3, 3],
'championships': [7, 1, 0, 1, 0, 7, 0, 3],
'iconic_moments': [10, 8, 5, 7, 6, 9, 4, 6], # Subjective
'cultural_impact': [10, 9, 5, 8, 8, 9, 7, 7] # Subjective
})
# Different weighting schemes for different perspectives
weighting_schemes = {
'Pure Statistics': {
'career_war': 0.40,
'peak_war': 0.30,
'ops_plus': 0.20,
'war_per_season': 0.10,
'mvps': 0.0,
'championships': 0.0,
'iconic_moments': 0.0,
'cultural_impact': 0.0
},
'Peak Performance': {
'career_war': 0.20,
'peak_war': 0.40,
'ops_plus': 0.15,
'war_per_season': 0.25,
'mvps': 0.0,
'championships': 0.0,
'iconic_moments': 0.0,
'cultural_impact': 0.0
},
'Career Achievement': {
'career_war': 0.35,
'peak_war': 0.15,
'ops_plus': 0.15,
'war_per_season': 0.10,
'mvps': 0.15,
'championships': 0.10,
'iconic_moments': 0.0,
'cultural_impact': 0.0
},
'Legacy & Impact': {
'career_war': 0.20,
'peak_war': 0.15,
'ops_plus': 0.10,
'war_per_season': 0.05,
'mvps': 0.10,
'championships': 0.15,
'iconic_moments': 0.15,
'cultural_impact': 0.10
},
'Balanced': {
'career_war': 0.25,
'peak_war': 0.20,
'ops_plus': 0.15,
'war_per_season': 0.15,
'mvps': 0.10,
'championships': 0.08,
'iconic_moments': 0.04,
'cultural_impact': 0.03
}
}
# Normalize all metrics to 0-100 scale
metrics = ['career_war', 'peak_war', 'ops_plus', 'war_per_season',
'mvps', 'championships', 'iconic_moments', 'cultural_impact']
goat_normalized = goat_data.copy()
for metric in metrics:
goat_normalized[f'{metric}_norm'] = (
(goat_data[metric] - goat_data[metric].min()) /
(goat_data[metric].max() - goat_data[metric].min()) * 100
)
# Calculate scores for each weighting scheme
results = []
for scheme_name, weights in weighting_schemes.items():
goat_normalized[f'score_{scheme_name}'] = 0
for metric, weight in weights.items():
if weight > 0:
goat_normalized[f'score_{scheme_name}'] += (
goat_normalized[f'{metric}_norm'] * weight
)
# Rank players
scheme_rankings = goat_normalized[['player', f'score_{scheme_name}']].copy()
scheme_rankings = scheme_rankings.sort_values(
f'score_{scheme_name}', ascending=False
)
scheme_rankings['rank'] = range(1, len(scheme_rankings) + 1)
scheme_rankings['scheme'] = scheme_name
results.append(scheme_rankings)
# Combine all rankings
all_rankings = pd.concat(results)
print("GOAT Rankings by Perspective:")
print("=" * 80)
for scheme in weighting_schemes.keys():
scheme_data = all_rankings[all_rankings['scheme'] == scheme]
print(f"\n{scheme}:")
print(scheme_data[['rank', 'player', f'score_{scheme}']].to_string(index=False))
# Visualizations
fig = plt.figure(figsize=(16, 10))
# Ranking heatmap
ax1 = plt.subplot(2, 2, 1)
ranking_pivot = all_rankings.pivot(index='player', columns='scheme', values='rank')
ranking_pivot = ranking_pivot[list(weighting_schemes.keys())] # Order columns
sns.heatmap(ranking_pivot, annot=True, fmt='g', cmap='RdYlGn_r',
cbar_kws={'label': 'Rank'}, ax=ax1)
ax1.set_title('Player Rankings by Perspective')
ax1.set_ylabel('')
# Score heatmap
ax2 = plt.subplot(2, 2, 2)
score_cols = [f'score_{scheme}' for scheme in weighting_schemes.keys()]
score_data = goat_normalized[['player'] + score_cols].set_index('player')
score_data.columns = list(weighting_schemes.keys())
sns.heatmap(score_data, annot=True, fmt='.1f', cmap='YlOrRd', ax=ax2)
ax2.set_title('GOAT Scores by Perspective')
ax2.set_ylabel('')
# Ranking stability
ax3 = plt.subplot(2, 2, 3)
rank_variance = ranking_pivot.var(axis=1).sort_values()
colors = ['green' if x < 2 else 'orange' if x < 5 else 'red' for x in rank_variance]
ax3.barh(rank_variance.index, rank_variance.values, color=colors, alpha=0.7)
ax3.set_xlabel('Variance in Rankings')
ax3.set_title('Ranking Stability Across Perspectives')
ax3.axvline(x=2, color='green', linestyle='--', alpha=0.5, label='Consistent')
ax3.axvline(x=5, color='red', linestyle='--', alpha=0.5, label='Variable')
ax3.legend()
# Average ranking
ax4 = plt.subplot(2, 2, 4)
avg_ranking = ranking_pivot.mean(axis=1).sort_values()
bars = ax4.barh(avg_ranking.index, avg_ranking.values, alpha=0.7)
ax4.set_xlabel('Average Rank')
ax4.set_title('Consensus GOAT Rankings')
ax4.invert_xaxis()
ax4.set_xlim(ax4.get_xlim()[1], 0.5) # Start from 0.5 to show #1 clearly
# Color gradient for bars
colors_gradient = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(bars)))
for bar, color in zip(bars, colors_gradient):
bar.set_color(color)
plt.tight_layout()
plt.savefig('goat_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Summary statistics
print("\n" + "=" * 80)
print("CONSENSUS RANKINGS (Average across all perspectives):")
print("=" * 80)
consensus = ranking_pivot.mean(axis=1).sort_values()
for rank, (player, avg_rank) in enumerate(consensus.items(), 1):
std = ranking_pivot.loc[player].std()
print(f"{rank}. {player:20s} Avg Rank: {avg_rank:.2f} (σ = {std:.2f})")
Exercise 1: Career Value Calculator
Build a comprehensive career value calculator that computes WAR, JAWS, and position-adjusted metrics for any player.
Tasks:
- Create a function that calculates career WAR from basic statistics
- Implement the JAWS calculation with a rolling 7-year window
- Add position-specific adjustments
- Compare against Hall of Fame standards
- Generate a career value report with visualizations
Solution Approach:
# R: Career Value Calculator
calculate_career_value <- function(player_stats, position) {
# player_stats should be a dataframe with columns: year, pa, h, 2b, 3b, hr, bb, hbp, sf, ab
# Calculate rate stats
player_stats <- player_stats %>%
mutate(
obp = (h + bb + hbp) / (ab + bb + hbp + sf),
slg = (h - `2b` - `3b` - hr + 2*`2b` + 3*`3b` + 4*hr) / ab,
ops = obp + slg,
# Simplified WAR estimation
war = (ops - 0.710) * pa / 20
)
# Career totals
career_war <- sum(player_stats$war, na.rm = TRUE)
seasons <- nrow(player_stats)
# Peak 7-year WAR
if (nrow(player_stats) >= 7) {
peak_war <- max(rollapply(player_stats$war, width = 7,
FUN = sum, align = "left", partial = TRUE))
} else {
peak_war <- career_war
}
# JAWS
jaws <- (career_war + peak_war) / 2
# Position-specific HOF standards (approximate)
position_standards <- list(
C = 53, `1B` = 67, `2B` = 69, `3B` = 67, SS = 67,
LF = 65, CF = 70, RF = 70, DH = 27
)
hof_standard <- position_standards[[position]]
hof_percentage <- (jaws / hof_standard) * 100
# Return results
list(
career_war = career_war,
peak_war = peak_war,
jaws = jaws,
seasons = seasons,
war_per_season = career_war / seasons,
hof_standard = hof_standard,
hof_percentage = hof_percentage,
yearly_war = player_stats$war
)
}
# Test the function
# (Implementation depends on your specific data)
Exercise 2: HOF Predictor with Machine Learning
Develop a machine learning model that predicts Hall of Fame induction probability using various features.
Tasks:
- Collect data on HOF-eligible players (both inducted and not)
- Engineer relevant features (career stats, awards, era adjustments)
- Train multiple models (Logistic Regression, Random Forest, Gradient Boosting)
- Evaluate model performance using cross-validation
- Generate predictions for current players
Exercise 3: Aging Curve Analysis
Analyze how different skills age and create projections for active players.
Tasks:
- Calculate aging curves for power, speed, and contact skills
- Use the delta method to control for survivorship bias
- Identify peak ages for each skill type
- Project the next 3 years of performance for 5 active players
- Visualize skill-specific aging patterns
Bonus: Compare aging curves by position (e.g., catchers vs. outfielders).
Exercise 4: Build Your Own GOAT Ranking System
Create a customizable GOAT ranking system that allows different weighting preferences.
Tasks:
- Compile comprehensive statistics for top 20 all-time players
- Normalize all metrics to a common scale (0-100)
- Create at least 5 different weighting schemes representing different perspectives
- Calculate rankings under each scheme
- Identify which players are most consensus and most controversial
- Build an interactive visualization showing how rankings change with different weights
Extension: Add era adjustments and park factors to make fairer historical comparisons.
Summary {#summary}
Career analytics combines statistical rigor with historical context to evaluate player legacies and predict future outcomes. Key takeaways:
- WAR and JAWS provide comprehensive frameworks for measuring career value, balancing longevity with peak performance
- Hall of Fame prediction requires both statistical excellence and contextual factors like awards, team success, and narrative
- Award prediction models reveal the combination of performance and story that drives voting
- Aging curves show universal patterns in player development and decline, with skill-specific variations
- What-if analysis explores hypothetical scenarios to understand luck and context in career outcomes
- GOAT debates benefit from multi-dimensional analysis, acknowledging that different perspectives yield different conclusions
Career analytics ultimately tells the story of baseball through individual achievement, connecting past legends to present stars and future Hall of Famers.
# R: Career Value Calculator
calculate_career_value <- function(player_stats, position) {
# player_stats should be a dataframe with columns: year, pa, h, 2b, 3b, hr, bb, hbp, sf, ab
# Calculate rate stats
player_stats <- player_stats %>%
mutate(
obp = (h + bb + hbp) / (ab + bb + hbp + sf),
slg = (h - `2b` - `3b` - hr + 2*`2b` + 3*`3b` + 4*hr) / ab,
ops = obp + slg,
# Simplified WAR estimation
war = (ops - 0.710) * pa / 20
)
# Career totals
career_war <- sum(player_stats$war, na.rm = TRUE)
seasons <- nrow(player_stats)
# Peak 7-year WAR
if (nrow(player_stats) >= 7) {
peak_war <- max(rollapply(player_stats$war, width = 7,
FUN = sum, align = "left", partial = TRUE))
} else {
peak_war <- career_war
}
# JAWS
jaws <- (career_war + peak_war) / 2
# Position-specific HOF standards (approximate)
position_standards <- list(
C = 53, `1B` = 67, `2B` = 69, `3B` = 67, SS = 67,
LF = 65, CF = 70, RF = 70, DH = 27
)
hof_standard <- position_standards[[position]]
hof_percentage <- (jaws / hof_standard) * 100
# Return results
list(
career_war = career_war,
peak_war = peak_war,
jaws = jaws,
seasons = seasons,
war_per_season = career_war / seasons,
hof_standard = hof_standard,
hof_percentage = hof_percentage,
yearly_war = player_stats$war
)
}
# Test the function
# (Implementation depends on your specific data)