Why Raw Statistics Fail
Consider this comparison: In 1930, Bill Terry batted .401 for the New York Giants. In 2023, Luis Arraez led the National League with a .354 batting average. Does this mean Terry was the better hitter? Not necessarily. The 1930 National League averaged .303 as a whole, while the 2023 NL averaged .248. Terry was 98 points above his league average; Arraez was 106 points above his.
Raw statistics are meaningless without context. A 3.00 ERA in 1968—the "Year of the Pitcher"—represents very different performance than a 3.00 ERA in 2000, when league-wide offense was at historic highs. Similarly, hitting 50 home runs in 1927, when Babe Ruth did it, was vastly more impressive than hitting 50 in 1998, when multiple players exceeded that mark.
The fundamental challenge is that baseball has never been played in a static environment. Every aspect of the game—from the ball itself to the talent pool to the rules governing play—has evolved continuously over more than a century of professional competition.
Changes in Rules and Equipment
Baseball's rules have undergone numerous changes that dramatically affected statistical outcomes:
The Pitching Distance: In 1893, the pitching distance was extended from 50 feet to the current 60 feet, 6 inches. This single change increased offense significantly and rendered all pre-1893 pitching statistics incomparable to later eras.
The Foul Strike Rule: The National League adopted the foul strike rule in 1901, and the American League followed in 1903. Before this, foul balls didn't count as strikes. This change dramatically favored pitchers and increased strikeout rates.
The Mound Height: After the 1968 season, MLB lowered the pitcher's mound from 15 inches to 10 inches, contributing to increased offense in subsequent years.
The Designated Hitter: The American League's adoption of the DH in 1973 created two different playing environments that persist today, complicating cross-league comparisons.
The Strike Zone: The official strike zone has been modified several times, and its practical enforcement has varied considerably across eras, even when the rules remained constant.
Equipment changes have been equally significant:
The Baseball Itself: The ball has been modified numerous times, sometimes deliberately (as in 1920, when spitballs were banned and a livelier ball introduced) and sometimes inadvertently through manufacturing changes. The transition from dead ball to live ball around 1920 marks the single most important dividing line in baseball history.
Batting Gloves and Equipment: Modern players use batting gloves, lighter bats, and more protective equipment that may influence performance.
Ballpark Design: Stadium construction has evolved from asymmetric, quirky parks to more standardized modern facilities, affecting both offense and defense.
The Talent Pool Evolution
Perhaps the most complex factor in cross-era comparison is the changing talent pool. Several factors have dramatically affected the quality of competition:
Integration: Before Jackie Robinson broke the color barrier in 1947, Black players were excluded from the major leagues. This meant that pre-integration stars competed against a fractured talent pool that excluded some of the game's best players. Post-integration statistics represent competition against a deeper, more talented field.
International Expansion: The influx of Latin American players beginning in the 1950s and 1960s, followed by Asian players in the 1990s and beyond, has continually deepened the talent pool.
Population Growth: The U.S. population has grown from approximately 76 million in 1900 to over 330 million today. A larger population means more potential players, though this is partially offset by competition from other sports.
Expansion: MLB has expanded from 16 teams in 1960 to 30 teams today. More teams mean more major league jobs, potentially diluting talent (though the expanded talent pool from internationalization has more than compensated).
Specialization: Modern players are better trained, better conditioned, and more specialized than their historical counterparts. Relief pitchers throw harder for shorter stints; players lift weights and follow sophisticated nutrition programs. The average player today is almost certainly more athletic than the average player of 1930.
The Dead Ball Era (1900-1919)
The "dead ball era" refers to the period when offense was suppressed by multiple factors:
- A softer, less resilient baseball that didn't travel as far when hit
- Rules that favored pitchers, including allowing spitballs and other doctored pitches
- Large, spacious ballparks with huge outfields
- Tactical approaches that emphasized "small ball" over power hitting
- Balls kept in play until they were misshapen and dirty, making them harder to see and hit
During this period, batting averages were relatively high (because fielding was poor and strikeouts rare), but home runs were extremely uncommon. In 1908, the entire American League hit only 278 home runs—less than individual players would hit in later eras.
Ty Cobb's .366 lifetime batting average, compiled mostly in the dead ball era, reflects both his extraordinary talent and the context of high batting averages. In 1911, Cobb batted .420, and that season the American League as a whole batted .273.
The Live Ball Era Transition (1920-1930)
The game changed dramatically in 1920 following the death of Ray Chapman, who was struck by a pitch. Several changes were implemented:
- Spitballs and other trick pitches were banned (with a grandfather clause for existing spitball pitchers)
- Balls were replaced more frequently, keeping them white and visible
- The ball itself was made livelier, with tighter winding
The effect was immediate. Home runs increased dramatically, and Babe Ruth became the game's first true power hitter, hitting 54 home runs in 1920—more than any entire team had hit the previous year.
The 1920s and especially 1930 saw offense reach absurd levels. The 1930 National League batted .303 as a whole, and Hack Wilson drove in 191 runs. The ball was subsequently deadened slightly, but offense remained higher than in the dead ball era.
Modern Era Variations
Even within the "modern era" (post-1920), offense has varied considerably:
The 1960s Pitcher Dominance: The strike zone was enlarged in 1963, and by 1968, pitchers dominated to historic levels. The American League batted just .230 that year, and Bob Gibson posted a 1.12 ERA.
The 1990s-2000s Offensive Explosion: Often called the "steroid era," this period saw offense reach historic highs, with home runs and other power numbers exploding. However, multiple factors contributed beyond performance-enhancing drugs, including smaller ballparks, possible ball changes, and expansion diluting pitching.
The 2010s Return to Pitching: Strikeout rates soared as pitchers threw harder and batters sold out for power. Batting averages declined to levels not seen since the 1960s.
The 2020s Three True Outcomes Era: Modern baseball features historic levels of strikeouts, walks, and home runs, with traditional balls in play becoming less common.
Understanding these contextual factors is essential before attempting any cross-era analysis.
To make meaningful comparisons across eras, we need statistics that adjust for context. The most widely used era-adjusted statistics express performance relative to league average, typically using 100 as the baseline.
OPS+ (Adjusted OPS)
OPS+ adjusts a player's OPS (On-Base Plus Slugging) for their league and ballpark. The formula is:
OPS+ = 100 * (OBP/lgOBP + SLG/lgSLG - 1)
This formula is then adjusted for park factors. An OPS+ of 100 is league average; 150 means the player was 50% better than average; 75 means 25% below average.
Let's calculate OPS+ for Babe Ruth's legendary 1927 season:
Ruth's 1927 Stats:
- OBP: .486
- SLG: .772
- OPS: 1.258
1927 AL Average:
- OBP: .333
- SLG: .392
- OPS: .725
# Calculate OPS+ for Babe Ruth 1927
ruth_obp <- 0.486
ruth_slg <- 0.772
league_obp <- 0.333
league_slg <- 0.392
# Basic OPS+ calculation (before park adjustment)
ops_plus <- 100 * ((ruth_obp / league_obp) + (ruth_slg / league_slg) - 1)
print(paste("Ruth's 1927 OPS+:", round(ops_plus, 0)))
# Calculate OPS+ for Babe Ruth 1927
ruth_obp = 0.486
ruth_slg = 0.772
league_obp = 0.333
league_slg = 0.392
# Basic OPS+ calculation (before park adjustment)
ops_plus = 100 * ((ruth_obp / league_obp) + (ruth_slg / league_slg) - 1)
print(f"Ruth's 1927 OPS+: {round(ops_plus, 0)}")
This gives Ruth an OPS+ of 225—meaning he was 125% better than the average hitter. This astronomical figure helps explain why many analysts consider this the greatest offensive season in baseball history.
ERA+ (Adjusted ERA)
ERA+ works similarly for pitchers, with the formula:
ERA+ = 100 * (lgERA / ERA)
Again adjusted for park factors. Higher is better (the opposite of ERA itself).
Let's calculate ERA+ for Pedro Martinez's incredible 2000 season:
Pedro's 2000 Stats:
- ERA: 1.74
2000 AL Average ERA: 4.91
# Calculate ERA+ for Pedro Martinez 2000
pedro_era <- 1.74
league_era <- 4.91
# Basic ERA+ calculation (before park adjustment)
era_plus <- 100 * (league_era / pedro_era)
print(paste("Pedro's 2000 ERA+:", round(era_plus, 0)))
# Calculate ERA+ for Pedro Martinez 2000
pedro_era = 1.74
league_era = 4.91
# Basic ERA+ calculation (before park adjustment)
era_plus = 100 * (league_era / pedro_era)
print(f"Pedro's 2000 ERA+: {round(era_plus, 0)}")
This yields an ERA+ of 282—the highest single-season ERA+ since 1900 for a pitcher with 150+ innings. Pedro was nearly three times as effective as the average pitcher in a high-offense environment.
wRC+ (Weighted Runs Created Plus)
wRC+ is a more sophisticated offensive statistic that weights different offensive events by their actual run value. It's calculated using linear weights derived from run expectancy matrices, then adjusted for league and park.
The basic concept:
- Calculate wOBA (weighted on-base average) using appropriate weights for each event
- Convert to wRAA (weighted runs above average)
- Add a park factor adjustment
- Scale to 100 (league average)
Here's a simplified calculation:
# Simplified wRC+ calculation
calculate_wrc_plus <- function(player_stats, league_stats, park_factor = 1.00) {
# These are approximate 2023 weights
woba_weights <- list(
BB = 0.69,
HBP = 0.72,
'1B' = 0.88,
'2B' = 1.24,
'3B' = 1.56,
HR = 2.08
)
# Calculate player wOBA
player_woba <- (
woba_weights$BB * player_stats$BB +
woba_weights$HBP * player_stats$HBP +
woba_weights$'1B' * player_stats$'1B' +
woba_weights$'2B' * player_stats$'2B' +
woba_weights$'3B' * player_stats$'3B' +
woba_weights$HR * player_stats$HR
) / (player_stats$AB + player_stats$BB - player_stats$IBB + player_stats$SF + player_stats$HBP)
# Calculate league wOBA
league_woba <- league_stats$wOBA
# wRC+ formula (simplified)
wrc_plus <- ((player_woba - league_woba) / 1.15 + 1) * 100 / park_factor
return(wrc_plus)
}
# Example: Mike Trout 2012
trout_2012 <- list(
AB = 559, BB = 67, HBP = 1, '1B' = 113, '2B' = 27, '3B' = 8, HR = 30,
IBB = 0, SF = 4
)
league_2012 <- list(wOBA = 0.315)
trout_wrc_plus <- calculate_wrc_plus(trout_2012, league_2012, park_factor = 1.00)
print(paste("Trout's 2012 wRC+:", round(trout_wrc_plus, 0)))
def calculate_wrc_plus(player_stats, league_woba, park_factor=1.00):
"""
Simplified wRC+ calculation
"""
# These are approximate 2023 weights
woba_weights = {
'BB': 0.69,
'HBP': 0.72,
'1B': 0.88,
'2B': 1.24,
'3B': 1.56,
'HR': 2.08
}
# Calculate player wOBA
numerator = (
woba_weights['BB'] * player_stats['BB'] +
woba_weights['HBP'] * player_stats['HBP'] +
woba_weights['1B'] * player_stats['1B'] +
woba_weights['2B'] * player_stats['2B'] +
woba_weights['3B'] * player_stats['3B'] +
woba_weights['HR'] * player_stats['HR']
)
denominator = (
player_stats['AB'] + player_stats['BB'] - player_stats['IBB'] +
player_stats['SF'] + player_stats['HBP']
)
player_woba = numerator / denominator
# wRC+ formula (simplified)
wrc_plus = ((player_woba - league_woba) / 1.15 + 1) * 100 / park_factor
return wrc_plus
# Example: Mike Trout 2012
trout_2012 = {
'AB': 559, 'BB': 67, 'HBP': 1, '1B': 113, '2B': 27, '3B': 8, 'HR': 30,
'IBB': 0, 'SF': 4
}
league_woba_2012 = 0.315
trout_wrc_plus = calculate_wrc_plus(trout_2012, league_woba_2012, park_factor=1.00)
print(f"Trout's 2012 wRC+: {round(trout_wrc_plus, 0)}")
Park Factors
Park factors adjust for the offensive environment of a player's home ballpark. Some parks (like Coors Field) dramatically increase offense; others (like Oracle Park in San Francisco) suppress it.
Park factors are typically calculated by comparing runs scored in a park (by both teams) to runs scored in road games:
Park Factor = (Home Runs / Home Games) / (Road Runs / Road Games) * 100
A park factor of 100 is neutral; above 100 favors hitters; below 100 favors pitchers.
Here's how to calculate park factors:
# Calculate park factor
calculate_park_factor <- function(home_runs, home_games, road_runs, road_games) {
park_factor <- (home_runs / home_games) / (road_runs / road_games) * 100
return(park_factor)
}
# Example: Coors Field (notorious hitter's park)
# Hypothetical season data
coors_pf <- calculate_park_factor(
home_runs = 900,
home_games = 81,
road_runs = 700,
road_games = 81
)
print(paste("Coors Field Park Factor:", round(coors_pf, 0)))
# Example: Oracle Park (pitcher's park)
oracle_pf <- calculate_park_factor(
home_runs = 650,
home_games = 81,
road_runs = 750,
road_games = 81
)
print(paste("Oracle Park Factor:", round(oracle_pf, 0)))
def calculate_park_factor(home_runs, home_games, road_runs, road_games):
"""
Calculate park factor
"""
park_factor = (home_runs / home_games) / (road_runs / road_games) * 100
return park_factor
# Example: Coors Field (notorious hitter's park)
coors_pf = calculate_park_factor(
home_runs=900,
home_games=81,
road_runs=700,
road_games=81
)
print(f"Coors Field Park Factor: {round(coors_pf, 0)}")
# Example: Oracle Park (pitcher's park)
oracle_pf = calculate_park_factor(
home_runs=650,
home_games=81,
road_runs=750,
road_games=81
)
print(f"Oracle Park Factor: {round(oracle_pf, 0)}")
Historical Park Factor Challenges
Park factors become more complex when analyzing historical players because:
- Parks Changed: Many historical players played in parks that no longer exist
- Multi-Year Analysis: Players often played in multiple parks throughout their careers
- Era Effects: The same physical park might play differently in different eras due to changes in the ball, rules, or playing style
For historical analysis, we often use multi-year park factors and apply them carefully, recognizing that they're estimates rather than precise measurements.
Applying Era Adjustments
Let's apply these concepts to compare two legendary seasons:
Babe Ruth, 1927: OPS 1.258, AL average OPS .725
Barry Bonds, 2004: OPS 1.422, NL average OPS .758
Raw OPS suggests Bonds was better. But adjusted for context:
# Compare Ruth and Bonds
compare_seasons <- function(player_ops, league_ops, player_name, year) {
relative_ops <- player_ops / league_ops
print(paste(player_name, year, "- Relative to league average:", round(relative_ops, 3)))
return(relative_ops)
}
ruth_relative <- compare_seasons(1.258, 0.725, "Babe Ruth", 1927)
bonds_relative <- compare_seasons(1.422, 0.758, "Barry Bonds", 2004)
print(paste("Ruth was", round((ruth_relative - 1) * 100, 1), "% above average"))
print(paste("Bonds was", round((bonds_relative - 1) * 100, 1), "% above average"))
def compare_seasons(player_ops, league_ops, player_name, year):
"""
Compare seasons relative to league average
"""
relative_ops = player_ops / league_ops
print(f"{player_name} {year} - Relative to league average: {relative_ops:.3f}")
return relative_ops
ruth_relative = compare_seasons(1.258, 0.725, "Babe Ruth", 1927)
bonds_relative = compare_seasons(1.422, 0.758, "Barry Bonds", 2004)
print(f"Ruth was {(ruth_relative - 1) * 100:.1f}% above average")
print(f"Bonds was {(bonds_relative - 1) * 100:.1f}% above average")
Both were approximately 73-87% above average, making these comparably dominant seasons despite different raw numbers.
OPS+ = 100 * (OBP/lgOBP + SLG/lgSLG - 1)
# Calculate OPS+ for Babe Ruth 1927
ruth_obp <- 0.486
ruth_slg <- 0.772
league_obp <- 0.333
league_slg <- 0.392
# Basic OPS+ calculation (before park adjustment)
ops_plus <- 100 * ((ruth_obp / league_obp) + (ruth_slg / league_slg) - 1)
print(paste("Ruth's 1927 OPS+:", round(ops_plus, 0)))
ERA+ = 100 * (lgERA / ERA)
# Calculate ERA+ for Pedro Martinez 2000
pedro_era <- 1.74
league_era <- 4.91
# Basic ERA+ calculation (before park adjustment)
era_plus <- 100 * (league_era / pedro_era)
print(paste("Pedro's 2000 ERA+:", round(era_plus, 0)))
# Simplified wRC+ calculation
calculate_wrc_plus <- function(player_stats, league_stats, park_factor = 1.00) {
# These are approximate 2023 weights
woba_weights <- list(
BB = 0.69,
HBP = 0.72,
'1B' = 0.88,
'2B' = 1.24,
'3B' = 1.56,
HR = 2.08
)
# Calculate player wOBA
player_woba <- (
woba_weights$BB * player_stats$BB +
woba_weights$HBP * player_stats$HBP +
woba_weights$'1B' * player_stats$'1B' +
woba_weights$'2B' * player_stats$'2B' +
woba_weights$'3B' * player_stats$'3B' +
woba_weights$HR * player_stats$HR
) / (player_stats$AB + player_stats$BB - player_stats$IBB + player_stats$SF + player_stats$HBP)
# Calculate league wOBA
league_woba <- league_stats$wOBA
# wRC+ formula (simplified)
wrc_plus <- ((player_woba - league_woba) / 1.15 + 1) * 100 / park_factor
return(wrc_plus)
}
# Example: Mike Trout 2012
trout_2012 <- list(
AB = 559, BB = 67, HBP = 1, '1B' = 113, '2B' = 27, '3B' = 8, HR = 30,
IBB = 0, SF = 4
)
league_2012 <- list(wOBA = 0.315)
trout_wrc_plus <- calculate_wrc_plus(trout_2012, league_2012, park_factor = 1.00)
print(paste("Trout's 2012 wRC+:", round(trout_wrc_plus, 0)))
Park Factor = (Home Runs / Home Games) / (Road Runs / Road Games) * 100
# Calculate park factor
calculate_park_factor <- function(home_runs, home_games, road_runs, road_games) {
park_factor <- (home_runs / home_games) / (road_runs / road_games) * 100
return(park_factor)
}
# Example: Coors Field (notorious hitter's park)
# Hypothetical season data
coors_pf <- calculate_park_factor(
home_runs = 900,
home_games = 81,
road_runs = 700,
road_games = 81
)
print(paste("Coors Field Park Factor:", round(coors_pf, 0)))
# Example: Oracle Park (pitcher's park)
oracle_pf <- calculate_park_factor(
home_runs = 650,
home_games = 81,
road_runs = 750,
road_games = 81
)
print(paste("Oracle Park Factor:", round(oracle_pf, 0)))
# Compare Ruth and Bonds
compare_seasons <- function(player_ops, league_ops, player_name, year) {
relative_ops <- player_ops / league_ops
print(paste(player_name, year, "- Relative to league average:", round(relative_ops, 3)))
return(relative_ops)
}
ruth_relative <- compare_seasons(1.258, 0.725, "Babe Ruth", 1927)
bonds_relative <- compare_seasons(1.422, 0.758, "Barry Bonds", 2004)
print(paste("Ruth was", round((ruth_relative - 1) * 100, 1), "% above average"))
print(paste("Bonds was", round((bonds_relative - 1) * 100, 1), "% above average"))
# Calculate OPS+ for Babe Ruth 1927
ruth_obp = 0.486
ruth_slg = 0.772
league_obp = 0.333
league_slg = 0.392
# Basic OPS+ calculation (before park adjustment)
ops_plus = 100 * ((ruth_obp / league_obp) + (ruth_slg / league_slg) - 1)
print(f"Ruth's 1927 OPS+: {round(ops_plus, 0)}")
# Calculate ERA+ for Pedro Martinez 2000
pedro_era = 1.74
league_era = 4.91
# Basic ERA+ calculation (before park adjustment)
era_plus = 100 * (league_era / pedro_era)
print(f"Pedro's 2000 ERA+: {round(era_plus, 0)}")
def calculate_wrc_plus(player_stats, league_woba, park_factor=1.00):
"""
Simplified wRC+ calculation
"""
# These are approximate 2023 weights
woba_weights = {
'BB': 0.69,
'HBP': 0.72,
'1B': 0.88,
'2B': 1.24,
'3B': 1.56,
'HR': 2.08
}
# Calculate player wOBA
numerator = (
woba_weights['BB'] * player_stats['BB'] +
woba_weights['HBP'] * player_stats['HBP'] +
woba_weights['1B'] * player_stats['1B'] +
woba_weights['2B'] * player_stats['2B'] +
woba_weights['3B'] * player_stats['3B'] +
woba_weights['HR'] * player_stats['HR']
)
denominator = (
player_stats['AB'] + player_stats['BB'] - player_stats['IBB'] +
player_stats['SF'] + player_stats['HBP']
)
player_woba = numerator / denominator
# wRC+ formula (simplified)
wrc_plus = ((player_woba - league_woba) / 1.15 + 1) * 100 / park_factor
return wrc_plus
# Example: Mike Trout 2012
trout_2012 = {
'AB': 559, 'BB': 67, 'HBP': 1, '1B': 113, '2B': 27, '3B': 8, 'HR': 30,
'IBB': 0, 'SF': 4
}
league_woba_2012 = 0.315
trout_wrc_plus = calculate_wrc_plus(trout_2012, league_woba_2012, park_factor=1.00)
print(f"Trout's 2012 wRC+: {round(trout_wrc_plus, 0)}")
def calculate_park_factor(home_runs, home_games, road_runs, road_games):
"""
Calculate park factor
"""
park_factor = (home_runs / home_games) / (road_runs / road_games) * 100
return park_factor
# Example: Coors Field (notorious hitter's park)
coors_pf = calculate_park_factor(
home_runs=900,
home_games=81,
road_runs=700,
road_games=81
)
print(f"Coors Field Park Factor: {round(coors_pf, 0)}")
# Example: Oracle Park (pitcher's park)
oracle_pf = calculate_park_factor(
home_runs=650,
home_games=81,
road_runs=750,
road_games=81
)
print(f"Oracle Park Factor: {round(oracle_pf, 0)}")
def compare_seasons(player_ops, league_ops, player_name, year):
"""
Compare seasons relative to league average
"""
relative_ops = player_ops / league_ops
print(f"{player_name} {year} - Relative to league average: {relative_ops:.3f}")
return relative_ops
ruth_relative = compare_seasons(1.258, 0.725, "Babe Ruth", 1927)
bonds_relative = compare_seasons(1.422, 0.758, "Barry Bonds", 2004)
print(f"Ruth was {(ruth_relative - 1) * 100:.1f}% above average")
print(f"Bonds was {(bonds_relative - 1) * 100:.1f}% above average")
The Lahman Database is the most comprehensive source of historical baseball statistics, covering every player and team from 1871 to the present. It's available as a free download and can be accessed through R and Python packages.
Setting Up the Lahman Database
The easiest way to access Lahman data is through dedicated packages:
# Install and load the Lahman package
# install.packages("Lahman")
library(Lahman)
library(dplyr)
# The package includes multiple datasets
# Let's explore what's available
data(package = "Lahman")
# Key datasets:
# - People: biographical information
# - Batting: batting statistics by season
# - Pitching: pitching statistics by season
# - Teams: team statistics by season
# - Fielding: fielding statistics by season
# View the structure of the Batting dataset
str(Batting)
# See the first few rows
head(Batting)
# Install pybaseball (includes Lahman data access)
# pip install pybaseball
import pybaseball as pyb
import pandas as pd
# Suppress cache warning
pyb.cache.enable()
# Download Lahman data
# The first time you run this, it will download the data
batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()
people = pyb.lahman.people()
teams = pyb.lahman.teams()
# View the structure
print(batting.info())
print(batting.head())
Querying Career Statistics
Let's pull complete career statistics for some legendary players:
library(Lahman)
library(dplyr)
# Get Babe Ruth's career batting stats
# First, find his playerID
ruth_id <- People %>%
filter(nameFirst == "Babe", nameLast == "Ruth") %>%
pull(playerID)
# Get his career stats
ruth_career <- Batting %>%
filter(playerID == ruth_id) %>%
arrange(yearID)
print(ruth_career)
# Calculate career totals
ruth_totals <- ruth_career %>%
summarise(
Years = n(),
Games = sum(G, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
Hits = sum(H, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
RBI = sum(RBI, na.rm = TRUE),
AVG = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE)
)
print(ruth_totals)
# Compare multiple players
compare_players <- function(first_names, last_names) {
# Get player IDs
player_data <- data.frame(firstName = first_names, lastName = last_names)
results <- list()
for(i in 1:nrow(player_data)) {
player_id <- People %>%
filter(nameFirst == player_data$firstName[i],
nameLast == player_data$lastName[i]) %>%
pull(playerID)
if(length(player_id) > 0) {
career <- Batting %>%
filter(playerID == player_id[1]) %>%
summarise(
Name = paste(player_data$firstName[i], player_data$lastName[i]),
Years = n(),
Games = sum(G, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
Hits = sum(H, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
AVG = round(sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE), 3)
)
results[[i]] <- career
}
}
return(bind_rows(results))
}
# Compare Ruth, Mays, Bonds, Trout (career through available data)
comparison <- compare_players(
c("Babe", "Willie", "Barry", "Mike"),
c("Ruth", "Mays", "Bonds", "Trout")
)
print(comparison)
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
# Load Lahman data
batting = pyb.lahman.batting()
people = pyb.lahman.people()
# Get Babe Ruth's career batting stats
ruth = people[
(people['nameFirst'] == 'Babe') &
(people['nameLast'] == 'Ruth')
]
ruth_id = ruth['playerID'].values[0]
# Get his career stats
ruth_career = batting[batting['playerID'] == ruth_id].sort_values('yearID')
print(ruth_career)
# Calculate career totals
ruth_totals = pd.DataFrame({
'Years': [len(ruth_career)],
'Games': [ruth_career['G'].sum()],
'AB': [ruth_career['AB'].sum()],
'Hits': [ruth_career['H'].sum()],
'HR': [ruth_career['HR'].sum()],
'RBI': [ruth_career['RBI'].sum()],
'AVG': [ruth_career['H'].sum() / ruth_career['AB'].sum()]
})
print(ruth_totals)
# Compare multiple players
def compare_players(player_names):
"""
Compare career stats for multiple players
player_names: list of tuples (first_name, last_name)
"""
results = []
for first_name, last_name in player_names:
player = people[
(people['nameFirst'] == first_name) &
(people['nameLast'] == last_name)
]
if len(player) > 0:
player_id = player['playerID'].values[0]
career = batting[batting['playerID'] == player_id]
stats = {
'Name': f"{first_name} {last_name}",
'Years': len(career),
'Games': career['G'].sum(),
'AB': career['AB'].sum(),
'Hits': career['H'].sum(),
'HR': career['HR'].sum(),
'AVG': round(career['H'].sum() / career['AB'].sum(), 3)
}
results.append(stats)
return pd.DataFrame(results)
# Compare Ruth, Mays, Bonds, Trout
comparison = compare_players([
('Babe', 'Ruth'),
('Willie', 'Mays'),
('Barry', 'Bonds'),
('Mike', 'Trout')
])
print(comparison)
Building Era Comparison Tools
Let's create a tool that automatically calculates era-adjusted statistics:
library(Lahman)
library(dplyr)
# Function to calculate league averages for a given year
get_league_averages <- function(year, league) {
league_stats <- Batting %>%
filter(yearID == year, lgID == league) %>%
summarise(
lgAB = sum(AB, na.rm = TRUE),
lgH = sum(H, na.rm = TRUE),
lgBB = sum(BB, na.rm = TRUE),
lgHBP = sum(HBP, na.rm = TRUE),
lgSF = sum(SF, na.rm = TRUE),
lgTB = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE)
) %>%
mutate(
lgPA = lgAB + lgBB + lgHBP + lgSF,
lgOBP = (lgH + lgBB + lgHBP) / lgPA,
lgSLG = lgTB / lgAB,
lgOPS = lgOBP + lgSLG
)
return(league_stats)
}
# Function to calculate OPS+ for a player season
calculate_ops_plus <- function(player_id, year) {
# Get player stats
player_stats <- Batting %>%
filter(playerID == player_id, yearID == year) %>%
summarise(
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
HBP = sum(HBP, na.rm = TRUE),
SF = sum(SF, na.rm = TRUE),
TB = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE),
lgID = first(lgID)
)
if(nrow(player_stats) == 0 || player_stats$AB == 0) {
return(NA)
}
# Calculate player OBP and SLG
player_PA <- player_stats$AB + player_stats$BB + player_stats$HBP + player_stats$SF
player_OBP <- (player_stats$H + player_stats$BB + player_stats$HBP) / player_PA
player_SLG <- player_stats$TB / player_stats$AB
# Get league averages
league_avg <- get_league_averages(year, player_stats$lgID)
# Calculate OPS+
ops_plus <- 100 * ((player_OBP / league_avg$lgOBP) + (player_SLG / league_avg$lgSLG) - 1)
return(round(ops_plus, 0))
}
# Example: Calculate OPS+ for famous seasons
# Babe Ruth 1927
ruth_id <- People %>% filter(nameFirst == "Babe", nameLast == "Ruth") %>% pull(playerID)
ruth_1927_ops_plus <- calculate_ops_plus(ruth_id, 1927)
print(paste("Babe Ruth 1927 OPS+:", ruth_1927_ops_plus))
# Ted Williams 1941
williams_id <- People %>% filter(nameFirst == "Ted", nameLast == "Williams") %>% pull(playerID)
williams_1941_ops_plus <- calculate_ops_plus(williams_id, 1941)
print(paste("Ted Williams 1941 OPS+:", williams_1941_ops_plus))
# Barry Bonds 2004
bonds_id <- People %>% filter(nameFirst == "Barry", nameLast == "Bonds") %>% pull(playerID)
bonds_2004_ops_plus <- calculate_ops_plus(bonds_id, 2004)
print(paste("Barry Bonds 2004 OPS+:", bonds_2004_ops_plus))
import pybaseball as pyb
import pandas as pd
import numpy as np
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def get_league_averages(year, league):
"""
Calculate league averages for a given year
"""
league_stats = batting[
(batting['yearID'] == year) &
(batting['lgID'] == league)
]
lg_ab = league_stats['AB'].sum()
lg_h = league_stats['H'].sum()
lg_bb = league_stats['BB'].sum()
lg_hbp = league_stats['HBP'].sum()
lg_sf = league_stats['SF'].sum()
lg_2b = league_stats['2B'].sum()
lg_3b = league_stats['3B'].sum()
lg_hr = league_stats['HR'].sum()
lg_tb = lg_h + lg_2b + 2*lg_3b + 3*lg_hr
lg_pa = lg_ab + lg_bb + lg_hbp + lg_sf
lg_obp = (lg_h + lg_bb + lg_hbp) / lg_pa
lg_slg = lg_tb / lg_ab
lg_ops = lg_obp + lg_slg
return {
'lgOBP': lg_obp,
'lgSLG': lg_slg,
'lgOPS': lg_ops
}
def calculate_ops_plus(player_id, year):
"""
Calculate OPS+ for a player season
"""
# Get player stats
player_stats = batting[
(batting['playerID'] == player_id) &
(batting['yearID'] == year)
]
if len(player_stats) == 0:
return None
# Aggregate if player played for multiple teams
ab = player_stats['AB'].sum()
h = player_stats['H'].sum()
bb = player_stats['BB'].sum()
hbp = player_stats['HBP'].sum()
sf = player_stats['SF'].sum()
doubles = player_stats['2B'].sum()
triples = player_stats['3B'].sum()
hr = player_stats['HR'].sum()
league = player_stats['lgID'].values[0]
if ab == 0:
return None
# Calculate player OBP and SLG
tb = h + doubles + 2*triples + 3*hr
pa = ab + bb + hbp + sf
player_obp = (h + bb + hbp) / pa
player_slg = tb / ab
# Get league averages
league_avg = get_league_averages(year, league)
# Calculate OPS+
ops_plus = 100 * (
(player_obp / league_avg['lgOBP']) +
(player_slg / league_avg['lgSLG']) - 1
)
return round(ops_plus, 0)
# Example: Calculate OPS+ for famous seasons
# Babe Ruth 1927
ruth = people[(people['nameFirst'] == 'Babe') & (people['nameLast'] == 'Ruth')]
ruth_id = ruth['playerID'].values[0]
ruth_1927 = calculate_ops_plus(ruth_id, 1927)
print(f"Babe Ruth 1927 OPS+: {ruth_1927}")
# Ted Williams 1941
williams = people[(people['nameFirst'] == 'Ted') & (people['nameLast'] == 'Williams')]
williams_id = williams['playerID'].values[0]
williams_1941 = calculate_ops_plus(williams_id, 1941)
print(f"Ted Williams 1941 OPS+: {williams_1941}")
# Barry Bonds 2004
bonds = people[(people['nameFirst'] == 'Barry') & (people['nameLast'] == 'Bonds')]
bonds_id = bonds['playerID'].values[0]
bonds_2004 = calculate_ops_plus(bonds_id, 2004)
print(f"Barry Bonds 2004 OPS+: {bonds_2004}")
Advanced Historical Queries
Let's find the best single seasons in history by OPS+:
library(Lahman)
library(dplyr)
# Find all qualified seasons (502+ PA) with their OPS+
find_best_seasons <- function(min_pa = 502, n_seasons = 20) {
# This is computationally intensive, so we'll sample key years
all_seasons <- Batting %>%
filter(AB >= 400) %>% # Rough PA proxy
select(playerID, yearID, lgID, AB, H, X2B, X3B, HR, BB, HBP, SF)
# Calculate OPS+ for each season
results <- all_seasons %>%
group_by(playerID, yearID) %>%
summarise(
lgID = first(lgID),
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
X2B = sum(X2B, na.rm = TRUE),
X3B = sum(X3B, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
HBP = sum(HBP, na.rm = TRUE),
SF = sum(SF, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(AB >= 400)
# For demonstration, calculate for a subset
# In practice, you'd want to loop through all seasons
# This is a simplified version
return(results)
}
# Find players with highest career OPS+
career_ops_plus <- function(min_pa = 3000) {
# Calculate career stats
career_stats <- Batting %>%
group_by(playerID) %>%
summarise(
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
X2B = sum(X2B, na.rm = TRUE),
X3B = sum(X3B, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(AB >= min_pa) %>%
arrange(desc(HR))
# Join with player names
career_with_names <- career_stats %>%
left_join(People, by = "playerID") %>%
select(nameFirst, nameLast, AB, H, HR, BB) %>%
head(20)
return(career_with_names)
}
top_careers <- career_ops_plus()
print(top_careers)
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def find_best_seasons(min_ab=400, n_seasons=20):
"""
Find the best single seasons by OPS+
Note: This is a simplified version for demonstration
"""
# Get all qualified seasons
qualified = batting[batting['AB'] >= min_ab].copy()
# Calculate basic rate stats
qualified['AVG'] = qualified['H'] / qualified['AB']
qualified['TB'] = (
qualified['H'] + qualified['2B'] +
2*qualified['3B'] + 3*qualified['HR']
)
qualified['SLG'] = qualified['TB'] / qualified['AB']
# Sort by HR (as a proxy for OPS+ for this example)
best_seasons = qualified.nlargest(n_seasons, 'HR')
# Join with player names
best_with_names = best_seasons.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
result = best_with_names[[
'nameFirst', 'nameLast', 'yearID', 'AB', 'H', 'HR', 'AVG', 'SLG'
]].sort_values('HR', ascending=False)
return result
# Find top single seasons
top_seasons = find_best_seasons()
print(top_seasons)
def career_ops_plus(min_ab=3000):
"""
Find players with best career stats
"""
# Calculate career stats
career_stats = batting.groupby('playerID').agg({
'AB': 'sum',
'H': 'sum',
'2B': 'sum',
'3B': 'sum',
'HR': 'sum',
'BB': 'sum'
}).reset_index()
# Filter qualified players
career_stats = career_stats[career_stats['AB'] >= min_ab]
# Calculate AVG
career_stats['AVG'] = career_stats['H'] / career_stats['AB']
# Join with names
career_with_names = career_stats.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
# Sort by HR and get top 20
top_careers = career_with_names.nlargest(20, 'HR')[[
'nameFirst', 'nameLast', 'AB', 'H', 'HR', 'BB', 'AVG'
]]
return top_careers
top_careers = career_ops_plus()
print(top_careers)
# Install and load the Lahman package
# install.packages("Lahman")
library(Lahman)
library(dplyr)
# The package includes multiple datasets
# Let's explore what's available
data(package = "Lahman")
# Key datasets:
# - People: biographical information
# - Batting: batting statistics by season
# - Pitching: pitching statistics by season
# - Teams: team statistics by season
# - Fielding: fielding statistics by season
# View the structure of the Batting dataset
str(Batting)
# See the first few rows
head(Batting)
library(Lahman)
library(dplyr)
# Get Babe Ruth's career batting stats
# First, find his playerID
ruth_id <- People %>%
filter(nameFirst == "Babe", nameLast == "Ruth") %>%
pull(playerID)
# Get his career stats
ruth_career <- Batting %>%
filter(playerID == ruth_id) %>%
arrange(yearID)
print(ruth_career)
# Calculate career totals
ruth_totals <- ruth_career %>%
summarise(
Years = n(),
Games = sum(G, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
Hits = sum(H, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
RBI = sum(RBI, na.rm = TRUE),
AVG = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE)
)
print(ruth_totals)
# Compare multiple players
compare_players <- function(first_names, last_names) {
# Get player IDs
player_data <- data.frame(firstName = first_names, lastName = last_names)
results <- list()
for(i in 1:nrow(player_data)) {
player_id <- People %>%
filter(nameFirst == player_data$firstName[i],
nameLast == player_data$lastName[i]) %>%
pull(playerID)
if(length(player_id) > 0) {
career <- Batting %>%
filter(playerID == player_id[1]) %>%
summarise(
Name = paste(player_data$firstName[i], player_data$lastName[i]),
Years = n(),
Games = sum(G, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
Hits = sum(H, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
AVG = round(sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE), 3)
)
results[[i]] <- career
}
}
return(bind_rows(results))
}
# Compare Ruth, Mays, Bonds, Trout (career through available data)
comparison <- compare_players(
c("Babe", "Willie", "Barry", "Mike"),
c("Ruth", "Mays", "Bonds", "Trout")
)
print(comparison)
library(Lahman)
library(dplyr)
# Function to calculate league averages for a given year
get_league_averages <- function(year, league) {
league_stats <- Batting %>%
filter(yearID == year, lgID == league) %>%
summarise(
lgAB = sum(AB, na.rm = TRUE),
lgH = sum(H, na.rm = TRUE),
lgBB = sum(BB, na.rm = TRUE),
lgHBP = sum(HBP, na.rm = TRUE),
lgSF = sum(SF, na.rm = TRUE),
lgTB = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE)
) %>%
mutate(
lgPA = lgAB + lgBB + lgHBP + lgSF,
lgOBP = (lgH + lgBB + lgHBP) / lgPA,
lgSLG = lgTB / lgAB,
lgOPS = lgOBP + lgSLG
)
return(league_stats)
}
# Function to calculate OPS+ for a player season
calculate_ops_plus <- function(player_id, year) {
# Get player stats
player_stats <- Batting %>%
filter(playerID == player_id, yearID == year) %>%
summarise(
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
HBP = sum(HBP, na.rm = TRUE),
SF = sum(SF, na.rm = TRUE),
TB = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE),
lgID = first(lgID)
)
if(nrow(player_stats) == 0 || player_stats$AB == 0) {
return(NA)
}
# Calculate player OBP and SLG
player_PA <- player_stats$AB + player_stats$BB + player_stats$HBP + player_stats$SF
player_OBP <- (player_stats$H + player_stats$BB + player_stats$HBP) / player_PA
player_SLG <- player_stats$TB / player_stats$AB
# Get league averages
league_avg <- get_league_averages(year, player_stats$lgID)
# Calculate OPS+
ops_plus <- 100 * ((player_OBP / league_avg$lgOBP) + (player_SLG / league_avg$lgSLG) - 1)
return(round(ops_plus, 0))
}
# Example: Calculate OPS+ for famous seasons
# Babe Ruth 1927
ruth_id <- People %>% filter(nameFirst == "Babe", nameLast == "Ruth") %>% pull(playerID)
ruth_1927_ops_plus <- calculate_ops_plus(ruth_id, 1927)
print(paste("Babe Ruth 1927 OPS+:", ruth_1927_ops_plus))
# Ted Williams 1941
williams_id <- People %>% filter(nameFirst == "Ted", nameLast == "Williams") %>% pull(playerID)
williams_1941_ops_plus <- calculate_ops_plus(williams_id, 1941)
print(paste("Ted Williams 1941 OPS+:", williams_1941_ops_plus))
# Barry Bonds 2004
bonds_id <- People %>% filter(nameFirst == "Barry", nameLast == "Bonds") %>% pull(playerID)
bonds_2004_ops_plus <- calculate_ops_plus(bonds_id, 2004)
print(paste("Barry Bonds 2004 OPS+:", bonds_2004_ops_plus))
library(Lahman)
library(dplyr)
# Find all qualified seasons (502+ PA) with their OPS+
find_best_seasons <- function(min_pa = 502, n_seasons = 20) {
# This is computationally intensive, so we'll sample key years
all_seasons <- Batting %>%
filter(AB >= 400) %>% # Rough PA proxy
select(playerID, yearID, lgID, AB, H, X2B, X3B, HR, BB, HBP, SF)
# Calculate OPS+ for each season
results <- all_seasons %>%
group_by(playerID, yearID) %>%
summarise(
lgID = first(lgID),
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
X2B = sum(X2B, na.rm = TRUE),
X3B = sum(X3B, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
HBP = sum(HBP, na.rm = TRUE),
SF = sum(SF, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(AB >= 400)
# For demonstration, calculate for a subset
# In practice, you'd want to loop through all seasons
# This is a simplified version
return(results)
}
# Find players with highest career OPS+
career_ops_plus <- function(min_pa = 3000) {
# Calculate career stats
career_stats <- Batting %>%
group_by(playerID) %>%
summarise(
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
X2B = sum(X2B, na.rm = TRUE),
X3B = sum(X3B, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(AB >= min_pa) %>%
arrange(desc(HR))
# Join with player names
career_with_names <- career_stats %>%
left_join(People, by = "playerID") %>%
select(nameFirst, nameLast, AB, H, HR, BB) %>%
head(20)
return(career_with_names)
}
top_careers <- career_ops_plus()
print(top_careers)
# Install pybaseball (includes Lahman data access)
# pip install pybaseball
import pybaseball as pyb
import pandas as pd
# Suppress cache warning
pyb.cache.enable()
# Download Lahman data
# The first time you run this, it will download the data
batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()
people = pyb.lahman.people()
teams = pyb.lahman.teams()
# View the structure
print(batting.info())
print(batting.head())
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
# Load Lahman data
batting = pyb.lahman.batting()
people = pyb.lahman.people()
# Get Babe Ruth's career batting stats
ruth = people[
(people['nameFirst'] == 'Babe') &
(people['nameLast'] == 'Ruth')
]
ruth_id = ruth['playerID'].values[0]
# Get his career stats
ruth_career = batting[batting['playerID'] == ruth_id].sort_values('yearID')
print(ruth_career)
# Calculate career totals
ruth_totals = pd.DataFrame({
'Years': [len(ruth_career)],
'Games': [ruth_career['G'].sum()],
'AB': [ruth_career['AB'].sum()],
'Hits': [ruth_career['H'].sum()],
'HR': [ruth_career['HR'].sum()],
'RBI': [ruth_career['RBI'].sum()],
'AVG': [ruth_career['H'].sum() / ruth_career['AB'].sum()]
})
print(ruth_totals)
# Compare multiple players
def compare_players(player_names):
"""
Compare career stats for multiple players
player_names: list of tuples (first_name, last_name)
"""
results = []
for first_name, last_name in player_names:
player = people[
(people['nameFirst'] == first_name) &
(people['nameLast'] == last_name)
]
if len(player) > 0:
player_id = player['playerID'].values[0]
career = batting[batting['playerID'] == player_id]
stats = {
'Name': f"{first_name} {last_name}",
'Years': len(career),
'Games': career['G'].sum(),
'AB': career['AB'].sum(),
'Hits': career['H'].sum(),
'HR': career['HR'].sum(),
'AVG': round(career['H'].sum() / career['AB'].sum(), 3)
}
results.append(stats)
return pd.DataFrame(results)
# Compare Ruth, Mays, Bonds, Trout
comparison = compare_players([
('Babe', 'Ruth'),
('Willie', 'Mays'),
('Barry', 'Bonds'),
('Mike', 'Trout')
])
print(comparison)
import pybaseball as pyb
import pandas as pd
import numpy as np
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def get_league_averages(year, league):
"""
Calculate league averages for a given year
"""
league_stats = batting[
(batting['yearID'] == year) &
(batting['lgID'] == league)
]
lg_ab = league_stats['AB'].sum()
lg_h = league_stats['H'].sum()
lg_bb = league_stats['BB'].sum()
lg_hbp = league_stats['HBP'].sum()
lg_sf = league_stats['SF'].sum()
lg_2b = league_stats['2B'].sum()
lg_3b = league_stats['3B'].sum()
lg_hr = league_stats['HR'].sum()
lg_tb = lg_h + lg_2b + 2*lg_3b + 3*lg_hr
lg_pa = lg_ab + lg_bb + lg_hbp + lg_sf
lg_obp = (lg_h + lg_bb + lg_hbp) / lg_pa
lg_slg = lg_tb / lg_ab
lg_ops = lg_obp + lg_slg
return {
'lgOBP': lg_obp,
'lgSLG': lg_slg,
'lgOPS': lg_ops
}
def calculate_ops_plus(player_id, year):
"""
Calculate OPS+ for a player season
"""
# Get player stats
player_stats = batting[
(batting['playerID'] == player_id) &
(batting['yearID'] == year)
]
if len(player_stats) == 0:
return None
# Aggregate if player played for multiple teams
ab = player_stats['AB'].sum()
h = player_stats['H'].sum()
bb = player_stats['BB'].sum()
hbp = player_stats['HBP'].sum()
sf = player_stats['SF'].sum()
doubles = player_stats['2B'].sum()
triples = player_stats['3B'].sum()
hr = player_stats['HR'].sum()
league = player_stats['lgID'].values[0]
if ab == 0:
return None
# Calculate player OBP and SLG
tb = h + doubles + 2*triples + 3*hr
pa = ab + bb + hbp + sf
player_obp = (h + bb + hbp) / pa
player_slg = tb / ab
# Get league averages
league_avg = get_league_averages(year, league)
# Calculate OPS+
ops_plus = 100 * (
(player_obp / league_avg['lgOBP']) +
(player_slg / league_avg['lgSLG']) - 1
)
return round(ops_plus, 0)
# Example: Calculate OPS+ for famous seasons
# Babe Ruth 1927
ruth = people[(people['nameFirst'] == 'Babe') & (people['nameLast'] == 'Ruth')]
ruth_id = ruth['playerID'].values[0]
ruth_1927 = calculate_ops_plus(ruth_id, 1927)
print(f"Babe Ruth 1927 OPS+: {ruth_1927}")
# Ted Williams 1941
williams = people[(people['nameFirst'] == 'Ted') & (people['nameLast'] == 'Williams')]
williams_id = williams['playerID'].values[0]
williams_1941 = calculate_ops_plus(williams_id, 1941)
print(f"Ted Williams 1941 OPS+: {williams_1941}")
# Barry Bonds 2004
bonds = people[(people['nameFirst'] == 'Barry') & (people['nameLast'] == 'Bonds')]
bonds_id = bonds['playerID'].values[0]
bonds_2004 = calculate_ops_plus(bonds_id, 2004)
print(f"Barry Bonds 2004 OPS+: {bonds_2004}")
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def find_best_seasons(min_ab=400, n_seasons=20):
"""
Find the best single seasons by OPS+
Note: This is a simplified version for demonstration
"""
# Get all qualified seasons
qualified = batting[batting['AB'] >= min_ab].copy()
# Calculate basic rate stats
qualified['AVG'] = qualified['H'] / qualified['AB']
qualified['TB'] = (
qualified['H'] + qualified['2B'] +
2*qualified['3B'] + 3*qualified['HR']
)
qualified['SLG'] = qualified['TB'] / qualified['AB']
# Sort by HR (as a proxy for OPS+ for this example)
best_seasons = qualified.nlargest(n_seasons, 'HR')
# Join with player names
best_with_names = best_seasons.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
result = best_with_names[[
'nameFirst', 'nameLast', 'yearID', 'AB', 'H', 'HR', 'AVG', 'SLG'
]].sort_values('HR', ascending=False)
return result
# Find top single seasons
top_seasons = find_best_seasons()
print(top_seasons)
def career_ops_plus(min_ab=3000):
"""
Find players with best career stats
"""
# Calculate career stats
career_stats = batting.groupby('playerID').agg({
'AB': 'sum',
'H': 'sum',
'2B': 'sum',
'3B': 'sum',
'HR': 'sum',
'BB': 'sum'
}).reset_index()
# Filter qualified players
career_stats = career_stats[career_stats['AB'] >= min_ab]
# Calculate AVG
career_stats['AVG'] = career_stats['H'] / career_stats['AB']
# Join with names
career_with_names = career_stats.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
# Sort by HR and get top 20
top_careers = career_with_names.nlargest(20, 'HR')[[
'nameFirst', 'nameLast', 'AB', 'H', 'HR', 'BB', 'AVG'
]]
return top_careers
top_careers = career_ops_plus()
print(top_careers)
Understanding how baseball has evolved requires examining statistical trends over time. Let's analyze how key metrics have changed decade by decade.
Calculating Decade Averages
First, we'll calculate league-wide statistics for each decade:
library(Lahman)
library(dplyr)
library(ggplot2)
# Calculate decade-by-decade league statistics
decade_analysis <- Batting %>%
filter(yearID >= 1900) %>% # Focus on modern era
mutate(decade = floor(yearID / 10) * 10) %>%
group_by(decade) %>%
summarise(
Total_AB = sum(AB, na.rm = TRUE),
Total_H = sum(H, na.rm = TRUE),
Total_HR = sum(HR, na.rm = TRUE),
Total_SO = sum(SO, na.rm = TRUE),
Total_BB = sum(BB, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
AVG = Total_H / Total_AB,
HR_per_AB = Total_HR / Total_AB,
SO_per_AB = Total_SO / Total_AB,
BB_per_AB = Total_BB / Total_AB,
HR_per_Game = (Total_HR / Total_AB) * 4.5, # Approximate ABs per game
SO_per_Game = (Total_SO / Total_AB) * 4.5
)
print(decade_analysis)
# Pitching trends by decade
pitching_decade <- Pitching %>%
filter(yearID >= 1900) %>%
mutate(decade = floor(yearID / 10) * 10) %>%
group_by(decade) %>%
summarise(
Total_IP = sum(IPouts, na.rm = TRUE) / 3,
Total_ER = sum(ER, na.rm = TRUE),
Total_SO = sum(SO, na.rm = TRUE),
Total_BB = sum(BB, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
ERA = (Total_ER / Total_IP) * 9,
SO_per_9 = (Total_SO / Total_IP) * 9,
BB_per_9 = (Total_BB / Total_IP) * 9,
SO_BB_ratio = Total_SO / Total_BB
)
print(pitching_decade)
import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pyb.cache.enable()
batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()
# Calculate decade-by-decade league statistics
batting_modern = batting[batting['yearID'] >= 1900].copy()
batting_modern['decade'] = (batting_modern['yearID'] // 10) * 10
decade_analysis = batting_modern.groupby('decade').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'SO': 'sum',
'BB': 'sum'
}).reset_index()
decade_analysis.columns = ['decade', 'Total_AB', 'Total_H', 'Total_HR', 'Total_SO', 'Total_BB']
# Calculate rate stats
decade_analysis['AVG'] = decade_analysis['Total_H'] / decade_analysis['Total_AB']
decade_analysis['HR_per_AB'] = decade_analysis['Total_HR'] / decade_analysis['Total_AB']
decade_analysis['SO_per_AB'] = decade_analysis['Total_SO'] / decade_analysis['Total_AB']
decade_analysis['BB_per_AB'] = decade_analysis['Total_BB'] / decade_analysis['Total_AB']
decade_analysis['HR_per_Game'] = (decade_analysis['Total_HR'] / decade_analysis['Total_AB']) * 4.5
decade_analysis['SO_per_Game'] = (decade_analysis['Total_SO'] / decade_analysis['Total_AB']) * 4.5
print(decade_analysis)
# Pitching trends by decade
pitching_modern = pitching[pitching['yearID'] >= 1900].copy()
pitching_modern['decade'] = (pitching_modern['yearID'] // 10) * 10
pitching_decade = pitching_modern.groupby('decade').agg({
'IPouts': 'sum',
'ER': 'sum',
'SO': 'sum',
'BB': 'sum'
}).reset_index()
# Calculate IP from outs
pitching_decade['Total_IP'] = pitching_decade['IPouts'] / 3
# Calculate rate stats
pitching_decade['ERA'] = (pitching_decade['ER'] / pitching_decade['Total_IP']) * 9
pitching_decade['SO_per_9'] = (pitching_decade['SO'] / pitching_decade['Total_IP']) * 9
pitching_decade['BB_per_9'] = (pitching_decade['BB'] / pitching_decade['Total_IP']) * 9
pitching_decade['SO_BB_ratio'] = pitching_decade['SO'] / pitching_decade['BB']
print(pitching_decade)
Key Observations from Decade Analysis
Looking at the data reveals several clear trends:
The Dead Ball Era (1900-1919):
- Very low home run rates (< 0.5% of at-bats)
- High batting averages (.260-.270 range)
- Low strikeout rates (< 10% of at-bats)
- Relatively high ERA (3.00-4.00)
The Live Ball Transition (1920-1930):
- Home runs doubled or tripled
- Batting averages peaked around .280
- Strikeout rates remained low
- ERA spiked in the late 1920s
Post-Integration Era (1950-1960):
- More balanced offense and pitching
- Steady increase in power numbers
- Rising strikeout rates
- ERA stabilized around 4.00
The 1960s Pitching Dominance:
- Lowest ERAs since dead ball era
- Strikeouts began rapid ascent
- Batting averages declined
- Home runs suppressed
The Expansion and Power Era (1970-2000):
- Steady increase in home runs
- Batting averages remained stable
- Strikeouts continued rising
- ERAs fluctuated with rule changes
The Steroid Era Peak (1990-2010):
- Historic home run rates
- Elevated offensive numbers across the board
- Strikeouts accelerating
- ERA inflation despite better pitching
Modern Three True Outcomes (2010-present):
- Historic strikeout rates (> 20% of at-bats)
- Continued high home run rates
- Lowest batting averages since 1960s
- Increased pitcher dominance returning
Visualizing Historical Trends
Let's create compelling visualizations of these trends:
library(ggplot2)
library(gridExtra)
# Batting Average over time
avg_plot <- ggplot(decade_analysis, aes(x = decade, y = AVG)) +
geom_line(size = 1.5, color = "blue") +
geom_point(size = 3, color = "blue") +
labs(
title = "MLB Batting Average by Decade",
x = "Decade",
y = "Batting Average"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
scale_y_continuous(limits = c(0.240, 0.280))
# Home runs per at-bat
hr_plot <- ggplot(decade_analysis, aes(x = decade, y = HR_per_AB * 100)) +
geom_line(size = 1.5, color = "red") +
geom_point(size = 3, color = "red") +
labs(
title = "MLB Home Run Rate by Decade",
x = "Decade",
y = "Home Runs per 100 AB"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
# Strikeouts per at-bat
so_plot <- ggplot(decade_analysis, aes(x = decade, y = SO_per_AB * 100)) +
geom_line(size = 1.5, color = "darkgreen") +
geom_point(size = 3, color = "darkgreen") +
labs(
title = "MLB Strikeout Rate by Decade",
x = "Decade",
y = "Strikeouts per 100 AB"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
# ERA over time
era_plot <- ggplot(pitching_decade, aes(x = decade, y = ERA)) +
geom_line(size = 1.5, color = "purple") +
geom_point(size = 3, color = "purple") +
labs(
title = "MLB ERA by Decade",
x = "Decade",
y = "ERA"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
# Combine plots
grid.arrange(avg_plot, hr_plot, so_plot, era_plot, ncol = 2)
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 10)
# Create subplots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
# Batting Average over time
ax1.plot(decade_analysis['decade'], decade_analysis['AVG'],
marker='o', linewidth=2, markersize=8, color='blue')
ax1.set_title('MLB Batting Average by Decade', fontsize=14, fontweight='bold')
ax1.set_xlabel('Decade')
ax1.set_ylabel('Batting Average')
ax1.set_ylim(0.240, 0.280)
ax1.grid(True, alpha=0.3)
# Home runs per at-bat
ax2.plot(decade_analysis['decade'], decade_analysis['HR_per_AB'] * 100,
marker='o', linewidth=2, markersize=8, color='red')
ax2.set_title('MLB Home Run Rate by Decade', fontsize=14, fontweight='bold')
ax2.set_xlabel('Decade')
ax2.set_ylabel('Home Runs per 100 AB')
ax2.grid(True, alpha=0.3)
# Strikeouts per at-bat
ax3.plot(decade_analysis['decade'], decade_analysis['SO_per_AB'] * 100,
marker='o', linewidth=2, markersize=8, color='darkgreen')
ax3.set_title('MLB Strikeout Rate by Decade', fontsize=14, fontweight='bold')
ax3.set_xlabel('Decade')
ax3.set_ylabel('Strikeouts per 100 AB')
ax3.grid(True, alpha=0.3)
# ERA over time
ax4.plot(pitching_decade['decade'], pitching_decade['ERA'],
marker='o', linewidth=2, markersize=8, color='purple')
ax4.set_title('MLB ERA by Decade', fontsize=14, fontweight='bold')
ax4.set_xlabel('Decade')
ax4.set_ylabel('ERA')
ax4.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('decade_trends.png', dpi=300, bbox_inches='tight')
plt.show()
Year-by-Year Analysis for Recent Trends
For more granular analysis, let's look at year-by-year trends in the modern era:
library(Lahman)
library(dplyr)
library(ggplot2)
# Year-by-year since 1990
modern_yearly <- Batting %>%
filter(yearID >= 1990) %>%
group_by(yearID) %>%
summarise(
Total_AB = sum(AB, na.rm = TRUE),
Total_H = sum(H, na.rm = TRUE),
Total_HR = sum(HR, na.rm = TRUE),
Total_SO = sum(SO, na.rm = TRUE),
Total_BB = sum(BB, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
AVG = Total_H / Total_AB,
HR_Rate = (Total_HR / Total_AB) * 100,
SO_Rate = (Total_SO / Total_AB) * 100,
BB_Rate = (Total_BB / Total_AB) * 100
)
# Create visualization
ggplot(modern_yearly) +
geom_line(aes(x = yearID, y = AVG * 1000), color = "blue", size = 1) +
geom_line(aes(x = yearID, y = HR_Rate * 10), color = "red", size = 1) +
geom_line(aes(x = yearID, y = SO_Rate * 10), color = "green", size = 1) +
labs(
title = "Modern Era Trends (1990-Present)",
subtitle = "Blue = AVG (×1000), Red = HR Rate (×10), Green = SO Rate (×10)",
x = "Year",
y = "Scaled Value"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
pyb.cache.enable()
batting = pyb.lahman.batting()
# Year-by-year since 1990
modern_batting = batting[batting['yearID'] >= 1990].copy()
modern_yearly = modern_batting.groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'SO': 'sum',
'BB': 'sum'
}).reset_index()
# Calculate rate stats
modern_yearly['AVG'] = modern_yearly['H'] / modern_yearly['AB']
modern_yearly['HR_Rate'] = (modern_yearly['HR'] / modern_yearly['AB']) * 100
modern_yearly['SO_Rate'] = (modern_yearly['SO'] / modern_yearly['AB']) * 100
modern_yearly['BB_Rate'] = (modern_yearly['BB'] / modern_yearly['AB']) * 100
# Create visualization
plt.figure(figsize=(14, 8))
plt.plot(modern_yearly['yearID'], modern_yearly['AVG'] * 1000,
label='AVG (×1000)', linewidth=2, color='blue')
plt.plot(modern_yearly['yearID'], modern_yearly['HR_Rate'] * 10,
label='HR Rate (×10)', linewidth=2, color='red')
plt.plot(modern_yearly['yearID'], modern_yearly['SO_Rate'] * 10,
label='SO Rate (×10)', linewidth=2, color='green')
plt.title('Modern Era Trends (1990-Present)', fontsize=16, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Value', fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('modern_era_trends.png', dpi=300, bbox_inches='tight')
plt.show()
print(modern_yearly)
library(Lahman)
library(dplyr)
library(ggplot2)
# Calculate decade-by-decade league statistics
decade_analysis <- Batting %>%
filter(yearID >= 1900) %>% # Focus on modern era
mutate(decade = floor(yearID / 10) * 10) %>%
group_by(decade) %>%
summarise(
Total_AB = sum(AB, na.rm = TRUE),
Total_H = sum(H, na.rm = TRUE),
Total_HR = sum(HR, na.rm = TRUE),
Total_SO = sum(SO, na.rm = TRUE),
Total_BB = sum(BB, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
AVG = Total_H / Total_AB,
HR_per_AB = Total_HR / Total_AB,
SO_per_AB = Total_SO / Total_AB,
BB_per_AB = Total_BB / Total_AB,
HR_per_Game = (Total_HR / Total_AB) * 4.5, # Approximate ABs per game
SO_per_Game = (Total_SO / Total_AB) * 4.5
)
print(decade_analysis)
# Pitching trends by decade
pitching_decade <- Pitching %>%
filter(yearID >= 1900) %>%
mutate(decade = floor(yearID / 10) * 10) %>%
group_by(decade) %>%
summarise(
Total_IP = sum(IPouts, na.rm = TRUE) / 3,
Total_ER = sum(ER, na.rm = TRUE),
Total_SO = sum(SO, na.rm = TRUE),
Total_BB = sum(BB, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
ERA = (Total_ER / Total_IP) * 9,
SO_per_9 = (Total_SO / Total_IP) * 9,
BB_per_9 = (Total_BB / Total_IP) * 9,
SO_BB_ratio = Total_SO / Total_BB
)
print(pitching_decade)
library(ggplot2)
library(gridExtra)
# Batting Average over time
avg_plot <- ggplot(decade_analysis, aes(x = decade, y = AVG)) +
geom_line(size = 1.5, color = "blue") +
geom_point(size = 3, color = "blue") +
labs(
title = "MLB Batting Average by Decade",
x = "Decade",
y = "Batting Average"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
scale_y_continuous(limits = c(0.240, 0.280))
# Home runs per at-bat
hr_plot <- ggplot(decade_analysis, aes(x = decade, y = HR_per_AB * 100)) +
geom_line(size = 1.5, color = "red") +
geom_point(size = 3, color = "red") +
labs(
title = "MLB Home Run Rate by Decade",
x = "Decade",
y = "Home Runs per 100 AB"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
# Strikeouts per at-bat
so_plot <- ggplot(decade_analysis, aes(x = decade, y = SO_per_AB * 100)) +
geom_line(size = 1.5, color = "darkgreen") +
geom_point(size = 3, color = "darkgreen") +
labs(
title = "MLB Strikeout Rate by Decade",
x = "Decade",
y = "Strikeouts per 100 AB"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
# ERA over time
era_plot <- ggplot(pitching_decade, aes(x = decade, y = ERA)) +
geom_line(size = 1.5, color = "purple") +
geom_point(size = 3, color = "purple") +
labs(
title = "MLB ERA by Decade",
x = "Decade",
y = "ERA"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))
# Combine plots
grid.arrange(avg_plot, hr_plot, so_plot, era_plot, ncol = 2)
library(Lahman)
library(dplyr)
library(ggplot2)
# Year-by-year since 1990
modern_yearly <- Batting %>%
filter(yearID >= 1990) %>%
group_by(yearID) %>%
summarise(
Total_AB = sum(AB, na.rm = TRUE),
Total_H = sum(H, na.rm = TRUE),
Total_HR = sum(HR, na.rm = TRUE),
Total_SO = sum(SO, na.rm = TRUE),
Total_BB = sum(BB, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
AVG = Total_H / Total_AB,
HR_Rate = (Total_HR / Total_AB) * 100,
SO_Rate = (Total_SO / Total_AB) * 100,
BB_Rate = (Total_BB / Total_AB) * 100
)
# Create visualization
ggplot(modern_yearly) +
geom_line(aes(x = yearID, y = AVG * 1000), color = "blue", size = 1) +
geom_line(aes(x = yearID, y = HR_Rate * 10), color = "red", size = 1) +
geom_line(aes(x = yearID, y = SO_Rate * 10), color = "green", size = 1) +
labs(
title = "Modern Era Trends (1990-Present)",
subtitle = "Blue = AVG (×1000), Red = HR Rate (×10), Green = SO Rate (×10)",
x = "Year",
y = "Scaled Value"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pyb.cache.enable()
batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()
# Calculate decade-by-decade league statistics
batting_modern = batting[batting['yearID'] >= 1900].copy()
batting_modern['decade'] = (batting_modern['yearID'] // 10) * 10
decade_analysis = batting_modern.groupby('decade').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'SO': 'sum',
'BB': 'sum'
}).reset_index()
decade_analysis.columns = ['decade', 'Total_AB', 'Total_H', 'Total_HR', 'Total_SO', 'Total_BB']
# Calculate rate stats
decade_analysis['AVG'] = decade_analysis['Total_H'] / decade_analysis['Total_AB']
decade_analysis['HR_per_AB'] = decade_analysis['Total_HR'] / decade_analysis['Total_AB']
decade_analysis['SO_per_AB'] = decade_analysis['Total_SO'] / decade_analysis['Total_AB']
decade_analysis['BB_per_AB'] = decade_analysis['Total_BB'] / decade_analysis['Total_AB']
decade_analysis['HR_per_Game'] = (decade_analysis['Total_HR'] / decade_analysis['Total_AB']) * 4.5
decade_analysis['SO_per_Game'] = (decade_analysis['Total_SO'] / decade_analysis['Total_AB']) * 4.5
print(decade_analysis)
# Pitching trends by decade
pitching_modern = pitching[pitching['yearID'] >= 1900].copy()
pitching_modern['decade'] = (pitching_modern['yearID'] // 10) * 10
pitching_decade = pitching_modern.groupby('decade').agg({
'IPouts': 'sum',
'ER': 'sum',
'SO': 'sum',
'BB': 'sum'
}).reset_index()
# Calculate IP from outs
pitching_decade['Total_IP'] = pitching_decade['IPouts'] / 3
# Calculate rate stats
pitching_decade['ERA'] = (pitching_decade['ER'] / pitching_decade['Total_IP']) * 9
pitching_decade['SO_per_9'] = (pitching_decade['SO'] / pitching_decade['Total_IP']) * 9
pitching_decade['BB_per_9'] = (pitching_decade['BB'] / pitching_decade['Total_IP']) * 9
pitching_decade['SO_BB_ratio'] = pitching_decade['SO'] / pitching_decade['BB']
print(pitching_decade)
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 10)
# Create subplots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
# Batting Average over time
ax1.plot(decade_analysis['decade'], decade_analysis['AVG'],
marker='o', linewidth=2, markersize=8, color='blue')
ax1.set_title('MLB Batting Average by Decade', fontsize=14, fontweight='bold')
ax1.set_xlabel('Decade')
ax1.set_ylabel('Batting Average')
ax1.set_ylim(0.240, 0.280)
ax1.grid(True, alpha=0.3)
# Home runs per at-bat
ax2.plot(decade_analysis['decade'], decade_analysis['HR_per_AB'] * 100,
marker='o', linewidth=2, markersize=8, color='red')
ax2.set_title('MLB Home Run Rate by Decade', fontsize=14, fontweight='bold')
ax2.set_xlabel('Decade')
ax2.set_ylabel('Home Runs per 100 AB')
ax2.grid(True, alpha=0.3)
# Strikeouts per at-bat
ax3.plot(decade_analysis['decade'], decade_analysis['SO_per_AB'] * 100,
marker='o', linewidth=2, markersize=8, color='darkgreen')
ax3.set_title('MLB Strikeout Rate by Decade', fontsize=14, fontweight='bold')
ax3.set_xlabel('Decade')
ax3.set_ylabel('Strikeouts per 100 AB')
ax3.grid(True, alpha=0.3)
# ERA over time
ax4.plot(pitching_decade['decade'], pitching_decade['ERA'],
marker='o', linewidth=2, markersize=8, color='purple')
ax4.set_title('MLB ERA by Decade', fontsize=14, fontweight='bold')
ax4.set_xlabel('Decade')
ax4.set_ylabel('ERA')
ax4.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('decade_trends.png', dpi=300, bbox_inches='tight')
plt.show()
import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
pyb.cache.enable()
batting = pyb.lahman.batting()
# Year-by-year since 1990
modern_batting = batting[batting['yearID'] >= 1990].copy()
modern_yearly = modern_batting.groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'SO': 'sum',
'BB': 'sum'
}).reset_index()
# Calculate rate stats
modern_yearly['AVG'] = modern_yearly['H'] / modern_yearly['AB']
modern_yearly['HR_Rate'] = (modern_yearly['HR'] / modern_yearly['AB']) * 100
modern_yearly['SO_Rate'] = (modern_yearly['SO'] / modern_yearly['AB']) * 100
modern_yearly['BB_Rate'] = (modern_yearly['BB'] / modern_yearly['AB']) * 100
# Create visualization
plt.figure(figsize=(14, 8))
plt.plot(modern_yearly['yearID'], modern_yearly['AVG'] * 1000,
label='AVG (×1000)', linewidth=2, color='blue')
plt.plot(modern_yearly['yearID'], modern_yearly['HR_Rate'] * 10,
label='HR Rate (×10)', linewidth=2, color='red')
plt.plot(modern_yearly['yearID'], modern_yearly['SO_Rate'] * 10,
label='SO Rate (×10)', linewidth=2, color='green')
plt.title('Modern Era Trends (1990-Present)', fontsize=16, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Value', fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('modern_era_trends.png', dpi=300, bbox_inches='tight')
plt.show()
print(modern_yearly)
Comparing individual players across eras is the ultimate test of our analytical methods. We need frameworks that account for different competitive environments while still allowing meaningful comparisons.
WAR as a Cross-Era Comparison Tool
Wins Above Replacement (WAR) is designed to be era-neutral because it compares players to replacement level within their own era. A player worth 8 WAR in 1927 provided approximately the same value as a player worth 8 WAR in 2023, even though their raw statistics might look completely different.
WAR's key advantages for cross-era comparison:
- Position Adjustment: Accounts for defensive value at different positions
- League and Park Adjustment: Built-in era and park factors
- Playing Time: Rewards durability and availability
- Replacement Level: Compares to the same baseline across eras
Let's compare the career WAR of legendary players:
library(Lahman)
library(dplyr)
# Note: Lahman database doesn't include WAR directly
# We'll use a simplified framework based on available data
# For actual WAR, you'd use Baseball-Reference or FanGraphs data
# Function to get career value statistics
get_career_value <- function(first_name, last_name) {
# Get player ID
player <- People %>%
filter(nameFirst == first_name, nameLast == last_name)
if(nrow(player) == 0) {
return(NULL)
}
player_id <- player$playerID[1]
# Get career stats
career <- Batting %>%
filter(playerID == player_id) %>%
summarise(
Name = paste(first_name, last_name),
Years = n(),
Games = sum(G, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
Hits = sum(H, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
RBI = sum(RBI, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
AVG = round(sum(H) / sum(AB), 3),
First_Year = min(yearID),
Last_Year = max(yearID)
)
return(career)
}
# Compare legendary players
legends <- list(
c("Babe", "Ruth"),
c("Ted", "Williams"),
c("Willie", "Mays"),
c("Hank", "Aaron"),
c("Barry", "Bonds"),
c("Mike", "Trout"),
c("Albert", "Pujols")
)
comparison <- bind_rows(lapply(legends, function(x) get_career_value(x[1], x[2])))
print(comparison)
# Calculate per-season averages
comparison <- comparison %>%
mutate(
HR_per_Season = round(HR / Years, 1),
Games_per_Season = round(Games / Years, 0)
)
print(comparison[, c("Name", "Years", "HR", "HR_per_Season", "AVG")])
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def get_career_value(first_name, last_name):
"""
Get career value statistics for a player
"""
# Get player ID
player = people[
(people['nameFirst'] == first_name) &
(people['nameLast'] == last_name)
]
if len(player) == 0:
return None
player_id = player['playerID'].values[0]
# Get career stats
career = batting[batting['playerID'] == player_id]
stats = {
'Name': f"{first_name} {last_name}",
'Years': len(career),
'Games': career['G'].sum(),
'AB': career['AB'].sum(),
'Hits': career['H'].sum(),
'HR': career['HR'].sum(),
'RBI': career['RBI'].sum(),
'BB': career['BB'].sum(),
'AVG': round(career['H'].sum() / career['AB'].sum(), 3),
'First_Year': career['yearID'].min(),
'Last_Year': career['yearID'].max()
}
return stats
# Compare legendary players
legends = [
('Babe', 'Ruth'),
('Ted', 'Williams'),
('Willie', 'Mays'),
('Hank', 'Aaron'),
('Barry', 'Bonds'),
('Mike', 'Trout'),
('Albert', 'Pujols')
]
comparison_data = []
for first, last in legends:
stats = get_career_value(first, last)
if stats:
comparison_data.append(stats)
comparison = pd.DataFrame(comparison_data)
# Calculate per-season averages
comparison['HR_per_Season'] = (comparison['HR'] / comparison['Years']).round(1)
comparison['Games_per_Season'] = (comparison['Games'] / comparison['Years']).round(0)
print(comparison[['Name', 'Years', 'HR', 'HR_per_Season', 'AVG']])
Peak Value vs. Career Value
One of the great debates in player comparison is peak value versus career value. Should we favor a player who dominated for a decade or one who was very good for two decades?
Different perspectives:
Peak Value Advocates argue that:
- Peak performance shows what a player was truly capable of
- Health and durability are partly luck
- Hall of Fame should be about greatness, not accumulation
- A player's best 7 years show their true talent level
Career Value Advocates argue that:
- Longevity requires skill (staying healthy, adapting, maintaining fitness)
- Consistency over time is valuable
- Total contribution to teams matters
- Durability is a skill, not just luck
Let's analyze both perspectives:
library(Lahman)
library(dplyr)
# Function to get peak seasons (top 7 years)
get_peak_value <- function(first_name, last_name, n_years = 7) {
# Get player ID
player <- People %>%
filter(nameFirst == first_name, nameLast == last_name)
if(nrow(player) == 0) {
return(NULL)
}
player_id <- player$playerID[1]
# Get all seasons
seasons <- Batting %>%
filter(playerID == player_id) %>%
group_by(yearID) %>%
summarise(
AB = sum(AB),
H = sum(H),
HR = sum(HR),
RBI = sum(RBI),
BB = sum(BB),
.groups = 'drop'
) %>%
filter(AB >= 300) %>% # Qualified seasons only
arrange(desc(HR)) # Sort by HR (could use other metrics)
# Get top N seasons
peak <- seasons %>%
head(n_years) %>%
summarise(
Name = paste(first_name, last_name),
Peak_Years = n(),
Total_AB = sum(AB),
Total_HR = sum(HR),
Total_RBI = sum(RBI),
Avg_HR = round(mean(HR), 1),
Peak_AVG = round(sum(H) / sum(AB), 3)
)
return(peak)
}
# Compare peak value
peak_legends <- bind_rows(lapply(legends, function(x) get_peak_value(x[1], x[2])))
print(peak_legends)
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def get_peak_value(first_name, last_name, n_years=7):
"""
Get peak value (best N seasons) for a player
"""
# Get player ID
player = people[
(people['nameFirst'] == first_name) &
(people['nameLast'] == last_name)
]
if len(player) == 0:
return None
player_id = player['playerID'].values[0]
# Get all seasons
seasons = batting[batting['playerID'] == player_id].groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'RBI': 'sum',
'BB': 'sum'
}).reset_index()
# Filter qualified seasons
seasons = seasons[seasons['AB'] >= 300]
# Sort by HR (could use other metrics)
seasons = seasons.sort_values('HR', ascending=False)
# Get top N seasons
peak = seasons.head(n_years)
stats = {
'Name': f"{first_name} {last_name}",
'Peak_Years': len(peak),
'Total_AB': peak['AB'].sum(),
'Total_HR': peak['HR'].sum(),
'Total_RBI': peak['RBI'].sum(),
'Avg_HR': round(peak['HR'].mean(), 1),
'Peak_AVG': round(peak['H'].sum() / peak['AB'].sum(), 3)
}
return stats
# Compare peak value
legends = [
('Babe', 'Ruth'),
('Ted', 'Williams'),
('Willie', 'Mays'),
('Hank', 'Aaron'),
('Barry', 'Bonds'),
('Mike', 'Trout')
]
peak_data = []
for first, last in legends:
stats = get_peak_value(first, last)
if stats:
peak_data.append(stats)
peak_legends = pd.DataFrame(peak_data)
print(peak_legends)
Building a "Best Seasons Ever" Analysis
Let's create a comprehensive analysis of the greatest single seasons in baseball history, using era-adjusted metrics:
library(Lahman)
library(dplyr)
# Find the best single seasons by home runs (as a starting point)
best_hr_seasons <- Batting %>%
filter(yearID >= 1900) %>%
group_by(playerID, yearID) %>%
summarise(
lgID = first(lgID),
HR = sum(HR, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(AB >= 400) %>% # Qualified seasons
arrange(desc(HR)) %>%
head(30)
# Join with player names
best_hr_with_names <- best_hr_seasons %>%
left_join(People, by = "playerID") %>%
mutate(
Name = paste(nameFirst, nameLast),
AVG = round(H / AB, 3)
) %>%
select(Name, yearID, HR, AB, AVG) %>%
arrange(desc(HR))
print(best_hr_with_names)
# Now find best seasons by batting average (qualified)
best_avg_seasons <- Batting %>%
filter(yearID >= 1900) %>%
group_by(playerID, yearID) %>%
summarise(
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(AB >= 400) %>%
mutate(AVG = H / AB) %>%
arrange(desc(AVG)) %>%
head(30)
best_avg_with_names <- best_avg_seasons %>%
left_join(People, by = "playerID") %>%
mutate(Name = paste(nameFirst, nameLast)) %>%
select(Name, yearID, AB, H, AVG) %>%
arrange(desc(AVG))
print(best_avg_with_names)
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
# Find the best single seasons by home runs
batting_modern = batting[batting['yearID'] >= 1900].copy()
best_hr_seasons = batting_modern.groupby(['playerID', 'yearID']).agg({
'lgID': 'first',
'HR': 'sum',
'AB': 'sum',
'H': 'sum'
}).reset_index()
# Filter qualified seasons
best_hr_seasons = best_hr_seasons[best_hr_seasons['AB'] >= 400]
best_hr_seasons = best_hr_seasons.sort_values('HR', ascending=False).head(30)
# Join with player names
best_hr_with_names = best_hr_seasons.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
best_hr_with_names['Name'] = (
best_hr_with_names['nameFirst'] + ' ' + best_hr_with_names['nameLast']
)
best_hr_with_names['AVG'] = (best_hr_with_names['H'] / best_hr_with_names['AB']).round(3)
print(best_hr_with_names[['Name', 'yearID', 'HR', 'AB', 'AVG']])
# Find best seasons by batting average
best_avg_seasons = batting_modern.groupby(['playerID', 'yearID']).agg({
'AB': 'sum',
'H': 'sum'
}).reset_index()
best_avg_seasons = best_avg_seasons[best_avg_seasons['AB'] >= 400]
best_avg_seasons['AVG'] = best_avg_seasons['H'] / best_avg_seasons['AB']
best_avg_seasons = best_avg_seasons.sort_values('AVG', ascending=False).head(30)
# Join with names
best_avg_with_names = best_avg_seasons.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
best_avg_with_names['Name'] = (
best_avg_with_names['nameFirst'] + ' ' + best_avg_with_names['nameLast']
)
print(best_avg_with_names[['Name', 'yearID', 'AB', 'H', 'AVG']])
Creating a Unified Comparison Framework
The most sophisticated approach combines multiple metrics into a unified framework:
# Create a comprehensive player comparison function
compare_players_comprehensive <- function(players_list) {
# players_list should be a list of (firstName, lastName) pairs
results <- list()
for(i in 1:length(players_list)) {
first <- players_list[[i]][1]
last <- players_list[[i]][2]
# Get player ID
player <- People %>%
filter(nameFirst == first, nameLast == last)
if(nrow(player) == 0) next
player_id <- player$playerID[1]
# Career stats
career <- Batting %>%
filter(playerID == player_id) %>%
summarise(
Name = paste(first, last),
Seasons = n(),
Games = sum(G, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
X2B = sum(X2B, na.rm = TRUE),
X3B = sum(X3B, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
RBI = sum(RBI, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
SO = sum(SO, na.rm = TRUE),
SB = sum(SB, na.rm = TRUE)
) %>%
mutate(
AVG = round(H / AB, 3),
OBP = round((H + BB) / (AB + BB), 3),
SLG = round((H + X2B + 2*X3B + 3*HR) / AB, 3),
OPS = round(OBP + SLG, 3),
HR_per_Season = round(HR / Seasons, 1)
)
results[[i]] <- career
}
return(bind_rows(results))
}
# Compare all-time greats
all_time_greats <- list(
c("Babe", "Ruth"),
c("Ted", "Williams"),
c("Willie", "Mays"),
c("Barry", "Bonds")
)
comprehensive <- compare_players_comprehensive(all_time_greats)
print(comprehensive[, c("Name", "Seasons", "HR", "AVG", "OBP", "SLG", "OPS")])
def compare_players_comprehensive(players_list):
"""
Comprehensive player comparison
players_list: list of tuples (first_name, last_name)
"""
results = []
for first, last in players_list:
# Get player ID
player = people[
(people['nameFirst'] == first) &
(people['nameLast'] == last)
]
if len(player) == 0:
continue
player_id = player['playerID'].values[0]
# Career stats
career = batting[batting['playerID'] == player_id]
stats = {
'Name': f"{first} {last}",
'Seasons': len(career),
'Games': career['G'].sum(),
'AB': career['AB'].sum(),
'H': career['H'].sum(),
'2B': career['2B'].sum(),
'3B': career['3B'].sum(),
'HR': career['HR'].sum(),
'RBI': career['RBI'].sum(),
'BB': career['BB'].sum(),
'SO': career['SO'].sum(),
'SB': career['SB'].sum()
}
# Calculate rate stats
stats['AVG'] = round(stats['H'] / stats['AB'], 3)
stats['OBP'] = round((stats['H'] + stats['BB']) / (stats['AB'] + stats['BB']), 3)
stats['SLG'] = round(
(stats['H'] + stats['2B'] + 2*stats['3B'] + 3*stats['HR']) / stats['AB'], 3
)
stats['OPS'] = round(stats['OBP'] + stats['SLG'], 3)
stats['HR_per_Season'] = round(stats['HR'] / stats['Seasons'], 1)
results.append(stats)
return pd.DataFrame(results)
# Compare all-time greats
all_time_greats = [
('Babe', 'Ruth'),
('Ted', 'Williams'),
('Willie', 'Mays'),
('Barry', 'Bonds')
]
comprehensive = compare_players_comprehensive(all_time_greats)
print(comprehensive[['Name', 'Seasons', 'HR', 'AVG', 'OBP', 'SLG', 'OPS']])
library(Lahman)
library(dplyr)
# Note: Lahman database doesn't include WAR directly
# We'll use a simplified framework based on available data
# For actual WAR, you'd use Baseball-Reference or FanGraphs data
# Function to get career value statistics
get_career_value <- function(first_name, last_name) {
# Get player ID
player <- People %>%
filter(nameFirst == first_name, nameLast == last_name)
if(nrow(player) == 0) {
return(NULL)
}
player_id <- player$playerID[1]
# Get career stats
career <- Batting %>%
filter(playerID == player_id) %>%
summarise(
Name = paste(first_name, last_name),
Years = n(),
Games = sum(G, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
Hits = sum(H, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
RBI = sum(RBI, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
AVG = round(sum(H) / sum(AB), 3),
First_Year = min(yearID),
Last_Year = max(yearID)
)
return(career)
}
# Compare legendary players
legends <- list(
c("Babe", "Ruth"),
c("Ted", "Williams"),
c("Willie", "Mays"),
c("Hank", "Aaron"),
c("Barry", "Bonds"),
c("Mike", "Trout"),
c("Albert", "Pujols")
)
comparison <- bind_rows(lapply(legends, function(x) get_career_value(x[1], x[2])))
print(comparison)
# Calculate per-season averages
comparison <- comparison %>%
mutate(
HR_per_Season = round(HR / Years, 1),
Games_per_Season = round(Games / Years, 0)
)
print(comparison[, c("Name", "Years", "HR", "HR_per_Season", "AVG")])
library(Lahman)
library(dplyr)
# Function to get peak seasons (top 7 years)
get_peak_value <- function(first_name, last_name, n_years = 7) {
# Get player ID
player <- People %>%
filter(nameFirst == first_name, nameLast == last_name)
if(nrow(player) == 0) {
return(NULL)
}
player_id <- player$playerID[1]
# Get all seasons
seasons <- Batting %>%
filter(playerID == player_id) %>%
group_by(yearID) %>%
summarise(
AB = sum(AB),
H = sum(H),
HR = sum(HR),
RBI = sum(RBI),
BB = sum(BB),
.groups = 'drop'
) %>%
filter(AB >= 300) %>% # Qualified seasons only
arrange(desc(HR)) # Sort by HR (could use other metrics)
# Get top N seasons
peak <- seasons %>%
head(n_years) %>%
summarise(
Name = paste(first_name, last_name),
Peak_Years = n(),
Total_AB = sum(AB),
Total_HR = sum(HR),
Total_RBI = sum(RBI),
Avg_HR = round(mean(HR), 1),
Peak_AVG = round(sum(H) / sum(AB), 3)
)
return(peak)
}
# Compare peak value
peak_legends <- bind_rows(lapply(legends, function(x) get_peak_value(x[1], x[2])))
print(peak_legends)
library(Lahman)
library(dplyr)
# Find the best single seasons by home runs (as a starting point)
best_hr_seasons <- Batting %>%
filter(yearID >= 1900) %>%
group_by(playerID, yearID) %>%
summarise(
lgID = first(lgID),
HR = sum(HR, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(AB >= 400) %>% # Qualified seasons
arrange(desc(HR)) %>%
head(30)
# Join with player names
best_hr_with_names <- best_hr_seasons %>%
left_join(People, by = "playerID") %>%
mutate(
Name = paste(nameFirst, nameLast),
AVG = round(H / AB, 3)
) %>%
select(Name, yearID, HR, AB, AVG) %>%
arrange(desc(HR))
print(best_hr_with_names)
# Now find best seasons by batting average (qualified)
best_avg_seasons <- Batting %>%
filter(yearID >= 1900) %>%
group_by(playerID, yearID) %>%
summarise(
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(AB >= 400) %>%
mutate(AVG = H / AB) %>%
arrange(desc(AVG)) %>%
head(30)
best_avg_with_names <- best_avg_seasons %>%
left_join(People, by = "playerID") %>%
mutate(Name = paste(nameFirst, nameLast)) %>%
select(Name, yearID, AB, H, AVG) %>%
arrange(desc(AVG))
print(best_avg_with_names)
# Create a comprehensive player comparison function
compare_players_comprehensive <- function(players_list) {
# players_list should be a list of (firstName, lastName) pairs
results <- list()
for(i in 1:length(players_list)) {
first <- players_list[[i]][1]
last <- players_list[[i]][2]
# Get player ID
player <- People %>%
filter(nameFirst == first, nameLast == last)
if(nrow(player) == 0) next
player_id <- player$playerID[1]
# Career stats
career <- Batting %>%
filter(playerID == player_id) %>%
summarise(
Name = paste(first, last),
Seasons = n(),
Games = sum(G, na.rm = TRUE),
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
X2B = sum(X2B, na.rm = TRUE),
X3B = sum(X3B, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
RBI = sum(RBI, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
SO = sum(SO, na.rm = TRUE),
SB = sum(SB, na.rm = TRUE)
) %>%
mutate(
AVG = round(H / AB, 3),
OBP = round((H + BB) / (AB + BB), 3),
SLG = round((H + X2B + 2*X3B + 3*HR) / AB, 3),
OPS = round(OBP + SLG, 3),
HR_per_Season = round(HR / Seasons, 1)
)
results[[i]] <- career
}
return(bind_rows(results))
}
# Compare all-time greats
all_time_greats <- list(
c("Babe", "Ruth"),
c("Ted", "Williams"),
c("Willie", "Mays"),
c("Barry", "Bonds")
)
comprehensive <- compare_players_comprehensive(all_time_greats)
print(comprehensive[, c("Name", "Seasons", "HR", "AVG", "OBP", "SLG", "OPS")])
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def get_career_value(first_name, last_name):
"""
Get career value statistics for a player
"""
# Get player ID
player = people[
(people['nameFirst'] == first_name) &
(people['nameLast'] == last_name)
]
if len(player) == 0:
return None
player_id = player['playerID'].values[0]
# Get career stats
career = batting[batting['playerID'] == player_id]
stats = {
'Name': f"{first_name} {last_name}",
'Years': len(career),
'Games': career['G'].sum(),
'AB': career['AB'].sum(),
'Hits': career['H'].sum(),
'HR': career['HR'].sum(),
'RBI': career['RBI'].sum(),
'BB': career['BB'].sum(),
'AVG': round(career['H'].sum() / career['AB'].sum(), 3),
'First_Year': career['yearID'].min(),
'Last_Year': career['yearID'].max()
}
return stats
# Compare legendary players
legends = [
('Babe', 'Ruth'),
('Ted', 'Williams'),
('Willie', 'Mays'),
('Hank', 'Aaron'),
('Barry', 'Bonds'),
('Mike', 'Trout'),
('Albert', 'Pujols')
]
comparison_data = []
for first, last in legends:
stats = get_career_value(first, last)
if stats:
comparison_data.append(stats)
comparison = pd.DataFrame(comparison_data)
# Calculate per-season averages
comparison['HR_per_Season'] = (comparison['HR'] / comparison['Years']).round(1)
comparison['Games_per_Season'] = (comparison['Games'] / comparison['Years']).round(0)
print(comparison[['Name', 'Years', 'HR', 'HR_per_Season', 'AVG']])
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def get_peak_value(first_name, last_name, n_years=7):
"""
Get peak value (best N seasons) for a player
"""
# Get player ID
player = people[
(people['nameFirst'] == first_name) &
(people['nameLast'] == last_name)
]
if len(player) == 0:
return None
player_id = player['playerID'].values[0]
# Get all seasons
seasons = batting[batting['playerID'] == player_id].groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'RBI': 'sum',
'BB': 'sum'
}).reset_index()
# Filter qualified seasons
seasons = seasons[seasons['AB'] >= 300]
# Sort by HR (could use other metrics)
seasons = seasons.sort_values('HR', ascending=False)
# Get top N seasons
peak = seasons.head(n_years)
stats = {
'Name': f"{first_name} {last_name}",
'Peak_Years': len(peak),
'Total_AB': peak['AB'].sum(),
'Total_HR': peak['HR'].sum(),
'Total_RBI': peak['RBI'].sum(),
'Avg_HR': round(peak['HR'].mean(), 1),
'Peak_AVG': round(peak['H'].sum() / peak['AB'].sum(), 3)
}
return stats
# Compare peak value
legends = [
('Babe', 'Ruth'),
('Ted', 'Williams'),
('Willie', 'Mays'),
('Hank', 'Aaron'),
('Barry', 'Bonds'),
('Mike', 'Trout')
]
peak_data = []
for first, last in legends:
stats = get_peak_value(first, last)
if stats:
peak_data.append(stats)
peak_legends = pd.DataFrame(peak_data)
print(peak_legends)
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
# Find the best single seasons by home runs
batting_modern = batting[batting['yearID'] >= 1900].copy()
best_hr_seasons = batting_modern.groupby(['playerID', 'yearID']).agg({
'lgID': 'first',
'HR': 'sum',
'AB': 'sum',
'H': 'sum'
}).reset_index()
# Filter qualified seasons
best_hr_seasons = best_hr_seasons[best_hr_seasons['AB'] >= 400]
best_hr_seasons = best_hr_seasons.sort_values('HR', ascending=False).head(30)
# Join with player names
best_hr_with_names = best_hr_seasons.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
best_hr_with_names['Name'] = (
best_hr_with_names['nameFirst'] + ' ' + best_hr_with_names['nameLast']
)
best_hr_with_names['AVG'] = (best_hr_with_names['H'] / best_hr_with_names['AB']).round(3)
print(best_hr_with_names[['Name', 'yearID', 'HR', 'AB', 'AVG']])
# Find best seasons by batting average
best_avg_seasons = batting_modern.groupby(['playerID', 'yearID']).agg({
'AB': 'sum',
'H': 'sum'
}).reset_index()
best_avg_seasons = best_avg_seasons[best_avg_seasons['AB'] >= 400]
best_avg_seasons['AVG'] = best_avg_seasons['H'] / best_avg_seasons['AB']
best_avg_seasons = best_avg_seasons.sort_values('AVG', ascending=False).head(30)
# Join with names
best_avg_with_names = best_avg_seasons.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
best_avg_with_names['Name'] = (
best_avg_with_names['nameFirst'] + ' ' + best_avg_with_names['nameLast']
)
print(best_avg_with_names[['Name', 'yearID', 'AB', 'H', 'AVG']])
def compare_players_comprehensive(players_list):
"""
Comprehensive player comparison
players_list: list of tuples (first_name, last_name)
"""
results = []
for first, last in players_list:
# Get player ID
player = people[
(people['nameFirst'] == first) &
(people['nameLast'] == last)
]
if len(player) == 0:
continue
player_id = player['playerID'].values[0]
# Career stats
career = batting[batting['playerID'] == player_id]
stats = {
'Name': f"{first} {last}",
'Seasons': len(career),
'Games': career['G'].sum(),
'AB': career['AB'].sum(),
'H': career['H'].sum(),
'2B': career['2B'].sum(),
'3B': career['3B'].sum(),
'HR': career['HR'].sum(),
'RBI': career['RBI'].sum(),
'BB': career['BB'].sum(),
'SO': career['SO'].sum(),
'SB': career['SB'].sum()
}
# Calculate rate stats
stats['AVG'] = round(stats['H'] / stats['AB'], 3)
stats['OBP'] = round((stats['H'] + stats['BB']) / (stats['AB'] + stats['BB']), 3)
stats['SLG'] = round(
(stats['H'] + stats['2B'] + 2*stats['3B'] + 3*stats['HR']) / stats['AB'], 3
)
stats['OPS'] = round(stats['OBP'] + stats['SLG'], 3)
stats['HR_per_Season'] = round(stats['HR'] / stats['Seasons'], 1)
results.append(stats)
return pd.DataFrame(results)
# Compare all-time greats
all_time_greats = [
('Babe', 'Ruth'),
('Ted', 'Williams'),
('Willie', 'Mays'),
('Barry', 'Bonds')
]
comprehensive = compare_players_comprehensive(all_time_greats)
print(comprehensive[['Name', 'Seasons', 'HR', 'AVG', 'OBP', 'SLG', 'OPS']])
The so-called "steroid era" presents unique challenges for historical analysis. Performance-enhancing drug use was widespread in baseball from approximately the mid-1990s through the mid-2000s, distorting statistical records and complicating player comparisons.
Identifying the Steroid Era Statistically
While we can't definitively identify PED users through statistics alone (testing and investigation are required), we can identify the period when offense was anomalously high:
Statistical Markers of the Steroid Era:
- Home Run Explosion: The 1990s and early 2000s saw unprecedented home run rates
- Power at All Ages: Players maintained or increased power into their late 30s
- Muscle Mass Increase: Visual evidence showed dramatic physical changes
- Breaking of "Unbreakable" Records: Maris's 61 HR record broken multiple times
- League-Wide Offensive Spike: Not just individual performances but systematic elevation
Let's analyze the data:
library(Lahman)
library(dplyr)
library(ggplot2)
# Calculate league-wide home run rates by year
hr_by_year <- Batting %>%
filter(yearID >= 1950) %>%
group_by(yearID) %>%
summarise(
Total_AB = sum(AB, na.rm = TRUE),
Total_HR = sum(HR, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
HR_Rate = (Total_HR / Total_AB) * 100,
Era = case_when(
yearID < 1994 ~ "Pre-Steroid",
yearID >= 1994 & yearID <= 2007 ~ "Steroid Era",
yearID > 2007 ~ "Post-Steroid"
)
)
# Visualize
ggplot(hr_by_year, aes(x = yearID, y = HR_Rate, color = Era)) +
geom_line(size = 1.5) +
geom_point(size = 2) +
labs(
title = "MLB Home Run Rate by Year (1950-Present)",
subtitle = "Identifying the Steroid Era",
x = "Year",
y = "Home Runs per 100 AB"
) +
scale_color_manual(values = c("Pre-Steroid" = "blue",
"Steroid Era" = "red",
"Post-Steroid" = "green")) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5)
) +
geom_vline(xintercept = c(1994, 2007), linetype = "dashed", alpha = 0.5)
# Calculate average by era
era_averages <- hr_by_year %>%
group_by(Era) %>%
summarise(
Avg_HR_Rate = mean(HR_Rate),
.groups = 'drop'
)
print(era_averages)
# Look at 40+ HR seasons by era
hr_40_plus <- Batting %>%
filter(yearID >= 1950) %>%
group_by(playerID, yearID) %>%
summarise(
HR = sum(HR, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(HR >= 40) %>%
mutate(
Era = case_when(
yearID < 1994 ~ "Pre-Steroid",
yearID >= 1994 & yearID <= 2007 ~ "Steroid Era",
yearID > 2007 ~ "Post-Steroid"
)
)
# Count by era
hr_40_by_era <- hr_40_plus %>%
group_by(Era) %>%
summarise(
Count_40_HR_Seasons = n(),
.groups = 'drop'
)
print(hr_40_by_era)
import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pyb.cache.enable()
batting = pyb.lahman.batting()
# Calculate league-wide home run rates by year
batting_modern = batting[batting['yearID'] >= 1950].copy()
hr_by_year = batting_modern.groupby('yearID').agg({
'AB': 'sum',
'HR': 'sum'
}).reset_index()
hr_by_year.columns = ['yearID', 'Total_AB', 'Total_HR']
hr_by_year['HR_Rate'] = (hr_by_year['Total_HR'] / hr_by_year['Total_AB']) * 100
# Define eras
def classify_era(year):
if year < 1994:
return "Pre-Steroid"
elif year <= 2007:
return "Steroid Era"
else:
return "Post-Steroid"
hr_by_year['Era'] = hr_by_year['yearID'].apply(classify_era)
# Visualize
plt.figure(figsize=(14, 8))
colors = {'Pre-Steroid': 'blue', 'Steroid Era': 'red', 'Post-Steroid': 'green'}
for era in ['Pre-Steroid', 'Steroid Era', 'Post-Steroid']:
data = hr_by_year[hr_by_year['Era'] == era]
plt.plot(data['yearID'], data['HR_Rate'],
label=era, linewidth=2, marker='o', color=colors[era])
plt.axvline(x=1994, linestyle='--', alpha=0.5, color='black')
plt.axvline(x=2007, linestyle='--', alpha=0.5, color='black')
plt.title('MLB Home Run Rate by Year (1950-Present)\nIdentifying the Steroid Era',
fontsize=14, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Home Runs per 100 AB', fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('steroid_era_hr_rate.png', dpi=300, bbox_inches='tight')
plt.show()
# Calculate average by era
era_averages = hr_by_year.groupby('Era')['HR_Rate'].mean().reset_index()
era_averages.columns = ['Era', 'Avg_HR_Rate']
print(era_averages)
# Look at 40+ HR seasons by era
hr_40_plus = batting_modern.groupby(['playerID', 'yearID'])['HR'].sum().reset_index()
hr_40_plus = hr_40_plus[hr_40_plus['HR'] >= 40]
hr_40_plus['Era'] = hr_40_plus['yearID'].apply(classify_era)
# Count by era
hr_40_by_era = hr_40_plus.groupby('Era').size().reset_index()
hr_40_by_era.columns = ['Era', 'Count_40_HR_Seasons']
print(hr_40_by_era)
Key Findings
The data clearly shows:
- Pre-Steroid Era (1950-1993): HR rate averaged ~2.5% of at-bats
- Steroid Era (1994-2007): HR rate jumped to ~3.0-3.5% of at-bats
- Post-Steroid Era (2008+): HR rate initially declined but has rebounded due to launch angle revolution
The number of 40+ HR seasons also spiked dramatically during the steroid era.
Adjusting for Performance Enhancement
How should analysts handle steroid-era statistics? Several approaches exist:
Approach 1: Accept the Numbers
- Statistics are what they are, regardless of cause
- We can't definitively know who used PEDs
- Adjusting requires subjective judgments
- Many factors contributed beyond PEDs
Approach 2: Era-Adjust More Aggressively
- Apply stronger era adjustments for 1994-2007
- Treat this period like any other high-offense era
- Use OPS+, ERA+, etc., which already account for context
- Don't make player-specific PED assumptions
Approach 3: Exclude Suspected Users
- Don't consider players with positive tests or evidence
- Creates a "clean" record book
- Risks unfairly excluding some players
- Difficult to apply consistently
Approach 4: Separate Era
- Treat steroid era as its own category
- Don't compare steroid-era players to other eras
- Create separate record books or rankings
- Acknowledges the unique circumstances
Most analysts favor Approach 2: using standard era-adjustment methods that naturally account for the inflated offense of the period.
Example: Comparing Steroid-Era Stats
Let's compare Barry Bonds's legendary 2001 season (73 HR) to Babe Ruth's 1927 (60 HR) using era adjustment:
# Bonds 2001 vs Ruth 1927 (era-adjusted)
# Get league averages
bonds_2001_lg <- Batting %>%
filter(yearID == 2001, lgID == "NL") %>%
summarise(
lgAB = sum(AB, na.rm = TRUE),
lgHR = sum(HR, na.rm = TRUE)
) %>%
mutate(lgHR_rate = lgHR / lgAB)
ruth_1927_lg <- Batting %>%
filter(yearID == 1927, lgID == "AL") %>%
summarise(
lgAB = sum(AB, na.rm = TRUE),
lgHR = sum(HR, na.rm = TRUE)
) %>%
mutate(lgHR_rate = lgHR / lgAB)
print(paste("2001 NL HR Rate:", round(bonds_2001_lg$lgHR_rate * 100, 2), "%"))
print(paste("1927 AL HR Rate:", round(ruth_1927_lg$lgHR_rate * 100, 2), "%"))
# Bonds hit 73 HR in 476 AB (15.3% of his ABs)
# Ruth hit 60 HR in 540 AB (11.1% of his ABs)
bonds_hr_rate <- 73 / 476
ruth_hr_rate <- 60 / 540
# Relative to league
bonds_relative <- bonds_hr_rate / bonds_2001_lg$lgHR_rate
ruth_relative <- ruth_hr_rate / ruth_1927_lg$lgHR_rate
print(paste("Bonds was", round(bonds_relative, 1), "times better than league average"))
print(paste("Ruth was", round(ruth_relative, 1), "times better than league average"))
# Bonds 2001 vs Ruth 1927 (era-adjusted)
# Get league averages
bonds_2001_lg = batting[
(batting['yearID'] == 2001) &
(batting['lgID'] == 'NL')
]
bonds_lg_ab = bonds_2001_lg['AB'].sum()
bonds_lg_hr = bonds_2001_lg['HR'].sum()
bonds_lg_rate = bonds_lg_hr / bonds_lg_ab
ruth_1927_lg = batting[
(batting['yearID'] == 1927) &
(batting['lgID'] == 'AL')
]
ruth_lg_ab = ruth_1927_lg['AB'].sum()
ruth_lg_hr = ruth_1927_lg['HR'].sum()
ruth_lg_rate = ruth_lg_hr / ruth_lg_ab
print(f"2001 NL HR Rate: {bonds_lg_rate * 100:.2f}%")
print(f"1927 AL HR Rate: {ruth_lg_rate * 100:.2f}%")
# Bonds hit 73 HR in 476 AB (15.3% of his ABs)
# Ruth hit 60 HR in 540 AB (11.1% of his ABs)
bonds_hr_rate = 73 / 476
ruth_hr_rate = 60 / 540
# Relative to league
bonds_relative = bonds_hr_rate / bonds_lg_rate
ruth_relative = ruth_hr_rate / ruth_lg_rate
print(f"Bonds was {bonds_relative:.1f} times better than league average")
print(f"Ruth was {ruth_relative:.1f} times better than league average")
Both seasons were approximately 5-6 times better than league average—equally dominant in their respective contexts.
Ethical Considerations in Historical Analysis
When analyzing the steroid era, analysts must balance several ethical considerations:
Statistical Integrity: Our job is to analyze numbers accurately, not to make moral judgments about players.
Historical Context: We must acknowledge the unique circumstances of each era without dismissing achievements.
Uncertainty: We often don't know who used PEDs and who didn't. Assumptions based on statistics alone can be unfair.
Consistency: Whatever approach we take should be applied consistently across all eras and players.
Transparency: We should be clear about our methods and assumptions when handling steroid-era data.
The safest approach is to:
- Use standard era-adjustment methods
- Note when players competed in the steroid era
- Avoid player-specific PED assumptions without evidence
- Let readers draw their own conclusions about individual cases
library(Lahman)
library(dplyr)
library(ggplot2)
# Calculate league-wide home run rates by year
hr_by_year <- Batting %>%
filter(yearID >= 1950) %>%
group_by(yearID) %>%
summarise(
Total_AB = sum(AB, na.rm = TRUE),
Total_HR = sum(HR, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
HR_Rate = (Total_HR / Total_AB) * 100,
Era = case_when(
yearID < 1994 ~ "Pre-Steroid",
yearID >= 1994 & yearID <= 2007 ~ "Steroid Era",
yearID > 2007 ~ "Post-Steroid"
)
)
# Visualize
ggplot(hr_by_year, aes(x = yearID, y = HR_Rate, color = Era)) +
geom_line(size = 1.5) +
geom_point(size = 2) +
labs(
title = "MLB Home Run Rate by Year (1950-Present)",
subtitle = "Identifying the Steroid Era",
x = "Year",
y = "Home Runs per 100 AB"
) +
scale_color_manual(values = c("Pre-Steroid" = "blue",
"Steroid Era" = "red",
"Post-Steroid" = "green")) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5)
) +
geom_vline(xintercept = c(1994, 2007), linetype = "dashed", alpha = 0.5)
# Calculate average by era
era_averages <- hr_by_year %>%
group_by(Era) %>%
summarise(
Avg_HR_Rate = mean(HR_Rate),
.groups = 'drop'
)
print(era_averages)
# Look at 40+ HR seasons by era
hr_40_plus <- Batting %>%
filter(yearID >= 1950) %>%
group_by(playerID, yearID) %>%
summarise(
HR = sum(HR, na.rm = TRUE),
.groups = 'drop'
) %>%
filter(HR >= 40) %>%
mutate(
Era = case_when(
yearID < 1994 ~ "Pre-Steroid",
yearID >= 1994 & yearID <= 2007 ~ "Steroid Era",
yearID > 2007 ~ "Post-Steroid"
)
)
# Count by era
hr_40_by_era <- hr_40_plus %>%
group_by(Era) %>%
summarise(
Count_40_HR_Seasons = n(),
.groups = 'drop'
)
print(hr_40_by_era)
# Bonds 2001 vs Ruth 1927 (era-adjusted)
# Get league averages
bonds_2001_lg <- Batting %>%
filter(yearID == 2001, lgID == "NL") %>%
summarise(
lgAB = sum(AB, na.rm = TRUE),
lgHR = sum(HR, na.rm = TRUE)
) %>%
mutate(lgHR_rate = lgHR / lgAB)
ruth_1927_lg <- Batting %>%
filter(yearID == 1927, lgID == "AL") %>%
summarise(
lgAB = sum(AB, na.rm = TRUE),
lgHR = sum(HR, na.rm = TRUE)
) %>%
mutate(lgHR_rate = lgHR / lgAB)
print(paste("2001 NL HR Rate:", round(bonds_2001_lg$lgHR_rate * 100, 2), "%"))
print(paste("1927 AL HR Rate:", round(ruth_1927_lg$lgHR_rate * 100, 2), "%"))
# Bonds hit 73 HR in 476 AB (15.3% of his ABs)
# Ruth hit 60 HR in 540 AB (11.1% of his ABs)
bonds_hr_rate <- 73 / 476
ruth_hr_rate <- 60 / 540
# Relative to league
bonds_relative <- bonds_hr_rate / bonds_2001_lg$lgHR_rate
ruth_relative <- ruth_hr_rate / ruth_1927_lg$lgHR_rate
print(paste("Bonds was", round(bonds_relative, 1), "times better than league average"))
print(paste("Ruth was", round(ruth_relative, 1), "times better than league average"))
import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pyb.cache.enable()
batting = pyb.lahman.batting()
# Calculate league-wide home run rates by year
batting_modern = batting[batting['yearID'] >= 1950].copy()
hr_by_year = batting_modern.groupby('yearID').agg({
'AB': 'sum',
'HR': 'sum'
}).reset_index()
hr_by_year.columns = ['yearID', 'Total_AB', 'Total_HR']
hr_by_year['HR_Rate'] = (hr_by_year['Total_HR'] / hr_by_year['Total_AB']) * 100
# Define eras
def classify_era(year):
if year < 1994:
return "Pre-Steroid"
elif year <= 2007:
return "Steroid Era"
else:
return "Post-Steroid"
hr_by_year['Era'] = hr_by_year['yearID'].apply(classify_era)
# Visualize
plt.figure(figsize=(14, 8))
colors = {'Pre-Steroid': 'blue', 'Steroid Era': 'red', 'Post-Steroid': 'green'}
for era in ['Pre-Steroid', 'Steroid Era', 'Post-Steroid']:
data = hr_by_year[hr_by_year['Era'] == era]
plt.plot(data['yearID'], data['HR_Rate'],
label=era, linewidth=2, marker='o', color=colors[era])
plt.axvline(x=1994, linestyle='--', alpha=0.5, color='black')
plt.axvline(x=2007, linestyle='--', alpha=0.5, color='black')
plt.title('MLB Home Run Rate by Year (1950-Present)\nIdentifying the Steroid Era',
fontsize=14, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Home Runs per 100 AB', fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('steroid_era_hr_rate.png', dpi=300, bbox_inches='tight')
plt.show()
# Calculate average by era
era_averages = hr_by_year.groupby('Era')['HR_Rate'].mean().reset_index()
era_averages.columns = ['Era', 'Avg_HR_Rate']
print(era_averages)
# Look at 40+ HR seasons by era
hr_40_plus = batting_modern.groupby(['playerID', 'yearID'])['HR'].sum().reset_index()
hr_40_plus = hr_40_plus[hr_40_plus['HR'] >= 40]
hr_40_plus['Era'] = hr_40_plus['yearID'].apply(classify_era)
# Count by era
hr_40_by_era = hr_40_plus.groupby('Era').size().reset_index()
hr_40_by_era.columns = ['Era', 'Count_40_HR_Seasons']
print(hr_40_by_era)
# Bonds 2001 vs Ruth 1927 (era-adjusted)
# Get league averages
bonds_2001_lg = batting[
(batting['yearID'] == 2001) &
(batting['lgID'] == 'NL')
]
bonds_lg_ab = bonds_2001_lg['AB'].sum()
bonds_lg_hr = bonds_2001_lg['HR'].sum()
bonds_lg_rate = bonds_lg_hr / bonds_lg_ab
ruth_1927_lg = batting[
(batting['yearID'] == 1927) &
(batting['lgID'] == 'AL')
]
ruth_lg_ab = ruth_1927_lg['AB'].sum()
ruth_lg_hr = ruth_1927_lg['HR'].sum()
ruth_lg_rate = ruth_lg_hr / ruth_lg_ab
print(f"2001 NL HR Rate: {bonds_lg_rate * 100:.2f}%")
print(f"1927 AL HR Rate: {ruth_lg_rate * 100:.2f}%")
# Bonds hit 73 HR in 476 AB (15.3% of his ABs)
# Ruth hit 60 HR in 540 AB (11.1% of his ABs)
bonds_hr_rate = 73 / 476
ruth_hr_rate = 60 / 540
# Relative to league
bonds_relative = bonds_hr_rate / bonds_lg_rate
ruth_relative = ruth_hr_rate / ruth_lg_rate
print(f"Bonds was {bonds_relative:.1f} times better than league average")
print(f"Ruth was {ruth_relative:.1f} times better than league average")
Modern data visualization tools enable us to explore baseball's historical evolution through interactive graphics that reveal patterns impossible to detect in static tables. This section introduces three powerful interactive visualization approaches for historical analysis: animated timelines showing how league statistics evolved across decades, interactive era comparison tools for searching and comparing players, and dynamic trend analysis with range sliders for examining specific time periods.
Animated Timeline of League Averages
One of the most compelling ways to understand baseball's evolution is through animated visualizations that show how key statistics changed over time. We can create animations that step through each decade, revealing the dramatic shifts in offensive production that define different eras.
Let's build an animated timeline showing home run rate, strikeout rate, and batting average from 1900 to 2023:
# Animated timeline of league statistics over time
library(tidyverse)
library(gganimate)
library(Lahman)
# Calculate league-wide statistics by year
league_evolution <- Batting %>%
filter(yearID >= 1900, yearID <= 2023) %>%
group_by(yearID) %>%
summarise(
total_ab = sum(AB, na.rm = TRUE),
total_h = sum(H, na.rm = TRUE),
total_hr = sum(HR, na.rm = TRUE),
total_so = sum(SO, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
avg = total_h / total_ab,
hr_rate = (total_hr / total_ab) * 100, # HR per 100 AB
k_rate = (total_so / total_ab) * 100, # K per 100 AB
decade = floor(yearID / 10) * 10,
era = case_when(
yearID < 1920 ~ "Dead Ball",
yearID < 1947 ~ "Live Ball (Pre-Integration)",
yearID < 1961 ~ "Integration Era",
yearID < 1993 ~ "Expansion Era",
yearID < 2006 ~ "Steroid Era",
TRUE ~ "Modern Era"
)
)
# Create animated plot
anim <- ggplot(league_evolution,
aes(x = yearID, y = avg, group = 1)) +
geom_line(size = 1.2, color = "darkblue") +
geom_point(size = 3, color = "darkblue") +
geom_text(aes(label = sprintf("%.3f", avg)),
vjust = -1, size = 3.5, color = "darkblue") +
labs(title = "MLB Batting Average Evolution: {frame_time}",
subtitle = "League-wide batting average by year",
x = "Year",
y = "Batting Average") +
theme_minimal() +
theme(plot.title = element_text(size = 16, face = "bold")) +
transition_time(yearID) +
ease_aes('linear') +
shadow_wake(wake_length = 0.1)
# Render animation
animate(anim, nframes = 124, fps = 4, width = 800, height = 500)
# Create multi-metric comparison
league_long <- league_evolution %>%
select(yearID, avg, hr_rate, k_rate, era) %>%
pivot_longer(cols = c(avg, hr_rate, k_rate),
names_to = "metric",
values_to = "value") %>%
mutate(
metric_label = case_when(
metric == "avg" ~ "Batting Average",
metric == "hr_rate" ~ "HR Rate (per 100 AB)",
metric == "k_rate" ~ "K Rate (per 100 AB)"
)
)
# Faceted animation showing all three metrics
multi_anim <- ggplot(league_long,
aes(x = yearID, y = value, color = metric_label)) +
geom_line(size = 1) +
facet_wrap(~metric_label, scales = "free_y", ncol = 1) +
labs(title = "Evolution of MLB Statistics: {frame_time}",
x = "Year",
y = "Value") +
theme_minimal() +
theme(legend.position = "none",
strip.text = element_text(size = 12, face = "bold")) +
transition_reveal(yearID)
animate(multi_anim, nframes = 150, fps = 10, width = 800, height = 600)
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import lahman
# Load historical batting data
batting = lahman.batting()
# Calculate league-wide statistics by year
league_evolution = batting[batting['yearID'] >= 1900].groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'SO': 'sum'
}).reset_index()
league_evolution['avg'] = league_evolution['H'] / league_evolution['AB']
league_evolution['hr_rate'] = (league_evolution['HR'] / league_evolution['AB']) * 100
league_evolution['k_rate'] = (league_evolution['SO'] / league_evolution['AB']) * 100
# Add era classifications
def classify_era(year):
if year < 1920:
return "Dead Ball"
elif year < 1947:
return "Live Ball (Pre-Integration)"
elif year < 1961:
return "Integration Era"
elif year < 1993:
return "Expansion Era"
elif year < 2006:
return "Steroid Era"
else:
return "Modern Era"
league_evolution['era'] = league_evolution['yearID'].apply(classify_era)
league_evolution['decade'] = (league_evolution['yearID'] // 10) * 10
# Create animated line chart with Plotly
fig = px.line(league_evolution,
x='yearID',
y='avg',
animation_frame='yearID',
range_x=[1900, 2023],
range_y=[0.23, 0.31],
title='MLB Batting Average Evolution Over Time',
labels={'yearID': 'Year', 'avg': 'Batting Average'})
fig.update_traces(line=dict(color='darkblue', width=3))
fig.update_layout(
xaxis_title='Year',
yaxis_title='Batting Average',
hovermode='x unified',
showlegend=False
)
# Show the animation
fig.show()
# Create multi-metric animated visualization
fig_multi = make_subplots(
rows=3, cols=1,
subplot_titles=('Batting Average', 'HR Rate (per 100 AB)', 'K Rate (per 100 AB)'),
vertical_spacing=0.1
)
# Add traces for each metric
for year in league_evolution['yearID'].unique():
year_data = league_evolution[league_evolution['yearID'] <= year]
fig_multi.add_trace(
go.Scatter(x=year_data['yearID'], y=year_data['avg'],
mode='lines', name='AVG', line=dict(color='blue')),
row=1, col=1
)
fig_multi.add_trace(
go.Scatter(x=year_data['yearID'], y=year_data['hr_rate'],
mode='lines', name='HR', line=dict(color='red')),
row=2, col=1
)
fig_multi.add_trace(
go.Scatter(x=year_data['yearID'], y=year_data['k_rate'],
mode='lines', name='K', line=dict(color='orange')),
row=3, col=1
)
fig_multi.update_xaxes(title_text="Year", row=3, col=1)
fig_multi.update_layout(
height=900,
title_text="Evolution of MLB Statistics Over Time",
showlegend=False
)
fig_multi.show()
These animated visualizations clearly reveal baseball's major transitions: the dead ball era's low home run rates, the 1920s offensive explosion, the 1968 pitcher dominance, the steroid era's power surge, and the modern game's strikeout epidemic. The ability to watch these trends unfold year by year provides intuition that static charts cannot match.
Interactive Era Comparison Tool
To facilitate player comparisons across eras, we can build an interactive tool that allows users to search for players, select seasons, and instantly see era-adjusted comparisons. This approach combines database queries with interactive plotting.
library(tidyverse)
library(Lahman)
library(plotly)
# Function to calculate era-adjusted OPS+
calculate_ops_plus <- function(player_batting, league_batting) {
player_ops <- with(player_batting,
(H + BB) / (AB + BB) + (H + 2*X2B + 3*X3B + 4*HR) / AB)
league_ops <- with(league_batting,
(H + BB) / (AB + BB) + (H + 2*X2B + 3*X3B + 4*HR) / AB)
ops_plus <- (player_ops / league_ops) * 100
return(ops_plus)
}
# Create interactive player comparison
compare_players <- function(player_names, min_year = 1900, max_year = 2023) {
# Get player IDs
player_ids <- People %>%
filter(paste(nameFirst, nameLast) %in% player_names) %>%
pull(playerID)
# Get batting stats for these players
player_stats <- Batting %>%
filter(playerID %in% player_ids,
yearID >= min_year,
yearID <= max_year,
AB >= 300) %>%
left_join(People %>% select(playerID, nameFirst, nameLast),
by = "playerID") %>%
mutate(player_name = paste(nameFirst, nameLast))
# Calculate league averages by year
league_averages <- Batting %>%
filter(yearID >= min_year, yearID <= max_year) %>%
group_by(yearID) %>%
summarise(
lg_AB = sum(AB, na.rm = TRUE),
lg_H = sum(H, na.rm = TRUE),
lg_BB = sum(BB, na.rm = TRUE),
lg_X2B = sum(X2B, na.rm = TRUE),
lg_X3B = sum(X3B, na.rm = TRUE),
lg_HR = sum(HR, na.rm = TRUE)
)
# Calculate OPS+ for each player-season
comparison_data <- player_stats %>%
left_join(league_averages, by = "yearID") %>%
rowwise() %>%
mutate(
player_ops = (H + BB) / (AB + BB) +
(H + 2*X2B + 3*X3B + 4*HR) / AB,
league_ops = (lg_H + lg_BB) / (lg_AB + lg_BB) +
(lg_H + 2*lg_X2B + 3*lg_X3B + 4*lg_HR) / lg_AB,
ops_plus = (player_ops / league_ops) * 100,
ba = H / AB
) %>%
ungroup()
# Create interactive plot
p <- plot_ly(comparison_data,
x = ~yearID,
y = ~ops_plus,
color = ~player_name,
type = 'scatter',
mode = 'lines+markers',
text = ~paste("Year:", yearID,
"<br>Player:", player_name,
"<br>OPS+:", round(ops_plus, 1),
"<br>BA:", sprintf("%.3f", ba)),
hoverinfo = 'text') %>%
layout(title = "Era-Adjusted Performance Comparison",
xaxis = list(title = "Year"),
yaxis = list(title = "OPS+ (100 = League Average)"),
hovermode = 'closest')
return(p)
}
# Example usage: Compare Ruth, Williams, Bonds
comparison <- compare_players(
c("Babe Ruth", "Ted Williams", "Barry Bonds"),
min_year = 1914,
max_year = 2007
)
comparison
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from pybaseball import lahman
# Load data
batting = lahman.batting()
people = lahman.people()
def calculate_ops(row):
"""Calculate OPS from batting statistics"""
if row['AB'] == 0:
return 0
obp = (row['H'] + row['BB']) / (row['AB'] + row['BB']) if (row['AB'] + row['BB']) > 0 else 0
slg = (row['H'] + row['2B'] + 2*row['3B'] + 3*row['HR']) / row['AB'] if row['AB'] > 0 else 0
return obp + slg
def compare_players_interactive(player_names, min_year=1900, max_year=2023):
"""
Create interactive comparison of players across eras
Parameters:
-----------
player_names : list
List of player names in format ["First Last", ...]
min_year : int
Starting year for comparison
max_year : int
Ending year for comparison
"""
# Parse player names
player_data = []
for name in player_names:
first, last = name.split()[0], ' '.join(name.split()[1:])
player_data.append((first, last))
# Get player IDs
player_ids = []
for first, last in player_data:
matches = people[(people['nameFirst'] == first) &
(people['nameLast'] == last)]
if len(matches) > 0:
player_ids.append(matches.iloc[0]['playerID'])
# Filter batting data
player_stats = batting[
(batting['playerID'].isin(player_ids)) &
(batting['yearID'] >= min_year) &
(batting['yearID'] <= max_year) &
(batting['AB'] >= 300)
].copy()
# Merge with player names
player_stats = player_stats.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID'
)
player_stats['player_name'] = (player_stats['nameFirst'] + ' ' +
player_stats['nameLast'])
# Calculate league averages by year
league_avg = batting[
(batting['yearID'] >= min_year) &
(batting['yearID'] <= max_year)
].groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'BB': 'sum',
'2B': 'sum',
'3B': 'sum',
'HR': 'sum'
}).reset_index()
league_avg['league_ops'] = league_avg.apply(calculate_ops, axis=1)
# Calculate player OPS and OPS+
player_stats['player_ops'] = player_stats.apply(calculate_ops, axis=1)
player_stats = player_stats.merge(
league_avg[['yearID', 'league_ops']],
on='yearID'
)
player_stats['ops_plus'] = (player_stats['player_ops'] /
player_stats['league_ops']) * 100
player_stats['ba'] = player_stats['H'] / player_stats['AB']
# Create interactive plot
fig = px.line(player_stats,
x='yearID',
y='ops_plus',
color='player_name',
markers=True,
title='Era-Adjusted Performance Comparison',
labels={'yearID': 'Year',
'ops_plus': 'OPS+ (100 = League Average)',
'player_name': 'Player'})
fig.add_hline(y=100, line_dash="dash", line_color="gray",
annotation_text="League Average")
fig.update_traces(
hovertemplate='<b>%{fullData.name}</b><br>' +
'Year: %{x}<br>' +
'OPS+: %{y:.1f}<br>' +
'<extra></extra>'
)
fig.update_layout(
hovermode='x unified',
xaxis_title='Year',
yaxis_title='OPS+ (100 = League Average)',
legend_title='Player',
height=600
)
return fig
# Example: Compare legendary players across eras
fig = compare_players_interactive(
["Babe Ruth", "Ted Williams", "Barry Bonds"],
min_year=1914,
max_year=2007
)
fig.show()
This interactive tool enables users to explore how players performed relative to their peers, regardless of when they played. Hovering over data points reveals detailed statistics, and the ability to add or remove players makes it easy to test different hypotheses about historical greatness.
Historical Trends with Range Slider
For detailed analysis of specific time periods, Plotly's range slider functionality allows users to zoom into particular eras while maintaining context of the full timeline. This is particularly useful for examining shorter-term trends within longer historical narratives.
library(tidyverse)
library(plotly)
library(Lahman)
# Prepare comprehensive historical data
historical_trends <- Batting %>%
filter(yearID >= 1900, yearID <= 2023) %>%
group_by(yearID) %>%
summarise(
total_ab = sum(AB, na.rm = TRUE),
total_h = sum(H, na.rm = TRUE),
total_hr = sum(HR, na.rm = TRUE),
total_so = sum(SO, na.rm = TRUE),
total_bb = sum(BB, na.rm = TRUE),
total_sb = sum(SB, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
batting_avg = total_h / total_ab,
hr_per_game = total_hr / (total_ab / 4), # Approximate games
k_per_pa = total_so / (total_ab + total_bb),
bb_per_pa = total_bb / (total_ab + total_bb),
sb_per_game = total_sb / (total_ab / 4)
)
# Create multi-trace plot with range slider
fig <- plot_ly()
# Add batting average trace
fig <- fig %>% add_trace(
data = historical_trends,
x = ~yearID,
y = ~batting_avg,
type = 'scatter',
mode = 'lines',
name = 'Batting Average',
line = list(color = 'blue', width = 2)
)
# Add HR rate trace
fig <- fig %>% add_trace(
data = historical_trends,
x = ~yearID,
y = ~hr_per_game * 10, # Scale for visibility
type = 'scatter',
mode = 'lines',
name = 'HR Rate (×10)',
line = list(color = 'red', width = 2),
yaxis = 'y2'
)
# Add K rate trace
fig <- fig %>% add_trace(
data = historical_trends,
x = ~yearID,
y = ~k_per_pa,
type = 'scatter',
mode = 'lines',
name = 'K Rate',
line = list(color = 'orange', width = 2)
)
# Configure layout with range slider
fig <- fig %>% layout(
title = "Historical Trends in MLB Statistics (1900-2023)",
xaxis = list(
title = "Year",
rangeslider = list(type = "date", visible = TRUE),
range = c(1900, 2023)
),
yaxis = list(
title = "Rate",
side = "left"
),
yaxis2 = list(
overlaying = "y",
side = "right",
showgrid = FALSE
),
hovermode = 'x unified',
legend = list(x = 0.1, y = 0.9)
)
fig
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import lahman
# Load and prepare data
batting = lahman.batting()
historical_trends = batting[batting['yearID'] >= 1900].groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'SO': 'sum',
'BB': 'sum',
'SB': 'sum'
}).reset_index()
# Calculate rates
historical_trends['batting_avg'] = historical_trends['H'] / historical_trends['AB']
historical_trends['hr_rate'] = (historical_trends['HR'] / historical_trends['AB']) * 100
historical_trends['k_rate'] = historical_trends['SO'] / (historical_trends['AB'] + historical_trends['BB'])
historical_trends['bb_rate'] = historical_trends['BB'] / (historical_trends['AB'] + historical_trends['BB'])
historical_trends['iso'] = (historical_trends['HR'] * 3) / historical_trends['AB'] # Simplified ISO
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(
go.Scatter(
x=historical_trends['yearID'],
y=historical_trends['batting_avg'],
name='Batting Average',
line=dict(color='blue', width=2)
),
secondary_y=False
)
fig.add_trace(
go.Scatter(
x=historical_trends['yearID'],
y=historical_trends['hr_rate'],
name='HR Rate (per 100 AB)',
line=dict(color='red', width=2)
),
secondary_y=True
)
fig.add_trace(
go.Scatter(
x=historical_trends['yearID'],
y=historical_trends['k_rate'],
name='K Rate',
line=dict(color='orange', width=2)
),
secondary_y=False
)
# Add era markers
eras = [
(1920, 'Live Ball Era'),
(1947, 'Integration'),
(1961, 'Expansion'),
(1993, 'Steroid Era'),
(2006, 'Modern Era')
]
for year, label in eras:
fig.add_vline(
x=year,
line_dash="dash",
line_color="gray",
annotation_text=label,
annotation_position="top"
)
# Update layout with range slider
fig.update_xaxes(
title_text="Year",
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=10, label="10y", step="year", stepmode="backward"),
dict(count=25, label="25y", step="year", stepmode="backward"),
dict(count=50, label="50y", step="year", stepmode="backward"),
dict(step="all", label="All")
])
)
)
fig.update_yaxes(title_text="Batting Average / K Rate", secondary_y=False)
fig.update_yaxes(title_text="HR Rate (per 100 AB)", secondary_y=True)
fig.update_layout(
title_text="Historical Trends in MLB Statistics (1900-2023)",
hovermode='x unified',
height=600,
legend=dict(x=0.01, y=0.99)
)
fig.show()
The range slider enables users to focus on specific periods (like the steroid era from 1993-2005) while maintaining awareness of broader historical context. The range selector buttons provide quick access to common analysis windows (10 years, 25 years, etc.), making it easy to examine how quickly baseball statistics have evolved during different periods.
These interactive visualization techniques transform historical baseball analysis from static number-crunching into dynamic exploration. They reveal patterns, enable comparisons, and provide intuitive understanding of how dramatically the game has changed over more than a century of professional play.
# Animated timeline of league statistics over time
library(tidyverse)
library(gganimate)
library(Lahman)
# Calculate league-wide statistics by year
league_evolution <- Batting %>%
filter(yearID >= 1900, yearID <= 2023) %>%
group_by(yearID) %>%
summarise(
total_ab = sum(AB, na.rm = TRUE),
total_h = sum(H, na.rm = TRUE),
total_hr = sum(HR, na.rm = TRUE),
total_so = sum(SO, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
avg = total_h / total_ab,
hr_rate = (total_hr / total_ab) * 100, # HR per 100 AB
k_rate = (total_so / total_ab) * 100, # K per 100 AB
decade = floor(yearID / 10) * 10,
era = case_when(
yearID < 1920 ~ "Dead Ball",
yearID < 1947 ~ "Live Ball (Pre-Integration)",
yearID < 1961 ~ "Integration Era",
yearID < 1993 ~ "Expansion Era",
yearID < 2006 ~ "Steroid Era",
TRUE ~ "Modern Era"
)
)
# Create animated plot
anim <- ggplot(league_evolution,
aes(x = yearID, y = avg, group = 1)) +
geom_line(size = 1.2, color = "darkblue") +
geom_point(size = 3, color = "darkblue") +
geom_text(aes(label = sprintf("%.3f", avg)),
vjust = -1, size = 3.5, color = "darkblue") +
labs(title = "MLB Batting Average Evolution: {frame_time}",
subtitle = "League-wide batting average by year",
x = "Year",
y = "Batting Average") +
theme_minimal() +
theme(plot.title = element_text(size = 16, face = "bold")) +
transition_time(yearID) +
ease_aes('linear') +
shadow_wake(wake_length = 0.1)
# Render animation
animate(anim, nframes = 124, fps = 4, width = 800, height = 500)
# Create multi-metric comparison
league_long <- league_evolution %>%
select(yearID, avg, hr_rate, k_rate, era) %>%
pivot_longer(cols = c(avg, hr_rate, k_rate),
names_to = "metric",
values_to = "value") %>%
mutate(
metric_label = case_when(
metric == "avg" ~ "Batting Average",
metric == "hr_rate" ~ "HR Rate (per 100 AB)",
metric == "k_rate" ~ "K Rate (per 100 AB)"
)
)
# Faceted animation showing all three metrics
multi_anim <- ggplot(league_long,
aes(x = yearID, y = value, color = metric_label)) +
geom_line(size = 1) +
facet_wrap(~metric_label, scales = "free_y", ncol = 1) +
labs(title = "Evolution of MLB Statistics: {frame_time}",
x = "Year",
y = "Value") +
theme_minimal() +
theme(legend.position = "none",
strip.text = element_text(size = 12, face = "bold")) +
transition_reveal(yearID)
animate(multi_anim, nframes = 150, fps = 10, width = 800, height = 600)
library(tidyverse)
library(Lahman)
library(plotly)
# Function to calculate era-adjusted OPS+
calculate_ops_plus <- function(player_batting, league_batting) {
player_ops <- with(player_batting,
(H + BB) / (AB + BB) + (H + 2*X2B + 3*X3B + 4*HR) / AB)
league_ops <- with(league_batting,
(H + BB) / (AB + BB) + (H + 2*X2B + 3*X3B + 4*HR) / AB)
ops_plus <- (player_ops / league_ops) * 100
return(ops_plus)
}
# Create interactive player comparison
compare_players <- function(player_names, min_year = 1900, max_year = 2023) {
# Get player IDs
player_ids <- People %>%
filter(paste(nameFirst, nameLast) %in% player_names) %>%
pull(playerID)
# Get batting stats for these players
player_stats <- Batting %>%
filter(playerID %in% player_ids,
yearID >= min_year,
yearID <= max_year,
AB >= 300) %>%
left_join(People %>% select(playerID, nameFirst, nameLast),
by = "playerID") %>%
mutate(player_name = paste(nameFirst, nameLast))
# Calculate league averages by year
league_averages <- Batting %>%
filter(yearID >= min_year, yearID <= max_year) %>%
group_by(yearID) %>%
summarise(
lg_AB = sum(AB, na.rm = TRUE),
lg_H = sum(H, na.rm = TRUE),
lg_BB = sum(BB, na.rm = TRUE),
lg_X2B = sum(X2B, na.rm = TRUE),
lg_X3B = sum(X3B, na.rm = TRUE),
lg_HR = sum(HR, na.rm = TRUE)
)
# Calculate OPS+ for each player-season
comparison_data <- player_stats %>%
left_join(league_averages, by = "yearID") %>%
rowwise() %>%
mutate(
player_ops = (H + BB) / (AB + BB) +
(H + 2*X2B + 3*X3B + 4*HR) / AB,
league_ops = (lg_H + lg_BB) / (lg_AB + lg_BB) +
(lg_H + 2*lg_X2B + 3*lg_X3B + 4*lg_HR) / lg_AB,
ops_plus = (player_ops / league_ops) * 100,
ba = H / AB
) %>%
ungroup()
# Create interactive plot
p <- plot_ly(comparison_data,
x = ~yearID,
y = ~ops_plus,
color = ~player_name,
type = 'scatter',
mode = 'lines+markers',
text = ~paste("Year:", yearID,
"<br>Player:", player_name,
"<br>OPS+:", round(ops_plus, 1),
"<br>BA:", sprintf("%.3f", ba)),
hoverinfo = 'text') %>%
layout(title = "Era-Adjusted Performance Comparison",
xaxis = list(title = "Year"),
yaxis = list(title = "OPS+ (100 = League Average)"),
hovermode = 'closest')
return(p)
}
# Example usage: Compare Ruth, Williams, Bonds
comparison <- compare_players(
c("Babe Ruth", "Ted Williams", "Barry Bonds"),
min_year = 1914,
max_year = 2007
)
comparison
library(tidyverse)
library(plotly)
library(Lahman)
# Prepare comprehensive historical data
historical_trends <- Batting %>%
filter(yearID >= 1900, yearID <= 2023) %>%
group_by(yearID) %>%
summarise(
total_ab = sum(AB, na.rm = TRUE),
total_h = sum(H, na.rm = TRUE),
total_hr = sum(HR, na.rm = TRUE),
total_so = sum(SO, na.rm = TRUE),
total_bb = sum(BB, na.rm = TRUE),
total_sb = sum(SB, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
batting_avg = total_h / total_ab,
hr_per_game = total_hr / (total_ab / 4), # Approximate games
k_per_pa = total_so / (total_ab + total_bb),
bb_per_pa = total_bb / (total_ab + total_bb),
sb_per_game = total_sb / (total_ab / 4)
)
# Create multi-trace plot with range slider
fig <- plot_ly()
# Add batting average trace
fig <- fig %>% add_trace(
data = historical_trends,
x = ~yearID,
y = ~batting_avg,
type = 'scatter',
mode = 'lines',
name = 'Batting Average',
line = list(color = 'blue', width = 2)
)
# Add HR rate trace
fig <- fig %>% add_trace(
data = historical_trends,
x = ~yearID,
y = ~hr_per_game * 10, # Scale for visibility
type = 'scatter',
mode = 'lines',
name = 'HR Rate (×10)',
line = list(color = 'red', width = 2),
yaxis = 'y2'
)
# Add K rate trace
fig <- fig %>% add_trace(
data = historical_trends,
x = ~yearID,
y = ~k_per_pa,
type = 'scatter',
mode = 'lines',
name = 'K Rate',
line = list(color = 'orange', width = 2)
)
# Configure layout with range slider
fig <- fig %>% layout(
title = "Historical Trends in MLB Statistics (1900-2023)",
xaxis = list(
title = "Year",
rangeslider = list(type = "date", visible = TRUE),
range = c(1900, 2023)
),
yaxis = list(
title = "Rate",
side = "left"
),
yaxis2 = list(
overlaying = "y",
side = "right",
showgrid = FALSE
),
hovermode = 'x unified',
legend = list(x = 0.1, y = 0.9)
)
fig
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import lahman
# Load historical batting data
batting = lahman.batting()
# Calculate league-wide statistics by year
league_evolution = batting[batting['yearID'] >= 1900].groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'SO': 'sum'
}).reset_index()
league_evolution['avg'] = league_evolution['H'] / league_evolution['AB']
league_evolution['hr_rate'] = (league_evolution['HR'] / league_evolution['AB']) * 100
league_evolution['k_rate'] = (league_evolution['SO'] / league_evolution['AB']) * 100
# Add era classifications
def classify_era(year):
if year < 1920:
return "Dead Ball"
elif year < 1947:
return "Live Ball (Pre-Integration)"
elif year < 1961:
return "Integration Era"
elif year < 1993:
return "Expansion Era"
elif year < 2006:
return "Steroid Era"
else:
return "Modern Era"
league_evolution['era'] = league_evolution['yearID'].apply(classify_era)
league_evolution['decade'] = (league_evolution['yearID'] // 10) * 10
# Create animated line chart with Plotly
fig = px.line(league_evolution,
x='yearID',
y='avg',
animation_frame='yearID',
range_x=[1900, 2023],
range_y=[0.23, 0.31],
title='MLB Batting Average Evolution Over Time',
labels={'yearID': 'Year', 'avg': 'Batting Average'})
fig.update_traces(line=dict(color='darkblue', width=3))
fig.update_layout(
xaxis_title='Year',
yaxis_title='Batting Average',
hovermode='x unified',
showlegend=False
)
# Show the animation
fig.show()
# Create multi-metric animated visualization
fig_multi = make_subplots(
rows=3, cols=1,
subplot_titles=('Batting Average', 'HR Rate (per 100 AB)', 'K Rate (per 100 AB)'),
vertical_spacing=0.1
)
# Add traces for each metric
for year in league_evolution['yearID'].unique():
year_data = league_evolution[league_evolution['yearID'] <= year]
fig_multi.add_trace(
go.Scatter(x=year_data['yearID'], y=year_data['avg'],
mode='lines', name='AVG', line=dict(color='blue')),
row=1, col=1
)
fig_multi.add_trace(
go.Scatter(x=year_data['yearID'], y=year_data['hr_rate'],
mode='lines', name='HR', line=dict(color='red')),
row=2, col=1
)
fig_multi.add_trace(
go.Scatter(x=year_data['yearID'], y=year_data['k_rate'],
mode='lines', name='K', line=dict(color='orange')),
row=3, col=1
)
fig_multi.update_xaxes(title_text="Year", row=3, col=1)
fig_multi.update_layout(
height=900,
title_text="Evolution of MLB Statistics Over Time",
showlegend=False
)
fig_multi.show()
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from pybaseball import lahman
# Load data
batting = lahman.batting()
people = lahman.people()
def calculate_ops(row):
"""Calculate OPS from batting statistics"""
if row['AB'] == 0:
return 0
obp = (row['H'] + row['BB']) / (row['AB'] + row['BB']) if (row['AB'] + row['BB']) > 0 else 0
slg = (row['H'] + row['2B'] + 2*row['3B'] + 3*row['HR']) / row['AB'] if row['AB'] > 0 else 0
return obp + slg
def compare_players_interactive(player_names, min_year=1900, max_year=2023):
"""
Create interactive comparison of players across eras
Parameters:
-----------
player_names : list
List of player names in format ["First Last", ...]
min_year : int
Starting year for comparison
max_year : int
Ending year for comparison
"""
# Parse player names
player_data = []
for name in player_names:
first, last = name.split()[0], ' '.join(name.split()[1:])
player_data.append((first, last))
# Get player IDs
player_ids = []
for first, last in player_data:
matches = people[(people['nameFirst'] == first) &
(people['nameLast'] == last)]
if len(matches) > 0:
player_ids.append(matches.iloc[0]['playerID'])
# Filter batting data
player_stats = batting[
(batting['playerID'].isin(player_ids)) &
(batting['yearID'] >= min_year) &
(batting['yearID'] <= max_year) &
(batting['AB'] >= 300)
].copy()
# Merge with player names
player_stats = player_stats.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID'
)
player_stats['player_name'] = (player_stats['nameFirst'] + ' ' +
player_stats['nameLast'])
# Calculate league averages by year
league_avg = batting[
(batting['yearID'] >= min_year) &
(batting['yearID'] <= max_year)
].groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'BB': 'sum',
'2B': 'sum',
'3B': 'sum',
'HR': 'sum'
}).reset_index()
league_avg['league_ops'] = league_avg.apply(calculate_ops, axis=1)
# Calculate player OPS and OPS+
player_stats['player_ops'] = player_stats.apply(calculate_ops, axis=1)
player_stats = player_stats.merge(
league_avg[['yearID', 'league_ops']],
on='yearID'
)
player_stats['ops_plus'] = (player_stats['player_ops'] /
player_stats['league_ops']) * 100
player_stats['ba'] = player_stats['H'] / player_stats['AB']
# Create interactive plot
fig = px.line(player_stats,
x='yearID',
y='ops_plus',
color='player_name',
markers=True,
title='Era-Adjusted Performance Comparison',
labels={'yearID': 'Year',
'ops_plus': 'OPS+ (100 = League Average)',
'player_name': 'Player'})
fig.add_hline(y=100, line_dash="dash", line_color="gray",
annotation_text="League Average")
fig.update_traces(
hovertemplate='<b>%{fullData.name}</b><br>' +
'Year: %{x}<br>' +
'OPS+: %{y:.1f}<br>' +
'<extra></extra>'
)
fig.update_layout(
hovermode='x unified',
xaxis_title='Year',
yaxis_title='OPS+ (100 = League Average)',
legend_title='Player',
height=600
)
return fig
# Example: Compare legendary players across eras
fig = compare_players_interactive(
["Babe Ruth", "Ted Williams", "Barry Bonds"],
min_year=1914,
max_year=2007
)
fig.show()
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import lahman
# Load and prepare data
batting = lahman.batting()
historical_trends = batting[batting['yearID'] >= 1900].groupby('yearID').agg({
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'SO': 'sum',
'BB': 'sum',
'SB': 'sum'
}).reset_index()
# Calculate rates
historical_trends['batting_avg'] = historical_trends['H'] / historical_trends['AB']
historical_trends['hr_rate'] = (historical_trends['HR'] / historical_trends['AB']) * 100
historical_trends['k_rate'] = historical_trends['SO'] / (historical_trends['AB'] + historical_trends['BB'])
historical_trends['bb_rate'] = historical_trends['BB'] / (historical_trends['AB'] + historical_trends['BB'])
historical_trends['iso'] = (historical_trends['HR'] * 3) / historical_trends['AB'] # Simplified ISO
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(
go.Scatter(
x=historical_trends['yearID'],
y=historical_trends['batting_avg'],
name='Batting Average',
line=dict(color='blue', width=2)
),
secondary_y=False
)
fig.add_trace(
go.Scatter(
x=historical_trends['yearID'],
y=historical_trends['hr_rate'],
name='HR Rate (per 100 AB)',
line=dict(color='red', width=2)
),
secondary_y=True
)
fig.add_trace(
go.Scatter(
x=historical_trends['yearID'],
y=historical_trends['k_rate'],
name='K Rate',
line=dict(color='orange', width=2)
),
secondary_y=False
)
# Add era markers
eras = [
(1920, 'Live Ball Era'),
(1947, 'Integration'),
(1961, 'Expansion'),
(1993, 'Steroid Era'),
(2006, 'Modern Era')
]
for year, label in eras:
fig.add_vline(
x=year,
line_dash="dash",
line_color="gray",
annotation_text=label,
annotation_position="top"
)
# Update layout with range slider
fig.update_xaxes(
title_text="Year",
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count=10, label="10y", step="year", stepmode="backward"),
dict(count=25, label="25y", step="year", stepmode="backward"),
dict(count=50, label="50y", step="year", stepmode="backward"),
dict(step="all", label="All")
])
)
)
fig.update_yaxes(title_text="Batting Average / K Rate", secondary_y=False)
fig.update_yaxes(title_text="HR Rate (per 100 AB)", secondary_y=True)
fig.update_layout(
title_text="Historical Trends in MLB Statistics (1900-2023)",
hovermode='x unified',
height=600,
legend=dict(x=0.01, y=0.99)
)
fig.show()
Exercise 1: Calculate Era-Adjusted Statistics
Calculate OPS+ and ERA+ for the following player seasons and compare them:
Hitters:
- Rogers Hornsby, 1924 (.424 AVG, .507 OBP, .696 SLG)
- Tony Gwynn, 1994 (.394 AVG, .454 OBP, .568 SLG)
- Ichiro Suzuki, 2004 (.372 AVG, .414 OBP, .455 SLG)
Pitchers:
- Walter Johnson, 1913 (1.14 ERA)
- Dwight Gooden, 1985 (1.53 ERA)
- Jacob deGrom, 2018 (1.70 ERA)
# Solution for Exercise 1
library(Lahman)
library(dplyr)
# Hitter OPS+ calculations
# Note: You'll need to look up league averages for each year
# 1924 NL Averages: OBP ~.330, SLG ~.395
hornsby_ops_plus <- 100 * ((0.507 / 0.330) + (0.696 / 0.395) - 1)
print(paste("Hornsby 1924 OPS+:", round(hornsby_ops_plus, 0)))
# 1994 NL Averages: OBP ~.330, SLG ~.415
gwynn_ops_plus <- 100 * ((0.454 / 0.330) + (0.568 / 0.415) - 1)
print(paste("Gwynn 1994 OPS+:", round(gwynn_ops_plus, 0)))
# 2004 AL Averages: OBP ~.333, SLG ~.423
ichiro_ops_plus <- 100 * ((0.414 / 0.333) + (0.455 / 0.423) - 1)
print(paste("Ichiro 2004 OPS+:", round(ichiro_ops_plus, 0)))
# Pitcher ERA+ calculations
# 1913 AL ERA: ~3.00
johnson_era_plus <- 100 * (3.00 / 1.14)
print(paste("Walter Johnson 1913 ERA+:", round(johnson_era_plus, 0)))
# 1985 NL ERA: ~3.58
gooden_era_plus <- 100 * (3.58 / 1.53)
print(paste("Dwight Gooden 1985 ERA+:", round(gooden_era_plus, 0)))
# 2018 NL ERA: ~4.04
degrom_era_plus <- 100 * (4.04 / 1.70)
print(paste("Jacob deGrom 2018 ERA+:", round(degrom_era_plus, 0)))
# Solution for Exercise 1
# Hitter OPS+ calculations
# 1924 NL Averages: OBP ~.330, SLG ~.395
hornsby_ops_plus = 100 * ((0.507 / 0.330) + (0.696 / 0.395) - 1)
print(f"Hornsby 1924 OPS+: {round(hornsby_ops_plus, 0)}")
# 1994 NL Averages: OBP ~.330, SLG ~.415
gwynn_ops_plus = 100 * ((0.454 / 0.330) + (0.568 / 0.415) - 1)
print(f"Gwynn 1994 OPS+: {round(gwynn_ops_plus, 0)}")
# 2004 AL Averages: OBP ~.333, SLG ~.423
ichiro_ops_plus = 100 * ((0.414 / 0.333) + (0.455 / 0.423) - 1)
print(f"Ichiro 2004 OPS+: {round(ichiro_ops_plus, 0)}")
# Pitcher ERA+ calculations
# 1913 AL ERA: ~3.00
johnson_era_plus = 100 * (3.00 / 1.14)
print(f"Walter Johnson 1913 ERA+: {round(johnson_era_plus, 0)}")
# 1985 NL ERA: ~3.58
gooden_era_plus = 100 * (3.58 / 1.53)
print(f"Dwight Gooden 1985 ERA+: {round(gooden_era_plus, 0)}")
# 2018 NL ERA: ~4.04
degrom_era_plus = 100 * (4.04 / 1.70)
print(f"Jacob deGrom 2018 ERA+: {round(degrom_era_plus, 0)}")
Exercise 2: Decade-by-Decade Trend Analysis
Using the Lahman database, create visualizations showing how the following statistics have changed by decade since 1900:
- Stolen base rate (SB per game)
- Complete game percentage
- Batting average on balls in play (BABIP)
- Walk rate (BB per PA)
# Solution for Exercise 2
library(Lahman)
library(dplyr)
library(ggplot2)
library(gridExtra)
# Calculate decade statistics
decade_trends <- Batting %>%
filter(yearID >= 1900) %>%
mutate(decade = floor(yearID / 10) * 10) %>%
group_by(decade) %>%
summarise(
Total_SB = sum(SB, na.rm = TRUE),
Total_G = sum(G, na.rm = TRUE),
Total_AB = sum(AB, na.rm = TRUE),
Total_H = sum(H, na.rm = TRUE),
Total_HR = sum(HR, na.rm = TRUE),
Total_BB = sum(BB, na.rm = TRUE),
Total_SO = sum(SO, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
SB_per_Game = Total_SB / (Total_G / 2), # Divide by 2 for team games
BIP = Total_AB - Total_SO - Total_HR,
BABIP = (Total_H - Total_HR) / BIP,
PA = Total_AB + Total_BB,
BB_Rate = Total_BB / PA
)
# Pitching complete games
pitching_trends <- Pitching %>%
filter(yearID >= 1900) %>%
mutate(decade = floor(yearID / 10) * 10) %>%
group_by(decade) %>%
summarise(
Total_GS = sum(GS, na.rm = TRUE),
Total_CG = sum(CG, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(CG_Pct = Total_CG / Total_GS)
# Create plots
p1 <- ggplot(decade_trends, aes(x = decade, y = SB_per_Game)) +
geom_line(size = 1.5, color = "blue") +
geom_point(size = 3) +
labs(title = "Stolen Bases per Game", y = "SB/Game") +
theme_minimal()
p2 <- ggplot(pitching_trends, aes(x = decade, y = CG_Pct * 100)) +
geom_line(size = 1.5, color = "red") +
geom_point(size = 3) +
labs(title = "Complete Game Percentage", y = "CG %") +
theme_minimal()
p3 <- ggplot(decade_trends, aes(x = decade, y = BABIP)) +
geom_line(size = 1.5, color = "green") +
geom_point(size = 3) +
labs(title = "BABIP by Decade", y = "BABIP") +
theme_minimal()
p4 <- ggplot(decade_trends, aes(x = decade, y = BB_Rate * 100)) +
geom_line(size = 1.5, color = "purple") +
geom_point(size = 3) +
labs(title = "Walk Rate", y = "BB%") +
theme_minimal()
grid.arrange(p1, p2, p3, p4, ncol = 2)
# Solution for Exercise 2
import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pyb.cache.enable()
batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()
# Calculate decade statistics
batting_modern = batting[batting['yearID'] >= 1900].copy()
batting_modern['decade'] = (batting_modern['yearID'] // 10) * 10
decade_trends = batting_modern.groupby('decade').agg({
'SB': 'sum',
'G': 'sum',
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'BB': 'sum',
'SO': 'sum'
}).reset_index()
# Calculate rate stats
decade_trends['SB_per_Game'] = decade_trends['SB'] / (decade_trends['G'] / 2)
decade_trends['BIP'] = decade_trends['AB'] - decade_trends['SO'] - decade_trends['HR']
decade_trends['BABIP'] = (decade_trends['H'] - decade_trends['HR']) / decade_trends['BIP']
decade_trends['PA'] = decade_trends['AB'] + decade_trends['BB']
decade_trends['BB_Rate'] = decade_trends['BB'] / decade_trends['PA']
# Pitching complete games
pitching_modern = pitching[pitching['yearID'] >= 1900].copy()
pitching_modern['decade'] = (pitching_modern['yearID'] // 10) * 10
pitching_trends = pitching_modern.groupby('decade').agg({
'GS': 'sum',
'CG': 'sum'
}).reset_index()
pitching_trends['CG_Pct'] = pitching_trends['CG'] / pitching_trends['GS']
# Create plots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
ax1.plot(decade_trends['decade'], decade_trends['SB_per_Game'],
marker='o', linewidth=2, markersize=8, color='blue')
ax1.set_title('Stolen Bases per Game', fontsize=12, fontweight='bold')
ax1.set_ylabel('SB/Game')
ax1.grid(True, alpha=0.3)
ax2.plot(pitching_trends['decade'], pitching_trends['CG_Pct'] * 100,
marker='o', linewidth=2, markersize=8, color='red')
ax2.set_title('Complete Game Percentage', fontsize=12, fontweight='bold')
ax2.set_ylabel('CG %')
ax2.grid(True, alpha=0.3)
ax3.plot(decade_trends['decade'], decade_trends['BABIP'],
marker='o', linewidth=2, markersize=8, color='green')
ax3.set_title('BABIP by Decade', fontsize=12, fontweight='bold')
ax3.set_ylabel('BABIP')
ax3.grid(True, alpha=0.3)
ax4.plot(decade_trends['decade'], decade_trends['BB_Rate'] * 100,
marker='o', linewidth=2, markersize=8, color='purple')
ax4.set_title('Walk Rate', fontsize=12, fontweight='bold')
ax4.set_ylabel('BB%')
ax4.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('decade_trends_detailed.png', dpi=300, bbox_inches='tight')
plt.show()
Exercise 3: Cross-Era Player Comparison
Compare the following three shortstops from different eras using era-adjusted statistics:
- Honus Wagner (career: 1897-1917)
- Cal Ripken Jr. (career: 1981-2001)
- Derek Jeter (career: 1995-2014)
Calculate and compare:
- Career batting average relative to league average
- Career OPS relative to league average
- Best single-season OPS+
- Career home runs relative to position average
# Solution for Exercise 3
library(Lahman)
library(dplyr)
compare_shortstops <- function() {
# Get player IDs
wagner_id <- People %>%
filter(nameLast == "Wagner", nameFirst == "Honus") %>%
pull(playerID)
ripken_id <- People %>%
filter(nameLast == "Ripken", nameFirst == "Cal") %>%
pull(playerID)
jeter_id <- People %>%
filter(nameLast == "Jeter", nameFirst == "Derek") %>%
pull(playerID)
# Get career stats
get_career <- function(pid, name) {
career <- Batting %>%
filter(playerID == pid) %>%
summarise(
Name = name,
Years = paste(min(yearID), max(yearID), sep = "-"),
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
X2B = sum(X2B, na.rm = TRUE),
X3B = sum(X3B, na.rm = TRUE)
) %>%
mutate(
AVG = round(H / AB, 3),
TB = H + X2B + 2*X3B + 3*HR,
SLG = round(TB / AB, 3),
OBP = round((H + BB) / (AB + BB), 3),
OPS = round(OBP + SLG, 3)
)
return(career)
}
wagner <- get_career(wagner_id, "Honus Wagner")
ripken <- get_career(ripken_id, "Cal Ripken Jr.")
jeter <- get_career(jeter_id, "Derek Jeter")
comparison <- bind_rows(wagner, ripken, jeter)
return(comparison)
}
shortstop_comparison <- compare_shortstops()
print(shortstop_comparison[, c("Name", "Years", "AVG", "OBP", "SLG", "OPS", "HR")])
# Solution for Exercise 3
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def compare_shortstops():
"""
Compare three legendary shortstops
"""
# Get player IDs
wagner = people[
(people['nameLast'] == 'Wagner') &
(people['nameFirst'] == 'Honus')
]
wagner_id = wagner['playerID'].values[0] if len(wagner) > 0 else None
ripken = people[
(people['nameLast'] == 'Ripken') &
(people['nameFirst'] == 'Cal')
]
ripken_id = ripken['playerID'].values[0] if len(ripken) > 0 else None
jeter = people[
(people['nameLast'] == 'Jeter') &
(people['nameFirst'] == 'Derek')
]
jeter_id = jeter['playerID'].values[0] if len(jeter) > 0 else None
def get_career(pid, name):
career = batting[batting['playerID'] == pid]
stats = {
'Name': name,
'Years': f"{career['yearID'].min()}-{career['yearID'].max()}",
'AB': career['AB'].sum(),
'H': career['H'].sum(),
'HR': career['HR'].sum(),
'BB': career['BB'].sum(),
'2B': career['2B'].sum(),
'3B': career['3B'].sum()
}
stats['AVG'] = round(stats['H'] / stats['AB'], 3)
stats['TB'] = stats['H'] + stats['2B'] + 2*stats['3B'] + 3*stats['HR']
stats['SLG'] = round(stats['TB'] / stats['AB'], 3)
stats['OBP'] = round((stats['H'] + stats['BB']) / (stats['AB'] + stats['BB']), 3)
stats['OPS'] = round(stats['OBP'] + stats['SLG'], 3)
return stats
comparison = pd.DataFrame([
get_career(wagner_id, "Honus Wagner"),
get_career(ripken_id, "Cal Ripken Jr."),
get_career(jeter_id, "Derek Jeter")
])
return comparison
shortstop_comparison = compare_shortstops()
print(shortstop_comparison[['Name', 'Years', 'AVG', 'OBP', 'SLG', 'OPS', 'HR']])
Exercise 4: Steroid Era Impact Analysis
Analyze the impact of the steroid era on career milestones:
- Count how many players reached 500 career home runs in different eras:
- Pre-steroid (before 1994)
- Steroid era (1994-2007)
- Post-steroid (after 2007)
- Calculate the average age at which players hit their career peak (most HR in a season) for each era
- Identify players whose late-career performance (ages 35+) was anomalously good compared to early career
# Solution for Exercise 4
library(Lahman)
library(dplyr)
# Part 1: 500 HR club by era
hr_500_club <- Batting %>%
group_by(playerID) %>%
summarise(
Career_HR = sum(HR, na.rm = TRUE),
Last_Year = max(yearID),
.groups = 'drop'
) %>%
filter(Career_HR >= 500) %>%
mutate(
Era = case_when(
Last_Year < 1994 ~ "Pre-Steroid",
Last_Year >= 1994 & Last_Year <= 2007 ~ "Steroid",
Last_Year > 2007 ~ "Post-Steroid"
)
)
# Join with names
hr_500_with_names <- hr_500_club %>%
left_join(People, by = "playerID") %>%
mutate(Name = paste(nameFirst, nameLast)) %>%
select(Name, Career_HR, Last_Year, Era) %>%
arrange(desc(Career_HR))
print(hr_500_with_names)
# Count by era
era_counts <- hr_500_with_names %>%
group_by(Era) %>%
summarise(Count = n(), .groups = 'drop')
print(era_counts)
# Part 2: Peak age by era
peak_age_analysis <- Batting %>%
left_join(People, by = "playerID") %>%
mutate(
Age = yearID - birthYear,
Era = case_when(
yearID < 1994 ~ "Pre-Steroid",
yearID >= 1994 & yearID <= 2007 ~ "Steroid",
yearID > 2007 ~ "Post-Steroid"
)
) %>%
filter(!is.na(Age), Age >= 20, Age <= 45) %>%
group_by(playerID) %>%
filter(HR == max(HR)) %>%
ungroup() %>%
group_by(Era) %>%
summarise(
Avg_Peak_Age = mean(Age, na.rm = TRUE),
.groups = 'drop'
)
print(peak_age_analysis)
# Solution for Exercise 4
import pybaseball as pyb
import pandas as pd
import numpy as np
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
# Part 1: 500 HR club by era
career_hr = batting.groupby('playerID').agg({
'HR': 'sum',
'yearID': 'max'
}).reset_index()
career_hr.columns = ['playerID', 'Career_HR', 'Last_Year']
hr_500_club = career_hr[career_hr['Career_HR'] >= 500].copy()
# Define era
def classify_era(year):
if year < 1994:
return "Pre-Steroid"
elif year <= 2007:
return "Steroid"
else:
return "Post-Steroid"
hr_500_club['Era'] = hr_500_club['Last_Year'].apply(classify_era)
# Join with names
hr_500_with_names = hr_500_club.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
hr_500_with_names['Name'] = (
hr_500_with_names['nameFirst'] + ' ' + hr_500_with_names['nameLast']
)
hr_500_with_names = hr_500_with_names[['Name', 'Career_HR', 'Last_Year', 'Era']]
hr_500_with_names = hr_500_with_names.sort_values('Career_HR', ascending=False)
print(hr_500_with_names)
# Count by era
era_counts = hr_500_with_names.groupby('Era').size().reset_index()
era_counts.columns = ['Era', 'Count']
print(era_counts)
# Part 2: Peak age by era
batting_with_age = batting.merge(
people[['playerID', 'birthYear']],
on='playerID',
how='left'
)
batting_with_age['Age'] = batting_with_age['yearID'] - batting_with_age['birthYear']
batting_with_age['Era'] = batting_with_age['yearID'].apply(classify_era)
# Filter reasonable ages
batting_with_age = batting_with_age[
(batting_with_age['Age'] >= 20) &
(batting_with_age['Age'] <= 45)
]
# Find peak HR season for each player
peak_seasons = batting_with_age.loc[
batting_with_age.groupby('playerID')['HR'].idxmax()
]
# Average peak age by era
peak_age_by_era = peak_seasons.groupby('Era')['Age'].mean().reset_index()
peak_age_by_era.columns = ['Era', 'Avg_Peak_Age']
print(peak_age_by_era)
Summary
Historical analysis and cross-era comparison represent some of the most intellectually challenging—and rewarding—aspects of baseball analytics. By understanding the contexts in which players competed, applying appropriate adjustments, and using sophisticated analytical frameworks, we can make meaningful comparisons across the decades.
Key takeaways from this chapter:
- Context is Everything: Raw statistics are meaningless without understanding the era in which they were accumulated
- Era-Adjusted Statistics: Tools like OPS+, ERA+, and wRC+ allow apples-to-apples comparisons across different offensive environments
- The Lahman Database: This comprehensive historical database enables sophisticated analysis of baseball's entire history
- Historical Trends: Baseball has evolved continuously, with clear patterns in offense, defense, and pitching across decades
- Cross-Era Comparison: With appropriate methods, we can meaningfully compare players from different eras while respecting their unique contexts
- The Steroid Era: This period presents special challenges but can be handled using standard era-adjustment techniques
- Peak vs. Career: Different analytical frameworks favor different types of players; both perspectives have merit
As you continue your work in baseball analytics, always remember that numbers tell stories about human achievement in specific historical moments. Our job is to understand those achievements in context while still allowing meaningful comparison across time. The greatest players of the dead ball era, the integration era, the expansion era, and today's game all deserve to be evaluated fairly within their own contexts while still being measured against each other using sophisticated analytical tools.
The methods you've learned in this chapter will enable you to participate in the great debates of baseball history with statistical rigor and historical awareness—combining the best of both traditional and modern approaches to understanding this timeless game.
# Solution for Exercise 1
library(Lahman)
library(dplyr)
# Hitter OPS+ calculations
# Note: You'll need to look up league averages for each year
# 1924 NL Averages: OBP ~.330, SLG ~.395
hornsby_ops_plus <- 100 * ((0.507 / 0.330) + (0.696 / 0.395) - 1)
print(paste("Hornsby 1924 OPS+:", round(hornsby_ops_plus, 0)))
# 1994 NL Averages: OBP ~.330, SLG ~.415
gwynn_ops_plus <- 100 * ((0.454 / 0.330) + (0.568 / 0.415) - 1)
print(paste("Gwynn 1994 OPS+:", round(gwynn_ops_plus, 0)))
# 2004 AL Averages: OBP ~.333, SLG ~.423
ichiro_ops_plus <- 100 * ((0.414 / 0.333) + (0.455 / 0.423) - 1)
print(paste("Ichiro 2004 OPS+:", round(ichiro_ops_plus, 0)))
# Pitcher ERA+ calculations
# 1913 AL ERA: ~3.00
johnson_era_plus <- 100 * (3.00 / 1.14)
print(paste("Walter Johnson 1913 ERA+:", round(johnson_era_plus, 0)))
# 1985 NL ERA: ~3.58
gooden_era_plus <- 100 * (3.58 / 1.53)
print(paste("Dwight Gooden 1985 ERA+:", round(gooden_era_plus, 0)))
# 2018 NL ERA: ~4.04
degrom_era_plus <- 100 * (4.04 / 1.70)
print(paste("Jacob deGrom 2018 ERA+:", round(degrom_era_plus, 0)))
# Solution for Exercise 2
library(Lahman)
library(dplyr)
library(ggplot2)
library(gridExtra)
# Calculate decade statistics
decade_trends <- Batting %>%
filter(yearID >= 1900) %>%
mutate(decade = floor(yearID / 10) * 10) %>%
group_by(decade) %>%
summarise(
Total_SB = sum(SB, na.rm = TRUE),
Total_G = sum(G, na.rm = TRUE),
Total_AB = sum(AB, na.rm = TRUE),
Total_H = sum(H, na.rm = TRUE),
Total_HR = sum(HR, na.rm = TRUE),
Total_BB = sum(BB, na.rm = TRUE),
Total_SO = sum(SO, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
SB_per_Game = Total_SB / (Total_G / 2), # Divide by 2 for team games
BIP = Total_AB - Total_SO - Total_HR,
BABIP = (Total_H - Total_HR) / BIP,
PA = Total_AB + Total_BB,
BB_Rate = Total_BB / PA
)
# Pitching complete games
pitching_trends <- Pitching %>%
filter(yearID >= 1900) %>%
mutate(decade = floor(yearID / 10) * 10) %>%
group_by(decade) %>%
summarise(
Total_GS = sum(GS, na.rm = TRUE),
Total_CG = sum(CG, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(CG_Pct = Total_CG / Total_GS)
# Create plots
p1 <- ggplot(decade_trends, aes(x = decade, y = SB_per_Game)) +
geom_line(size = 1.5, color = "blue") +
geom_point(size = 3) +
labs(title = "Stolen Bases per Game", y = "SB/Game") +
theme_minimal()
p2 <- ggplot(pitching_trends, aes(x = decade, y = CG_Pct * 100)) +
geom_line(size = 1.5, color = "red") +
geom_point(size = 3) +
labs(title = "Complete Game Percentage", y = "CG %") +
theme_minimal()
p3 <- ggplot(decade_trends, aes(x = decade, y = BABIP)) +
geom_line(size = 1.5, color = "green") +
geom_point(size = 3) +
labs(title = "BABIP by Decade", y = "BABIP") +
theme_minimal()
p4 <- ggplot(decade_trends, aes(x = decade, y = BB_Rate * 100)) +
geom_line(size = 1.5, color = "purple") +
geom_point(size = 3) +
labs(title = "Walk Rate", y = "BB%") +
theme_minimal()
grid.arrange(p1, p2, p3, p4, ncol = 2)
# Solution for Exercise 3
library(Lahman)
library(dplyr)
compare_shortstops <- function() {
# Get player IDs
wagner_id <- People %>%
filter(nameLast == "Wagner", nameFirst == "Honus") %>%
pull(playerID)
ripken_id <- People %>%
filter(nameLast == "Ripken", nameFirst == "Cal") %>%
pull(playerID)
jeter_id <- People %>%
filter(nameLast == "Jeter", nameFirst == "Derek") %>%
pull(playerID)
# Get career stats
get_career <- function(pid, name) {
career <- Batting %>%
filter(playerID == pid) %>%
summarise(
Name = name,
Years = paste(min(yearID), max(yearID), sep = "-"),
AB = sum(AB, na.rm = TRUE),
H = sum(H, na.rm = TRUE),
HR = sum(HR, na.rm = TRUE),
BB = sum(BB, na.rm = TRUE),
X2B = sum(X2B, na.rm = TRUE),
X3B = sum(X3B, na.rm = TRUE)
) %>%
mutate(
AVG = round(H / AB, 3),
TB = H + X2B + 2*X3B + 3*HR,
SLG = round(TB / AB, 3),
OBP = round((H + BB) / (AB + BB), 3),
OPS = round(OBP + SLG, 3)
)
return(career)
}
wagner <- get_career(wagner_id, "Honus Wagner")
ripken <- get_career(ripken_id, "Cal Ripken Jr.")
jeter <- get_career(jeter_id, "Derek Jeter")
comparison <- bind_rows(wagner, ripken, jeter)
return(comparison)
}
shortstop_comparison <- compare_shortstops()
print(shortstop_comparison[, c("Name", "Years", "AVG", "OBP", "SLG", "OPS", "HR")])
# Solution for Exercise 4
library(Lahman)
library(dplyr)
# Part 1: 500 HR club by era
hr_500_club <- Batting %>%
group_by(playerID) %>%
summarise(
Career_HR = sum(HR, na.rm = TRUE),
Last_Year = max(yearID),
.groups = 'drop'
) %>%
filter(Career_HR >= 500) %>%
mutate(
Era = case_when(
Last_Year < 1994 ~ "Pre-Steroid",
Last_Year >= 1994 & Last_Year <= 2007 ~ "Steroid",
Last_Year > 2007 ~ "Post-Steroid"
)
)
# Join with names
hr_500_with_names <- hr_500_club %>%
left_join(People, by = "playerID") %>%
mutate(Name = paste(nameFirst, nameLast)) %>%
select(Name, Career_HR, Last_Year, Era) %>%
arrange(desc(Career_HR))
print(hr_500_with_names)
# Count by era
era_counts <- hr_500_with_names %>%
group_by(Era) %>%
summarise(Count = n(), .groups = 'drop')
print(era_counts)
# Part 2: Peak age by era
peak_age_analysis <- Batting %>%
left_join(People, by = "playerID") %>%
mutate(
Age = yearID - birthYear,
Era = case_when(
yearID < 1994 ~ "Pre-Steroid",
yearID >= 1994 & yearID <= 2007 ~ "Steroid",
yearID > 2007 ~ "Post-Steroid"
)
) %>%
filter(!is.na(Age), Age >= 20, Age <= 45) %>%
group_by(playerID) %>%
filter(HR == max(HR)) %>%
ungroup() %>%
group_by(Era) %>%
summarise(
Avg_Peak_Age = mean(Age, na.rm = TRUE),
.groups = 'drop'
)
print(peak_age_analysis)
# Solution for Exercise 1
# Hitter OPS+ calculations
# 1924 NL Averages: OBP ~.330, SLG ~.395
hornsby_ops_plus = 100 * ((0.507 / 0.330) + (0.696 / 0.395) - 1)
print(f"Hornsby 1924 OPS+: {round(hornsby_ops_plus, 0)}")
# 1994 NL Averages: OBP ~.330, SLG ~.415
gwynn_ops_plus = 100 * ((0.454 / 0.330) + (0.568 / 0.415) - 1)
print(f"Gwynn 1994 OPS+: {round(gwynn_ops_plus, 0)}")
# 2004 AL Averages: OBP ~.333, SLG ~.423
ichiro_ops_plus = 100 * ((0.414 / 0.333) + (0.455 / 0.423) - 1)
print(f"Ichiro 2004 OPS+: {round(ichiro_ops_plus, 0)}")
# Pitcher ERA+ calculations
# 1913 AL ERA: ~3.00
johnson_era_plus = 100 * (3.00 / 1.14)
print(f"Walter Johnson 1913 ERA+: {round(johnson_era_plus, 0)}")
# 1985 NL ERA: ~3.58
gooden_era_plus = 100 * (3.58 / 1.53)
print(f"Dwight Gooden 1985 ERA+: {round(gooden_era_plus, 0)}")
# 2018 NL ERA: ~4.04
degrom_era_plus = 100 * (4.04 / 1.70)
print(f"Jacob deGrom 2018 ERA+: {round(degrom_era_plus, 0)}")
# Solution for Exercise 2
import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pyb.cache.enable()
batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()
# Calculate decade statistics
batting_modern = batting[batting['yearID'] >= 1900].copy()
batting_modern['decade'] = (batting_modern['yearID'] // 10) * 10
decade_trends = batting_modern.groupby('decade').agg({
'SB': 'sum',
'G': 'sum',
'AB': 'sum',
'H': 'sum',
'HR': 'sum',
'BB': 'sum',
'SO': 'sum'
}).reset_index()
# Calculate rate stats
decade_trends['SB_per_Game'] = decade_trends['SB'] / (decade_trends['G'] / 2)
decade_trends['BIP'] = decade_trends['AB'] - decade_trends['SO'] - decade_trends['HR']
decade_trends['BABIP'] = (decade_trends['H'] - decade_trends['HR']) / decade_trends['BIP']
decade_trends['PA'] = decade_trends['AB'] + decade_trends['BB']
decade_trends['BB_Rate'] = decade_trends['BB'] / decade_trends['PA']
# Pitching complete games
pitching_modern = pitching[pitching['yearID'] >= 1900].copy()
pitching_modern['decade'] = (pitching_modern['yearID'] // 10) * 10
pitching_trends = pitching_modern.groupby('decade').agg({
'GS': 'sum',
'CG': 'sum'
}).reset_index()
pitching_trends['CG_Pct'] = pitching_trends['CG'] / pitching_trends['GS']
# Create plots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))
ax1.plot(decade_trends['decade'], decade_trends['SB_per_Game'],
marker='o', linewidth=2, markersize=8, color='blue')
ax1.set_title('Stolen Bases per Game', fontsize=12, fontweight='bold')
ax1.set_ylabel('SB/Game')
ax1.grid(True, alpha=0.3)
ax2.plot(pitching_trends['decade'], pitching_trends['CG_Pct'] * 100,
marker='o', linewidth=2, markersize=8, color='red')
ax2.set_title('Complete Game Percentage', fontsize=12, fontweight='bold')
ax2.set_ylabel('CG %')
ax2.grid(True, alpha=0.3)
ax3.plot(decade_trends['decade'], decade_trends['BABIP'],
marker='o', linewidth=2, markersize=8, color='green')
ax3.set_title('BABIP by Decade', fontsize=12, fontweight='bold')
ax3.set_ylabel('BABIP')
ax3.grid(True, alpha=0.3)
ax4.plot(decade_trends['decade'], decade_trends['BB_Rate'] * 100,
marker='o', linewidth=2, markersize=8, color='purple')
ax4.set_title('Walk Rate', fontsize=12, fontweight='bold')
ax4.set_ylabel('BB%')
ax4.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('decade_trends_detailed.png', dpi=300, bbox_inches='tight')
plt.show()
# Solution for Exercise 3
import pybaseball as pyb
import pandas as pd
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
def compare_shortstops():
"""
Compare three legendary shortstops
"""
# Get player IDs
wagner = people[
(people['nameLast'] == 'Wagner') &
(people['nameFirst'] == 'Honus')
]
wagner_id = wagner['playerID'].values[0] if len(wagner) > 0 else None
ripken = people[
(people['nameLast'] == 'Ripken') &
(people['nameFirst'] == 'Cal')
]
ripken_id = ripken['playerID'].values[0] if len(ripken) > 0 else None
jeter = people[
(people['nameLast'] == 'Jeter') &
(people['nameFirst'] == 'Derek')
]
jeter_id = jeter['playerID'].values[0] if len(jeter) > 0 else None
def get_career(pid, name):
career = batting[batting['playerID'] == pid]
stats = {
'Name': name,
'Years': f"{career['yearID'].min()}-{career['yearID'].max()}",
'AB': career['AB'].sum(),
'H': career['H'].sum(),
'HR': career['HR'].sum(),
'BB': career['BB'].sum(),
'2B': career['2B'].sum(),
'3B': career['3B'].sum()
}
stats['AVG'] = round(stats['H'] / stats['AB'], 3)
stats['TB'] = stats['H'] + stats['2B'] + 2*stats['3B'] + 3*stats['HR']
stats['SLG'] = round(stats['TB'] / stats['AB'], 3)
stats['OBP'] = round((stats['H'] + stats['BB']) / (stats['AB'] + stats['BB']), 3)
stats['OPS'] = round(stats['OBP'] + stats['SLG'], 3)
return stats
comparison = pd.DataFrame([
get_career(wagner_id, "Honus Wagner"),
get_career(ripken_id, "Cal Ripken Jr."),
get_career(jeter_id, "Derek Jeter")
])
return comparison
shortstop_comparison = compare_shortstops()
print(shortstop_comparison[['Name', 'Years', 'AVG', 'OBP', 'SLG', 'OPS', 'HR']])
# Solution for Exercise 4
import pybaseball as pyb
import pandas as pd
import numpy as np
pyb.cache.enable()
batting = pyb.lahman.batting()
people = pyb.lahman.people()
# Part 1: 500 HR club by era
career_hr = batting.groupby('playerID').agg({
'HR': 'sum',
'yearID': 'max'
}).reset_index()
career_hr.columns = ['playerID', 'Career_HR', 'Last_Year']
hr_500_club = career_hr[career_hr['Career_HR'] >= 500].copy()
# Define era
def classify_era(year):
if year < 1994:
return "Pre-Steroid"
elif year <= 2007:
return "Steroid"
else:
return "Post-Steroid"
hr_500_club['Era'] = hr_500_club['Last_Year'].apply(classify_era)
# Join with names
hr_500_with_names = hr_500_club.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID',
how='left'
)
hr_500_with_names['Name'] = (
hr_500_with_names['nameFirst'] + ' ' + hr_500_with_names['nameLast']
)
hr_500_with_names = hr_500_with_names[['Name', 'Career_HR', 'Last_Year', 'Era']]
hr_500_with_names = hr_500_with_names.sort_values('Career_HR', ascending=False)
print(hr_500_with_names)
# Count by era
era_counts = hr_500_with_names.groupby('Era').size().reset_index()
era_counts.columns = ['Era', 'Count']
print(era_counts)
# Part 2: Peak age by era
batting_with_age = batting.merge(
people[['playerID', 'birthYear']],
on='playerID',
how='left'
)
batting_with_age['Age'] = batting_with_age['yearID'] - batting_with_age['birthYear']
batting_with_age['Era'] = batting_with_age['yearID'].apply(classify_era)
# Filter reasonable ages
batting_with_age = batting_with_age[
(batting_with_age['Age'] >= 20) &
(batting_with_age['Age'] <= 45)
]
# Find peak HR season for each player
peak_seasons = batting_with_age.loc[
batting_with_age.groupby('playerID')['HR'].idxmax()
]
# Average peak age by era
peak_age_by_era = peak_seasons.groupby('Era')['Age'].mean().reset_index()
peak_age_by_era.columns = ['Era', 'Avg_Peak_Age']
print(peak_age_by_era)