Chapter 13: Historical Analysis & Era Comparison

13.1 The Challenge of Cross-Era Comparison

Why Raw Statistics Fail

Consider this comparison: In 1930, Bill Terry batted .401 for the New York Giants. In 2023, Luis Arraez led the National League with a .354 batting average. Does this mean Terry was the better hitter? Not necessarily. The 1930 National League averaged .303 as a whole, while the 2023 NL averaged .248. Terry was 98 points above his league average; Arraez was 106 points above his.

Raw statistics are meaningless without context. A 3.00 ERA in 1968—the "Year of the Pitcher"—represents very different performance than a 3.00 ERA in 2000, when league-wide offense was at historic highs. Similarly, hitting 50 home runs in 1927, when Babe Ruth did it, was vastly more impressive than hitting 50 in 1998, when multiple players exceeded that mark.

The fundamental challenge is that baseball has never been played in a static environment. Every aspect of the game—from the ball itself to the talent pool to the rules governing play—has evolved continuously over more than a century of professional competition.

Changes in Rules and Equipment

Baseball's rules have undergone numerous changes that dramatically affected statistical outcomes:

The Pitching Distance: In 1893, the pitching distance was extended from 50 feet to the current 60 feet, 6 inches. This single change increased offense significantly and rendered all pre-1893 pitching statistics incomparable to later eras.

The Foul Strike Rule: The National League adopted the foul strike rule in 1901, and the American League followed in 1903. Before this, foul balls didn't count as strikes. This change dramatically favored pitchers and increased strikeout rates.

The Mound Height: After the 1968 season, MLB lowered the pitcher's mound from 15 inches to 10 inches, contributing to increased offense in subsequent years.

The Designated Hitter: The American League's adoption of the DH in 1973 created two different playing environments that persist today, complicating cross-league comparisons.

The Strike Zone: The official strike zone has been modified several times, and its practical enforcement has varied considerably across eras, even when the rules remained constant.

Equipment changes have been equally significant:

The Baseball Itself: The ball has been modified numerous times, sometimes deliberately (as in 1920, when spitballs were banned and a livelier ball introduced) and sometimes inadvertently through manufacturing changes. The transition from dead ball to live ball around 1920 marks the single most important dividing line in baseball history.

Batting Gloves and Equipment: Modern players use batting gloves, lighter bats, and more protective equipment that may influence performance.

Ballpark Design: Stadium construction has evolved from asymmetric, quirky parks to more standardized modern facilities, affecting both offense and defense.

The Talent Pool Evolution

Perhaps the most complex factor in cross-era comparison is the changing talent pool. Several factors have dramatically affected the quality of competition:

Integration: Before Jackie Robinson broke the color barrier in 1947, Black players were excluded from the major leagues. This meant that pre-integration stars competed against a fractured talent pool that excluded some of the game's best players. Post-integration statistics represent competition against a deeper, more talented field.

International Expansion: The influx of Latin American players beginning in the 1950s and 1960s, followed by Asian players in the 1990s and beyond, has continually deepened the talent pool.

Population Growth: The U.S. population has grown from approximately 76 million in 1900 to over 330 million today. A larger population means more potential players, though this is partially offset by competition from other sports.

Expansion: MLB has expanded from 16 teams in 1960 to 30 teams today. More teams mean more major league jobs, potentially diluting talent (though the expanded talent pool from internationalization has more than compensated).

Specialization: Modern players are better trained, better conditioned, and more specialized than their historical counterparts. Relief pitchers throw harder for shorter stints; players lift weights and follow sophisticated nutrition programs. The average player today is almost certainly more athletic than the average player of 1930.

The Dead Ball Era (1900-1919)

The "dead ball era" refers to the period when offense was suppressed by multiple factors:

A softer, less resilient baseball that didn't travel as far when hit
Rules that favored pitchers, including allowing spitballs and other doctored pitches
Large, spacious ballparks with huge outfields
Tactical approaches that emphasized "small ball" over power hitting
Balls kept in play until they were misshapen and dirty, making them harder to see and hit

During this period, batting averages were relatively high (because fielding was poor and strikeouts rare), but home runs were extremely uncommon. In 1908, the entire American League hit only 278 home runs—less than individual players would hit in later eras.

Ty Cobb's .366 lifetime batting average, compiled mostly in the dead ball era, reflects both his extraordinary talent and the context of high batting averages. In 1911, Cobb batted .420, and that season the American League as a whole batted .273.

The Live Ball Era Transition (1920-1930)

The game changed dramatically in 1920 following the death of Ray Chapman, who was struck by a pitch. Several changes were implemented:

Spitballs and other trick pitches were banned (with a grandfather clause for existing spitball pitchers)
Balls were replaced more frequently, keeping them white and visible
The ball itself was made livelier, with tighter winding

The effect was immediate. Home runs increased dramatically, and Babe Ruth became the game's first true power hitter, hitting 54 home runs in 1920—more than any entire team had hit the previous year.

The 1920s and especially 1930 saw offense reach absurd levels. The 1930 National League batted .303 as a whole, and Hack Wilson drove in 191 runs. The ball was subsequently deadened slightly, but offense remained higher than in the dead ball era.

Modern Era Variations

Even within the "modern era" (post-1920), offense has varied considerably:

The 1960s Pitcher Dominance: The strike zone was enlarged in 1963, and by 1968, pitchers dominated to historic levels. The American League batted just .230 that year, and Bob Gibson posted a 1.12 ERA.

The 1990s-2000s Offensive Explosion: Often called the "steroid era," this period saw offense reach historic highs, with home runs and other power numbers exploding. However, multiple factors contributed beyond performance-enhancing drugs, including smaller ballparks, possible ball changes, and expansion diluting pitching.

The 2010s Return to Pitching: Strikeout rates soared as pitchers threw harder and batters sold out for power. Batting averages declined to levels not seen since the 1960s.

The 2020s Three True Outcomes Era: Modern baseball features historic levels of strikeouts, walks, and home runs, with traditional balls in play becoming less common.

Understanding these contextual factors is essential before attempting any cross-era analysis.

13.2 Era-Adjusted Statistics

To make meaningful comparisons across eras, we need statistics that adjust for context. The most widely used era-adjusted statistics express performance relative to league average, typically using 100 as the baseline.

OPS+ (Adjusted OPS)

OPS+ adjusts a player's OPS (On-Base Plus Slugging) for their league and ballpark. The formula is:

OPS+ = 100 * (OBP/lgOBP + SLG/lgSLG - 1)

This formula is then adjusted for park factors. An OPS+ of 100 is league average; 150 means the player was 50% better than average; 75 means 25% below average.

Let's calculate OPS+ for Babe Ruth's legendary 1927 season:

Ruth's 1927 Stats:

OBP: .486

SLG: .772

OPS: 1.258

1927 AL Average:

OBP: .333

SLG: .392

OPS: .725

# Calculate OPS+ for Babe Ruth 1927
ruth_obp <- 0.486
ruth_slg <- 0.772
league_obp <- 0.333
league_slg <- 0.392

# Basic OPS+ calculation (before park adjustment)
ops_plus <- 100 * ((ruth_obp / league_obp) + (ruth_slg / league_slg) - 1)
print(paste("Ruth's 1927 OPS+:", round(ops_plus, 0)))

# Calculate OPS+ for Babe Ruth 1927
ruth_obp = 0.486
ruth_slg = 0.772
league_obp = 0.333
league_slg = 0.392

# Basic OPS+ calculation (before park adjustment)
ops_plus = 100 * ((ruth_obp / league_obp) + (ruth_slg / league_slg) - 1)
print(f"Ruth's 1927 OPS+: {round(ops_plus, 0)}")

This gives Ruth an OPS+ of 225—meaning he was 125% better than the average hitter. This astronomical figure helps explain why many analysts consider this the greatest offensive season in baseball history.

ERA+ (Adjusted ERA)

ERA+ works similarly for pitchers, with the formula:

ERA+ = 100 * (lgERA / ERA)

Again adjusted for park factors. Higher is better (the opposite of ERA itself).

Let's calculate ERA+ for Pedro Martinez's incredible 2000 season:

Pedro's 2000 Stats:

ERA: 1.74

2000 AL Average ERA: 4.91

# Calculate ERA+ for Pedro Martinez 2000
pedro_era <- 1.74
league_era <- 4.91

# Basic ERA+ calculation (before park adjustment)
era_plus <- 100 * (league_era / pedro_era)
print(paste("Pedro's 2000 ERA+:", round(era_plus, 0)))

# Calculate ERA+ for Pedro Martinez 2000
pedro_era = 1.74
league_era = 4.91

# Basic ERA+ calculation (before park adjustment)
era_plus = 100 * (league_era / pedro_era)
print(f"Pedro's 2000 ERA+: {round(era_plus, 0)}")

This yields an ERA+ of 282—the highest single-season ERA+ since 1900 for a pitcher with 150+ innings. Pedro was nearly three times as effective as the average pitcher in a high-offense environment.

wRC+ (Weighted Runs Created Plus)

wRC+ is a more sophisticated offensive statistic that weights different offensive events by their actual run value. It's calculated using linear weights derived from run expectancy matrices, then adjusted for league and park.

The basic concept:

Calculate wOBA (weighted on-base average) using appropriate weights for each event

Convert to wRAA (weighted runs above average)

Add a park factor adjustment

Scale to 100 (league average)

Here's a simplified calculation:

# Simplified wRC+ calculation
calculate_wrc_plus <- function(player_stats, league_stats, park_factor = 1.00) {
  # These are approximate 2023 weights
  woba_weights <- list(
    BB = 0.69,
    HBP = 0.72,
    '1B' = 0.88,
    '2B' = 1.24,
    '3B' = 1.56,
    HR = 2.08
  )

  # Calculate player wOBA
  player_woba <- (
    woba_weights$BB * player_stats$BB +
    woba_weights$HBP * player_stats$HBP +
    woba_weights$'1B' * player_stats$'1B' +
    woba_weights$'2B' * player_stats$'2B' +
    woba_weights$'3B' * player_stats$'3B' +
    woba_weights$HR * player_stats$HR
  ) / (player_stats$AB + player_stats$BB - player_stats$IBB + player_stats$SF + player_stats$HBP)

  # Calculate league wOBA
  league_woba <- league_stats$wOBA

  # wRC+ formula (simplified)
  wrc_plus <- ((player_woba - league_woba) / 1.15 + 1) * 100 / park_factor

  return(wrc_plus)
}

# Example: Mike Trout 2012
trout_2012 <- list(
  AB = 559, BB = 67, HBP = 1, '1B' = 113, '2B' = 27, '3B' = 8, HR = 30,
  IBB = 0, SF = 4
)

league_2012 <- list(wOBA = 0.315)

trout_wrc_plus <- calculate_wrc_plus(trout_2012, league_2012, park_factor = 1.00)
print(paste("Trout's 2012 wRC+:", round(trout_wrc_plus, 0)))

def calculate_wrc_plus(player_stats, league_woba, park_factor=1.00):
    """
    Simplified wRC+ calculation
    """
    # These are approximate 2023 weights
    woba_weights = {
        'BB': 0.69,
        'HBP': 0.72,
        '1B': 0.88,
        '2B': 1.24,
        '3B': 1.56,
        'HR': 2.08
    }

    # Calculate player wOBA
    numerator = (
        woba_weights['BB'] * player_stats['BB'] +
        woba_weights['HBP'] * player_stats['HBP'] +
        woba_weights['1B'] * player_stats['1B'] +
        woba_weights['2B'] * player_stats['2B'] +
        woba_weights['3B'] * player_stats['3B'] +
        woba_weights['HR'] * player_stats['HR']
    )

    denominator = (
        player_stats['AB'] + player_stats['BB'] - player_stats['IBB'] +
        player_stats['SF'] + player_stats['HBP']
    )

    player_woba = numerator / denominator

    # wRC+ formula (simplified)
    wrc_plus = ((player_woba - league_woba) / 1.15 + 1) * 100 / park_factor

    return wrc_plus

# Example: Mike Trout 2012
trout_2012 = {
    'AB': 559, 'BB': 67, 'HBP': 1, '1B': 113, '2B': 27, '3B': 8, 'HR': 30,
    'IBB': 0, 'SF': 4
}

league_woba_2012 = 0.315

trout_wrc_plus = calculate_wrc_plus(trout_2012, league_woba_2012, park_factor=1.00)
print(f"Trout's 2012 wRC+: {round(trout_wrc_plus, 0)}")

Park Factors

Park factors adjust for the offensive environment of a player's home ballpark. Some parks (like Coors Field) dramatically increase offense; others (like Oracle Park in San Francisco) suppress it.

Park factors are typically calculated by comparing runs scored in a park (by both teams) to runs scored in road games:

Park Factor = (Home Runs / Home Games) / (Road Runs / Road Games) * 100

A park factor of 100 is neutral; above 100 favors hitters; below 100 favors pitchers.

Here's how to calculate park factors:

# Calculate park factor
calculate_park_factor <- function(home_runs, home_games, road_runs, road_games) {
  park_factor <- (home_runs / home_games) / (road_runs / road_games) * 100
  return(park_factor)
}

# Example: Coors Field (notorious hitter's park)
# Hypothetical season data
coors_pf <- calculate_park_factor(
  home_runs = 900,
  home_games = 81,
  road_runs = 700,
  road_games = 81
)

print(paste("Coors Field Park Factor:", round(coors_pf, 0)))

# Example: Oracle Park (pitcher's park)
oracle_pf <- calculate_park_factor(
  home_runs = 650,
  home_games = 81,
  road_runs = 750,
  road_games = 81
)

print(paste("Oracle Park Factor:", round(oracle_pf, 0)))

def calculate_park_factor(home_runs, home_games, road_runs, road_games):
    """
    Calculate park factor
    """
    park_factor = (home_runs / home_games) / (road_runs / road_games) * 100
    return park_factor

# Example: Coors Field (notorious hitter's park)
coors_pf = calculate_park_factor(
    home_runs=900,
    home_games=81,
    road_runs=700,
    road_games=81
)

print(f"Coors Field Park Factor: {round(coors_pf, 0)}")

# Example: Oracle Park (pitcher's park)
oracle_pf = calculate_park_factor(
    home_runs=650,
    home_games=81,
    road_runs=750,
    road_games=81
)

print(f"Oracle Park Factor: {round(oracle_pf, 0)}")

Historical Park Factor Challenges

Park factors become more complex when analyzing historical players because:

Parks Changed: Many historical players played in parks that no longer exist
Multi-Year Analysis: Players often played in multiple parks throughout their careers
Era Effects: The same physical park might play differently in different eras due to changes in the ball, rules, or playing style

For historical analysis, we often use multi-year park factors and apply them carefully, recognizing that they're estimates rather than precise measurements.

Applying Era Adjustments

Let's apply these concepts to compare two legendary seasons:

Babe Ruth, 1927: OPS 1.258, AL average OPS .725
Barry Bonds, 2004: OPS 1.422, NL average OPS .758

Raw OPS suggests Bonds was better. But adjusted for context:

# Compare Ruth and Bonds
compare_seasons <- function(player_ops, league_ops, player_name, year) {
  relative_ops <- player_ops / league_ops
  print(paste(player_name, year, "- Relative to league average:", round(relative_ops, 3)))
  return(relative_ops)
}

ruth_relative <- compare_seasons(1.258, 0.725, "Babe Ruth", 1927)
bonds_relative <- compare_seasons(1.422, 0.758, "Barry Bonds", 2004)

print(paste("Ruth was", round((ruth_relative - 1) * 100, 1), "% above average"))
print(paste("Bonds was", round((bonds_relative - 1) * 100, 1), "% above average"))

def compare_seasons(player_ops, league_ops, player_name, year):
    """
    Compare seasons relative to league average
    """
    relative_ops = player_ops / league_ops
    print(f"{player_name} {year} - Relative to league average: {relative_ops:.3f}")
    return relative_ops

ruth_relative = compare_seasons(1.258, 0.725, "Babe Ruth", 1927)
bonds_relative = compare_seasons(1.422, 0.758, "Barry Bonds", 2004)

print(f"Ruth was {(ruth_relative - 1) * 100:.1f}% above average")
print(f"Bonds was {(bonds_relative - 1) * 100:.1f}% above average")

Both were approximately 73-87% above average, making these comparably dominant seasons despite different raw numbers.

R

OPS+ = 100 * (OBP/lgOBP + SLG/lgSLG - 1)

R

# Calculate OPS+ for Babe Ruth 1927
ruth_obp <- 0.486
ruth_slg <- 0.772
league_obp <- 0.333
league_slg <- 0.392

# Basic OPS+ calculation (before park adjustment)
ops_plus <- 100 * ((ruth_obp / league_obp) + (ruth_slg / league_slg) - 1)
print(paste("Ruth's 1927 OPS+:", round(ops_plus, 0)))

R

ERA+ = 100 * (lgERA / ERA)

R

# Calculate ERA+ for Pedro Martinez 2000
pedro_era <- 1.74
league_era <- 4.91

# Basic ERA+ calculation (before park adjustment)
era_plus <- 100 * (league_era / pedro_era)
print(paste("Pedro's 2000 ERA+:", round(era_plus, 0)))

R

# Simplified wRC+ calculation
calculate_wrc_plus <- function(player_stats, league_stats, park_factor = 1.00) {
  # These are approximate 2023 weights
  woba_weights <- list(
    BB = 0.69,
    HBP = 0.72,
    '1B' = 0.88,
    '2B' = 1.24,
    '3B' = 1.56,
    HR = 2.08
  )

  # Calculate player wOBA
  player_woba <- (
    woba_weights$BB * player_stats$BB +
    woba_weights$HBP * player_stats$HBP +
    woba_weights$'1B' * player_stats$'1B' +
    woba_weights$'2B' * player_stats$'2B' +
    woba_weights$'3B' * player_stats$'3B' +
    woba_weights$HR * player_stats$HR
  ) / (player_stats$AB + player_stats$BB - player_stats$IBB + player_stats$SF + player_stats$HBP)

  # Calculate league wOBA
  league_woba <- league_stats$wOBA

  # wRC+ formula (simplified)
  wrc_plus <- ((player_woba - league_woba) / 1.15 + 1) * 100 / park_factor

  return(wrc_plus)
}

# Example: Mike Trout 2012
trout_2012 <- list(
  AB = 559, BB = 67, HBP = 1, '1B' = 113, '2B' = 27, '3B' = 8, HR = 30,
  IBB = 0, SF = 4
)

league_2012 <- list(wOBA = 0.315)

trout_wrc_plus <- calculate_wrc_plus(trout_2012, league_2012, park_factor = 1.00)
print(paste("Trout's 2012 wRC+:", round(trout_wrc_plus, 0)))

R

Park Factor = (Home Runs / Home Games) / (Road Runs / Road Games) * 100

R

# Calculate park factor
calculate_park_factor <- function(home_runs, home_games, road_runs, road_games) {
  park_factor <- (home_runs / home_games) / (road_runs / road_games) * 100
  return(park_factor)
}

# Example: Coors Field (notorious hitter's park)
# Hypothetical season data
coors_pf <- calculate_park_factor(
  home_runs = 900,
  home_games = 81,
  road_runs = 700,
  road_games = 81
)

print(paste("Coors Field Park Factor:", round(coors_pf, 0)))

# Example: Oracle Park (pitcher's park)
oracle_pf <- calculate_park_factor(
  home_runs = 650,
  home_games = 81,
  road_runs = 750,
  road_games = 81
)

print(paste("Oracle Park Factor:", round(oracle_pf, 0)))

R

# Compare Ruth and Bonds
compare_seasons <- function(player_ops, league_ops, player_name, year) {
  relative_ops <- player_ops / league_ops
  print(paste(player_name, year, "- Relative to league average:", round(relative_ops, 3)))
  return(relative_ops)
}

ruth_relative <- compare_seasons(1.258, 0.725, "Babe Ruth", 1927)
bonds_relative <- compare_seasons(1.422, 0.758, "Barry Bonds", 2004)

print(paste("Ruth was", round((ruth_relative - 1) * 100, 1), "% above average"))
print(paste("Bonds was", round((bonds_relative - 1) * 100, 1), "% above average"))

Python

# Calculate OPS+ for Babe Ruth 1927
ruth_obp = 0.486
ruth_slg = 0.772
league_obp = 0.333
league_slg = 0.392

# Basic OPS+ calculation (before park adjustment)
ops_plus = 100 * ((ruth_obp / league_obp) + (ruth_slg / league_slg) - 1)
print(f"Ruth's 1927 OPS+: {round(ops_plus, 0)}")

Python

# Calculate ERA+ for Pedro Martinez 2000
pedro_era = 1.74
league_era = 4.91

# Basic ERA+ calculation (before park adjustment)
era_plus = 100 * (league_era / pedro_era)
print(f"Pedro's 2000 ERA+: {round(era_plus, 0)}")

Python

def calculate_wrc_plus(player_stats, league_woba, park_factor=1.00):
    """
    Simplified wRC+ calculation
    """
    # These are approximate 2023 weights
    woba_weights = {
        'BB': 0.69,
        'HBP': 0.72,
        '1B': 0.88,
        '2B': 1.24,
        '3B': 1.56,
        'HR': 2.08
    }

    # Calculate player wOBA
    numerator = (
        woba_weights['BB'] * player_stats['BB'] +
        woba_weights['HBP'] * player_stats['HBP'] +
        woba_weights['1B'] * player_stats['1B'] +
        woba_weights['2B'] * player_stats['2B'] +
        woba_weights['3B'] * player_stats['3B'] +
        woba_weights['HR'] * player_stats['HR']
    )

    denominator = (
        player_stats['AB'] + player_stats['BB'] - player_stats['IBB'] +
        player_stats['SF'] + player_stats['HBP']
    )

    player_woba = numerator / denominator

    # wRC+ formula (simplified)
    wrc_plus = ((player_woba - league_woba) / 1.15 + 1) * 100 / park_factor

    return wrc_plus

# Example: Mike Trout 2012
trout_2012 = {
    'AB': 559, 'BB': 67, 'HBP': 1, '1B': 113, '2B': 27, '3B': 8, 'HR': 30,
    'IBB': 0, 'SF': 4
}

league_woba_2012 = 0.315

trout_wrc_plus = calculate_wrc_plus(trout_2012, league_woba_2012, park_factor=1.00)
print(f"Trout's 2012 wRC+: {round(trout_wrc_plus, 0)}")

Python

def calculate_park_factor(home_runs, home_games, road_runs, road_games):
    """
    Calculate park factor
    """
    park_factor = (home_runs / home_games) / (road_runs / road_games) * 100
    return park_factor

# Example: Coors Field (notorious hitter's park)
coors_pf = calculate_park_factor(
    home_runs=900,
    home_games=81,
    road_runs=700,
    road_games=81
)

print(f"Coors Field Park Factor: {round(coors_pf, 0)}")

# Example: Oracle Park (pitcher's park)
oracle_pf = calculate_park_factor(
    home_runs=650,
    home_games=81,
    road_runs=750,
    road_games=81
)

print(f"Oracle Park Factor: {round(oracle_pf, 0)}")

Python

def compare_seasons(player_ops, league_ops, player_name, year):
    """
    Compare seasons relative to league average
    """
    relative_ops = player_ops / league_ops
    print(f"{player_name} {year} - Relative to league average: {relative_ops:.3f}")
    return relative_ops

ruth_relative = compare_seasons(1.258, 0.725, "Babe Ruth", 1927)
bonds_relative = compare_seasons(1.422, 0.758, "Barry Bonds", 2004)

print(f"Ruth was {(ruth_relative - 1) * 100:.1f}% above average")
print(f"Bonds was {(bonds_relative - 1) * 100:.1f}% above average")

13.3 Using the Lahman Database

The Lahman Database is the most comprehensive source of historical baseball statistics, covering every player and team from 1871 to the present. It's available as a free download and can be accessed through R and Python packages.

Setting Up the Lahman Database

The easiest way to access Lahman data is through dedicated packages:

# Install and load the Lahman package
# install.packages("Lahman")
library(Lahman)
library(dplyr)

# The package includes multiple datasets
# Let's explore what's available
data(package = "Lahman")

# Key datasets:
# - People: biographical information
# - Batting: batting statistics by season
# - Pitching: pitching statistics by season
# - Teams: team statistics by season
# - Fielding: fielding statistics by season

# View the structure of the Batting dataset
str(Batting)

# See the first few rows
head(Batting)

# Install pybaseball (includes Lahman data access)
# pip install pybaseball

import pybaseball as pyb
import pandas as pd

# Suppress cache warning
pyb.cache.enable()

# Download Lahman data
# The first time you run this, it will download the data
batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()
people = pyb.lahman.people()
teams = pyb.lahman.teams()

# View the structure
print(batting.info())
print(batting.head())

Querying Career Statistics

Let's pull complete career statistics for some legendary players:

library(Lahman)
library(dplyr)

# Get Babe Ruth's career batting stats
# First, find his playerID
ruth_id <- People %>%
  filter(nameFirst == "Babe", nameLast == "Ruth") %>%
  pull(playerID)

# Get his career stats
ruth_career <- Batting %>%
  filter(playerID == ruth_id) %>%
  arrange(yearID)

print(ruth_career)

# Calculate career totals
ruth_totals <- ruth_career %>%
  summarise(
    Years = n(),
    Games = sum(G, na.rm = TRUE),
    AB = sum(AB, na.rm = TRUE),
    Hits = sum(H, na.rm = TRUE),
    HR = sum(HR, na.rm = TRUE),
    RBI = sum(RBI, na.rm = TRUE),
    AVG = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE)
  )

print(ruth_totals)

# Compare multiple players
compare_players <- function(first_names, last_names) {
  # Get player IDs
  player_data <- data.frame(firstName = first_names, lastName = last_names)

  results <- list()

  for(i in 1:nrow(player_data)) {
    player_id <- People %>%
      filter(nameFirst == player_data$firstName[i],
             nameLast == player_data$lastName[i]) %>%
      pull(playerID)

    if(length(player_id) > 0) {
      career <- Batting %>%
        filter(playerID == player_id[1]) %>%
        summarise(
          Name = paste(player_data$firstName[i], player_data$lastName[i]),
          Years = n(),
          Games = sum(G, na.rm = TRUE),
          AB = sum(AB, na.rm = TRUE),
          Hits = sum(H, na.rm = TRUE),
          HR = sum(HR, na.rm = TRUE),
          AVG = round(sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE), 3)
        )
      results[[i]] <- career
    }
  }

  return(bind_rows(results))
}

# Compare Ruth, Mays, Bonds, Trout (career through available data)
comparison <- compare_players(
  c("Babe", "Willie", "Barry", "Mike"),
  c("Ruth", "Mays", "Bonds", "Trout")
)

print(comparison)

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

# Load Lahman data
batting = pyb.lahman.batting()
people = pyb.lahman.people()

# Get Babe Ruth's career batting stats
ruth = people[
    (people['nameFirst'] == 'Babe') &
    (people['nameLast'] == 'Ruth')
]

ruth_id = ruth['playerID'].values[0]

# Get his career stats
ruth_career = batting[batting['playerID'] == ruth_id].sort_values('yearID')
print(ruth_career)

# Calculate career totals
ruth_totals = pd.DataFrame({
    'Years': [len(ruth_career)],
    'Games': [ruth_career['G'].sum()],
    'AB': [ruth_career['AB'].sum()],
    'Hits': [ruth_career['H'].sum()],
    'HR': [ruth_career['HR'].sum()],
    'RBI': [ruth_career['RBI'].sum()],
    'AVG': [ruth_career['H'].sum() / ruth_career['AB'].sum()]
})

print(ruth_totals)

# Compare multiple players
def compare_players(player_names):
    """
    Compare career stats for multiple players
    player_names: list of tuples (first_name, last_name)
    """
    results = []

    for first_name, last_name in player_names:
        player = people[
            (people['nameFirst'] == first_name) &
            (people['nameLast'] == last_name)
        ]

        if len(player) > 0:
            player_id = player['playerID'].values[0]
            career = batting[batting['playerID'] == player_id]

            stats = {
                'Name': f"{first_name} {last_name}",
                'Years': len(career),
                'Games': career['G'].sum(),
                'AB': career['AB'].sum(),
                'Hits': career['H'].sum(),
                'HR': career['HR'].sum(),
                'AVG': round(career['H'].sum() / career['AB'].sum(), 3)
            }
            results.append(stats)

    return pd.DataFrame(results)

# Compare Ruth, Mays, Bonds, Trout
comparison = compare_players([
    ('Babe', 'Ruth'),
    ('Willie', 'Mays'),
    ('Barry', 'Bonds'),
    ('Mike', 'Trout')
])

print(comparison)

Building Era Comparison Tools

Let's create a tool that automatically calculates era-adjusted statistics:

library(Lahman)
library(dplyr)

# Function to calculate league averages for a given year
get_league_averages <- function(year, league) {
  league_stats <- Batting %>%
    filter(yearID == year, lgID == league) %>%
    summarise(
      lgAB = sum(AB, na.rm = TRUE),
      lgH = sum(H, na.rm = TRUE),
      lgBB = sum(BB, na.rm = TRUE),
      lgHBP = sum(HBP, na.rm = TRUE),
      lgSF = sum(SF, na.rm = TRUE),
      lgTB = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE)
    ) %>%
    mutate(
      lgPA = lgAB + lgBB + lgHBP + lgSF,
      lgOBP = (lgH + lgBB + lgHBP) / lgPA,
      lgSLG = lgTB / lgAB,
      lgOPS = lgOBP + lgSLG
    )

  return(league_stats)
}

# Function to calculate OPS+ for a player season
calculate_ops_plus <- function(player_id, year) {
  # Get player stats
  player_stats <- Batting %>%
    filter(playerID == player_id, yearID == year) %>%
    summarise(
      AB = sum(AB, na.rm = TRUE),
      H = sum(H, na.rm = TRUE),
      BB = sum(BB, na.rm = TRUE),
      HBP = sum(HBP, na.rm = TRUE),
      SF = sum(SF, na.rm = TRUE),
      TB = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE),
      lgID = first(lgID)
    )

  if(nrow(player_stats) == 0 || player_stats$AB == 0) {
    return(NA)
  }

  # Calculate player OBP and SLG
  player_PA <- player_stats$AB + player_stats$BB + player_stats$HBP + player_stats$SF
  player_OBP <- (player_stats$H + player_stats$BB + player_stats$HBP) / player_PA
  player_SLG <- player_stats$TB / player_stats$AB

  # Get league averages
  league_avg <- get_league_averages(year, player_stats$lgID)

  # Calculate OPS+
  ops_plus <- 100 * ((player_OBP / league_avg$lgOBP) + (player_SLG / league_avg$lgSLG) - 1)

  return(round(ops_plus, 0))
}

# Example: Calculate OPS+ for famous seasons
# Babe Ruth 1927
ruth_id <- People %>% filter(nameFirst == "Babe", nameLast == "Ruth") %>% pull(playerID)
ruth_1927_ops_plus <- calculate_ops_plus(ruth_id, 1927)
print(paste("Babe Ruth 1927 OPS+:", ruth_1927_ops_plus))

# Ted Williams 1941
williams_id <- People %>% filter(nameFirst == "Ted", nameLast == "Williams") %>% pull(playerID)
williams_1941_ops_plus <- calculate_ops_plus(williams_id, 1941)
print(paste("Ted Williams 1941 OPS+:", williams_1941_ops_plus))

# Barry Bonds 2004
bonds_id <- People %>% filter(nameFirst == "Barry", nameLast == "Bonds") %>% pull(playerID)
bonds_2004_ops_plus <- calculate_ops_plus(bonds_id, 2004)
print(paste("Barry Bonds 2004 OPS+:", bonds_2004_ops_plus))

import pybaseball as pyb
import pandas as pd
import numpy as np

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def get_league_averages(year, league):
    """
    Calculate league averages for a given year
    """
    league_stats = batting[
        (batting['yearID'] == year) &
        (batting['lgID'] == league)
    ]

    lg_ab = league_stats['AB'].sum()
    lg_h = league_stats['H'].sum()
    lg_bb = league_stats['BB'].sum()
    lg_hbp = league_stats['HBP'].sum()
    lg_sf = league_stats['SF'].sum()
    lg_2b = league_stats['2B'].sum()
    lg_3b = league_stats['3B'].sum()
    lg_hr = league_stats['HR'].sum()

    lg_tb = lg_h + lg_2b + 2*lg_3b + 3*lg_hr
    lg_pa = lg_ab + lg_bb + lg_hbp + lg_sf

    lg_obp = (lg_h + lg_bb + lg_hbp) / lg_pa
    lg_slg = lg_tb / lg_ab
    lg_ops = lg_obp + lg_slg

    return {
        'lgOBP': lg_obp,
        'lgSLG': lg_slg,
        'lgOPS': lg_ops
    }

def calculate_ops_plus(player_id, year):
    """
    Calculate OPS+ for a player season
    """
    # Get player stats
    player_stats = batting[
        (batting['playerID'] == player_id) &
        (batting['yearID'] == year)
    ]

    if len(player_stats) == 0:
        return None

    # Aggregate if player played for multiple teams
    ab = player_stats['AB'].sum()
    h = player_stats['H'].sum()
    bb = player_stats['BB'].sum()
    hbp = player_stats['HBP'].sum()
    sf = player_stats['SF'].sum()
    doubles = player_stats['2B'].sum()
    triples = player_stats['3B'].sum()
    hr = player_stats['HR'].sum()
    league = player_stats['lgID'].values[0]

    if ab == 0:
        return None

    # Calculate player OBP and SLG
    tb = h + doubles + 2*triples + 3*hr
    pa = ab + bb + hbp + sf
    player_obp = (h + bb + hbp) / pa
    player_slg = tb / ab

    # Get league averages
    league_avg = get_league_averages(year, league)

    # Calculate OPS+
    ops_plus = 100 * (
        (player_obp / league_avg['lgOBP']) +
        (player_slg / league_avg['lgSLG']) - 1
    )

    return round(ops_plus, 0)

# Example: Calculate OPS+ for famous seasons
# Babe Ruth 1927
ruth = people[(people['nameFirst'] == 'Babe') & (people['nameLast'] == 'Ruth')]
ruth_id = ruth['playerID'].values[0]
ruth_1927 = calculate_ops_plus(ruth_id, 1927)
print(f"Babe Ruth 1927 OPS+: {ruth_1927}")

# Ted Williams 1941
williams = people[(people['nameFirst'] == 'Ted') & (people['nameLast'] == 'Williams')]
williams_id = williams['playerID'].values[0]
williams_1941 = calculate_ops_plus(williams_id, 1941)
print(f"Ted Williams 1941 OPS+: {williams_1941}")

# Barry Bonds 2004
bonds = people[(people['nameFirst'] == 'Barry') & (people['nameLast'] == 'Bonds')]
bonds_id = bonds['playerID'].values[0]
bonds_2004 = calculate_ops_plus(bonds_id, 2004)
print(f"Barry Bonds 2004 OPS+: {bonds_2004}")

Advanced Historical Queries

Let's find the best single seasons in history by OPS+:

library(Lahman)
library(dplyr)

# Find all qualified seasons (502+ PA) with their OPS+
find_best_seasons <- function(min_pa = 502, n_seasons = 20) {
  # This is computationally intensive, so we'll sample key years
  all_seasons <- Batting %>%
    filter(AB >= 400) %>%  # Rough PA proxy
    select(playerID, yearID, lgID, AB, H, X2B, X3B, HR, BB, HBP, SF)

  # Calculate OPS+ for each season
  results <- all_seasons %>%
    group_by(playerID, yearID) %>%
    summarise(
      lgID = first(lgID),
      AB = sum(AB, na.rm = TRUE),
      H = sum(H, na.rm = TRUE),
      X2B = sum(X2B, na.rm = TRUE),
      X3B = sum(X3B, na.rm = TRUE),
      HR = sum(HR, na.rm = TRUE),
      BB = sum(BB, na.rm = TRUE),
      HBP = sum(HBP, na.rm = TRUE),
      SF = sum(SF, na.rm = TRUE),
      .groups = 'drop'
    ) %>%
    filter(AB >= 400)

  # For demonstration, calculate for a subset
  # In practice, you'd want to loop through all seasons
  # This is a simplified version

  return(results)
}

# Find players with highest career OPS+
career_ops_plus <- function(min_pa = 3000) {
  # Calculate career stats
  career_stats <- Batting %>%
    group_by(playerID) %>%
    summarise(
      AB = sum(AB, na.rm = TRUE),
      H = sum(H, na.rm = TRUE),
      X2B = sum(X2B, na.rm = TRUE),
      X3B = sum(X3B, na.rm = TRUE),
      HR = sum(HR, na.rm = TRUE),
      BB = sum(BB, na.rm = TRUE),
      .groups = 'drop'
    ) %>%
    filter(AB >= min_pa) %>%
    arrange(desc(HR))

  # Join with player names
  career_with_names <- career_stats %>%
    left_join(People, by = "playerID") %>%
    select(nameFirst, nameLast, AB, H, HR, BB) %>%
    head(20)

  return(career_with_names)
}

top_careers <- career_ops_plus()
print(top_careers)

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def find_best_seasons(min_ab=400, n_seasons=20):
    """
    Find the best single seasons by OPS+
    Note: This is a simplified version for demonstration
    """
    # Get all qualified seasons
    qualified = batting[batting['AB'] >= min_ab].copy()

    # Calculate basic rate stats
    qualified['AVG'] = qualified['H'] / qualified['AB']
    qualified['TB'] = (
        qualified['H'] + qualified['2B'] +
        2*qualified['3B'] + 3*qualified['HR']
    )
    qualified['SLG'] = qualified['TB'] / qualified['AB']

    # Sort by HR (as a proxy for OPS+ for this example)
    best_seasons = qualified.nlargest(n_seasons, 'HR')

    # Join with player names
    best_with_names = best_seasons.merge(
        people[['playerID', 'nameFirst', 'nameLast']],
        on='playerID',
        how='left'
    )

    result = best_with_names[[
        'nameFirst', 'nameLast', 'yearID', 'AB', 'H', 'HR', 'AVG', 'SLG'
    ]].sort_values('HR', ascending=False)

    return result

# Find top single seasons
top_seasons = find_best_seasons()
print(top_seasons)

def career_ops_plus(min_ab=3000):
    """
    Find players with best career stats
    """
    # Calculate career stats
    career_stats = batting.groupby('playerID').agg({
        'AB': 'sum',
        'H': 'sum',
        '2B': 'sum',
        '3B': 'sum',
        'HR': 'sum',
        'BB': 'sum'
    }).reset_index()

    # Filter qualified players
    career_stats = career_stats[career_stats['AB'] >= min_ab]

    # Calculate AVG
    career_stats['AVG'] = career_stats['H'] / career_stats['AB']

    # Join with names
    career_with_names = career_stats.merge(
        people[['playerID', 'nameFirst', 'nameLast']],
        on='playerID',
        how='left'
    )

    # Sort by HR and get top 20
    top_careers = career_with_names.nlargest(20, 'HR')[[
        'nameFirst', 'nameLast', 'AB', 'H', 'HR', 'BB', 'AVG'
    ]]

    return top_careers

top_careers = career_ops_plus()
print(top_careers)

R

# Install and load the Lahman package
# install.packages("Lahman")
library(Lahman)
library(dplyr)

# The package includes multiple datasets
# Let's explore what's available
data(package = "Lahman")

# Key datasets:
# - People: biographical information
# - Batting: batting statistics by season
# - Pitching: pitching statistics by season
# - Teams: team statistics by season
# - Fielding: fielding statistics by season

# View the structure of the Batting dataset
str(Batting)

# See the first few rows
head(Batting)

R

library(Lahman)
library(dplyr)

# Get Babe Ruth's career batting stats
# First, find his playerID
ruth_id <- People %>%
  filter(nameFirst == "Babe", nameLast == "Ruth") %>%
  pull(playerID)

# Get his career stats
ruth_career <- Batting %>%
  filter(playerID == ruth_id) %>%
  arrange(yearID)

print(ruth_career)

# Calculate career totals
ruth_totals <- ruth_career %>%
  summarise(
    Years = n(),
    Games = sum(G, na.rm = TRUE),
    AB = sum(AB, na.rm = TRUE),
    Hits = sum(H, na.rm = TRUE),
    HR = sum(HR, na.rm = TRUE),
    RBI = sum(RBI, na.rm = TRUE),
    AVG = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE)
  )

print(ruth_totals)

# Compare multiple players
compare_players <- function(first_names, last_names) {
  # Get player IDs
  player_data <- data.frame(firstName = first_names, lastName = last_names)

  results <- list()

  for(i in 1:nrow(player_data)) {
    player_id <- People %>%
      filter(nameFirst == player_data$firstName[i],
             nameLast == player_data$lastName[i]) %>%
      pull(playerID)

    if(length(player_id) > 0) {
      career <- Batting %>%
        filter(playerID == player_id[1]) %>%
        summarise(
          Name = paste(player_data$firstName[i], player_data$lastName[i]),
          Years = n(),
          Games = sum(G, na.rm = TRUE),
          AB = sum(AB, na.rm = TRUE),
          Hits = sum(H, na.rm = TRUE),
          HR = sum(HR, na.rm = TRUE),
          AVG = round(sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE), 3)
        )
      results[[i]] <- career
    }
  }

  return(bind_rows(results))
}

# Compare Ruth, Mays, Bonds, Trout (career through available data)
comparison <- compare_players(
  c("Babe", "Willie", "Barry", "Mike"),
  c("Ruth", "Mays", "Bonds", "Trout")
)

print(comparison)

R

library(Lahman)
library(dplyr)

# Function to calculate league averages for a given year
get_league_averages <- function(year, league) {
  league_stats <- Batting %>%
    filter(yearID == year, lgID == league) %>%
    summarise(
      lgAB = sum(AB, na.rm = TRUE),
      lgH = sum(H, na.rm = TRUE),
      lgBB = sum(BB, na.rm = TRUE),
      lgHBP = sum(HBP, na.rm = TRUE),
      lgSF = sum(SF, na.rm = TRUE),
      lgTB = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE)
    ) %>%
    mutate(
      lgPA = lgAB + lgBB + lgHBP + lgSF,
      lgOBP = (lgH + lgBB + lgHBP) / lgPA,
      lgSLG = lgTB / lgAB,
      lgOPS = lgOBP + lgSLG
    )

  return(league_stats)
}

# Function to calculate OPS+ for a player season
calculate_ops_plus <- function(player_id, year) {
  # Get player stats
  player_stats <- Batting %>%
    filter(playerID == player_id, yearID == year) %>%
    summarise(
      AB = sum(AB, na.rm = TRUE),
      H = sum(H, na.rm = TRUE),
      BB = sum(BB, na.rm = TRUE),
      HBP = sum(HBP, na.rm = TRUE),
      SF = sum(SF, na.rm = TRUE),
      TB = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE),
      lgID = first(lgID)
    )

  if(nrow(player_stats) == 0 || player_stats$AB == 0) {
    return(NA)
  }

  # Calculate player OBP and SLG
  player_PA <- player_stats$AB + player_stats$BB + player_stats$HBP + player_stats$SF
  player_OBP <- (player_stats$H + player_stats$BB + player_stats$HBP) / player_PA
  player_SLG <- player_stats$TB / player_stats$AB

  # Get league averages
  league_avg <- get_league_averages(year, player_stats$lgID)

  # Calculate OPS+
  ops_plus <- 100 * ((player_OBP / league_avg$lgOBP) + (player_SLG / league_avg$lgSLG) - 1)

  return(round(ops_plus, 0))
}

# Example: Calculate OPS+ for famous seasons
# Babe Ruth 1927
ruth_id <- People %>% filter(nameFirst == "Babe", nameLast == "Ruth") %>% pull(playerID)
ruth_1927_ops_plus <- calculate_ops_plus(ruth_id, 1927)
print(paste("Babe Ruth 1927 OPS+:", ruth_1927_ops_plus))

# Ted Williams 1941
williams_id <- People %>% filter(nameFirst == "Ted", nameLast == "Williams") %>% pull(playerID)
williams_1941_ops_plus <- calculate_ops_plus(williams_id, 1941)
print(paste("Ted Williams 1941 OPS+:", williams_1941_ops_plus))

# Barry Bonds 2004
bonds_id <- People %>% filter(nameFirst == "Barry", nameLast == "Bonds") %>% pull(playerID)
bonds_2004_ops_plus <- calculate_ops_plus(bonds_id, 2004)
print(paste("Barry Bonds 2004 OPS+:", bonds_2004_ops_plus))

R

library(Lahman)
library(dplyr)

# Find all qualified seasons (502+ PA) with their OPS+
find_best_seasons <- function(min_pa = 502, n_seasons = 20) {
  # This is computationally intensive, so we'll sample key years
  all_seasons <- Batting %>%
    filter(AB >= 400) %>%  # Rough PA proxy
    select(playerID, yearID, lgID, AB, H, X2B, X3B, HR, BB, HBP, SF)

  # Calculate OPS+ for each season
  results <- all_seasons %>%
    group_by(playerID, yearID) %>%
    summarise(
      lgID = first(lgID),
      AB = sum(AB, na.rm = TRUE),
      H = sum(H, na.rm = TRUE),
      X2B = sum(X2B, na.rm = TRUE),
      X3B = sum(X3B, na.rm = TRUE),
      HR = sum(HR, na.rm = TRUE),
      BB = sum(BB, na.rm = TRUE),
      HBP = sum(HBP, na.rm = TRUE),
      SF = sum(SF, na.rm = TRUE),
      .groups = 'drop'
    ) %>%
    filter(AB >= 400)

  # For demonstration, calculate for a subset
  # In practice, you'd want to loop through all seasons
  # This is a simplified version

  return(results)
}

# Find players with highest career OPS+
career_ops_plus <- function(min_pa = 3000) {
  # Calculate career stats
  career_stats <- Batting %>%
    group_by(playerID) %>%
    summarise(
      AB = sum(AB, na.rm = TRUE),
      H = sum(H, na.rm = TRUE),
      X2B = sum(X2B, na.rm = TRUE),
      X3B = sum(X3B, na.rm = TRUE),
      HR = sum(HR, na.rm = TRUE),
      BB = sum(BB, na.rm = TRUE),
      .groups = 'drop'
    ) %>%
    filter(AB >= min_pa) %>%
    arrange(desc(HR))

  # Join with player names
  career_with_names <- career_stats %>%
    left_join(People, by = "playerID") %>%
    select(nameFirst, nameLast, AB, H, HR, BB) %>%
    head(20)

  return(career_with_names)
}

top_careers <- career_ops_plus()
print(top_careers)

Python

# Install pybaseball (includes Lahman data access)
# pip install pybaseball

import pybaseball as pyb
import pandas as pd

# Suppress cache warning
pyb.cache.enable()

# Download Lahman data
# The first time you run this, it will download the data
batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()
people = pyb.lahman.people()
teams = pyb.lahman.teams()

# View the structure
print(batting.info())
print(batting.head())

Python

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

# Load Lahman data
batting = pyb.lahman.batting()
people = pyb.lahman.people()

# Get Babe Ruth's career batting stats
ruth = people[
    (people['nameFirst'] == 'Babe') &
    (people['nameLast'] == 'Ruth')
]

ruth_id = ruth['playerID'].values[0]

# Get his career stats
ruth_career = batting[batting['playerID'] == ruth_id].sort_values('yearID')
print(ruth_career)

# Calculate career totals
ruth_totals = pd.DataFrame({
    'Years': [len(ruth_career)],
    'Games': [ruth_career['G'].sum()],
    'AB': [ruth_career['AB'].sum()],
    'Hits': [ruth_career['H'].sum()],
    'HR': [ruth_career['HR'].sum()],
    'RBI': [ruth_career['RBI'].sum()],
    'AVG': [ruth_career['H'].sum() / ruth_career['AB'].sum()]
})

print(ruth_totals)

# Compare multiple players
def compare_players(player_names):
    """
    Compare career stats for multiple players
    player_names: list of tuples (first_name, last_name)
    """
    results = []

    for first_name, last_name in player_names:
        player = people[
            (people['nameFirst'] == first_name) &
            (people['nameLast'] == last_name)
        ]

        if len(player) > 0:
            player_id = player['playerID'].values[0]
            career = batting[batting['playerID'] == player_id]

            stats = {
                'Name': f"{first_name} {last_name}",
                'Years': len(career),
                'Games': career['G'].sum(),
                'AB': career['AB'].sum(),
                'Hits': career['H'].sum(),
                'HR': career['HR'].sum(),
                'AVG': round(career['H'].sum() / career['AB'].sum(), 3)
            }
            results.append(stats)

    return pd.DataFrame(results)

# Compare Ruth, Mays, Bonds, Trout
comparison = compare_players([
    ('Babe', 'Ruth'),
    ('Willie', 'Mays'),
    ('Barry', 'Bonds'),
    ('Mike', 'Trout')
])

print(comparison)

Python

import pybaseball as pyb
import pandas as pd
import numpy as np

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def get_league_averages(year, league):
    """
    Calculate league averages for a given year
    """
    league_stats = batting[
        (batting['yearID'] == year) &
        (batting['lgID'] == league)
    ]

    lg_ab = league_stats['AB'].sum()
    lg_h = league_stats['H'].sum()
    lg_bb = league_stats['BB'].sum()
    lg_hbp = league_stats['HBP'].sum()
    lg_sf = league_stats['SF'].sum()
    lg_2b = league_stats['2B'].sum()
    lg_3b = league_stats['3B'].sum()
    lg_hr = league_stats['HR'].sum()

    lg_tb = lg_h + lg_2b + 2*lg_3b + 3*lg_hr
    lg_pa = lg_ab + lg_bb + lg_hbp + lg_sf

    lg_obp = (lg_h + lg_bb + lg_hbp) / lg_pa
    lg_slg = lg_tb / lg_ab
    lg_ops = lg_obp + lg_slg

    return {
        'lgOBP': lg_obp,
        'lgSLG': lg_slg,
        'lgOPS': lg_ops
    }

def calculate_ops_plus(player_id, year):
    """
    Calculate OPS+ for a player season
    """
    # Get player stats
    player_stats = batting[
        (batting['playerID'] == player_id) &
        (batting['yearID'] == year)
    ]

    if len(player_stats) == 0:
        return None

    # Aggregate if player played for multiple teams
    ab = player_stats['AB'].sum()
    h = player_stats['H'].sum()
    bb = player_stats['BB'].sum()
    hbp = player_stats['HBP'].sum()
    sf = player_stats['SF'].sum()
    doubles = player_stats['2B'].sum()
    triples = player_stats['3B'].sum()
    hr = player_stats['HR'].sum()
    league = player_stats['lgID'].values[0]

    if ab == 0:
        return None

    # Calculate player OBP and SLG
    tb = h + doubles + 2*triples + 3*hr
    pa = ab + bb + hbp + sf
    player_obp = (h + bb + hbp) / pa
    player_slg = tb / ab

    # Get league averages
    league_avg = get_league_averages(year, league)

    # Calculate OPS+
    ops_plus = 100 * (
        (player_obp / league_avg['lgOBP']) +
        (player_slg / league_avg['lgSLG']) - 1
    )

    return round(ops_plus, 0)

# Example: Calculate OPS+ for famous seasons
# Babe Ruth 1927
ruth = people[(people['nameFirst'] == 'Babe') & (people['nameLast'] == 'Ruth')]
ruth_id = ruth['playerID'].values[0]
ruth_1927 = calculate_ops_plus(ruth_id, 1927)
print(f"Babe Ruth 1927 OPS+: {ruth_1927}")

# Ted Williams 1941
williams = people[(people['nameFirst'] == 'Ted') & (people['nameLast'] == 'Williams')]
williams_id = williams['playerID'].values[0]
williams_1941 = calculate_ops_plus(williams_id, 1941)
print(f"Ted Williams 1941 OPS+: {williams_1941}")

# Barry Bonds 2004
bonds = people[(people['nameFirst'] == 'Barry') & (people['nameLast'] == 'Bonds')]
bonds_id = bonds['playerID'].values[0]
bonds_2004 = calculate_ops_plus(bonds_id, 2004)
print(f"Barry Bonds 2004 OPS+: {bonds_2004}")

Python

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def find_best_seasons(min_ab=400, n_seasons=20):
    """
    Find the best single seasons by OPS+
    Note: This is a simplified version for demonstration
    """
    # Get all qualified seasons
    qualified = batting[batting['AB'] >= min_ab].copy()

    # Calculate basic rate stats
    qualified['AVG'] = qualified['H'] / qualified['AB']
    qualified['TB'] = (
        qualified['H'] + qualified['2B'] +
        2*qualified['3B'] + 3*qualified['HR']
    )
    qualified['SLG'] = qualified['TB'] / qualified['AB']

    # Sort by HR (as a proxy for OPS+ for this example)
    best_seasons = qualified.nlargest(n_seasons, 'HR')

    # Join with player names
    best_with_names = best_seasons.merge(
        people[['playerID', 'nameFirst', 'nameLast']],
        on='playerID',
        how='left'
    )

    result = best_with_names[[
        'nameFirst', 'nameLast', 'yearID', 'AB', 'H', 'HR', 'AVG', 'SLG'
    ]].sort_values('HR', ascending=False)

    return result

# Find top single seasons
top_seasons = find_best_seasons()
print(top_seasons)

def career_ops_plus(min_ab=3000):
    """
    Find players with best career stats
    """
    # Calculate career stats
    career_stats = batting.groupby('playerID').agg({
        'AB': 'sum',
        'H': 'sum',
        '2B': 'sum',
        '3B': 'sum',
        'HR': 'sum',
        'BB': 'sum'
    }).reset_index()

    # Filter qualified players
    career_stats = career_stats[career_stats['AB'] >= min_ab]

    # Calculate AVG
    career_stats['AVG'] = career_stats['H'] / career_stats['AB']

    # Join with names
    career_with_names = career_stats.merge(
        people[['playerID', 'nameFirst', 'nameLast']],
        on='playerID',
        how='left'
    )

    # Sort by HR and get top 20
    top_careers = career_with_names.nlargest(20, 'HR')[[
        'nameFirst', 'nameLast', 'AB', 'H', 'HR', 'BB', 'AVG'
    ]]

    return top_careers

top_careers = career_ops_plus()
print(top_careers)

13.4 Decade-by-Decade Analysis

Understanding how baseball has evolved requires examining statistical trends over time. Let's analyze how key metrics have changed decade by decade.

Calculating Decade Averages

First, we'll calculate league-wide statistics for each decade:

library(Lahman)
library(dplyr)
library(ggplot2)

# Calculate decade-by-decade league statistics
decade_analysis <- Batting %>%
  filter(yearID >= 1900) %>%  # Focus on modern era
  mutate(decade = floor(yearID / 10) * 10) %>%
  group_by(decade) %>%
  summarise(
    Total_AB = sum(AB, na.rm = TRUE),
    Total_H = sum(H, na.rm = TRUE),
    Total_HR = sum(HR, na.rm = TRUE),
    Total_SO = sum(SO, na.rm = TRUE),
    Total_BB = sum(BB, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    AVG = Total_H / Total_AB,
    HR_per_AB = Total_HR / Total_AB,
    SO_per_AB = Total_SO / Total_AB,
    BB_per_AB = Total_BB / Total_AB,
    HR_per_Game = (Total_HR / Total_AB) * 4.5,  # Approximate ABs per game
    SO_per_Game = (Total_SO / Total_AB) * 4.5
  )

print(decade_analysis)

# Pitching trends by decade
pitching_decade <- Pitching %>%
  filter(yearID >= 1900) %>%
  mutate(decade = floor(yearID / 10) * 10) %>%
  group_by(decade) %>%
  summarise(
    Total_IP = sum(IPouts, na.rm = TRUE) / 3,
    Total_ER = sum(ER, na.rm = TRUE),
    Total_SO = sum(SO, na.rm = TRUE),
    Total_BB = sum(BB, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    ERA = (Total_ER / Total_IP) * 9,
    SO_per_9 = (Total_SO / Total_IP) * 9,
    BB_per_9 = (Total_BB / Total_IP) * 9,
    SO_BB_ratio = Total_SO / Total_BB
  )

print(pitching_decade)

import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pyb.cache.enable()

batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()

# Calculate decade-by-decade league statistics
batting_modern = batting[batting['yearID'] >= 1900].copy()
batting_modern['decade'] = (batting_modern['yearID'] // 10) * 10

decade_analysis = batting_modern.groupby('decade').agg({
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'SO': 'sum',
    'BB': 'sum'
}).reset_index()

decade_analysis.columns = ['decade', 'Total_AB', 'Total_H', 'Total_HR', 'Total_SO', 'Total_BB']

# Calculate rate stats
decade_analysis['AVG'] = decade_analysis['Total_H'] / decade_analysis['Total_AB']
decade_analysis['HR_per_AB'] = decade_analysis['Total_HR'] / decade_analysis['Total_AB']
decade_analysis['SO_per_AB'] = decade_analysis['Total_SO'] / decade_analysis['Total_AB']
decade_analysis['BB_per_AB'] = decade_analysis['Total_BB'] / decade_analysis['Total_AB']
decade_analysis['HR_per_Game'] = (decade_analysis['Total_HR'] / decade_analysis['Total_AB']) * 4.5
decade_analysis['SO_per_Game'] = (decade_analysis['Total_SO'] / decade_analysis['Total_AB']) * 4.5

print(decade_analysis)

# Pitching trends by decade
pitching_modern = pitching[pitching['yearID'] >= 1900].copy()
pitching_modern['decade'] = (pitching_modern['yearID'] // 10) * 10

pitching_decade = pitching_modern.groupby('decade').agg({
    'IPouts': 'sum',
    'ER': 'sum',
    'SO': 'sum',
    'BB': 'sum'
}).reset_index()

# Calculate IP from outs
pitching_decade['Total_IP'] = pitching_decade['IPouts'] / 3

# Calculate rate stats
pitching_decade['ERA'] = (pitching_decade['ER'] / pitching_decade['Total_IP']) * 9
pitching_decade['SO_per_9'] = (pitching_decade['SO'] / pitching_decade['Total_IP']) * 9
pitching_decade['BB_per_9'] = (pitching_decade['BB'] / pitching_decade['Total_IP']) * 9
pitching_decade['SO_BB_ratio'] = pitching_decade['SO'] / pitching_decade['BB']

print(pitching_decade)

Key Observations from Decade Analysis

Looking at the data reveals several clear trends:

The Dead Ball Era (1900-1919):

Very low home run rates (< 0.5% of at-bats)

High batting averages (.260-.270 range)

Low strikeout rates (< 10% of at-bats)

Relatively high ERA (3.00-4.00)

The Live Ball Transition (1920-1930):

Home runs doubled or tripled

Batting averages peaked around .280

Strikeout rates remained low

ERA spiked in the late 1920s

Post-Integration Era (1950-1960):

More balanced offense and pitching

Steady increase in power numbers

Rising strikeout rates

ERA stabilized around 4.00

The 1960s Pitching Dominance:

Lowest ERAs since dead ball era

Strikeouts began rapid ascent

Batting averages declined

Home runs suppressed

The Expansion and Power Era (1970-2000):

Steady increase in home runs

Batting averages remained stable

Strikeouts continued rising

ERAs fluctuated with rule changes

The Steroid Era Peak (1990-2010):

Historic home run rates

Elevated offensive numbers across the board

Strikeouts accelerating

ERA inflation despite better pitching

Modern Three True Outcomes (2010-present):

Historic strikeout rates (> 20% of at-bats)

Continued high home run rates

Lowest batting averages since 1960s

Increased pitcher dominance returning

Visualizing Historical Trends

Let's create compelling visualizations of these trends:

library(ggplot2)
library(gridExtra)

# Batting Average over time
avg_plot <- ggplot(decade_analysis, aes(x = decade, y = AVG)) +
  geom_line(size = 1.5, color = "blue") +
  geom_point(size = 3, color = "blue") +
  labs(
    title = "MLB Batting Average by Decade",
    x = "Decade",
    y = "Batting Average"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
  scale_y_continuous(limits = c(0.240, 0.280))

# Home runs per at-bat
hr_plot <- ggplot(decade_analysis, aes(x = decade, y = HR_per_AB * 100)) +
  geom_line(size = 1.5, color = "red") +
  geom_point(size = 3, color = "red") +
  labs(
    title = "MLB Home Run Rate by Decade",
    x = "Decade",
    y = "Home Runs per 100 AB"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

# Strikeouts per at-bat
so_plot <- ggplot(decade_analysis, aes(x = decade, y = SO_per_AB * 100)) +
  geom_line(size = 1.5, color = "darkgreen") +
  geom_point(size = 3, color = "darkgreen") +
  labs(
    title = "MLB Strikeout Rate by Decade",
    x = "Decade",
    y = "Strikeouts per 100 AB"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

# ERA over time
era_plot <- ggplot(pitching_decade, aes(x = decade, y = ERA)) +
  geom_line(size = 1.5, color = "purple") +
  geom_point(size = 3, color = "purple") +
  labs(
    title = "MLB ERA by Decade",
    x = "Decade",
    y = "ERA"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

# Combine plots
grid.arrange(avg_plot, hr_plot, so_plot, era_plot, ncol = 2)

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 10)

# Create subplots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

# Batting Average over time
ax1.plot(decade_analysis['decade'], decade_analysis['AVG'],
         marker='o', linewidth=2, markersize=8, color='blue')
ax1.set_title('MLB Batting Average by Decade', fontsize=14, fontweight='bold')
ax1.set_xlabel('Decade')
ax1.set_ylabel('Batting Average')
ax1.set_ylim(0.240, 0.280)
ax1.grid(True, alpha=0.3)

# Home runs per at-bat
ax2.plot(decade_analysis['decade'], decade_analysis['HR_per_AB'] * 100,
         marker='o', linewidth=2, markersize=8, color='red')
ax2.set_title('MLB Home Run Rate by Decade', fontsize=14, fontweight='bold')
ax2.set_xlabel('Decade')
ax2.set_ylabel('Home Runs per 100 AB')
ax2.grid(True, alpha=0.3)

# Strikeouts per at-bat
ax3.plot(decade_analysis['decade'], decade_analysis['SO_per_AB'] * 100,
         marker='o', linewidth=2, markersize=8, color='darkgreen')
ax3.set_title('MLB Strikeout Rate by Decade', fontsize=14, fontweight='bold')
ax3.set_xlabel('Decade')
ax3.set_ylabel('Strikeouts per 100 AB')
ax3.grid(True, alpha=0.3)

# ERA over time
ax4.plot(pitching_decade['decade'], pitching_decade['ERA'],
         marker='o', linewidth=2, markersize=8, color='purple')
ax4.set_title('MLB ERA by Decade', fontsize=14, fontweight='bold')
ax4.set_xlabel('Decade')
ax4.set_ylabel('ERA')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('decade_trends.png', dpi=300, bbox_inches='tight')
plt.show()

Year-by-Year Analysis for Recent Trends

For more granular analysis, let's look at year-by-year trends in the modern era:

library(Lahman)
library(dplyr)
library(ggplot2)

# Year-by-year since 1990
modern_yearly <- Batting %>%
  filter(yearID >= 1990) %>%
  group_by(yearID) %>%
  summarise(
    Total_AB = sum(AB, na.rm = TRUE),
    Total_H = sum(H, na.rm = TRUE),
    Total_HR = sum(HR, na.rm = TRUE),
    Total_SO = sum(SO, na.rm = TRUE),
    Total_BB = sum(BB, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    AVG = Total_H / Total_AB,
    HR_Rate = (Total_HR / Total_AB) * 100,
    SO_Rate = (Total_SO / Total_AB) * 100,
    BB_Rate = (Total_BB / Total_AB) * 100
  )

# Create visualization
ggplot(modern_yearly) +
  geom_line(aes(x = yearID, y = AVG * 1000), color = "blue", size = 1) +
  geom_line(aes(x = yearID, y = HR_Rate * 10), color = "red", size = 1) +
  geom_line(aes(x = yearID, y = SO_Rate * 10), color = "green", size = 1) +
  labs(
    title = "Modern Era Trends (1990-Present)",
    subtitle = "Blue = AVG (×1000), Red = HR Rate (×10), Green = SO Rate (×10)",
    x = "Year",
    y = "Scaled Value"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt

pyb.cache.enable()

batting = pyb.lahman.batting()

# Year-by-year since 1990
modern_batting = batting[batting['yearID'] >= 1990].copy()

modern_yearly = modern_batting.groupby('yearID').agg({
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'SO': 'sum',
    'BB': 'sum'
}).reset_index()

# Calculate rate stats
modern_yearly['AVG'] = modern_yearly['H'] / modern_yearly['AB']
modern_yearly['HR_Rate'] = (modern_yearly['HR'] / modern_yearly['AB']) * 100
modern_yearly['SO_Rate'] = (modern_yearly['SO'] / modern_yearly['AB']) * 100
modern_yearly['BB_Rate'] = (modern_yearly['BB'] / modern_yearly['AB']) * 100

# Create visualization
plt.figure(figsize=(14, 8))

plt.plot(modern_yearly['yearID'], modern_yearly['AVG'] * 1000,
         label='AVG (×1000)', linewidth=2, color='blue')
plt.plot(modern_yearly['yearID'], modern_yearly['HR_Rate'] * 10,
         label='HR Rate (×10)', linewidth=2, color='red')
plt.plot(modern_yearly['yearID'], modern_yearly['SO_Rate'] * 10,
         label='SO Rate (×10)', linewidth=2, color='green')

plt.title('Modern Era Trends (1990-Present)', fontsize=16, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Value', fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('modern_era_trends.png', dpi=300, bbox_inches='tight')
plt.show()

print(modern_yearly)

R

library(Lahman)
library(dplyr)
library(ggplot2)

# Calculate decade-by-decade league statistics
decade_analysis <- Batting %>%
  filter(yearID >= 1900) %>%  # Focus on modern era
  mutate(decade = floor(yearID / 10) * 10) %>%
  group_by(decade) %>%
  summarise(
    Total_AB = sum(AB, na.rm = TRUE),
    Total_H = sum(H, na.rm = TRUE),
    Total_HR = sum(HR, na.rm = TRUE),
    Total_SO = sum(SO, na.rm = TRUE),
    Total_BB = sum(BB, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    AVG = Total_H / Total_AB,
    HR_per_AB = Total_HR / Total_AB,
    SO_per_AB = Total_SO / Total_AB,
    BB_per_AB = Total_BB / Total_AB,
    HR_per_Game = (Total_HR / Total_AB) * 4.5,  # Approximate ABs per game
    SO_per_Game = (Total_SO / Total_AB) * 4.5
  )

print(decade_analysis)

# Pitching trends by decade
pitching_decade <- Pitching %>%
  filter(yearID >= 1900) %>%
  mutate(decade = floor(yearID / 10) * 10) %>%
  group_by(decade) %>%
  summarise(
    Total_IP = sum(IPouts, na.rm = TRUE) / 3,
    Total_ER = sum(ER, na.rm = TRUE),
    Total_SO = sum(SO, na.rm = TRUE),
    Total_BB = sum(BB, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    ERA = (Total_ER / Total_IP) * 9,
    SO_per_9 = (Total_SO / Total_IP) * 9,
    BB_per_9 = (Total_BB / Total_IP) * 9,
    SO_BB_ratio = Total_SO / Total_BB
  )

print(pitching_decade)

R

library(ggplot2)
library(gridExtra)

# Batting Average over time
avg_plot <- ggplot(decade_analysis, aes(x = decade, y = AVG)) +
  geom_line(size = 1.5, color = "blue") +
  geom_point(size = 3, color = "blue") +
  labs(
    title = "MLB Batting Average by Decade",
    x = "Decade",
    y = "Batting Average"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold")) +
  scale_y_continuous(limits = c(0.240, 0.280))

# Home runs per at-bat
hr_plot <- ggplot(decade_analysis, aes(x = decade, y = HR_per_AB * 100)) +
  geom_line(size = 1.5, color = "red") +
  geom_point(size = 3, color = "red") +
  labs(
    title = "MLB Home Run Rate by Decade",
    x = "Decade",
    y = "Home Runs per 100 AB"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

# Strikeouts per at-bat
so_plot <- ggplot(decade_analysis, aes(x = decade, y = SO_per_AB * 100)) +
  geom_line(size = 1.5, color = "darkgreen") +
  geom_point(size = 3, color = "darkgreen") +
  labs(
    title = "MLB Strikeout Rate by Decade",
    x = "Decade",
    y = "Strikeouts per 100 AB"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

# ERA over time
era_plot <- ggplot(pitching_decade, aes(x = decade, y = ERA)) +
  geom_line(size = 1.5, color = "purple") +
  geom_point(size = 3, color = "purple") +
  labs(
    title = "MLB ERA by Decade",
    x = "Decade",
    y = "ERA"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

# Combine plots
grid.arrange(avg_plot, hr_plot, so_plot, era_plot, ncol = 2)

R

library(Lahman)
library(dplyr)
library(ggplot2)

# Year-by-year since 1990
modern_yearly <- Batting %>%
  filter(yearID >= 1990) %>%
  group_by(yearID) %>%
  summarise(
    Total_AB = sum(AB, na.rm = TRUE),
    Total_H = sum(H, na.rm = TRUE),
    Total_HR = sum(HR, na.rm = TRUE),
    Total_SO = sum(SO, na.rm = TRUE),
    Total_BB = sum(BB, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    AVG = Total_H / Total_AB,
    HR_Rate = (Total_HR / Total_AB) * 100,
    SO_Rate = (Total_SO / Total_AB) * 100,
    BB_Rate = (Total_BB / Total_AB) * 100
  )

# Create visualization
ggplot(modern_yearly) +
  geom_line(aes(x = yearID, y = AVG * 1000), color = "blue", size = 1) +
  geom_line(aes(x = yearID, y = HR_Rate * 10), color = "red", size = 1) +
  geom_line(aes(x = yearID, y = SO_Rate * 10), color = "green", size = 1) +
  labs(
    title = "Modern Era Trends (1990-Present)",
    subtitle = "Blue = AVG (×1000), Red = HR Rate (×10), Green = SO Rate (×10)",
    x = "Year",
    y = "Scaled Value"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Python

import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pyb.cache.enable()

batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()

# Calculate decade-by-decade league statistics
batting_modern = batting[batting['yearID'] >= 1900].copy()
batting_modern['decade'] = (batting_modern['yearID'] // 10) * 10

decade_analysis = batting_modern.groupby('decade').agg({
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'SO': 'sum',
    'BB': 'sum'
}).reset_index()

decade_analysis.columns = ['decade', 'Total_AB', 'Total_H', 'Total_HR', 'Total_SO', 'Total_BB']

# Calculate rate stats
decade_analysis['AVG'] = decade_analysis['Total_H'] / decade_analysis['Total_AB']
decade_analysis['HR_per_AB'] = decade_analysis['Total_HR'] / decade_analysis['Total_AB']
decade_analysis['SO_per_AB'] = decade_analysis['Total_SO'] / decade_analysis['Total_AB']
decade_analysis['BB_per_AB'] = decade_analysis['Total_BB'] / decade_analysis['Total_AB']
decade_analysis['HR_per_Game'] = (decade_analysis['Total_HR'] / decade_analysis['Total_AB']) * 4.5
decade_analysis['SO_per_Game'] = (decade_analysis['Total_SO'] / decade_analysis['Total_AB']) * 4.5

print(decade_analysis)

# Pitching trends by decade
pitching_modern = pitching[pitching['yearID'] >= 1900].copy()
pitching_modern['decade'] = (pitching_modern['yearID'] // 10) * 10

pitching_decade = pitching_modern.groupby('decade').agg({
    'IPouts': 'sum',
    'ER': 'sum',
    'SO': 'sum',
    'BB': 'sum'
}).reset_index()

# Calculate IP from outs
pitching_decade['Total_IP'] = pitching_decade['IPouts'] / 3

# Calculate rate stats
pitching_decade['ERA'] = (pitching_decade['ER'] / pitching_decade['Total_IP']) * 9
pitching_decade['SO_per_9'] = (pitching_decade['SO'] / pitching_decade['Total_IP']) * 9
pitching_decade['BB_per_9'] = (pitching_decade['BB'] / pitching_decade['Total_IP']) * 9
pitching_decade['SO_BB_ratio'] = pitching_decade['SO'] / pitching_decade['BB']

print(pitching_decade)

Python

import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 10)

# Create subplots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)

# Batting Average over time
ax1.plot(decade_analysis['decade'], decade_analysis['AVG'],
         marker='o', linewidth=2, markersize=8, color='blue')
ax1.set_title('MLB Batting Average by Decade', fontsize=14, fontweight='bold')
ax1.set_xlabel('Decade')
ax1.set_ylabel('Batting Average')
ax1.set_ylim(0.240, 0.280)
ax1.grid(True, alpha=0.3)

# Home runs per at-bat
ax2.plot(decade_analysis['decade'], decade_analysis['HR_per_AB'] * 100,
         marker='o', linewidth=2, markersize=8, color='red')
ax2.set_title('MLB Home Run Rate by Decade', fontsize=14, fontweight='bold')
ax2.set_xlabel('Decade')
ax2.set_ylabel('Home Runs per 100 AB')
ax2.grid(True, alpha=0.3)

# Strikeouts per at-bat
ax3.plot(decade_analysis['decade'], decade_analysis['SO_per_AB'] * 100,
         marker='o', linewidth=2, markersize=8, color='darkgreen')
ax3.set_title('MLB Strikeout Rate by Decade', fontsize=14, fontweight='bold')
ax3.set_xlabel('Decade')
ax3.set_ylabel('Strikeouts per 100 AB')
ax3.grid(True, alpha=0.3)

# ERA over time
ax4.plot(pitching_decade['decade'], pitching_decade['ERA'],
         marker='o', linewidth=2, markersize=8, color='purple')
ax4.set_title('MLB ERA by Decade', fontsize=14, fontweight='bold')
ax4.set_xlabel('Decade')
ax4.set_ylabel('ERA')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('decade_trends.png', dpi=300, bbox_inches='tight')
plt.show()

Python

import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt

pyb.cache.enable()

batting = pyb.lahman.batting()

# Year-by-year since 1990
modern_batting = batting[batting['yearID'] >= 1990].copy()

modern_yearly = modern_batting.groupby('yearID').agg({
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'SO': 'sum',
    'BB': 'sum'
}).reset_index()

# Calculate rate stats
modern_yearly['AVG'] = modern_yearly['H'] / modern_yearly['AB']
modern_yearly['HR_Rate'] = (modern_yearly['HR'] / modern_yearly['AB']) * 100
modern_yearly['SO_Rate'] = (modern_yearly['SO'] / modern_yearly['AB']) * 100
modern_yearly['BB_Rate'] = (modern_yearly['BB'] / modern_yearly['AB']) * 100

# Create visualization
plt.figure(figsize=(14, 8))

plt.plot(modern_yearly['yearID'], modern_yearly['AVG'] * 1000,
         label='AVG (×1000)', linewidth=2, color='blue')
plt.plot(modern_yearly['yearID'], modern_yearly['HR_Rate'] * 10,
         label='HR Rate (×10)', linewidth=2, color='red')
plt.plot(modern_yearly['yearID'], modern_yearly['SO_Rate'] * 10,
         label='SO Rate (×10)', linewidth=2, color='green')

plt.title('Modern Era Trends (1990-Present)', fontsize=16, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Scaled Value', fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('modern_era_trends.png', dpi=300, bbox_inches='tight')
plt.show()

print(modern_yearly)

13.5 Comparing Players Across Eras

Comparing individual players across eras is the ultimate test of our analytical methods. We need frameworks that account for different competitive environments while still allowing meaningful comparisons.

WAR as a Cross-Era Comparison Tool

Wins Above Replacement (WAR) is designed to be era-neutral because it compares players to replacement level within their own era. A player worth 8 WAR in 1927 provided approximately the same value as a player worth 8 WAR in 2023, even though their raw statistics might look completely different.

WAR's key advantages for cross-era comparison:

Position Adjustment: Accounts for defensive value at different positions
League and Park Adjustment: Built-in era and park factors
Playing Time: Rewards durability and availability
Replacement Level: Compares to the same baseline across eras

Let's compare the career WAR of legendary players:

library(Lahman)
library(dplyr)

# Note: Lahman database doesn't include WAR directly
# We'll use a simplified framework based on available data
# For actual WAR, you'd use Baseball-Reference or FanGraphs data

# Function to get career value statistics
get_career_value <- function(first_name, last_name) {
  # Get player ID
  player <- People %>%
    filter(nameFirst == first_name, nameLast == last_name)

  if(nrow(player) == 0) {
    return(NULL)
  }

  player_id <- player$playerID[1]

  # Get career stats
  career <- Batting %>%
    filter(playerID == player_id) %>%
    summarise(
      Name = paste(first_name, last_name),
      Years = n(),
      Games = sum(G, na.rm = TRUE),
      AB = sum(AB, na.rm = TRUE),
      Hits = sum(H, na.rm = TRUE),
      HR = sum(HR, na.rm = TRUE),
      RBI = sum(RBI, na.rm = TRUE),
      BB = sum(BB, na.rm = TRUE),
      AVG = round(sum(H) / sum(AB), 3),
      First_Year = min(yearID),
      Last_Year = max(yearID)
    )

  return(career)
}

# Compare legendary players
legends <- list(
  c("Babe", "Ruth"),
  c("Ted", "Williams"),
  c("Willie", "Mays"),
  c("Hank", "Aaron"),
  c("Barry", "Bonds"),
  c("Mike", "Trout"),
  c("Albert", "Pujols")
)

comparison <- bind_rows(lapply(legends, function(x) get_career_value(x[1], x[2])))
print(comparison)

# Calculate per-season averages
comparison <- comparison %>%
  mutate(
    HR_per_Season = round(HR / Years, 1),
    Games_per_Season = round(Games / Years, 0)
  )

print(comparison[, c("Name", "Years", "HR", "HR_per_Season", "AVG")])

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def get_career_value(first_name, last_name):
    """
    Get career value statistics for a player
    """
    # Get player ID
    player = people[
        (people['nameFirst'] == first_name) &
        (people['nameLast'] == last_name)
    ]

    if len(player) == 0:
        return None

    player_id = player['playerID'].values[0]

    # Get career stats
    career = batting[batting['playerID'] == player_id]

    stats = {
        'Name': f"{first_name} {last_name}",
        'Years': len(career),
        'Games': career['G'].sum(),
        'AB': career['AB'].sum(),
        'Hits': career['H'].sum(),
        'HR': career['HR'].sum(),
        'RBI': career['RBI'].sum(),
        'BB': career['BB'].sum(),
        'AVG': round(career['H'].sum() / career['AB'].sum(), 3),
        'First_Year': career['yearID'].min(),
        'Last_Year': career['yearID'].max()
    }

    return stats

# Compare legendary players
legends = [
    ('Babe', 'Ruth'),
    ('Ted', 'Williams'),
    ('Willie', 'Mays'),
    ('Hank', 'Aaron'),
    ('Barry', 'Bonds'),
    ('Mike', 'Trout'),
    ('Albert', 'Pujols')
]

comparison_data = []
for first, last in legends:
    stats = get_career_value(first, last)
    if stats:
        comparison_data.append(stats)

comparison = pd.DataFrame(comparison_data)

# Calculate per-season averages
comparison['HR_per_Season'] = (comparison['HR'] / comparison['Years']).round(1)
comparison['Games_per_Season'] = (comparison['Games'] / comparison['Years']).round(0)

print(comparison[['Name', 'Years', 'HR', 'HR_per_Season', 'AVG']])

Peak Value vs. Career Value

One of the great debates in player comparison is peak value versus career value. Should we favor a player who dominated for a decade or one who was very good for two decades?

Different perspectives:

Peak Value Advocates argue that:

Peak performance shows what a player was truly capable of

Health and durability are partly luck

Hall of Fame should be about greatness, not accumulation

A player's best 7 years show their true talent level

Career Value Advocates argue that:

Longevity requires skill (staying healthy, adapting, maintaining fitness)

Consistency over time is valuable

Total contribution to teams matters

Durability is a skill, not just luck

Let's analyze both perspectives:

library(Lahman)
library(dplyr)

# Function to get peak seasons (top 7 years)
get_peak_value <- function(first_name, last_name, n_years = 7) {
  # Get player ID
  player <- People %>%
    filter(nameFirst == first_name, nameLast == last_name)

  if(nrow(player) == 0) {
    return(NULL)
  }

  player_id <- player$playerID[1]

  # Get all seasons
  seasons <- Batting %>%
    filter(playerID == player_id) %>%
    group_by(yearID) %>%
    summarise(
      AB = sum(AB),
      H = sum(H),
      HR = sum(HR),
      RBI = sum(RBI),
      BB = sum(BB),
      .groups = 'drop'
    ) %>%
    filter(AB >= 300) %>%  # Qualified seasons only
    arrange(desc(HR))  # Sort by HR (could use other metrics)

  # Get top N seasons
  peak <- seasons %>%
    head(n_years) %>%
    summarise(
      Name = paste(first_name, last_name),
      Peak_Years = n(),
      Total_AB = sum(AB),
      Total_HR = sum(HR),
      Total_RBI = sum(RBI),
      Avg_HR = round(mean(HR), 1),
      Peak_AVG = round(sum(H) / sum(AB), 3)
    )

  return(peak)
}

# Compare peak value
peak_legends <- bind_rows(lapply(legends, function(x) get_peak_value(x[1], x[2])))
print(peak_legends)

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def get_peak_value(first_name, last_name, n_years=7):
    """
    Get peak value (best N seasons) for a player
    """
    # Get player ID
    player = people[
        (people['nameFirst'] == first_name) &
        (people['nameLast'] == last_name)
    ]

    if len(player) == 0:
        return None

    player_id = player['playerID'].values[0]

    # Get all seasons
    seasons = batting[batting['playerID'] == player_id].groupby('yearID').agg({
        'AB': 'sum',
        'H': 'sum',
        'HR': 'sum',
        'RBI': 'sum',
        'BB': 'sum'
    }).reset_index()

    # Filter qualified seasons
    seasons = seasons[seasons['AB'] >= 300]

    # Sort by HR (could use other metrics)
    seasons = seasons.sort_values('HR', ascending=False)

    # Get top N seasons
    peak = seasons.head(n_years)

    stats = {
        'Name': f"{first_name} {last_name}",
        'Peak_Years': len(peak),
        'Total_AB': peak['AB'].sum(),
        'Total_HR': peak['HR'].sum(),
        'Total_RBI': peak['RBI'].sum(),
        'Avg_HR': round(peak['HR'].mean(), 1),
        'Peak_AVG': round(peak['H'].sum() / peak['AB'].sum(), 3)
    }

    return stats

# Compare peak value
legends = [
    ('Babe', 'Ruth'),
    ('Ted', 'Williams'),
    ('Willie', 'Mays'),
    ('Hank', 'Aaron'),
    ('Barry', 'Bonds'),
    ('Mike', 'Trout')
]

peak_data = []
for first, last in legends:
    stats = get_peak_value(first, last)
    if stats:
        peak_data.append(stats)

peak_legends = pd.DataFrame(peak_data)
print(peak_legends)

Building a "Best Seasons Ever" Analysis

Let's create a comprehensive analysis of the greatest single seasons in baseball history, using era-adjusted metrics:

library(Lahman)
library(dplyr)

# Find the best single seasons by home runs (as a starting point)
best_hr_seasons <- Batting %>%
  filter(yearID >= 1900) %>%
  group_by(playerID, yearID) %>%
  summarise(
    lgID = first(lgID),
    HR = sum(HR, na.rm = TRUE),
    AB = sum(AB, na.rm = TRUE),
    H = sum(H, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  filter(AB >= 400) %>%  # Qualified seasons
  arrange(desc(HR)) %>%
  head(30)

# Join with player names
best_hr_with_names <- best_hr_seasons %>%
  left_join(People, by = "playerID") %>%
  mutate(
    Name = paste(nameFirst, nameLast),
    AVG = round(H / AB, 3)
  ) %>%
  select(Name, yearID, HR, AB, AVG) %>%
  arrange(desc(HR))

print(best_hr_with_names)

# Now find best seasons by batting average (qualified)
best_avg_seasons <- Batting %>%
  filter(yearID >= 1900) %>%
  group_by(playerID, yearID) %>%
  summarise(
    AB = sum(AB, na.rm = TRUE),
    H = sum(H, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  filter(AB >= 400) %>%
  mutate(AVG = H / AB) %>%
  arrange(desc(AVG)) %>%
  head(30)

best_avg_with_names <- best_avg_seasons %>%
  left_join(People, by = "playerID") %>%
  mutate(Name = paste(nameFirst, nameLast)) %>%
  select(Name, yearID, AB, H, AVG) %>%
  arrange(desc(AVG))

print(best_avg_with_names)

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

# Find the best single seasons by home runs
batting_modern = batting[batting['yearID'] >= 1900].copy()

best_hr_seasons = batting_modern.groupby(['playerID', 'yearID']).agg({
    'lgID': 'first',
    'HR': 'sum',
    'AB': 'sum',
    'H': 'sum'
}).reset_index()

# Filter qualified seasons
best_hr_seasons = best_hr_seasons[best_hr_seasons['AB'] >= 400]
best_hr_seasons = best_hr_seasons.sort_values('HR', ascending=False).head(30)

# Join with player names
best_hr_with_names = best_hr_seasons.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID',
    how='left'
)

best_hr_with_names['Name'] = (
    best_hr_with_names['nameFirst'] + ' ' + best_hr_with_names['nameLast']
)
best_hr_with_names['AVG'] = (best_hr_with_names['H'] / best_hr_with_names['AB']).round(3)

print(best_hr_with_names[['Name', 'yearID', 'HR', 'AB', 'AVG']])

# Find best seasons by batting average
best_avg_seasons = batting_modern.groupby(['playerID', 'yearID']).agg({
    'AB': 'sum',
    'H': 'sum'
}).reset_index()

best_avg_seasons = best_avg_seasons[best_avg_seasons['AB'] >= 400]
best_avg_seasons['AVG'] = best_avg_seasons['H'] / best_avg_seasons['AB']
best_avg_seasons = best_avg_seasons.sort_values('AVG', ascending=False).head(30)

# Join with names
best_avg_with_names = best_avg_seasons.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID',
    how='left'
)

best_avg_with_names['Name'] = (
    best_avg_with_names['nameFirst'] + ' ' + best_avg_with_names['nameLast']
)

print(best_avg_with_names[['Name', 'yearID', 'AB', 'H', 'AVG']])

Creating a Unified Comparison Framework

The most sophisticated approach combines multiple metrics into a unified framework:

# Create a comprehensive player comparison function
compare_players_comprehensive <- function(players_list) {
  # players_list should be a list of (firstName, lastName) pairs

  results <- list()

  for(i in 1:length(players_list)) {
    first <- players_list[[i]][1]
    last <- players_list[[i]][2]

    # Get player ID
    player <- People %>%
      filter(nameFirst == first, nameLast == last)

    if(nrow(player) == 0) next

    player_id <- player$playerID[1]

    # Career stats
    career <- Batting %>%
      filter(playerID == player_id) %>%
      summarise(
        Name = paste(first, last),
        Seasons = n(),
        Games = sum(G, na.rm = TRUE),
        AB = sum(AB, na.rm = TRUE),
        H = sum(H, na.rm = TRUE),
        X2B = sum(X2B, na.rm = TRUE),
        X3B = sum(X3B, na.rm = TRUE),
        HR = sum(HR, na.rm = TRUE),
        RBI = sum(RBI, na.rm = TRUE),
        BB = sum(BB, na.rm = TRUE),
        SO = sum(SO, na.rm = TRUE),
        SB = sum(SB, na.rm = TRUE)
      ) %>%
      mutate(
        AVG = round(H / AB, 3),
        OBP = round((H + BB) / (AB + BB), 3),
        SLG = round((H + X2B + 2*X3B + 3*HR) / AB, 3),
        OPS = round(OBP + SLG, 3),
        HR_per_Season = round(HR / Seasons, 1)
      )

    results[[i]] <- career
  }

  return(bind_rows(results))
}

# Compare all-time greats
all_time_greats <- list(
  c("Babe", "Ruth"),
  c("Ted", "Williams"),
  c("Willie", "Mays"),
  c("Barry", "Bonds")
)

comprehensive <- compare_players_comprehensive(all_time_greats)
print(comprehensive[, c("Name", "Seasons", "HR", "AVG", "OBP", "SLG", "OPS")])

def compare_players_comprehensive(players_list):
    """
    Comprehensive player comparison
    players_list: list of tuples (first_name, last_name)
    """
    results = []

    for first, last in players_list:
        # Get player ID
        player = people[
            (people['nameFirst'] == first) &
            (people['nameLast'] == last)
        ]

        if len(player) == 0:
            continue

        player_id = player['playerID'].values[0]

        # Career stats
        career = batting[batting['playerID'] == player_id]

        stats = {
            'Name': f"{first} {last}",
            'Seasons': len(career),
            'Games': career['G'].sum(),
            'AB': career['AB'].sum(),
            'H': career['H'].sum(),
            '2B': career['2B'].sum(),
            '3B': career['3B'].sum(),
            'HR': career['HR'].sum(),
            'RBI': career['RBI'].sum(),
            'BB': career['BB'].sum(),
            'SO': career['SO'].sum(),
            'SB': career['SB'].sum()
        }

        # Calculate rate stats
        stats['AVG'] = round(stats['H'] / stats['AB'], 3)
        stats['OBP'] = round((stats['H'] + stats['BB']) / (stats['AB'] + stats['BB']), 3)
        stats['SLG'] = round(
            (stats['H'] + stats['2B'] + 2*stats['3B'] + 3*stats['HR']) / stats['AB'], 3
        )
        stats['OPS'] = round(stats['OBP'] + stats['SLG'], 3)
        stats['HR_per_Season'] = round(stats['HR'] / stats['Seasons'], 1)

        results.append(stats)

    return pd.DataFrame(results)

# Compare all-time greats
all_time_greats = [
    ('Babe', 'Ruth'),
    ('Ted', 'Williams'),
    ('Willie', 'Mays'),
    ('Barry', 'Bonds')
]

comprehensive = compare_players_comprehensive(all_time_greats)
print(comprehensive[['Name', 'Seasons', 'HR', 'AVG', 'OBP', 'SLG', 'OPS']])

R

library(Lahman)
library(dplyr)

# Note: Lahman database doesn't include WAR directly
# We'll use a simplified framework based on available data
# For actual WAR, you'd use Baseball-Reference or FanGraphs data

# Function to get career value statistics
get_career_value <- function(first_name, last_name) {
  # Get player ID
  player <- People %>%
    filter(nameFirst == first_name, nameLast == last_name)

  if(nrow(player) == 0) {
    return(NULL)
  }

  player_id <- player$playerID[1]

  # Get career stats
  career <- Batting %>%
    filter(playerID == player_id) %>%
    summarise(
      Name = paste(first_name, last_name),
      Years = n(),
      Games = sum(G, na.rm = TRUE),
      AB = sum(AB, na.rm = TRUE),
      Hits = sum(H, na.rm = TRUE),
      HR = sum(HR, na.rm = TRUE),
      RBI = sum(RBI, na.rm = TRUE),
      BB = sum(BB, na.rm = TRUE),
      AVG = round(sum(H) / sum(AB), 3),
      First_Year = min(yearID),
      Last_Year = max(yearID)
    )

  return(career)
}

# Compare legendary players
legends <- list(
  c("Babe", "Ruth"),
  c("Ted", "Williams"),
  c("Willie", "Mays"),
  c("Hank", "Aaron"),
  c("Barry", "Bonds"),
  c("Mike", "Trout"),
  c("Albert", "Pujols")
)

comparison <- bind_rows(lapply(legends, function(x) get_career_value(x[1], x[2])))
print(comparison)

# Calculate per-season averages
comparison <- comparison %>%
  mutate(
    HR_per_Season = round(HR / Years, 1),
    Games_per_Season = round(Games / Years, 0)
  )

print(comparison[, c("Name", "Years", "HR", "HR_per_Season", "AVG")])

R

library(Lahman)
library(dplyr)

# Function to get peak seasons (top 7 years)
get_peak_value <- function(first_name, last_name, n_years = 7) {
  # Get player ID
  player <- People %>%
    filter(nameFirst == first_name, nameLast == last_name)

  if(nrow(player) == 0) {
    return(NULL)
  }

  player_id <- player$playerID[1]

  # Get all seasons
  seasons <- Batting %>%
    filter(playerID == player_id) %>%
    group_by(yearID) %>%
    summarise(
      AB = sum(AB),
      H = sum(H),
      HR = sum(HR),
      RBI = sum(RBI),
      BB = sum(BB),
      .groups = 'drop'
    ) %>%
    filter(AB >= 300) %>%  # Qualified seasons only
    arrange(desc(HR))  # Sort by HR (could use other metrics)

  # Get top N seasons
  peak <- seasons %>%
    head(n_years) %>%
    summarise(
      Name = paste(first_name, last_name),
      Peak_Years = n(),
      Total_AB = sum(AB),
      Total_HR = sum(HR),
      Total_RBI = sum(RBI),
      Avg_HR = round(mean(HR), 1),
      Peak_AVG = round(sum(H) / sum(AB), 3)
    )

  return(peak)
}

# Compare peak value
peak_legends <- bind_rows(lapply(legends, function(x) get_peak_value(x[1], x[2])))
print(peak_legends)

R

library(Lahman)
library(dplyr)

# Find the best single seasons by home runs (as a starting point)
best_hr_seasons <- Batting %>%
  filter(yearID >= 1900) %>%
  group_by(playerID, yearID) %>%
  summarise(
    lgID = first(lgID),
    HR = sum(HR, na.rm = TRUE),
    AB = sum(AB, na.rm = TRUE),
    H = sum(H, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  filter(AB >= 400) %>%  # Qualified seasons
  arrange(desc(HR)) %>%
  head(30)

# Join with player names
best_hr_with_names <- best_hr_seasons %>%
  left_join(People, by = "playerID") %>%
  mutate(
    Name = paste(nameFirst, nameLast),
    AVG = round(H / AB, 3)
  ) %>%
  select(Name, yearID, HR, AB, AVG) %>%
  arrange(desc(HR))

print(best_hr_with_names)

# Now find best seasons by batting average (qualified)
best_avg_seasons <- Batting %>%
  filter(yearID >= 1900) %>%
  group_by(playerID, yearID) %>%
  summarise(
    AB = sum(AB, na.rm = TRUE),
    H = sum(H, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  filter(AB >= 400) %>%
  mutate(AVG = H / AB) %>%
  arrange(desc(AVG)) %>%
  head(30)

best_avg_with_names <- best_avg_seasons %>%
  left_join(People, by = "playerID") %>%
  mutate(Name = paste(nameFirst, nameLast)) %>%
  select(Name, yearID, AB, H, AVG) %>%
  arrange(desc(AVG))

print(best_avg_with_names)

R

# Create a comprehensive player comparison function
compare_players_comprehensive <- function(players_list) {
  # players_list should be a list of (firstName, lastName) pairs

  results <- list()

  for(i in 1:length(players_list)) {
    first <- players_list[[i]][1]
    last <- players_list[[i]][2]

    # Get player ID
    player <- People %>%
      filter(nameFirst == first, nameLast == last)

    if(nrow(player) == 0) next

    player_id <- player$playerID[1]

    # Career stats
    career <- Batting %>%
      filter(playerID == player_id) %>%
      summarise(
        Name = paste(first, last),
        Seasons = n(),
        Games = sum(G, na.rm = TRUE),
        AB = sum(AB, na.rm = TRUE),
        H = sum(H, na.rm = TRUE),
        X2B = sum(X2B, na.rm = TRUE),
        X3B = sum(X3B, na.rm = TRUE),
        HR = sum(HR, na.rm = TRUE),
        RBI = sum(RBI, na.rm = TRUE),
        BB = sum(BB, na.rm = TRUE),
        SO = sum(SO, na.rm = TRUE),
        SB = sum(SB, na.rm = TRUE)
      ) %>%
      mutate(
        AVG = round(H / AB, 3),
        OBP = round((H + BB) / (AB + BB), 3),
        SLG = round((H + X2B + 2*X3B + 3*HR) / AB, 3),
        OPS = round(OBP + SLG, 3),
        HR_per_Season = round(HR / Seasons, 1)
      )

    results[[i]] <- career
  }

  return(bind_rows(results))
}

# Compare all-time greats
all_time_greats <- list(
  c("Babe", "Ruth"),
  c("Ted", "Williams"),
  c("Willie", "Mays"),
  c("Barry", "Bonds")
)

comprehensive <- compare_players_comprehensive(all_time_greats)
print(comprehensive[, c("Name", "Seasons", "HR", "AVG", "OBP", "SLG", "OPS")])

Python

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def get_career_value(first_name, last_name):
    """
    Get career value statistics for a player
    """
    # Get player ID
    player = people[
        (people['nameFirst'] == first_name) &
        (people['nameLast'] == last_name)
    ]

    if len(player) == 0:
        return None

    player_id = player['playerID'].values[0]

    # Get career stats
    career = batting[batting['playerID'] == player_id]

    stats = {
        'Name': f"{first_name} {last_name}",
        'Years': len(career),
        'Games': career['G'].sum(),
        'AB': career['AB'].sum(),
        'Hits': career['H'].sum(),
        'HR': career['HR'].sum(),
        'RBI': career['RBI'].sum(),
        'BB': career['BB'].sum(),
        'AVG': round(career['H'].sum() / career['AB'].sum(), 3),
        'First_Year': career['yearID'].min(),
        'Last_Year': career['yearID'].max()
    }

    return stats

# Compare legendary players
legends = [
    ('Babe', 'Ruth'),
    ('Ted', 'Williams'),
    ('Willie', 'Mays'),
    ('Hank', 'Aaron'),
    ('Barry', 'Bonds'),
    ('Mike', 'Trout'),
    ('Albert', 'Pujols')
]

comparison_data = []
for first, last in legends:
    stats = get_career_value(first, last)
    if stats:
        comparison_data.append(stats)

comparison = pd.DataFrame(comparison_data)

# Calculate per-season averages
comparison['HR_per_Season'] = (comparison['HR'] / comparison['Years']).round(1)
comparison['Games_per_Season'] = (comparison['Games'] / comparison['Years']).round(0)

print(comparison[['Name', 'Years', 'HR', 'HR_per_Season', 'AVG']])

Python

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def get_peak_value(first_name, last_name, n_years=7):
    """
    Get peak value (best N seasons) for a player
    """
    # Get player ID
    player = people[
        (people['nameFirst'] == first_name) &
        (people['nameLast'] == last_name)
    ]

    if len(player) == 0:
        return None

    player_id = player['playerID'].values[0]

    # Get all seasons
    seasons = batting[batting['playerID'] == player_id].groupby('yearID').agg({
        'AB': 'sum',
        'H': 'sum',
        'HR': 'sum',
        'RBI': 'sum',
        'BB': 'sum'
    }).reset_index()

    # Filter qualified seasons
    seasons = seasons[seasons['AB'] >= 300]

    # Sort by HR (could use other metrics)
    seasons = seasons.sort_values('HR', ascending=False)

    # Get top N seasons
    peak = seasons.head(n_years)

    stats = {
        'Name': f"{first_name} {last_name}",
        'Peak_Years': len(peak),
        'Total_AB': peak['AB'].sum(),
        'Total_HR': peak['HR'].sum(),
        'Total_RBI': peak['RBI'].sum(),
        'Avg_HR': round(peak['HR'].mean(), 1),
        'Peak_AVG': round(peak['H'].sum() / peak['AB'].sum(), 3)
    }

    return stats

# Compare peak value
legends = [
    ('Babe', 'Ruth'),
    ('Ted', 'Williams'),
    ('Willie', 'Mays'),
    ('Hank', 'Aaron'),
    ('Barry', 'Bonds'),
    ('Mike', 'Trout')
]

peak_data = []
for first, last in legends:
    stats = get_peak_value(first, last)
    if stats:
        peak_data.append(stats)

peak_legends = pd.DataFrame(peak_data)
print(peak_legends)

Python

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

# Find the best single seasons by home runs
batting_modern = batting[batting['yearID'] >= 1900].copy()

best_hr_seasons = batting_modern.groupby(['playerID', 'yearID']).agg({
    'lgID': 'first',
    'HR': 'sum',
    'AB': 'sum',
    'H': 'sum'
}).reset_index()

# Filter qualified seasons
best_hr_seasons = best_hr_seasons[best_hr_seasons['AB'] >= 400]
best_hr_seasons = best_hr_seasons.sort_values('HR', ascending=False).head(30)

# Join with player names
best_hr_with_names = best_hr_seasons.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID',
    how='left'
)

best_hr_with_names['Name'] = (
    best_hr_with_names['nameFirst'] + ' ' + best_hr_with_names['nameLast']
)
best_hr_with_names['AVG'] = (best_hr_with_names['H'] / best_hr_with_names['AB']).round(3)

print(best_hr_with_names[['Name', 'yearID', 'HR', 'AB', 'AVG']])

# Find best seasons by batting average
best_avg_seasons = batting_modern.groupby(['playerID', 'yearID']).agg({
    'AB': 'sum',
    'H': 'sum'
}).reset_index()

best_avg_seasons = best_avg_seasons[best_avg_seasons['AB'] >= 400]
best_avg_seasons['AVG'] = best_avg_seasons['H'] / best_avg_seasons['AB']
best_avg_seasons = best_avg_seasons.sort_values('AVG', ascending=False).head(30)

# Join with names
best_avg_with_names = best_avg_seasons.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID',
    how='left'
)

best_avg_with_names['Name'] = (
    best_avg_with_names['nameFirst'] + ' ' + best_avg_with_names['nameLast']
)

print(best_avg_with_names[['Name', 'yearID', 'AB', 'H', 'AVG']])

Python

def compare_players_comprehensive(players_list):
    """
    Comprehensive player comparison
    players_list: list of tuples (first_name, last_name)
    """
    results = []

    for first, last in players_list:
        # Get player ID
        player = people[
            (people['nameFirst'] == first) &
            (people['nameLast'] == last)
        ]

        if len(player) == 0:
            continue

        player_id = player['playerID'].values[0]

        # Career stats
        career = batting[batting['playerID'] == player_id]

        stats = {
            'Name': f"{first} {last}",
            'Seasons': len(career),
            'Games': career['G'].sum(),
            'AB': career['AB'].sum(),
            'H': career['H'].sum(),
            '2B': career['2B'].sum(),
            '3B': career['3B'].sum(),
            'HR': career['HR'].sum(),
            'RBI': career['RBI'].sum(),
            'BB': career['BB'].sum(),
            'SO': career['SO'].sum(),
            'SB': career['SB'].sum()
        }

        # Calculate rate stats
        stats['AVG'] = round(stats['H'] / stats['AB'], 3)
        stats['OBP'] = round((stats['H'] + stats['BB']) / (stats['AB'] + stats['BB']), 3)
        stats['SLG'] = round(
            (stats['H'] + stats['2B'] + 2*stats['3B'] + 3*stats['HR']) / stats['AB'], 3
        )
        stats['OPS'] = round(stats['OBP'] + stats['SLG'], 3)
        stats['HR_per_Season'] = round(stats['HR'] / stats['Seasons'], 1)

        results.append(stats)

    return pd.DataFrame(results)

# Compare all-time greats
all_time_greats = [
    ('Babe', 'Ruth'),
    ('Ted', 'Williams'),
    ('Willie', 'Mays'),
    ('Barry', 'Bonds')
]

comprehensive = compare_players_comprehensive(all_time_greats)
print(comprehensive[['Name', 'Seasons', 'HR', 'AVG', 'OBP', 'SLG', 'OPS']])

13.6 The Steroid Era Problem

The so-called "steroid era" presents unique challenges for historical analysis. Performance-enhancing drug use was widespread in baseball from approximately the mid-1990s through the mid-2000s, distorting statistical records and complicating player comparisons.

Identifying the Steroid Era Statistically

While we can't definitively identify PED users through statistics alone (testing and investigation are required), we can identify the period when offense was anomalously high:

Statistical Markers of the Steroid Era:

Home Run Explosion: The 1990s and early 2000s saw unprecedented home run rates
Power at All Ages: Players maintained or increased power into their late 30s
Muscle Mass Increase: Visual evidence showed dramatic physical changes
Breaking of "Unbreakable" Records: Maris's 61 HR record broken multiple times
League-Wide Offensive Spike: Not just individual performances but systematic elevation

Let's analyze the data:

library(Lahman)
library(dplyr)
library(ggplot2)

# Calculate league-wide home run rates by year
hr_by_year <- Batting %>%
  filter(yearID >= 1950) %>%
  group_by(yearID) %>%
  summarise(
    Total_AB = sum(AB, na.rm = TRUE),
    Total_HR = sum(HR, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    HR_Rate = (Total_HR / Total_AB) * 100,
    Era = case_when(
      yearID < 1994 ~ "Pre-Steroid",
      yearID >= 1994 & yearID <= 2007 ~ "Steroid Era",
      yearID > 2007 ~ "Post-Steroid"
    )
  )

# Visualize
ggplot(hr_by_year, aes(x = yearID, y = HR_Rate, color = Era)) +
  geom_line(size = 1.5) +
  geom_point(size = 2) +
  labs(
    title = "MLB Home Run Rate by Year (1950-Present)",
    subtitle = "Identifying the Steroid Era",
    x = "Year",
    y = "Home Runs per 100 AB"
  ) +
  scale_color_manual(values = c("Pre-Steroid" = "blue",
                                 "Steroid Era" = "red",
                                 "Post-Steroid" = "green")) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  geom_vline(xintercept = c(1994, 2007), linetype = "dashed", alpha = 0.5)

# Calculate average by era
era_averages <- hr_by_year %>%
  group_by(Era) %>%
  summarise(
    Avg_HR_Rate = mean(HR_Rate),
    .groups = 'drop'
  )

print(era_averages)

# Look at 40+ HR seasons by era
hr_40_plus <- Batting %>%
  filter(yearID >= 1950) %>%
  group_by(playerID, yearID) %>%
  summarise(
    HR = sum(HR, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  filter(HR >= 40) %>%
  mutate(
    Era = case_when(
      yearID < 1994 ~ "Pre-Steroid",
      yearID >= 1994 & yearID <= 2007 ~ "Steroid Era",
      yearID > 2007 ~ "Post-Steroid"
    )
  )

# Count by era
hr_40_by_era <- hr_40_plus %>%
  group_by(Era) %>%
  summarise(
    Count_40_HR_Seasons = n(),
    .groups = 'drop'
  )

print(hr_40_by_era)

import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pyb.cache.enable()

batting = pyb.lahman.batting()

# Calculate league-wide home run rates by year
batting_modern = batting[batting['yearID'] >= 1950].copy()

hr_by_year = batting_modern.groupby('yearID').agg({
    'AB': 'sum',
    'HR': 'sum'
}).reset_index()

hr_by_year.columns = ['yearID', 'Total_AB', 'Total_HR']
hr_by_year['HR_Rate'] = (hr_by_year['Total_HR'] / hr_by_year['Total_AB']) * 100

# Define eras
def classify_era(year):
    if year < 1994:
        return "Pre-Steroid"
    elif year <= 2007:
        return "Steroid Era"
    else:
        return "Post-Steroid"

hr_by_year['Era'] = hr_by_year['yearID'].apply(classify_era)

# Visualize
plt.figure(figsize=(14, 8))
colors = {'Pre-Steroid': 'blue', 'Steroid Era': 'red', 'Post-Steroid': 'green'}

for era in ['Pre-Steroid', 'Steroid Era', 'Post-Steroid']:
    data = hr_by_year[hr_by_year['Era'] == era]
    plt.plot(data['yearID'], data['HR_Rate'],
             label=era, linewidth=2, marker='o', color=colors[era])

plt.axvline(x=1994, linestyle='--', alpha=0.5, color='black')
plt.axvline(x=2007, linestyle='--', alpha=0.5, color='black')

plt.title('MLB Home Run Rate by Year (1950-Present)\nIdentifying the Steroid Era',
          fontsize=14, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Home Runs per 100 AB', fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('steroid_era_hr_rate.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate average by era
era_averages = hr_by_year.groupby('Era')['HR_Rate'].mean().reset_index()
era_averages.columns = ['Era', 'Avg_HR_Rate']
print(era_averages)

# Look at 40+ HR seasons by era
hr_40_plus = batting_modern.groupby(['playerID', 'yearID'])['HR'].sum().reset_index()
hr_40_plus = hr_40_plus[hr_40_plus['HR'] >= 40]
hr_40_plus['Era'] = hr_40_plus['yearID'].apply(classify_era)

# Count by era
hr_40_by_era = hr_40_plus.groupby('Era').size().reset_index()
hr_40_by_era.columns = ['Era', 'Count_40_HR_Seasons']
print(hr_40_by_era)

Key Findings

The data clearly shows:

Pre-Steroid Era (1950-1993): HR rate averaged ~2.5% of at-bats
Steroid Era (1994-2007): HR rate jumped to ~3.0-3.5% of at-bats
Post-Steroid Era (2008+): HR rate initially declined but has rebounded due to launch angle revolution

The number of 40+ HR seasons also spiked dramatically during the steroid era.

Adjusting for Performance Enhancement

How should analysts handle steroid-era statistics? Several approaches exist:

Approach 1: Accept the Numbers

Statistics are what they are, regardless of cause

We can't definitively know who used PEDs

Adjusting requires subjective judgments

Many factors contributed beyond PEDs

Approach 2: Era-Adjust More Aggressively

Apply stronger era adjustments for 1994-2007

Treat this period like any other high-offense era

Use OPS+, ERA+, etc., which already account for context

Don't make player-specific PED assumptions

Approach 3: Exclude Suspected Users

Don't consider players with positive tests or evidence

Creates a "clean" record book

Risks unfairly excluding some players

Difficult to apply consistently

Approach 4: Separate Era

Treat steroid era as its own category

Don't compare steroid-era players to other eras

Create separate record books or rankings

Acknowledges the unique circumstances

Most analysts favor Approach 2: using standard era-adjustment methods that naturally account for the inflated offense of the period.

Example: Comparing Steroid-Era Stats

Let's compare Barry Bonds's legendary 2001 season (73 HR) to Babe Ruth's 1927 (60 HR) using era adjustment:

# Bonds 2001 vs Ruth 1927 (era-adjusted)

# Get league averages
bonds_2001_lg <- Batting %>%
  filter(yearID == 2001, lgID == "NL") %>%
  summarise(
    lgAB = sum(AB, na.rm = TRUE),
    lgHR = sum(HR, na.rm = TRUE)
  ) %>%
  mutate(lgHR_rate = lgHR / lgAB)

ruth_1927_lg <- Batting %>%
  filter(yearID == 1927, lgID == "AL") %>%
  summarise(
    lgAB = sum(AB, na.rm = TRUE),
    lgHR = sum(HR, na.rm = TRUE)
  ) %>%
  mutate(lgHR_rate = lgHR / lgAB)

print(paste("2001 NL HR Rate:", round(bonds_2001_lg$lgHR_rate * 100, 2), "%"))
print(paste("1927 AL HR Rate:", round(ruth_1927_lg$lgHR_rate * 100, 2), "%"))

# Bonds hit 73 HR in 476 AB (15.3% of his ABs)
# Ruth hit 60 HR in 540 AB (11.1% of his ABs)

bonds_hr_rate <- 73 / 476
ruth_hr_rate <- 60 / 540

# Relative to league
bonds_relative <- bonds_hr_rate / bonds_2001_lg$lgHR_rate
ruth_relative <- ruth_hr_rate / ruth_1927_lg$lgHR_rate

print(paste("Bonds was", round(bonds_relative, 1), "times better than league average"))
print(paste("Ruth was", round(ruth_relative, 1), "times better than league average"))

# Bonds 2001 vs Ruth 1927 (era-adjusted)

# Get league averages
bonds_2001_lg = batting[
    (batting['yearID'] == 2001) &
    (batting['lgID'] == 'NL')
]

bonds_lg_ab = bonds_2001_lg['AB'].sum()
bonds_lg_hr = bonds_2001_lg['HR'].sum()
bonds_lg_rate = bonds_lg_hr / bonds_lg_ab

ruth_1927_lg = batting[
    (batting['yearID'] == 1927) &
    (batting['lgID'] == 'AL')
]

ruth_lg_ab = ruth_1927_lg['AB'].sum()
ruth_lg_hr = ruth_1927_lg['HR'].sum()
ruth_lg_rate = ruth_lg_hr / ruth_lg_ab

print(f"2001 NL HR Rate: {bonds_lg_rate * 100:.2f}%")
print(f"1927 AL HR Rate: {ruth_lg_rate * 100:.2f}%")

# Bonds hit 73 HR in 476 AB (15.3% of his ABs)
# Ruth hit 60 HR in 540 AB (11.1% of his ABs)

bonds_hr_rate = 73 / 476
ruth_hr_rate = 60 / 540

# Relative to league
bonds_relative = bonds_hr_rate / bonds_lg_rate
ruth_relative = ruth_hr_rate / ruth_lg_rate

print(f"Bonds was {bonds_relative:.1f} times better than league average")
print(f"Ruth was {ruth_relative:.1f} times better than league average")

Both seasons were approximately 5-6 times better than league average—equally dominant in their respective contexts.

Ethical Considerations in Historical Analysis

When analyzing the steroid era, analysts must balance several ethical considerations:

Statistical Integrity: Our job is to analyze numbers accurately, not to make moral judgments about players.

Historical Context: We must acknowledge the unique circumstances of each era without dismissing achievements.

Uncertainty: We often don't know who used PEDs and who didn't. Assumptions based on statistics alone can be unfair.

Consistency: Whatever approach we take should be applied consistently across all eras and players.

Transparency: We should be clear about our methods and assumptions when handling steroid-era data.

The safest approach is to:

Use standard era-adjustment methods

Note when players competed in the steroid era

Avoid player-specific PED assumptions without evidence

Let readers draw their own conclusions about individual cases

R

library(Lahman)
library(dplyr)
library(ggplot2)

# Calculate league-wide home run rates by year
hr_by_year <- Batting %>%
  filter(yearID >= 1950) %>%
  group_by(yearID) %>%
  summarise(
    Total_AB = sum(AB, na.rm = TRUE),
    Total_HR = sum(HR, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    HR_Rate = (Total_HR / Total_AB) * 100,
    Era = case_when(
      yearID < 1994 ~ "Pre-Steroid",
      yearID >= 1994 & yearID <= 2007 ~ "Steroid Era",
      yearID > 2007 ~ "Post-Steroid"
    )
  )

# Visualize
ggplot(hr_by_year, aes(x = yearID, y = HR_Rate, color = Era)) +
  geom_line(size = 1.5) +
  geom_point(size = 2) +
  labs(
    title = "MLB Home Run Rate by Year (1950-Present)",
    subtitle = "Identifying the Steroid Era",
    x = "Year",
    y = "Home Runs per 100 AB"
  ) +
  scale_color_manual(values = c("Pre-Steroid" = "blue",
                                 "Steroid Era" = "red",
                                 "Post-Steroid" = "green")) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5)
  ) +
  geom_vline(xintercept = c(1994, 2007), linetype = "dashed", alpha = 0.5)

# Calculate average by era
era_averages <- hr_by_year %>%
  group_by(Era) %>%
  summarise(
    Avg_HR_Rate = mean(HR_Rate),
    .groups = 'drop'
  )

print(era_averages)

# Look at 40+ HR seasons by era
hr_40_plus <- Batting %>%
  filter(yearID >= 1950) %>%
  group_by(playerID, yearID) %>%
  summarise(
    HR = sum(HR, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  filter(HR >= 40) %>%
  mutate(
    Era = case_when(
      yearID < 1994 ~ "Pre-Steroid",
      yearID >= 1994 & yearID <= 2007 ~ "Steroid Era",
      yearID > 2007 ~ "Post-Steroid"
    )
  )

# Count by era
hr_40_by_era <- hr_40_plus %>%
  group_by(Era) %>%
  summarise(
    Count_40_HR_Seasons = n(),
    .groups = 'drop'
  )

print(hr_40_by_era)

R

# Bonds 2001 vs Ruth 1927 (era-adjusted)

# Get league averages
bonds_2001_lg <- Batting %>%
  filter(yearID == 2001, lgID == "NL") %>%
  summarise(
    lgAB = sum(AB, na.rm = TRUE),
    lgHR = sum(HR, na.rm = TRUE)
  ) %>%
  mutate(lgHR_rate = lgHR / lgAB)

ruth_1927_lg <- Batting %>%
  filter(yearID == 1927, lgID == "AL") %>%
  summarise(
    lgAB = sum(AB, na.rm = TRUE),
    lgHR = sum(HR, na.rm = TRUE)
  ) %>%
  mutate(lgHR_rate = lgHR / lgAB)

print(paste("2001 NL HR Rate:", round(bonds_2001_lg$lgHR_rate * 100, 2), "%"))
print(paste("1927 AL HR Rate:", round(ruth_1927_lg$lgHR_rate * 100, 2), "%"))

# Bonds hit 73 HR in 476 AB (15.3% of his ABs)
# Ruth hit 60 HR in 540 AB (11.1% of his ABs)

bonds_hr_rate <- 73 / 476
ruth_hr_rate <- 60 / 540

# Relative to league
bonds_relative <- bonds_hr_rate / bonds_2001_lg$lgHR_rate
ruth_relative <- ruth_hr_rate / ruth_1927_lg$lgHR_rate

print(paste("Bonds was", round(bonds_relative, 1), "times better than league average"))
print(paste("Ruth was", round(ruth_relative, 1), "times better than league average"))

Python

import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pyb.cache.enable()

batting = pyb.lahman.batting()

# Calculate league-wide home run rates by year
batting_modern = batting[batting['yearID'] >= 1950].copy()

hr_by_year = batting_modern.groupby('yearID').agg({
    'AB': 'sum',
    'HR': 'sum'
}).reset_index()

hr_by_year.columns = ['yearID', 'Total_AB', 'Total_HR']
hr_by_year['HR_Rate'] = (hr_by_year['Total_HR'] / hr_by_year['Total_AB']) * 100

# Define eras
def classify_era(year):
    if year < 1994:
        return "Pre-Steroid"
    elif year <= 2007:
        return "Steroid Era"
    else:
        return "Post-Steroid"

hr_by_year['Era'] = hr_by_year['yearID'].apply(classify_era)

# Visualize
plt.figure(figsize=(14, 8))
colors = {'Pre-Steroid': 'blue', 'Steroid Era': 'red', 'Post-Steroid': 'green'}

for era in ['Pre-Steroid', 'Steroid Era', 'Post-Steroid']:
    data = hr_by_year[hr_by_year['Era'] == era]
    plt.plot(data['yearID'], data['HR_Rate'],
             label=era, linewidth=2, marker='o', color=colors[era])

plt.axvline(x=1994, linestyle='--', alpha=0.5, color='black')
plt.axvline(x=2007, linestyle='--', alpha=0.5, color='black')

plt.title('MLB Home Run Rate by Year (1950-Present)\nIdentifying the Steroid Era',
          fontsize=14, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Home Runs per 100 AB', fontsize=12)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('steroid_era_hr_rate.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate average by era
era_averages = hr_by_year.groupby('Era')['HR_Rate'].mean().reset_index()
era_averages.columns = ['Era', 'Avg_HR_Rate']
print(era_averages)

# Look at 40+ HR seasons by era
hr_40_plus = batting_modern.groupby(['playerID', 'yearID'])['HR'].sum().reset_index()
hr_40_plus = hr_40_plus[hr_40_plus['HR'] >= 40]
hr_40_plus['Era'] = hr_40_plus['yearID'].apply(classify_era)

# Count by era
hr_40_by_era = hr_40_plus.groupby('Era').size().reset_index()
hr_40_by_era.columns = ['Era', 'Count_40_HR_Seasons']
print(hr_40_by_era)

Python

# Bonds 2001 vs Ruth 1927 (era-adjusted)

# Get league averages
bonds_2001_lg = batting[
    (batting['yearID'] == 2001) &
    (batting['lgID'] == 'NL')
]

bonds_lg_ab = bonds_2001_lg['AB'].sum()
bonds_lg_hr = bonds_2001_lg['HR'].sum()
bonds_lg_rate = bonds_lg_hr / bonds_lg_ab

ruth_1927_lg = batting[
    (batting['yearID'] == 1927) &
    (batting['lgID'] == 'AL')
]

ruth_lg_ab = ruth_1927_lg['AB'].sum()
ruth_lg_hr = ruth_1927_lg['HR'].sum()
ruth_lg_rate = ruth_lg_hr / ruth_lg_ab

print(f"2001 NL HR Rate: {bonds_lg_rate * 100:.2f}%")
print(f"1927 AL HR Rate: {ruth_lg_rate * 100:.2f}%")

# Bonds hit 73 HR in 476 AB (15.3% of his ABs)
# Ruth hit 60 HR in 540 AB (11.1% of his ABs)

bonds_hr_rate = 73 / 476
ruth_hr_rate = 60 / 540

# Relative to league
bonds_relative = bonds_hr_rate / bonds_lg_rate
ruth_relative = ruth_hr_rate / ruth_lg_rate

print(f"Bonds was {bonds_relative:.1f} times better than league average")
print(f"Ruth was {ruth_relative:.1f} times better than league average")

13.7 Interactive Historical Exploration

Modern data visualization tools enable us to explore baseball's historical evolution through interactive graphics that reveal patterns impossible to detect in static tables. This section introduces three powerful interactive visualization approaches for historical analysis: animated timelines showing how league statistics evolved across decades, interactive era comparison tools for searching and comparing players, and dynamic trend analysis with range sliders for examining specific time periods.

Animated Timeline of League Averages

One of the most compelling ways to understand baseball's evolution is through animated visualizations that show how key statistics changed over time. We can create animations that step through each decade, revealing the dramatic shifts in offensive production that define different eras.

Let's build an animated timeline showing home run rate, strikeout rate, and batting average from 1900 to 2023:

# Animated timeline of league statistics over time
library(tidyverse)
library(gganimate)
library(Lahman)

# Calculate league-wide statistics by year
league_evolution <- Batting %>%
  filter(yearID >= 1900, yearID <= 2023) %>%
  group_by(yearID) %>%
  summarise(
    total_ab = sum(AB, na.rm = TRUE),
    total_h = sum(H, na.rm = TRUE),
    total_hr = sum(HR, na.rm = TRUE),
    total_so = sum(SO, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    avg = total_h / total_ab,
    hr_rate = (total_hr / total_ab) * 100,  # HR per 100 AB
    k_rate = (total_so / total_ab) * 100,   # K per 100 AB
    decade = floor(yearID / 10) * 10,
    era = case_when(
      yearID < 1920 ~ "Dead Ball",
      yearID < 1947 ~ "Live Ball (Pre-Integration)",
      yearID < 1961 ~ "Integration Era",
      yearID < 1993 ~ "Expansion Era",
      yearID < 2006 ~ "Steroid Era",
      TRUE ~ "Modern Era"
    )
  )

# Create animated plot
anim <- ggplot(league_evolution,
               aes(x = yearID, y = avg, group = 1)) +
  geom_line(size = 1.2, color = "darkblue") +
  geom_point(size = 3, color = "darkblue") +
  geom_text(aes(label = sprintf("%.3f", avg)),
            vjust = -1, size = 3.5, color = "darkblue") +
  labs(title = "MLB Batting Average Evolution: {frame_time}",
       subtitle = "League-wide batting average by year",
       x = "Year",
       y = "Batting Average") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16, face = "bold")) +
  transition_time(yearID) +
  ease_aes('linear') +
  shadow_wake(wake_length = 0.1)

# Render animation
animate(anim, nframes = 124, fps = 4, width = 800, height = 500)

# Create multi-metric comparison
league_long <- league_evolution %>%
  select(yearID, avg, hr_rate, k_rate, era) %>%
  pivot_longer(cols = c(avg, hr_rate, k_rate),
               names_to = "metric",
               values_to = "value") %>%
  mutate(
    metric_label = case_when(
      metric == "avg" ~ "Batting Average",
      metric == "hr_rate" ~ "HR Rate (per 100 AB)",
      metric == "k_rate" ~ "K Rate (per 100 AB)"
    )
  )

# Faceted animation showing all three metrics
multi_anim <- ggplot(league_long,
                     aes(x = yearID, y = value, color = metric_label)) +
  geom_line(size = 1) +
  facet_wrap(~metric_label, scales = "free_y", ncol = 1) +
  labs(title = "Evolution of MLB Statistics: {frame_time}",
       x = "Year",
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "none",
        strip.text = element_text(size = 12, face = "bold")) +
  transition_reveal(yearID)

animate(multi_anim, nframes = 150, fps = 10, width = 800, height = 600)

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import lahman

# Load historical batting data
batting = lahman.batting()

# Calculate league-wide statistics by year
league_evolution = batting[batting['yearID'] >= 1900].groupby('yearID').agg({
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'SO': 'sum'
}).reset_index()

league_evolution['avg'] = league_evolution['H'] / league_evolution['AB']
league_evolution['hr_rate'] = (league_evolution['HR'] / league_evolution['AB']) * 100
league_evolution['k_rate'] = (league_evolution['SO'] / league_evolution['AB']) * 100

# Add era classifications
def classify_era(year):
    if year < 1920:
        return "Dead Ball"
    elif year < 1947:
        return "Live Ball (Pre-Integration)"
    elif year < 1961:
        return "Integration Era"
    elif year < 1993:
        return "Expansion Era"
    elif year < 2006:
        return "Steroid Era"
    else:
        return "Modern Era"

league_evolution['era'] = league_evolution['yearID'].apply(classify_era)
league_evolution['decade'] = (league_evolution['yearID'] // 10) * 10

# Create animated line chart with Plotly
fig = px.line(league_evolution,
              x='yearID',
              y='avg',
              animation_frame='yearID',
              range_x=[1900, 2023],
              range_y=[0.23, 0.31],
              title='MLB Batting Average Evolution Over Time',
              labels={'yearID': 'Year', 'avg': 'Batting Average'})

fig.update_traces(line=dict(color='darkblue', width=3))
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Batting Average',
    hovermode='x unified',
    showlegend=False
)

# Show the animation
fig.show()

# Create multi-metric animated visualization
fig_multi = make_subplots(
    rows=3, cols=1,
    subplot_titles=('Batting Average', 'HR Rate (per 100 AB)', 'K Rate (per 100 AB)'),
    vertical_spacing=0.1
)

# Add traces for each metric
for year in league_evolution['yearID'].unique():
    year_data = league_evolution[league_evolution['yearID'] <= year]

    fig_multi.add_trace(
        go.Scatter(x=year_data['yearID'], y=year_data['avg'],
                  mode='lines', name='AVG', line=dict(color='blue')),
        row=1, col=1
    )
    fig_multi.add_trace(
        go.Scatter(x=year_data['yearID'], y=year_data['hr_rate'],
                  mode='lines', name='HR', line=dict(color='red')),
        row=2, col=1
    )
    fig_multi.add_trace(
        go.Scatter(x=year_data['yearID'], y=year_data['k_rate'],
                  mode='lines', name='K', line=dict(color='orange')),
        row=3, col=1
    )

fig_multi.update_xaxes(title_text="Year", row=3, col=1)
fig_multi.update_layout(
    height=900,
    title_text="Evolution of MLB Statistics Over Time",
    showlegend=False
)

fig_multi.show()

These animated visualizations clearly reveal baseball's major transitions: the dead ball era's low home run rates, the 1920s offensive explosion, the 1968 pitcher dominance, the steroid era's power surge, and the modern game's strikeout epidemic. The ability to watch these trends unfold year by year provides intuition that static charts cannot match.

Interactive Era Comparison Tool

To facilitate player comparisons across eras, we can build an interactive tool that allows users to search for players, select seasons, and instantly see era-adjusted comparisons. This approach combines database queries with interactive plotting.

library(tidyverse)
library(Lahman)
library(plotly)

# Function to calculate era-adjusted OPS+
calculate_ops_plus <- function(player_batting, league_batting) {
  player_ops <- with(player_batting,
                    (H + BB) / (AB + BB) + (H + 2*X2B + 3*X3B + 4*HR) / AB)
  league_ops <- with(league_batting,
                    (H + BB) / (AB + BB) + (H + 2*X2B + 3*X3B + 4*HR) / AB)

  ops_plus <- (player_ops / league_ops) * 100
  return(ops_plus)
}

# Create interactive player comparison
compare_players <- function(player_names, min_year = 1900, max_year = 2023) {

  # Get player IDs
  player_ids <- People %>%
    filter(paste(nameFirst, nameLast) %in% player_names) %>%
    pull(playerID)

  # Get batting stats for these players
  player_stats <- Batting %>%
    filter(playerID %in% player_ids,
           yearID >= min_year,
           yearID <= max_year,
           AB >= 300) %>%
    left_join(People %>% select(playerID, nameFirst, nameLast),
              by = "playerID") %>%
    mutate(player_name = paste(nameFirst, nameLast))

  # Calculate league averages by year
  league_averages <- Batting %>%
    filter(yearID >= min_year, yearID <= max_year) %>%
    group_by(yearID) %>%
    summarise(
      lg_AB = sum(AB, na.rm = TRUE),
      lg_H = sum(H, na.rm = TRUE),
      lg_BB = sum(BB, na.rm = TRUE),
      lg_X2B = sum(X2B, na.rm = TRUE),
      lg_X3B = sum(X3B, na.rm = TRUE),
      lg_HR = sum(HR, na.rm = TRUE)
    )

  # Calculate OPS+ for each player-season
  comparison_data <- player_stats %>%
    left_join(league_averages, by = "yearID") %>%
    rowwise() %>%
    mutate(
      player_ops = (H + BB) / (AB + BB) +
                   (H + 2*X2B + 3*X3B + 4*HR) / AB,
      league_ops = (lg_H + lg_BB) / (lg_AB + lg_BB) +
                   (lg_H + 2*lg_X2B + 3*lg_X3B + 4*lg_HR) / lg_AB,
      ops_plus = (player_ops / league_ops) * 100,
      ba = H / AB
    ) %>%
    ungroup()

  # Create interactive plot
  p <- plot_ly(comparison_data,
               x = ~yearID,
               y = ~ops_plus,
               color = ~player_name,
               type = 'scatter',
               mode = 'lines+markers',
               text = ~paste("Year:", yearID,
                           "<br>Player:", player_name,
                           "<br>OPS+:", round(ops_plus, 1),
                           "<br>BA:", sprintf("%.3f", ba)),
               hoverinfo = 'text') %>%
    layout(title = "Era-Adjusted Performance Comparison",
           xaxis = list(title = "Year"),
           yaxis = list(title = "OPS+ (100 = League Average)"),
           hovermode = 'closest')

  return(p)
}

# Example usage: Compare Ruth, Williams, Bonds
comparison <- compare_players(
  c("Babe Ruth", "Ted Williams", "Barry Bonds"),
  min_year = 1914,
  max_year = 2007
)

comparison

import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from pybaseball import lahman

# Load data
batting = lahman.batting()
people = lahman.people()

def calculate_ops(row):
    """Calculate OPS from batting statistics"""
    if row['AB'] == 0:
        return 0
    obp = (row['H'] + row['BB']) / (row['AB'] + row['BB']) if (row['AB'] + row['BB']) > 0 else 0
    slg = (row['H'] + row['2B'] + 2*row['3B'] + 3*row['HR']) / row['AB'] if row['AB'] > 0 else 0
    return obp + slg

def compare_players_interactive(player_names, min_year=1900, max_year=2023):
    """
    Create interactive comparison of players across eras

    Parameters:
    -----------
    player_names : list
        List of player names in format ["First Last", ...]
    min_year : int
        Starting year for comparison
    max_year : int
        Ending year for comparison
    """

    # Parse player names
    player_data = []
    for name in player_names:
        first, last = name.split()[0], ' '.join(name.split()[1:])
        player_data.append((first, last))

    # Get player IDs
    player_ids = []
    for first, last in player_data:
        matches = people[(people['nameFirst'] == first) &
                        (people['nameLast'] == last)]
        if len(matches) > 0:
            player_ids.append(matches.iloc[0]['playerID'])

    # Filter batting data
    player_stats = batting[
        (batting['playerID'].isin(player_ids)) &
        (batting['yearID'] >= min_year) &
        (batting['yearID'] <= max_year) &
        (batting['AB'] >= 300)
    ].copy()

    # Merge with player names
    player_stats = player_stats.merge(
        people[['playerID', 'nameFirst', 'nameLast']],
        on='playerID'
    )
    player_stats['player_name'] = (player_stats['nameFirst'] + ' ' +
                                   player_stats['nameLast'])

    # Calculate league averages by year
    league_avg = batting[
        (batting['yearID'] >= min_year) &
        (batting['yearID'] <= max_year)
    ].groupby('yearID').agg({
        'AB': 'sum',
        'H': 'sum',
        'BB': 'sum',
        '2B': 'sum',
        '3B': 'sum',
        'HR': 'sum'
    }).reset_index()

    league_avg['league_ops'] = league_avg.apply(calculate_ops, axis=1)

    # Calculate player OPS and OPS+
    player_stats['player_ops'] = player_stats.apply(calculate_ops, axis=1)
    player_stats = player_stats.merge(
        league_avg[['yearID', 'league_ops']],
        on='yearID'
    )
    player_stats['ops_plus'] = (player_stats['player_ops'] /
                                 player_stats['league_ops']) * 100
    player_stats['ba'] = player_stats['H'] / player_stats['AB']

    # Create interactive plot
    fig = px.line(player_stats,
                  x='yearID',
                  y='ops_plus',
                  color='player_name',
                  markers=True,
                  title='Era-Adjusted Performance Comparison',
                  labels={'yearID': 'Year',
                         'ops_plus': 'OPS+ (100 = League Average)',
                         'player_name': 'Player'})

    fig.add_hline(y=100, line_dash="dash", line_color="gray",
                  annotation_text="League Average")

    fig.update_traces(
        hovertemplate='<b>%{fullData.name}</b><br>' +
                     'Year: %{x}<br>' +
                     'OPS+: %{y:.1f}<br>' +
                     '<extra></extra>'
    )

    fig.update_layout(
        hovermode='x unified',
        xaxis_title='Year',
        yaxis_title='OPS+ (100 = League Average)',
        legend_title='Player',
        height=600
    )

    return fig

# Example: Compare legendary players across eras
fig = compare_players_interactive(
    ["Babe Ruth", "Ted Williams", "Barry Bonds"],
    min_year=1914,
    max_year=2007
)

fig.show()

This interactive tool enables users to explore how players performed relative to their peers, regardless of when they played. Hovering over data points reveals detailed statistics, and the ability to add or remove players makes it easy to test different hypotheses about historical greatness.

Historical Trends with Range Slider

For detailed analysis of specific time periods, Plotly's range slider functionality allows users to zoom into particular eras while maintaining context of the full timeline. This is particularly useful for examining shorter-term trends within longer historical narratives.

library(tidyverse)
library(plotly)
library(Lahman)

# Prepare comprehensive historical data
historical_trends <- Batting %>%
  filter(yearID >= 1900, yearID <= 2023) %>%
  group_by(yearID) %>%
  summarise(
    total_ab = sum(AB, na.rm = TRUE),
    total_h = sum(H, na.rm = TRUE),
    total_hr = sum(HR, na.rm = TRUE),
    total_so = sum(SO, na.rm = TRUE),
    total_bb = sum(BB, na.rm = TRUE),
    total_sb = sum(SB, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    batting_avg = total_h / total_ab,
    hr_per_game = total_hr / (total_ab / 4),  # Approximate games
    k_per_pa = total_so / (total_ab + total_bb),
    bb_per_pa = total_bb / (total_ab + total_bb),
    sb_per_game = total_sb / (total_ab / 4)
  )

# Create multi-trace plot with range slider
fig <- plot_ly()

# Add batting average trace
fig <- fig %>% add_trace(
  data = historical_trends,
  x = ~yearID,
  y = ~batting_avg,
  type = 'scatter',
  mode = 'lines',
  name = 'Batting Average',
  line = list(color = 'blue', width = 2)
)

# Add HR rate trace
fig <- fig %>% add_trace(
  data = historical_trends,
  x = ~yearID,
  y = ~hr_per_game * 10,  # Scale for visibility
  type = 'scatter',
  mode = 'lines',
  name = 'HR Rate (×10)',
  line = list(color = 'red', width = 2),
  yaxis = 'y2'
)

# Add K rate trace
fig <- fig %>% add_trace(
  data = historical_trends,
  x = ~yearID,
  y = ~k_per_pa,
  type = 'scatter',
  mode = 'lines',
  name = 'K Rate',
  line = list(color = 'orange', width = 2)
)

# Configure layout with range slider
fig <- fig %>% layout(
  title = "Historical Trends in MLB Statistics (1900-2023)",
  xaxis = list(
    title = "Year",
    rangeslider = list(type = "date", visible = TRUE),
    range = c(1900, 2023)
  ),
  yaxis = list(
    title = "Rate",
    side = "left"
  ),
  yaxis2 = list(
    overlaying = "y",
    side = "right",
    showgrid = FALSE
  ),
  hovermode = 'x unified',
  legend = list(x = 0.1, y = 0.9)
)

fig

import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import lahman

# Load and prepare data
batting = lahman.batting()

historical_trends = batting[batting['yearID'] >= 1900].groupby('yearID').agg({
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'SO': 'sum',
    'BB': 'sum',
    'SB': 'sum'
}).reset_index()

# Calculate rates
historical_trends['batting_avg'] = historical_trends['H'] / historical_trends['AB']
historical_trends['hr_rate'] = (historical_trends['HR'] / historical_trends['AB']) * 100
historical_trends['k_rate'] = historical_trends['SO'] / (historical_trends['AB'] + historical_trends['BB'])
historical_trends['bb_rate'] = historical_trends['BB'] / (historical_trends['AB'] + historical_trends['BB'])
historical_trends['iso'] = (historical_trends['HR'] * 3) / historical_trends['AB']  # Simplified ISO

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(
        x=historical_trends['yearID'],
        y=historical_trends['batting_avg'],
        name='Batting Average',
        line=dict(color='blue', width=2)
    ),
    secondary_y=False
)

fig.add_trace(
    go.Scatter(
        x=historical_trends['yearID'],
        y=historical_trends['hr_rate'],
        name='HR Rate (per 100 AB)',
        line=dict(color='red', width=2)
    ),
    secondary_y=True
)

fig.add_trace(
    go.Scatter(
        x=historical_trends['yearID'],
        y=historical_trends['k_rate'],
        name='K Rate',
        line=dict(color='orange', width=2)
    ),
    secondary_y=False
)

# Add era markers
eras = [
    (1920, 'Live Ball Era'),
    (1947, 'Integration'),
    (1961, 'Expansion'),
    (1993, 'Steroid Era'),
    (2006, 'Modern Era')
]

for year, label in eras:
    fig.add_vline(
        x=year,
        line_dash="dash",
        line_color="gray",
        annotation_text=label,
        annotation_position="top"
    )

# Update layout with range slider
fig.update_xaxes(
    title_text="Year",
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=10, label="10y", step="year", stepmode="backward"),
            dict(count=25, label="25y", step="year", stepmode="backward"),
            dict(count=50, label="50y", step="year", stepmode="backward"),
            dict(step="all", label="All")
        ])
    )
)

fig.update_yaxes(title_text="Batting Average / K Rate", secondary_y=False)
fig.update_yaxes(title_text="HR Rate (per 100 AB)", secondary_y=True)

fig.update_layout(
    title_text="Historical Trends in MLB Statistics (1900-2023)",
    hovermode='x unified',
    height=600,
    legend=dict(x=0.01, y=0.99)
)

fig.show()

The range slider enables users to focus on specific periods (like the steroid era from 1993-2005) while maintaining awareness of broader historical context. The range selector buttons provide quick access to common analysis windows (10 years, 25 years, etc.), making it easy to examine how quickly baseball statistics have evolved during different periods.

These interactive visualization techniques transform historical baseball analysis from static number-crunching into dynamic exploration. They reveal patterns, enable comparisons, and provide intuitive understanding of how dramatically the game has changed over more than a century of professional play.

R

# Animated timeline of league statistics over time
library(tidyverse)
library(gganimate)
library(Lahman)

# Calculate league-wide statistics by year
league_evolution <- Batting %>%
  filter(yearID >= 1900, yearID <= 2023) %>%
  group_by(yearID) %>%
  summarise(
    total_ab = sum(AB, na.rm = TRUE),
    total_h = sum(H, na.rm = TRUE),
    total_hr = sum(HR, na.rm = TRUE),
    total_so = sum(SO, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    avg = total_h / total_ab,
    hr_rate = (total_hr / total_ab) * 100,  # HR per 100 AB
    k_rate = (total_so / total_ab) * 100,   # K per 100 AB
    decade = floor(yearID / 10) * 10,
    era = case_when(
      yearID < 1920 ~ "Dead Ball",
      yearID < 1947 ~ "Live Ball (Pre-Integration)",
      yearID < 1961 ~ "Integration Era",
      yearID < 1993 ~ "Expansion Era",
      yearID < 2006 ~ "Steroid Era",
      TRUE ~ "Modern Era"
    )
  )

# Create animated plot
anim <- ggplot(league_evolution,
               aes(x = yearID, y = avg, group = 1)) +
  geom_line(size = 1.2, color = "darkblue") +
  geom_point(size = 3, color = "darkblue") +
  geom_text(aes(label = sprintf("%.3f", avg)),
            vjust = -1, size = 3.5, color = "darkblue") +
  labs(title = "MLB Batting Average Evolution: {frame_time}",
       subtitle = "League-wide batting average by year",
       x = "Year",
       y = "Batting Average") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16, face = "bold")) +
  transition_time(yearID) +
  ease_aes('linear') +
  shadow_wake(wake_length = 0.1)

# Render animation
animate(anim, nframes = 124, fps = 4, width = 800, height = 500)

# Create multi-metric comparison
league_long <- league_evolution %>%
  select(yearID, avg, hr_rate, k_rate, era) %>%
  pivot_longer(cols = c(avg, hr_rate, k_rate),
               names_to = "metric",
               values_to = "value") %>%
  mutate(
    metric_label = case_when(
      metric == "avg" ~ "Batting Average",
      metric == "hr_rate" ~ "HR Rate (per 100 AB)",
      metric == "k_rate" ~ "K Rate (per 100 AB)"
    )
  )

# Faceted animation showing all three metrics
multi_anim <- ggplot(league_long,
                     aes(x = yearID, y = value, color = metric_label)) +
  geom_line(size = 1) +
  facet_wrap(~metric_label, scales = "free_y", ncol = 1) +
  labs(title = "Evolution of MLB Statistics: {frame_time}",
       x = "Year",
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "none",
        strip.text = element_text(size = 12, face = "bold")) +
  transition_reveal(yearID)

animate(multi_anim, nframes = 150, fps = 10, width = 800, height = 600)

R

library(tidyverse)
library(Lahman)
library(plotly)

# Function to calculate era-adjusted OPS+
calculate_ops_plus <- function(player_batting, league_batting) {
  player_ops <- with(player_batting,
                    (H + BB) / (AB + BB) + (H + 2*X2B + 3*X3B + 4*HR) / AB)
  league_ops <- with(league_batting,
                    (H + BB) / (AB + BB) + (H + 2*X2B + 3*X3B + 4*HR) / AB)

  ops_plus <- (player_ops / league_ops) * 100
  return(ops_plus)
}

# Create interactive player comparison
compare_players <- function(player_names, min_year = 1900, max_year = 2023) {

  # Get player IDs
  player_ids <- People %>%
    filter(paste(nameFirst, nameLast) %in% player_names) %>%
    pull(playerID)

  # Get batting stats for these players
  player_stats <- Batting %>%
    filter(playerID %in% player_ids,
           yearID >= min_year,
           yearID <= max_year,
           AB >= 300) %>%
    left_join(People %>% select(playerID, nameFirst, nameLast),
              by = "playerID") %>%
    mutate(player_name = paste(nameFirst, nameLast))

  # Calculate league averages by year
  league_averages <- Batting %>%
    filter(yearID >= min_year, yearID <= max_year) %>%
    group_by(yearID) %>%
    summarise(
      lg_AB = sum(AB, na.rm = TRUE),
      lg_H = sum(H, na.rm = TRUE),
      lg_BB = sum(BB, na.rm = TRUE),
      lg_X2B = sum(X2B, na.rm = TRUE),
      lg_X3B = sum(X3B, na.rm = TRUE),
      lg_HR = sum(HR, na.rm = TRUE)
    )

  # Calculate OPS+ for each player-season
  comparison_data <- player_stats %>%
    left_join(league_averages, by = "yearID") %>%
    rowwise() %>%
    mutate(
      player_ops = (H + BB) / (AB + BB) +
                   (H + 2*X2B + 3*X3B + 4*HR) / AB,
      league_ops = (lg_H + lg_BB) / (lg_AB + lg_BB) +
                   (lg_H + 2*lg_X2B + 3*lg_X3B + 4*lg_HR) / lg_AB,
      ops_plus = (player_ops / league_ops) * 100,
      ba = H / AB
    ) %>%
    ungroup()

  # Create interactive plot
  p <- plot_ly(comparison_data,
               x = ~yearID,
               y = ~ops_plus,
               color = ~player_name,
               type = 'scatter',
               mode = 'lines+markers',
               text = ~paste("Year:", yearID,
                           "<br>Player:", player_name,
                           "<br>OPS+:", round(ops_plus, 1),
                           "<br>BA:", sprintf("%.3f", ba)),
               hoverinfo = 'text') %>%
    layout(title = "Era-Adjusted Performance Comparison",
           xaxis = list(title = "Year"),
           yaxis = list(title = "OPS+ (100 = League Average)"),
           hovermode = 'closest')

  return(p)
}

# Example usage: Compare Ruth, Williams, Bonds
comparison <- compare_players(
  c("Babe Ruth", "Ted Williams", "Barry Bonds"),
  min_year = 1914,
  max_year = 2007
)

comparison

R

library(tidyverse)
library(plotly)
library(Lahman)

# Prepare comprehensive historical data
historical_trends <- Batting %>%
  filter(yearID >= 1900, yearID <= 2023) %>%
  group_by(yearID) %>%
  summarise(
    total_ab = sum(AB, na.rm = TRUE),
    total_h = sum(H, na.rm = TRUE),
    total_hr = sum(HR, na.rm = TRUE),
    total_so = sum(SO, na.rm = TRUE),
    total_bb = sum(BB, na.rm = TRUE),
    total_sb = sum(SB, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    batting_avg = total_h / total_ab,
    hr_per_game = total_hr / (total_ab / 4),  # Approximate games
    k_per_pa = total_so / (total_ab + total_bb),
    bb_per_pa = total_bb / (total_ab + total_bb),
    sb_per_game = total_sb / (total_ab / 4)
  )

# Create multi-trace plot with range slider
fig <- plot_ly()

# Add batting average trace
fig <- fig %>% add_trace(
  data = historical_trends,
  x = ~yearID,
  y = ~batting_avg,
  type = 'scatter',
  mode = 'lines',
  name = 'Batting Average',
  line = list(color = 'blue', width = 2)
)

# Add HR rate trace
fig <- fig %>% add_trace(
  data = historical_trends,
  x = ~yearID,
  y = ~hr_per_game * 10,  # Scale for visibility
  type = 'scatter',
  mode = 'lines',
  name = 'HR Rate (×10)',
  line = list(color = 'red', width = 2),
  yaxis = 'y2'
)

# Add K rate trace
fig <- fig %>% add_trace(
  data = historical_trends,
  x = ~yearID,
  y = ~k_per_pa,
  type = 'scatter',
  mode = 'lines',
  name = 'K Rate',
  line = list(color = 'orange', width = 2)
)

# Configure layout with range slider
fig <- fig %>% layout(
  title = "Historical Trends in MLB Statistics (1900-2023)",
  xaxis = list(
    title = "Year",
    rangeslider = list(type = "date", visible = TRUE),
    range = c(1900, 2023)
  ),
  yaxis = list(
    title = "Rate",
    side = "left"
  ),
  yaxis2 = list(
    overlaying = "y",
    side = "right",
    showgrid = FALSE
  ),
  hovermode = 'x unified',
  legend = list(x = 0.1, y = 0.9)
)

fig

Python

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import lahman

# Load historical batting data
batting = lahman.batting()

# Calculate league-wide statistics by year
league_evolution = batting[batting['yearID'] >= 1900].groupby('yearID').agg({
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'SO': 'sum'
}).reset_index()

league_evolution['avg'] = league_evolution['H'] / league_evolution['AB']
league_evolution['hr_rate'] = (league_evolution['HR'] / league_evolution['AB']) * 100
league_evolution['k_rate'] = (league_evolution['SO'] / league_evolution['AB']) * 100

# Add era classifications
def classify_era(year):
    if year < 1920:
        return "Dead Ball"
    elif year < 1947:
        return "Live Ball (Pre-Integration)"
    elif year < 1961:
        return "Integration Era"
    elif year < 1993:
        return "Expansion Era"
    elif year < 2006:
        return "Steroid Era"
    else:
        return "Modern Era"

league_evolution['era'] = league_evolution['yearID'].apply(classify_era)
league_evolution['decade'] = (league_evolution['yearID'] // 10) * 10

# Create animated line chart with Plotly
fig = px.line(league_evolution,
              x='yearID',
              y='avg',
              animation_frame='yearID',
              range_x=[1900, 2023],
              range_y=[0.23, 0.31],
              title='MLB Batting Average Evolution Over Time',
              labels={'yearID': 'Year', 'avg': 'Batting Average'})

fig.update_traces(line=dict(color='darkblue', width=3))
fig.update_layout(
    xaxis_title='Year',
    yaxis_title='Batting Average',
    hovermode='x unified',
    showlegend=False
)

# Show the animation
fig.show()

# Create multi-metric animated visualization
fig_multi = make_subplots(
    rows=3, cols=1,
    subplot_titles=('Batting Average', 'HR Rate (per 100 AB)', 'K Rate (per 100 AB)'),
    vertical_spacing=0.1
)

# Add traces for each metric
for year in league_evolution['yearID'].unique():
    year_data = league_evolution[league_evolution['yearID'] <= year]

    fig_multi.add_trace(
        go.Scatter(x=year_data['yearID'], y=year_data['avg'],
                  mode='lines', name='AVG', line=dict(color='blue')),
        row=1, col=1
    )
    fig_multi.add_trace(
        go.Scatter(x=year_data['yearID'], y=year_data['hr_rate'],
                  mode='lines', name='HR', line=dict(color='red')),
        row=2, col=1
    )
    fig_multi.add_trace(
        go.Scatter(x=year_data['yearID'], y=year_data['k_rate'],
                  mode='lines', name='K', line=dict(color='orange')),
        row=3, col=1
    )

fig_multi.update_xaxes(title_text="Year", row=3, col=1)
fig_multi.update_layout(
    height=900,
    title_text="Evolution of MLB Statistics Over Time",
    showlegend=False
)

fig_multi.show()

Python

import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from pybaseball import lahman

# Load data
batting = lahman.batting()
people = lahman.people()

def calculate_ops(row):
    """Calculate OPS from batting statistics"""
    if row['AB'] == 0:
        return 0
    obp = (row['H'] + row['BB']) / (row['AB'] + row['BB']) if (row['AB'] + row['BB']) > 0 else 0
    slg = (row['H'] + row['2B'] + 2*row['3B'] + 3*row['HR']) / row['AB'] if row['AB'] > 0 else 0
    return obp + slg

def compare_players_interactive(player_names, min_year=1900, max_year=2023):
    """
    Create interactive comparison of players across eras

    Parameters:
    -----------
    player_names : list
        List of player names in format ["First Last", ...]
    min_year : int
        Starting year for comparison
    max_year : int
        Ending year for comparison
    """

    # Parse player names
    player_data = []
    for name in player_names:
        first, last = name.split()[0], ' '.join(name.split()[1:])
        player_data.append((first, last))

    # Get player IDs
    player_ids = []
    for first, last in player_data:
        matches = people[(people['nameFirst'] == first) &
                        (people['nameLast'] == last)]
        if len(matches) > 0:
            player_ids.append(matches.iloc[0]['playerID'])

    # Filter batting data
    player_stats = batting[
        (batting['playerID'].isin(player_ids)) &
        (batting['yearID'] >= min_year) &
        (batting['yearID'] <= max_year) &
        (batting['AB'] >= 300)
    ].copy()

    # Merge with player names
    player_stats = player_stats.merge(
        people[['playerID', 'nameFirst', 'nameLast']],
        on='playerID'
    )
    player_stats['player_name'] = (player_stats['nameFirst'] + ' ' +
                                   player_stats['nameLast'])

    # Calculate league averages by year
    league_avg = batting[
        (batting['yearID'] >= min_year) &
        (batting['yearID'] <= max_year)
    ].groupby('yearID').agg({
        'AB': 'sum',
        'H': 'sum',
        'BB': 'sum',
        '2B': 'sum',
        '3B': 'sum',
        'HR': 'sum'
    }).reset_index()

    league_avg['league_ops'] = league_avg.apply(calculate_ops, axis=1)

    # Calculate player OPS and OPS+
    player_stats['player_ops'] = player_stats.apply(calculate_ops, axis=1)
    player_stats = player_stats.merge(
        league_avg[['yearID', 'league_ops']],
        on='yearID'
    )
    player_stats['ops_plus'] = (player_stats['player_ops'] /
                                 player_stats['league_ops']) * 100
    player_stats['ba'] = player_stats['H'] / player_stats['AB']

    # Create interactive plot
    fig = px.line(player_stats,
                  x='yearID',
                  y='ops_plus',
                  color='player_name',
                  markers=True,
                  title='Era-Adjusted Performance Comparison',
                  labels={'yearID': 'Year',
                         'ops_plus': 'OPS+ (100 = League Average)',
                         'player_name': 'Player'})

    fig.add_hline(y=100, line_dash="dash", line_color="gray",
                  annotation_text="League Average")

    fig.update_traces(
        hovertemplate='<b>%{fullData.name}</b><br>' +
                     'Year: %{x}<br>' +
                     'OPS+: %{y:.1f}<br>' +
                     '<extra></extra>'
    )

    fig.update_layout(
        hovermode='x unified',
        xaxis_title='Year',
        yaxis_title='OPS+ (100 = League Average)',
        legend_title='Player',
        height=600
    )

    return fig

# Example: Compare legendary players across eras
fig = compare_players_interactive(
    ["Babe Ruth", "Ted Williams", "Barry Bonds"],
    min_year=1914,
    max_year=2007
)

fig.show()

Python

import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import lahman

# Load and prepare data
batting = lahman.batting()

historical_trends = batting[batting['yearID'] >= 1900].groupby('yearID').agg({
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'SO': 'sum',
    'BB': 'sum',
    'SB': 'sum'
}).reset_index()

# Calculate rates
historical_trends['batting_avg'] = historical_trends['H'] / historical_trends['AB']
historical_trends['hr_rate'] = (historical_trends['HR'] / historical_trends['AB']) * 100
historical_trends['k_rate'] = historical_trends['SO'] / (historical_trends['AB'] + historical_trends['BB'])
historical_trends['bb_rate'] = historical_trends['BB'] / (historical_trends['AB'] + historical_trends['BB'])
historical_trends['iso'] = (historical_trends['HR'] * 3) / historical_trends['AB']  # Simplified ISO

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(
        x=historical_trends['yearID'],
        y=historical_trends['batting_avg'],
        name='Batting Average',
        line=dict(color='blue', width=2)
    ),
    secondary_y=False
)

fig.add_trace(
    go.Scatter(
        x=historical_trends['yearID'],
        y=historical_trends['hr_rate'],
        name='HR Rate (per 100 AB)',
        line=dict(color='red', width=2)
    ),
    secondary_y=True
)

fig.add_trace(
    go.Scatter(
        x=historical_trends['yearID'],
        y=historical_trends['k_rate'],
        name='K Rate',
        line=dict(color='orange', width=2)
    ),
    secondary_y=False
)

# Add era markers
eras = [
    (1920, 'Live Ball Era'),
    (1947, 'Integration'),
    (1961, 'Expansion'),
    (1993, 'Steroid Era'),
    (2006, 'Modern Era')
]

for year, label in eras:
    fig.add_vline(
        x=year,
        line_dash="dash",
        line_color="gray",
        annotation_text=label,
        annotation_position="top"
    )

# Update layout with range slider
fig.update_xaxes(
    title_text="Year",
    rangeslider_visible=True,
    rangeselector=dict(
        buttons=list([
            dict(count=10, label="10y", step="year", stepmode="backward"),
            dict(count=25, label="25y", step="year", stepmode="backward"),
            dict(count=50, label="50y", step="year", stepmode="backward"),
            dict(step="all", label="All")
        ])
    )
)

fig.update_yaxes(title_text="Batting Average / K Rate", secondary_y=False)
fig.update_yaxes(title_text="HR Rate (per 100 AB)", secondary_y=True)

fig.update_layout(
    title_text="Historical Trends in MLB Statistics (1900-2023)",
    hovermode='x unified',
    height=600,
    legend=dict(x=0.01, y=0.99)
)

fig.show()

13.8 Exercises

Exercise 1: Calculate Era-Adjusted Statistics

Calculate OPS+ and ERA+ for the following player seasons and compare them:

Hitters:

Rogers Hornsby, 1924 (.424 AVG, .507 OBP, .696 SLG)

Tony Gwynn, 1994 (.394 AVG, .454 OBP, .568 SLG)

Ichiro Suzuki, 2004 (.372 AVG, .414 OBP, .455 SLG)

Pitchers:

Walter Johnson, 1913 (1.14 ERA)

Dwight Gooden, 1985 (1.53 ERA)

Jacob deGrom, 2018 (1.70 ERA)

# Solution for Exercise 1

library(Lahman)
library(dplyr)

# Hitter OPS+ calculations
# Note: You'll need to look up league averages for each year

# 1924 NL Averages: OBP ~.330, SLG ~.395
hornsby_ops_plus <- 100 * ((0.507 / 0.330) + (0.696 / 0.395) - 1)
print(paste("Hornsby 1924 OPS+:", round(hornsby_ops_plus, 0)))

# 1994 NL Averages: OBP ~.330, SLG ~.415
gwynn_ops_plus <- 100 * ((0.454 / 0.330) + (0.568 / 0.415) - 1)
print(paste("Gwynn 1994 OPS+:", round(gwynn_ops_plus, 0)))

# 2004 AL Averages: OBP ~.333, SLG ~.423
ichiro_ops_plus <- 100 * ((0.414 / 0.333) + (0.455 / 0.423) - 1)
print(paste("Ichiro 2004 OPS+:", round(ichiro_ops_plus, 0)))

# Pitcher ERA+ calculations
# 1913 AL ERA: ~3.00
johnson_era_plus <- 100 * (3.00 / 1.14)
print(paste("Walter Johnson 1913 ERA+:", round(johnson_era_plus, 0)))

# 1985 NL ERA: ~3.58
gooden_era_plus <- 100 * (3.58 / 1.53)
print(paste("Dwight Gooden 1985 ERA+:", round(gooden_era_plus, 0)))

# 2018 NL ERA: ~4.04
degrom_era_plus <- 100 * (4.04 / 1.70)
print(paste("Jacob deGrom 2018 ERA+:", round(degrom_era_plus, 0)))

# Solution for Exercise 1

# Hitter OPS+ calculations
# 1924 NL Averages: OBP ~.330, SLG ~.395
hornsby_ops_plus = 100 * ((0.507 / 0.330) + (0.696 / 0.395) - 1)
print(f"Hornsby 1924 OPS+: {round(hornsby_ops_plus, 0)}")

# 1994 NL Averages: OBP ~.330, SLG ~.415
gwynn_ops_plus = 100 * ((0.454 / 0.330) + (0.568 / 0.415) - 1)
print(f"Gwynn 1994 OPS+: {round(gwynn_ops_plus, 0)}")

# 2004 AL Averages: OBP ~.333, SLG ~.423
ichiro_ops_plus = 100 * ((0.414 / 0.333) + (0.455 / 0.423) - 1)
print(f"Ichiro 2004 OPS+: {round(ichiro_ops_plus, 0)}")

# Pitcher ERA+ calculations
# 1913 AL ERA: ~3.00
johnson_era_plus = 100 * (3.00 / 1.14)
print(f"Walter Johnson 1913 ERA+: {round(johnson_era_plus, 0)}")

# 1985 NL ERA: ~3.58
gooden_era_plus = 100 * (3.58 / 1.53)
print(f"Dwight Gooden 1985 ERA+: {round(gooden_era_plus, 0)}")

# 2018 NL ERA: ~4.04
degrom_era_plus = 100 * (4.04 / 1.70)
print(f"Jacob deGrom 2018 ERA+: {round(degrom_era_plus, 0)}")

Exercise 2: Decade-by-Decade Trend Analysis

Using the Lahman database, create visualizations showing how the following statistics have changed by decade since 1900:

Stolen base rate (SB per game)
Complete game percentage
Batting average on balls in play (BABIP)
Walk rate (BB per PA)

# Solution for Exercise 2

library(Lahman)
library(dplyr)
library(ggplot2)
library(gridExtra)

# Calculate decade statistics
decade_trends <- Batting %>%
  filter(yearID >= 1900) %>%
  mutate(decade = floor(yearID / 10) * 10) %>%
  group_by(decade) %>%
  summarise(
    Total_SB = sum(SB, na.rm = TRUE),
    Total_G = sum(G, na.rm = TRUE),
    Total_AB = sum(AB, na.rm = TRUE),
    Total_H = sum(H, na.rm = TRUE),
    Total_HR = sum(HR, na.rm = TRUE),
    Total_BB = sum(BB, na.rm = TRUE),
    Total_SO = sum(SO, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    SB_per_Game = Total_SB / (Total_G / 2),  # Divide by 2 for team games
    BIP = Total_AB - Total_SO - Total_HR,
    BABIP = (Total_H - Total_HR) / BIP,
    PA = Total_AB + Total_BB,
    BB_Rate = Total_BB / PA
  )

# Pitching complete games
pitching_trends <- Pitching %>%
  filter(yearID >= 1900) %>%
  mutate(decade = floor(yearID / 10) * 10) %>%
  group_by(decade) %>%
  summarise(
    Total_GS = sum(GS, na.rm = TRUE),
    Total_CG = sum(CG, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(CG_Pct = Total_CG / Total_GS)

# Create plots
p1 <- ggplot(decade_trends, aes(x = decade, y = SB_per_Game)) +
  geom_line(size = 1.5, color = "blue") +
  geom_point(size = 3) +
  labs(title = "Stolen Bases per Game", y = "SB/Game") +
  theme_minimal()

p2 <- ggplot(pitching_trends, aes(x = decade, y = CG_Pct * 100)) +
  geom_line(size = 1.5, color = "red") +
  geom_point(size = 3) +
  labs(title = "Complete Game Percentage", y = "CG %") +
  theme_minimal()

p3 <- ggplot(decade_trends, aes(x = decade, y = BABIP)) +
  geom_line(size = 1.5, color = "green") +
  geom_point(size = 3) +
  labs(title = "BABIP by Decade", y = "BABIP") +
  theme_minimal()

p4 <- ggplot(decade_trends, aes(x = decade, y = BB_Rate * 100)) +
  geom_line(size = 1.5, color = "purple") +
  geom_point(size = 3) +
  labs(title = "Walk Rate", y = "BB%") +
  theme_minimal()

grid.arrange(p1, p2, p3, p4, ncol = 2)

# Solution for Exercise 2

import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pyb.cache.enable()

batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()

# Calculate decade statistics
batting_modern = batting[batting['yearID'] >= 1900].copy()
batting_modern['decade'] = (batting_modern['yearID'] // 10) * 10

decade_trends = batting_modern.groupby('decade').agg({
    'SB': 'sum',
    'G': 'sum',
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'BB': 'sum',
    'SO': 'sum'
}).reset_index()

# Calculate rate stats
decade_trends['SB_per_Game'] = decade_trends['SB'] / (decade_trends['G'] / 2)
decade_trends['BIP'] = decade_trends['AB'] - decade_trends['SO'] - decade_trends['HR']
decade_trends['BABIP'] = (decade_trends['H'] - decade_trends['HR']) / decade_trends['BIP']
decade_trends['PA'] = decade_trends['AB'] + decade_trends['BB']
decade_trends['BB_Rate'] = decade_trends['BB'] / decade_trends['PA']

# Pitching complete games
pitching_modern = pitching[pitching['yearID'] >= 1900].copy()
pitching_modern['decade'] = (pitching_modern['yearID'] // 10) * 10

pitching_trends = pitching_modern.groupby('decade').agg({
    'GS': 'sum',
    'CG': 'sum'
}).reset_index()

pitching_trends['CG_Pct'] = pitching_trends['CG'] / pitching_trends['GS']

# Create plots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

ax1.plot(decade_trends['decade'], decade_trends['SB_per_Game'],
         marker='o', linewidth=2, markersize=8, color='blue')
ax1.set_title('Stolen Bases per Game', fontsize=12, fontweight='bold')
ax1.set_ylabel('SB/Game')
ax1.grid(True, alpha=0.3)

ax2.plot(pitching_trends['decade'], pitching_trends['CG_Pct'] * 100,
         marker='o', linewidth=2, markersize=8, color='red')
ax2.set_title('Complete Game Percentage', fontsize=12, fontweight='bold')
ax2.set_ylabel('CG %')
ax2.grid(True, alpha=0.3)

ax3.plot(decade_trends['decade'], decade_trends['BABIP'],
         marker='o', linewidth=2, markersize=8, color='green')
ax3.set_title('BABIP by Decade', fontsize=12, fontweight='bold')
ax3.set_ylabel('BABIP')
ax3.grid(True, alpha=0.3)

ax4.plot(decade_trends['decade'], decade_trends['BB_Rate'] * 100,
         marker='o', linewidth=2, markersize=8, color='purple')
ax4.set_title('Walk Rate', fontsize=12, fontweight='bold')
ax4.set_ylabel('BB%')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('decade_trends_detailed.png', dpi=300, bbox_inches='tight')
plt.show()

Exercise 3: Cross-Era Player Comparison

Compare the following three shortstops from different eras using era-adjusted statistics:

Honus Wagner (career: 1897-1917)
Cal Ripken Jr. (career: 1981-2001)
Derek Jeter (career: 1995-2014)

Calculate and compare:

Career batting average relative to league average

Career OPS relative to league average

Best single-season OPS+

Career home runs relative to position average

# Solution for Exercise 3

library(Lahman)
library(dplyr)

compare_shortstops <- function() {
  # Get player IDs
  wagner_id <- People %>%
    filter(nameLast == "Wagner", nameFirst == "Honus") %>%
    pull(playerID)

  ripken_id <- People %>%
    filter(nameLast == "Ripken", nameFirst == "Cal") %>%
    pull(playerID)

  jeter_id <- People %>%
    filter(nameLast == "Jeter", nameFirst == "Derek") %>%
    pull(playerID)

  # Get career stats
  get_career <- function(pid, name) {
    career <- Batting %>%
      filter(playerID == pid) %>%
      summarise(
        Name = name,
        Years = paste(min(yearID), max(yearID), sep = "-"),
        AB = sum(AB, na.rm = TRUE),
        H = sum(H, na.rm = TRUE),
        HR = sum(HR, na.rm = TRUE),
        BB = sum(BB, na.rm = TRUE),
        X2B = sum(X2B, na.rm = TRUE),
        X3B = sum(X3B, na.rm = TRUE)
      ) %>%
      mutate(
        AVG = round(H / AB, 3),
        TB = H + X2B + 2*X3B + 3*HR,
        SLG = round(TB / AB, 3),
        OBP = round((H + BB) / (AB + BB), 3),
        OPS = round(OBP + SLG, 3)
      )

    return(career)
  }

  wagner <- get_career(wagner_id, "Honus Wagner")
  ripken <- get_career(ripken_id, "Cal Ripken Jr.")
  jeter <- get_career(jeter_id, "Derek Jeter")

  comparison <- bind_rows(wagner, ripken, jeter)
  return(comparison)
}

shortstop_comparison <- compare_shortstops()
print(shortstop_comparison[, c("Name", "Years", "AVG", "OBP", "SLG", "OPS", "HR")])

# Solution for Exercise 3

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def compare_shortstops():
    """
    Compare three legendary shortstops
    """
    # Get player IDs
    wagner = people[
        (people['nameLast'] == 'Wagner') &
        (people['nameFirst'] == 'Honus')
    ]
    wagner_id = wagner['playerID'].values[0] if len(wagner) > 0 else None

    ripken = people[
        (people['nameLast'] == 'Ripken') &
        (people['nameFirst'] == 'Cal')
    ]
    ripken_id = ripken['playerID'].values[0] if len(ripken) > 0 else None

    jeter = people[
        (people['nameLast'] == 'Jeter') &
        (people['nameFirst'] == 'Derek')
    ]
    jeter_id = jeter['playerID'].values[0] if len(jeter) > 0 else None

    def get_career(pid, name):
        career = batting[batting['playerID'] == pid]

        stats = {
            'Name': name,
            'Years': f"{career['yearID'].min()}-{career['yearID'].max()}",
            'AB': career['AB'].sum(),
            'H': career['H'].sum(),
            'HR': career['HR'].sum(),
            'BB': career['BB'].sum(),
            '2B': career['2B'].sum(),
            '3B': career['3B'].sum()
        }

        stats['AVG'] = round(stats['H'] / stats['AB'], 3)
        stats['TB'] = stats['H'] + stats['2B'] + 2*stats['3B'] + 3*stats['HR']
        stats['SLG'] = round(stats['TB'] / stats['AB'], 3)
        stats['OBP'] = round((stats['H'] + stats['BB']) / (stats['AB'] + stats['BB']), 3)
        stats['OPS'] = round(stats['OBP'] + stats['SLG'], 3)

        return stats

    comparison = pd.DataFrame([
        get_career(wagner_id, "Honus Wagner"),
        get_career(ripken_id, "Cal Ripken Jr."),
        get_career(jeter_id, "Derek Jeter")
    ])

    return comparison

shortstop_comparison = compare_shortstops()
print(shortstop_comparison[['Name', 'Years', 'AVG', 'OBP', 'SLG', 'OPS', 'HR']])

Exercise 4: Steroid Era Impact Analysis

Analyze the impact of the steroid era on career milestones:

Count how many players reached 500 career home runs in different eras:

Pre-steroid (before 1994)
Steroid era (1994-2007)
Post-steroid (after 2007)

Calculate the average age at which players hit their career peak (most HR in a season) for each era

Identify players whose late-career performance (ages 35+) was anomalously good compared to early career

# Solution for Exercise 4

library(Lahman)
library(dplyr)

# Part 1: 500 HR club by era
hr_500_club <- Batting %>%
  group_by(playerID) %>%
  summarise(
    Career_HR = sum(HR, na.rm = TRUE),
    Last_Year = max(yearID),
    .groups = 'drop'
  ) %>%
  filter(Career_HR >= 500) %>%
  mutate(
    Era = case_when(
      Last_Year < 1994 ~ "Pre-Steroid",
      Last_Year >= 1994 & Last_Year <= 2007 ~ "Steroid",
      Last_Year > 2007 ~ "Post-Steroid"
    )
  )

# Join with names
hr_500_with_names <- hr_500_club %>%
  left_join(People, by = "playerID") %>%
  mutate(Name = paste(nameFirst, nameLast)) %>%
  select(Name, Career_HR, Last_Year, Era) %>%
  arrange(desc(Career_HR))

print(hr_500_with_names)

# Count by era
era_counts <- hr_500_with_names %>%
  group_by(Era) %>%
  summarise(Count = n(), .groups = 'drop')

print(era_counts)

# Part 2: Peak age by era
peak_age_analysis <- Batting %>%
  left_join(People, by = "playerID") %>%
  mutate(
    Age = yearID - birthYear,
    Era = case_when(
      yearID < 1994 ~ "Pre-Steroid",
      yearID >= 1994 & yearID <= 2007 ~ "Steroid",
      yearID > 2007 ~ "Post-Steroid"
    )
  ) %>%
  filter(!is.na(Age), Age >= 20, Age <= 45) %>%
  group_by(playerID) %>%
  filter(HR == max(HR)) %>%
  ungroup() %>%
  group_by(Era) %>%
  summarise(
    Avg_Peak_Age = mean(Age, na.rm = TRUE),
    .groups = 'drop'
  )

print(peak_age_analysis)

# Solution for Exercise 4

import pybaseball as pyb
import pandas as pd
import numpy as np

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

# Part 1: 500 HR club by era
career_hr = batting.groupby('playerID').agg({
    'HR': 'sum',
    'yearID': 'max'
}).reset_index()

career_hr.columns = ['playerID', 'Career_HR', 'Last_Year']
hr_500_club = career_hr[career_hr['Career_HR'] >= 500].copy()

# Define era
def classify_era(year):
    if year < 1994:
        return "Pre-Steroid"
    elif year <= 2007:
        return "Steroid"
    else:
        return "Post-Steroid"

hr_500_club['Era'] = hr_500_club['Last_Year'].apply(classify_era)

# Join with names
hr_500_with_names = hr_500_club.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID',
    how='left'
)

hr_500_with_names['Name'] = (
    hr_500_with_names['nameFirst'] + ' ' + hr_500_with_names['nameLast']
)

hr_500_with_names = hr_500_with_names[['Name', 'Career_HR', 'Last_Year', 'Era']]
hr_500_with_names = hr_500_with_names.sort_values('Career_HR', ascending=False)

print(hr_500_with_names)

# Count by era
era_counts = hr_500_with_names.groupby('Era').size().reset_index()
era_counts.columns = ['Era', 'Count']
print(era_counts)

# Part 2: Peak age by era
batting_with_age = batting.merge(
    people[['playerID', 'birthYear']],
    on='playerID',
    how='left'
)

batting_with_age['Age'] = batting_with_age['yearID'] - batting_with_age['birthYear']
batting_with_age['Era'] = batting_with_age['yearID'].apply(classify_era)

# Filter reasonable ages
batting_with_age = batting_with_age[
    (batting_with_age['Age'] >= 20) &
    (batting_with_age['Age'] <= 45)
]

# Find peak HR season for each player
peak_seasons = batting_with_age.loc[
    batting_with_age.groupby('playerID')['HR'].idxmax()
]

# Average peak age by era
peak_age_by_era = peak_seasons.groupby('Era')['Age'].mean().reset_index()
peak_age_by_era.columns = ['Era', 'Avg_Peak_Age']
print(peak_age_by_era)

Summary

Historical analysis and cross-era comparison represent some of the most intellectually challenging—and rewarding—aspects of baseball analytics. By understanding the contexts in which players competed, applying appropriate adjustments, and using sophisticated analytical frameworks, we can make meaningful comparisons across the decades.

Key takeaways from this chapter:

Context is Everything: Raw statistics are meaningless without understanding the era in which they were accumulated

Era-Adjusted Statistics: Tools like OPS+, ERA+, and wRC+ allow apples-to-apples comparisons across different offensive environments

The Lahman Database: This comprehensive historical database enables sophisticated analysis of baseball's entire history

Historical Trends: Baseball has evolved continuously, with clear patterns in offense, defense, and pitching across decades

Cross-Era Comparison: With appropriate methods, we can meaningfully compare players from different eras while respecting their unique contexts

The Steroid Era: This period presents special challenges but can be handled using standard era-adjustment techniques

Peak vs. Career: Different analytical frameworks favor different types of players; both perspectives have merit

As you continue your work in baseball analytics, always remember that numbers tell stories about human achievement in specific historical moments. Our job is to understand those achievements in context while still allowing meaningful comparison across time. The greatest players of the dead ball era, the integration era, the expansion era, and today's game all deserve to be evaluated fairly within their own contexts while still being measured against each other using sophisticated analytical tools.

The methods you've learned in this chapter will enable you to participate in the great debates of baseball history with statistical rigor and historical awareness—combining the best of both traditional and modern approaches to understanding this timeless game.

R

# Solution for Exercise 1

library(Lahman)
library(dplyr)

# Hitter OPS+ calculations
# Note: You'll need to look up league averages for each year

# 1924 NL Averages: OBP ~.330, SLG ~.395
hornsby_ops_plus <- 100 * ((0.507 / 0.330) + (0.696 / 0.395) - 1)
print(paste("Hornsby 1924 OPS+:", round(hornsby_ops_plus, 0)))

# 1994 NL Averages: OBP ~.330, SLG ~.415
gwynn_ops_plus <- 100 * ((0.454 / 0.330) + (0.568 / 0.415) - 1)
print(paste("Gwynn 1994 OPS+:", round(gwynn_ops_plus, 0)))

# 2004 AL Averages: OBP ~.333, SLG ~.423
ichiro_ops_plus <- 100 * ((0.414 / 0.333) + (0.455 / 0.423) - 1)
print(paste("Ichiro 2004 OPS+:", round(ichiro_ops_plus, 0)))

# Pitcher ERA+ calculations
# 1913 AL ERA: ~3.00
johnson_era_plus <- 100 * (3.00 / 1.14)
print(paste("Walter Johnson 1913 ERA+:", round(johnson_era_plus, 0)))

# 1985 NL ERA: ~3.58
gooden_era_plus <- 100 * (3.58 / 1.53)
print(paste("Dwight Gooden 1985 ERA+:", round(gooden_era_plus, 0)))

# 2018 NL ERA: ~4.04
degrom_era_plus <- 100 * (4.04 / 1.70)
print(paste("Jacob deGrom 2018 ERA+:", round(degrom_era_plus, 0)))

R

# Solution for Exercise 2

library(Lahman)
library(dplyr)
library(ggplot2)
library(gridExtra)

# Calculate decade statistics
decade_trends <- Batting %>%
  filter(yearID >= 1900) %>%
  mutate(decade = floor(yearID / 10) * 10) %>%
  group_by(decade) %>%
  summarise(
    Total_SB = sum(SB, na.rm = TRUE),
    Total_G = sum(G, na.rm = TRUE),
    Total_AB = sum(AB, na.rm = TRUE),
    Total_H = sum(H, na.rm = TRUE),
    Total_HR = sum(HR, na.rm = TRUE),
    Total_BB = sum(BB, na.rm = TRUE),
    Total_SO = sum(SO, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    SB_per_Game = Total_SB / (Total_G / 2),  # Divide by 2 for team games
    BIP = Total_AB - Total_SO - Total_HR,
    BABIP = (Total_H - Total_HR) / BIP,
    PA = Total_AB + Total_BB,
    BB_Rate = Total_BB / PA
  )

# Pitching complete games
pitching_trends <- Pitching %>%
  filter(yearID >= 1900) %>%
  mutate(decade = floor(yearID / 10) * 10) %>%
  group_by(decade) %>%
  summarise(
    Total_GS = sum(GS, na.rm = TRUE),
    Total_CG = sum(CG, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(CG_Pct = Total_CG / Total_GS)

# Create plots
p1 <- ggplot(decade_trends, aes(x = decade, y = SB_per_Game)) +
  geom_line(size = 1.5, color = "blue") +
  geom_point(size = 3) +
  labs(title = "Stolen Bases per Game", y = "SB/Game") +
  theme_minimal()

p2 <- ggplot(pitching_trends, aes(x = decade, y = CG_Pct * 100)) +
  geom_line(size = 1.5, color = "red") +
  geom_point(size = 3) +
  labs(title = "Complete Game Percentage", y = "CG %") +
  theme_minimal()

p3 <- ggplot(decade_trends, aes(x = decade, y = BABIP)) +
  geom_line(size = 1.5, color = "green") +
  geom_point(size = 3) +
  labs(title = "BABIP by Decade", y = "BABIP") +
  theme_minimal()

p4 <- ggplot(decade_trends, aes(x = decade, y = BB_Rate * 100)) +
  geom_line(size = 1.5, color = "purple") +
  geom_point(size = 3) +
  labs(title = "Walk Rate", y = "BB%") +
  theme_minimal()

grid.arrange(p1, p2, p3, p4, ncol = 2)

R

# Solution for Exercise 3

library(Lahman)
library(dplyr)

compare_shortstops <- function() {
  # Get player IDs
  wagner_id <- People %>%
    filter(nameLast == "Wagner", nameFirst == "Honus") %>%
    pull(playerID)

  ripken_id <- People %>%
    filter(nameLast == "Ripken", nameFirst == "Cal") %>%
    pull(playerID)

  jeter_id <- People %>%
    filter(nameLast == "Jeter", nameFirst == "Derek") %>%
    pull(playerID)

  # Get career stats
  get_career <- function(pid, name) {
    career <- Batting %>%
      filter(playerID == pid) %>%
      summarise(
        Name = name,
        Years = paste(min(yearID), max(yearID), sep = "-"),
        AB = sum(AB, na.rm = TRUE),
        H = sum(H, na.rm = TRUE),
        HR = sum(HR, na.rm = TRUE),
        BB = sum(BB, na.rm = TRUE),
        X2B = sum(X2B, na.rm = TRUE),
        X3B = sum(X3B, na.rm = TRUE)
      ) %>%
      mutate(
        AVG = round(H / AB, 3),
        TB = H + X2B + 2*X3B + 3*HR,
        SLG = round(TB / AB, 3),
        OBP = round((H + BB) / (AB + BB), 3),
        OPS = round(OBP + SLG, 3)
      )

    return(career)
  }

  wagner <- get_career(wagner_id, "Honus Wagner")
  ripken <- get_career(ripken_id, "Cal Ripken Jr.")
  jeter <- get_career(jeter_id, "Derek Jeter")

  comparison <- bind_rows(wagner, ripken, jeter)
  return(comparison)
}

shortstop_comparison <- compare_shortstops()
print(shortstop_comparison[, c("Name", "Years", "AVG", "OBP", "SLG", "OPS", "HR")])

R

# Solution for Exercise 4

library(Lahman)
library(dplyr)

# Part 1: 500 HR club by era
hr_500_club <- Batting %>%
  group_by(playerID) %>%
  summarise(
    Career_HR = sum(HR, na.rm = TRUE),
    Last_Year = max(yearID),
    .groups = 'drop'
  ) %>%
  filter(Career_HR >= 500) %>%
  mutate(
    Era = case_when(
      Last_Year < 1994 ~ "Pre-Steroid",
      Last_Year >= 1994 & Last_Year <= 2007 ~ "Steroid",
      Last_Year > 2007 ~ "Post-Steroid"
    )
  )

# Join with names
hr_500_with_names <- hr_500_club %>%
  left_join(People, by = "playerID") %>%
  mutate(Name = paste(nameFirst, nameLast)) %>%
  select(Name, Career_HR, Last_Year, Era) %>%
  arrange(desc(Career_HR))

print(hr_500_with_names)

# Count by era
era_counts <- hr_500_with_names %>%
  group_by(Era) %>%
  summarise(Count = n(), .groups = 'drop')

print(era_counts)

# Part 2: Peak age by era
peak_age_analysis <- Batting %>%
  left_join(People, by = "playerID") %>%
  mutate(
    Age = yearID - birthYear,
    Era = case_when(
      yearID < 1994 ~ "Pre-Steroid",
      yearID >= 1994 & yearID <= 2007 ~ "Steroid",
      yearID > 2007 ~ "Post-Steroid"
    )
  ) %>%
  filter(!is.na(Age), Age >= 20, Age <= 45) %>%
  group_by(playerID) %>%
  filter(HR == max(HR)) %>%
  ungroup() %>%
  group_by(Era) %>%
  summarise(
    Avg_Peak_Age = mean(Age, na.rm = TRUE),
    .groups = 'drop'
  )

print(peak_age_analysis)

Python

# Solution for Exercise 1

# Hitter OPS+ calculations
# 1924 NL Averages: OBP ~.330, SLG ~.395
hornsby_ops_plus = 100 * ((0.507 / 0.330) + (0.696 / 0.395) - 1)
print(f"Hornsby 1924 OPS+: {round(hornsby_ops_plus, 0)}")

# 1994 NL Averages: OBP ~.330, SLG ~.415
gwynn_ops_plus = 100 * ((0.454 / 0.330) + (0.568 / 0.415) - 1)
print(f"Gwynn 1994 OPS+: {round(gwynn_ops_plus, 0)}")

# 2004 AL Averages: OBP ~.333, SLG ~.423
ichiro_ops_plus = 100 * ((0.414 / 0.333) + (0.455 / 0.423) - 1)
print(f"Ichiro 2004 OPS+: {round(ichiro_ops_plus, 0)}")

# Pitcher ERA+ calculations
# 1913 AL ERA: ~3.00
johnson_era_plus = 100 * (3.00 / 1.14)
print(f"Walter Johnson 1913 ERA+: {round(johnson_era_plus, 0)}")

# 1985 NL ERA: ~3.58
gooden_era_plus = 100 * (3.58 / 1.53)
print(f"Dwight Gooden 1985 ERA+: {round(gooden_era_plus, 0)}")

# 2018 NL ERA: ~4.04
degrom_era_plus = 100 * (4.04 / 1.70)
print(f"Jacob deGrom 2018 ERA+: {round(degrom_era_plus, 0)}")

Python

# Solution for Exercise 2

import pybaseball as pyb
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pyb.cache.enable()

batting = pyb.lahman.batting()
pitching = pyb.lahman.pitching()

# Calculate decade statistics
batting_modern = batting[batting['yearID'] >= 1900].copy()
batting_modern['decade'] = (batting_modern['yearID'] // 10) * 10

decade_trends = batting_modern.groupby('decade').agg({
    'SB': 'sum',
    'G': 'sum',
    'AB': 'sum',
    'H': 'sum',
    'HR': 'sum',
    'BB': 'sum',
    'SO': 'sum'
}).reset_index()

# Calculate rate stats
decade_trends['SB_per_Game'] = decade_trends['SB'] / (decade_trends['G'] / 2)
decade_trends['BIP'] = decade_trends['AB'] - decade_trends['SO'] - decade_trends['HR']
decade_trends['BABIP'] = (decade_trends['H'] - decade_trends['HR']) / decade_trends['BIP']
decade_trends['PA'] = decade_trends['AB'] + decade_trends['BB']
decade_trends['BB_Rate'] = decade_trends['BB'] / decade_trends['PA']

# Pitching complete games
pitching_modern = pitching[pitching['yearID'] >= 1900].copy()
pitching_modern['decade'] = (pitching_modern['yearID'] // 10) * 10

pitching_trends = pitching_modern.groupby('decade').agg({
    'GS': 'sum',
    'CG': 'sum'
}).reset_index()

pitching_trends['CG_Pct'] = pitching_trends['CG'] / pitching_trends['GS']

# Create plots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

ax1.plot(decade_trends['decade'], decade_trends['SB_per_Game'],
         marker='o', linewidth=2, markersize=8, color='blue')
ax1.set_title('Stolen Bases per Game', fontsize=12, fontweight='bold')
ax1.set_ylabel('SB/Game')
ax1.grid(True, alpha=0.3)

ax2.plot(pitching_trends['decade'], pitching_trends['CG_Pct'] * 100,
         marker='o', linewidth=2, markersize=8, color='red')
ax2.set_title('Complete Game Percentage', fontsize=12, fontweight='bold')
ax2.set_ylabel('CG %')
ax2.grid(True, alpha=0.3)

ax3.plot(decade_trends['decade'], decade_trends['BABIP'],
         marker='o', linewidth=2, markersize=8, color='green')
ax3.set_title('BABIP by Decade', fontsize=12, fontweight='bold')
ax3.set_ylabel('BABIP')
ax3.grid(True, alpha=0.3)

ax4.plot(decade_trends['decade'], decade_trends['BB_Rate'] * 100,
         marker='o', linewidth=2, markersize=8, color='purple')
ax4.set_title('Walk Rate', fontsize=12, fontweight='bold')
ax4.set_ylabel('BB%')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('decade_trends_detailed.png', dpi=300, bbox_inches='tight')
plt.show()

Python

# Solution for Exercise 3

import pybaseball as pyb
import pandas as pd

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

def compare_shortstops():
    """
    Compare three legendary shortstops
    """
    # Get player IDs
    wagner = people[
        (people['nameLast'] == 'Wagner') &
        (people['nameFirst'] == 'Honus')
    ]
    wagner_id = wagner['playerID'].values[0] if len(wagner) > 0 else None

    ripken = people[
        (people['nameLast'] == 'Ripken') &
        (people['nameFirst'] == 'Cal')
    ]
    ripken_id = ripken['playerID'].values[0] if len(ripken) > 0 else None

    jeter = people[
        (people['nameLast'] == 'Jeter') &
        (people['nameFirst'] == 'Derek')
    ]
    jeter_id = jeter['playerID'].values[0] if len(jeter) > 0 else None

    def get_career(pid, name):
        career = batting[batting['playerID'] == pid]

        stats = {
            'Name': name,
            'Years': f"{career['yearID'].min()}-{career['yearID'].max()}",
            'AB': career['AB'].sum(),
            'H': career['H'].sum(),
            'HR': career['HR'].sum(),
            'BB': career['BB'].sum(),
            '2B': career['2B'].sum(),
            '3B': career['3B'].sum()
        }

        stats['AVG'] = round(stats['H'] / stats['AB'], 3)
        stats['TB'] = stats['H'] + stats['2B'] + 2*stats['3B'] + 3*stats['HR']
        stats['SLG'] = round(stats['TB'] / stats['AB'], 3)
        stats['OBP'] = round((stats['H'] + stats['BB']) / (stats['AB'] + stats['BB']), 3)
        stats['OPS'] = round(stats['OBP'] + stats['SLG'], 3)

        return stats

    comparison = pd.DataFrame([
        get_career(wagner_id, "Honus Wagner"),
        get_career(ripken_id, "Cal Ripken Jr."),
        get_career(jeter_id, "Derek Jeter")
    ])

    return comparison

shortstop_comparison = compare_shortstops()
print(shortstop_comparison[['Name', 'Years', 'AVG', 'OBP', 'SLG', 'OPS', 'HR']])

Python

# Solution for Exercise 4

import pybaseball as pyb
import pandas as pd
import numpy as np

pyb.cache.enable()

batting = pyb.lahman.batting()
people = pyb.lahman.people()

# Part 1: 500 HR club by era
career_hr = batting.groupby('playerID').agg({
    'HR': 'sum',
    'yearID': 'max'
}).reset_index()

career_hr.columns = ['playerID', 'Career_HR', 'Last_Year']
hr_500_club = career_hr[career_hr['Career_HR'] >= 500].copy()

# Define era
def classify_era(year):
    if year < 1994:
        return "Pre-Steroid"
    elif year <= 2007:
        return "Steroid"
    else:
        return "Post-Steroid"

hr_500_club['Era'] = hr_500_club['Last_Year'].apply(classify_era)

# Join with names
hr_500_with_names = hr_500_club.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID',
    how='left'
)

hr_500_with_names['Name'] = (
    hr_500_with_names['nameFirst'] + ' ' + hr_500_with_names['nameLast']
)

hr_500_with_names = hr_500_with_names[['Name', 'Career_HR', 'Last_Year', 'Era']]
hr_500_with_names = hr_500_with_names.sort_values('Career_HR', ascending=False)

print(hr_500_with_names)

# Count by era
era_counts = hr_500_with_names.groupby('Era').size().reset_index()
era_counts.columns = ['Era', 'Count']
print(era_counts)

# Part 2: Peak age by era
batting_with_age = batting.merge(
    people[['playerID', 'birthYear']],
    on='playerID',
    how='left'
)

batting_with_age['Age'] = batting_with_age['yearID'] - batting_with_age['birthYear']
batting_with_age['Era'] = batting_with_age['yearID'].apply(classify_era)

# Filter reasonable ages
batting_with_age = batting_with_age[
    (batting_with_age['Age'] >= 20) &
    (batting_with_age['Age'] <= 45)
]

# Find peak HR season for each player
peak_seasons = batting_with_age.loc[
    batting_with_age.groupby('playerID')['HR'].idxmax()
]

# Average peak age by era
peak_age_by_era = peak_seasons.groupby('Era')['Age'].mean().reset_index()
peak_age_by_era.columns = ['Era', 'Avg_Peak_Age']
print(peak_age_by_era)

Chapter 13: Historical Analysis & Era Comparison

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

13.1 The Challenge of Cross-Era Comparison

Why Raw Statistics Fail

Changes in Rules and Equipment

The Talent Pool Evolution

The Dead Ball Era (1900-1919)

The Live Ball Era Transition (1920-1930)

Modern Era Variations

13.2 Era-Adjusted Statistics

OPS+ (Adjusted OPS)

ERA+ (Adjusted ERA)

wRC+ (Weighted Runs Created Plus)

Park Factors

Historical Park Factor Challenges

Applying Era Adjustments

13.3 Using the Lahman Database

Setting Up the Lahman Database

Querying Career Statistics

Building Era Comparison Tools

Advanced Historical Queries

13.4 Decade-by-Decade Analysis

Calculating Decade Averages

Key Observations from Decade Analysis

Visualizing Historical Trends

Year-by-Year Analysis for Recent Trends

13.5 Comparing Players Across Eras

WAR as a Cross-Era Comparison Tool

Peak Value vs. Career Value

Building a "Best Seasons Ever" Analysis

Creating a Unified Comparison Framework

13.6 The Steroid Era Problem

Identifying the Steroid Era Statistically

Key Findings

Adjusting for Performance Enhancement

Example: Comparing Steroid-Era Stats

Ethical Considerations in Historical Analysis

13.7 Interactive Historical Exploration

Animated Timeline of League Averages

Interactive Era Comparison Tool

Historical Trends with Range Slider

13.8 Exercises

Exercise 1: Calculate Era-Adjusted Statistics

Exercise 2: Decade-by-Decade Trend Analysis

Exercise 3: Cross-Era Player Comparison

Exercise 4: Steroid Era Impact Analysis

Summary

Chapter Summary

Related Resources

Glossary

Resources

All Chapters