Chapter 3: The Baseball Data Ecosystem

Baseball is blessed with an extraordinary wealth of publicly available data. From historical records dating back to 1871 in the Lahman database, to modern Statcast measurements tracking every pitch at 2,000 frames per second, analysts have access to datasets that would be the envy of researchers in most other fields.

Beginner ~17 min read 5 sections 54 code examples 4 exercises
Book Progress
7%
Chapter 4 of 54
What You'll Learn
  • The baseballr Package (R)
  • The pybaseball Library (Python)
  • The Lahman Database
  • Other Data Sources
  • And 1 more topics...
Languages in This Chapter
R (31) Python (23)

All code examples can be copied and run in your environment.

3.1 The baseballr Package (R)

The baseballr package, created by Bill Petti, is the Swiss Army knife of baseball data acquisition in R. It provides functions to access FanGraphs, Baseball Reference, Baseball Savant (Statcast), and the MLB Stats API, all through a consistent interface.

3.1.1 Package Overview and Installation {#baseballr-install}

Install the package from CRAN or the development version from GitHub:

# From CRAN (stable version)
install.packages("baseballr")

# From GitHub (development version with latest features)
# install.packages("devtools")
devtools::install_github("BillPetti/baseballr")

# Load the package
library(baseballr)
library(tidyverse)  # For data manipulation

The package documentation is extensive: ?baseballr or visit https://billpetti.github.io/baseballr/

Key features:


  • Access to multiple data sources through a unified interface

  • Handles API rate limiting and pagination automatically

  • Returns data as tibbles (tidy data frames)

  • Includes player ID mapping across different systems

  • Regular updates to accommodate API changes

3.1.2 FanGraphs Data Access {#baseballr-fangraphs}

FanGraphs is the gold standard for advanced baseball statistics, featuring metrics like wOBA, wRC+, FIP, and WAR. The baseballr package provides several functions to access FanGraphs leaderboards.

Batting Leaders

library(baseballr)
library(tidyverse)

# Get 2024 batting leaders (qualified batters: min 3.1 PA per team game)
batters_2024 <- fg_batter_leaders(
  startseason = 2024,
  endseason = 2024,
  qual = 502  # 502 PA = 162 * 3.1
)

# Preview the data
glimpse(batters_2024)
dim(batters_2024)  # Check dimensions

# Key columns include:
# Name, Team, G, PA, HR, R, RBI, SB, BB%, K%, AVG, OBP, SLG, wOBA, wRC+, WAR, etc.

# View top performers by wRC+ (Weighted Runs Created Plus)
top_hitters <- batters_2024 %>%
  arrange(desc(`wRC+`)) %>%
  select(Name, Team, PA, AVG, OBP, SLG, wOBA, `wRC+`, WAR) %>%
  head(10)

print(top_hitters)

Output (example):

# A tibble: 10 × 9
   Name              Team     PA   AVG   OBP   SLG  wOBA `wRC+`   WAR
   <chr>             <chr> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1 Aaron Judge       NYY     704 0.322 0.458 0.701 0.457    218   10.8
 2 Juan Soto         NYY     713 0.288 0.419 0.569 0.419    178    8.1
 3 Shohei Ohtani     LAD     731 0.310 0.390 0.646 0.435    194    9.2
 4 Bobby Witt Jr.    KC      708 0.332 0.389 0.588 0.399    162    8.5
 5 Francisco Lindor  NYM     702 0.273 0.344 0.500 0.362    127    6.9

# Get batting data for a specific range without qualification minimum
all_batters_2024 <- fg_batter_leaders(
  startseason = 2024,
  endseason = 2024,
  qual = 0  # 0 = no minimum, returns all players
)

# Filter for Yankees players
yankees_batters <- all_batters_2024 %>%
  filter(Team == "NYY") %>%
  arrange(desc(PA))

# Multi-year data: Career progression
judge_career <- fg_batter_leaders(
  startseason = 2016,  # Judge's debut
  endseason = 2024,
  qual = 0
) %>%
  filter(Name == "Aaron Judge") %>%
  select(Season, Age, G, PA, HR, AVG, OBP, SLG, `wRC+`, WAR) %>%
  arrange(Season)

print(judge_career)

Advanced Batting Metrics

# Get additional metrics: plate discipline, batted ball data
# Note: Different endpoints have different available columns

# Standard batting with plate discipline
discipline_leaders <- fg_batter_leaders(2024, 2024, qual = 300) %>%
  select(Name, Team, PA, `BB%`, `K%`, `BB/K`, `O-Swing%`, `Z-Swing%`, `SwStr%`) %>%
  arrange(`K%`)  # Lowest strikeout rates

print(discipline_leaders %>% head(10))

# Players with excellent plate discipline (low K%, high BB%)
elite_discipline <- fg_batter_leaders(2024, 2024, qual = 400) %>%
  filter(`K%` < 15, `BB%` > 12) %>%
  select(Name, Team, PA, `BB%`, `K%`, `BB/K`, OBP, `wRC+`) %>%
  arrange(desc(`BB/K`))

print(elite_discipline)

Pitching Leaders

# Get 2024 pitching leaders (qualified: 1 IP per team game = 162 IP)
pitchers_2024 <- fg_pitcher_leaders(
  startseason = 2024,
  endseason = 2024,
  qual = 162  # Qualified starters
)

# Preview
glimpse(pitchers_2024)

# Top pitchers by FIP (Fielding Independent Pitching)
top_pitchers_fip <- pitchers_2024 %>%
  arrange(FIP) %>%
  select(Name, Team, IP, ERA, FIP, `xFIP`, `K/9`, `BB/9`, `HR/9`, WAR) %>%
  head(10)

print(top_pitchers_fip)

# Get relief pitchers (min 50 IP, typically relievers)
relievers_2024 <- fg_pitcher_leaders(2024, 2024, qual = 0) %>%
  filter(IP >= 50, IP < 100) %>%
  arrange(desc(WAR)) %>%
  select(Name, Team, IP, ERA, FIP, `K/9`, `BB/9`, `K%`, SV, WAR)

print(relievers_2024 %>% head(15))

Batted Ball Data

# Get batted ball statistics
batted_ball_leaders <- fg_batter_leaders(2024, 2024, qual = 300) %>%
  select(
    Name, Team, PA,
    `GB%`, `FB%`, `LD%`, `IFFB%`, `Pull%`, `Cent%`, `Oppo%`,
    `Soft%`, `Med%`, `Hard%`,
    `HR/FB`
  ) %>%
  arrange(desc(`Hard%`))  # Highest hard-hit rate

print(batted_ball_leaders %>% head(10))

# Extreme fly ball hitters with power
power_profile <- fg_batter_leaders(2024, 2024, qual = 400) %>%
  filter(`FB%` > 40, `HR/FB` > 15) %>%
  select(Name, Team, HR, `FB%`, `HR/FB`, `Pull%`, `Hard%`, ISO, SLG) %>%
  arrange(desc(HR))

print(power_profile)

3.1.3 Baseball Reference Data Access {#baseballr-bbref}

Baseball Reference is another premier source, particularly valued for its historical depth and park-adjusted statistics.

# Get Baseball Reference batting data
# Note: Functions may have different names/parameters

# Batting statistics by season
bref_batting_2024 <- bref_daily_batter(t1 = "2024-04-01", t2 = "2024-09-30")

# Pitching statistics
bref_pitching_2024 <- bref_daily_pitcher(t1 = "2024-04-01", t2 = "2024-09-30")

# Team batting statistics
team_batting_2024 <- bref_team_results(Tm = "NYY", year = 2024)

# Standings
standings_2024 <- bref_standings_on_date(date = "2024-09-30", division = "AL East")

print(standings_2024)

Note: Baseball Reference functions can be rate-limited. Be respectful of their servers and cache results locally for repeated analyses.

3.1.4 Statcast Data Access (statcast_search) {#baseballr-statcast}

This is one of the most powerful features of baseballr: direct access to Baseball Savant's Statcast data. Statcast tracks every pitch and batted ball with high-precision measurements: exit velocity, launch angle, spin rate, pitch movement, and more.

Basic Statcast Search

library(baseballr)

# Get all Statcast data for a date range
# Note: Large date ranges will take time due to data volume

# Single day of data
statcast_single_day <- statcast_search(
  start_date = "2024-07-15",
  end_date = "2024-07-15",
  playerid = NULL  # NULL = all players
)

glimpse(statcast_single_day)
dim(statcast_single_day)  # Thousands of pitches

# Key columns:
# pitch_type, release_speed, release_pos_x/y/z, pfx_x/pfx_z (movement),
# plate_x, plate_z, vx0, vy0, vz0 (velocity components),
# ax, ay, az (acceleration), sz_top, sz_bot (strike zone),
# hit_distance_sc, launch_speed, launch_angle, barrel, events, description

Player-Specific Statcast Data

# Aaron Judge's player ID (MLB ID)
judge_mlbam_id <- 592450

# Get Judge's Statcast data for 2024 season (in chunks)
# Note: API limits date ranges to ~2 weeks, so we'll query multiple periods

# Helper function to query in chunks
get_statcast_season <- function(player_id, year) {
  # Create date ranges (two-week chunks)
  start_date <- as.Date(paste0(year, "-03-28"))
  end_date <- as.Date(paste0(year, "-09-30"))

  date_seq <- seq(start_date, end_date, by = "14 days")
  if (tail(date_seq, 1) < end_date) {
    date_seq <- c(date_seq, end_date)
  }

  all_data <- list()

  for (i in 1:(length(date_seq) - 1)) {
    message(paste("Fetching data from", date_seq[i], "to", date_seq[i+1]))

    chunk <- statcast_search(
      start_date = as.character(date_seq[i]),
      end_date = as.character(date_seq[i+1]),
      playerid = player_id,
      player_type = "batter"
    )

    all_data[[i]] <- chunk
    Sys.sleep(2)  # Be nice to the API
  }

  bind_rows(all_data)
}

# Get Judge's 2024 Statcast data
judge_statcast_2024 <- get_statcast_season(592450, 2024)

# Analyze Judge's batted balls
judge_batted_balls <- judge_statcast_2024 %>%
  filter(!is.na(launch_speed), !is.na(launch_angle)) %>%
  select(
    game_date, events, description,
    launch_speed, launch_angle, hit_distance_sc,
    barrel, babip_value, estimated_ba_using_speedangle, estimated_woba_using_speedangle
  )

# Summary statistics
judge_ev_summary <- judge_batted_balls %>%
  summarize(
    batted_balls = n(),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    max_exit_velo = max(launch_speed, na.rm = TRUE),
    avg_launch_angle = mean(launch_angle, na.rm = TRUE),
    barrel_rate = sum(barrel == 1, na.rm = TRUE) / n() * 100,
    hard_hit_rate = sum(launch_speed >= 95, na.rm = TRUE) / n() * 100
  )

print(judge_ev_summary)

Output (example):

  batted_balls avg_exit_velo max_exit_velo avg_launch_angle barrel_rate hard_hit_rate
1          503          92.3         121.4             14.2        15.3          52.1

Analyzing Pitch Types and Outcomes

# Analyze pitches Judge faced
judge_pitches <- judge_statcast_2024 %>%
  filter(!is.na(pitch_type))

# Performance by pitch type
pitch_type_performance <- judge_pitches %>%
  group_by(pitch_type) %>%
  summarize(
    pitches_seen = n(),
    swing_rate = sum(description %in% c("foul", "hit_into_play", "swinging_strike", "foul_tip")) / n(),
    whiff_rate = sum(description %in% c("swinging_strike", "foul_tip")) / sum(description %in% c("foul", "hit_into_play", "swinging_strike", "foul_tip")),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    batting_avg = sum(events %in% c("single", "double", "triple", "home_run"), na.rm = TRUE) / sum(!is.na(events)),
    slugging = (
      sum(events == "single", na.rm = TRUE) +
      2 * sum(events == "double", na.rm = TRUE) +
      3 * sum(events == "triple", na.rm = TRUE) +
      4 * sum(events == "home_run", na.rm = TRUE)
    ) / sum(!is.na(events)),
    .groups = "drop"
  ) %>%
  arrange(desc(pitches_seen))

print(pitch_type_performance)

Zone Analysis

# Analyze performance by pitch location
# Divide strike zone into regions

judge_zone_analysis <- judge_pitches %>%
  filter(!is.na(plate_x), !is.na(plate_z)) %>%
  mutate(
    # Horizontal: inside, middle, outside (from catcher's perspective, Judge is RHB)
    zone_horizontal = case_when(
      plate_x < -0.5 ~ "Inside",
      plate_x > 0.5 ~ "Outside",
      TRUE ~ "Middle"
    ),
    # Vertical: high, middle, low
    zone_vertical = case_when(
      plate_z > 3.0 ~ "High",
      plate_z < 2.0 ~ "Low",
      TRUE ~ "Middle"
    ),
    zone = paste(zone_horizontal, zone_vertical, sep = "-")
  )

zone_results <- judge_zone_analysis %>%
  group_by(zone) %>%
  summarize(
    pitches = n(),
    swing_rate = mean(description %in% c("foul", "hit_into_play", "swinging_strike")),
    contact_rate = mean(description %in% c("foul", "hit_into_play")),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(pitches))

print(zone_results)

Pitcher-Specific Statcast Data

# Gerrit Cole's player ID
cole_mlbam_id <- 543037

# Get Cole's pitching data for a specific game or date range
cole_statcast <- statcast_search(
  start_date = "2024-06-01",
  end_date = "2024-06-30",
  playerid = cole_mlbam_id,
  player_type = "pitcher"
)

# Analyze Cole's arsenal
cole_arsenal <- cole_statcast %>%
  filter(!is.na(pitch_type)) %>%
  group_by(pitch_type) %>%
  summarize(
    pitches = n(),
    usage_rate = n() / nrow(cole_statcast) * 100,
    avg_velo = mean(release_speed, na.rm = TRUE),
    max_velo = max(release_speed, na.rm = TRUE),
    avg_spin = mean(release_spin_rate, na.rm = TRUE),
    whiff_rate = sum(description %in% c("swinging_strike", "swinging_strike_blocked")) / sum(description %in% c("foul", "hit_into_play", "swinging_strike", "swinging_strike_blocked")) * 100,
    zone_rate = sum(zone %in% 1:9) / n() * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(pitches))

print(cole_arsenal)

Output (example):

  pitch_type pitches usage_rate avg_velo max_velo avg_spin whiff_rate zone_rate
1 FF             345       45.2     97.8    100.2     2345       28.5      52.3
2 SL             198       26.0     86.4     89.1     2687       35.2      44.1
3 CH              89       11.7     88.2     90.5     1876       22.8      48.3
4 CU              76       10.0     81.5     84.2     2543       31.5      38.9

3.1.5 MLB Stats API Functions {#baseballr-mlb-api}

The MLB Stats API provides official data directly from MLB. It includes real-time game data, player information, team rosters, and more.

# Get current MLB teams
teams <- mlb_teams(season = 2024, sport_id = 1)  # sport_id 1 = MLB

print(teams %>% select(team_id, team_full_name, division_name, league_name))

# Get team roster
yankees_roster <- mlb_roster(team_id = 147, season = 2024)  # 147 = Yankees

print(yankees_roster %>% select(person_full_name, position_name, jersey_number))

# Get player information
judge_info <- mlb_people(person_ids = 592450)

print(judge_info)

# Get game schedule for a team
yankees_schedule <- mlb_schedule(
  season = 2024,
  team_id = 147,
  game_type = "R"  # R = Regular season
)

# Games played in a date range
june_games <- yankees_schedule %>%
  filter(game_date >= "2024-06-01", game_date <= "2024-06-30") %>%
  select(game_date, game_pk, home_team_name, away_team_name, home_score, away_score)

print(june_games)

# Get live game data (if a game is in progress)
# game_pk is the unique game identifier
# live_game_data <- mlb_game_linescore(game_pk = 747106)

R
# From CRAN (stable version)
install.packages("baseballr")

# From GitHub (development version with latest features)
# install.packages("devtools")
devtools::install_github("BillPetti/baseballr")

# Load the package
library(baseballr)
library(tidyverse)  # For data manipulation
R
library(baseballr)
library(tidyverse)

# Get 2024 batting leaders (qualified batters: min 3.1 PA per team game)
batters_2024 <- fg_batter_leaders(
  startseason = 2024,
  endseason = 2024,
  qual = 502  # 502 PA = 162 * 3.1
)

# Preview the data
glimpse(batters_2024)
dim(batters_2024)  # Check dimensions

# Key columns include:
# Name, Team, G, PA, HR, R, RBI, SB, BB%, K%, AVG, OBP, SLG, wOBA, wRC+, WAR, etc.

# View top performers by wRC+ (Weighted Runs Created Plus)
top_hitters <- batters_2024 %>%
  arrange(desc(`wRC+`)) %>%
  select(Name, Team, PA, AVG, OBP, SLG, wOBA, `wRC+`, WAR) %>%
  head(10)

print(top_hitters)
R
# A tibble: 10 × 9
   Name              Team     PA   AVG   OBP   SLG  wOBA `wRC+`   WAR
   <chr>             <chr> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
 1 Aaron Judge       NYY     704 0.322 0.458 0.701 0.457    218   10.8
 2 Juan Soto         NYY     713 0.288 0.419 0.569 0.419    178    8.1
 3 Shohei Ohtani     LAD     731 0.310 0.390 0.646 0.435    194    9.2
 4 Bobby Witt Jr.    KC      708 0.332 0.389 0.588 0.399    162    8.5
 5 Francisco Lindor  NYM     702 0.273 0.344 0.500 0.362    127    6.9
R
# Get batting data for a specific range without qualification minimum
all_batters_2024 <- fg_batter_leaders(
  startseason = 2024,
  endseason = 2024,
  qual = 0  # 0 = no minimum, returns all players
)

# Filter for Yankees players
yankees_batters <- all_batters_2024 %>%
  filter(Team == "NYY") %>%
  arrange(desc(PA))

# Multi-year data: Career progression
judge_career <- fg_batter_leaders(
  startseason = 2016,  # Judge's debut
  endseason = 2024,
  qual = 0
) %>%
  filter(Name == "Aaron Judge") %>%
  select(Season, Age, G, PA, HR, AVG, OBP, SLG, `wRC+`, WAR) %>%
  arrange(Season)

print(judge_career)
R
# Get additional metrics: plate discipline, batted ball data
# Note: Different endpoints have different available columns

# Standard batting with plate discipline
discipline_leaders <- fg_batter_leaders(2024, 2024, qual = 300) %>%
  select(Name, Team, PA, `BB%`, `K%`, `BB/K`, `O-Swing%`, `Z-Swing%`, `SwStr%`) %>%
  arrange(`K%`)  # Lowest strikeout rates

print(discipline_leaders %>% head(10))

# Players with excellent plate discipline (low K%, high BB%)
elite_discipline <- fg_batter_leaders(2024, 2024, qual = 400) %>%
  filter(`K%` < 15, `BB%` > 12) %>%
  select(Name, Team, PA, `BB%`, `K%`, `BB/K`, OBP, `wRC+`) %>%
  arrange(desc(`BB/K`))

print(elite_discipline)
R
# Get 2024 pitching leaders (qualified: 1 IP per team game = 162 IP)
pitchers_2024 <- fg_pitcher_leaders(
  startseason = 2024,
  endseason = 2024,
  qual = 162  # Qualified starters
)

# Preview
glimpse(pitchers_2024)

# Top pitchers by FIP (Fielding Independent Pitching)
top_pitchers_fip <- pitchers_2024 %>%
  arrange(FIP) %>%
  select(Name, Team, IP, ERA, FIP, `xFIP`, `K/9`, `BB/9`, `HR/9`, WAR) %>%
  head(10)

print(top_pitchers_fip)

# Get relief pitchers (min 50 IP, typically relievers)
relievers_2024 <- fg_pitcher_leaders(2024, 2024, qual = 0) %>%
  filter(IP >= 50, IP < 100) %>%
  arrange(desc(WAR)) %>%
  select(Name, Team, IP, ERA, FIP, `K/9`, `BB/9`, `K%`, SV, WAR)

print(relievers_2024 %>% head(15))
R
# Get batted ball statistics
batted_ball_leaders <- fg_batter_leaders(2024, 2024, qual = 300) %>%
  select(
    Name, Team, PA,
    `GB%`, `FB%`, `LD%`, `IFFB%`, `Pull%`, `Cent%`, `Oppo%`,
    `Soft%`, `Med%`, `Hard%`,
    `HR/FB`
  ) %>%
  arrange(desc(`Hard%`))  # Highest hard-hit rate

print(batted_ball_leaders %>% head(10))

# Extreme fly ball hitters with power
power_profile <- fg_batter_leaders(2024, 2024, qual = 400) %>%
  filter(`FB%` > 40, `HR/FB` > 15) %>%
  select(Name, Team, HR, `FB%`, `HR/FB`, `Pull%`, `Hard%`, ISO, SLG) %>%
  arrange(desc(HR))

print(power_profile)
R
# Get Baseball Reference batting data
# Note: Functions may have different names/parameters

# Batting statistics by season
bref_batting_2024 <- bref_daily_batter(t1 = "2024-04-01", t2 = "2024-09-30")

# Pitching statistics
bref_pitching_2024 <- bref_daily_pitcher(t1 = "2024-04-01", t2 = "2024-09-30")

# Team batting statistics
team_batting_2024 <- bref_team_results(Tm = "NYY", year = 2024)

# Standings
standings_2024 <- bref_standings_on_date(date = "2024-09-30", division = "AL East")

print(standings_2024)
R
library(baseballr)

# Get all Statcast data for a date range
# Note: Large date ranges will take time due to data volume

# Single day of data
statcast_single_day <- statcast_search(
  start_date = "2024-07-15",
  end_date = "2024-07-15",
  playerid = NULL  # NULL = all players
)

glimpse(statcast_single_day)
dim(statcast_single_day)  # Thousands of pitches

# Key columns:
# pitch_type, release_speed, release_pos_x/y/z, pfx_x/pfx_z (movement),
# plate_x, plate_z, vx0, vy0, vz0 (velocity components),
# ax, ay, az (acceleration), sz_top, sz_bot (strike zone),
# hit_distance_sc, launch_speed, launch_angle, barrel, events, description
R
# Aaron Judge's player ID (MLB ID)
judge_mlbam_id <- 592450

# Get Judge's Statcast data for 2024 season (in chunks)
# Note: API limits date ranges to ~2 weeks, so we'll query multiple periods

# Helper function to query in chunks
get_statcast_season <- function(player_id, year) {
  # Create date ranges (two-week chunks)
  start_date <- as.Date(paste0(year, "-03-28"))
  end_date <- as.Date(paste0(year, "-09-30"))

  date_seq <- seq(start_date, end_date, by = "14 days")
  if (tail(date_seq, 1) < end_date) {
    date_seq <- c(date_seq, end_date)
  }

  all_data <- list()

  for (i in 1:(length(date_seq) - 1)) {
    message(paste("Fetching data from", date_seq[i], "to", date_seq[i+1]))

    chunk <- statcast_search(
      start_date = as.character(date_seq[i]),
      end_date = as.character(date_seq[i+1]),
      playerid = player_id,
      player_type = "batter"
    )

    all_data[[i]] <- chunk
    Sys.sleep(2)  # Be nice to the API
  }

  bind_rows(all_data)
}

# Get Judge's 2024 Statcast data
judge_statcast_2024 <- get_statcast_season(592450, 2024)

# Analyze Judge's batted balls
judge_batted_balls <- judge_statcast_2024 %>%
  filter(!is.na(launch_speed), !is.na(launch_angle)) %>%
  select(
    game_date, events, description,
    launch_speed, launch_angle, hit_distance_sc,
    barrel, babip_value, estimated_ba_using_speedangle, estimated_woba_using_speedangle
  )

# Summary statistics
judge_ev_summary <- judge_batted_balls %>%
  summarize(
    batted_balls = n(),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    max_exit_velo = max(launch_speed, na.rm = TRUE),
    avg_launch_angle = mean(launch_angle, na.rm = TRUE),
    barrel_rate = sum(barrel == 1, na.rm = TRUE) / n() * 100,
    hard_hit_rate = sum(launch_speed >= 95, na.rm = TRUE) / n() * 100
  )

print(judge_ev_summary)
R
batted_balls avg_exit_velo max_exit_velo avg_launch_angle barrel_rate hard_hit_rate
1          503          92.3         121.4             14.2        15.3          52.1
R
# Analyze pitches Judge faced
judge_pitches <- judge_statcast_2024 %>%
  filter(!is.na(pitch_type))

# Performance by pitch type
pitch_type_performance <- judge_pitches %>%
  group_by(pitch_type) %>%
  summarize(
    pitches_seen = n(),
    swing_rate = sum(description %in% c("foul", "hit_into_play", "swinging_strike", "foul_tip")) / n(),
    whiff_rate = sum(description %in% c("swinging_strike", "foul_tip")) / sum(description %in% c("foul", "hit_into_play", "swinging_strike", "foul_tip")),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    batting_avg = sum(events %in% c("single", "double", "triple", "home_run"), na.rm = TRUE) / sum(!is.na(events)),
    slugging = (
      sum(events == "single", na.rm = TRUE) +
      2 * sum(events == "double", na.rm = TRUE) +
      3 * sum(events == "triple", na.rm = TRUE) +
      4 * sum(events == "home_run", na.rm = TRUE)
    ) / sum(!is.na(events)),
    .groups = "drop"
  ) %>%
  arrange(desc(pitches_seen))

print(pitch_type_performance)
R
# Analyze performance by pitch location
# Divide strike zone into regions

judge_zone_analysis <- judge_pitches %>%
  filter(!is.na(plate_x), !is.na(plate_z)) %>%
  mutate(
    # Horizontal: inside, middle, outside (from catcher's perspective, Judge is RHB)
    zone_horizontal = case_when(
      plate_x < -0.5 ~ "Inside",
      plate_x > 0.5 ~ "Outside",
      TRUE ~ "Middle"
    ),
    # Vertical: high, middle, low
    zone_vertical = case_when(
      plate_z > 3.0 ~ "High",
      plate_z < 2.0 ~ "Low",
      TRUE ~ "Middle"
    ),
    zone = paste(zone_horizontal, zone_vertical, sep = "-")
  )

zone_results <- judge_zone_analysis %>%
  group_by(zone) %>%
  summarize(
    pitches = n(),
    swing_rate = mean(description %in% c("foul", "hit_into_play", "swinging_strike")),
    contact_rate = mean(description %in% c("foul", "hit_into_play")),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(pitches))

print(zone_results)
R
# Gerrit Cole's player ID
cole_mlbam_id <- 543037

# Get Cole's pitching data for a specific game or date range
cole_statcast <- statcast_search(
  start_date = "2024-06-01",
  end_date = "2024-06-30",
  playerid = cole_mlbam_id,
  player_type = "pitcher"
)

# Analyze Cole's arsenal
cole_arsenal <- cole_statcast %>%
  filter(!is.na(pitch_type)) %>%
  group_by(pitch_type) %>%
  summarize(
    pitches = n(),
    usage_rate = n() / nrow(cole_statcast) * 100,
    avg_velo = mean(release_speed, na.rm = TRUE),
    max_velo = max(release_speed, na.rm = TRUE),
    avg_spin = mean(release_spin_rate, na.rm = TRUE),
    whiff_rate = sum(description %in% c("swinging_strike", "swinging_strike_blocked")) / sum(description %in% c("foul", "hit_into_play", "swinging_strike", "swinging_strike_blocked")) * 100,
    zone_rate = sum(zone %in% 1:9) / n() * 100,
    .groups = "drop"
  ) %>%
  arrange(desc(pitches))

print(cole_arsenal)
R
pitch_type pitches usage_rate avg_velo max_velo avg_spin whiff_rate zone_rate
1 FF             345       45.2     97.8    100.2     2345       28.5      52.3
2 SL             198       26.0     86.4     89.1     2687       35.2      44.1
3 CH              89       11.7     88.2     90.5     1876       22.8      48.3
4 CU              76       10.0     81.5     84.2     2543       31.5      38.9
R
# Get current MLB teams
teams <- mlb_teams(season = 2024, sport_id = 1)  # sport_id 1 = MLB

print(teams %>% select(team_id, team_full_name, division_name, league_name))

# Get team roster
yankees_roster <- mlb_roster(team_id = 147, season = 2024)  # 147 = Yankees

print(yankees_roster %>% select(person_full_name, position_name, jersey_number))

# Get player information
judge_info <- mlb_people(person_ids = 592450)

print(judge_info)

# Get game schedule for a team
yankees_schedule <- mlb_schedule(
  season = 2024,
  team_id = 147,
  game_type = "R"  # R = Regular season
)

# Games played in a date range
june_games <- yankees_schedule %>%
  filter(game_date >= "2024-06-01", game_date <= "2024-06-30") %>%
  select(game_date, game_pk, home_team_name, away_team_name, home_score, away_score)

print(june_games)

# Get live game data (if a game is in progress)
# game_pk is the unique game identifier
# live_game_data <- mlb_game_linescore(game_pk = 747106)

3.2 The pybaseball Library (Python)

The pybaseball library, created by James LeDoux, is the Python equivalent of baseballr, providing comprehensive access to baseball data sources through a clean, Pythonic interface.

3.2.1 Installation and Setup {#pybaseball-install}

# Install from PyPI
# pip install pybaseball

# Import
import pybaseball as pyb
from pybaseball import (
    batting_stats, pitching_stats,
    statcast, statcast_batter, statcast_pitcher,
    playerid_lookup, playerid_reverse_lookup
)
import pandas as pd
import numpy as np

# Enable cache to avoid repeated API calls
pyb.cache.enable()

# Check version
print(pyb.__version__)

Key features:


  • Clean, consistent function names

  • Returns pandas DataFrames

  • Built-in caching system

  • Comprehensive FanGraphs, Statcast, and Lahman access

  • Active development and community support

3.2.2 Batting and Pitching Stats (battingstats, pitchingstats) {#pybaseball-batting-pitching}

Batting Statistics from FanGraphs

from pybaseball import batting_stats
import pandas as pd

# Get 2024 batting statistics (qualified batters)
batters_2024 = batting_stats(2024, qual=502)

print(batters_2024.shape)
print(batters_2024.columns.tolist())

# Preview top performers
top_wrc_plus = batters_2024.nlargest(10, 'wRC+')[
    ['Name', 'Team', 'PA', 'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']
]
print(top_wrc_plus)

Output (example):

              Name Team   PA    AVG    OBP    SLG  wRC+   WAR
0     Aaron Judge  NYY  704  0.322  0.458  0.701   218  10.8
1       Juan Soto  NYY  713  0.288  0.419  0.569   178   8.1
2   Shohei Ohtani  LAD  731  0.310  0.390  0.646   194   9.2
3  Bobby Witt Jr.   KC  708  0.332  0.389  0.588   162   8.5

# All batters (no qualification minimum)
all_batters_2024 = batting_stats(2024, qual=0)

# Filter for Yankees
yankees = all_batters_2024[all_batters_2024['Team'] == 'NYY'].sort_values('PA', ascending=False)
print(yankees[['Name', 'PA', 'HR', 'AVG', 'OPS', 'WAR']].head(10))

# Multi-year data for a player (across multiple calls)
judge_career = pd.concat([
    batting_stats(year, qual=0).query('Name == "Aaron Judge"')
    for year in range(2016, 2025)
])

judge_career = judge_career[['Season', 'Age', 'G', 'PA', 'HR', 'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']]
print(judge_career.sort_values('Season'))

Advanced Batting Metrics

# Elite plate discipline (low K%, high BB%)
disciplined_hitters = batters_2024[
    (batters_2024['K%'] < 15) &
    (batters_2024['BB%'] > 12)
][['Name', 'Team', 'PA', 'BB%', 'K%', 'BB/K', 'OBP', 'wRC+']].sort_values('BB/K', ascending=False)

print(disciplined_hitters)

# Batted ball leaders
batted_ball_leaders = batters_2024.nlargest(10, 'Hard%')[
    ['Name', 'Team', 'PA', 'Hard%', 'Barrel%', 'GB%', 'FB%', 'Pull%', 'HR', 'ISO']
]
print(batted_ball_leaders)

# Speed and base running
speed_leaders = batters_2024[batters_2024['PA'] >= 400].nlargest(15, 'SB')[
    ['Name', 'Team', 'SB', 'CS', 'SB%', 'Spd', 'wSB', 'UBR']
]
print(speed_leaders)

Pitching Statistics

from pybaseball import pitching_stats

# Get 2024 pitching statistics (qualified starters)
pitchers_2024 = pitching_stats(2024, qual=162)

print(pitchers_2024.shape)

# Top pitchers by FIP
top_fip = pitchers_2024.nsmallest(10, 'FIP')[
    ['Name', 'Team', 'IP', 'ERA', 'FIP', 'xFIP', 'K/9', 'BB/9', 'HR/9', 'WAR']
]
print(top_fip)

# Top strikeout pitchers
strikeout_leaders = pitchers_2024.nlargest(10, 'SO')[
    ['Name', 'Team', 'IP', 'SO', 'K/9', 'K%', 'SwStr%', 'ERA', 'FIP']
]
print(strikeout_leaders)

# Relief pitchers
all_pitchers = pitching_stats(2024, qual=0)
relievers = all_pitchers[
    (all_pitchers['IP'] >= 50) &
    (all_pitchers['IP'] < 100)
].nlargest(15, 'WAR')[
    ['Name', 'Team', 'IP', 'ERA', 'FIP', 'K/9', 'BB/9', 'SV', 'WAR']
]
print(relievers)

3.2.3 Statcast Data (statcast, statcastbatter, statcastpitcher) {#pybaseball-statcast}

pybaseball provides three main functions for Statcast data: statcast() for date ranges, statcast_batter() for specific batters, and statcast_pitcher() for specific pitchers.

General Statcast Search

from pybaseball import statcast
import pandas as pd

# Get all Statcast data for a date range
# Note: Keep date ranges reasonably small (1-2 weeks) to avoid timeouts
statcast_data = statcast(start_dt='2024-07-15', end_dt='2024-07-20')

print(f"Total pitches: {len(statcast_data)}")
print(statcast_data.columns.tolist())

# Key columns:
# pitch_type, release_speed, release_spin_rate, pfx_x, pfx_z,
# plate_x, plate_z, launch_speed, launch_angle, hit_distance_sc,
# barrel, events, description, zone, balls, strikes, outs_when_up

Batter-Specific Statcast

from pybaseball import statcast_batter, playerid_lookup

# Look up player ID
judge_lookup = playerid_lookup('judge', 'aaron')
print(judge_lookup)

# Aaron Judge's MLBAM ID: 592450
judge_id = 592450

# Get Judge's Statcast data for 2024 season
judge_statcast = statcast_batter(
    start_dt='2024-03-28',
    end_dt='2024-09-30',
    player_id=judge_id
)

print(f"Judge pitches faced: {len(judge_statcast)}")
print(f"Judge batted balls: {judge_statcast['launch_speed'].notna().sum()}")

# Analyze batted balls
judge_batted_balls = judge_statcast.dropna(subset=['launch_speed', 'launch_angle'])

# Exit velocity summary
ev_summary = {
    'batted_balls': len(judge_batted_balls),
    'avg_exit_velo': judge_batted_balls['launch_speed'].mean(),
    'max_exit_velo': judge_batted_balls['launch_speed'].max(),
    'min_exit_velo': judge_batted_balls['launch_speed'].min(),
    'avg_launch_angle': judge_batted_balls['launch_angle'].mean(),
    'barrel_rate': (judge_batted_balls['barrel'] == 1).sum() / len(judge_batted_balls) * 100,
    'hard_hit_rate': (judge_batted_balls['launch_speed'] >= 95).sum() / len(judge_batted_balls) * 100
}

print(pd.Series(ev_summary))

Output (example):

batted_balls         503.000000
avg_exit_velo         92.345678
max_exit_velo        121.400000
min_exit_velo         52.300000
avg_launch_angle      14.234567
barrel_rate           15.308151
hard_hit_rate         52.087476

Performance by Pitch Type

# Analyze Judge vs. different pitch types
pitch_type_analysis = judge_statcast[judge_statcast['pitch_type'].notna()].groupby('pitch_type').agg({
    'pitch_type': 'count',
    'description': lambda x: (x.isin(['swinging_strike', 'swinging_strike_blocked'])).sum() / (x.isin(['foul', 'hit_into_play', 'swinging_strike', 'swinging_strike_blocked'])).sum(),
    'launch_speed': 'mean',
    'launch_angle': 'mean',
    'estimated_woba_using_speedangle': 'mean'
}).round(3)

pitch_type_analysis.columns = ['pitches', 'whiff_rate', 'avg_ev', 'avg_la', 'xwOBA']
pitch_type_analysis = pitch_type_analysis.sort_values('pitches', ascending=False)

print(pitch_type_analysis)

Spray Chart Data

# Get batted ball location data (hc_x, hc_y are coordinates)
judge_hits = judge_batted_balls[judge_batted_balls['events'].isin([
    'single', 'double', 'triple', 'home_run'
])].copy()

# Classify hit type by coordinates (simplified)
def classify_spray_direction(row):
    x = row['hc_x']
    # For RHB: Lower hc_x = pull (right field), higher = opposite (left field)
    if pd.isna(x):
        return 'Unknown'
    elif x < 75:
        return 'Pull'
    elif x > 150:
        return 'Opposite'
    else:
        return 'Center'

judge_hits['spray_direction'] = judge_hits.apply(classify_spray_direction, axis=1)

spray_summary = judge_hits.groupby('spray_direction').agg({
    'events': 'count',
    'launch_speed': 'mean',
    'hit_distance_sc': 'mean'
}).round(2)

spray_summary.columns = ['hits', 'avg_ev', 'avg_distance']
print(spray_summary)

Pitcher-Specific Statcast

from pybaseball import statcast_pitcher

# Gerrit Cole's MLBAM ID: 543037
cole_id = 543037

# Get Cole's pitching data
cole_statcast = statcast_pitcher(
    start_dt='2024-06-01',
    end_dt='2024-06-30',
    player_id=cole_id
)

print(f"Cole pitches thrown: {len(cole_statcast)}")

# Analyze pitch arsenal
cole_arsenal = cole_statcast[cole_statcast['pitch_type'].notna()].groupby('pitch_type').agg({
    'pitch_type': 'count',
    'release_speed': ['mean', 'max'],
    'release_spin_rate': 'mean',
    'pfx_x': 'mean',  # Horizontal movement
    'pfx_z': 'mean',  # Vertical movement (induced)
    'description': lambda x: (x.isin(['swinging_strike', 'swinging_strike_blocked'])).sum() / (x.isin(['foul', 'hit_into_play', 'swinging_strike', 'swinging_strike_blocked'])).sum() * 100
}).round(2)

cole_arsenal.columns = ['count', 'avg_velo', 'max_velo', 'avg_spin', 'h_break', 'v_break', 'whiff_pct']
cole_arsenal['usage'] = (cole_arsenal['count'] / cole_arsenal['count'].sum() * 100).round(1)
cole_arsenal = cole_arsenal.sort_values('count', ascending=False)

print(cole_arsenal)

Output (example):

           count  avg_velo  max_velo  avg_spin  h_break  v_break  whiff_pct  usage
pitch_type
FF           345     97.82    100.24   2345.23     7.84    15.32      28.45   45.2
SL           198     86.43     89.12   2687.45    -5.23     2.45      35.21   26.0
CH            89     88.23     90.51   1876.34     9.12    -3.21      22.78   11.7
CU            76     81.54     84.23   2543.12    -8.34    -6.78      31.52   10.0

Expected Statistics (xwOBA, xBA)

# Compare actual results to expected results based on batted ball quality
judge_xstats = judge_batted_balls.groupby(
    judge_batted_balls['game_date'].dt.to_period('M')  # Group by month
).agg({
    'estimated_ba_using_speedangle': 'mean',
    'estimated_woba_using_speedangle': 'mean',
    'woba_value': 'mean',
    'launch_speed': 'mean',
    'barrel': lambda x: (x == 1).sum() / len(x) * 100
}).round(3)

judge_xstats.columns = ['xBA', 'xwOBA', 'wOBA', 'avg_EV', 'barrel_rate']
print(judge_xstats)

# Calculate over/under-performance
judge_batted_balls['woba_diff'] = judge_batted_balls['woba_value'] - judge_batted_balls['estimated_woba_using_speedangle']

print(f"Average wOBA vs xwOBA difference: {judge_batted_balls['woba_diff'].mean():.3f}")
print(f"Judge {'outperformed' if judge_batted_balls['woba_diff'].mean() > 0 else 'underperformed'} his expected stats")

3.2.4 Player ID Lookups (playerid_lookup) {#pybaseball-player-ids}

Different data sources use different player ID systems. pybaseball provides functions to look up and convert between them.

from pybaseball import playerid_lookup, playerid_reverse_lookup

# Look up a player by name
soto_ids = playerid_lookup('soto', 'juan')
print(soto_ids)

Output (example):

  name_first name_last  key_mlbam key_retro  key_bbref  key_fangraphs  mlb_played_first  mlb_played_last
0       juan      soto     665742  sotoj002  sotoju01          19611              2018             2024

# Access different ID systems
soto_mlbam = soto_ids.iloc[0]['key_mlbam']  # For Statcast
soto_fg = soto_ids.iloc[0]['key_fangraphs']  # For FanGraphs
soto_bbref = soto_ids.iloc[0]['key_bbref']  # For Baseball Reference

print(f"Juan Soto MLBAM ID: {soto_mlbam}")
print(f"Juan Soto FanGraphs ID: {soto_fg}")
print(f"Juan Soto Baseball-Reference ID: {soto_bbref}")

# Reverse lookup (by ID)
player_info = playerid_reverse_lookup([665742], key_type='mlbam')
print(player_info)

# Lookup multiple players
players = playerid_lookup('judge', 'aaron')
players = pd.concat([
    players,
    playerid_lookup('ohtani', 'shohei'),
    playerid_lookup('betts', 'mookie')
])

print(players[['name_first', 'name_last', 'key_mlbam', 'key_fangraphs']])

3.2.5 Team and Schedule Data {#pybaseball-team-schedule}

from pybaseball import schedule_and_record

# Get team schedule and results
yankees_2024 = schedule_and_record(2024, 'NYY')

print(yankees_2024.head(10))
print(yankees_2024.columns.tolist())

# Analyze home vs. road performance
yankees_2024['is_home'] = yankees_2024['Home_Away'] == 'Home'
yankees_2024['win'] = yankees_2024['W/L'] == 'W'

home_away_split = yankees_2024.groupby('is_home').agg({
    'win': ['sum', 'count', 'mean']
})
home_away_split.columns = ['wins', 'games', 'win_pct']

print(home_away_split)

# Performance by month
yankees_2024['month'] = pd.to_datetime(yankees_2024['Date']).dt.month
monthly_performance = yankees_2024.groupby('month').agg({
    'win': ['sum', 'count', 'mean'],
    'R': 'mean',
    'RA': 'mean'
}).round(3)
monthly_performance.columns = ['wins', 'games', 'win_pct', 'runs_per_game', 'runs_allowed']

print(monthly_performance)

# Winning streaks
yankees_2024['streak'] = (yankees_2024['win'] != yankees_2024['win'].shift()).cumsum()
streaks = yankees_2024[yankees_2024['win']].groupby('streak').size()
longest_win_streak = streaks.max()

print(f"Longest winning streak: {longest_win_streak} games")

R
Name Team   PA    AVG    OBP    SLG  wRC+   WAR
0     Aaron Judge  NYY  704  0.322  0.458  0.701   218  10.8
1       Juan Soto  NYY  713  0.288  0.419  0.569   178   8.1
2   Shohei Ohtani  LAD  731  0.310  0.390  0.646   194   9.2
3  Bobby Witt Jr.   KC  708  0.332  0.389  0.588   162   8.5
R
batted_balls         503.000000
avg_exit_velo         92.345678
max_exit_velo        121.400000
min_exit_velo         52.300000
avg_launch_angle      14.234567
barrel_rate           15.308151
hard_hit_rate         52.087476
R
count  avg_velo  max_velo  avg_spin  h_break  v_break  whiff_pct  usage
pitch_type
FF           345     97.82    100.24   2345.23     7.84    15.32      28.45   45.2
SL           198     86.43     89.12   2687.45    -5.23     2.45      35.21   26.0
CH            89     88.23     90.51   1876.34     9.12    -3.21      22.78   11.7
CU            76     81.54     84.23   2543.12    -8.34    -6.78      31.52   10.0
R
name_first name_last  key_mlbam key_retro  key_bbref  key_fangraphs  mlb_played_first  mlb_played_last
0       juan      soto     665742  sotoj002  sotoju01          19611              2018             2024
Python
# Install from PyPI
# pip install pybaseball

# Import
import pybaseball as pyb
from pybaseball import (
    batting_stats, pitching_stats,
    statcast, statcast_batter, statcast_pitcher,
    playerid_lookup, playerid_reverse_lookup
)
import pandas as pd
import numpy as np

# Enable cache to avoid repeated API calls
pyb.cache.enable()

# Check version
print(pyb.__version__)
Python
from pybaseball import batting_stats
import pandas as pd

# Get 2024 batting statistics (qualified batters)
batters_2024 = batting_stats(2024, qual=502)

print(batters_2024.shape)
print(batters_2024.columns.tolist())

# Preview top performers
top_wrc_plus = batters_2024.nlargest(10, 'wRC+')[
    ['Name', 'Team', 'PA', 'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']
]
print(top_wrc_plus)
Python
# All batters (no qualification minimum)
all_batters_2024 = batting_stats(2024, qual=0)

# Filter for Yankees
yankees = all_batters_2024[all_batters_2024['Team'] == 'NYY'].sort_values('PA', ascending=False)
print(yankees[['Name', 'PA', 'HR', 'AVG', 'OPS', 'WAR']].head(10))

# Multi-year data for a player (across multiple calls)
judge_career = pd.concat([
    batting_stats(year, qual=0).query('Name == "Aaron Judge"')
    for year in range(2016, 2025)
])

judge_career = judge_career[['Season', 'Age', 'G', 'PA', 'HR', 'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']]
print(judge_career.sort_values('Season'))
Python
# Elite plate discipline (low K%, high BB%)
disciplined_hitters = batters_2024[
    (batters_2024['K%'] < 15) &
    (batters_2024['BB%'] > 12)
][['Name', 'Team', 'PA', 'BB%', 'K%', 'BB/K', 'OBP', 'wRC+']].sort_values('BB/K', ascending=False)

print(disciplined_hitters)

# Batted ball leaders
batted_ball_leaders = batters_2024.nlargest(10, 'Hard%')[
    ['Name', 'Team', 'PA', 'Hard%', 'Barrel%', 'GB%', 'FB%', 'Pull%', 'HR', 'ISO']
]
print(batted_ball_leaders)

# Speed and base running
speed_leaders = batters_2024[batters_2024['PA'] >= 400].nlargest(15, 'SB')[
    ['Name', 'Team', 'SB', 'CS', 'SB%', 'Spd', 'wSB', 'UBR']
]
print(speed_leaders)
Python
from pybaseball import pitching_stats

# Get 2024 pitching statistics (qualified starters)
pitchers_2024 = pitching_stats(2024, qual=162)

print(pitchers_2024.shape)

# Top pitchers by FIP
top_fip = pitchers_2024.nsmallest(10, 'FIP')[
    ['Name', 'Team', 'IP', 'ERA', 'FIP', 'xFIP', 'K/9', 'BB/9', 'HR/9', 'WAR']
]
print(top_fip)

# Top strikeout pitchers
strikeout_leaders = pitchers_2024.nlargest(10, 'SO')[
    ['Name', 'Team', 'IP', 'SO', 'K/9', 'K%', 'SwStr%', 'ERA', 'FIP']
]
print(strikeout_leaders)

# Relief pitchers
all_pitchers = pitching_stats(2024, qual=0)
relievers = all_pitchers[
    (all_pitchers['IP'] >= 50) &
    (all_pitchers['IP'] < 100)
].nlargest(15, 'WAR')[
    ['Name', 'Team', 'IP', 'ERA', 'FIP', 'K/9', 'BB/9', 'SV', 'WAR']
]
print(relievers)
Python
from pybaseball import statcast
import pandas as pd

# Get all Statcast data for a date range
# Note: Keep date ranges reasonably small (1-2 weeks) to avoid timeouts
statcast_data = statcast(start_dt='2024-07-15', end_dt='2024-07-20')

print(f"Total pitches: {len(statcast_data)}")
print(statcast_data.columns.tolist())

# Key columns:
# pitch_type, release_speed, release_spin_rate, pfx_x, pfx_z,
# plate_x, plate_z, launch_speed, launch_angle, hit_distance_sc,
# barrel, events, description, zone, balls, strikes, outs_when_up
Python
from pybaseball import statcast_batter, playerid_lookup

# Look up player ID
judge_lookup = playerid_lookup('judge', 'aaron')
print(judge_lookup)

# Aaron Judge's MLBAM ID: 592450
judge_id = 592450

# Get Judge's Statcast data for 2024 season
judge_statcast = statcast_batter(
    start_dt='2024-03-28',
    end_dt='2024-09-30',
    player_id=judge_id
)

print(f"Judge pitches faced: {len(judge_statcast)}")
print(f"Judge batted balls: {judge_statcast['launch_speed'].notna().sum()}")

# Analyze batted balls
judge_batted_balls = judge_statcast.dropna(subset=['launch_speed', 'launch_angle'])

# Exit velocity summary
ev_summary = {
    'batted_balls': len(judge_batted_balls),
    'avg_exit_velo': judge_batted_balls['launch_speed'].mean(),
    'max_exit_velo': judge_batted_balls['launch_speed'].max(),
    'min_exit_velo': judge_batted_balls['launch_speed'].min(),
    'avg_launch_angle': judge_batted_balls['launch_angle'].mean(),
    'barrel_rate': (judge_batted_balls['barrel'] == 1).sum() / len(judge_batted_balls) * 100,
    'hard_hit_rate': (judge_batted_balls['launch_speed'] >= 95).sum() / len(judge_batted_balls) * 100
}

print(pd.Series(ev_summary))
Python
# Analyze Judge vs. different pitch types
pitch_type_analysis = judge_statcast[judge_statcast['pitch_type'].notna()].groupby('pitch_type').agg({
    'pitch_type': 'count',
    'description': lambda x: (x.isin(['swinging_strike', 'swinging_strike_blocked'])).sum() / (x.isin(['foul', 'hit_into_play', 'swinging_strike', 'swinging_strike_blocked'])).sum(),
    'launch_speed': 'mean',
    'launch_angle': 'mean',
    'estimated_woba_using_speedangle': 'mean'
}).round(3)

pitch_type_analysis.columns = ['pitches', 'whiff_rate', 'avg_ev', 'avg_la', 'xwOBA']
pitch_type_analysis = pitch_type_analysis.sort_values('pitches', ascending=False)

print(pitch_type_analysis)
Python
# Get batted ball location data (hc_x, hc_y are coordinates)
judge_hits = judge_batted_balls[judge_batted_balls['events'].isin([
    'single', 'double', 'triple', 'home_run'
])].copy()

# Classify hit type by coordinates (simplified)
def classify_spray_direction(row):
    x = row['hc_x']
    # For RHB: Lower hc_x = pull (right field), higher = opposite (left field)
    if pd.isna(x):
        return 'Unknown'
    elif x < 75:
        return 'Pull'
    elif x > 150:
        return 'Opposite'
    else:
        return 'Center'

judge_hits['spray_direction'] = judge_hits.apply(classify_spray_direction, axis=1)

spray_summary = judge_hits.groupby('spray_direction').agg({
    'events': 'count',
    'launch_speed': 'mean',
    'hit_distance_sc': 'mean'
}).round(2)

spray_summary.columns = ['hits', 'avg_ev', 'avg_distance']
print(spray_summary)
Python
from pybaseball import statcast_pitcher

# Gerrit Cole's MLBAM ID: 543037
cole_id = 543037

# Get Cole's pitching data
cole_statcast = statcast_pitcher(
    start_dt='2024-06-01',
    end_dt='2024-06-30',
    player_id=cole_id
)

print(f"Cole pitches thrown: {len(cole_statcast)}")

# Analyze pitch arsenal
cole_arsenal = cole_statcast[cole_statcast['pitch_type'].notna()].groupby('pitch_type').agg({
    'pitch_type': 'count',
    'release_speed': ['mean', 'max'],
    'release_spin_rate': 'mean',
    'pfx_x': 'mean',  # Horizontal movement
    'pfx_z': 'mean',  # Vertical movement (induced)
    'description': lambda x: (x.isin(['swinging_strike', 'swinging_strike_blocked'])).sum() / (x.isin(['foul', 'hit_into_play', 'swinging_strike', 'swinging_strike_blocked'])).sum() * 100
}).round(2)

cole_arsenal.columns = ['count', 'avg_velo', 'max_velo', 'avg_spin', 'h_break', 'v_break', 'whiff_pct']
cole_arsenal['usage'] = (cole_arsenal['count'] / cole_arsenal['count'].sum() * 100).round(1)
cole_arsenal = cole_arsenal.sort_values('count', ascending=False)

print(cole_arsenal)
Python
# Compare actual results to expected results based on batted ball quality
judge_xstats = judge_batted_balls.groupby(
    judge_batted_balls['game_date'].dt.to_period('M')  # Group by month
).agg({
    'estimated_ba_using_speedangle': 'mean',
    'estimated_woba_using_speedangle': 'mean',
    'woba_value': 'mean',
    'launch_speed': 'mean',
    'barrel': lambda x: (x == 1).sum() / len(x) * 100
}).round(3)

judge_xstats.columns = ['xBA', 'xwOBA', 'wOBA', 'avg_EV', 'barrel_rate']
print(judge_xstats)

# Calculate over/under-performance
judge_batted_balls['woba_diff'] = judge_batted_balls['woba_value'] - judge_batted_balls['estimated_woba_using_speedangle']

print(f"Average wOBA vs xwOBA difference: {judge_batted_balls['woba_diff'].mean():.3f}")
print(f"Judge {'outperformed' if judge_batted_balls['woba_diff'].mean() > 0 else 'underperformed'} his expected stats")
Python
from pybaseball import playerid_lookup, playerid_reverse_lookup

# Look up a player by name
soto_ids = playerid_lookup('soto', 'juan')
print(soto_ids)
Python
# Access different ID systems
soto_mlbam = soto_ids.iloc[0]['key_mlbam']  # For Statcast
soto_fg = soto_ids.iloc[0]['key_fangraphs']  # For FanGraphs
soto_bbref = soto_ids.iloc[0]['key_bbref']  # For Baseball Reference

print(f"Juan Soto MLBAM ID: {soto_mlbam}")
print(f"Juan Soto FanGraphs ID: {soto_fg}")
print(f"Juan Soto Baseball-Reference ID: {soto_bbref}")

# Reverse lookup (by ID)
player_info = playerid_reverse_lookup([665742], key_type='mlbam')
print(player_info)

# Lookup multiple players
players = playerid_lookup('judge', 'aaron')
players = pd.concat([
    players,
    playerid_lookup('ohtani', 'shohei'),
    playerid_lookup('betts', 'mookie')
])

print(players[['name_first', 'name_last', 'key_mlbam', 'key_fangraphs']])
Python
from pybaseball import schedule_and_record

# Get team schedule and results
yankees_2024 = schedule_and_record(2024, 'NYY')

print(yankees_2024.head(10))
print(yankees_2024.columns.tolist())

# Analyze home vs. road performance
yankees_2024['is_home'] = yankees_2024['Home_Away'] == 'Home'
yankees_2024['win'] = yankees_2024['W/L'] == 'W'

home_away_split = yankees_2024.groupby('is_home').agg({
    'win': ['sum', 'count', 'mean']
})
home_away_split.columns = ['wins', 'games', 'win_pct']

print(home_away_split)

# Performance by month
yankees_2024['month'] = pd.to_datetime(yankees_2024['Date']).dt.month
monthly_performance = yankees_2024.groupby('month').agg({
    'win': ['sum', 'count', 'mean'],
    'R': 'mean',
    'RA': 'mean'
}).round(3)
monthly_performance.columns = ['wins', 'games', 'win_pct', 'runs_per_game', 'runs_allowed']

print(monthly_performance)

# Winning streaks
yankees_2024['streak'] = (yankees_2024['win'] != yankees_2024['win'].shift()).cumsum()
streaks = yankees_2024[yankees_2024['win']].groupby('streak').size()
longest_win_streak = streaks.max()

print(f"Longest winning streak: {longest_win_streak} games")

3.3 The Lahman Database

The Lahman Database is the crown jewel of historical baseball data. Maintained by Sean Lahman, it contains complete batting, pitching, and fielding statistics from 1871 to the present, along with biographical data, awards, and team information.

3.3.1 History and Structure {#lahman-history}

History:


  • Started by Sean Lahman in the 1990s

  • Covers all of Major League Baseball from 1871-present

  • Free and open source

  • Updated annually

  • Used in books like Moneyball and by MLB teams

Key Features:


  • Complete player statistics (every player who ever appeared in MLB)

  • Biographical information (birthplace, birth date, death date)

  • Team records and franchise history

  • Awards (MVP, Cy Young, Gold Glove, Hall of Fame)

  • Post-season statistics

  • Salary data (1985-present)

  • Normalized structure with primary/foreign keys

3.3.2 Key Tables {#lahman-tables}

The database consists of multiple interconnected tables:

  • People: Biographical info (playerID, nameFirst, nameLast, birthYear, birthCountry, debut, finalGame)
  • Batting: Season batting stats (playerID, yearID, teamID, G, AB, R, H, 2B, 3B, HR, RBI, SB, CS, BB, SO, AVG, etc.)
  • Pitching: Season pitching stats (playerID, yearID, W, L, G, GS, CG, SHO, SV, IPouts, H, ER, HR, BB, SO, ERA, etc.)
  • Fielding: Defensive stats by position (POS, G, GS, InnOuts, PO, A, E, DP)
  • Teams: Team season records (yearID, lgID, teamID, W, L, R, RA, attendance)
  • Salaries: Player salaries (playerID, yearID, teamID, salary)
  • AwardsPlayers: Individual awards (playerID, awardID, yearID, lgID)
  • HallOfFame: Hall of Fame voting (playerID, yearID, votedBy, ballots, votes, inducted)
  • AllstarFull: All-Star game appearances
  • BattingPost/PitchingPost: Postseason statistics

3.3.3 Accessing in R (Lahman package) {#lahman-r}

# Install and load the Lahman package
# install.packages("Lahman")
library(Lahman)
library(tidyverse)

# The package loads tables as data frames
# Main tables: People, Batting, Pitching, Fielding, Teams, Salaries

# Explore available tables
data(package = "Lahman")

# View the People table
glimpse(People)
head(People %>% select(playerID, nameFirst, nameLast, birthYear, birthCountry, debut, finalGame))

# View Batting table
glimpse(Batting)
head(Batting)

Example: Career Home Run Leaders

# Calculate career home runs
career_hr <- Batting %>%
  group_by(playerID) %>%
  summarize(
    total_hr = sum(HR, na.rm = TRUE),
    seasons = n(),
    first_year = min(yearID),
    last_year = max(yearID),
    .groups = "drop"
  ) %>%
  arrange(desc(total_hr)) %>%
  head(20)

# Join with People to get names
career_hr_leaders <- career_hr %>%
  left_join(
    People %>% select(playerID, nameFirst, nameLast),
    by = "playerID"
  ) %>%
  mutate(name = paste(nameFirst, nameLast)) %>%
  select(name, total_hr, seasons, first_year, last_year)

print(career_hr_leaders)

Output:

                 name total_hr seasons first_year last_year
1         Barry Bonds      762      22       1986      2007
2         Hank Aaron      755      23       1954      1976
3          Babe Ruth      714      22       1914      1935
4       Alex Rodriguez      696      22       1994      2016
5        Albert Pujols      703      22       2001      2022
6        Willie Mays      660      22       1951      1973

Example: Single-Season Batting Records

# Best single seasons by various metrics
best_seasons <- Batting %>%
  filter(AB >= 400) %>%  # Qualified seasons
  mutate(
    AVG = H / AB,
    OBP = (H + BB + HBP) / (AB + BB + HBP + SF),
    SLG = (H - `2B` - `3B` - HR + 2*`2B` + 3*`3B` + 4*HR) / AB,
    OPS = OBP + SLG
  ) %>%
  left_join(
    People %>% select(playerID, nameFirst, nameLast),
    by = "playerID"
  ) %>%
  mutate(name = paste(nameFirst, nameLast))

# Highest single-season batting average
best_avg <- best_seasons %>%
  arrange(desc(AVG)) %>%
  select(name, yearID, teamID, AVG, H, AB) %>%
  head(10)

print(best_avg)

# Most home runs in a season
most_hr <- best_seasons %>%
  arrange(desc(HR)) %>%
  select(name, yearID, teamID, HR, AB, AVG, OPS) %>%
  head(10)

print(most_hr)

Example: Team Analysis

# Yankees history
yankees_history <- Teams %>%
  filter(teamID == "NYA") %>%  # NYA = New York Yankees (AL)
  arrange(yearID) %>%
  mutate(
    win_pct = W / (W + L),
    run_diff = R - RA
  ) %>%
  select(yearID, W, L, win_pct, R, RA, run_diff, attendance, Rank, WSWin)

# Best Yankees seasons
best_yankees <- yankees_history %>%
  arrange(desc(win_pct)) %>%
  select(yearID, W, L, win_pct, run_diff, WSWin) %>%
  head(10)

print(best_yankees)

# World Series wins
yankees_championships <- yankees_history %>%
  filter(WSWin == "Y") %>%
  select(yearID, W, L, win_pct, R, RA)

print(paste("Yankees World Series titles:", nrow(yankees_championships)))

Example: Historical Trends

# Evolution of home runs over time
hr_by_era <- Batting %>%
  group_by(yearID) %>%
  summarize(
    total_hr = sum(HR, na.rm = TRUE),
    total_ab = sum(AB, na.rm = TRUE),
    hr_rate = total_hr / total_ab * 100,
    .groups = "drop"
  ) %>%
  filter(yearID >= 1920)  # Deadball era ended around 1920

# Identify eras
hr_by_era <- hr_by_era %>%
  mutate(
    era = case_when(
      yearID < 1947 ~ "Pre-Integration",
      yearID < 1961 ~ "Integration Era",
      yearID < 1994 ~ "Pre-Strike",
      yearID < 2006 ~ "Steroid Era",
      TRUE ~ "Modern Era"
    )
  )

# Average HR rate by era
era_summary <- hr_by_era %>%
  group_by(era) %>%
  summarize(
    years = n(),
    avg_hr_rate = mean(hr_rate),
    .groups = "drop"
  )

print(era_summary)

# Visualize (requires ggplot2)
library(ggplot2)
ggplot(hr_by_era, aes(x = yearID, y = hr_rate, color = era)) +
  geom_line(size = 1) +
  geom_smooth(method = "loess", se = FALSE, linetype = "dashed") +
  labs(
    title = "Home Run Rate Over Time (1920-Present)",
    x = "Year",
    y = "HR Rate (% of AB)",
    color = "Era"
  ) +
  theme_minimal()

3.3.4 Accessing in Python {#lahman-python}

# The pybaseball library includes Lahman data
from pybaseball import lahman
import pandas as pd

# Load tables
people = lahman.people()
batting = lahman.batting()
pitching = lahman.pitching()
fielding = lahman.fielding()
teams = lahman.teams()
salaries = lahman.salaries()

# Explore structure
print(people.head())
print(people.columns.tolist())

print(batting.head())
print(batting.columns.tolist())

Example: Career Home Run Leaders

# Calculate career home runs
career_hr = batting.groupby('playerID').agg({
    'HR': 'sum',
    'yearID': ['count', 'min', 'max']
}).reset_index()

career_hr.columns = ['playerID', 'total_HR', 'seasons', 'first_year', 'last_year']
career_hr = career_hr.nlargest(20, 'total_HR')

# Merge with people for names
career_hr_leaders = career_hr.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID'
)
career_hr_leaders['name'] = career_hr_leaders['nameFirst'] + ' ' + career_hr_leaders['nameLast']
career_hr_leaders = career_hr_leaders[['name', 'total_HR', 'seasons', 'first_year', 'last_year']]

print(career_hr_leaders)

Example: Best Single Seasons

# Calculate rate statistics
batting['AVG'] = batting['H'] / batting['AB']
batting['OBP'] = (batting['H'] + batting['BB'] + batting['HBP']) / (
    batting['AB'] + batting['BB'] + batting['HBP'] + batting['SF']
)
batting['SLG'] = (
    batting['H'] - batting['2B'] - batting['3B'] - batting['HR'] +
    2*batting['2B'] + 3*batting['3B'] + 4*batting['HR']
) / batting['AB']
batting['OPS'] = batting['OBP'] + batting['SLG']

# Filter for qualified seasons
qualified = batting[batting['AB'] >= 400].copy()

# Merge with people
qualified = qualified.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID'
)
qualified['name'] = qualified['nameFirst'] + ' ' + qualified['nameLast']

# Best batting averages
best_avg = qualified.nlargest(10, 'AVG')[
    ['name', 'yearID', 'teamID', 'AVG', 'H', 'AB']
].round(3)

print(best_avg)

# Most home runs
most_hr = qualified.nlargest(10, 'HR')[
    ['name', 'yearID', 'teamID', 'HR', 'AB', 'AVG', 'OPS']
].round(3)

print(most_hr)

Example: Salary Analysis

# Modern era salaries (adjusted for inflation could be added)
recent_salaries = salaries[salaries['yearID'] >= 2010].copy()

# Top earners by year
top_salaries_by_year = recent_salaries.loc[
    recent_salaries.groupby('yearID')['salary'].idxmax()
]

top_salaries_with_names = top_salaries_by_year.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID'
)
top_salaries_with_names['name'] = top_salaries_with_names['nameFirst'] + ' ' + top_salaries_with_names['nameLast']

print(top_salaries_with_names[['yearID', 'name', 'teamID', 'salary']].sort_values('yearID'))

# Average salary by year
avg_salary_by_year = recent_salaries.groupby('yearID').agg({
    'salary': ['mean', 'median', 'max']
}).round(0)

avg_salary_by_year.columns = ['mean_salary', 'median_salary', 'max_salary']
print(avg_salary_by_year)

# Salary vs. performance
# Merge 2023 salaries with 2023 batting stats
salaries_2023 = salaries[salaries['yearID'] == 2023]
batting_2023 = batting[batting['yearID'] == 2023]

salary_performance = salaries_2023.merge(batting_2023, on=['playerID', 'yearID', 'teamID'])
salary_performance = salary_performance[salary_performance['AB'] >= 400]

# Calculate value metrics
salary_performance['WAR_per_million'] = salary_performance['AB'] / salary_performance['salary'] * 1000000
# Note: Lahman doesn't include WAR, this is illustrative

correlation = salary_performance[['salary', 'HR', 'H', 'AB']].corr()
print(correlation)

R
# Install and load the Lahman package
# install.packages("Lahman")
library(Lahman)
library(tidyverse)

# The package loads tables as data frames
# Main tables: People, Batting, Pitching, Fielding, Teams, Salaries

# Explore available tables
data(package = "Lahman")

# View the People table
glimpse(People)
head(People %>% select(playerID, nameFirst, nameLast, birthYear, birthCountry, debut, finalGame))

# View Batting table
glimpse(Batting)
head(Batting)
R
# Calculate career home runs
career_hr <- Batting %>%
  group_by(playerID) %>%
  summarize(
    total_hr = sum(HR, na.rm = TRUE),
    seasons = n(),
    first_year = min(yearID),
    last_year = max(yearID),
    .groups = "drop"
  ) %>%
  arrange(desc(total_hr)) %>%
  head(20)

# Join with People to get names
career_hr_leaders <- career_hr %>%
  left_join(
    People %>% select(playerID, nameFirst, nameLast),
    by = "playerID"
  ) %>%
  mutate(name = paste(nameFirst, nameLast)) %>%
  select(name, total_hr, seasons, first_year, last_year)

print(career_hr_leaders)
R
name total_hr seasons first_year last_year
1         Barry Bonds      762      22       1986      2007
2         Hank Aaron      755      23       1954      1976
3          Babe Ruth      714      22       1914      1935
4       Alex Rodriguez      696      22       1994      2016
5        Albert Pujols      703      22       2001      2022
6        Willie Mays      660      22       1951      1973
R
# Best single seasons by various metrics
best_seasons <- Batting %>%
  filter(AB >= 400) %>%  # Qualified seasons
  mutate(
    AVG = H / AB,
    OBP = (H + BB + HBP) / (AB + BB + HBP + SF),
    SLG = (H - `2B` - `3B` - HR + 2*`2B` + 3*`3B` + 4*HR) / AB,
    OPS = OBP + SLG
  ) %>%
  left_join(
    People %>% select(playerID, nameFirst, nameLast),
    by = "playerID"
  ) %>%
  mutate(name = paste(nameFirst, nameLast))

# Highest single-season batting average
best_avg <- best_seasons %>%
  arrange(desc(AVG)) %>%
  select(name, yearID, teamID, AVG, H, AB) %>%
  head(10)

print(best_avg)

# Most home runs in a season
most_hr <- best_seasons %>%
  arrange(desc(HR)) %>%
  select(name, yearID, teamID, HR, AB, AVG, OPS) %>%
  head(10)

print(most_hr)
R
# Yankees history
yankees_history <- Teams %>%
  filter(teamID == "NYA") %>%  # NYA = New York Yankees (AL)
  arrange(yearID) %>%
  mutate(
    win_pct = W / (W + L),
    run_diff = R - RA
  ) %>%
  select(yearID, W, L, win_pct, R, RA, run_diff, attendance, Rank, WSWin)

# Best Yankees seasons
best_yankees <- yankees_history %>%
  arrange(desc(win_pct)) %>%
  select(yearID, W, L, win_pct, run_diff, WSWin) %>%
  head(10)

print(best_yankees)

# World Series wins
yankees_championships <- yankees_history %>%
  filter(WSWin == "Y") %>%
  select(yearID, W, L, win_pct, R, RA)

print(paste("Yankees World Series titles:", nrow(yankees_championships)))
R
# Evolution of home runs over time
hr_by_era <- Batting %>%
  group_by(yearID) %>%
  summarize(
    total_hr = sum(HR, na.rm = TRUE),
    total_ab = sum(AB, na.rm = TRUE),
    hr_rate = total_hr / total_ab * 100,
    .groups = "drop"
  ) %>%
  filter(yearID >= 1920)  # Deadball era ended around 1920

# Identify eras
hr_by_era <- hr_by_era %>%
  mutate(
    era = case_when(
      yearID < 1947 ~ "Pre-Integration",
      yearID < 1961 ~ "Integration Era",
      yearID < 1994 ~ "Pre-Strike",
      yearID < 2006 ~ "Steroid Era",
      TRUE ~ "Modern Era"
    )
  )

# Average HR rate by era
era_summary <- hr_by_era %>%
  group_by(era) %>%
  summarize(
    years = n(),
    avg_hr_rate = mean(hr_rate),
    .groups = "drop"
  )

print(era_summary)

# Visualize (requires ggplot2)
library(ggplot2)
ggplot(hr_by_era, aes(x = yearID, y = hr_rate, color = era)) +
  geom_line(size = 1) +
  geom_smooth(method = "loess", se = FALSE, linetype = "dashed") +
  labs(
    title = "Home Run Rate Over Time (1920-Present)",
    x = "Year",
    y = "HR Rate (% of AB)",
    color = "Era"
  ) +
  theme_minimal()
Python
# The pybaseball library includes Lahman data
from pybaseball import lahman
import pandas as pd

# Load tables
people = lahman.people()
batting = lahman.batting()
pitching = lahman.pitching()
fielding = lahman.fielding()
teams = lahman.teams()
salaries = lahman.salaries()

# Explore structure
print(people.head())
print(people.columns.tolist())

print(batting.head())
print(batting.columns.tolist())
Python
# Calculate career home runs
career_hr = batting.groupby('playerID').agg({
    'HR': 'sum',
    'yearID': ['count', 'min', 'max']
}).reset_index()

career_hr.columns = ['playerID', 'total_HR', 'seasons', 'first_year', 'last_year']
career_hr = career_hr.nlargest(20, 'total_HR')

# Merge with people for names
career_hr_leaders = career_hr.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID'
)
career_hr_leaders['name'] = career_hr_leaders['nameFirst'] + ' ' + career_hr_leaders['nameLast']
career_hr_leaders = career_hr_leaders[['name', 'total_HR', 'seasons', 'first_year', 'last_year']]

print(career_hr_leaders)
Python
# Calculate rate statistics
batting['AVG'] = batting['H'] / batting['AB']
batting['OBP'] = (batting['H'] + batting['BB'] + batting['HBP']) / (
    batting['AB'] + batting['BB'] + batting['HBP'] + batting['SF']
)
batting['SLG'] = (
    batting['H'] - batting['2B'] - batting['3B'] - batting['HR'] +
    2*batting['2B'] + 3*batting['3B'] + 4*batting['HR']
) / batting['AB']
batting['OPS'] = batting['OBP'] + batting['SLG']

# Filter for qualified seasons
qualified = batting[batting['AB'] >= 400].copy()

# Merge with people
qualified = qualified.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID'
)
qualified['name'] = qualified['nameFirst'] + ' ' + qualified['nameLast']

# Best batting averages
best_avg = qualified.nlargest(10, 'AVG')[
    ['name', 'yearID', 'teamID', 'AVG', 'H', 'AB']
].round(3)

print(best_avg)

# Most home runs
most_hr = qualified.nlargest(10, 'HR')[
    ['name', 'yearID', 'teamID', 'HR', 'AB', 'AVG', 'OPS']
].round(3)

print(most_hr)
Python
# Modern era salaries (adjusted for inflation could be added)
recent_salaries = salaries[salaries['yearID'] >= 2010].copy()

# Top earners by year
top_salaries_by_year = recent_salaries.loc[
    recent_salaries.groupby('yearID')['salary'].idxmax()
]

top_salaries_with_names = top_salaries_by_year.merge(
    people[['playerID', 'nameFirst', 'nameLast']],
    on='playerID'
)
top_salaries_with_names['name'] = top_salaries_with_names['nameFirst'] + ' ' + top_salaries_with_names['nameLast']

print(top_salaries_with_names[['yearID', 'name', 'teamID', 'salary']].sort_values('yearID'))

# Average salary by year
avg_salary_by_year = recent_salaries.groupby('yearID').agg({
    'salary': ['mean', 'median', 'max']
}).round(0)

avg_salary_by_year.columns = ['mean_salary', 'median_salary', 'max_salary']
print(avg_salary_by_year)

# Salary vs. performance
# Merge 2023 salaries with 2023 batting stats
salaries_2023 = salaries[salaries['yearID'] == 2023]
batting_2023 = batting[batting['yearID'] == 2023]

salary_performance = salaries_2023.merge(batting_2023, on=['playerID', 'yearID', 'teamID'])
salary_performance = salary_performance[salary_performance['AB'] >= 400]

# Calculate value metrics
salary_performance['WAR_per_million'] = salary_performance['AB'] / salary_performance['salary'] * 1000000
# Note: Lahman doesn't include WAR, this is illustrative

correlation = salary_performance[['salary', 'HR', 'H', 'AB']].corr()
print(correlation)

3.4 Other Data Sources

Beyond the major packages, several other sources provide specialized data.

3.4.1 Retrosheet (play-by-play historical) {#retrosheet}

Retrosheet provides free play-by-play and box score data, covering most games from 1913 onward with complete coverage from 1974-present.

What Retrosheet offers:


  • Play-by-play event files

  • Game logs with detailed metadata

  • Box scores

  • Schedule and roster information

Accessing Retrosheet:

While there's no dedicated R/Python package as convenient as baseballr/pybaseball, you can:

  1. Download event files from https://www.retrosheet.org/game.htm
  2. Use Chadwick tools to parse event files
  3. Load parsed data into R/Python

R Example (manual parsing):

# After downloading and parsing with Chadwick tools
# You'd have CSV files to load

# Example structure (fictional path)
# events_2023 <- read_csv("retrosheet_data/all2023.csv")

# Retrosheet data includes detailed play-by-play
# Columns: GAME_ID, INN_CT, BAT_HOME_ID, OUTS_CT, BALLS_CT, STRIKES_CT,
#          PITCH_SEQ_TX, EVENT_CD, BATTEDBALL_CD, BUNT_FL, etc.

Python Example:

# Similarly, you'd parse and load CSV files
# import pandas as pd
# events_2023 = pd.read_csv('retrosheet_data/all2023.csv')

# Retrosheet enables granular analysis:
# - Win probability calculations
# - Leverage index
# - Re-examining historical games play-by-play

Use cases:


  • Historical game recreation

  • Calculating leverage index and win probability

  • Detailed lineup and substitution analysis

  • Researching specific historical moments

3.4.2 Baseball Savant (direct access) {#baseball-savant}

Baseball Savant (https://baseballsavant.mlb.com/) is MLB's official Statcast interface. While baseballr and pybaseball access Statcast data, the website offers additional tools:

  • Search tools: Custom queries for specific situations
  • Visualizations: Spray charts, pitch movement diagrams, catch probability
  • Expected stats: xBA, xwOBA, xSLG leaderboards
  • Pitcher breakdowns: Detailed arsenal analysis
  • Umpire scorecards: Strike zone accuracy

Direct download:
You can export CSVs directly from Baseball Savant's leaderboards and search results.

# R: Load a manually downloaded CSV
# savant_data <- read_csv("Downloads/baseballsavant_data.csv")
# Python: Load manually downloaded CSV
# savant_data = pd.read_csv('Downloads/baseballsavant_data.csv')

3.4.3 FanGraphs and Baseball Reference (web scraping considerations) {#web-scraping}

While baseballr and pybaseball provide API access, sometimes you need data not available through those packages. Web scraping is an option, but:

Important considerations:


  • Always check the website's Terms of Service

  • Respect robots.txt

  • Use appropriate rate limiting

  • Consider reaching out to site maintainers for data access

  • Many sites have APIs or data exports (use those first)

  • FanGraphs offers a data export tool for some leaderboards

R Web Scraping (rvest):

# library(rvest)
# library(tidyverse)

# Example: Scraping a simple table (check TOS first!)
# url <- "https://www.example-baseball-site.com/stats"
# page <- read_html(url)
# table <- page %>%
#   html_element("table.stats-table") %>%
#   html_table()

Python Web Scraping (BeautifulSoup):

# from bs4 import BeautifulSoup
# import requests
# import pandas as pd

# url = "https://www.example-baseball-site.com/stats"
# response = requests.get(url)
# soup = BeautifulSoup(response.content, 'html.parser')
# table = soup.find('table', class_='stats-table')
# df = pd.read_html(str(table))[0]

Best practice: Use official packages (baseballr/pybaseball) whenever possible.

3.4.4 Chadwick Bureau and Open Source Projects {#chadwick-bureau}

Chadwick Bureau: A volunteer organization that maintains open-source baseball data and tools:


  • Player ID registers (cross-referencing different ID systems)

  • Historical corrections to Retrosheet and Lahman data

  • Open-source parsing tools

Notable projects:


  • chadwick: Command-line tools for parsing Retrosheet data

  • baseball_id: Player ID lookup tables

  • Baseball-Databank: GitHub repository with historical data

Accessing:

# R: Load from GitHub
# id_mapping <- read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")

# Python: Load from GitHub
# id_mapping = pd.read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")

R
# After downloading and parsing with Chadwick tools
# You'd have CSV files to load

# Example structure (fictional path)
# events_2023 <- read_csv("retrosheet_data/all2023.csv")

# Retrosheet data includes detailed play-by-play
# Columns: GAME_ID, INN_CT, BAT_HOME_ID, OUTS_CT, BALLS_CT, STRIKES_CT,
#          PITCH_SEQ_TX, EVENT_CD, BATTEDBALL_CD, BUNT_FL, etc.
R
# R: Load a manually downloaded CSV
# savant_data <- read_csv("Downloads/baseballsavant_data.csv")
R
# library(rvest)
# library(tidyverse)

# Example: Scraping a simple table (check TOS first!)
# url <- "https://www.example-baseball-site.com/stats"
# page <- read_html(url)
# table <- page %>%
#   html_element("table.stats-table") %>%
#   html_table()
R
# R: Load from GitHub
# id_mapping <- read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
Python
# Similarly, you'd parse and load CSV files
# import pandas as pd
# events_2023 = pd.read_csv('retrosheet_data/all2023.csv')

# Retrosheet enables granular analysis:
# - Win probability calculations
# - Leverage index
# - Re-examining historical games play-by-play
Python
# Python: Load manually downloaded CSV
# savant_data = pd.read_csv('Downloads/baseballsavant_data.csv')
Python
# from bs4 import BeautifulSoup
# import requests
# import pandas as pd

# url = "https://www.example-baseball-site.com/stats"
# response = requests.get(url)
# soup = BeautifulSoup(response.content, 'html.parser')
# table = soup.find('table', class_='stats-table')
# df = pd.read_html(str(table))[0]
Python
# Python: Load from GitHub
# id_mapping = pd.read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")

3.5 Exercises

Exercise 3.1: Multi-Source Data Integration

Using both FanGraphs (via baseballr/pybaseball) and Statcast data:

  1. Retrieve 2024 batting statistics for players with 400+ PA
  2. For the top 10 hitters by wRC+, get their Statcast data
  3. Compare their wOBA (FanGraphs) to their xwOBA (Statcast)
  4. Which players are most over-performing or under-performing their expected stats?

R Solution Sketch:

library(baseballr)
library(tidyverse)

# 1. Get FanGraphs data
batters <- fg_batter_leaders(2024, 2024, qual = 400)
top10 <- batters %>% arrange(desc(`wRC+`)) %>% head(10)

# 2 & 3. Get Statcast data for each
# (Would loop through players and use statcast_search with their IDs)
# Compare wOBA vs xwOBA

# 4. Calculate differences
# top10 %>% mutate(woba_diff = wOBA - xwOBA)

Python Solution Sketch:

from pybaseball import batting_stats, statcast_batter, playerid_lookup
import pandas as pd

# 1. Get batting stats
batters = batting_stats(2024, qual=400)
top10 = batters.nlargest(10, 'wRC+')

# 2. Get Statcast data for top 10
# (Would need to lookup MLBAM IDs and query statcast_batter)

# 3 & 4. Compare wOBA vs xwOBA
# Calculate differences to identify over/under-performers

Exercise 3.2: Historical Trends with Lahman

Using the Lahman database:

  1. Calculate the league-average batting average by decade (1920-present)
  2. Identify the decade with the highest and lowest scoring
  3. Plot the trend of strikeouts per game over time
  4. Compare the "Steroid Era" (1995-2005) to the "Modern Era" (2015-2024) in terms of HR rate, K rate, and BA

Exercise 3.3: Team Performance Analysis

Combine multiple data sources:

  1. Get 2024 team standings (use mlbstandings from baseballr or scheduleand_record from pybaseball)
  2. Calculate team batting statistics (aggregate from individual player data)
  3. Join with team ERA (from pitching data)
  4. Create a Pythagorean expectation model: Expected W% = R^2 / (R^2 + RA^2)
  5. Compare actual wins to expected wins—which teams over-performed?

Exercise 3.4: Statcast Deep Dive

Pick your favorite pitcher and analyze their arsenal:

  1. Retrieve all Statcast data for their 2024 season
  2. For each pitch type:
  • Calculate average velocity, spin rate, and movement
  • Calculate whiff rate and zone rate
  • Identify which pitch generates the most swings and misses
  1. Analyze platoon splits: How do their pitches perform vs. LHB vs. RHB?
  2. Create a "pitch quality" ranking based on velocity, spin, and whiff rate

Deliverable: A comprehensive report with visualizations showing pitch usage, effectiveness, and recommendations for pitch selection.


This concludes Chapter 3. You now have the tools to access virtually any baseball dataset available. In the next chapters, we'll use these data sources to explore specific analytical techniques, from evaluating hitters and pitchers to building predictive models and crafting advanced visualizations.

The combination of R's baseballr package and Python's pybaseball library, along with the historical richness of the Lahman database, provides everything you need to conduct professional-grade baseball analysis. Master these tools, and you'll be equipped to answer almost any baseball question with data.

R
library(baseballr)
library(tidyverse)

# 1. Get FanGraphs data
batters <- fg_batter_leaders(2024, 2024, qual = 400)
top10 <- batters %>% arrange(desc(`wRC+`)) %>% head(10)

# 2 & 3. Get Statcast data for each
# (Would loop through players and use statcast_search with their IDs)
# Compare wOBA vs xwOBA

# 4. Calculate differences
# top10 %>% mutate(woba_diff = wOBA - xwOBA)
Python
from pybaseball import batting_stats, statcast_batter, playerid_lookup
import pandas as pd

# 1. Get batting stats
batters = batting_stats(2024, qual=400)
top10 = batters.nlargest(10, 'wRC+')

# 2. Get Statcast data for top 10
# (Would need to lookup MLBAM IDs and query statcast_batter)

# 3 & 4. Compare wOBA vs xwOBA
# Calculate differences to identify over/under-performers

Practice Exercises

Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.

4 exercises
Tips for Success
  • Read the problem carefully before starting to code
  • Break down complex problems into smaller steps
  • Use the hints if you're stuck - they won't give away the answer
  • After solving, compare your approach with the solution
Exercise 3.1
Multi-Source Data Integration
Medium
Using both FanGraphs (via baseballr/pybaseball) and Statcast data:

1. Retrieve 2024 batting statistics for players with 400+ PA
2. For the top 10 hitters by wRC+, get their Statcast data
3. Compare their wOBA (FanGraphs) to their xwOBA (Statcast)
4. Which players are most over-performing or under-performing their expected stats?

**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)

# 1. Get FanGraphs data
batters <- fg_batter_leaders(2024, 2024, qual = 400)
top10 <- batters %>% arrange(desc(`wRC+`)) %>% head(10)

# 2 & 3. Get Statcast data for each
# (Would loop through players and use statcast_search with their IDs)
# Compare wOBA vs xwOBA

# 4. Calculate differences
# top10 %>% mutate(woba_diff = wOBA - xwOBA)
```

**Python Solution Sketch:**
```python
from pybaseball import batting_stats, statcast_batter, playerid_lookup
import pandas as pd

# 1. Get batting stats
batters = batting_stats(2024, qual=400)
top10 = batters.nlargest(10, 'wRC+')

# 2. Get Statcast data for top 10
# (Would need to lookup MLBAM IDs and query statcast_batter)

# 3 & 4. Compare wOBA vs xwOBA
# Calculate differences to identify over/under-performers
```
Exercise 3.2
Historical Trends with Lahman
Medium
Using the Lahman database:

1. Calculate the league-average batting average by decade (1920-present)
2. Identify the decade with the highest and lowest scoring
3. Plot the trend of strikeouts per game over time
4. Compare the "Steroid Era" (1995-2005) to the "Modern Era" (2015-2024) in terms of HR rate, K rate, and BA
Exercise 3.3
Team Performance Analysis
Hard
Combine multiple data sources:

1. Get 2024 team standings (use mlb_standings from baseballr or schedule_and_record from pybaseball)
2. Calculate team batting statistics (aggregate from individual player data)
3. Join with team ERA (from pitching data)
4. Create a Pythagorean expectation model: Expected W% = R^2 / (R^2 + RA^2)
5. Compare actual wins to expected wins—which teams over-performed?
Exercise 3.4
Statcast Deep Dive
Hard
Pick your favorite pitcher and analyze their arsenal:

1. Retrieve all Statcast data for their 2024 season
2. For each pitch type:
- Calculate average velocity, spin rate, and movement
- Calculate whiff rate and zone rate
- Identify which pitch generates the most swings and misses
3. Analyze platoon splits: How do their pitches perform vs. LHB vs. RHB?
4. Create a "pitch quality" ranking based on velocity, spin, and whiff rate

**Deliverable**: A comprehensive report with visualizations showing pitch usage, effectiveness, and recommendations for pitch selection.

---

This concludes Chapter 3. You now have the tools to access virtually any baseball dataset available. In the next chapters, we'll use these data sources to explore specific analytical techniques, from evaluating hitters and pitchers to building predictive models and crafting advanced visualizations.

The combination of R's `baseballr` package and Python's `pybaseball` library, along with the historical richness of the Lahman database, provides everything you need to conduct professional-grade baseball analysis. Master these tools, and you'll be equipped to answer almost any baseball question with data.

Chapter Summary

In this chapter, you learned about the baseball data ecosystem. Key topics covered:

  • The baseballr Package (R)
  • The pybaseball Library (Python)
  • The Lahman Database
  • Other Data Sources
  • Exercises
4 practice exercises available Practice Now