The baseballr package, created by Bill Petti, is the Swiss Army knife of baseball data acquisition in R. It provides functions to access FanGraphs, Baseball Reference, Baseball Savant (Statcast), and the MLB Stats API, all through a consistent interface.
3.1.1 Package Overview and Installation {#baseballr-install}
Install the package from CRAN or the development version from GitHub:
# From CRAN (stable version)
install.packages("baseballr")
# From GitHub (development version with latest features)
# install.packages("devtools")
devtools::install_github("BillPetti/baseballr")
# Load the package
library(baseballr)
library(tidyverse) # For data manipulation
The package documentation is extensive: ?baseballr or visit https://billpetti.github.io/baseballr/
Key features:
- Access to multiple data sources through a unified interface
- Handles API rate limiting and pagination automatically
- Returns data as tibbles (tidy data frames)
- Includes player ID mapping across different systems
- Regular updates to accommodate API changes
3.1.2 FanGraphs Data Access {#baseballr-fangraphs}
FanGraphs is the gold standard for advanced baseball statistics, featuring metrics like wOBA, wRC+, FIP, and WAR. The baseballr package provides several functions to access FanGraphs leaderboards.
Batting Leaders
library(baseballr)
library(tidyverse)
# Get 2024 batting leaders (qualified batters: min 3.1 PA per team game)
batters_2024 <- fg_batter_leaders(
startseason = 2024,
endseason = 2024,
qual = 502 # 502 PA = 162 * 3.1
)
# Preview the data
glimpse(batters_2024)
dim(batters_2024) # Check dimensions
# Key columns include:
# Name, Team, G, PA, HR, R, RBI, SB, BB%, K%, AVG, OBP, SLG, wOBA, wRC+, WAR, etc.
# View top performers by wRC+ (Weighted Runs Created Plus)
top_hitters <- batters_2024 %>%
arrange(desc(`wRC+`)) %>%
select(Name, Team, PA, AVG, OBP, SLG, wOBA, `wRC+`, WAR) %>%
head(10)
print(top_hitters)
Output (example):
# A tibble: 10 × 9
Name Team PA AVG OBP SLG wOBA `wRC+` WAR
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron Judge NYY 704 0.322 0.458 0.701 0.457 218 10.8
2 Juan Soto NYY 713 0.288 0.419 0.569 0.419 178 8.1
3 Shohei Ohtani LAD 731 0.310 0.390 0.646 0.435 194 9.2
4 Bobby Witt Jr. KC 708 0.332 0.389 0.588 0.399 162 8.5
5 Francisco Lindor NYM 702 0.273 0.344 0.500 0.362 127 6.9
# Get batting data for a specific range without qualification minimum
all_batters_2024 <- fg_batter_leaders(
startseason = 2024,
endseason = 2024,
qual = 0 # 0 = no minimum, returns all players
)
# Filter for Yankees players
yankees_batters <- all_batters_2024 %>%
filter(Team == "NYY") %>%
arrange(desc(PA))
# Multi-year data: Career progression
judge_career <- fg_batter_leaders(
startseason = 2016, # Judge's debut
endseason = 2024,
qual = 0
) %>%
filter(Name == "Aaron Judge") %>%
select(Season, Age, G, PA, HR, AVG, OBP, SLG, `wRC+`, WAR) %>%
arrange(Season)
print(judge_career)
Advanced Batting Metrics
# Get additional metrics: plate discipline, batted ball data
# Note: Different endpoints have different available columns
# Standard batting with plate discipline
discipline_leaders <- fg_batter_leaders(2024, 2024, qual = 300) %>%
select(Name, Team, PA, `BB%`, `K%`, `BB/K`, `O-Swing%`, `Z-Swing%`, `SwStr%`) %>%
arrange(`K%`) # Lowest strikeout rates
print(discipline_leaders %>% head(10))
# Players with excellent plate discipline (low K%, high BB%)
elite_discipline <- fg_batter_leaders(2024, 2024, qual = 400) %>%
filter(`K%` < 15, `BB%` > 12) %>%
select(Name, Team, PA, `BB%`, `K%`, `BB/K`, OBP, `wRC+`) %>%
arrange(desc(`BB/K`))
print(elite_discipline)
Pitching Leaders
# Get 2024 pitching leaders (qualified: 1 IP per team game = 162 IP)
pitchers_2024 <- fg_pitcher_leaders(
startseason = 2024,
endseason = 2024,
qual = 162 # Qualified starters
)
# Preview
glimpse(pitchers_2024)
# Top pitchers by FIP (Fielding Independent Pitching)
top_pitchers_fip <- pitchers_2024 %>%
arrange(FIP) %>%
select(Name, Team, IP, ERA, FIP, `xFIP`, `K/9`, `BB/9`, `HR/9`, WAR) %>%
head(10)
print(top_pitchers_fip)
# Get relief pitchers (min 50 IP, typically relievers)
relievers_2024 <- fg_pitcher_leaders(2024, 2024, qual = 0) %>%
filter(IP >= 50, IP < 100) %>%
arrange(desc(WAR)) %>%
select(Name, Team, IP, ERA, FIP, `K/9`, `BB/9`, `K%`, SV, WAR)
print(relievers_2024 %>% head(15))
Batted Ball Data
# Get batted ball statistics
batted_ball_leaders <- fg_batter_leaders(2024, 2024, qual = 300) %>%
select(
Name, Team, PA,
`GB%`, `FB%`, `LD%`, `IFFB%`, `Pull%`, `Cent%`, `Oppo%`,
`Soft%`, `Med%`, `Hard%`,
`HR/FB`
) %>%
arrange(desc(`Hard%`)) # Highest hard-hit rate
print(batted_ball_leaders %>% head(10))
# Extreme fly ball hitters with power
power_profile <- fg_batter_leaders(2024, 2024, qual = 400) %>%
filter(`FB%` > 40, `HR/FB` > 15) %>%
select(Name, Team, HR, `FB%`, `HR/FB`, `Pull%`, `Hard%`, ISO, SLG) %>%
arrange(desc(HR))
print(power_profile)
3.1.3 Baseball Reference Data Access {#baseballr-bbref}
Baseball Reference is another premier source, particularly valued for its historical depth and park-adjusted statistics.
# Get Baseball Reference batting data
# Note: Functions may have different names/parameters
# Batting statistics by season
bref_batting_2024 <- bref_daily_batter(t1 = "2024-04-01", t2 = "2024-09-30")
# Pitching statistics
bref_pitching_2024 <- bref_daily_pitcher(t1 = "2024-04-01", t2 = "2024-09-30")
# Team batting statistics
team_batting_2024 <- bref_team_results(Tm = "NYY", year = 2024)
# Standings
standings_2024 <- bref_standings_on_date(date = "2024-09-30", division = "AL East")
print(standings_2024)
Note: Baseball Reference functions can be rate-limited. Be respectful of their servers and cache results locally for repeated analyses.
3.1.4 Statcast Data Access (statcast_search) {#baseballr-statcast}
This is one of the most powerful features of baseballr: direct access to Baseball Savant's Statcast data. Statcast tracks every pitch and batted ball with high-precision measurements: exit velocity, launch angle, spin rate, pitch movement, and more.
Basic Statcast Search
library(baseballr)
# Get all Statcast data for a date range
# Note: Large date ranges will take time due to data volume
# Single day of data
statcast_single_day <- statcast_search(
start_date = "2024-07-15",
end_date = "2024-07-15",
playerid = NULL # NULL = all players
)
glimpse(statcast_single_day)
dim(statcast_single_day) # Thousands of pitches
# Key columns:
# pitch_type, release_speed, release_pos_x/y/z, pfx_x/pfx_z (movement),
# plate_x, plate_z, vx0, vy0, vz0 (velocity components),
# ax, ay, az (acceleration), sz_top, sz_bot (strike zone),
# hit_distance_sc, launch_speed, launch_angle, barrel, events, description
Player-Specific Statcast Data
# Aaron Judge's player ID (MLB ID)
judge_mlbam_id <- 592450
# Get Judge's Statcast data for 2024 season (in chunks)
# Note: API limits date ranges to ~2 weeks, so we'll query multiple periods
# Helper function to query in chunks
get_statcast_season <- function(player_id, year) {
# Create date ranges (two-week chunks)
start_date <- as.Date(paste0(year, "-03-28"))
end_date <- as.Date(paste0(year, "-09-30"))
date_seq <- seq(start_date, end_date, by = "14 days")
if (tail(date_seq, 1) < end_date) {
date_seq <- c(date_seq, end_date)
}
all_data <- list()
for (i in 1:(length(date_seq) - 1)) {
message(paste("Fetching data from", date_seq[i], "to", date_seq[i+1]))
chunk <- statcast_search(
start_date = as.character(date_seq[i]),
end_date = as.character(date_seq[i+1]),
playerid = player_id,
player_type = "batter"
)
all_data[[i]] <- chunk
Sys.sleep(2) # Be nice to the API
}
bind_rows(all_data)
}
# Get Judge's 2024 Statcast data
judge_statcast_2024 <- get_statcast_season(592450, 2024)
# Analyze Judge's batted balls
judge_batted_balls <- judge_statcast_2024 %>%
filter(!is.na(launch_speed), !is.na(launch_angle)) %>%
select(
game_date, events, description,
launch_speed, launch_angle, hit_distance_sc,
barrel, babip_value, estimated_ba_using_speedangle, estimated_woba_using_speedangle
)
# Summary statistics
judge_ev_summary <- judge_batted_balls %>%
summarize(
batted_balls = n(),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
max_exit_velo = max(launch_speed, na.rm = TRUE),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
barrel_rate = sum(barrel == 1, na.rm = TRUE) / n() * 100,
hard_hit_rate = sum(launch_speed >= 95, na.rm = TRUE) / n() * 100
)
print(judge_ev_summary)
Output (example):
batted_balls avg_exit_velo max_exit_velo avg_launch_angle barrel_rate hard_hit_rate
1 503 92.3 121.4 14.2 15.3 52.1
Analyzing Pitch Types and Outcomes
# Analyze pitches Judge faced
judge_pitches <- judge_statcast_2024 %>%
filter(!is.na(pitch_type))
# Performance by pitch type
pitch_type_performance <- judge_pitches %>%
group_by(pitch_type) %>%
summarize(
pitches_seen = n(),
swing_rate = sum(description %in% c("foul", "hit_into_play", "swinging_strike", "foul_tip")) / n(),
whiff_rate = sum(description %in% c("swinging_strike", "foul_tip")) / sum(description %in% c("foul", "hit_into_play", "swinging_strike", "foul_tip")),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
batting_avg = sum(events %in% c("single", "double", "triple", "home_run"), na.rm = TRUE) / sum(!is.na(events)),
slugging = (
sum(events == "single", na.rm = TRUE) +
2 * sum(events == "double", na.rm = TRUE) +
3 * sum(events == "triple", na.rm = TRUE) +
4 * sum(events == "home_run", na.rm = TRUE)
) / sum(!is.na(events)),
.groups = "drop"
) %>%
arrange(desc(pitches_seen))
print(pitch_type_performance)
Zone Analysis
# Analyze performance by pitch location
# Divide strike zone into regions
judge_zone_analysis <- judge_pitches %>%
filter(!is.na(plate_x), !is.na(plate_z)) %>%
mutate(
# Horizontal: inside, middle, outside (from catcher's perspective, Judge is RHB)
zone_horizontal = case_when(
plate_x < -0.5 ~ "Inside",
plate_x > 0.5 ~ "Outside",
TRUE ~ "Middle"
),
# Vertical: high, middle, low
zone_vertical = case_when(
plate_z > 3.0 ~ "High",
plate_z < 2.0 ~ "Low",
TRUE ~ "Middle"
),
zone = paste(zone_horizontal, zone_vertical, sep = "-")
)
zone_results <- judge_zone_analysis %>%
group_by(zone) %>%
summarize(
pitches = n(),
swing_rate = mean(description %in% c("foul", "hit_into_play", "swinging_strike")),
contact_rate = mean(description %in% c("foul", "hit_into_play")),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(pitches))
print(zone_results)
Pitcher-Specific Statcast Data
# Gerrit Cole's player ID
cole_mlbam_id <- 543037
# Get Cole's pitching data for a specific game or date range
cole_statcast <- statcast_search(
start_date = "2024-06-01",
end_date = "2024-06-30",
playerid = cole_mlbam_id,
player_type = "pitcher"
)
# Analyze Cole's arsenal
cole_arsenal <- cole_statcast %>%
filter(!is.na(pitch_type)) %>%
group_by(pitch_type) %>%
summarize(
pitches = n(),
usage_rate = n() / nrow(cole_statcast) * 100,
avg_velo = mean(release_speed, na.rm = TRUE),
max_velo = max(release_speed, na.rm = TRUE),
avg_spin = mean(release_spin_rate, na.rm = TRUE),
whiff_rate = sum(description %in% c("swinging_strike", "swinging_strike_blocked")) / sum(description %in% c("foul", "hit_into_play", "swinging_strike", "swinging_strike_blocked")) * 100,
zone_rate = sum(zone %in% 1:9) / n() * 100,
.groups = "drop"
) %>%
arrange(desc(pitches))
print(cole_arsenal)
Output (example):
pitch_type pitches usage_rate avg_velo max_velo avg_spin whiff_rate zone_rate
1 FF 345 45.2 97.8 100.2 2345 28.5 52.3
2 SL 198 26.0 86.4 89.1 2687 35.2 44.1
3 CH 89 11.7 88.2 90.5 1876 22.8 48.3
4 CU 76 10.0 81.5 84.2 2543 31.5 38.9
3.1.5 MLB Stats API Functions {#baseballr-mlb-api}
The MLB Stats API provides official data directly from MLB. It includes real-time game data, player information, team rosters, and more.
# Get current MLB teams
teams <- mlb_teams(season = 2024, sport_id = 1) # sport_id 1 = MLB
print(teams %>% select(team_id, team_full_name, division_name, league_name))
# Get team roster
yankees_roster <- mlb_roster(team_id = 147, season = 2024) # 147 = Yankees
print(yankees_roster %>% select(person_full_name, position_name, jersey_number))
# Get player information
judge_info <- mlb_people(person_ids = 592450)
print(judge_info)
# Get game schedule for a team
yankees_schedule <- mlb_schedule(
season = 2024,
team_id = 147,
game_type = "R" # R = Regular season
)
# Games played in a date range
june_games <- yankees_schedule %>%
filter(game_date >= "2024-06-01", game_date <= "2024-06-30") %>%
select(game_date, game_pk, home_team_name, away_team_name, home_score, away_score)
print(june_games)
# Get live game data (if a game is in progress)
# game_pk is the unique game identifier
# live_game_data <- mlb_game_linescore(game_pk = 747106)
# From CRAN (stable version)
install.packages("baseballr")
# From GitHub (development version with latest features)
# install.packages("devtools")
devtools::install_github("BillPetti/baseballr")
# Load the package
library(baseballr)
library(tidyverse) # For data manipulation
library(baseballr)
library(tidyverse)
# Get 2024 batting leaders (qualified batters: min 3.1 PA per team game)
batters_2024 <- fg_batter_leaders(
startseason = 2024,
endseason = 2024,
qual = 502 # 502 PA = 162 * 3.1
)
# Preview the data
glimpse(batters_2024)
dim(batters_2024) # Check dimensions
# Key columns include:
# Name, Team, G, PA, HR, R, RBI, SB, BB%, K%, AVG, OBP, SLG, wOBA, wRC+, WAR, etc.
# View top performers by wRC+ (Weighted Runs Created Plus)
top_hitters <- batters_2024 %>%
arrange(desc(`wRC+`)) %>%
select(Name, Team, PA, AVG, OBP, SLG, wOBA, `wRC+`, WAR) %>%
head(10)
print(top_hitters)
# A tibble: 10 × 9
Name Team PA AVG OBP SLG wOBA `wRC+` WAR
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Aaron Judge NYY 704 0.322 0.458 0.701 0.457 218 10.8
2 Juan Soto NYY 713 0.288 0.419 0.569 0.419 178 8.1
3 Shohei Ohtani LAD 731 0.310 0.390 0.646 0.435 194 9.2
4 Bobby Witt Jr. KC 708 0.332 0.389 0.588 0.399 162 8.5
5 Francisco Lindor NYM 702 0.273 0.344 0.500 0.362 127 6.9
# Get batting data for a specific range without qualification minimum
all_batters_2024 <- fg_batter_leaders(
startseason = 2024,
endseason = 2024,
qual = 0 # 0 = no minimum, returns all players
)
# Filter for Yankees players
yankees_batters <- all_batters_2024 %>%
filter(Team == "NYY") %>%
arrange(desc(PA))
# Multi-year data: Career progression
judge_career <- fg_batter_leaders(
startseason = 2016, # Judge's debut
endseason = 2024,
qual = 0
) %>%
filter(Name == "Aaron Judge") %>%
select(Season, Age, G, PA, HR, AVG, OBP, SLG, `wRC+`, WAR) %>%
arrange(Season)
print(judge_career)
# Get additional metrics: plate discipline, batted ball data
# Note: Different endpoints have different available columns
# Standard batting with plate discipline
discipline_leaders <- fg_batter_leaders(2024, 2024, qual = 300) %>%
select(Name, Team, PA, `BB%`, `K%`, `BB/K`, `O-Swing%`, `Z-Swing%`, `SwStr%`) %>%
arrange(`K%`) # Lowest strikeout rates
print(discipline_leaders %>% head(10))
# Players with excellent plate discipline (low K%, high BB%)
elite_discipline <- fg_batter_leaders(2024, 2024, qual = 400) %>%
filter(`K%` < 15, `BB%` > 12) %>%
select(Name, Team, PA, `BB%`, `K%`, `BB/K`, OBP, `wRC+`) %>%
arrange(desc(`BB/K`))
print(elite_discipline)
# Get 2024 pitching leaders (qualified: 1 IP per team game = 162 IP)
pitchers_2024 <- fg_pitcher_leaders(
startseason = 2024,
endseason = 2024,
qual = 162 # Qualified starters
)
# Preview
glimpse(pitchers_2024)
# Top pitchers by FIP (Fielding Independent Pitching)
top_pitchers_fip <- pitchers_2024 %>%
arrange(FIP) %>%
select(Name, Team, IP, ERA, FIP, `xFIP`, `K/9`, `BB/9`, `HR/9`, WAR) %>%
head(10)
print(top_pitchers_fip)
# Get relief pitchers (min 50 IP, typically relievers)
relievers_2024 <- fg_pitcher_leaders(2024, 2024, qual = 0) %>%
filter(IP >= 50, IP < 100) %>%
arrange(desc(WAR)) %>%
select(Name, Team, IP, ERA, FIP, `K/9`, `BB/9`, `K%`, SV, WAR)
print(relievers_2024 %>% head(15))
# Get batted ball statistics
batted_ball_leaders <- fg_batter_leaders(2024, 2024, qual = 300) %>%
select(
Name, Team, PA,
`GB%`, `FB%`, `LD%`, `IFFB%`, `Pull%`, `Cent%`, `Oppo%`,
`Soft%`, `Med%`, `Hard%`,
`HR/FB`
) %>%
arrange(desc(`Hard%`)) # Highest hard-hit rate
print(batted_ball_leaders %>% head(10))
# Extreme fly ball hitters with power
power_profile <- fg_batter_leaders(2024, 2024, qual = 400) %>%
filter(`FB%` > 40, `HR/FB` > 15) %>%
select(Name, Team, HR, `FB%`, `HR/FB`, `Pull%`, `Hard%`, ISO, SLG) %>%
arrange(desc(HR))
print(power_profile)
# Get Baseball Reference batting data
# Note: Functions may have different names/parameters
# Batting statistics by season
bref_batting_2024 <- bref_daily_batter(t1 = "2024-04-01", t2 = "2024-09-30")
# Pitching statistics
bref_pitching_2024 <- bref_daily_pitcher(t1 = "2024-04-01", t2 = "2024-09-30")
# Team batting statistics
team_batting_2024 <- bref_team_results(Tm = "NYY", year = 2024)
# Standings
standings_2024 <- bref_standings_on_date(date = "2024-09-30", division = "AL East")
print(standings_2024)
library(baseballr)
# Get all Statcast data for a date range
# Note: Large date ranges will take time due to data volume
# Single day of data
statcast_single_day <- statcast_search(
start_date = "2024-07-15",
end_date = "2024-07-15",
playerid = NULL # NULL = all players
)
glimpse(statcast_single_day)
dim(statcast_single_day) # Thousands of pitches
# Key columns:
# pitch_type, release_speed, release_pos_x/y/z, pfx_x/pfx_z (movement),
# plate_x, plate_z, vx0, vy0, vz0 (velocity components),
# ax, ay, az (acceleration), sz_top, sz_bot (strike zone),
# hit_distance_sc, launch_speed, launch_angle, barrel, events, description
# Aaron Judge's player ID (MLB ID)
judge_mlbam_id <- 592450
# Get Judge's Statcast data for 2024 season (in chunks)
# Note: API limits date ranges to ~2 weeks, so we'll query multiple periods
# Helper function to query in chunks
get_statcast_season <- function(player_id, year) {
# Create date ranges (two-week chunks)
start_date <- as.Date(paste0(year, "-03-28"))
end_date <- as.Date(paste0(year, "-09-30"))
date_seq <- seq(start_date, end_date, by = "14 days")
if (tail(date_seq, 1) < end_date) {
date_seq <- c(date_seq, end_date)
}
all_data <- list()
for (i in 1:(length(date_seq) - 1)) {
message(paste("Fetching data from", date_seq[i], "to", date_seq[i+1]))
chunk <- statcast_search(
start_date = as.character(date_seq[i]),
end_date = as.character(date_seq[i+1]),
playerid = player_id,
player_type = "batter"
)
all_data[[i]] <- chunk
Sys.sleep(2) # Be nice to the API
}
bind_rows(all_data)
}
# Get Judge's 2024 Statcast data
judge_statcast_2024 <- get_statcast_season(592450, 2024)
# Analyze Judge's batted balls
judge_batted_balls <- judge_statcast_2024 %>%
filter(!is.na(launch_speed), !is.na(launch_angle)) %>%
select(
game_date, events, description,
launch_speed, launch_angle, hit_distance_sc,
barrel, babip_value, estimated_ba_using_speedangle, estimated_woba_using_speedangle
)
# Summary statistics
judge_ev_summary <- judge_batted_balls %>%
summarize(
batted_balls = n(),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
max_exit_velo = max(launch_speed, na.rm = TRUE),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
barrel_rate = sum(barrel == 1, na.rm = TRUE) / n() * 100,
hard_hit_rate = sum(launch_speed >= 95, na.rm = TRUE) / n() * 100
)
print(judge_ev_summary)
batted_balls avg_exit_velo max_exit_velo avg_launch_angle barrel_rate hard_hit_rate
1 503 92.3 121.4 14.2 15.3 52.1
# Analyze pitches Judge faced
judge_pitches <- judge_statcast_2024 %>%
filter(!is.na(pitch_type))
# Performance by pitch type
pitch_type_performance <- judge_pitches %>%
group_by(pitch_type) %>%
summarize(
pitches_seen = n(),
swing_rate = sum(description %in% c("foul", "hit_into_play", "swinging_strike", "foul_tip")) / n(),
whiff_rate = sum(description %in% c("swinging_strike", "foul_tip")) / sum(description %in% c("foul", "hit_into_play", "swinging_strike", "foul_tip")),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
batting_avg = sum(events %in% c("single", "double", "triple", "home_run"), na.rm = TRUE) / sum(!is.na(events)),
slugging = (
sum(events == "single", na.rm = TRUE) +
2 * sum(events == "double", na.rm = TRUE) +
3 * sum(events == "triple", na.rm = TRUE) +
4 * sum(events == "home_run", na.rm = TRUE)
) / sum(!is.na(events)),
.groups = "drop"
) %>%
arrange(desc(pitches_seen))
print(pitch_type_performance)
# Analyze performance by pitch location
# Divide strike zone into regions
judge_zone_analysis <- judge_pitches %>%
filter(!is.na(plate_x), !is.na(plate_z)) %>%
mutate(
# Horizontal: inside, middle, outside (from catcher's perspective, Judge is RHB)
zone_horizontal = case_when(
plate_x < -0.5 ~ "Inside",
plate_x > 0.5 ~ "Outside",
TRUE ~ "Middle"
),
# Vertical: high, middle, low
zone_vertical = case_when(
plate_z > 3.0 ~ "High",
plate_z < 2.0 ~ "Low",
TRUE ~ "Middle"
),
zone = paste(zone_horizontal, zone_vertical, sep = "-")
)
zone_results <- judge_zone_analysis %>%
group_by(zone) %>%
summarize(
pitches = n(),
swing_rate = mean(description %in% c("foul", "hit_into_play", "swinging_strike")),
contact_rate = mean(description %in% c("foul", "hit_into_play")),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(pitches))
print(zone_results)
# Gerrit Cole's player ID
cole_mlbam_id <- 543037
# Get Cole's pitching data for a specific game or date range
cole_statcast <- statcast_search(
start_date = "2024-06-01",
end_date = "2024-06-30",
playerid = cole_mlbam_id,
player_type = "pitcher"
)
# Analyze Cole's arsenal
cole_arsenal <- cole_statcast %>%
filter(!is.na(pitch_type)) %>%
group_by(pitch_type) %>%
summarize(
pitches = n(),
usage_rate = n() / nrow(cole_statcast) * 100,
avg_velo = mean(release_speed, na.rm = TRUE),
max_velo = max(release_speed, na.rm = TRUE),
avg_spin = mean(release_spin_rate, na.rm = TRUE),
whiff_rate = sum(description %in% c("swinging_strike", "swinging_strike_blocked")) / sum(description %in% c("foul", "hit_into_play", "swinging_strike", "swinging_strike_blocked")) * 100,
zone_rate = sum(zone %in% 1:9) / n() * 100,
.groups = "drop"
) %>%
arrange(desc(pitches))
print(cole_arsenal)
pitch_type pitches usage_rate avg_velo max_velo avg_spin whiff_rate zone_rate
1 FF 345 45.2 97.8 100.2 2345 28.5 52.3
2 SL 198 26.0 86.4 89.1 2687 35.2 44.1
3 CH 89 11.7 88.2 90.5 1876 22.8 48.3
4 CU 76 10.0 81.5 84.2 2543 31.5 38.9
# Get current MLB teams
teams <- mlb_teams(season = 2024, sport_id = 1) # sport_id 1 = MLB
print(teams %>% select(team_id, team_full_name, division_name, league_name))
# Get team roster
yankees_roster <- mlb_roster(team_id = 147, season = 2024) # 147 = Yankees
print(yankees_roster %>% select(person_full_name, position_name, jersey_number))
# Get player information
judge_info <- mlb_people(person_ids = 592450)
print(judge_info)
# Get game schedule for a team
yankees_schedule <- mlb_schedule(
season = 2024,
team_id = 147,
game_type = "R" # R = Regular season
)
# Games played in a date range
june_games <- yankees_schedule %>%
filter(game_date >= "2024-06-01", game_date <= "2024-06-30") %>%
select(game_date, game_pk, home_team_name, away_team_name, home_score, away_score)
print(june_games)
# Get live game data (if a game is in progress)
# game_pk is the unique game identifier
# live_game_data <- mlb_game_linescore(game_pk = 747106)
The pybaseball library, created by James LeDoux, is the Python equivalent of baseballr, providing comprehensive access to baseball data sources through a clean, Pythonic interface.
3.2.1 Installation and Setup {#pybaseball-install}
# Install from PyPI
# pip install pybaseball
# Import
import pybaseball as pyb
from pybaseball import (
batting_stats, pitching_stats,
statcast, statcast_batter, statcast_pitcher,
playerid_lookup, playerid_reverse_lookup
)
import pandas as pd
import numpy as np
# Enable cache to avoid repeated API calls
pyb.cache.enable()
# Check version
print(pyb.__version__)
Key features:
- Clean, consistent function names
- Returns pandas DataFrames
- Built-in caching system
- Comprehensive FanGraphs, Statcast, and Lahman access
- Active development and community support
3.2.2 Batting and Pitching Stats (battingstats, pitchingstats) {#pybaseball-batting-pitching}
Batting Statistics from FanGraphs
from pybaseball import batting_stats
import pandas as pd
# Get 2024 batting statistics (qualified batters)
batters_2024 = batting_stats(2024, qual=502)
print(batters_2024.shape)
print(batters_2024.columns.tolist())
# Preview top performers
top_wrc_plus = batters_2024.nlargest(10, 'wRC+')[
['Name', 'Team', 'PA', 'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']
]
print(top_wrc_plus)
Output (example):
Name Team PA AVG OBP SLG wRC+ WAR
0 Aaron Judge NYY 704 0.322 0.458 0.701 218 10.8
1 Juan Soto NYY 713 0.288 0.419 0.569 178 8.1
2 Shohei Ohtani LAD 731 0.310 0.390 0.646 194 9.2
3 Bobby Witt Jr. KC 708 0.332 0.389 0.588 162 8.5
# All batters (no qualification minimum)
all_batters_2024 = batting_stats(2024, qual=0)
# Filter for Yankees
yankees = all_batters_2024[all_batters_2024['Team'] == 'NYY'].sort_values('PA', ascending=False)
print(yankees[['Name', 'PA', 'HR', 'AVG', 'OPS', 'WAR']].head(10))
# Multi-year data for a player (across multiple calls)
judge_career = pd.concat([
batting_stats(year, qual=0).query('Name == "Aaron Judge"')
for year in range(2016, 2025)
])
judge_career = judge_career[['Season', 'Age', 'G', 'PA', 'HR', 'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']]
print(judge_career.sort_values('Season'))
Advanced Batting Metrics
# Elite plate discipline (low K%, high BB%)
disciplined_hitters = batters_2024[
(batters_2024['K%'] < 15) &
(batters_2024['BB%'] > 12)
][['Name', 'Team', 'PA', 'BB%', 'K%', 'BB/K', 'OBP', 'wRC+']].sort_values('BB/K', ascending=False)
print(disciplined_hitters)
# Batted ball leaders
batted_ball_leaders = batters_2024.nlargest(10, 'Hard%')[
['Name', 'Team', 'PA', 'Hard%', 'Barrel%', 'GB%', 'FB%', 'Pull%', 'HR', 'ISO']
]
print(batted_ball_leaders)
# Speed and base running
speed_leaders = batters_2024[batters_2024['PA'] >= 400].nlargest(15, 'SB')[
['Name', 'Team', 'SB', 'CS', 'SB%', 'Spd', 'wSB', 'UBR']
]
print(speed_leaders)
Pitching Statistics
from pybaseball import pitching_stats
# Get 2024 pitching statistics (qualified starters)
pitchers_2024 = pitching_stats(2024, qual=162)
print(pitchers_2024.shape)
# Top pitchers by FIP
top_fip = pitchers_2024.nsmallest(10, 'FIP')[
['Name', 'Team', 'IP', 'ERA', 'FIP', 'xFIP', 'K/9', 'BB/9', 'HR/9', 'WAR']
]
print(top_fip)
# Top strikeout pitchers
strikeout_leaders = pitchers_2024.nlargest(10, 'SO')[
['Name', 'Team', 'IP', 'SO', 'K/9', 'K%', 'SwStr%', 'ERA', 'FIP']
]
print(strikeout_leaders)
# Relief pitchers
all_pitchers = pitching_stats(2024, qual=0)
relievers = all_pitchers[
(all_pitchers['IP'] >= 50) &
(all_pitchers['IP'] < 100)
].nlargest(15, 'WAR')[
['Name', 'Team', 'IP', 'ERA', 'FIP', 'K/9', 'BB/9', 'SV', 'WAR']
]
print(relievers)
3.2.3 Statcast Data (statcast, statcastbatter, statcastpitcher) {#pybaseball-statcast}
pybaseball provides three main functions for Statcast data: statcast() for date ranges, statcast_batter() for specific batters, and statcast_pitcher() for specific pitchers.
General Statcast Search
from pybaseball import statcast
import pandas as pd
# Get all Statcast data for a date range
# Note: Keep date ranges reasonably small (1-2 weeks) to avoid timeouts
statcast_data = statcast(start_dt='2024-07-15', end_dt='2024-07-20')
print(f"Total pitches: {len(statcast_data)}")
print(statcast_data.columns.tolist())
# Key columns:
# pitch_type, release_speed, release_spin_rate, pfx_x, pfx_z,
# plate_x, plate_z, launch_speed, launch_angle, hit_distance_sc,
# barrel, events, description, zone, balls, strikes, outs_when_up
Batter-Specific Statcast
from pybaseball import statcast_batter, playerid_lookup
# Look up player ID
judge_lookup = playerid_lookup('judge', 'aaron')
print(judge_lookup)
# Aaron Judge's MLBAM ID: 592450
judge_id = 592450
# Get Judge's Statcast data for 2024 season
judge_statcast = statcast_batter(
start_dt='2024-03-28',
end_dt='2024-09-30',
player_id=judge_id
)
print(f"Judge pitches faced: {len(judge_statcast)}")
print(f"Judge batted balls: {judge_statcast['launch_speed'].notna().sum()}")
# Analyze batted balls
judge_batted_balls = judge_statcast.dropna(subset=['launch_speed', 'launch_angle'])
# Exit velocity summary
ev_summary = {
'batted_balls': len(judge_batted_balls),
'avg_exit_velo': judge_batted_balls['launch_speed'].mean(),
'max_exit_velo': judge_batted_balls['launch_speed'].max(),
'min_exit_velo': judge_batted_balls['launch_speed'].min(),
'avg_launch_angle': judge_batted_balls['launch_angle'].mean(),
'barrel_rate': (judge_batted_balls['barrel'] == 1).sum() / len(judge_batted_balls) * 100,
'hard_hit_rate': (judge_batted_balls['launch_speed'] >= 95).sum() / len(judge_batted_balls) * 100
}
print(pd.Series(ev_summary))
Output (example):
batted_balls 503.000000
avg_exit_velo 92.345678
max_exit_velo 121.400000
min_exit_velo 52.300000
avg_launch_angle 14.234567
barrel_rate 15.308151
hard_hit_rate 52.087476
Performance by Pitch Type
# Analyze Judge vs. different pitch types
pitch_type_analysis = judge_statcast[judge_statcast['pitch_type'].notna()].groupby('pitch_type').agg({
'pitch_type': 'count',
'description': lambda x: (x.isin(['swinging_strike', 'swinging_strike_blocked'])).sum() / (x.isin(['foul', 'hit_into_play', 'swinging_strike', 'swinging_strike_blocked'])).sum(),
'launch_speed': 'mean',
'launch_angle': 'mean',
'estimated_woba_using_speedangle': 'mean'
}).round(3)
pitch_type_analysis.columns = ['pitches', 'whiff_rate', 'avg_ev', 'avg_la', 'xwOBA']
pitch_type_analysis = pitch_type_analysis.sort_values('pitches', ascending=False)
print(pitch_type_analysis)
Spray Chart Data
# Get batted ball location data (hc_x, hc_y are coordinates)
judge_hits = judge_batted_balls[judge_batted_balls['events'].isin([
'single', 'double', 'triple', 'home_run'
])].copy()
# Classify hit type by coordinates (simplified)
def classify_spray_direction(row):
x = row['hc_x']
# For RHB: Lower hc_x = pull (right field), higher = opposite (left field)
if pd.isna(x):
return 'Unknown'
elif x < 75:
return 'Pull'
elif x > 150:
return 'Opposite'
else:
return 'Center'
judge_hits['spray_direction'] = judge_hits.apply(classify_spray_direction, axis=1)
spray_summary = judge_hits.groupby('spray_direction').agg({
'events': 'count',
'launch_speed': 'mean',
'hit_distance_sc': 'mean'
}).round(2)
spray_summary.columns = ['hits', 'avg_ev', 'avg_distance']
print(spray_summary)
Pitcher-Specific Statcast
from pybaseball import statcast_pitcher
# Gerrit Cole's MLBAM ID: 543037
cole_id = 543037
# Get Cole's pitching data
cole_statcast = statcast_pitcher(
start_dt='2024-06-01',
end_dt='2024-06-30',
player_id=cole_id
)
print(f"Cole pitches thrown: {len(cole_statcast)}")
# Analyze pitch arsenal
cole_arsenal = cole_statcast[cole_statcast['pitch_type'].notna()].groupby('pitch_type').agg({
'pitch_type': 'count',
'release_speed': ['mean', 'max'],
'release_spin_rate': 'mean',
'pfx_x': 'mean', # Horizontal movement
'pfx_z': 'mean', # Vertical movement (induced)
'description': lambda x: (x.isin(['swinging_strike', 'swinging_strike_blocked'])).sum() / (x.isin(['foul', 'hit_into_play', 'swinging_strike', 'swinging_strike_blocked'])).sum() * 100
}).round(2)
cole_arsenal.columns = ['count', 'avg_velo', 'max_velo', 'avg_spin', 'h_break', 'v_break', 'whiff_pct']
cole_arsenal['usage'] = (cole_arsenal['count'] / cole_arsenal['count'].sum() * 100).round(1)
cole_arsenal = cole_arsenal.sort_values('count', ascending=False)
print(cole_arsenal)
Output (example):
count avg_velo max_velo avg_spin h_break v_break whiff_pct usage
pitch_type
FF 345 97.82 100.24 2345.23 7.84 15.32 28.45 45.2
SL 198 86.43 89.12 2687.45 -5.23 2.45 35.21 26.0
CH 89 88.23 90.51 1876.34 9.12 -3.21 22.78 11.7
CU 76 81.54 84.23 2543.12 -8.34 -6.78 31.52 10.0
Expected Statistics (xwOBA, xBA)
# Compare actual results to expected results based on batted ball quality
judge_xstats = judge_batted_balls.groupby(
judge_batted_balls['game_date'].dt.to_period('M') # Group by month
).agg({
'estimated_ba_using_speedangle': 'mean',
'estimated_woba_using_speedangle': 'mean',
'woba_value': 'mean',
'launch_speed': 'mean',
'barrel': lambda x: (x == 1).sum() / len(x) * 100
}).round(3)
judge_xstats.columns = ['xBA', 'xwOBA', 'wOBA', 'avg_EV', 'barrel_rate']
print(judge_xstats)
# Calculate over/under-performance
judge_batted_balls['woba_diff'] = judge_batted_balls['woba_value'] - judge_batted_balls['estimated_woba_using_speedangle']
print(f"Average wOBA vs xwOBA difference: {judge_batted_balls['woba_diff'].mean():.3f}")
print(f"Judge {'outperformed' if judge_batted_balls['woba_diff'].mean() > 0 else 'underperformed'} his expected stats")
3.2.4 Player ID Lookups (playerid_lookup) {#pybaseball-player-ids}
Different data sources use different player ID systems. pybaseball provides functions to look up and convert between them.
from pybaseball import playerid_lookup, playerid_reverse_lookup
# Look up a player by name
soto_ids = playerid_lookup('soto', 'juan')
print(soto_ids)
Output (example):
name_first name_last key_mlbam key_retro key_bbref key_fangraphs mlb_played_first mlb_played_last
0 juan soto 665742 sotoj002 sotoju01 19611 2018 2024
# Access different ID systems
soto_mlbam = soto_ids.iloc[0]['key_mlbam'] # For Statcast
soto_fg = soto_ids.iloc[0]['key_fangraphs'] # For FanGraphs
soto_bbref = soto_ids.iloc[0]['key_bbref'] # For Baseball Reference
print(f"Juan Soto MLBAM ID: {soto_mlbam}")
print(f"Juan Soto FanGraphs ID: {soto_fg}")
print(f"Juan Soto Baseball-Reference ID: {soto_bbref}")
# Reverse lookup (by ID)
player_info = playerid_reverse_lookup([665742], key_type='mlbam')
print(player_info)
# Lookup multiple players
players = playerid_lookup('judge', 'aaron')
players = pd.concat([
players,
playerid_lookup('ohtani', 'shohei'),
playerid_lookup('betts', 'mookie')
])
print(players[['name_first', 'name_last', 'key_mlbam', 'key_fangraphs']])
3.2.5 Team and Schedule Data {#pybaseball-team-schedule}
from pybaseball import schedule_and_record
# Get team schedule and results
yankees_2024 = schedule_and_record(2024, 'NYY')
print(yankees_2024.head(10))
print(yankees_2024.columns.tolist())
# Analyze home vs. road performance
yankees_2024['is_home'] = yankees_2024['Home_Away'] == 'Home'
yankees_2024['win'] = yankees_2024['W/L'] == 'W'
home_away_split = yankees_2024.groupby('is_home').agg({
'win': ['sum', 'count', 'mean']
})
home_away_split.columns = ['wins', 'games', 'win_pct']
print(home_away_split)
# Performance by month
yankees_2024['month'] = pd.to_datetime(yankees_2024['Date']).dt.month
monthly_performance = yankees_2024.groupby('month').agg({
'win': ['sum', 'count', 'mean'],
'R': 'mean',
'RA': 'mean'
}).round(3)
monthly_performance.columns = ['wins', 'games', 'win_pct', 'runs_per_game', 'runs_allowed']
print(monthly_performance)
# Winning streaks
yankees_2024['streak'] = (yankees_2024['win'] != yankees_2024['win'].shift()).cumsum()
streaks = yankees_2024[yankees_2024['win']].groupby('streak').size()
longest_win_streak = streaks.max()
print(f"Longest winning streak: {longest_win_streak} games")
Name Team PA AVG OBP SLG wRC+ WAR
0 Aaron Judge NYY 704 0.322 0.458 0.701 218 10.8
1 Juan Soto NYY 713 0.288 0.419 0.569 178 8.1
2 Shohei Ohtani LAD 731 0.310 0.390 0.646 194 9.2
3 Bobby Witt Jr. KC 708 0.332 0.389 0.588 162 8.5
batted_balls 503.000000
avg_exit_velo 92.345678
max_exit_velo 121.400000
min_exit_velo 52.300000
avg_launch_angle 14.234567
barrel_rate 15.308151
hard_hit_rate 52.087476
count avg_velo max_velo avg_spin h_break v_break whiff_pct usage
pitch_type
FF 345 97.82 100.24 2345.23 7.84 15.32 28.45 45.2
SL 198 86.43 89.12 2687.45 -5.23 2.45 35.21 26.0
CH 89 88.23 90.51 1876.34 9.12 -3.21 22.78 11.7
CU 76 81.54 84.23 2543.12 -8.34 -6.78 31.52 10.0
name_first name_last key_mlbam key_retro key_bbref key_fangraphs mlb_played_first mlb_played_last
0 juan soto 665742 sotoj002 sotoju01 19611 2018 2024
# Install from PyPI
# pip install pybaseball
# Import
import pybaseball as pyb
from pybaseball import (
batting_stats, pitching_stats,
statcast, statcast_batter, statcast_pitcher,
playerid_lookup, playerid_reverse_lookup
)
import pandas as pd
import numpy as np
# Enable cache to avoid repeated API calls
pyb.cache.enable()
# Check version
print(pyb.__version__)
from pybaseball import batting_stats
import pandas as pd
# Get 2024 batting statistics (qualified batters)
batters_2024 = batting_stats(2024, qual=502)
print(batters_2024.shape)
print(batters_2024.columns.tolist())
# Preview top performers
top_wrc_plus = batters_2024.nlargest(10, 'wRC+')[
['Name', 'Team', 'PA', 'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']
]
print(top_wrc_plus)
# All batters (no qualification minimum)
all_batters_2024 = batting_stats(2024, qual=0)
# Filter for Yankees
yankees = all_batters_2024[all_batters_2024['Team'] == 'NYY'].sort_values('PA', ascending=False)
print(yankees[['Name', 'PA', 'HR', 'AVG', 'OPS', 'WAR']].head(10))
# Multi-year data for a player (across multiple calls)
judge_career = pd.concat([
batting_stats(year, qual=0).query('Name == "Aaron Judge"')
for year in range(2016, 2025)
])
judge_career = judge_career[['Season', 'Age', 'G', 'PA', 'HR', 'AVG', 'OBP', 'SLG', 'wRC+', 'WAR']]
print(judge_career.sort_values('Season'))
# Elite plate discipline (low K%, high BB%)
disciplined_hitters = batters_2024[
(batters_2024['K%'] < 15) &
(batters_2024['BB%'] > 12)
][['Name', 'Team', 'PA', 'BB%', 'K%', 'BB/K', 'OBP', 'wRC+']].sort_values('BB/K', ascending=False)
print(disciplined_hitters)
# Batted ball leaders
batted_ball_leaders = batters_2024.nlargest(10, 'Hard%')[
['Name', 'Team', 'PA', 'Hard%', 'Barrel%', 'GB%', 'FB%', 'Pull%', 'HR', 'ISO']
]
print(batted_ball_leaders)
# Speed and base running
speed_leaders = batters_2024[batters_2024['PA'] >= 400].nlargest(15, 'SB')[
['Name', 'Team', 'SB', 'CS', 'SB%', 'Spd', 'wSB', 'UBR']
]
print(speed_leaders)
from pybaseball import pitching_stats
# Get 2024 pitching statistics (qualified starters)
pitchers_2024 = pitching_stats(2024, qual=162)
print(pitchers_2024.shape)
# Top pitchers by FIP
top_fip = pitchers_2024.nsmallest(10, 'FIP')[
['Name', 'Team', 'IP', 'ERA', 'FIP', 'xFIP', 'K/9', 'BB/9', 'HR/9', 'WAR']
]
print(top_fip)
# Top strikeout pitchers
strikeout_leaders = pitchers_2024.nlargest(10, 'SO')[
['Name', 'Team', 'IP', 'SO', 'K/9', 'K%', 'SwStr%', 'ERA', 'FIP']
]
print(strikeout_leaders)
# Relief pitchers
all_pitchers = pitching_stats(2024, qual=0)
relievers = all_pitchers[
(all_pitchers['IP'] >= 50) &
(all_pitchers['IP'] < 100)
].nlargest(15, 'WAR')[
['Name', 'Team', 'IP', 'ERA', 'FIP', 'K/9', 'BB/9', 'SV', 'WAR']
]
print(relievers)
from pybaseball import statcast
import pandas as pd
# Get all Statcast data for a date range
# Note: Keep date ranges reasonably small (1-2 weeks) to avoid timeouts
statcast_data = statcast(start_dt='2024-07-15', end_dt='2024-07-20')
print(f"Total pitches: {len(statcast_data)}")
print(statcast_data.columns.tolist())
# Key columns:
# pitch_type, release_speed, release_spin_rate, pfx_x, pfx_z,
# plate_x, plate_z, launch_speed, launch_angle, hit_distance_sc,
# barrel, events, description, zone, balls, strikes, outs_when_up
from pybaseball import statcast_batter, playerid_lookup
# Look up player ID
judge_lookup = playerid_lookup('judge', 'aaron')
print(judge_lookup)
# Aaron Judge's MLBAM ID: 592450
judge_id = 592450
# Get Judge's Statcast data for 2024 season
judge_statcast = statcast_batter(
start_dt='2024-03-28',
end_dt='2024-09-30',
player_id=judge_id
)
print(f"Judge pitches faced: {len(judge_statcast)}")
print(f"Judge batted balls: {judge_statcast['launch_speed'].notna().sum()}")
# Analyze batted balls
judge_batted_balls = judge_statcast.dropna(subset=['launch_speed', 'launch_angle'])
# Exit velocity summary
ev_summary = {
'batted_balls': len(judge_batted_balls),
'avg_exit_velo': judge_batted_balls['launch_speed'].mean(),
'max_exit_velo': judge_batted_balls['launch_speed'].max(),
'min_exit_velo': judge_batted_balls['launch_speed'].min(),
'avg_launch_angle': judge_batted_balls['launch_angle'].mean(),
'barrel_rate': (judge_batted_balls['barrel'] == 1).sum() / len(judge_batted_balls) * 100,
'hard_hit_rate': (judge_batted_balls['launch_speed'] >= 95).sum() / len(judge_batted_balls) * 100
}
print(pd.Series(ev_summary))
# Analyze Judge vs. different pitch types
pitch_type_analysis = judge_statcast[judge_statcast['pitch_type'].notna()].groupby('pitch_type').agg({
'pitch_type': 'count',
'description': lambda x: (x.isin(['swinging_strike', 'swinging_strike_blocked'])).sum() / (x.isin(['foul', 'hit_into_play', 'swinging_strike', 'swinging_strike_blocked'])).sum(),
'launch_speed': 'mean',
'launch_angle': 'mean',
'estimated_woba_using_speedangle': 'mean'
}).round(3)
pitch_type_analysis.columns = ['pitches', 'whiff_rate', 'avg_ev', 'avg_la', 'xwOBA']
pitch_type_analysis = pitch_type_analysis.sort_values('pitches', ascending=False)
print(pitch_type_analysis)
# Get batted ball location data (hc_x, hc_y are coordinates)
judge_hits = judge_batted_balls[judge_batted_balls['events'].isin([
'single', 'double', 'triple', 'home_run'
])].copy()
# Classify hit type by coordinates (simplified)
def classify_spray_direction(row):
x = row['hc_x']
# For RHB: Lower hc_x = pull (right field), higher = opposite (left field)
if pd.isna(x):
return 'Unknown'
elif x < 75:
return 'Pull'
elif x > 150:
return 'Opposite'
else:
return 'Center'
judge_hits['spray_direction'] = judge_hits.apply(classify_spray_direction, axis=1)
spray_summary = judge_hits.groupby('spray_direction').agg({
'events': 'count',
'launch_speed': 'mean',
'hit_distance_sc': 'mean'
}).round(2)
spray_summary.columns = ['hits', 'avg_ev', 'avg_distance']
print(spray_summary)
from pybaseball import statcast_pitcher
# Gerrit Cole's MLBAM ID: 543037
cole_id = 543037
# Get Cole's pitching data
cole_statcast = statcast_pitcher(
start_dt='2024-06-01',
end_dt='2024-06-30',
player_id=cole_id
)
print(f"Cole pitches thrown: {len(cole_statcast)}")
# Analyze pitch arsenal
cole_arsenal = cole_statcast[cole_statcast['pitch_type'].notna()].groupby('pitch_type').agg({
'pitch_type': 'count',
'release_speed': ['mean', 'max'],
'release_spin_rate': 'mean',
'pfx_x': 'mean', # Horizontal movement
'pfx_z': 'mean', # Vertical movement (induced)
'description': lambda x: (x.isin(['swinging_strike', 'swinging_strike_blocked'])).sum() / (x.isin(['foul', 'hit_into_play', 'swinging_strike', 'swinging_strike_blocked'])).sum() * 100
}).round(2)
cole_arsenal.columns = ['count', 'avg_velo', 'max_velo', 'avg_spin', 'h_break', 'v_break', 'whiff_pct']
cole_arsenal['usage'] = (cole_arsenal['count'] / cole_arsenal['count'].sum() * 100).round(1)
cole_arsenal = cole_arsenal.sort_values('count', ascending=False)
print(cole_arsenal)
# Compare actual results to expected results based on batted ball quality
judge_xstats = judge_batted_balls.groupby(
judge_batted_balls['game_date'].dt.to_period('M') # Group by month
).agg({
'estimated_ba_using_speedangle': 'mean',
'estimated_woba_using_speedangle': 'mean',
'woba_value': 'mean',
'launch_speed': 'mean',
'barrel': lambda x: (x == 1).sum() / len(x) * 100
}).round(3)
judge_xstats.columns = ['xBA', 'xwOBA', 'wOBA', 'avg_EV', 'barrel_rate']
print(judge_xstats)
# Calculate over/under-performance
judge_batted_balls['woba_diff'] = judge_batted_balls['woba_value'] - judge_batted_balls['estimated_woba_using_speedangle']
print(f"Average wOBA vs xwOBA difference: {judge_batted_balls['woba_diff'].mean():.3f}")
print(f"Judge {'outperformed' if judge_batted_balls['woba_diff'].mean() > 0 else 'underperformed'} his expected stats")
from pybaseball import playerid_lookup, playerid_reverse_lookup
# Look up a player by name
soto_ids = playerid_lookup('soto', 'juan')
print(soto_ids)
# Access different ID systems
soto_mlbam = soto_ids.iloc[0]['key_mlbam'] # For Statcast
soto_fg = soto_ids.iloc[0]['key_fangraphs'] # For FanGraphs
soto_bbref = soto_ids.iloc[0]['key_bbref'] # For Baseball Reference
print(f"Juan Soto MLBAM ID: {soto_mlbam}")
print(f"Juan Soto FanGraphs ID: {soto_fg}")
print(f"Juan Soto Baseball-Reference ID: {soto_bbref}")
# Reverse lookup (by ID)
player_info = playerid_reverse_lookup([665742], key_type='mlbam')
print(player_info)
# Lookup multiple players
players = playerid_lookup('judge', 'aaron')
players = pd.concat([
players,
playerid_lookup('ohtani', 'shohei'),
playerid_lookup('betts', 'mookie')
])
print(players[['name_first', 'name_last', 'key_mlbam', 'key_fangraphs']])
from pybaseball import schedule_and_record
# Get team schedule and results
yankees_2024 = schedule_and_record(2024, 'NYY')
print(yankees_2024.head(10))
print(yankees_2024.columns.tolist())
# Analyze home vs. road performance
yankees_2024['is_home'] = yankees_2024['Home_Away'] == 'Home'
yankees_2024['win'] = yankees_2024['W/L'] == 'W'
home_away_split = yankees_2024.groupby('is_home').agg({
'win': ['sum', 'count', 'mean']
})
home_away_split.columns = ['wins', 'games', 'win_pct']
print(home_away_split)
# Performance by month
yankees_2024['month'] = pd.to_datetime(yankees_2024['Date']).dt.month
monthly_performance = yankees_2024.groupby('month').agg({
'win': ['sum', 'count', 'mean'],
'R': 'mean',
'RA': 'mean'
}).round(3)
monthly_performance.columns = ['wins', 'games', 'win_pct', 'runs_per_game', 'runs_allowed']
print(monthly_performance)
# Winning streaks
yankees_2024['streak'] = (yankees_2024['win'] != yankees_2024['win'].shift()).cumsum()
streaks = yankees_2024[yankees_2024['win']].groupby('streak').size()
longest_win_streak = streaks.max()
print(f"Longest winning streak: {longest_win_streak} games")
The Lahman Database is the crown jewel of historical baseball data. Maintained by Sean Lahman, it contains complete batting, pitching, and fielding statistics from 1871 to the present, along with biographical data, awards, and team information.
3.3.1 History and Structure {#lahman-history}
History:
- Started by Sean Lahman in the 1990s
- Covers all of Major League Baseball from 1871-present
- Free and open source
- Updated annually
- Used in books like Moneyball and by MLB teams
Key Features:
- Complete player statistics (every player who ever appeared in MLB)
- Biographical information (birthplace, birth date, death date)
- Team records and franchise history
- Awards (MVP, Cy Young, Gold Glove, Hall of Fame)
- Post-season statistics
- Salary data (1985-present)
- Normalized structure with primary/foreign keys
3.3.2 Key Tables {#lahman-tables}
The database consists of multiple interconnected tables:
- People: Biographical info (playerID, nameFirst, nameLast, birthYear, birthCountry, debut, finalGame)
- Batting: Season batting stats (playerID, yearID, teamID, G, AB, R, H, 2B, 3B, HR, RBI, SB, CS, BB, SO, AVG, etc.)
- Pitching: Season pitching stats (playerID, yearID, W, L, G, GS, CG, SHO, SV, IPouts, H, ER, HR, BB, SO, ERA, etc.)
- Fielding: Defensive stats by position (POS, G, GS, InnOuts, PO, A, E, DP)
- Teams: Team season records (yearID, lgID, teamID, W, L, R, RA, attendance)
- Salaries: Player salaries (playerID, yearID, teamID, salary)
- AwardsPlayers: Individual awards (playerID, awardID, yearID, lgID)
- HallOfFame: Hall of Fame voting (playerID, yearID, votedBy, ballots, votes, inducted)
- AllstarFull: All-Star game appearances
- BattingPost/PitchingPost: Postseason statistics
3.3.3 Accessing in R (Lahman package) {#lahman-r}
# Install and load the Lahman package
# install.packages("Lahman")
library(Lahman)
library(tidyverse)
# The package loads tables as data frames
# Main tables: People, Batting, Pitching, Fielding, Teams, Salaries
# Explore available tables
data(package = "Lahman")
# View the People table
glimpse(People)
head(People %>% select(playerID, nameFirst, nameLast, birthYear, birthCountry, debut, finalGame))
# View Batting table
glimpse(Batting)
head(Batting)
Example: Career Home Run Leaders
# Calculate career home runs
career_hr <- Batting %>%
group_by(playerID) %>%
summarize(
total_hr = sum(HR, na.rm = TRUE),
seasons = n(),
first_year = min(yearID),
last_year = max(yearID),
.groups = "drop"
) %>%
arrange(desc(total_hr)) %>%
head(20)
# Join with People to get names
career_hr_leaders <- career_hr %>%
left_join(
People %>% select(playerID, nameFirst, nameLast),
by = "playerID"
) %>%
mutate(name = paste(nameFirst, nameLast)) %>%
select(name, total_hr, seasons, first_year, last_year)
print(career_hr_leaders)
Output:
name total_hr seasons first_year last_year
1 Barry Bonds 762 22 1986 2007
2 Hank Aaron 755 23 1954 1976
3 Babe Ruth 714 22 1914 1935
4 Alex Rodriguez 696 22 1994 2016
5 Albert Pujols 703 22 2001 2022
6 Willie Mays 660 22 1951 1973
Example: Single-Season Batting Records
# Best single seasons by various metrics
best_seasons <- Batting %>%
filter(AB >= 400) %>% # Qualified seasons
mutate(
AVG = H / AB,
OBP = (H + BB + HBP) / (AB + BB + HBP + SF),
SLG = (H - `2B` - `3B` - HR + 2*`2B` + 3*`3B` + 4*HR) / AB,
OPS = OBP + SLG
) %>%
left_join(
People %>% select(playerID, nameFirst, nameLast),
by = "playerID"
) %>%
mutate(name = paste(nameFirst, nameLast))
# Highest single-season batting average
best_avg <- best_seasons %>%
arrange(desc(AVG)) %>%
select(name, yearID, teamID, AVG, H, AB) %>%
head(10)
print(best_avg)
# Most home runs in a season
most_hr <- best_seasons %>%
arrange(desc(HR)) %>%
select(name, yearID, teamID, HR, AB, AVG, OPS) %>%
head(10)
print(most_hr)
Example: Team Analysis
# Yankees history
yankees_history <- Teams %>%
filter(teamID == "NYA") %>% # NYA = New York Yankees (AL)
arrange(yearID) %>%
mutate(
win_pct = W / (W + L),
run_diff = R - RA
) %>%
select(yearID, W, L, win_pct, R, RA, run_diff, attendance, Rank, WSWin)
# Best Yankees seasons
best_yankees <- yankees_history %>%
arrange(desc(win_pct)) %>%
select(yearID, W, L, win_pct, run_diff, WSWin) %>%
head(10)
print(best_yankees)
# World Series wins
yankees_championships <- yankees_history %>%
filter(WSWin == "Y") %>%
select(yearID, W, L, win_pct, R, RA)
print(paste("Yankees World Series titles:", nrow(yankees_championships)))
Example: Historical Trends
# Evolution of home runs over time
hr_by_era <- Batting %>%
group_by(yearID) %>%
summarize(
total_hr = sum(HR, na.rm = TRUE),
total_ab = sum(AB, na.rm = TRUE),
hr_rate = total_hr / total_ab * 100,
.groups = "drop"
) %>%
filter(yearID >= 1920) # Deadball era ended around 1920
# Identify eras
hr_by_era <- hr_by_era %>%
mutate(
era = case_when(
yearID < 1947 ~ "Pre-Integration",
yearID < 1961 ~ "Integration Era",
yearID < 1994 ~ "Pre-Strike",
yearID < 2006 ~ "Steroid Era",
TRUE ~ "Modern Era"
)
)
# Average HR rate by era
era_summary <- hr_by_era %>%
group_by(era) %>%
summarize(
years = n(),
avg_hr_rate = mean(hr_rate),
.groups = "drop"
)
print(era_summary)
# Visualize (requires ggplot2)
library(ggplot2)
ggplot(hr_by_era, aes(x = yearID, y = hr_rate, color = era)) +
geom_line(size = 1) +
geom_smooth(method = "loess", se = FALSE, linetype = "dashed") +
labs(
title = "Home Run Rate Over Time (1920-Present)",
x = "Year",
y = "HR Rate (% of AB)",
color = "Era"
) +
theme_minimal()
3.3.4 Accessing in Python {#lahman-python}
# The pybaseball library includes Lahman data
from pybaseball import lahman
import pandas as pd
# Load tables
people = lahman.people()
batting = lahman.batting()
pitching = lahman.pitching()
fielding = lahman.fielding()
teams = lahman.teams()
salaries = lahman.salaries()
# Explore structure
print(people.head())
print(people.columns.tolist())
print(batting.head())
print(batting.columns.tolist())
Example: Career Home Run Leaders
# Calculate career home runs
career_hr = batting.groupby('playerID').agg({
'HR': 'sum',
'yearID': ['count', 'min', 'max']
}).reset_index()
career_hr.columns = ['playerID', 'total_HR', 'seasons', 'first_year', 'last_year']
career_hr = career_hr.nlargest(20, 'total_HR')
# Merge with people for names
career_hr_leaders = career_hr.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID'
)
career_hr_leaders['name'] = career_hr_leaders['nameFirst'] + ' ' + career_hr_leaders['nameLast']
career_hr_leaders = career_hr_leaders[['name', 'total_HR', 'seasons', 'first_year', 'last_year']]
print(career_hr_leaders)
Example: Best Single Seasons
# Calculate rate statistics
batting['AVG'] = batting['H'] / batting['AB']
batting['OBP'] = (batting['H'] + batting['BB'] + batting['HBP']) / (
batting['AB'] + batting['BB'] + batting['HBP'] + batting['SF']
)
batting['SLG'] = (
batting['H'] - batting['2B'] - batting['3B'] - batting['HR'] +
2*batting['2B'] + 3*batting['3B'] + 4*batting['HR']
) / batting['AB']
batting['OPS'] = batting['OBP'] + batting['SLG']
# Filter for qualified seasons
qualified = batting[batting['AB'] >= 400].copy()
# Merge with people
qualified = qualified.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID'
)
qualified['name'] = qualified['nameFirst'] + ' ' + qualified['nameLast']
# Best batting averages
best_avg = qualified.nlargest(10, 'AVG')[
['name', 'yearID', 'teamID', 'AVG', 'H', 'AB']
].round(3)
print(best_avg)
# Most home runs
most_hr = qualified.nlargest(10, 'HR')[
['name', 'yearID', 'teamID', 'HR', 'AB', 'AVG', 'OPS']
].round(3)
print(most_hr)
Example: Salary Analysis
# Modern era salaries (adjusted for inflation could be added)
recent_salaries = salaries[salaries['yearID'] >= 2010].copy()
# Top earners by year
top_salaries_by_year = recent_salaries.loc[
recent_salaries.groupby('yearID')['salary'].idxmax()
]
top_salaries_with_names = top_salaries_by_year.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID'
)
top_salaries_with_names['name'] = top_salaries_with_names['nameFirst'] + ' ' + top_salaries_with_names['nameLast']
print(top_salaries_with_names[['yearID', 'name', 'teamID', 'salary']].sort_values('yearID'))
# Average salary by year
avg_salary_by_year = recent_salaries.groupby('yearID').agg({
'salary': ['mean', 'median', 'max']
}).round(0)
avg_salary_by_year.columns = ['mean_salary', 'median_salary', 'max_salary']
print(avg_salary_by_year)
# Salary vs. performance
# Merge 2023 salaries with 2023 batting stats
salaries_2023 = salaries[salaries['yearID'] == 2023]
batting_2023 = batting[batting['yearID'] == 2023]
salary_performance = salaries_2023.merge(batting_2023, on=['playerID', 'yearID', 'teamID'])
salary_performance = salary_performance[salary_performance['AB'] >= 400]
# Calculate value metrics
salary_performance['WAR_per_million'] = salary_performance['AB'] / salary_performance['salary'] * 1000000
# Note: Lahman doesn't include WAR, this is illustrative
correlation = salary_performance[['salary', 'HR', 'H', 'AB']].corr()
print(correlation)
# Install and load the Lahman package
# install.packages("Lahman")
library(Lahman)
library(tidyverse)
# The package loads tables as data frames
# Main tables: People, Batting, Pitching, Fielding, Teams, Salaries
# Explore available tables
data(package = "Lahman")
# View the People table
glimpse(People)
head(People %>% select(playerID, nameFirst, nameLast, birthYear, birthCountry, debut, finalGame))
# View Batting table
glimpse(Batting)
head(Batting)
# Calculate career home runs
career_hr <- Batting %>%
group_by(playerID) %>%
summarize(
total_hr = sum(HR, na.rm = TRUE),
seasons = n(),
first_year = min(yearID),
last_year = max(yearID),
.groups = "drop"
) %>%
arrange(desc(total_hr)) %>%
head(20)
# Join with People to get names
career_hr_leaders <- career_hr %>%
left_join(
People %>% select(playerID, nameFirst, nameLast),
by = "playerID"
) %>%
mutate(name = paste(nameFirst, nameLast)) %>%
select(name, total_hr, seasons, first_year, last_year)
print(career_hr_leaders)
name total_hr seasons first_year last_year
1 Barry Bonds 762 22 1986 2007
2 Hank Aaron 755 23 1954 1976
3 Babe Ruth 714 22 1914 1935
4 Alex Rodriguez 696 22 1994 2016
5 Albert Pujols 703 22 2001 2022
6 Willie Mays 660 22 1951 1973
# Best single seasons by various metrics
best_seasons <- Batting %>%
filter(AB >= 400) %>% # Qualified seasons
mutate(
AVG = H / AB,
OBP = (H + BB + HBP) / (AB + BB + HBP + SF),
SLG = (H - `2B` - `3B` - HR + 2*`2B` + 3*`3B` + 4*HR) / AB,
OPS = OBP + SLG
) %>%
left_join(
People %>% select(playerID, nameFirst, nameLast),
by = "playerID"
) %>%
mutate(name = paste(nameFirst, nameLast))
# Highest single-season batting average
best_avg <- best_seasons %>%
arrange(desc(AVG)) %>%
select(name, yearID, teamID, AVG, H, AB) %>%
head(10)
print(best_avg)
# Most home runs in a season
most_hr <- best_seasons %>%
arrange(desc(HR)) %>%
select(name, yearID, teamID, HR, AB, AVG, OPS) %>%
head(10)
print(most_hr)
# Yankees history
yankees_history <- Teams %>%
filter(teamID == "NYA") %>% # NYA = New York Yankees (AL)
arrange(yearID) %>%
mutate(
win_pct = W / (W + L),
run_diff = R - RA
) %>%
select(yearID, W, L, win_pct, R, RA, run_diff, attendance, Rank, WSWin)
# Best Yankees seasons
best_yankees <- yankees_history %>%
arrange(desc(win_pct)) %>%
select(yearID, W, L, win_pct, run_diff, WSWin) %>%
head(10)
print(best_yankees)
# World Series wins
yankees_championships <- yankees_history %>%
filter(WSWin == "Y") %>%
select(yearID, W, L, win_pct, R, RA)
print(paste("Yankees World Series titles:", nrow(yankees_championships)))
# Evolution of home runs over time
hr_by_era <- Batting %>%
group_by(yearID) %>%
summarize(
total_hr = sum(HR, na.rm = TRUE),
total_ab = sum(AB, na.rm = TRUE),
hr_rate = total_hr / total_ab * 100,
.groups = "drop"
) %>%
filter(yearID >= 1920) # Deadball era ended around 1920
# Identify eras
hr_by_era <- hr_by_era %>%
mutate(
era = case_when(
yearID < 1947 ~ "Pre-Integration",
yearID < 1961 ~ "Integration Era",
yearID < 1994 ~ "Pre-Strike",
yearID < 2006 ~ "Steroid Era",
TRUE ~ "Modern Era"
)
)
# Average HR rate by era
era_summary <- hr_by_era %>%
group_by(era) %>%
summarize(
years = n(),
avg_hr_rate = mean(hr_rate),
.groups = "drop"
)
print(era_summary)
# Visualize (requires ggplot2)
library(ggplot2)
ggplot(hr_by_era, aes(x = yearID, y = hr_rate, color = era)) +
geom_line(size = 1) +
geom_smooth(method = "loess", se = FALSE, linetype = "dashed") +
labs(
title = "Home Run Rate Over Time (1920-Present)",
x = "Year",
y = "HR Rate (% of AB)",
color = "Era"
) +
theme_minimal()
# The pybaseball library includes Lahman data
from pybaseball import lahman
import pandas as pd
# Load tables
people = lahman.people()
batting = lahman.batting()
pitching = lahman.pitching()
fielding = lahman.fielding()
teams = lahman.teams()
salaries = lahman.salaries()
# Explore structure
print(people.head())
print(people.columns.tolist())
print(batting.head())
print(batting.columns.tolist())
# Calculate career home runs
career_hr = batting.groupby('playerID').agg({
'HR': 'sum',
'yearID': ['count', 'min', 'max']
}).reset_index()
career_hr.columns = ['playerID', 'total_HR', 'seasons', 'first_year', 'last_year']
career_hr = career_hr.nlargest(20, 'total_HR')
# Merge with people for names
career_hr_leaders = career_hr.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID'
)
career_hr_leaders['name'] = career_hr_leaders['nameFirst'] + ' ' + career_hr_leaders['nameLast']
career_hr_leaders = career_hr_leaders[['name', 'total_HR', 'seasons', 'first_year', 'last_year']]
print(career_hr_leaders)
# Calculate rate statistics
batting['AVG'] = batting['H'] / batting['AB']
batting['OBP'] = (batting['H'] + batting['BB'] + batting['HBP']) / (
batting['AB'] + batting['BB'] + batting['HBP'] + batting['SF']
)
batting['SLG'] = (
batting['H'] - batting['2B'] - batting['3B'] - batting['HR'] +
2*batting['2B'] + 3*batting['3B'] + 4*batting['HR']
) / batting['AB']
batting['OPS'] = batting['OBP'] + batting['SLG']
# Filter for qualified seasons
qualified = batting[batting['AB'] >= 400].copy()
# Merge with people
qualified = qualified.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID'
)
qualified['name'] = qualified['nameFirst'] + ' ' + qualified['nameLast']
# Best batting averages
best_avg = qualified.nlargest(10, 'AVG')[
['name', 'yearID', 'teamID', 'AVG', 'H', 'AB']
].round(3)
print(best_avg)
# Most home runs
most_hr = qualified.nlargest(10, 'HR')[
['name', 'yearID', 'teamID', 'HR', 'AB', 'AVG', 'OPS']
].round(3)
print(most_hr)
# Modern era salaries (adjusted for inflation could be added)
recent_salaries = salaries[salaries['yearID'] >= 2010].copy()
# Top earners by year
top_salaries_by_year = recent_salaries.loc[
recent_salaries.groupby('yearID')['salary'].idxmax()
]
top_salaries_with_names = top_salaries_by_year.merge(
people[['playerID', 'nameFirst', 'nameLast']],
on='playerID'
)
top_salaries_with_names['name'] = top_salaries_with_names['nameFirst'] + ' ' + top_salaries_with_names['nameLast']
print(top_salaries_with_names[['yearID', 'name', 'teamID', 'salary']].sort_values('yearID'))
# Average salary by year
avg_salary_by_year = recent_salaries.groupby('yearID').agg({
'salary': ['mean', 'median', 'max']
}).round(0)
avg_salary_by_year.columns = ['mean_salary', 'median_salary', 'max_salary']
print(avg_salary_by_year)
# Salary vs. performance
# Merge 2023 salaries with 2023 batting stats
salaries_2023 = salaries[salaries['yearID'] == 2023]
batting_2023 = batting[batting['yearID'] == 2023]
salary_performance = salaries_2023.merge(batting_2023, on=['playerID', 'yearID', 'teamID'])
salary_performance = salary_performance[salary_performance['AB'] >= 400]
# Calculate value metrics
salary_performance['WAR_per_million'] = salary_performance['AB'] / salary_performance['salary'] * 1000000
# Note: Lahman doesn't include WAR, this is illustrative
correlation = salary_performance[['salary', 'HR', 'H', 'AB']].corr()
print(correlation)
Beyond the major packages, several other sources provide specialized data.
3.4.1 Retrosheet (play-by-play historical) {#retrosheet}
Retrosheet provides free play-by-play and box score data, covering most games from 1913 onward with complete coverage from 1974-present.
What Retrosheet offers:
- Play-by-play event files
- Game logs with detailed metadata
- Box scores
- Schedule and roster information
Accessing Retrosheet:
While there's no dedicated R/Python package as convenient as baseballr/pybaseball, you can:
- Download event files from https://www.retrosheet.org/game.htm
- Use Chadwick tools to parse event files
- Load parsed data into R/Python
R Example (manual parsing):
# After downloading and parsing with Chadwick tools
# You'd have CSV files to load
# Example structure (fictional path)
# events_2023 <- read_csv("retrosheet_data/all2023.csv")
# Retrosheet data includes detailed play-by-play
# Columns: GAME_ID, INN_CT, BAT_HOME_ID, OUTS_CT, BALLS_CT, STRIKES_CT,
# PITCH_SEQ_TX, EVENT_CD, BATTEDBALL_CD, BUNT_FL, etc.
Python Example:
# Similarly, you'd parse and load CSV files
# import pandas as pd
# events_2023 = pd.read_csv('retrosheet_data/all2023.csv')
# Retrosheet enables granular analysis:
# - Win probability calculations
# - Leverage index
# - Re-examining historical games play-by-play
Use cases:
- Historical game recreation
- Calculating leverage index and win probability
- Detailed lineup and substitution analysis
- Researching specific historical moments
3.4.2 Baseball Savant (direct access) {#baseball-savant}
Baseball Savant (https://baseballsavant.mlb.com/) is MLB's official Statcast interface. While baseballr and pybaseball access Statcast data, the website offers additional tools:
- Search tools: Custom queries for specific situations
- Visualizations: Spray charts, pitch movement diagrams, catch probability
- Expected stats: xBA, xwOBA, xSLG leaderboards
- Pitcher breakdowns: Detailed arsenal analysis
- Umpire scorecards: Strike zone accuracy
Direct download:
You can export CSVs directly from Baseball Savant's leaderboards and search results.
# R: Load a manually downloaded CSV
# savant_data <- read_csv("Downloads/baseballsavant_data.csv")
# Python: Load manually downloaded CSV
# savant_data = pd.read_csv('Downloads/baseballsavant_data.csv')
3.4.3 FanGraphs and Baseball Reference (web scraping considerations) {#web-scraping}
While baseballr and pybaseball provide API access, sometimes you need data not available through those packages. Web scraping is an option, but:
Important considerations:
- Always check the website's Terms of Service
- Respect robots.txt
- Use appropriate rate limiting
- Consider reaching out to site maintainers for data access
- Many sites have APIs or data exports (use those first)
- FanGraphs offers a data export tool for some leaderboards
R Web Scraping (rvest):
# library(rvest)
# library(tidyverse)
# Example: Scraping a simple table (check TOS first!)
# url <- "https://www.example-baseball-site.com/stats"
# page <- read_html(url)
# table <- page %>%
# html_element("table.stats-table") %>%
# html_table()
Python Web Scraping (BeautifulSoup):
# from bs4 import BeautifulSoup
# import requests
# import pandas as pd
# url = "https://www.example-baseball-site.com/stats"
# response = requests.get(url)
# soup = BeautifulSoup(response.content, 'html.parser')
# table = soup.find('table', class_='stats-table')
# df = pd.read_html(str(table))[0]
Best practice: Use official packages (baseballr/pybaseball) whenever possible.
3.4.4 Chadwick Bureau and Open Source Projects {#chadwick-bureau}
Chadwick Bureau: A volunteer organization that maintains open-source baseball data and tools:
- Player ID registers (cross-referencing different ID systems)
- Historical corrections to Retrosheet and Lahman data
- Open-source parsing tools
Notable projects:
- chadwick: Command-line tools for parsing Retrosheet data
- baseball_id: Player ID lookup tables
- Baseball-Databank: GitHub repository with historical data
Accessing:
# R: Load from GitHub
# id_mapping <- read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
# Python: Load from GitHub
# id_mapping = pd.read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
# After downloading and parsing with Chadwick tools
# You'd have CSV files to load
# Example structure (fictional path)
# events_2023 <- read_csv("retrosheet_data/all2023.csv")
# Retrosheet data includes detailed play-by-play
# Columns: GAME_ID, INN_CT, BAT_HOME_ID, OUTS_CT, BALLS_CT, STRIKES_CT,
# PITCH_SEQ_TX, EVENT_CD, BATTEDBALL_CD, BUNT_FL, etc.
# R: Load a manually downloaded CSV
# savant_data <- read_csv("Downloads/baseballsavant_data.csv")
# library(rvest)
# library(tidyverse)
# Example: Scraping a simple table (check TOS first!)
# url <- "https://www.example-baseball-site.com/stats"
# page <- read_html(url)
# table <- page %>%
# html_element("table.stats-table") %>%
# html_table()
# R: Load from GitHub
# id_mapping <- read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
# Similarly, you'd parse and load CSV files
# import pandas as pd
# events_2023 = pd.read_csv('retrosheet_data/all2023.csv')
# Retrosheet enables granular analysis:
# - Win probability calculations
# - Leverage index
# - Re-examining historical games play-by-play
# Python: Load manually downloaded CSV
# savant_data = pd.read_csv('Downloads/baseballsavant_data.csv')
# from bs4 import BeautifulSoup
# import requests
# import pandas as pd
# url = "https://www.example-baseball-site.com/stats"
# response = requests.get(url)
# soup = BeautifulSoup(response.content, 'html.parser')
# table = soup.find('table', class_='stats-table')
# df = pd.read_html(str(table))[0]
# Python: Load from GitHub
# id_mapping = pd.read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
Exercise 3.1: Multi-Source Data Integration
Using both FanGraphs (via baseballr/pybaseball) and Statcast data:
- Retrieve 2024 batting statistics for players with 400+ PA
- For the top 10 hitters by wRC+, get their Statcast data
- Compare their wOBA (FanGraphs) to their xwOBA (Statcast)
- Which players are most over-performing or under-performing their expected stats?
R Solution Sketch:
library(baseballr)
library(tidyverse)
# 1. Get FanGraphs data
batters <- fg_batter_leaders(2024, 2024, qual = 400)
top10 <- batters %>% arrange(desc(`wRC+`)) %>% head(10)
# 2 & 3. Get Statcast data for each
# (Would loop through players and use statcast_search with their IDs)
# Compare wOBA vs xwOBA
# 4. Calculate differences
# top10 %>% mutate(woba_diff = wOBA - xwOBA)
Python Solution Sketch:
from pybaseball import batting_stats, statcast_batter, playerid_lookup
import pandas as pd
# 1. Get batting stats
batters = batting_stats(2024, qual=400)
top10 = batters.nlargest(10, 'wRC+')
# 2. Get Statcast data for top 10
# (Would need to lookup MLBAM IDs and query statcast_batter)
# 3 & 4. Compare wOBA vs xwOBA
# Calculate differences to identify over/under-performers
Exercise 3.2: Historical Trends with Lahman
Using the Lahman database:
- Calculate the league-average batting average by decade (1920-present)
- Identify the decade with the highest and lowest scoring
- Plot the trend of strikeouts per game over time
- Compare the "Steroid Era" (1995-2005) to the "Modern Era" (2015-2024) in terms of HR rate, K rate, and BA
Exercise 3.3: Team Performance Analysis
Combine multiple data sources:
- Get 2024 team standings (use mlbstandings from baseballr or scheduleand_record from pybaseball)
- Calculate team batting statistics (aggregate from individual player data)
- Join with team ERA (from pitching data)
- Create a Pythagorean expectation model: Expected W% = R^2 / (R^2 + RA^2)
- Compare actual wins to expected wins—which teams over-performed?
Exercise 3.4: Statcast Deep Dive
Pick your favorite pitcher and analyze their arsenal:
- Retrieve all Statcast data for their 2024 season
- For each pitch type:
- Calculate average velocity, spin rate, and movement
- Calculate whiff rate and zone rate
- Identify which pitch generates the most swings and misses
- Analyze platoon splits: How do their pitches perform vs. LHB vs. RHB?
- Create a "pitch quality" ranking based on velocity, spin, and whiff rate
Deliverable: A comprehensive report with visualizations showing pitch usage, effectiveness, and recommendations for pitch selection.
This concludes Chapter 3. You now have the tools to access virtually any baseball dataset available. In the next chapters, we'll use these data sources to explore specific analytical techniques, from evaluating hitters and pitchers to building predictive models and crafting advanced visualizations.
The combination of R's baseballr package and Python's pybaseball library, along with the historical richness of the Lahman database, provides everything you need to conduct professional-grade baseball analysis. Master these tools, and you'll be equipped to answer almost any baseball question with data.
library(baseballr)
library(tidyverse)
# 1. Get FanGraphs data
batters <- fg_batter_leaders(2024, 2024, qual = 400)
top10 <- batters %>% arrange(desc(`wRC+`)) %>% head(10)
# 2 & 3. Get Statcast data for each
# (Would loop through players and use statcast_search with their IDs)
# Compare wOBA vs xwOBA
# 4. Calculate differences
# top10 %>% mutate(woba_diff = wOBA - xwOBA)
from pybaseball import batting_stats, statcast_batter, playerid_lookup
import pandas as pd
# 1. Get batting stats
batters = batting_stats(2024, qual=400)
top10 = batters.nlargest(10, 'wRC+')
# 2. Get Statcast data for top 10
# (Would need to lookup MLBAM IDs and query statcast_batter)
# 3 & 4. Compare wOBA vs xwOBA
# Calculate differences to identify over/under-performers
Practice Exercises
Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.
Tips for Success
- Read the problem carefully before starting to code
- Break down complex problems into smaller steps
- Use the hints if you're stuck - they won't give away the answer
- After solving, compare your approach with the solution
Multi-Source Data Integration
1. Retrieve 2024 batting statistics for players with 400+ PA
2. For the top 10 hitters by wRC+, get their Statcast data
3. Compare their wOBA (FanGraphs) to their xwOBA (Statcast)
4. Which players are most over-performing or under-performing their expected stats?
**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)
# 1. Get FanGraphs data
batters <- fg_batter_leaders(2024, 2024, qual = 400)
top10 <- batters %>% arrange(desc(`wRC+`)) %>% head(10)
# 2 & 3. Get Statcast data for each
# (Would loop through players and use statcast_search with their IDs)
# Compare wOBA vs xwOBA
# 4. Calculate differences
# top10 %>% mutate(woba_diff = wOBA - xwOBA)
```
**Python Solution Sketch:**
```python
from pybaseball import batting_stats, statcast_batter, playerid_lookup
import pandas as pd
# 1. Get batting stats
batters = batting_stats(2024, qual=400)
top10 = batters.nlargest(10, 'wRC+')
# 2. Get Statcast data for top 10
# (Would need to lookup MLBAM IDs and query statcast_batter)
# 3 & 4. Compare wOBA vs xwOBA
# Calculate differences to identify over/under-performers
```
Historical Trends with Lahman
1. Calculate the league-average batting average by decade (1920-present)
2. Identify the decade with the highest and lowest scoring
3. Plot the trend of strikeouts per game over time
4. Compare the "Steroid Era" (1995-2005) to the "Modern Era" (2015-2024) in terms of HR rate, K rate, and BA
Team Performance Analysis
1. Get 2024 team standings (use mlb_standings from baseballr or schedule_and_record from pybaseball)
2. Calculate team batting statistics (aggregate from individual player data)
3. Join with team ERA (from pitching data)
4. Create a Pythagorean expectation model: Expected W% = R^2 / (R^2 + RA^2)
5. Compare actual wins to expected wins—which teams over-performed?
Statcast Deep Dive
1. Retrieve all Statcast data for their 2024 season
2. For each pitch type:
- Calculate average velocity, spin rate, and movement
- Calculate whiff rate and zone rate
- Identify which pitch generates the most swings and misses
3. Analyze platoon splits: How do their pitches perform vs. LHB vs. RHB?
4. Create a "pitch quality" ranking based on velocity, spin, and whiff rate
**Deliverable**: A comprehensive report with visualizations showing pitch usage, effectiveness, and recommendations for pitch selection.
---
This concludes Chapter 3. You now have the tools to access virtually any baseball dataset available. In the next chapters, we'll use these data sources to explore specific analytical techniques, from evaluating hitters and pitchers to building predictive models and crafting advanced visualizations.
The combination of R's `baseballr` package and Python's `pybaseball` library, along with the historical richness of the Lahman database, provides everything you need to conduct professional-grade baseball analysis. Master these tools, and you'll be equipped to answer almost any baseball question with data.