The tidyverse is a collection of R packages designed for data science. Created by Hadley Wickham and maintained by RStudio, it provides a consistent, human-readable approach to data manipulation that has become the standard in modern R programming.
2.1.1 The Philosophy of Tidy Data {#tidy-data-philosophy}
Tidy data follows three fundamental principles:
- Each variable is a column: Player name, home runs, batting average—each gets its own column
- Each observation is a row: Each player-season, each game, each pitch is a separate row
- Each type of observational unit is a table: Batting stats in one table, pitching stats in another
Consider this untidy baseball data:
# Untidy: Statistics spread across multiple columns by year
untidy_data <- data.frame(
player = c("Aaron Judge", "Shohei Ohtani"),
hr_2023 = c(37, 44),
hr_2024 = c(58, 54),
avg_2023 = c(.267, .304),
avg_2024 = c(.322, .310)
)
The tidy version separates years into rows:
library(tidyverse)
tidy_data <- data.frame(
player = rep(c("Aaron Judge", "Shohei Ohtani"), each = 2),
year = rep(c(2023, 2024), times = 2),
home_runs = c(37, 58, 44, 54),
batting_avg = c(.267, .322, .304, .310)
)
print(tidy_data)
Output:
player year home_runs batting_avg
1 Aaron Judge 2023 37 0.267
2 Aaron Judge 2024 58 0.322
3 Shohei Ohtani 2023 44 0.304
4 Shohei Ohtani 2024 54 0.310
This structure makes it trivial to calculate year-over-year changes, filter by season, or group by player.
2.1.2 The Pipe Operator: %>% and |> {#pipe-operator}
The pipe operator chains operations together, passing the result of one function as the first argument to the next. Think of it as "then" in plain English: "Take the data, then filter it, then select columns, then arrange by value."
R now has two pipe operators:
%>%from the magrittr package (loaded with tidyverse)|>the native R pipe (available in R 4.1+)
Both work similarly for most cases:
library(tidyverse)
# Without pipes - nested and hard to read
result <- arrange(
select(
filter(tidy_data, year == 2024),
player, home_runs
),
desc(home_runs)
)
# With pipes - reads left to right, top to bottom
result <- tidy_data %>%
filter(year == 2024) %>%
select(player, home_runs) %>%
arrange(desc(home_runs))
# Native pipe (equivalent)
result <- tidy_data |>
filter(year == 2024) |>
select(player, home_runs) |>
arrange(desc(home_runs))
print(result)
Output:
player home_runs
1 Shohei Ohtani 54
2 Aaron Judge 58
We'll use %>% throughout this book as it's more widely used, but |> works identically in most contexts.
2.1.3 The Core dplyr Verbs {#dplyr-verbs}
The dplyr package (part of the tidyverse) provides six essential verbs for data manipulation. Let's explore each with baseball data.
filter(): Keep Rows Based on Conditions
Filter selects rows that meet specific criteria.
library(baseballr)
library(tidyverse)
# Get 2024 batting leaders (requires baseballr package)
# Note: This fetches real data from FanGraphs
batters_2024 <- fg_batter_leaders(2024, 2024, qual = 200)
# Filter for players with 30+ home runs
power_hitters <- batters_2024 %>%
filter(HR >= 30)
# Multiple conditions: 30+ HR AND .280+ AVG
elite_hitters <- batters_2024 %>%
filter(HR >= 30, AVG >= .280)
# OR condition: Either 30+ HR or 100+ RBI
production <- batters_2024 %>%
filter(HR >= 30 | RBI >= 100)
# Filter by player type
pull_hitters <- batters_2024 %>%
filter(Pull_percent > 40)
select(): Choose Columns
Select picks specific columns or removes unwanted ones.
# Select basic offensive stats
basic_stats <- batters_2024 %>%
select(Name, Team, G, PA, HR, RBI, AVG, OBP, SLG)
# Select columns by pattern
rate_stats <- batters_2024 %>%
select(Name, ends_with("_percent"))
# Remove columns
no_ids <- batters_2024 %>%
select(-playerid, -xMLBAMID)
# Reorder and rename
clean_stats <- batters_2024 %>%
select(
player = Name,
team = Team,
games = G,
home_runs = HR,
avg = AVG
)
mutate(): Create New Columns
Mutate adds new variables or transforms existing ones.
# Calculate isolated power (SLG - AVG)
batters_enhanced <- batters_2024 %>%
mutate(
ISO = SLG - AVG,
wOBA_minus_xwOBA = wOBA - xwOBA,
AB = PA - BB - HBP - SF, # Approximate at-bats
singles = H - `2B` - `3B` - HR
)
# Conditional mutations with case_when
batters_categorized <- batters_2024 %>%
mutate(
power_category = case_when(
HR >= 40 ~ "Elite",
HR >= 30 ~ "Excellent",
HR >= 20 ~ "Above Average",
HR >= 10 ~ "Average",
TRUE ~ "Below Average"
),
contact_skill = case_when(
`K_percent` < 15 ~ "Excellent",
`K_percent` < 20 ~ "Good",
`K_percent` < 25 ~ "Average",
TRUE ~ "Poor"
)
)
# View the new categories
batters_categorized %>%
select(Name, HR, power_category, `K_percent`, contact_skill) %>%
head(10)
arrange(): Sort Rows
Arrange sorts data by one or more columns.
# Sort by home runs (ascending)
by_hr <- batters_2024 %>%
arrange(HR)
# Sort descending (most HR first)
hr_leaders <- batters_2024 %>%
arrange(desc(HR)) %>%
head(10)
# Multiple columns: sort by team, then by HR within team
by_team_hr <- batters_2024 %>%
arrange(Team, desc(HR))
# Most balanced hitters (low strikeout rate, high walk rate)
disciplined <- batters_2024 %>%
arrange(`K_percent`, desc(`BB_percent`)) %>%
select(Name, `K_percent`, `BB_percent`, AVG, OBP) %>%
head(10)
print(disciplined)
summarize(): Aggregate Data
Summarize (or summarise) reduces data to summary statistics.
# League-wide statistics
league_summary <- batters_2024 %>%
summarize(
total_players = n(),
total_hr = sum(HR),
avg_hr = mean(HR),
median_hr = median(HR),
avg_ba = mean(AVG),
avg_obp = mean(OBP),
avg_slg = mean(SLG)
)
print(league_summary)
Output:
total_players total_hr avg_hr median_hr avg_ba avg_obp avg_slg
1 147 3247 22.088 19.0 0.2536 0.3245 0.4182
group_by(): Group Data for Split Analysis
The true power of summarize emerges when combined with group_by, which splits data into groups before aggregating.
# Statistics by team
team_summary <- batters_2024 %>%
group_by(Team) %>%
summarize(
players = n(),
total_hr = sum(HR),
avg_hr_per_player = mean(HR),
team_avg = mean(AVG),
team_obp = mean(OBP),
.groups = "drop"
) %>%
arrange(desc(total_hr))
print(team_summary)
Output:
Team players total_hr avg_hr_per_player team_avg team_obp
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 NYY 5 183 36.6 0.266 0.351
2 ATL 6 152 25.3 0.258 0.342
3 LAD 6 148 24.7 0.271 0.355
More complex grouping:
# Analyze power vs. contact by handedness (if available)
# Create position groups
position_analysis <- batters_2024 %>%
mutate(
pos_group = case_when(
grepl("C", Pos) ~ "Catcher",
grepl("1B", Pos) ~ "1B/DH",
grepl("2B|SS", Pos) ~ "Middle Infield",
grepl("3B", Pos) ~ "Corner Infield",
grepl("OF", Pos) ~ "Outfield",
TRUE ~ "Other"
)
) %>%
group_by(pos_group) %>%
summarize(
count = n(),
avg_hr = mean(HR),
avg_sb = mean(SB),
avg_avg = mean(AVG),
avg_k_pct = mean(`K_percent`),
.groups = "drop"
) %>%
arrange(desc(avg_hr))
print(position_analysis)
2.1.4 Advanced dplyr Operations {#advanced-dplyr}
Joins: Combining Multiple Data Sources
Joins merge two datasets based on common columns. This is essential when combining batting and pitching data, or merging FanGraphs stats with Statcast metrics.
# Create sample datasets
batting <- data.frame(
player_id = c(1001, 1002, 1003),
name = c("Aaron Judge", "Juan Soto", "Mookie Betts"),
hr = c(58, 41, 19),
avg = c(.322, .288, .289)
)
fielding <- data.frame(
player_id = c(1001, 1002, 1004),
name = c("Aaron Judge", "Juan Soto", "Bobby Witt Jr."),
position = c("RF", "RF", "SS"),
drs = c(8, -5, 12)
)
# Inner join: only players in both datasets
inner_result <- batting %>%
inner_join(fielding, by = "player_id")
# Left join: all batting players, add fielding where available
left_result <- batting %>%
left_join(fielding, by = "player_id", suffix = c("", "_field"))
# Full join: all players from both datasets
full_result <- batting %>%
full_join(fielding, by = "player_id", suffix = c("_bat", "_field"))
print(left_result)
Output:
player_id name hr avg name_field position drs
1 1001 Aaron Judge 58 0.322 Aaron Judge RF 8
2 1002 Juan Soto 41 0.288 Juan Soto RF -5
3 1003 Mookie Betts 19 0.289 <NA> <NA> NA
Real-world example combining FanGraphs and custom data:
# Suppose we have team payroll data
payroll <- data.frame(
Team = c("NYY", "LAD", "NYM", "SD", "PHI"),
payroll_millions = c(306, 285, 275, 251, 243)
)
# Join with team performance
team_performance <- batters_2024 %>%
group_by(Team) %>%
summarize(
team_war = sum(WAR),
team_hr = sum(HR),
.groups = "drop"
) %>%
inner_join(payroll, by = "Team") %>%
mutate(war_per_million = team_war / payroll_millions) %>%
arrange(desc(war_per_million))
Pivots: Reshaping Data
Pivoting transforms data between wide and long formats.
# Long format (tidy)
player_stats_long <- data.frame(
player = rep(c("Judge", "Ohtani"), each = 3),
stat = rep(c("HR", "RBI", "AVG"), times = 2),
value = c(58, 144, .322, 54, 130, .310)
)
# Pivot wider: stats as columns
player_stats_wide <- player_stats_long %>%
pivot_wider(names_from = stat, values_from = value)
print(player_stats_wide)
Output:
player HR RBI AVG
1 Judge 58 144 0.322
2 Ohtani 54 130 0.310
# Pivot longer: years as rows (common for time series)
yearly_hr <- data.frame(
player = c("Judge", "Ohtani"),
HR_2022 = c(62, 34),
HR_2023 = c(37, 44),
HR_2024 = c(58, 54)
)
yearly_hr_long <- yearly_hr %>%
pivot_longer(
cols = starts_with("HR_"),
names_to = "year",
values_to = "home_runs",
names_prefix = "HR_"
) %>%
mutate(year = as.integer(year))
print(yearly_hr_long)
Output:
player year home_runs
1 Judge 2022 62
2 Judge 2023 37
3 Judge 2024 58
4 Ohtani 2022 34
5 Ohtani 2023 44
6 Ohtani 2024 54
Window Functions: Calculations Within Groups
Window functions perform calculations across rows related to the current row, without collapsing the data.
# Rank players by HR within each team
ranked_batters <- batters_2024 %>%
group_by(Team) %>%
mutate(
team_hr_rank = rank(desc(HR)),
team_hr_dense_rank = dense_rank(desc(HR)),
pct_of_team_hr = HR / sum(HR)
) %>%
ungroup() %>%
select(Name, Team, HR, team_hr_rank, pct_of_team_hr) %>%
arrange(Team, team_hr_rank)
# Calculate cumulative statistics
cumulative_stats <- yearly_hr_long %>%
arrange(player, year) %>%
group_by(player) %>%
mutate(
career_hr = cumsum(home_runs),
career_avg_hr = cummean(home_runs),
hr_from_last_year = home_runs - lag(home_runs)
) %>%
ungroup()
print(cumulative_stats)
Output:
player year home_runs career_hr career_avg_hr hr_from_last_year
1 Judge 2022 62 62 62.0 NA
2 Judge 2023 37 99 49.5 -25
3 Judge 2024 58 157 52.3 21
4 Ohtani 2022 34 34 34.0 NA
5 Ohtani 2023 44 78 39.0 10
6 Ohtani 2024 54 132 44.0 10
# Untidy: Statistics spread across multiple columns by year
untidy_data <- data.frame(
player = c("Aaron Judge", "Shohei Ohtani"),
hr_2023 = c(37, 44),
hr_2024 = c(58, 54),
avg_2023 = c(.267, .304),
avg_2024 = c(.322, .310)
)
library(tidyverse)
tidy_data <- data.frame(
player = rep(c("Aaron Judge", "Shohei Ohtani"), each = 2),
year = rep(c(2023, 2024), times = 2),
home_runs = c(37, 58, 44, 54),
batting_avg = c(.267, .322, .304, .310)
)
print(tidy_data)
player year home_runs batting_avg
1 Aaron Judge 2023 37 0.267
2 Aaron Judge 2024 58 0.322
3 Shohei Ohtani 2023 44 0.304
4 Shohei Ohtani 2024 54 0.310
library(tidyverse)
# Without pipes - nested and hard to read
result <- arrange(
select(
filter(tidy_data, year == 2024),
player, home_runs
),
desc(home_runs)
)
# With pipes - reads left to right, top to bottom
result <- tidy_data %>%
filter(year == 2024) %>%
select(player, home_runs) %>%
arrange(desc(home_runs))
# Native pipe (equivalent)
result <- tidy_data |>
filter(year == 2024) |>
select(player, home_runs) |>
arrange(desc(home_runs))
print(result)
player home_runs
1 Shohei Ohtani 54
2 Aaron Judge 58
library(baseballr)
library(tidyverse)
# Get 2024 batting leaders (requires baseballr package)
# Note: This fetches real data from FanGraphs
batters_2024 <- fg_batter_leaders(2024, 2024, qual = 200)
# Filter for players with 30+ home runs
power_hitters <- batters_2024 %>%
filter(HR >= 30)
# Multiple conditions: 30+ HR AND .280+ AVG
elite_hitters <- batters_2024 %>%
filter(HR >= 30, AVG >= .280)
# OR condition: Either 30+ HR or 100+ RBI
production <- batters_2024 %>%
filter(HR >= 30 | RBI >= 100)
# Filter by player type
pull_hitters <- batters_2024 %>%
filter(Pull_percent > 40)
# Select basic offensive stats
basic_stats <- batters_2024 %>%
select(Name, Team, G, PA, HR, RBI, AVG, OBP, SLG)
# Select columns by pattern
rate_stats <- batters_2024 %>%
select(Name, ends_with("_percent"))
# Remove columns
no_ids <- batters_2024 %>%
select(-playerid, -xMLBAMID)
# Reorder and rename
clean_stats <- batters_2024 %>%
select(
player = Name,
team = Team,
games = G,
home_runs = HR,
avg = AVG
)
# Calculate isolated power (SLG - AVG)
batters_enhanced <- batters_2024 %>%
mutate(
ISO = SLG - AVG,
wOBA_minus_xwOBA = wOBA - xwOBA,
AB = PA - BB - HBP - SF, # Approximate at-bats
singles = H - `2B` - `3B` - HR
)
# Conditional mutations with case_when
batters_categorized <- batters_2024 %>%
mutate(
power_category = case_when(
HR >= 40 ~ "Elite",
HR >= 30 ~ "Excellent",
HR >= 20 ~ "Above Average",
HR >= 10 ~ "Average",
TRUE ~ "Below Average"
),
contact_skill = case_when(
`K_percent` < 15 ~ "Excellent",
`K_percent` < 20 ~ "Good",
`K_percent` < 25 ~ "Average",
TRUE ~ "Poor"
)
)
# View the new categories
batters_categorized %>%
select(Name, HR, power_category, `K_percent`, contact_skill) %>%
head(10)
# Sort by home runs (ascending)
by_hr <- batters_2024 %>%
arrange(HR)
# Sort descending (most HR first)
hr_leaders <- batters_2024 %>%
arrange(desc(HR)) %>%
head(10)
# Multiple columns: sort by team, then by HR within team
by_team_hr <- batters_2024 %>%
arrange(Team, desc(HR))
# Most balanced hitters (low strikeout rate, high walk rate)
disciplined <- batters_2024 %>%
arrange(`K_percent`, desc(`BB_percent`)) %>%
select(Name, `K_percent`, `BB_percent`, AVG, OBP) %>%
head(10)
print(disciplined)
# League-wide statistics
league_summary <- batters_2024 %>%
summarize(
total_players = n(),
total_hr = sum(HR),
avg_hr = mean(HR),
median_hr = median(HR),
avg_ba = mean(AVG),
avg_obp = mean(OBP),
avg_slg = mean(SLG)
)
print(league_summary)
total_players total_hr avg_hr median_hr avg_ba avg_obp avg_slg
1 147 3247 22.088 19.0 0.2536 0.3245 0.4182
# Statistics by team
team_summary <- batters_2024 %>%
group_by(Team) %>%
summarize(
players = n(),
total_hr = sum(HR),
avg_hr_per_player = mean(HR),
team_avg = mean(AVG),
team_obp = mean(OBP),
.groups = "drop"
) %>%
arrange(desc(total_hr))
print(team_summary)
Team players total_hr avg_hr_per_player team_avg team_obp
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 NYY 5 183 36.6 0.266 0.351
2 ATL 6 152 25.3 0.258 0.342
3 LAD 6 148 24.7 0.271 0.355
# Analyze power vs. contact by handedness (if available)
# Create position groups
position_analysis <- batters_2024 %>%
mutate(
pos_group = case_when(
grepl("C", Pos) ~ "Catcher",
grepl("1B", Pos) ~ "1B/DH",
grepl("2B|SS", Pos) ~ "Middle Infield",
grepl("3B", Pos) ~ "Corner Infield",
grepl("OF", Pos) ~ "Outfield",
TRUE ~ "Other"
)
) %>%
group_by(pos_group) %>%
summarize(
count = n(),
avg_hr = mean(HR),
avg_sb = mean(SB),
avg_avg = mean(AVG),
avg_k_pct = mean(`K_percent`),
.groups = "drop"
) %>%
arrange(desc(avg_hr))
print(position_analysis)
# Create sample datasets
batting <- data.frame(
player_id = c(1001, 1002, 1003),
name = c("Aaron Judge", "Juan Soto", "Mookie Betts"),
hr = c(58, 41, 19),
avg = c(.322, .288, .289)
)
fielding <- data.frame(
player_id = c(1001, 1002, 1004),
name = c("Aaron Judge", "Juan Soto", "Bobby Witt Jr."),
position = c("RF", "RF", "SS"),
drs = c(8, -5, 12)
)
# Inner join: only players in both datasets
inner_result <- batting %>%
inner_join(fielding, by = "player_id")
# Left join: all batting players, add fielding where available
left_result <- batting %>%
left_join(fielding, by = "player_id", suffix = c("", "_field"))
# Full join: all players from both datasets
full_result <- batting %>%
full_join(fielding, by = "player_id", suffix = c("_bat", "_field"))
print(left_result)
player_id name hr avg name_field position drs
1 1001 Aaron Judge 58 0.322 Aaron Judge RF 8
2 1002 Juan Soto 41 0.288 Juan Soto RF -5
3 1003 Mookie Betts 19 0.289 <NA> <NA> NA
# Suppose we have team payroll data
payroll <- data.frame(
Team = c("NYY", "LAD", "NYM", "SD", "PHI"),
payroll_millions = c(306, 285, 275, 251, 243)
)
# Join with team performance
team_performance <- batters_2024 %>%
group_by(Team) %>%
summarize(
team_war = sum(WAR),
team_hr = sum(HR),
.groups = "drop"
) %>%
inner_join(payroll, by = "Team") %>%
mutate(war_per_million = team_war / payroll_millions) %>%
arrange(desc(war_per_million))
# Long format (tidy)
player_stats_long <- data.frame(
player = rep(c("Judge", "Ohtani"), each = 3),
stat = rep(c("HR", "RBI", "AVG"), times = 2),
value = c(58, 144, .322, 54, 130, .310)
)
# Pivot wider: stats as columns
player_stats_wide <- player_stats_long %>%
pivot_wider(names_from = stat, values_from = value)
print(player_stats_wide)
player HR RBI AVG
1 Judge 58 144 0.322
2 Ohtani 54 130 0.310
# Pivot longer: years as rows (common for time series)
yearly_hr <- data.frame(
player = c("Judge", "Ohtani"),
HR_2022 = c(62, 34),
HR_2023 = c(37, 44),
HR_2024 = c(58, 54)
)
yearly_hr_long <- yearly_hr %>%
pivot_longer(
cols = starts_with("HR_"),
names_to = "year",
values_to = "home_runs",
names_prefix = "HR_"
) %>%
mutate(year = as.integer(year))
print(yearly_hr_long)
player year home_runs
1 Judge 2022 62
2 Judge 2023 37
3 Judge 2024 58
4 Ohtani 2022 34
5 Ohtani 2023 44
6 Ohtani 2024 54
# Rank players by HR within each team
ranked_batters <- batters_2024 %>%
group_by(Team) %>%
mutate(
team_hr_rank = rank(desc(HR)),
team_hr_dense_rank = dense_rank(desc(HR)),
pct_of_team_hr = HR / sum(HR)
) %>%
ungroup() %>%
select(Name, Team, HR, team_hr_rank, pct_of_team_hr) %>%
arrange(Team, team_hr_rank)
# Calculate cumulative statistics
cumulative_stats <- yearly_hr_long %>%
arrange(player, year) %>%
group_by(player) %>%
mutate(
career_hr = cumsum(home_runs),
career_avg_hr = cummean(home_runs),
hr_from_last_year = home_runs - lag(home_runs)
) %>%
ungroup()
print(cumulative_stats)
player year home_runs career_hr career_avg_hr hr_from_last_year
1 Judge 2022 62 62 62.0 NA
2 Judge 2023 37 99 49.5 -25
3 Judge 2024 58 157 52.3 21
4 Ohtani 2022 34 34 34.0 NA
5 Ohtani 2023 44 78 39.0 10
6 Ohtani 2024 54 132 44.0 10
Pandas is the cornerstone of data analysis in Python. Built on NumPy, it provides DataFrame and Series objects that make data manipulation intuitive and efficient. If you're coming from R, pandas DataFrames are similar to data.frames or tibbles, with some syntactic differences.
2.2.1 DataFrames and Series {#dataframes-series}
A Series is a one-dimensional array with labels. A DataFrame is a two-dimensional table with labeled rows and columns.
import pandas as pd
import numpy as np
# Create a Series (one column of data)
home_runs = pd.Series(
[58, 54, 44, 41],
index=['Judge', 'Ohtani', 'Schwarber', 'Soto'],
name='HR'
)
print(home_runs)
Output:
Judge 58
Ohtani 54
Schwarber 44
Soto 41
Name: HR, dtype: int64
# Create a DataFrame
players_df = pd.DataFrame({
'player': ['Aaron Judge', 'Shohei Ohtani', 'Juan Soto', 'Kyle Schwarber'],
'team': ['NYY', 'LAD', 'NYY', 'PHI'],
'hr': [58, 54, 41, 38],
'avg': [.322, .310, .288, .250],
'obp': [.458, .390, .419, .373],
'slg': [.701, .646, .569, .488]
})
print(players_df)
Output:
player team hr avg obp slg
0 Aaron Judge NYY 58 0.322 0.458 0.701
1 Shohei Ohtani LAD 54 0.310 0.390 0.646
2 Juan Soto NYY 41 0.288 0.419 0.569
3 Kyle Schwarber PHI 38 0.250 0.373 0.488
# Inspecting DataFrames
print(players_df.shape) # (4, 6) - 4 rows, 6 columns
print(players_df.columns) # Column names
print(players_df.dtypes) # Data types
print(players_df.head()) # First 5 rows
print(players_df.info()) # Summary information
print(players_df.describe()) # Summary statistics for numeric columns
2.2.2 Selecting and Filtering Data {#pandas-selecting}
Pandas offers multiple ways to select data: bracket notation, .loc[] (label-based), and .iloc[] (integer position-based).
# Select a single column (returns Series)
hrs = players_df['hr']
teams = players_df.team # Alternative dot notation
# Select multiple columns (returns DataFrame)
basic_stats = players_df[['player', 'team', 'hr', 'avg']]
# Select rows by position with iloc
first_two_players = players_df.iloc[0:2] # First two rows
last_player = players_df.iloc[-1] # Last row
# Select rows by label with loc
# (Requires setting an index first)
players_indexed = players_df.set_index('player')
judge_stats = players_indexed.loc['Aaron Judge']
print(judge_stats)
Output:
team NYY
hr 58
avg 0.322
obp 0.458
slg 0.701
Name: Aaron Judge, dtype: object
Filtering (Boolean Indexing)
# Filter for players with 40+ home runs
power_hitters = players_df[players_df['hr'] >= 40]
# Multiple conditions with & (and) and | (or)
# Note: Each condition must be in parentheses
elite_hitters = players_df[(players_df['hr'] >= 40) & (players_df['avg'] >= .280)]
yankees = players_df[players_df['team'] == 'NYY']
# Complex filtering
qualified = players_df[
(players_df['hr'] >= 30) &
(players_df['obp'] >= .350) &
(players_df['slg'] >= .500)
]
print(qualified)
Output:
player team hr avg obp slg
0 Aaron Judge NYY 58 0.322 0.458 0.701
1 Shohei Ohtani LAD 54 0.310 0.390 0.646
2 Juan Soto NYY 41 0.288 0.419 0.569
# Using query() for more readable filtering
power_hitters_query = players_df.query('hr >= 40')
yankees_power = players_df.query('team == "NYY" and hr >= 40')
# Using isin() for multiple values
east_coast = players_df[players_df['team'].isin(['NYY', 'PHI', 'NYM', 'BOS'])]
2.2.3 Creating New Columns {#pandas-new-columns}
# Simple calculation
players_df['iso'] = players_df['slg'] - players_df['avg']
# Multiple operations
players_df['ops'] = players_df['obp'] + players_df['slg']
players_df['hr_per_1000_pa'] = (players_df['hr'] / 600) * 1000 # Assuming 600 PA
# Conditional column with np.where
players_df['power_tier'] = np.where(
players_df['hr'] >= 50, 'Elite',
np.where(players_df['hr'] >= 40, 'Excellent', 'Good')
)
# More complex conditions with pd.cut (binning)
players_df['hr_category'] = pd.cut(
players_df['hr'],
bins=[0, 30, 40, 50, 100],
labels=['Below 30', '30-40', '40-50', '50+']
)
# Apply custom functions
def classify_hitter(row):
if row['hr'] >= 40 and row['avg'] >= .280:
return 'Power & Average'
elif row['hr'] >= 40:
return 'Power'
elif row['avg'] >= .280:
return 'Average'
else:
return 'Other'
players_df['hitter_type'] = players_df.apply(classify_hitter, axis=1)
print(players_df[['player', 'hr', 'avg', 'hitter_type']])
Output:
player hr avg hitter_type
0 Aaron Judge 58 0.322 Power & Average
1 Shohei Ohtani 54 0.310 Power & Average
2 Juan Soto 41 0.288 Power & Average
3 Kyle Schwarber 38 0.250 Other
2.2.4 Grouping and Aggregating {#pandas-groupby}
The groupby() method splits data into groups, applies a function, and combines results—similar to dplyr's group_by() + summarize().
# Create larger dataset with multiple teams
batters_data = pd.DataFrame({
'player': ['Judge', 'Soto', 'Ohtani', 'Betts', 'Freeman', 'Schwarber', 'Harper', 'Turner'],
'team': ['NYY', 'NYY', 'LAD', 'LAD', 'LAD', 'PHI', 'PHI', 'PHI'],
'hr': [58, 41, 54, 19, 22, 38, 26, 21],
'rbi': [144, 109, 130, 75, 89, 104, 87, 86],
'avg': [.322, .288, .310, .289, .282, .250, .285, .298]
})
# Group by team and calculate totals
team_totals = batters_data.groupby('team').agg({
'hr': 'sum',
'rbi': 'sum',
'avg': 'mean'
})
print(team_totals)
Output:
hr rbi avg
team
LAD 95 294 0.293667
NYY 99 253 0.305000
PHI 85 277 0.277667
# Multiple aggregations per column
team_summary = batters_data.groupby('team').agg({
'hr': ['sum', 'mean', 'max'],
'avg': ['mean', 'min', 'max'],
'player': 'count'
}).round(3)
# Named aggregations (pandas 0.25+)
team_stats = batters_data.groupby('team').agg(
total_hr=('hr', 'sum'),
avg_hr=('hr', 'mean'),
top_hr=('hr', 'max'),
team_avg=('avg', 'mean'),
player_count=('player', 'count')
).round(2)
print(team_stats)
Output:
total_hr avg_hr top_hr team_avg player_count
team
LAD 95 31.67 54 0.29 3
NYY 99 49.50 58 0.31 2
PHI 85 28.33 38 0.28 3
# Group by multiple columns
# Add league information
batters_data['league'] = batters_data['team'].map({
'NYY': 'AL', 'LAD': 'NL', 'PHI': 'NL'
})
league_team_stats = batters_data.groupby(['league', 'team']).agg({
'hr': 'sum',
'player': 'count'
}).reset_index()
print(league_team_stats)
Output:
league team hr player
0 AL NYY 99 2
1 NL LAD 95 3
2 NL PHI 85 3
2.2.5 Merging DataFrames {#pandas-merging}
Pandas provides several ways to combine DataFrames: merge() (SQL-style joins), concat() (stacking), and join() (index-based).
# Create batting and fielding datasets
batting = pd.DataFrame({
'player_id': [1001, 1002, 1003],
'name': ['Aaron Judge', 'Juan Soto', 'Mookie Betts'],
'hr': [58, 41, 19],
'avg': [.322, .288, .289]
})
fielding = pd.DataFrame({
'player_id': [1001, 1002, 1004],
'name': ['Aaron Judge', 'Juan Soto', 'Bobby Witt Jr.'],
'position': ['RF', 'RF', 'SS'],
'drs': [8, -5, 12]
})
# Inner join (default)
inner = pd.merge(batting, fielding, on='player_id', suffixes=('_bat', '_field'))
# Left join (keep all batting records)
left = pd.merge(batting, fielding, on='player_id', how='left', suffixes=('_bat', '_field'))
# Outer join (keep all records from both)
outer = pd.merge(batting, fielding, on='player_id', how='outer', suffixes=('_bat', '_field'))
print(left)
Output:
player_id name_bat hr avg name_field position drs
0 1001 Aaron Judge 58 0.322 Aaron Judge RF 8.0
1 1002 Juan Soto 41 0.288 Juan Soto RF -5.0
2 1003 Mookie Betts 19 0.289 NaN NaN NaN
# Merge on multiple columns
# Concatenate DataFrames vertically
al_batters = pd.DataFrame({
'player': ['Judge', 'Witt'],
'team': ['NYY', 'KC'],
'hr': [58, 32]
})
nl_batters = pd.DataFrame({
'player': ['Ohtani', 'Betts'],
'team': ['LAD', 'LAD'],
'hr': [54, 19]
})
all_batters = pd.concat([al_batters, nl_batters], ignore_index=True)
print(all_batters)
Output:
player team hr
0 Judge NYY 58
1 Witt KC 32
2 Ohtani LAD 54
3 Betts LAD 19
Judge 58
Ohtani 54
Schwarber 44
Soto 41
Name: HR, dtype: int64
player team hr avg obp slg
0 Aaron Judge NYY 58 0.322 0.458 0.701
1 Shohei Ohtani LAD 54 0.310 0.390 0.646
2 Juan Soto NYY 41 0.288 0.419 0.569
3 Kyle Schwarber PHI 38 0.250 0.373 0.488
team NYY
hr 58
avg 0.322
obp 0.458
slg 0.701
Name: Aaron Judge, dtype: object
player team hr avg obp slg
0 Aaron Judge NYY 58 0.322 0.458 0.701
1 Shohei Ohtani LAD 54 0.310 0.390 0.646
2 Juan Soto NYY 41 0.288 0.419 0.569
player hr avg hitter_type
0 Aaron Judge 58 0.322 Power & Average
1 Shohei Ohtani 54 0.310 Power & Average
2 Juan Soto 41 0.288 Power & Average
3 Kyle Schwarber 38 0.250 Other
hr rbi avg
team
LAD 95 294 0.293667
NYY 99 253 0.305000
PHI 85 277 0.277667
total_hr avg_hr top_hr team_avg player_count
team
LAD 95 31.67 54 0.29 3
NYY 99 49.50 58 0.31 2
PHI 85 28.33 38 0.28 3
league team hr player
0 AL NYY 99 2
1 NL LAD 95 3
2 NL PHI 85 3
player_id name_bat hr avg name_field position drs
0 1001 Aaron Judge 58 0.322 Aaron Judge RF 8.0
1 1002 Juan Soto 41 0.288 Juan Soto RF -5.0
2 1003 Mookie Betts 19 0.289 NaN NaN NaN
player team hr
0 Judge NYY 58
1 Witt KC 32
2 Ohtani LAD 54
3 Betts LAD 19
import pandas as pd
import numpy as np
# Create a Series (one column of data)
home_runs = pd.Series(
[58, 54, 44, 41],
index=['Judge', 'Ohtani', 'Schwarber', 'Soto'],
name='HR'
)
print(home_runs)
# Create a DataFrame
players_df = pd.DataFrame({
'player': ['Aaron Judge', 'Shohei Ohtani', 'Juan Soto', 'Kyle Schwarber'],
'team': ['NYY', 'LAD', 'NYY', 'PHI'],
'hr': [58, 54, 41, 38],
'avg': [.322, .310, .288, .250],
'obp': [.458, .390, .419, .373],
'slg': [.701, .646, .569, .488]
})
print(players_df)
# Inspecting DataFrames
print(players_df.shape) # (4, 6) - 4 rows, 6 columns
print(players_df.columns) # Column names
print(players_df.dtypes) # Data types
print(players_df.head()) # First 5 rows
print(players_df.info()) # Summary information
print(players_df.describe()) # Summary statistics for numeric columns
# Select a single column (returns Series)
hrs = players_df['hr']
teams = players_df.team # Alternative dot notation
# Select multiple columns (returns DataFrame)
basic_stats = players_df[['player', 'team', 'hr', 'avg']]
# Select rows by position with iloc
first_two_players = players_df.iloc[0:2] # First two rows
last_player = players_df.iloc[-1] # Last row
# Select rows by label with loc
# (Requires setting an index first)
players_indexed = players_df.set_index('player')
judge_stats = players_indexed.loc['Aaron Judge']
print(judge_stats)
# Filter for players with 40+ home runs
power_hitters = players_df[players_df['hr'] >= 40]
# Multiple conditions with & (and) and | (or)
# Note: Each condition must be in parentheses
elite_hitters = players_df[(players_df['hr'] >= 40) & (players_df['avg'] >= .280)]
yankees = players_df[players_df['team'] == 'NYY']
# Complex filtering
qualified = players_df[
(players_df['hr'] >= 30) &
(players_df['obp'] >= .350) &
(players_df['slg'] >= .500)
]
print(qualified)
# Using query() for more readable filtering
power_hitters_query = players_df.query('hr >= 40')
yankees_power = players_df.query('team == "NYY" and hr >= 40')
# Using isin() for multiple values
east_coast = players_df[players_df['team'].isin(['NYY', 'PHI', 'NYM', 'BOS'])]
# Simple calculation
players_df['iso'] = players_df['slg'] - players_df['avg']
# Multiple operations
players_df['ops'] = players_df['obp'] + players_df['slg']
players_df['hr_per_1000_pa'] = (players_df['hr'] / 600) * 1000 # Assuming 600 PA
# Conditional column with np.where
players_df['power_tier'] = np.where(
players_df['hr'] >= 50, 'Elite',
np.where(players_df['hr'] >= 40, 'Excellent', 'Good')
)
# More complex conditions with pd.cut (binning)
players_df['hr_category'] = pd.cut(
players_df['hr'],
bins=[0, 30, 40, 50, 100],
labels=['Below 30', '30-40', '40-50', '50+']
)
# Apply custom functions
def classify_hitter(row):
if row['hr'] >= 40 and row['avg'] >= .280:
return 'Power & Average'
elif row['hr'] >= 40:
return 'Power'
elif row['avg'] >= .280:
return 'Average'
else:
return 'Other'
players_df['hitter_type'] = players_df.apply(classify_hitter, axis=1)
print(players_df[['player', 'hr', 'avg', 'hitter_type']])
# Create larger dataset with multiple teams
batters_data = pd.DataFrame({
'player': ['Judge', 'Soto', 'Ohtani', 'Betts', 'Freeman', 'Schwarber', 'Harper', 'Turner'],
'team': ['NYY', 'NYY', 'LAD', 'LAD', 'LAD', 'PHI', 'PHI', 'PHI'],
'hr': [58, 41, 54, 19, 22, 38, 26, 21],
'rbi': [144, 109, 130, 75, 89, 104, 87, 86],
'avg': [.322, .288, .310, .289, .282, .250, .285, .298]
})
# Group by team and calculate totals
team_totals = batters_data.groupby('team').agg({
'hr': 'sum',
'rbi': 'sum',
'avg': 'mean'
})
print(team_totals)
# Multiple aggregations per column
team_summary = batters_data.groupby('team').agg({
'hr': ['sum', 'mean', 'max'],
'avg': ['mean', 'min', 'max'],
'player': 'count'
}).round(3)
# Named aggregations (pandas 0.25+)
team_stats = batters_data.groupby('team').agg(
total_hr=('hr', 'sum'),
avg_hr=('hr', 'mean'),
top_hr=('hr', 'max'),
team_avg=('avg', 'mean'),
player_count=('player', 'count')
).round(2)
print(team_stats)
# Group by multiple columns
# Add league information
batters_data['league'] = batters_data['team'].map({
'NYY': 'AL', 'LAD': 'NL', 'PHI': 'NL'
})
league_team_stats = batters_data.groupby(['league', 'team']).agg({
'hr': 'sum',
'player': 'count'
}).reset_index()
print(league_team_stats)
# Create batting and fielding datasets
batting = pd.DataFrame({
'player_id': [1001, 1002, 1003],
'name': ['Aaron Judge', 'Juan Soto', 'Mookie Betts'],
'hr': [58, 41, 19],
'avg': [.322, .288, .289]
})
fielding = pd.DataFrame({
'player_id': [1001, 1002, 1004],
'name': ['Aaron Judge', 'Juan Soto', 'Bobby Witt Jr.'],
'position': ['RF', 'RF', 'SS'],
'drs': [8, -5, 12]
})
# Inner join (default)
inner = pd.merge(batting, fielding, on='player_id', suffixes=('_bat', '_field'))
# Left join (keep all batting records)
left = pd.merge(batting, fielding, on='player_id', how='left', suffixes=('_bat', '_field'))
# Outer join (keep all records from both)
outer = pd.merge(batting, fielding, on='player_id', how='outer', suffixes=('_bat', '_field'))
print(left)
# Merge on multiple columns
# Concatenate DataFrames vertically
al_batters = pd.DataFrame({
'player': ['Judge', 'Witt'],
'team': ['NYY', 'KC'],
'hr': [58, 32]
})
nl_batters = pd.DataFrame({
'player': ['Ohtani', 'Betts'],
'team': ['LAD', 'LAD'],
'hr': [54, 19]
})
all_batters = pd.concat([al_batters, nl_batters], ignore_index=True)
print(all_batters)
Now that we understand the core tools, let's apply them to common baseball analytics tasks.
2.3.1 Calculating Rate Stats (per PA, per IP) {#rate-stats}
Raw counting stats (HR, RBI, SO) depend on playing time. Rate statistics normalize these for comparison.
R Example: Batting Rates
library(baseballr)
library(tidyverse)
# Get 2024 data
batters <- fg_batter_leaders(2024, 2024, qual = 100)
# Calculate per-PA and per-game rates
batters_rates <- batters %>%
mutate(
# Runs created per 27 outs (approximation)
RC27 = (R + RBI - HR) / (AB - H + CS + GIDP) * 27,
# Home runs per plate appearance
HR_per_PA = HR / PA,
# Walks per plate appearance
BB_per_PA = BB / PA,
# Extra base hits
XBH = `2B` + `3B` + HR,
XBH_per_PA = XBH / PA,
# Total bases per plate appearance
TB = H + `2B` + (`3B` * 2) + (HR * 2),
TB_per_PA = TB / PA,
# Speed score components
SB_per_game = SB / G,
# Production: OBP + SLG (quick OPS equivalent)
Production = OBP + SLG
)
# Top home run hitters by rate
hr_rate_leaders <- batters_rates %>%
filter(PA >= 400) %>%
select(Name, Team, PA, HR, HR_per_PA) %>%
arrange(desc(HR_per_PA)) %>%
head(10)
print(hr_rate_leaders)
Python Example: Pitching Rates
from pybaseball import pitching_stats
import pandas as pd
# Get 2024 pitching data
pitchers = pitching_stats(2024, qual=100)
# Calculate rate statistics
pitchers['K_per_9'] = (pitchers['SO'] / pitchers['IP']) * 9
pitchers['BB_per_9'] = (pitchers['BB'] / pitchers['IP']) * 9
pitchers['HR_per_9'] = (pitchers['HR'] / pitchers['IP']) * 9
pitchers['H_per_9'] = (pitchers['H'] / pitchers['IP']) * 9
# K/BB ratio
pitchers['K_BB_ratio'] = pitchers['SO'] / pitchers['BB']
# Strikeout rate and walk rate (per batter faced)
# Note: Estimate BF if not available
pitchers['est_BF'] = pitchers['IP'] * 3 + pitchers['H'] + pitchers['BB'] + pitchers['HBP']
pitchers['K_rate'] = pitchers['SO'] / pitchers['est_BF']
pitchers['BB_rate'] = pitchers['BB'] / pitchers['est_BF']
# WHIP (Walks + Hits per Inning Pitched)
pitchers['WHIP'] = (pitchers['BB'] + pitchers['H']) / pitchers['IP']
# Quality starts rate (if available)
if 'QS' in pitchers.columns:
pitchers['QS_rate'] = pitchers['QS'] / pitchers['GS']
# View elite strikeout pitchers
elite_k = pitchers.nlargest(10, 'K_per_9')[
['Name', 'Team', 'IP', 'SO', 'K_per_9', 'K_BB_ratio', 'ERA']
]
print(elite_k)
2.3.2 Rolling Averages and Moving Windows {#rolling-averages}
Rolling statistics smooth out noise and identify trends over time, such as hot and cold streaks.
R Example: 10-Game Rolling Average
# Simulate game-by-game data for a player
set.seed(123)
game_log <- data.frame(
game_num = 1:100,
date = seq(as.Date("2024-04-01"), by = "day", length.out = 100),
hits = rpois(100, lambda = 1.2),
ab = sample(3:5, 100, replace = TRUE)
) %>%
mutate(
batting_avg = hits / ab,
# 10-game rolling average
rolling_avg_10 = zoo::rollmean(batting_avg, k = 10, fill = NA, align = "right"),
# 20-game rolling average
rolling_avg_20 = zoo::rollmean(batting_avg, k = 20, fill = NA, align = "right"),
# Cumulative season average
season_avg = cumsum(hits) / cumsum(ab)
)
# Identify hot streaks (10-game avg > .300)
hot_streaks <- game_log %>%
filter(rolling_avg_10 > .300) %>%
select(game_num, date, rolling_avg_10, season_avg)
print(hot_streaks)
# Visualize (if plotting)
library(ggplot2)
ggplot(game_log, aes(x = game_num)) +
geom_line(aes(y = batting_avg), alpha = 0.3, color = "gray") +
geom_line(aes(y = rolling_avg_10), color = "blue", size = 1) +
geom_line(aes(y = season_avg), color = "red", linetype = "dashed") +
geom_hline(yintercept = 0.250, linetype = "dotted") +
labs(
title = "Batting Average: Game-by-Game vs. Rolling Average",
x = "Game Number",
y = "Batting Average",
caption = "Blue = 10-game rolling avg, Red = Season avg"
) +
theme_minimal()
Python Example: Rolling Statcast Metrics
import pandas as pd
import numpy as np
# Simulate pitch-level data aggregated by game
np.random.seed(42)
game_log_py = pd.DataFrame({
'game_num': range(1, 101),
'date': pd.date_range('2024-04-01', periods=100),
'exit_velo': np.random.normal(90, 5, 100), # Average exit velocity
'launch_angle': np.random.normal(15, 10, 100),
'hard_hit_pct': np.random.uniform(0.3, 0.6, 100),
'barrel_pct': np.random.uniform(0.05, 0.15, 100)
})
# Calculate rolling averages
game_log_py['ev_rolling_10'] = game_log_py['exit_velo'].rolling(window=10).mean()
game_log_py['ev_rolling_20'] = game_log_py['exit_velo'].rolling(window=20).mean()
game_log_py['hard_hit_rolling'] = game_log_py['hard_hit_pct'].rolling(window=10).mean()
game_log_py['barrel_rolling'] = game_log_py['barrel_pct'].rolling(window=10).mean()
# Expanding window (cumulative)
game_log_py['ev_season_avg'] = game_log_py['exit_velo'].expanding().mean()
# Identify games with elite contact quality
elite_contact = game_log_py[
(game_log_py['ev_rolling_10'] > 92) &
(game_log_py['hard_hit_rolling'] > 0.45)
][['game_num', 'date', 'ev_rolling_10', 'hard_hit_rolling']]
print(elite_contact.head(10))
2.3.3 Year-over-Year Comparisons {#year-over-year}
Comparing performance across seasons reveals development trends and aging curves.
R Example: YoY HR Comparison
library(baseballr)
library(tidyverse)
# Get multiple years of data
batters_2023 <- fg_batter_leaders(2023, 2023, qual = 200)
batters_2024 <- fg_batter_leaders(2024, 2024, qual = 200)
# Prepare datasets
batters_23 <- batters_2023 %>%
select(Name, HR_2023 = HR, AVG_2023 = AVG, OPS_2023 = OPS, WAR_2023 = WAR)
batters_24 <- batters_2024 %>%
select(Name, HR_2024 = HR, AVG_2024 = AVG, OPS_2024 = OPS, WAR_2024 = WAR)
# Join and calculate changes
yoy_comparison <- batters_23 %>%
inner_join(batters_24, by = "Name") %>%
mutate(
HR_change = HR_2024 - HR_2023,
AVG_change = AVG_2024 - AVG_2023,
OPS_change = OPS_2024 - OPS_2023,
WAR_change = WAR_2024 - WAR_2023,
HR_pct_change = (HR_change / HR_2023) * 100
)
# Biggest improvements
biggest_improvers <- yoy_comparison %>%
arrange(desc(HR_change)) %>%
select(Name, HR_2023, HR_2024, HR_change, HR_pct_change) %>%
head(10)
print(biggest_improvers)
# Players who regressed
biggest_decliners <- yoy_comparison %>%
arrange(HR_change) %>%
select(Name, HR_2023, HR_2024, HR_change) %>%
head(10)
print(biggest_decliners)
Python Example: Multi-Year Career Trajectory
from pybaseball import batting_stats
import pandas as pd
# Get multiple years
years = [2022, 2023, 2024]
all_years = []
for year in years:
df = batting_stats(year, qual=300)
df['Season'] = year
all_years.append(df)
multi_year = pd.concat(all_years, ignore_index=True)
# Calculate career trajectory for specific players
target_players = ['Aaron Judge', 'Juan Soto', 'Mookie Betts']
trajectories = multi_year[multi_year['Name'].isin(target_players)].copy()
trajectories = trajectories.sort_values(['Name', 'Season'])
# Calculate year-over-year changes
trajectories['HR_yoy'] = trajectories.groupby('Name')['HR'].diff()
trajectories['AVG_yoy'] = trajectories.groupby('Name')['AVG'].diff()
trajectories['OPS_yoy'] = trajectories.groupby('Name')['OPS'].diff()
# View trajectories
trajectory_summary = trajectories[[
'Name', 'Season', 'HR', 'HR_yoy', 'AVG', 'AVG_yoy', 'OPS', 'OPS_yoy'
]].round(3)
print(trajectory_summary)
2.3.4 Split Analysis (home/away, vs LHP/RHP) {#split-analysis}
Splits reveal situational performance patterns crucial for lineup construction and matchup decisions.
R Example: Home/Road Splits
# Using baseballr to get splits (if available in your version)
# Alternative: simulate or load from Baseball Reference
# Simulated home/road data
set.seed(456)
player_splits <- expand.grid(
player = c("Aaron Judge", "Juan Soto", "Shohei Ohtani"),
split = c("Home", "Away")
) %>%
mutate(
PA = sample(250:350, n(), replace = TRUE),
HR = case_when(
split == "Home" ~ rpois(n(), lambda = 25),
TRUE ~ rpois(n(), lambda = 18)
),
AVG = case_when(
split == "Home" ~ runif(n(), .270, .340),
TRUE ~ runif(n(), .240, .310)
),
OPS = case_when(
split == "Home" ~ runif(n(), .850, 1.100),
TRUE ~ runif(n(), .750, .950)
)
)
# Calculate home/away differentials
split_comparison <- player_splits %>%
pivot_wider(
id_cols = player,
names_from = split,
values_from = c(PA, HR, AVG, OPS)
) %>%
mutate(
HR_diff = HR_Home - HR_Away,
AVG_diff = AVG_Home - AVG_Away,
OPS_diff = OPS_Home - OPS_Away,
home_advantage = case_when(
OPS_diff > 0.100 ~ "Strong home hitter",
OPS_diff > 0.050 ~ "Slight home advantage",
OPS_diff > -0.050 ~ "Neutral",
TRUE ~ "Better away"
)
)
print(split_comparison)
Python Example: LHP/RHP Splits
import pandas as pd
import numpy as np
# Simulate platoon splits data
np.random.seed(789)
players = ['Judge', 'Soto', 'Ohtani', 'Betts', 'Freeman']
handedness = ['vs_LHP', 'vs_RHP']
split_data = []
for player in players:
for hand in handedness:
# RHB typically better vs LHP
if hand == 'vs_LHP':
avg = np.random.uniform(0.280, 0.340)
slg = np.random.uniform(0.520, 0.720)
pa = np.random.randint(150, 250)
else:
avg = np.random.uniform(0.250, 0.310)
slg = np.random.uniform(0.450, 0.620)
pa = np.random.randint(350, 450)
split_data.append({
'player': player,
'pitcher_hand': hand,
'PA': pa,
'AVG': round(avg, 3),
'SLG': round(slg, 3),
'OPS': round(avg + slg, 3)
})
splits_df = pd.DataFrame(split_data)
# Pivot to compare
splits_wide = splits_df.pivot(
index='player',
columns='pitcher_hand',
values=['PA', 'AVG', 'SLG', 'OPS']
)
# Calculate platoon differential
splits_wide['AVG_diff'] = splits_wide[('AVG', 'vs_LHP')] - splits_wide[('AVG', 'vs_RHP')]
splits_wide['OPS_diff'] = splits_wide[('OPS', 'vs_LHP')] - splits_wide[('OPS', 'vs_RHP')]
# Identify severe platoon splits
splits_wide['platoon_severity'] = pd.cut(
splits_wide['OPS_diff'],
bins=[-1, 0.050, 0.150, 0.300, 2],
labels=['Reverse', 'Mild', 'Moderate', 'Severe']
)
print(splits_wide)
library(baseballr)
library(tidyverse)
# Get 2024 data
batters <- fg_batter_leaders(2024, 2024, qual = 100)
# Calculate per-PA and per-game rates
batters_rates <- batters %>%
mutate(
# Runs created per 27 outs (approximation)
RC27 = (R + RBI - HR) / (AB - H + CS + GIDP) * 27,
# Home runs per plate appearance
HR_per_PA = HR / PA,
# Walks per plate appearance
BB_per_PA = BB / PA,
# Extra base hits
XBH = `2B` + `3B` + HR,
XBH_per_PA = XBH / PA,
# Total bases per plate appearance
TB = H + `2B` + (`3B` * 2) + (HR * 2),
TB_per_PA = TB / PA,
# Speed score components
SB_per_game = SB / G,
# Production: OBP + SLG (quick OPS equivalent)
Production = OBP + SLG
)
# Top home run hitters by rate
hr_rate_leaders <- batters_rates %>%
filter(PA >= 400) %>%
select(Name, Team, PA, HR, HR_per_PA) %>%
arrange(desc(HR_per_PA)) %>%
head(10)
print(hr_rate_leaders)
# Simulate game-by-game data for a player
set.seed(123)
game_log <- data.frame(
game_num = 1:100,
date = seq(as.Date("2024-04-01"), by = "day", length.out = 100),
hits = rpois(100, lambda = 1.2),
ab = sample(3:5, 100, replace = TRUE)
) %>%
mutate(
batting_avg = hits / ab,
# 10-game rolling average
rolling_avg_10 = zoo::rollmean(batting_avg, k = 10, fill = NA, align = "right"),
# 20-game rolling average
rolling_avg_20 = zoo::rollmean(batting_avg, k = 20, fill = NA, align = "right"),
# Cumulative season average
season_avg = cumsum(hits) / cumsum(ab)
)
# Identify hot streaks (10-game avg > .300)
hot_streaks <- game_log %>%
filter(rolling_avg_10 > .300) %>%
select(game_num, date, rolling_avg_10, season_avg)
print(hot_streaks)
# Visualize (if plotting)
library(ggplot2)
ggplot(game_log, aes(x = game_num)) +
geom_line(aes(y = batting_avg), alpha = 0.3, color = "gray") +
geom_line(aes(y = rolling_avg_10), color = "blue", size = 1) +
geom_line(aes(y = season_avg), color = "red", linetype = "dashed") +
geom_hline(yintercept = 0.250, linetype = "dotted") +
labs(
title = "Batting Average: Game-by-Game vs. Rolling Average",
x = "Game Number",
y = "Batting Average",
caption = "Blue = 10-game rolling avg, Red = Season avg"
) +
theme_minimal()
library(baseballr)
library(tidyverse)
# Get multiple years of data
batters_2023 <- fg_batter_leaders(2023, 2023, qual = 200)
batters_2024 <- fg_batter_leaders(2024, 2024, qual = 200)
# Prepare datasets
batters_23 <- batters_2023 %>%
select(Name, HR_2023 = HR, AVG_2023 = AVG, OPS_2023 = OPS, WAR_2023 = WAR)
batters_24 <- batters_2024 %>%
select(Name, HR_2024 = HR, AVG_2024 = AVG, OPS_2024 = OPS, WAR_2024 = WAR)
# Join and calculate changes
yoy_comparison <- batters_23 %>%
inner_join(batters_24, by = "Name") %>%
mutate(
HR_change = HR_2024 - HR_2023,
AVG_change = AVG_2024 - AVG_2023,
OPS_change = OPS_2024 - OPS_2023,
WAR_change = WAR_2024 - WAR_2023,
HR_pct_change = (HR_change / HR_2023) * 100
)
# Biggest improvements
biggest_improvers <- yoy_comparison %>%
arrange(desc(HR_change)) %>%
select(Name, HR_2023, HR_2024, HR_change, HR_pct_change) %>%
head(10)
print(biggest_improvers)
# Players who regressed
biggest_decliners <- yoy_comparison %>%
arrange(HR_change) %>%
select(Name, HR_2023, HR_2024, HR_change) %>%
head(10)
print(biggest_decliners)
# Using baseballr to get splits (if available in your version)
# Alternative: simulate or load from Baseball Reference
# Simulated home/road data
set.seed(456)
player_splits <- expand.grid(
player = c("Aaron Judge", "Juan Soto", "Shohei Ohtani"),
split = c("Home", "Away")
) %>%
mutate(
PA = sample(250:350, n(), replace = TRUE),
HR = case_when(
split == "Home" ~ rpois(n(), lambda = 25),
TRUE ~ rpois(n(), lambda = 18)
),
AVG = case_when(
split == "Home" ~ runif(n(), .270, .340),
TRUE ~ runif(n(), .240, .310)
),
OPS = case_when(
split == "Home" ~ runif(n(), .850, 1.100),
TRUE ~ runif(n(), .750, .950)
)
)
# Calculate home/away differentials
split_comparison <- player_splits %>%
pivot_wider(
id_cols = player,
names_from = split,
values_from = c(PA, HR, AVG, OPS)
) %>%
mutate(
HR_diff = HR_Home - HR_Away,
AVG_diff = AVG_Home - AVG_Away,
OPS_diff = OPS_Home - OPS_Away,
home_advantage = case_when(
OPS_diff > 0.100 ~ "Strong home hitter",
OPS_diff > 0.050 ~ "Slight home advantage",
OPS_diff > -0.050 ~ "Neutral",
TRUE ~ "Better away"
)
)
print(split_comparison)
from pybaseball import pitching_stats
import pandas as pd
# Get 2024 pitching data
pitchers = pitching_stats(2024, qual=100)
# Calculate rate statistics
pitchers['K_per_9'] = (pitchers['SO'] / pitchers['IP']) * 9
pitchers['BB_per_9'] = (pitchers['BB'] / pitchers['IP']) * 9
pitchers['HR_per_9'] = (pitchers['HR'] / pitchers['IP']) * 9
pitchers['H_per_9'] = (pitchers['H'] / pitchers['IP']) * 9
# K/BB ratio
pitchers['K_BB_ratio'] = pitchers['SO'] / pitchers['BB']
# Strikeout rate and walk rate (per batter faced)
# Note: Estimate BF if not available
pitchers['est_BF'] = pitchers['IP'] * 3 + pitchers['H'] + pitchers['BB'] + pitchers['HBP']
pitchers['K_rate'] = pitchers['SO'] / pitchers['est_BF']
pitchers['BB_rate'] = pitchers['BB'] / pitchers['est_BF']
# WHIP (Walks + Hits per Inning Pitched)
pitchers['WHIP'] = (pitchers['BB'] + pitchers['H']) / pitchers['IP']
# Quality starts rate (if available)
if 'QS' in pitchers.columns:
pitchers['QS_rate'] = pitchers['QS'] / pitchers['GS']
# View elite strikeout pitchers
elite_k = pitchers.nlargest(10, 'K_per_9')[
['Name', 'Team', 'IP', 'SO', 'K_per_9', 'K_BB_ratio', 'ERA']
]
print(elite_k)
import pandas as pd
import numpy as np
# Simulate pitch-level data aggregated by game
np.random.seed(42)
game_log_py = pd.DataFrame({
'game_num': range(1, 101),
'date': pd.date_range('2024-04-01', periods=100),
'exit_velo': np.random.normal(90, 5, 100), # Average exit velocity
'launch_angle': np.random.normal(15, 10, 100),
'hard_hit_pct': np.random.uniform(0.3, 0.6, 100),
'barrel_pct': np.random.uniform(0.05, 0.15, 100)
})
# Calculate rolling averages
game_log_py['ev_rolling_10'] = game_log_py['exit_velo'].rolling(window=10).mean()
game_log_py['ev_rolling_20'] = game_log_py['exit_velo'].rolling(window=20).mean()
game_log_py['hard_hit_rolling'] = game_log_py['hard_hit_pct'].rolling(window=10).mean()
game_log_py['barrel_rolling'] = game_log_py['barrel_pct'].rolling(window=10).mean()
# Expanding window (cumulative)
game_log_py['ev_season_avg'] = game_log_py['exit_velo'].expanding().mean()
# Identify games with elite contact quality
elite_contact = game_log_py[
(game_log_py['ev_rolling_10'] > 92) &
(game_log_py['hard_hit_rolling'] > 0.45)
][['game_num', 'date', 'ev_rolling_10', 'hard_hit_rolling']]
print(elite_contact.head(10))
from pybaseball import batting_stats
import pandas as pd
# Get multiple years
years = [2022, 2023, 2024]
all_years = []
for year in years:
df = batting_stats(year, qual=300)
df['Season'] = year
all_years.append(df)
multi_year = pd.concat(all_years, ignore_index=True)
# Calculate career trajectory for specific players
target_players = ['Aaron Judge', 'Juan Soto', 'Mookie Betts']
trajectories = multi_year[multi_year['Name'].isin(target_players)].copy()
trajectories = trajectories.sort_values(['Name', 'Season'])
# Calculate year-over-year changes
trajectories['HR_yoy'] = trajectories.groupby('Name')['HR'].diff()
trajectories['AVG_yoy'] = trajectories.groupby('Name')['AVG'].diff()
trajectories['OPS_yoy'] = trajectories.groupby('Name')['OPS'].diff()
# View trajectories
trajectory_summary = trajectories[[
'Name', 'Season', 'HR', 'HR_yoy', 'AVG', 'AVG_yoy', 'OPS', 'OPS_yoy'
]].round(3)
print(trajectory_summary)
import pandas as pd
import numpy as np
# Simulate platoon splits data
np.random.seed(789)
players = ['Judge', 'Soto', 'Ohtani', 'Betts', 'Freeman']
handedness = ['vs_LHP', 'vs_RHP']
split_data = []
for player in players:
for hand in handedness:
# RHB typically better vs LHP
if hand == 'vs_LHP':
avg = np.random.uniform(0.280, 0.340)
slg = np.random.uniform(0.520, 0.720)
pa = np.random.randint(150, 250)
else:
avg = np.random.uniform(0.250, 0.310)
slg = np.random.uniform(0.450, 0.620)
pa = np.random.randint(350, 450)
split_data.append({
'player': player,
'pitcher_hand': hand,
'PA': pa,
'AVG': round(avg, 3),
'SLG': round(slg, 3),
'OPS': round(avg + slg, 3)
})
splits_df = pd.DataFrame(split_data)
# Pivot to compare
splits_wide = splits_df.pivot(
index='player',
columns='pitcher_hand',
values=['PA', 'AVG', 'SLG', 'OPS']
)
# Calculate platoon differential
splits_wide['AVG_diff'] = splits_wide[('AVG', 'vs_LHP')] - splits_wide[('AVG', 'vs_RHP')]
splits_wide['OPS_diff'] = splits_wide[('OPS', 'vs_LHP')] - splits_wide[('OPS', 'vs_RHP')]
# Identify severe platoon splits
splits_wide['platoon_severity'] = pd.cut(
splits_wide['OPS_diff'],
bins=[-1, 0.050, 0.150, 0.300, 2],
labels=['Reverse', 'Mild', 'Moderate', 'Severe']
)
print(splits_wide)
Baseball is inherently temporal: seasons progress, players age, trends emerge. Effective date handling is essential for time-based analysis.
2.4.1 Game Dates and Scheduling {#game-dates}
R Example: Date Manipulation
library(lubridate)
library(tidyverse)
# Create a game schedule
schedule <- data.frame(
game_date = seq(as.Date("2024-03-28"), as.Date("2024-09-29"), by = "day")
) %>%
# Filter to realistic game dates (not every day)
filter(wday(game_date) %in% c(1, 3, 4, 5, 6, 7)) %>% # Mostly no off days on Tuesday
mutate(
year = year(game_date),
month = month(game_date, label = TRUE),
day_of_week = wday(game_date, label = TRUE),
week_of_season = week(game_date) - week(as.Date("2024-03-28")) + 1,
is_weekend = wday(game_date) %in% c(1, 7),
season_phase = case_when(
month %in% c("Mar", "Apr") ~ "Early",
month %in% c("May", "Jun", "Jul") ~ "Mid",
month %in% c("Aug", "Sep") ~ "Late"
)
)
# Games by month
games_by_month <- schedule %>%
count(month, season_phase) %>%
arrange(month)
print(games_by_month)
# Weekend vs. weekday
schedule %>%
count(is_weekend) %>%
mutate(pct = n / sum(n) * 100)
Python Example: Date Parsing and Filtering
import pandas as pd
from datetime import datetime, timedelta
# Create schedule
start_date = pd.to_datetime('2024-03-28')
end_date = pd.to_datetime('2024-09-29')
date_range = pd.date_range(start_date, end_date, freq='D')
schedule_py = pd.DataFrame({'game_date': date_range})
# Extract date components
schedule_py['year'] = schedule_py['game_date'].dt.year
schedule_py['month'] = schedule_py['game_date'].dt.month
schedule_py['month_name'] = schedule_py['game_date'].dt.month_name()
schedule_py['day_of_week'] = schedule_py['game_date'].dt.day_name()
schedule_py['week_num'] = schedule_py['game_date'].dt.isocalendar().week
schedule_py['is_weekend'] = schedule_py['game_date'].dt.dayofweek >= 5
# Season phases
def get_season_phase(month):
if month in [3, 4]:
return 'Early'
elif month in [5, 6, 7]:
return 'Mid'
else:
return 'Late'
schedule_py['season_phase'] = schedule_py['month'].apply(get_season_phase)
# Summary statistics
phase_summary = schedule_py.groupby('season_phase').agg({
'game_date': 'count',
'is_weekend': 'sum'
}).rename(columns={'game_date': 'total_days', 'is_weekend': 'weekend_days'})
print(phase_summary)
2.4.2 Time-Based Filtering {#time-filtering}
# Filter by date range
early_season <- schedule %>%
filter(game_date >= as.Date("2024-03-28"),
game_date <= as.Date("2024-05-31"))
# Filter by month
june_games <- schedule %>%
filter(month == "Jun")
# Last 30 days
recent_games <- schedule %>%
filter(game_date >= Sys.Date() - days(30))
# Specific day of week
weekend_games <- schedule %>%
filter(day_of_week %in% c("Sat", "Sun"))
# Python date filtering
early_season_py = schedule_py[
(schedule_py['game_date'] >= '2024-03-28') &
(schedule_py['game_date'] <= '2024-05-31')
]
# Filter by month
june_games_py = schedule_py[schedule_py['month'] == 6]
# Last 30 days (from today)
cutoff_date = pd.Timestamp.today() - pd.Timedelta(days=30)
recent_games_py = schedule_py[schedule_py['game_date'] >= cutoff_date]
# Weekend games
weekend_games_py = schedule_py[schedule_py['is_weekend']]
2.4.3 Seasonal Aggregations {#seasonal-aggregations}
# Aggregate performance by month (with simulated data)
set.seed(100)
monthly_performance <- schedule %>%
sample_n(100) %>% # Sample 100 games
mutate(
hits = rpois(n(), lambda = 9),
runs = rpois(n(), lambda = 4.5),
hr = rpois(n(), lambda = 1.2)
) %>%
group_by(month, season_phase) %>%
summarize(
games = n(),
total_hits = sum(hits),
total_runs = sum(runs),
total_hr = sum(hr),
avg_runs_per_game = mean(runs),
.groups = "drop"
)
print(monthly_performance)
# Python seasonal aggregations
np.random.seed(100)
sample_games = schedule_py.sample(100).copy()
sample_games['hits'] = np.random.poisson(9, 100)
sample_games['runs'] = np.random.poisson(4.5, 100)
sample_games['hr'] = np.random.poisson(1.2, 100)
monthly_perf_py = sample_games.groupby(['month_name', 'season_phase']).agg({
'game_date': 'count',
'hits': 'sum',
'runs': ['sum', 'mean'],
'hr': 'sum'
}).round(2)
monthly_perf_py.columns = ['_'.join(col).strip('_') for col in monthly_perf_py.columns]
print(monthly_perf_py)
library(lubridate)
library(tidyverse)
# Create a game schedule
schedule <- data.frame(
game_date = seq(as.Date("2024-03-28"), as.Date("2024-09-29"), by = "day")
) %>%
# Filter to realistic game dates (not every day)
filter(wday(game_date) %in% c(1, 3, 4, 5, 6, 7)) %>% # Mostly no off days on Tuesday
mutate(
year = year(game_date),
month = month(game_date, label = TRUE),
day_of_week = wday(game_date, label = TRUE),
week_of_season = week(game_date) - week(as.Date("2024-03-28")) + 1,
is_weekend = wday(game_date) %in% c(1, 7),
season_phase = case_when(
month %in% c("Mar", "Apr") ~ "Early",
month %in% c("May", "Jun", "Jul") ~ "Mid",
month %in% c("Aug", "Sep") ~ "Late"
)
)
# Games by month
games_by_month <- schedule %>%
count(month, season_phase) %>%
arrange(month)
print(games_by_month)
# Weekend vs. weekday
schedule %>%
count(is_weekend) %>%
mutate(pct = n / sum(n) * 100)
# Filter by date range
early_season <- schedule %>%
filter(game_date >= as.Date("2024-03-28"),
game_date <= as.Date("2024-05-31"))
# Filter by month
june_games <- schedule %>%
filter(month == "Jun")
# Last 30 days
recent_games <- schedule %>%
filter(game_date >= Sys.Date() - days(30))
# Specific day of week
weekend_games <- schedule %>%
filter(day_of_week %in% c("Sat", "Sun"))
# Aggregate performance by month (with simulated data)
set.seed(100)
monthly_performance <- schedule %>%
sample_n(100) %>% # Sample 100 games
mutate(
hits = rpois(n(), lambda = 9),
runs = rpois(n(), lambda = 4.5),
hr = rpois(n(), lambda = 1.2)
) %>%
group_by(month, season_phase) %>%
summarize(
games = n(),
total_hits = sum(hits),
total_runs = sum(runs),
total_hr = sum(hr),
avg_runs_per_game = mean(runs),
.groups = "drop"
)
print(monthly_performance)
import pandas as pd
from datetime import datetime, timedelta
# Create schedule
start_date = pd.to_datetime('2024-03-28')
end_date = pd.to_datetime('2024-09-29')
date_range = pd.date_range(start_date, end_date, freq='D')
schedule_py = pd.DataFrame({'game_date': date_range})
# Extract date components
schedule_py['year'] = schedule_py['game_date'].dt.year
schedule_py['month'] = schedule_py['game_date'].dt.month
schedule_py['month_name'] = schedule_py['game_date'].dt.month_name()
schedule_py['day_of_week'] = schedule_py['game_date'].dt.day_name()
schedule_py['week_num'] = schedule_py['game_date'].dt.isocalendar().week
schedule_py['is_weekend'] = schedule_py['game_date'].dt.dayofweek >= 5
# Season phases
def get_season_phase(month):
if month in [3, 4]:
return 'Early'
elif month in [5, 6, 7]:
return 'Mid'
else:
return 'Late'
schedule_py['season_phase'] = schedule_py['month'].apply(get_season_phase)
# Summary statistics
phase_summary = schedule_py.groupby('season_phase').agg({
'game_date': 'count',
'is_weekend': 'sum'
}).rename(columns={'game_date': 'total_days', 'is_weekend': 'weekend_days'})
print(phase_summary)
# Python date filtering
early_season_py = schedule_py[
(schedule_py['game_date'] >= '2024-03-28') &
(schedule_py['game_date'] <= '2024-05-31')
]
# Filter by month
june_games_py = schedule_py[schedule_py['month'] == 6]
# Last 30 days (from today)
cutoff_date = pd.Timestamp.today() - pd.Timedelta(days=30)
recent_games_py = schedule_py[schedule_py['game_date'] >= cutoff_date]
# Weekend games
weekend_games_py = schedule_py[schedule_py['is_weekend']]
# Python seasonal aggregations
np.random.seed(100)
sample_games = schedule_py.sample(100).copy()
sample_games['hits'] = np.random.poisson(9, 100)
sample_games['runs'] = np.random.poisson(4.5, 100)
sample_games['hr'] = np.random.poisson(1.2, 100)
monthly_perf_py = sample_games.groupby(['month_name', 'season_phase']).agg({
'game_date': 'count',
'hits': 'sum',
'runs': ['sum', 'mean'],
'hr': 'sum'
}).round(2)
monthly_perf_py.columns = ['_'.join(col).strip('_') for col in monthly_perf_py.columns]
print(monthly_perf_py)
Missing data is ubiquitous in baseball: injured players have no stats, new statistics weren't tracked historically, and data collection errors occur. How you handle missingness can significantly impact your conclusions.
2.5.1 Types of Missing Data in Baseball {#missing-data-types}
- Missing Completely at Random (MCAR): Data entry errors, random system failures
- Missing at Random (MAR): Older seasons lack Statcast data, but missingness depends on year, not the value itself
- Missing Not at Random (MNAR): Players with poor performance stop playing (survivor bias)
2.5.2 Imputation Strategies {#imputation}
R Example: Detecting and Imputing
# Simulate data with missing values
set.seed(321)
batters_missing <- data.frame(
player = paste("Player", 1:50),
hr = sample(c(rpois(45, lambda = 20), rep(NA, 5))),
avg = sample(c(runif(43, .230, .310), rep(NA, 7))),
obp = sample(c(runif(46, .300, .400), rep(NA, 4))),
age = sample(22:38, 50, replace = TRUE)
)
# Detect missing data
missing_summary <- batters_missing %>%
summarize(
total_rows = n(),
hr_missing = sum(is.na(hr)),
avg_missing = sum(is.na(avg)),
obp_missing = sum(is.na(obp))
)
print(missing_summary)
# Visualize missingness pattern
library(naniar)
vis_miss(batters_missing) # Requires naniar package
# Imputation strategies
# 1. Remove rows with any missing data
complete_cases <- batters_missing %>%
filter(complete.cases(.))
# 2. Mean imputation (simple, but reduces variance)
batters_mean_impute <- batters_missing %>%
mutate(
hr = ifelse(is.na(hr), mean(hr, na.rm = TRUE), hr),
avg = ifelse(is.na(avg), mean(avg, na.rm = TRUE), avg),
obp = ifelse(is.na(obp), mean(obp, na.rm = TRUE), obp)
)
# 3. Median imputation (robust to outliers)
batters_median_impute <- batters_missing %>%
mutate(
hr = ifelse(is.na(hr), median(hr, na.rm = TRUE), hr)
)
# 4. Group-based imputation (by age group)
batters_group_impute <- batters_missing %>%
mutate(
age_group = cut(age, breaks = c(20, 25, 30, 35, 40), labels = c("Young", "Prime", "Veteran", "Older"))
) %>%
group_by(age_group) %>%
mutate(
hr = ifelse(is.na(hr), mean(hr, na.rm = TRUE), hr)
) %>%
ungroup() %>%
select(-age_group)
# 5. Forward fill (carry last observation forward - useful for cumulative stats)
batters_ffill <- batters_missing %>%
arrange(player) %>%
fill(hr, avg, obp, .direction = "down")
Python Example: Advanced Imputation
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
# Simulate data with missing values
np.random.seed(321)
batters_missing_py = pd.DataFrame({
'player': [f'Player {i}' for i in range(1, 51)],
'hr': np.random.poisson(20, 50).astype(float),
'avg': np.random.uniform(0.230, 0.310, 50),
'obp': np.random.uniform(0.300, 0.400, 50),
'age': np.random.randint(22, 39, 50)
})
# Introduce missing values
batters_missing_py.loc[np.random.choice(50, 5, replace=False), 'hr'] = np.nan
batters_missing_py.loc[np.random.choice(50, 7, replace=False), 'avg'] = np.nan
batters_missing_py.loc[np.random.choice(50, 4, replace=False), 'obp'] = np.nan
# Check missing data
print(batters_missing_py.isnull().sum())
print(batters_missing_py.isnull().sum() / len(batters_missing_py) * 100) # Percentages
# 1. Drop rows with any missing data
complete_cases_py = batters_missing_py.dropna()
# 2. Drop columns with too many missing values
threshold = 0.2 # 20%
batters_dropped_cols = batters_missing_py.dropna(
axis=1,
thresh=int((1-threshold) * len(batters_missing_py))
)
# 3. Mean imputation
imputer_mean = SimpleImputer(strategy='mean')
numeric_cols = ['hr', 'avg', 'obp', 'age']
batters_mean_imputed = batters_missing_py.copy()
batters_mean_imputed[numeric_cols] = imputer_mean.fit_transform(batters_missing_py[numeric_cols])
# 4. Median imputation
imputer_median = SimpleImputer(strategy='median')
batters_median_imputed = batters_missing_py.copy()
batters_median_imputed[numeric_cols] = imputer_median.fit_transform(batters_missing_py[numeric_cols])
# 5. KNN imputation (uses similar records)
imputer_knn = KNNImputer(n_neighbors=5)
batters_knn_imputed = batters_missing_py.copy()
batters_knn_imputed[numeric_cols] = imputer_knn.fit_transform(batters_missing_py[numeric_cols])
# 6. Forward fill
batters_ffill_py = batters_missing_py.fillna(method='ffill')
# 7. Group-based imputation
batters_missing_py['age_group'] = pd.cut(
batters_missing_py['age'],
bins=[20, 25, 30, 35, 40],
labels=['Young', 'Prime', 'Veteran', 'Older']
)
batters_group_imputed = batters_missing_py.groupby('age_group').transform(
lambda x: x.fillna(x.mean()) if x.dtype != 'object' else x
)
batters_group_imputed['player'] = batters_missing_py['player']
print("\nOriginal missing counts:", batters_missing_py.isnull().sum().sum())
print("After KNN imputation:", batters_knn_imputed.isnull().sum().sum())
2.5.3 When to Exclude vs. Impute {#exclude-vs-impute}
Exclude (listwise deletion) when:
- Missing data is MCAR and you have abundant data
- The proportion of missing data is very small (<5%)
- Imputation might introduce more bias than deletion
Impute when:
- Missing data is MAR and related to other observed variables
- You have moderate missingness (5-20%) and need to preserve sample size
- The missing data follows a pattern you can model
Example decision framework:
# R: Assess missingness and decide
assess_missingness <- function(df, threshold = 0.05) {
missing_rates <- df %>%
summarize(across(everything(), ~sum(is.na(.)) / n())) %>%
pivot_longer(everything(), names_to = "variable", values_to = "missing_rate")
recommendations <- missing_rates %>%
mutate(
recommendation = case_when(
missing_rate == 0 ~ "No action needed",
missing_rate < threshold ~ "Consider listwise deletion",
missing_rate < 0.20 ~ "Imputation recommended",
TRUE ~ "Consider dropping variable or collecting more data"
)
)
return(recommendations)
}
recommendations <- assess_missingness(batters_missing, threshold = 0.05)
print(recommendations)
# Python: Assess and decide
def assess_missingness(df, threshold=0.05):
missing_rates = df.isnull().sum() / len(df)
recommendations = []
for col, rate in missing_rates.items():
if rate == 0:
rec = "No action needed"
elif rate < threshold:
rec = "Consider listwise deletion"
elif rate < 0.20:
rec = "Imputation recommended"
else:
rec = "Consider dropping variable"
recommendations.append({
'variable': col,
'missing_rate': rate,
'recommendation': rec
})
return pd.DataFrame(recommendations)
recommendations_py = assess_missingness(batters_missing_py, threshold=0.05)
print(recommendations_py)
# Simulate data with missing values
set.seed(321)
batters_missing <- data.frame(
player = paste("Player", 1:50),
hr = sample(c(rpois(45, lambda = 20), rep(NA, 5))),
avg = sample(c(runif(43, .230, .310), rep(NA, 7))),
obp = sample(c(runif(46, .300, .400), rep(NA, 4))),
age = sample(22:38, 50, replace = TRUE)
)
# Detect missing data
missing_summary <- batters_missing %>%
summarize(
total_rows = n(),
hr_missing = sum(is.na(hr)),
avg_missing = sum(is.na(avg)),
obp_missing = sum(is.na(obp))
)
print(missing_summary)
# Visualize missingness pattern
library(naniar)
vis_miss(batters_missing) # Requires naniar package
# Imputation strategies
# 1. Remove rows with any missing data
complete_cases <- batters_missing %>%
filter(complete.cases(.))
# 2. Mean imputation (simple, but reduces variance)
batters_mean_impute <- batters_missing %>%
mutate(
hr = ifelse(is.na(hr), mean(hr, na.rm = TRUE), hr),
avg = ifelse(is.na(avg), mean(avg, na.rm = TRUE), avg),
obp = ifelse(is.na(obp), mean(obp, na.rm = TRUE), obp)
)
# 3. Median imputation (robust to outliers)
batters_median_impute <- batters_missing %>%
mutate(
hr = ifelse(is.na(hr), median(hr, na.rm = TRUE), hr)
)
# 4. Group-based imputation (by age group)
batters_group_impute <- batters_missing %>%
mutate(
age_group = cut(age, breaks = c(20, 25, 30, 35, 40), labels = c("Young", "Prime", "Veteran", "Older"))
) %>%
group_by(age_group) %>%
mutate(
hr = ifelse(is.na(hr), mean(hr, na.rm = TRUE), hr)
) %>%
ungroup() %>%
select(-age_group)
# 5. Forward fill (carry last observation forward - useful for cumulative stats)
batters_ffill <- batters_missing %>%
arrange(player) %>%
fill(hr, avg, obp, .direction = "down")
# R: Assess missingness and decide
assess_missingness <- function(df, threshold = 0.05) {
missing_rates <- df %>%
summarize(across(everything(), ~sum(is.na(.)) / n())) %>%
pivot_longer(everything(), names_to = "variable", values_to = "missing_rate")
recommendations <- missing_rates %>%
mutate(
recommendation = case_when(
missing_rate == 0 ~ "No action needed",
missing_rate < threshold ~ "Consider listwise deletion",
missing_rate < 0.20 ~ "Imputation recommended",
TRUE ~ "Consider dropping variable or collecting more data"
)
)
return(recommendations)
}
recommendations <- assess_missingness(batters_missing, threshold = 0.05)
print(recommendations)
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer
# Simulate data with missing values
np.random.seed(321)
batters_missing_py = pd.DataFrame({
'player': [f'Player {i}' for i in range(1, 51)],
'hr': np.random.poisson(20, 50).astype(float),
'avg': np.random.uniform(0.230, 0.310, 50),
'obp': np.random.uniform(0.300, 0.400, 50),
'age': np.random.randint(22, 39, 50)
})
# Introduce missing values
batters_missing_py.loc[np.random.choice(50, 5, replace=False), 'hr'] = np.nan
batters_missing_py.loc[np.random.choice(50, 7, replace=False), 'avg'] = np.nan
batters_missing_py.loc[np.random.choice(50, 4, replace=False), 'obp'] = np.nan
# Check missing data
print(batters_missing_py.isnull().sum())
print(batters_missing_py.isnull().sum() / len(batters_missing_py) * 100) # Percentages
# 1. Drop rows with any missing data
complete_cases_py = batters_missing_py.dropna()
# 2. Drop columns with too many missing values
threshold = 0.2 # 20%
batters_dropped_cols = batters_missing_py.dropna(
axis=1,
thresh=int((1-threshold) * len(batters_missing_py))
)
# 3. Mean imputation
imputer_mean = SimpleImputer(strategy='mean')
numeric_cols = ['hr', 'avg', 'obp', 'age']
batters_mean_imputed = batters_missing_py.copy()
batters_mean_imputed[numeric_cols] = imputer_mean.fit_transform(batters_missing_py[numeric_cols])
# 4. Median imputation
imputer_median = SimpleImputer(strategy='median')
batters_median_imputed = batters_missing_py.copy()
batters_median_imputed[numeric_cols] = imputer_median.fit_transform(batters_missing_py[numeric_cols])
# 5. KNN imputation (uses similar records)
imputer_knn = KNNImputer(n_neighbors=5)
batters_knn_imputed = batters_missing_py.copy()
batters_knn_imputed[numeric_cols] = imputer_knn.fit_transform(batters_missing_py[numeric_cols])
# 6. Forward fill
batters_ffill_py = batters_missing_py.fillna(method='ffill')
# 7. Group-based imputation
batters_missing_py['age_group'] = pd.cut(
batters_missing_py['age'],
bins=[20, 25, 30, 35, 40],
labels=['Young', 'Prime', 'Veteran', 'Older']
)
batters_group_imputed = batters_missing_py.groupby('age_group').transform(
lambda x: x.fillna(x.mean()) if x.dtype != 'object' else x
)
batters_group_imputed['player'] = batters_missing_py['player']
print("\nOriginal missing counts:", batters_missing_py.isnull().sum().sum())
print("After KNN imputation:", batters_knn_imputed.isnull().sum().sum())
# Python: Assess and decide
def assess_missingness(df, threshold=0.05):
missing_rates = df.isnull().sum() / len(df)
recommendations = []
for col, rate in missing_rates.items():
if rate == 0:
rec = "No action needed"
elif rate < threshold:
rec = "Consider listwise deletion"
elif rate < 0.20:
rec = "Imputation recommended"
else:
rec = "Consider dropping variable"
recommendations.append({
'variable': col,
'missing_rate': rate,
'recommendation': rec
})
return pd.DataFrame(recommendations)
recommendations_py = assess_missingness(batters_missing_py, threshold=0.05)
print(recommendations_py)
Exercise 2.1: Advanced Filtering and Selection
Using the baseballr package (R) or pybaseball (Python), retrieve 2024 batting data and answer:
- How many players hit 30+ home runs with a strikeout rate below 20%?
- Which players had an OBP of at least .350 and stole 20+ bases?
- Create a new category "Three True Outcomes Rate" (HR + BB + SO) / PA and identify the top 10 players
R Solution Sketch:
library(baseballr)
library(tidyverse)
batters <- fg_batter_leaders(2024, 2024, qual = 300)
# 1.
q1 <- batters %>%
filter(HR >= 30, `K_percent` < 20) %>%
nrow()
# 2.
q2 <- batters %>%
filter(OBP >= .350, SB >= 20) %>%
select(Name, OBP, SB)
# 3.
q3 <- batters %>%
mutate(TTO_rate = (HR + BB + SO) / PA) %>%
arrange(desc(TTO_rate)) %>%
select(Name, HR, BB, SO, PA, TTO_rate) %>%
head(10)
Python Solution Sketch:
from pybaseball import batting_stats
batters_py = batting_stats(2024, qual=300)
# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")
# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]
# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]
Exercise 2.2: Grouping and Team Analysis
Calculate team-level statistics from individual player data:
- Which team had the highest average OPS among qualified batters?
- What's the correlation between team home runs and team wins (you'll need to join with standings data)?
- Calculate each team's offensive balance: standard deviation of WAR among their top 5 hitters
Exercise 2.3: Time Series Analysis
Create a 162-game simulated season for a player and:
- Calculate 15-game rolling averages for BA, OBP, and SLG
- Identify the longest streak above .300 BA
- Compare first-half vs. second-half performance
- Visualize performance trends over the season
Exercise 2.4: Joins and Data Integration
Combine batting and pitching data:
- Identify two-way players (appear in both datasets)
- For two-way players, calculate their combined WAR
- Compare offensive WAR vs. pitching WAR
Exercise 2.5: Missing Data Challenge
Using real-world data with missing Statcast metrics:
- Assess the extent and pattern of missing data
- Implement three different imputation strategies
- Compare the impact on correlation between exit velocity and hard-hit rate
- Justify which imputation method is most appropriate and why
This concludes Chapter 2. In the next chapter, we'll explore the rich ecosystem of baseball data sources, learning how to access FanGraphs, Baseball Reference, Statcast, and the Lahman database through R and Python packages.
library(baseballr)
library(tidyverse)
batters <- fg_batter_leaders(2024, 2024, qual = 300)
# 1.
q1 <- batters %>%
filter(HR >= 30, `K_percent` < 20) %>%
nrow()
# 2.
q2 <- batters %>%
filter(OBP >= .350, SB >= 20) %>%
select(Name, OBP, SB)
# 3.
q3 <- batters %>%
mutate(TTO_rate = (HR + BB + SO) / PA) %>%
arrange(desc(TTO_rate)) %>%
select(Name, HR, BB, SO, PA, TTO_rate) %>%
head(10)
from pybaseball import batting_stats
batters_py = batting_stats(2024, qual=300)
# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")
# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]
# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]
Practice Exercises
Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.
Tips for Success
- Read the problem carefully before starting to code
- Break down complex problems into smaller steps
- Use the hints if you're stuck - they won't give away the answer
- After solving, compare your approach with the solution
Advanced Filtering and Selection
1. How many players hit 30+ home runs with a strikeout rate below 20%?
2. Which players had an OBP of at least .350 and stole 20+ bases?
3. Create a new category "Three True Outcomes Rate" (HR + BB + SO) / PA and identify the top 10 players
**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)
batters <- fg_batter_leaders(2024, 2024, qual = 300)
# 1.
q1 <- batters %>%
filter(HR >= 30, `K_percent` < 20) %>%
nrow()
# 2.
q2 <- batters %>%
filter(OBP >= .350, SB >= 20) %>%
select(Name, OBP, SB)
# 3.
q3 <- batters %>%
mutate(TTO_rate = (HR + BB + SO) / PA) %>%
arrange(desc(TTO_rate)) %>%
select(Name, HR, BB, SO, PA, TTO_rate) %>%
head(10)
```
**Python Solution Sketch:**
```python
from pybaseball import batting_stats
batters_py = batting_stats(2024, qual=300)
# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")
# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]
# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]
```
Grouping and Team Analysis
1. Which team had the highest average OPS among qualified batters?
2. What's the correlation between team home runs and team wins (you'll need to join with standings data)?
3. Calculate each team's offensive balance: standard deviation of WAR among their top 5 hitters
Time Series Analysis
1. Calculate 15-game rolling averages for BA, OBP, and SLG
2. Identify the longest streak above .300 BA
3. Compare first-half vs. second-half performance
4. Visualize performance trends over the season
Joins and Data Integration
1. Identify two-way players (appear in both datasets)
2. For two-way players, calculate their combined WAR
3. Compare offensive WAR vs. pitching WAR
Missing Data Challenge
1. Assess the extent and pattern of missing data
2. Implement three different imputation strategies
3. Compare the impact on correlation between exit velocity and hard-hit rate
4. Justify which imputation method is most appropriate and why
---
This concludes Chapter 2. In the next chapter, we'll explore the rich ecosystem of baseball data sources, learning how to access FanGraphs, Baseball Reference, Statcast, and the Lahman database through R and Python packages.