Chapter 2: Data Wrangling for Baseball

2.1 Introduction to the Tidyverse (R)

The tidyverse is a collection of R packages designed for data science. Created by Hadley Wickham and maintained by RStudio, it provides a consistent, human-readable approach to data manipulation that has become the standard in modern R programming.

2.1.1 The Philosophy of Tidy Data {#tidy-data-philosophy}

Tidy data follows three fundamental principles:

Each variable is a column: Player name, home runs, batting average—each gets its own column
Each observation is a row: Each player-season, each game, each pitch is a separate row
Each type of observational unit is a table: Batting stats in one table, pitching stats in another

Consider this untidy baseball data:

# Untidy: Statistics spread across multiple columns by year
untidy_data <- data.frame(
  player = c("Aaron Judge", "Shohei Ohtani"),
  hr_2023 = c(37, 44),
  hr_2024 = c(58, 54),
  avg_2023 = c(.267, .304),
  avg_2024 = c(.322, .310)
)

The tidy version separates years into rows:

library(tidyverse)

tidy_data <- data.frame(
  player = rep(c("Aaron Judge", "Shohei Ohtani"), each = 2),
  year = rep(c(2023, 2024), times = 2),
  home_runs = c(37, 58, 44, 54),
  batting_avg = c(.267, .322, .304, .310)
)

print(tidy_data)

Output:

           player year home_runs batting_avg
1    Aaron Judge 2023        37       0.267
2    Aaron Judge 2024        58       0.322
3  Shohei Ohtani 2023        44       0.304
4  Shohei Ohtani 2024        54       0.310

This structure makes it trivial to calculate year-over-year changes, filter by season, or group by player.

2.1.2 The Pipe Operator: %>% and |> {#pipe-operator}

The pipe operator chains operations together, passing the result of one function as the first argument to the next. Think of it as "then" in plain English: "Take the data, then filter it, then select columns, then arrange by value."

R now has two pipe operators:

%>% from the magrittr package (loaded with tidyverse)

|> the native R pipe (available in R 4.1+)

Both work similarly for most cases:

library(tidyverse)

# Without pipes - nested and hard to read
result <- arrange(
  select(
    filter(tidy_data, year == 2024),
    player, home_runs
  ),
  desc(home_runs)
)

# With pipes - reads left to right, top to bottom
result <- tidy_data %>%
  filter(year == 2024) %>%
  select(player, home_runs) %>%
  arrange(desc(home_runs))

# Native pipe (equivalent)
result <- tidy_data |>
  filter(year == 2024) |>
  select(player, home_runs) |>
  arrange(desc(home_runs))

print(result)

Output:

          player home_runs
1  Shohei Ohtani        54
2   Aaron Judge        58

We'll use %>% throughout this book as it's more widely used, but |> works identically in most contexts.

2.1.3 The Core dplyr Verbs {#dplyr-verbs}

The dplyr package (part of the tidyverse) provides six essential verbs for data manipulation. Let's explore each with baseball data.

filter(): Keep Rows Based on Conditions

Filter selects rows that meet specific criteria.

library(baseballr)
library(tidyverse)

# Get 2024 batting leaders (requires baseballr package)
# Note: This fetches real data from FanGraphs
batters_2024 <- fg_batter_leaders(2024, 2024, qual = 200)

# Filter for players with 30+ home runs
power_hitters <- batters_2024 %>%
  filter(HR >= 30)

# Multiple conditions: 30+ HR AND .280+ AVG
elite_hitters <- batters_2024 %>%
  filter(HR >= 30, AVG >= .280)

# OR condition: Either 30+ HR or 100+ RBI
production <- batters_2024 %>%
  filter(HR >= 30 | RBI >= 100)

# Filter by player type
pull_hitters <- batters_2024 %>%
  filter(Pull_percent > 40)

select(): Choose Columns

Select picks specific columns or removes unwanted ones.

# Select basic offensive stats
basic_stats <- batters_2024 %>%
  select(Name, Team, G, PA, HR, RBI, AVG, OBP, SLG)

# Select columns by pattern
rate_stats <- batters_2024 %>%
  select(Name, ends_with("_percent"))

# Remove columns
no_ids <- batters_2024 %>%
  select(-playerid, -xMLBAMID)

# Reorder and rename
clean_stats <- batters_2024 %>%
  select(
    player = Name,
    team = Team,
    games = G,
    home_runs = HR,
    avg = AVG
  )

mutate(): Create New Columns

Mutate adds new variables or transforms existing ones.

# Calculate isolated power (SLG - AVG)
batters_enhanced <- batters_2024 %>%
  mutate(
    ISO = SLG - AVG,
    wOBA_minus_xwOBA = wOBA - xwOBA,
    AB = PA - BB - HBP - SF,  # Approximate at-bats
    singles = H - `2B` - `3B` - HR
  )

# Conditional mutations with case_when
batters_categorized <- batters_2024 %>%
  mutate(
    power_category = case_when(
      HR >= 40 ~ "Elite",
      HR >= 30 ~ "Excellent",
      HR >= 20 ~ "Above Average",
      HR >= 10 ~ "Average",
      TRUE ~ "Below Average"
    ),
    contact_skill = case_when(
      `K_percent` < 15 ~ "Excellent",
      `K_percent` < 20 ~ "Good",
      `K_percent` < 25 ~ "Average",
      TRUE ~ "Poor"
    )
  )

# View the new categories
batters_categorized %>%
  select(Name, HR, power_category, `K_percent`, contact_skill) %>%
  head(10)

arrange(): Sort Rows

Arrange sorts data by one or more columns.

# Sort by home runs (ascending)
by_hr <- batters_2024 %>%
  arrange(HR)

# Sort descending (most HR first)
hr_leaders <- batters_2024 %>%
  arrange(desc(HR)) %>%
  head(10)

# Multiple columns: sort by team, then by HR within team
by_team_hr <- batters_2024 %>%
  arrange(Team, desc(HR))

# Most balanced hitters (low strikeout rate, high walk rate)
disciplined <- batters_2024 %>%
  arrange(`K_percent`, desc(`BB_percent`)) %>%
  select(Name, `K_percent`, `BB_percent`, AVG, OBP) %>%
  head(10)

print(disciplined)

summarize(): Aggregate Data

Summarize (or summarise) reduces data to summary statistics.

# League-wide statistics
league_summary <- batters_2024 %>%
  summarize(
    total_players = n(),
    total_hr = sum(HR),
    avg_hr = mean(HR),
    median_hr = median(HR),
    avg_ba = mean(AVG),
    avg_obp = mean(OBP),
    avg_slg = mean(SLG)
  )

print(league_summary)

Output:

  total_players total_hr  avg_hr median_hr  avg_ba avg_obp avg_slg
1           147     3247  22.088      19.0  0.2536  0.3245  0.4182

group_by(): Group Data for Split Analysis

The true power of summarize emerges when combined with group_by, which splits data into groups before aggregating.

# Statistics by team
team_summary <- batters_2024 %>%
  group_by(Team) %>%
  summarize(
    players = n(),
    total_hr = sum(HR),
    avg_hr_per_player = mean(HR),
    team_avg = mean(AVG),
    team_obp = mean(OBP),
    .groups = "drop"
  ) %>%
  arrange(desc(total_hr))

print(team_summary)

Output:

   Team   players total_hr avg_hr_per_player team_avg team_obp
   <chr>    <int>    <dbl>             <dbl>    <dbl>    <dbl>
 1 NYY          5      183              36.6    0.266    0.351
 2 ATL          6      152              25.3    0.258    0.342
 3 LAD          6      148              24.7    0.271    0.355

More complex grouping:

# Analyze power vs. contact by handedness (if available)
# Create position groups
position_analysis <- batters_2024 %>%
  mutate(
    pos_group = case_when(
      grepl("C", Pos) ~ "Catcher",
      grepl("1B", Pos) ~ "1B/DH",
      grepl("2B|SS", Pos) ~ "Middle Infield",
      grepl("3B", Pos) ~ "Corner Infield",
      grepl("OF", Pos) ~ "Outfield",
      TRUE ~ "Other"
    )
  ) %>%
  group_by(pos_group) %>%
  summarize(
    count = n(),
    avg_hr = mean(HR),
    avg_sb = mean(SB),
    avg_avg = mean(AVG),
    avg_k_pct = mean(`K_percent`),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_hr))

print(position_analysis)

2.1.4 Advanced dplyr Operations {#advanced-dplyr}

Joins: Combining Multiple Data Sources

Joins merge two datasets based on common columns. This is essential when combining batting and pitching data, or merging FanGraphs stats with Statcast metrics.

# Create sample datasets
batting <- data.frame(
  player_id = c(1001, 1002, 1003),
  name = c("Aaron Judge", "Juan Soto", "Mookie Betts"),
  hr = c(58, 41, 19),
  avg = c(.322, .288, .289)
)

fielding <- data.frame(
  player_id = c(1001, 1002, 1004),
  name = c("Aaron Judge", "Juan Soto", "Bobby Witt Jr."),
  position = c("RF", "RF", "SS"),
  drs = c(8, -5, 12)
)

# Inner join: only players in both datasets
inner_result <- batting %>%
  inner_join(fielding, by = "player_id")

# Left join: all batting players, add fielding where available
left_result <- batting %>%
  left_join(fielding, by = "player_id", suffix = c("", "_field"))

# Full join: all players from both datasets
full_result <- batting %>%
  full_join(fielding, by = "player_id", suffix = c("_bat", "_field"))

print(left_result)

Output:

  player_id          name hr   avg  name_field position  drs
1      1001  Aaron Judge 58 0.322 Aaron Judge       RF    8
2      1002     Juan Soto 41 0.288   Juan Soto       RF   -5
3      1003  Mookie Betts 19 0.289        <NA>     <NA>   NA

Real-world example combining FanGraphs and custom data:

# Suppose we have team payroll data
payroll <- data.frame(
  Team = c("NYY", "LAD", "NYM", "SD", "PHI"),
  payroll_millions = c(306, 285, 275, 251, 243)
)

# Join with team performance
team_performance <- batters_2024 %>%
  group_by(Team) %>%
  summarize(
    team_war = sum(WAR),
    team_hr = sum(HR),
    .groups = "drop"
  ) %>%
  inner_join(payroll, by = "Team") %>%
  mutate(war_per_million = team_war / payroll_millions) %>%
  arrange(desc(war_per_million))

Pivots: Reshaping Data

Pivoting transforms data between wide and long formats.

# Long format (tidy)
player_stats_long <- data.frame(
  player = rep(c("Judge", "Ohtani"), each = 3),
  stat = rep(c("HR", "RBI", "AVG"), times = 2),
  value = c(58, 144, .322, 54, 130, .310)
)

# Pivot wider: stats as columns
player_stats_wide <- player_stats_long %>%
  pivot_wider(names_from = stat, values_from = value)

print(player_stats_wide)

Output:

  player    HR  RBI   AVG
1  Judge    58  144 0.322
2 Ohtani    54  130 0.310

# Pivot longer: years as rows (common for time series)
yearly_hr <- data.frame(
  player = c("Judge", "Ohtani"),
  HR_2022 = c(62, 34),
  HR_2023 = c(37, 44),
  HR_2024 = c(58, 54)
)

yearly_hr_long <- yearly_hr %>%
  pivot_longer(
    cols = starts_with("HR_"),
    names_to = "year",
    values_to = "home_runs",
    names_prefix = "HR_"
  ) %>%
  mutate(year = as.integer(year))

print(yearly_hr_long)

Output:

  player    year home_runs
1  Judge    2022        62
2  Judge    2023        37
3  Judge    2024        58
4 Ohtani    2022        34
5 Ohtani    2023        44
6 Ohtani    2024        54

Window Functions: Calculations Within Groups

Window functions perform calculations across rows related to the current row, without collapsing the data.

# Rank players by HR within each team
ranked_batters <- batters_2024 %>%
  group_by(Team) %>%
  mutate(
    team_hr_rank = rank(desc(HR)),
    team_hr_dense_rank = dense_rank(desc(HR)),
    pct_of_team_hr = HR / sum(HR)
  ) %>%
  ungroup() %>%
  select(Name, Team, HR, team_hr_rank, pct_of_team_hr) %>%
  arrange(Team, team_hr_rank)

# Calculate cumulative statistics
cumulative_stats <- yearly_hr_long %>%
  arrange(player, year) %>%
  group_by(player) %>%
  mutate(
    career_hr = cumsum(home_runs),
    career_avg_hr = cummean(home_runs),
    hr_from_last_year = home_runs - lag(home_runs)
  ) %>%
  ungroup()

print(cumulative_stats)

Output:

  player    year home_runs career_hr career_avg_hr hr_from_last_year
1  Judge    2022        62        62          62.0                NA
2  Judge    2023        37        99          49.5               -25
3  Judge    2024        58       157          52.3                21
4 Ohtani    2022        34        34          34.0                NA
5 Ohtani    2023        44        78          39.0                10
6 Ohtani    2024        54       132          44.0                10

# Untidy: Statistics spread across multiple columns by year
untidy_data <- data.frame(
  player = c("Aaron Judge", "Shohei Ohtani"),
  hr_2023 = c(37, 44),
  hr_2024 = c(58, 54),
  avg_2023 = c(.267, .304),
  avg_2024 = c(.322, .310)
)

library(tidyverse)

tidy_data <- data.frame(
  player = rep(c("Aaron Judge", "Shohei Ohtani"), each = 2),
  year = rep(c(2023, 2024), times = 2),
  home_runs = c(37, 58, 44, 54),
  batting_avg = c(.267, .322, .304, .310)
)

print(tidy_data)

player year home_runs batting_avg
1    Aaron Judge 2023        37       0.267
2    Aaron Judge 2024        58       0.322
3  Shohei Ohtani 2023        44       0.304
4  Shohei Ohtani 2024        54       0.310

library(tidyverse)

# Without pipes - nested and hard to read
result <- arrange(
  select(
    filter(tidy_data, year == 2024),
    player, home_runs
  ),
  desc(home_runs)
)

# With pipes - reads left to right, top to bottom
result <- tidy_data %>%
  filter(year == 2024) %>%
  select(player, home_runs) %>%
  arrange(desc(home_runs))

# Native pipe (equivalent)
result <- tidy_data |>
  filter(year == 2024) |>
  select(player, home_runs) |>
  arrange(desc(home_runs))

print(result)

player home_runs
1  Shohei Ohtani        54
2   Aaron Judge        58

library(baseballr)
library(tidyverse)

# Get 2024 batting leaders (requires baseballr package)
# Note: This fetches real data from FanGraphs
batters_2024 <- fg_batter_leaders(2024, 2024, qual = 200)

# Filter for players with 30+ home runs
power_hitters <- batters_2024 %>%
  filter(HR >= 30)

# Multiple conditions: 30+ HR AND .280+ AVG
elite_hitters <- batters_2024 %>%
  filter(HR >= 30, AVG >= .280)

# OR condition: Either 30+ HR or 100+ RBI
production <- batters_2024 %>%
  filter(HR >= 30 | RBI >= 100)

# Filter by player type
pull_hitters <- batters_2024 %>%
  filter(Pull_percent > 40)

# Select basic offensive stats
basic_stats <- batters_2024 %>%
  select(Name, Team, G, PA, HR, RBI, AVG, OBP, SLG)

# Select columns by pattern
rate_stats <- batters_2024 %>%
  select(Name, ends_with("_percent"))

# Remove columns
no_ids <- batters_2024 %>%
  select(-playerid, -xMLBAMID)

# Reorder and rename
clean_stats <- batters_2024 %>%
  select(
    player = Name,
    team = Team,
    games = G,
    home_runs = HR,
    avg = AVG
  )

# Calculate isolated power (SLG - AVG)
batters_enhanced <- batters_2024 %>%
  mutate(
    ISO = SLG - AVG,
    wOBA_minus_xwOBA = wOBA - xwOBA,
    AB = PA - BB - HBP - SF,  # Approximate at-bats
    singles = H - `2B` - `3B` - HR
  )

# Conditional mutations with case_when
batters_categorized <- batters_2024 %>%
  mutate(
    power_category = case_when(
      HR >= 40 ~ "Elite",
      HR >= 30 ~ "Excellent",
      HR >= 20 ~ "Above Average",
      HR >= 10 ~ "Average",
      TRUE ~ "Below Average"
    ),
    contact_skill = case_when(
      `K_percent` < 15 ~ "Excellent",
      `K_percent` < 20 ~ "Good",
      `K_percent` < 25 ~ "Average",
      TRUE ~ "Poor"
    )
  )

# View the new categories
batters_categorized %>%
  select(Name, HR, power_category, `K_percent`, contact_skill) %>%
  head(10)

# Sort by home runs (ascending)
by_hr <- batters_2024 %>%
  arrange(HR)

# Sort descending (most HR first)
hr_leaders <- batters_2024 %>%
  arrange(desc(HR)) %>%
  head(10)

# Multiple columns: sort by team, then by HR within team
by_team_hr <- batters_2024 %>%
  arrange(Team, desc(HR))

# Most balanced hitters (low strikeout rate, high walk rate)
disciplined <- batters_2024 %>%
  arrange(`K_percent`, desc(`BB_percent`)) %>%
  select(Name, `K_percent`, `BB_percent`, AVG, OBP) %>%
  head(10)

print(disciplined)

# League-wide statistics
league_summary <- batters_2024 %>%
  summarize(
    total_players = n(),
    total_hr = sum(HR),
    avg_hr = mean(HR),
    median_hr = median(HR),
    avg_ba = mean(AVG),
    avg_obp = mean(OBP),
    avg_slg = mean(SLG)
  )

print(league_summary)

total_players total_hr  avg_hr median_hr  avg_ba avg_obp avg_slg
1           147     3247  22.088      19.0  0.2536  0.3245  0.4182

# Statistics by team
team_summary <- batters_2024 %>%
  group_by(Team) %>%
  summarize(
    players = n(),
    total_hr = sum(HR),
    avg_hr_per_player = mean(HR),
    team_avg = mean(AVG),
    team_obp = mean(OBP),
    .groups = "drop"
  ) %>%
  arrange(desc(total_hr))

print(team_summary)

Team   players total_hr avg_hr_per_player team_avg team_obp
   <chr>    <int>    <dbl>             <dbl>    <dbl>    <dbl>
 1 NYY          5      183              36.6    0.266    0.351
 2 ATL          6      152              25.3    0.258    0.342
 3 LAD          6      148              24.7    0.271    0.355

# Analyze power vs. contact by handedness (if available)
# Create position groups
position_analysis <- batters_2024 %>%
  mutate(
    pos_group = case_when(
      grepl("C", Pos) ~ "Catcher",
      grepl("1B", Pos) ~ "1B/DH",
      grepl("2B|SS", Pos) ~ "Middle Infield",
      grepl("3B", Pos) ~ "Corner Infield",
      grepl("OF", Pos) ~ "Outfield",
      TRUE ~ "Other"
    )
  ) %>%
  group_by(pos_group) %>%
  summarize(
    count = n(),
    avg_hr = mean(HR),
    avg_sb = mean(SB),
    avg_avg = mean(AVG),
    avg_k_pct = mean(`K_percent`),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_hr))

print(position_analysis)

# Create sample datasets
batting <- data.frame(
  player_id = c(1001, 1002, 1003),
  name = c("Aaron Judge", "Juan Soto", "Mookie Betts"),
  hr = c(58, 41, 19),
  avg = c(.322, .288, .289)
)

fielding <- data.frame(
  player_id = c(1001, 1002, 1004),
  name = c("Aaron Judge", "Juan Soto", "Bobby Witt Jr."),
  position = c("RF", "RF", "SS"),
  drs = c(8, -5, 12)
)

# Inner join: only players in both datasets
inner_result <- batting %>%
  inner_join(fielding, by = "player_id")

# Left join: all batting players, add fielding where available
left_result <- batting %>%
  left_join(fielding, by = "player_id", suffix = c("", "_field"))

# Full join: all players from both datasets
full_result <- batting %>%
  full_join(fielding, by = "player_id", suffix = c("_bat", "_field"))

print(left_result)

player_id          name hr   avg  name_field position  drs
1      1001  Aaron Judge 58 0.322 Aaron Judge       RF    8
2      1002     Juan Soto 41 0.288   Juan Soto       RF   -5
3      1003  Mookie Betts 19 0.289        <NA>     <NA>   NA

# Suppose we have team payroll data
payroll <- data.frame(
  Team = c("NYY", "LAD", "NYM", "SD", "PHI"),
  payroll_millions = c(306, 285, 275, 251, 243)
)

# Join with team performance
team_performance <- batters_2024 %>%
  group_by(Team) %>%
  summarize(
    team_war = sum(WAR),
    team_hr = sum(HR),
    .groups = "drop"
  ) %>%
  inner_join(payroll, by = "Team") %>%
  mutate(war_per_million = team_war / payroll_millions) %>%
  arrange(desc(war_per_million))

# Long format (tidy)
player_stats_long <- data.frame(
  player = rep(c("Judge", "Ohtani"), each = 3),
  stat = rep(c("HR", "RBI", "AVG"), times = 2),
  value = c(58, 144, .322, 54, 130, .310)
)

# Pivot wider: stats as columns
player_stats_wide <- player_stats_long %>%
  pivot_wider(names_from = stat, values_from = value)

print(player_stats_wide)

player    HR  RBI   AVG
1  Judge    58  144 0.322
2 Ohtani    54  130 0.310

# Pivot longer: years as rows (common for time series)
yearly_hr <- data.frame(
  player = c("Judge", "Ohtani"),
  HR_2022 = c(62, 34),
  HR_2023 = c(37, 44),
  HR_2024 = c(58, 54)
)

yearly_hr_long <- yearly_hr %>%
  pivot_longer(
    cols = starts_with("HR_"),
    names_to = "year",
    values_to = "home_runs",
    names_prefix = "HR_"
  ) %>%
  mutate(year = as.integer(year))

print(yearly_hr_long)

player    year home_runs
1  Judge    2022        62
2  Judge    2023        37
3  Judge    2024        58
4 Ohtani    2022        34
5 Ohtani    2023        44
6 Ohtani    2024        54

# Rank players by HR within each team
ranked_batters <- batters_2024 %>%
  group_by(Team) %>%
  mutate(
    team_hr_rank = rank(desc(HR)),
    team_hr_dense_rank = dense_rank(desc(HR)),
    pct_of_team_hr = HR / sum(HR)
  ) %>%
  ungroup() %>%
  select(Name, Team, HR, team_hr_rank, pct_of_team_hr) %>%
  arrange(Team, team_hr_rank)

# Calculate cumulative statistics
cumulative_stats <- yearly_hr_long %>%
  arrange(player, year) %>%
  group_by(player) %>%
  mutate(
    career_hr = cumsum(home_runs),
    career_avg_hr = cummean(home_runs),
    hr_from_last_year = home_runs - lag(home_runs)
  ) %>%
  ungroup()

print(cumulative_stats)

player    year home_runs career_hr career_avg_hr hr_from_last_year
1  Judge    2022        62        62          62.0                NA
2  Judge    2023        37        99          49.5               -25
3  Judge    2024        58       157          52.3                21
4 Ohtani    2022        34        34          34.0                NA
5 Ohtani    2023        44        78          39.0                10
6 Ohtani    2024        54       132          44.0                10

2.2 Introduction to Pandas (Python)

Pandas is the cornerstone of data analysis in Python. Built on NumPy, it provides DataFrame and Series objects that make data manipulation intuitive and efficient. If you're coming from R, pandas DataFrames are similar to data.frames or tibbles, with some syntactic differences.

2.2.1 DataFrames and Series {#dataframes-series}

A Series is a one-dimensional array with labels. A DataFrame is a two-dimensional table with labeled rows and columns.

import pandas as pd
import numpy as np

# Create a Series (one column of data)
home_runs = pd.Series(
    [58, 54, 44, 41],
    index=['Judge', 'Ohtani', 'Schwarber', 'Soto'],
    name='HR'
)

print(home_runs)

Output:

Judge        58
Ohtani       54
Schwarber    44
Soto         41
Name: HR, dtype: int64

# Create a DataFrame
players_df = pd.DataFrame({
    'player': ['Aaron Judge', 'Shohei Ohtani', 'Juan Soto', 'Kyle Schwarber'],
    'team': ['NYY', 'LAD', 'NYY', 'PHI'],
    'hr': [58, 54, 41, 38],
    'avg': [.322, .310, .288, .250],
    'obp': [.458, .390, .419, .373],
    'slg': [.701, .646, .569, .488]
})

print(players_df)

Output:

          player team  hr    avg    obp    slg
0   Aaron Judge  NYY  58  0.322  0.458  0.701
1  Shohei Ohtani  LAD  54  0.310  0.390  0.646
2      Juan Soto  NYY  41  0.288  0.419  0.569
3  Kyle Schwarber  PHI  38  0.250  0.373  0.488

# Inspecting DataFrames
print(players_df.shape)  # (4, 6) - 4 rows, 6 columns
print(players_df.columns)  # Column names
print(players_df.dtypes)  # Data types
print(players_df.head())  # First 5 rows
print(players_df.info())  # Summary information
print(players_df.describe())  # Summary statistics for numeric columns

2.2.2 Selecting and Filtering Data {#pandas-selecting}

Pandas offers multiple ways to select data: bracket notation, .loc[] (label-based), and .iloc[] (integer position-based).

# Select a single column (returns Series)
hrs = players_df['hr']
teams = players_df.team  # Alternative dot notation

# Select multiple columns (returns DataFrame)
basic_stats = players_df[['player', 'team', 'hr', 'avg']]

# Select rows by position with iloc
first_two_players = players_df.iloc[0:2]  # First two rows
last_player = players_df.iloc[-1]  # Last row

# Select rows by label with loc
# (Requires setting an index first)
players_indexed = players_df.set_index('player')
judge_stats = players_indexed.loc['Aaron Judge']

print(judge_stats)

Output:

team     NYY
hr        58
avg    0.322
obp    0.458
slg    0.701
Name: Aaron Judge, dtype: object

Filtering (Boolean Indexing)

# Filter for players with 40+ home runs
power_hitters = players_df[players_df['hr'] >= 40]

# Multiple conditions with & (and) and | (or)
# Note: Each condition must be in parentheses
elite_hitters = players_df[(players_df['hr'] >= 40) & (players_df['avg'] >= .280)]

yankees = players_df[players_df['team'] == 'NYY']

# Complex filtering
qualified = players_df[
    (players_df['hr'] >= 30) &
    (players_df['obp'] >= .350) &
    (players_df['slg'] >= .500)
]

print(qualified)

Output:

          player team  hr    avg    obp    slg
0   Aaron Judge  NYY  58  0.322  0.458  0.701
1  Shohei Ohtani  LAD  54  0.310  0.390  0.646
2      Juan Soto  NYY  41  0.288  0.419  0.569

# Using query() for more readable filtering
power_hitters_query = players_df.query('hr >= 40')
yankees_power = players_df.query('team == "NYY" and hr >= 40')

# Using isin() for multiple values
east_coast = players_df[players_df['team'].isin(['NYY', 'PHI', 'NYM', 'BOS'])]

2.2.3 Creating New Columns {#pandas-new-columns}

# Simple calculation
players_df['iso'] = players_df['slg'] - players_df['avg']

# Multiple operations
players_df['ops'] = players_df['obp'] + players_df['slg']
players_df['hr_per_1000_pa'] = (players_df['hr'] / 600) * 1000  # Assuming 600 PA

# Conditional column with np.where
players_df['power_tier'] = np.where(
    players_df['hr'] >= 50, 'Elite',
    np.where(players_df['hr'] >= 40, 'Excellent', 'Good')
)

# More complex conditions with pd.cut (binning)
players_df['hr_category'] = pd.cut(
    players_df['hr'],
    bins=[0, 30, 40, 50, 100],
    labels=['Below 30', '30-40', '40-50', '50+']
)

# Apply custom functions
def classify_hitter(row):
    if row['hr'] >= 40 and row['avg'] >= .280:
        return 'Power & Average'
    elif row['hr'] >= 40:
        return 'Power'
    elif row['avg'] >= .280:
        return 'Average'
    else:
        return 'Other'

players_df['hitter_type'] = players_df.apply(classify_hitter, axis=1)

print(players_df[['player', 'hr', 'avg', 'hitter_type']])

Output:

          player  hr    avg      hitter_type
0   Aaron Judge  58  0.322  Power & Average
1  Shohei Ohtani  54  0.310  Power & Average
2      Juan Soto  41  0.288  Power & Average
3  Kyle Schwarber  38  0.250            Other

2.2.4 Grouping and Aggregating {#pandas-groupby}

The groupby() method splits data into groups, applies a function, and combines results—similar to dplyr's group_by() + summarize().

# Create larger dataset with multiple teams
batters_data = pd.DataFrame({
    'player': ['Judge', 'Soto', 'Ohtani', 'Betts', 'Freeman', 'Schwarber', 'Harper', 'Turner'],
    'team': ['NYY', 'NYY', 'LAD', 'LAD', 'LAD', 'PHI', 'PHI', 'PHI'],
    'hr': [58, 41, 54, 19, 22, 38, 26, 21],
    'rbi': [144, 109, 130, 75, 89, 104, 87, 86],
    'avg': [.322, .288, .310, .289, .282, .250, .285, .298]
})

# Group by team and calculate totals
team_totals = batters_data.groupby('team').agg({
    'hr': 'sum',
    'rbi': 'sum',
    'avg': 'mean'
})

print(team_totals)

Output:

      hr  rbi       avg
team
LAD   95  294  0.293667
NYY   99  253  0.305000
PHI   85  277  0.277667

# Multiple aggregations per column
team_summary = batters_data.groupby('team').agg({
    'hr': ['sum', 'mean', 'max'],
    'avg': ['mean', 'min', 'max'],
    'player': 'count'
}).round(3)

# Named aggregations (pandas 0.25+)
team_stats = batters_data.groupby('team').agg(
    total_hr=('hr', 'sum'),
    avg_hr=('hr', 'mean'),
    top_hr=('hr', 'max'),
    team_avg=('avg', 'mean'),
    player_count=('player', 'count')
).round(2)

print(team_stats)

Output:

      total_hr  avg_hr  top_hr  team_avg  player_count
team
LAD         95   31.67      54      0.29             3
NYY         99   49.50      58      0.31             2
PHI         85   28.33      38      0.28             3

# Group by multiple columns
# Add league information
batters_data['league'] = batters_data['team'].map({
    'NYY': 'AL', 'LAD': 'NL', 'PHI': 'NL'
})

league_team_stats = batters_data.groupby(['league', 'team']).agg({
    'hr': 'sum',
    'player': 'count'
}).reset_index()

print(league_team_stats)

Output:

  league team  hr  player
0     AL  NYY  99       2
1     NL  LAD  95       3
2     NL  PHI  85       3

2.2.5 Merging DataFrames {#pandas-merging}

Pandas provides several ways to combine DataFrames: merge() (SQL-style joins), concat() (stacking), and join() (index-based).

# Create batting and fielding datasets
batting = pd.DataFrame({
    'player_id': [1001, 1002, 1003],
    'name': ['Aaron Judge', 'Juan Soto', 'Mookie Betts'],
    'hr': [58, 41, 19],
    'avg': [.322, .288, .289]
})

fielding = pd.DataFrame({
    'player_id': [1001, 1002, 1004],
    'name': ['Aaron Judge', 'Juan Soto', 'Bobby Witt Jr.'],
    'position': ['RF', 'RF', 'SS'],
    'drs': [8, -5, 12]
})

# Inner join (default)
inner = pd.merge(batting, fielding, on='player_id', suffixes=('_bat', '_field'))

# Left join (keep all batting records)
left = pd.merge(batting, fielding, on='player_id', how='left', suffixes=('_bat', '_field'))

# Outer join (keep all records from both)
outer = pd.merge(batting, fielding, on='player_id', how='outer', suffixes=('_bat', '_field'))

print(left)

Output:

   player_id     name_bat  hr    avg   name_field position  drs
0       1001  Aaron Judge  58  0.322  Aaron Judge       RF  8.0
1       1002    Juan Soto  41  0.288    Juan Soto       RF -5.0
2       1003  Mookie Betts  19  0.289         NaN      NaN  NaN

# Merge on multiple columns
# Concatenate DataFrames vertically
al_batters = pd.DataFrame({
    'player': ['Judge', 'Witt'],
    'team': ['NYY', 'KC'],
    'hr': [58, 32]
})

nl_batters = pd.DataFrame({
    'player': ['Ohtani', 'Betts'],
    'team': ['LAD', 'LAD'],
    'hr': [54, 19]
})

all_batters = pd.concat([al_batters, nl_batters], ignore_index=True)
print(all_batters)

Output:

   player team  hr
0   Judge  NYY  58
1    Witt   KC  32
2  Ohtani  LAD  54
3   Betts  LAD  19

Judge        58
Ohtani       54
Schwarber    44
Soto         41
Name: HR, dtype: int64

player team  hr    avg    obp    slg
0   Aaron Judge  NYY  58  0.322  0.458  0.701
1  Shohei Ohtani  LAD  54  0.310  0.390  0.646
2      Juan Soto  NYY  41  0.288  0.419  0.569
3  Kyle Schwarber  PHI  38  0.250  0.373  0.488

team     NYY
hr        58
avg    0.322
obp    0.458
slg    0.701
Name: Aaron Judge, dtype: object

player team  hr    avg    obp    slg
0   Aaron Judge  NYY  58  0.322  0.458  0.701
1  Shohei Ohtani  LAD  54  0.310  0.390  0.646
2      Juan Soto  NYY  41  0.288  0.419  0.569

player  hr    avg      hitter_type
0   Aaron Judge  58  0.322  Power & Average
1  Shohei Ohtani  54  0.310  Power & Average
2      Juan Soto  41  0.288  Power & Average
3  Kyle Schwarber  38  0.250            Other

hr  rbi       avg
team
LAD   95  294  0.293667
NYY   99  253  0.305000
PHI   85  277  0.277667

total_hr  avg_hr  top_hr  team_avg  player_count
team
LAD         95   31.67      54      0.29             3
NYY         99   49.50      58      0.31             2
PHI         85   28.33      38      0.28             3

league team  hr  player
0     AL  NYY  99       2
1     NL  LAD  95       3
2     NL  PHI  85       3

player_id     name_bat  hr    avg   name_field position  drs
0       1001  Aaron Judge  58  0.322  Aaron Judge       RF  8.0
1       1002    Juan Soto  41  0.288    Juan Soto       RF -5.0
2       1003  Mookie Betts  19  0.289         NaN      NaN  NaN

player team  hr
0   Judge  NYY  58
1    Witt   KC  32
2  Ohtani  LAD  54
3   Betts  LAD  19

Python

import pandas as pd
import numpy as np

# Create a Series (one column of data)
home_runs = pd.Series(
    [58, 54, 44, 41],
    index=['Judge', 'Ohtani', 'Schwarber', 'Soto'],
    name='HR'
)

print(home_runs)

Python

# Create a DataFrame
players_df = pd.DataFrame({
    'player': ['Aaron Judge', 'Shohei Ohtani', 'Juan Soto', 'Kyle Schwarber'],
    'team': ['NYY', 'LAD', 'NYY', 'PHI'],
    'hr': [58, 54, 41, 38],
    'avg': [.322, .310, .288, .250],
    'obp': [.458, .390, .419, .373],
    'slg': [.701, .646, .569, .488]
})

print(players_df)

Python

# Inspecting DataFrames
print(players_df.shape)  # (4, 6) - 4 rows, 6 columns
print(players_df.columns)  # Column names
print(players_df.dtypes)  # Data types
print(players_df.head())  # First 5 rows
print(players_df.info())  # Summary information
print(players_df.describe())  # Summary statistics for numeric columns

Python

# Select a single column (returns Series)
hrs = players_df['hr']
teams = players_df.team  # Alternative dot notation

# Select multiple columns (returns DataFrame)
basic_stats = players_df[['player', 'team', 'hr', 'avg']]

# Select rows by position with iloc
first_two_players = players_df.iloc[0:2]  # First two rows
last_player = players_df.iloc[-1]  # Last row

# Select rows by label with loc
# (Requires setting an index first)
players_indexed = players_df.set_index('player')
judge_stats = players_indexed.loc['Aaron Judge']

print(judge_stats)

Python

# Filter for players with 40+ home runs
power_hitters = players_df[players_df['hr'] >= 40]

# Multiple conditions with & (and) and | (or)
# Note: Each condition must be in parentheses
elite_hitters = players_df[(players_df['hr'] >= 40) & (players_df['avg'] >= .280)]

yankees = players_df[players_df['team'] == 'NYY']

# Complex filtering
qualified = players_df[
    (players_df['hr'] >= 30) &
    (players_df['obp'] >= .350) &
    (players_df['slg'] >= .500)
]

print(qualified)

Python

# Using query() for more readable filtering
power_hitters_query = players_df.query('hr >= 40')
yankees_power = players_df.query('team == "NYY" and hr >= 40')

# Using isin() for multiple values
east_coast = players_df[players_df['team'].isin(['NYY', 'PHI', 'NYM', 'BOS'])]

Python

# Simple calculation
players_df['iso'] = players_df['slg'] - players_df['avg']

# Multiple operations
players_df['ops'] = players_df['obp'] + players_df['slg']
players_df['hr_per_1000_pa'] = (players_df['hr'] / 600) * 1000  # Assuming 600 PA

# Conditional column with np.where
players_df['power_tier'] = np.where(
    players_df['hr'] >= 50, 'Elite',
    np.where(players_df['hr'] >= 40, 'Excellent', 'Good')
)

# More complex conditions with pd.cut (binning)
players_df['hr_category'] = pd.cut(
    players_df['hr'],
    bins=[0, 30, 40, 50, 100],
    labels=['Below 30', '30-40', '40-50', '50+']
)

# Apply custom functions
def classify_hitter(row):
    if row['hr'] >= 40 and row['avg'] >= .280:
        return 'Power & Average'
    elif row['hr'] >= 40:
        return 'Power'
    elif row['avg'] >= .280:
        return 'Average'
    else:
        return 'Other'

players_df['hitter_type'] = players_df.apply(classify_hitter, axis=1)

print(players_df[['player', 'hr', 'avg', 'hitter_type']])

Python

# Create larger dataset with multiple teams
batters_data = pd.DataFrame({
    'player': ['Judge', 'Soto', 'Ohtani', 'Betts', 'Freeman', 'Schwarber', 'Harper', 'Turner'],
    'team': ['NYY', 'NYY', 'LAD', 'LAD', 'LAD', 'PHI', 'PHI', 'PHI'],
    'hr': [58, 41, 54, 19, 22, 38, 26, 21],
    'rbi': [144, 109, 130, 75, 89, 104, 87, 86],
    'avg': [.322, .288, .310, .289, .282, .250, .285, .298]
})

# Group by team and calculate totals
team_totals = batters_data.groupby('team').agg({
    'hr': 'sum',
    'rbi': 'sum',
    'avg': 'mean'
})

print(team_totals)

Python

# Multiple aggregations per column
team_summary = batters_data.groupby('team').agg({
    'hr': ['sum', 'mean', 'max'],
    'avg': ['mean', 'min', 'max'],
    'player': 'count'
}).round(3)

# Named aggregations (pandas 0.25+)
team_stats = batters_data.groupby('team').agg(
    total_hr=('hr', 'sum'),
    avg_hr=('hr', 'mean'),
    top_hr=('hr', 'max'),
    team_avg=('avg', 'mean'),
    player_count=('player', 'count')
).round(2)

print(team_stats)

Python

# Group by multiple columns
# Add league information
batters_data['league'] = batters_data['team'].map({
    'NYY': 'AL', 'LAD': 'NL', 'PHI': 'NL'
})

league_team_stats = batters_data.groupby(['league', 'team']).agg({
    'hr': 'sum',
    'player': 'count'
}).reset_index()

print(league_team_stats)

Python

# Create batting and fielding datasets
batting = pd.DataFrame({
    'player_id': [1001, 1002, 1003],
    'name': ['Aaron Judge', 'Juan Soto', 'Mookie Betts'],
    'hr': [58, 41, 19],
    'avg': [.322, .288, .289]
})

fielding = pd.DataFrame({
    'player_id': [1001, 1002, 1004],
    'name': ['Aaron Judge', 'Juan Soto', 'Bobby Witt Jr.'],
    'position': ['RF', 'RF', 'SS'],
    'drs': [8, -5, 12]
})

# Inner join (default)
inner = pd.merge(batting, fielding, on='player_id', suffixes=('_bat', '_field'))

# Left join (keep all batting records)
left = pd.merge(batting, fielding, on='player_id', how='left', suffixes=('_bat', '_field'))

# Outer join (keep all records from both)
outer = pd.merge(batting, fielding, on='player_id', how='outer', suffixes=('_bat', '_field'))

print(left)

Python

# Merge on multiple columns
# Concatenate DataFrames vertically
al_batters = pd.DataFrame({
    'player': ['Judge', 'Witt'],
    'team': ['NYY', 'KC'],
    'hr': [58, 32]
})

nl_batters = pd.DataFrame({
    'player': ['Ohtani', 'Betts'],
    'team': ['LAD', 'LAD'],
    'hr': [54, 19]
})

all_batters = pd.concat([al_batters, nl_batters], ignore_index=True)
print(all_batters)

2.3 Common Baseball Data Transformations

Now that we understand the core tools, let's apply them to common baseball analytics tasks.

2.3.1 Calculating Rate Stats (per PA, per IP) {#rate-stats}

Raw counting stats (HR, RBI, SO) depend on playing time. Rate statistics normalize these for comparison.

R Example: Batting Rates

library(baseballr)
library(tidyverse)

# Get 2024 data
batters <- fg_batter_leaders(2024, 2024, qual = 100)

# Calculate per-PA and per-game rates
batters_rates <- batters %>%
  mutate(
    # Runs created per 27 outs (approximation)
    RC27 = (R + RBI - HR) / (AB - H + CS + GIDP) * 27,

    # Home runs per plate appearance
    HR_per_PA = HR / PA,

    # Walks per plate appearance
    BB_per_PA = BB / PA,

    # Extra base hits
    XBH = `2B` + `3B` + HR,
    XBH_per_PA = XBH / PA,

    # Total bases per plate appearance
    TB = H + `2B` + (`3B` * 2) + (HR * 2),
    TB_per_PA = TB / PA,

    # Speed score components
    SB_per_game = SB / G,

    # Production: OBP + SLG (quick OPS equivalent)
    Production = OBP + SLG
  )

# Top home run hitters by rate
hr_rate_leaders <- batters_rates %>%
  filter(PA >= 400) %>%
  select(Name, Team, PA, HR, HR_per_PA) %>%
  arrange(desc(HR_per_PA)) %>%
  head(10)

print(hr_rate_leaders)

Python Example: Pitching Rates

from pybaseball import pitching_stats
import pandas as pd

# Get 2024 pitching data
pitchers = pitching_stats(2024, qual=100)

# Calculate rate statistics
pitchers['K_per_9'] = (pitchers['SO'] / pitchers['IP']) * 9
pitchers['BB_per_9'] = (pitchers['BB'] / pitchers['IP']) * 9
pitchers['HR_per_9'] = (pitchers['HR'] / pitchers['IP']) * 9
pitchers['H_per_9'] = (pitchers['H'] / pitchers['IP']) * 9

# K/BB ratio
pitchers['K_BB_ratio'] = pitchers['SO'] / pitchers['BB']

# Strikeout rate and walk rate (per batter faced)
# Note: Estimate BF if not available
pitchers['est_BF'] = pitchers['IP'] * 3 + pitchers['H'] + pitchers['BB'] + pitchers['HBP']
pitchers['K_rate'] = pitchers['SO'] / pitchers['est_BF']
pitchers['BB_rate'] = pitchers['BB'] / pitchers['est_BF']

# WHIP (Walks + Hits per Inning Pitched)
pitchers['WHIP'] = (pitchers['BB'] + pitchers['H']) / pitchers['IP']

# Quality starts rate (if available)
if 'QS' in pitchers.columns:
    pitchers['QS_rate'] = pitchers['QS'] / pitchers['GS']

# View elite strikeout pitchers
elite_k = pitchers.nlargest(10, 'K_per_9')[
    ['Name', 'Team', 'IP', 'SO', 'K_per_9', 'K_BB_ratio', 'ERA']
]

print(elite_k)

2.3.2 Rolling Averages and Moving Windows {#rolling-averages}

Rolling statistics smooth out noise and identify trends over time, such as hot and cold streaks.

R Example: 10-Game Rolling Average

# Simulate game-by-game data for a player
set.seed(123)
game_log <- data.frame(
  game_num = 1:100,
  date = seq(as.Date("2024-04-01"), by = "day", length.out = 100),
  hits = rpois(100, lambda = 1.2),
  ab = sample(3:5, 100, replace = TRUE)
) %>%
  mutate(
    batting_avg = hits / ab,
    # 10-game rolling average
    rolling_avg_10 = zoo::rollmean(batting_avg, k = 10, fill = NA, align = "right"),
    # 20-game rolling average
    rolling_avg_20 = zoo::rollmean(batting_avg, k = 20, fill = NA, align = "right"),
    # Cumulative season average
    season_avg = cumsum(hits) / cumsum(ab)
  )

# Identify hot streaks (10-game avg > .300)
hot_streaks <- game_log %>%
  filter(rolling_avg_10 > .300) %>%
  select(game_num, date, rolling_avg_10, season_avg)

print(hot_streaks)

# Visualize (if plotting)
library(ggplot2)
ggplot(game_log, aes(x = game_num)) +
  geom_line(aes(y = batting_avg), alpha = 0.3, color = "gray") +
  geom_line(aes(y = rolling_avg_10), color = "blue", size = 1) +
  geom_line(aes(y = season_avg), color = "red", linetype = "dashed") +
  geom_hline(yintercept = 0.250, linetype = "dotted") +
  labs(
    title = "Batting Average: Game-by-Game vs. Rolling Average",
    x = "Game Number",
    y = "Batting Average",
    caption = "Blue = 10-game rolling avg, Red = Season avg"
  ) +
  theme_minimal()

Python Example: Rolling Statcast Metrics

import pandas as pd
import numpy as np

# Simulate pitch-level data aggregated by game
np.random.seed(42)
game_log_py = pd.DataFrame({
    'game_num': range(1, 101),
    'date': pd.date_range('2024-04-01', periods=100),
    'exit_velo': np.random.normal(90, 5, 100),  # Average exit velocity
    'launch_angle': np.random.normal(15, 10, 100),
    'hard_hit_pct': np.random.uniform(0.3, 0.6, 100),
    'barrel_pct': np.random.uniform(0.05, 0.15, 100)
})

# Calculate rolling averages
game_log_py['ev_rolling_10'] = game_log_py['exit_velo'].rolling(window=10).mean()
game_log_py['ev_rolling_20'] = game_log_py['exit_velo'].rolling(window=20).mean()
game_log_py['hard_hit_rolling'] = game_log_py['hard_hit_pct'].rolling(window=10).mean()
game_log_py['barrel_rolling'] = game_log_py['barrel_pct'].rolling(window=10).mean()

# Expanding window (cumulative)
game_log_py['ev_season_avg'] = game_log_py['exit_velo'].expanding().mean()

# Identify games with elite contact quality
elite_contact = game_log_py[
    (game_log_py['ev_rolling_10'] > 92) &
    (game_log_py['hard_hit_rolling'] > 0.45)
][['game_num', 'date', 'ev_rolling_10', 'hard_hit_rolling']]

print(elite_contact.head(10))

2.3.3 Year-over-Year Comparisons {#year-over-year}

Comparing performance across seasons reveals development trends and aging curves.

R Example: YoY HR Comparison

library(baseballr)
library(tidyverse)

# Get multiple years of data
batters_2023 <- fg_batter_leaders(2023, 2023, qual = 200)
batters_2024 <- fg_batter_leaders(2024, 2024, qual = 200)

# Prepare datasets
batters_23 <- batters_2023 %>%
  select(Name, HR_2023 = HR, AVG_2023 = AVG, OPS_2023 = OPS, WAR_2023 = WAR)

batters_24 <- batters_2024 %>%
  select(Name, HR_2024 = HR, AVG_2024 = AVG, OPS_2024 = OPS, WAR_2024 = WAR)

# Join and calculate changes
yoy_comparison <- batters_23 %>%
  inner_join(batters_24, by = "Name") %>%
  mutate(
    HR_change = HR_2024 - HR_2023,
    AVG_change = AVG_2024 - AVG_2023,
    OPS_change = OPS_2024 - OPS_2023,
    WAR_change = WAR_2024 - WAR_2023,
    HR_pct_change = (HR_change / HR_2023) * 100
  )

# Biggest improvements
biggest_improvers <- yoy_comparison %>%
  arrange(desc(HR_change)) %>%
  select(Name, HR_2023, HR_2024, HR_change, HR_pct_change) %>%
  head(10)

print(biggest_improvers)

# Players who regressed
biggest_decliners <- yoy_comparison %>%
  arrange(HR_change) %>%
  select(Name, HR_2023, HR_2024, HR_change) %>%
  head(10)

print(biggest_decliners)

Python Example: Multi-Year Career Trajectory

from pybaseball import batting_stats
import pandas as pd

# Get multiple years
years = [2022, 2023, 2024]
all_years = []

for year in years:
    df = batting_stats(year, qual=300)
    df['Season'] = year
    all_years.append(df)

multi_year = pd.concat(all_years, ignore_index=True)

# Calculate career trajectory for specific players
target_players = ['Aaron Judge', 'Juan Soto', 'Mookie Betts']
trajectories = multi_year[multi_year['Name'].isin(target_players)].copy()

trajectories = trajectories.sort_values(['Name', 'Season'])

# Calculate year-over-year changes
trajectories['HR_yoy'] = trajectories.groupby('Name')['HR'].diff()
trajectories['AVG_yoy'] = trajectories.groupby('Name')['AVG'].diff()
trajectories['OPS_yoy'] = trajectories.groupby('Name')['OPS'].diff()

# View trajectories
trajectory_summary = trajectories[[
    'Name', 'Season', 'HR', 'HR_yoy', 'AVG', 'AVG_yoy', 'OPS', 'OPS_yoy'
]].round(3)

print(trajectory_summary)

2.3.4 Split Analysis (home/away, vs LHP/RHP) {#split-analysis}

Splits reveal situational performance patterns crucial for lineup construction and matchup decisions.

R Example: Home/Road Splits

# Using baseballr to get splits (if available in your version)
# Alternative: simulate or load from Baseball Reference

# Simulated home/road data
set.seed(456)
player_splits <- expand.grid(
  player = c("Aaron Judge", "Juan Soto", "Shohei Ohtani"),
  split = c("Home", "Away")
) %>%
  mutate(
    PA = sample(250:350, n(), replace = TRUE),
    HR = case_when(
      split == "Home" ~ rpois(n(), lambda = 25),
      TRUE ~ rpois(n(), lambda = 18)
    ),
    AVG = case_when(
      split == "Home" ~ runif(n(), .270, .340),
      TRUE ~ runif(n(), .240, .310)
    ),
    OPS = case_when(
      split == "Home" ~ runif(n(), .850, 1.100),
      TRUE ~ runif(n(), .750, .950)
    )
  )

# Calculate home/away differentials
split_comparison <- player_splits %>%
  pivot_wider(
    id_cols = player,
    names_from = split,
    values_from = c(PA, HR, AVG, OPS)
  ) %>%
  mutate(
    HR_diff = HR_Home - HR_Away,
    AVG_diff = AVG_Home - AVG_Away,
    OPS_diff = OPS_Home - OPS_Away,
    home_advantage = case_when(
      OPS_diff > 0.100 ~ "Strong home hitter",
      OPS_diff > 0.050 ~ "Slight home advantage",
      OPS_diff > -0.050 ~ "Neutral",
      TRUE ~ "Better away"
    )
  )

print(split_comparison)

Python Example: LHP/RHP Splits

import pandas as pd
import numpy as np

# Simulate platoon splits data
np.random.seed(789)
players = ['Judge', 'Soto', 'Ohtani', 'Betts', 'Freeman']
handedness = ['vs_LHP', 'vs_RHP']

split_data = []
for player in players:
    for hand in handedness:
        # RHB typically better vs LHP
        if hand == 'vs_LHP':
            avg = np.random.uniform(0.280, 0.340)
            slg = np.random.uniform(0.520, 0.720)
            pa = np.random.randint(150, 250)
        else:
            avg = np.random.uniform(0.250, 0.310)
            slg = np.random.uniform(0.450, 0.620)
            pa = np.random.randint(350, 450)

        split_data.append({
            'player': player,
            'pitcher_hand': hand,
            'PA': pa,
            'AVG': round(avg, 3),
            'SLG': round(slg, 3),
            'OPS': round(avg + slg, 3)
        })

splits_df = pd.DataFrame(split_data)

# Pivot to compare
splits_wide = splits_df.pivot(
    index='player',
    columns='pitcher_hand',
    values=['PA', 'AVG', 'SLG', 'OPS']
)

# Calculate platoon differential
splits_wide['AVG_diff'] = splits_wide[('AVG', 'vs_LHP')] - splits_wide[('AVG', 'vs_RHP')]
splits_wide['OPS_diff'] = splits_wide[('OPS', 'vs_LHP')] - splits_wide[('OPS', 'vs_RHP')]

# Identify severe platoon splits
splits_wide['platoon_severity'] = pd.cut(
    splits_wide['OPS_diff'],
    bins=[-1, 0.050, 0.150, 0.300, 2],
    labels=['Reverse', 'Mild', 'Moderate', 'Severe']
)

print(splits_wide)

library(baseballr)
library(tidyverse)

# Get 2024 data
batters <- fg_batter_leaders(2024, 2024, qual = 100)

# Calculate per-PA and per-game rates
batters_rates <- batters %>%
  mutate(
    # Runs created per 27 outs (approximation)
    RC27 = (R + RBI - HR) / (AB - H + CS + GIDP) * 27,

    # Home runs per plate appearance
    HR_per_PA = HR / PA,

    # Walks per plate appearance
    BB_per_PA = BB / PA,

    # Extra base hits
    XBH = `2B` + `3B` + HR,
    XBH_per_PA = XBH / PA,

    # Total bases per plate appearance
    TB = H + `2B` + (`3B` * 2) + (HR * 2),
    TB_per_PA = TB / PA,

    # Speed score components
    SB_per_game = SB / G,

    # Production: OBP + SLG (quick OPS equivalent)
    Production = OBP + SLG
  )

# Top home run hitters by rate
hr_rate_leaders <- batters_rates %>%
  filter(PA >= 400) %>%
  select(Name, Team, PA, HR, HR_per_PA) %>%
  arrange(desc(HR_per_PA)) %>%
  head(10)

print(hr_rate_leaders)

# Simulate game-by-game data for a player
set.seed(123)
game_log <- data.frame(
  game_num = 1:100,
  date = seq(as.Date("2024-04-01"), by = "day", length.out = 100),
  hits = rpois(100, lambda = 1.2),
  ab = sample(3:5, 100, replace = TRUE)
) %>%
  mutate(
    batting_avg = hits / ab,
    # 10-game rolling average
    rolling_avg_10 = zoo::rollmean(batting_avg, k = 10, fill = NA, align = "right"),
    # 20-game rolling average
    rolling_avg_20 = zoo::rollmean(batting_avg, k = 20, fill = NA, align = "right"),
    # Cumulative season average
    season_avg = cumsum(hits) / cumsum(ab)
  )

# Identify hot streaks (10-game avg > .300)
hot_streaks <- game_log %>%
  filter(rolling_avg_10 > .300) %>%
  select(game_num, date, rolling_avg_10, season_avg)

print(hot_streaks)

# Visualize (if plotting)
library(ggplot2)
ggplot(game_log, aes(x = game_num)) +
  geom_line(aes(y = batting_avg), alpha = 0.3, color = "gray") +
  geom_line(aes(y = rolling_avg_10), color = "blue", size = 1) +
  geom_line(aes(y = season_avg), color = "red", linetype = "dashed") +
  geom_hline(yintercept = 0.250, linetype = "dotted") +
  labs(
    title = "Batting Average: Game-by-Game vs. Rolling Average",
    x = "Game Number",
    y = "Batting Average",
    caption = "Blue = 10-game rolling avg, Red = Season avg"
  ) +
  theme_minimal()

library(baseballr)
library(tidyverse)

# Get multiple years of data
batters_2023 <- fg_batter_leaders(2023, 2023, qual = 200)
batters_2024 <- fg_batter_leaders(2024, 2024, qual = 200)

# Prepare datasets
batters_23 <- batters_2023 %>%
  select(Name, HR_2023 = HR, AVG_2023 = AVG, OPS_2023 = OPS, WAR_2023 = WAR)

batters_24 <- batters_2024 %>%
  select(Name, HR_2024 = HR, AVG_2024 = AVG, OPS_2024 = OPS, WAR_2024 = WAR)

# Join and calculate changes
yoy_comparison <- batters_23 %>%
  inner_join(batters_24, by = "Name") %>%
  mutate(
    HR_change = HR_2024 - HR_2023,
    AVG_change = AVG_2024 - AVG_2023,
    OPS_change = OPS_2024 - OPS_2023,
    WAR_change = WAR_2024 - WAR_2023,
    HR_pct_change = (HR_change / HR_2023) * 100
  )

# Biggest improvements
biggest_improvers <- yoy_comparison %>%
  arrange(desc(HR_change)) %>%
  select(Name, HR_2023, HR_2024, HR_change, HR_pct_change) %>%
  head(10)

print(biggest_improvers)

# Players who regressed
biggest_decliners <- yoy_comparison %>%
  arrange(HR_change) %>%
  select(Name, HR_2023, HR_2024, HR_change) %>%
  head(10)

print(biggest_decliners)

# Using baseballr to get splits (if available in your version)
# Alternative: simulate or load from Baseball Reference

# Simulated home/road data
set.seed(456)
player_splits <- expand.grid(
  player = c("Aaron Judge", "Juan Soto", "Shohei Ohtani"),
  split = c("Home", "Away")
) %>%
  mutate(
    PA = sample(250:350, n(), replace = TRUE),
    HR = case_when(
      split == "Home" ~ rpois(n(), lambda = 25),
      TRUE ~ rpois(n(), lambda = 18)
    ),
    AVG = case_when(
      split == "Home" ~ runif(n(), .270, .340),
      TRUE ~ runif(n(), .240, .310)
    ),
    OPS = case_when(
      split == "Home" ~ runif(n(), .850, 1.100),
      TRUE ~ runif(n(), .750, .950)
    )
  )

# Calculate home/away differentials
split_comparison <- player_splits %>%
  pivot_wider(
    id_cols = player,
    names_from = split,
    values_from = c(PA, HR, AVG, OPS)
  ) %>%
  mutate(
    HR_diff = HR_Home - HR_Away,
    AVG_diff = AVG_Home - AVG_Away,
    OPS_diff = OPS_Home - OPS_Away,
    home_advantage = case_when(
      OPS_diff > 0.100 ~ "Strong home hitter",
      OPS_diff > 0.050 ~ "Slight home advantage",
      OPS_diff > -0.050 ~ "Neutral",
      TRUE ~ "Better away"
    )
  )

print(split_comparison)

Python

from pybaseball import pitching_stats
import pandas as pd

# Get 2024 pitching data
pitchers = pitching_stats(2024, qual=100)

# Calculate rate statistics
pitchers['K_per_9'] = (pitchers['SO'] / pitchers['IP']) * 9
pitchers['BB_per_9'] = (pitchers['BB'] / pitchers['IP']) * 9
pitchers['HR_per_9'] = (pitchers['HR'] / pitchers['IP']) * 9
pitchers['H_per_9'] = (pitchers['H'] / pitchers['IP']) * 9

# K/BB ratio
pitchers['K_BB_ratio'] = pitchers['SO'] / pitchers['BB']

# Strikeout rate and walk rate (per batter faced)
# Note: Estimate BF if not available
pitchers['est_BF'] = pitchers['IP'] * 3 + pitchers['H'] + pitchers['BB'] + pitchers['HBP']
pitchers['K_rate'] = pitchers['SO'] / pitchers['est_BF']
pitchers['BB_rate'] = pitchers['BB'] / pitchers['est_BF']

# WHIP (Walks + Hits per Inning Pitched)
pitchers['WHIP'] = (pitchers['BB'] + pitchers['H']) / pitchers['IP']

# Quality starts rate (if available)
if 'QS' in pitchers.columns:
    pitchers['QS_rate'] = pitchers['QS'] / pitchers['GS']

# View elite strikeout pitchers
elite_k = pitchers.nlargest(10, 'K_per_9')[
    ['Name', 'Team', 'IP', 'SO', 'K_per_9', 'K_BB_ratio', 'ERA']
]

print(elite_k)

Python

import pandas as pd
import numpy as np

# Simulate pitch-level data aggregated by game
np.random.seed(42)
game_log_py = pd.DataFrame({
    'game_num': range(1, 101),
    'date': pd.date_range('2024-04-01', periods=100),
    'exit_velo': np.random.normal(90, 5, 100),  # Average exit velocity
    'launch_angle': np.random.normal(15, 10, 100),
    'hard_hit_pct': np.random.uniform(0.3, 0.6, 100),
    'barrel_pct': np.random.uniform(0.05, 0.15, 100)
})

# Calculate rolling averages
game_log_py['ev_rolling_10'] = game_log_py['exit_velo'].rolling(window=10).mean()
game_log_py['ev_rolling_20'] = game_log_py['exit_velo'].rolling(window=20).mean()
game_log_py['hard_hit_rolling'] = game_log_py['hard_hit_pct'].rolling(window=10).mean()
game_log_py['barrel_rolling'] = game_log_py['barrel_pct'].rolling(window=10).mean()

# Expanding window (cumulative)
game_log_py['ev_season_avg'] = game_log_py['exit_velo'].expanding().mean()

# Identify games with elite contact quality
elite_contact = game_log_py[
    (game_log_py['ev_rolling_10'] > 92) &
    (game_log_py['hard_hit_rolling'] > 0.45)
][['game_num', 'date', 'ev_rolling_10', 'hard_hit_rolling']]

print(elite_contact.head(10))

Python

from pybaseball import batting_stats
import pandas as pd

# Get multiple years
years = [2022, 2023, 2024]
all_years = []

for year in years:
    df = batting_stats(year, qual=300)
    df['Season'] = year
    all_years.append(df)

multi_year = pd.concat(all_years, ignore_index=True)

# Calculate career trajectory for specific players
target_players = ['Aaron Judge', 'Juan Soto', 'Mookie Betts']
trajectories = multi_year[multi_year['Name'].isin(target_players)].copy()

trajectories = trajectories.sort_values(['Name', 'Season'])

# Calculate year-over-year changes
trajectories['HR_yoy'] = trajectories.groupby('Name')['HR'].diff()
trajectories['AVG_yoy'] = trajectories.groupby('Name')['AVG'].diff()
trajectories['OPS_yoy'] = trajectories.groupby('Name')['OPS'].diff()

# View trajectories
trajectory_summary = trajectories[[
    'Name', 'Season', 'HR', 'HR_yoy', 'AVG', 'AVG_yoy', 'OPS', 'OPS_yoy'
]].round(3)

print(trajectory_summary)

Python

import pandas as pd
import numpy as np

# Simulate platoon splits data
np.random.seed(789)
players = ['Judge', 'Soto', 'Ohtani', 'Betts', 'Freeman']
handedness = ['vs_LHP', 'vs_RHP']

split_data = []
for player in players:
    for hand in handedness:
        # RHB typically better vs LHP
        if hand == 'vs_LHP':
            avg = np.random.uniform(0.280, 0.340)
            slg = np.random.uniform(0.520, 0.720)
            pa = np.random.randint(150, 250)
        else:
            avg = np.random.uniform(0.250, 0.310)
            slg = np.random.uniform(0.450, 0.620)
            pa = np.random.randint(350, 450)

        split_data.append({
            'player': player,
            'pitcher_hand': hand,
            'PA': pa,
            'AVG': round(avg, 3),
            'SLG': round(slg, 3),
            'OPS': round(avg + slg, 3)
        })

splits_df = pd.DataFrame(split_data)

# Pivot to compare
splits_wide = splits_df.pivot(
    index='player',
    columns='pitcher_hand',
    values=['PA', 'AVG', 'SLG', 'OPS']
)

# Calculate platoon differential
splits_wide['AVG_diff'] = splits_wide[('AVG', 'vs_LHP')] - splits_wide[('AVG', 'vs_RHP')]
splits_wide['OPS_diff'] = splits_wide[('OPS', 'vs_LHP')] - splits_wide[('OPS', 'vs_RHP')]

# Identify severe platoon splits
splits_wide['platoon_severity'] = pd.cut(
    splits_wide['OPS_diff'],
    bins=[-1, 0.050, 0.150, 0.300, 2],
    labels=['Reverse', 'Mild', 'Moderate', 'Severe']
)

print(splits_wide)

2.4 Working with Dates and Time in Baseball

Baseball is inherently temporal: seasons progress, players age, trends emerge. Effective date handling is essential for time-based analysis.

2.4.1 Game Dates and Scheduling {#game-dates}

R Example: Date Manipulation

library(lubridate)
library(tidyverse)

# Create a game schedule
schedule <- data.frame(
  game_date = seq(as.Date("2024-03-28"), as.Date("2024-09-29"), by = "day")
) %>%
  # Filter to realistic game dates (not every day)
  filter(wday(game_date) %in% c(1, 3, 4, 5, 6, 7)) %>%  # Mostly no off days on Tuesday
  mutate(
    year = year(game_date),
    month = month(game_date, label = TRUE),
    day_of_week = wday(game_date, label = TRUE),
    week_of_season = week(game_date) - week(as.Date("2024-03-28")) + 1,
    is_weekend = wday(game_date) %in% c(1, 7),
    season_phase = case_when(
      month %in% c("Mar", "Apr") ~ "Early",
      month %in% c("May", "Jun", "Jul") ~ "Mid",
      month %in% c("Aug", "Sep") ~ "Late"
    )
  )

# Games by month
games_by_month <- schedule %>%
  count(month, season_phase) %>%
  arrange(month)

print(games_by_month)

# Weekend vs. weekday
schedule %>%
  count(is_weekend) %>%
  mutate(pct = n / sum(n) * 100)

Python Example: Date Parsing and Filtering

import pandas as pd
from datetime import datetime, timedelta

# Create schedule
start_date = pd.to_datetime('2024-03-28')
end_date = pd.to_datetime('2024-09-29')
date_range = pd.date_range(start_date, end_date, freq='D')

schedule_py = pd.DataFrame({'game_date': date_range})

# Extract date components
schedule_py['year'] = schedule_py['game_date'].dt.year
schedule_py['month'] = schedule_py['game_date'].dt.month
schedule_py['month_name'] = schedule_py['game_date'].dt.month_name()
schedule_py['day_of_week'] = schedule_py['game_date'].dt.day_name()
schedule_py['week_num'] = schedule_py['game_date'].dt.isocalendar().week
schedule_py['is_weekend'] = schedule_py['game_date'].dt.dayofweek >= 5

# Season phases
def get_season_phase(month):
    if month in [3, 4]:
        return 'Early'
    elif month in [5, 6, 7]:
        return 'Mid'
    else:
        return 'Late'

schedule_py['season_phase'] = schedule_py['month'].apply(get_season_phase)

# Summary statistics
phase_summary = schedule_py.groupby('season_phase').agg({
    'game_date': 'count',
    'is_weekend': 'sum'
}).rename(columns={'game_date': 'total_days', 'is_weekend': 'weekend_days'})

print(phase_summary)

2.4.2 Time-Based Filtering {#time-filtering}

# Filter by date range
early_season <- schedule %>%
  filter(game_date >= as.Date("2024-03-28"),
         game_date <= as.Date("2024-05-31"))

# Filter by month
june_games <- schedule %>%
  filter(month == "Jun")

# Last 30 days
recent_games <- schedule %>%
  filter(game_date >= Sys.Date() - days(30))

# Specific day of week
weekend_games <- schedule %>%
  filter(day_of_week %in% c("Sat", "Sun"))

# Python date filtering
early_season_py = schedule_py[
    (schedule_py['game_date'] >= '2024-03-28') &
    (schedule_py['game_date'] <= '2024-05-31')
]

# Filter by month
june_games_py = schedule_py[schedule_py['month'] == 6]

# Last 30 days (from today)
cutoff_date = pd.Timestamp.today() - pd.Timedelta(days=30)
recent_games_py = schedule_py[schedule_py['game_date'] >= cutoff_date]

# Weekend games
weekend_games_py = schedule_py[schedule_py['is_weekend']]

2.4.3 Seasonal Aggregations {#seasonal-aggregations}

# Aggregate performance by month (with simulated data)
set.seed(100)
monthly_performance <- schedule %>%
  sample_n(100) %>%  # Sample 100 games
  mutate(
    hits = rpois(n(), lambda = 9),
    runs = rpois(n(), lambda = 4.5),
    hr = rpois(n(), lambda = 1.2)
  ) %>%
  group_by(month, season_phase) %>%
  summarize(
    games = n(),
    total_hits = sum(hits),
    total_runs = sum(runs),
    total_hr = sum(hr),
    avg_runs_per_game = mean(runs),
    .groups = "drop"
  )

print(monthly_performance)

# Python seasonal aggregations
np.random.seed(100)
sample_games = schedule_py.sample(100).copy()
sample_games['hits'] = np.random.poisson(9, 100)
sample_games['runs'] = np.random.poisson(4.5, 100)
sample_games['hr'] = np.random.poisson(1.2, 100)

monthly_perf_py = sample_games.groupby(['month_name', 'season_phase']).agg({
    'game_date': 'count',
    'hits': 'sum',
    'runs': ['sum', 'mean'],
    'hr': 'sum'
}).round(2)

monthly_perf_py.columns = ['_'.join(col).strip('_') for col in monthly_perf_py.columns]
print(monthly_perf_py)

library(lubridate)
library(tidyverse)

# Create a game schedule
schedule <- data.frame(
  game_date = seq(as.Date("2024-03-28"), as.Date("2024-09-29"), by = "day")
) %>%
  # Filter to realistic game dates (not every day)
  filter(wday(game_date) %in% c(1, 3, 4, 5, 6, 7)) %>%  # Mostly no off days on Tuesday
  mutate(
    year = year(game_date),
    month = month(game_date, label = TRUE),
    day_of_week = wday(game_date, label = TRUE),
    week_of_season = week(game_date) - week(as.Date("2024-03-28")) + 1,
    is_weekend = wday(game_date) %in% c(1, 7),
    season_phase = case_when(
      month %in% c("Mar", "Apr") ~ "Early",
      month %in% c("May", "Jun", "Jul") ~ "Mid",
      month %in% c("Aug", "Sep") ~ "Late"
    )
  )

# Games by month
games_by_month <- schedule %>%
  count(month, season_phase) %>%
  arrange(month)

print(games_by_month)

# Weekend vs. weekday
schedule %>%
  count(is_weekend) %>%
  mutate(pct = n / sum(n) * 100)

# Filter by date range
early_season <- schedule %>%
  filter(game_date >= as.Date("2024-03-28"),
         game_date <= as.Date("2024-05-31"))

# Filter by month
june_games <- schedule %>%
  filter(month == "Jun")

# Last 30 days
recent_games <- schedule %>%
  filter(game_date >= Sys.Date() - days(30))

# Specific day of week
weekend_games <- schedule %>%
  filter(day_of_week %in% c("Sat", "Sun"))

# Aggregate performance by month (with simulated data)
set.seed(100)
monthly_performance <- schedule %>%
  sample_n(100) %>%  # Sample 100 games
  mutate(
    hits = rpois(n(), lambda = 9),
    runs = rpois(n(), lambda = 4.5),
    hr = rpois(n(), lambda = 1.2)
  ) %>%
  group_by(month, season_phase) %>%
  summarize(
    games = n(),
    total_hits = sum(hits),
    total_runs = sum(runs),
    total_hr = sum(hr),
    avg_runs_per_game = mean(runs),
    .groups = "drop"
  )

print(monthly_performance)

Python

import pandas as pd
from datetime import datetime, timedelta

# Create schedule
start_date = pd.to_datetime('2024-03-28')
end_date = pd.to_datetime('2024-09-29')
date_range = pd.date_range(start_date, end_date, freq='D')

schedule_py = pd.DataFrame({'game_date': date_range})

# Extract date components
schedule_py['year'] = schedule_py['game_date'].dt.year
schedule_py['month'] = schedule_py['game_date'].dt.month
schedule_py['month_name'] = schedule_py['game_date'].dt.month_name()
schedule_py['day_of_week'] = schedule_py['game_date'].dt.day_name()
schedule_py['week_num'] = schedule_py['game_date'].dt.isocalendar().week
schedule_py['is_weekend'] = schedule_py['game_date'].dt.dayofweek >= 5

# Season phases
def get_season_phase(month):
    if month in [3, 4]:
        return 'Early'
    elif month in [5, 6, 7]:
        return 'Mid'
    else:
        return 'Late'

schedule_py['season_phase'] = schedule_py['month'].apply(get_season_phase)

# Summary statistics
phase_summary = schedule_py.groupby('season_phase').agg({
    'game_date': 'count',
    'is_weekend': 'sum'
}).rename(columns={'game_date': 'total_days', 'is_weekend': 'weekend_days'})

print(phase_summary)

Python

# Python date filtering
early_season_py = schedule_py[
    (schedule_py['game_date'] >= '2024-03-28') &
    (schedule_py['game_date'] <= '2024-05-31')
]

# Filter by month
june_games_py = schedule_py[schedule_py['month'] == 6]

# Last 30 days (from today)
cutoff_date = pd.Timestamp.today() - pd.Timedelta(days=30)
recent_games_py = schedule_py[schedule_py['game_date'] >= cutoff_date]

# Weekend games
weekend_games_py = schedule_py[schedule_py['is_weekend']]

Python

# Python seasonal aggregations
np.random.seed(100)
sample_games = schedule_py.sample(100).copy()
sample_games['hits'] = np.random.poisson(9, 100)
sample_games['runs'] = np.random.poisson(4.5, 100)
sample_games['hr'] = np.random.poisson(1.2, 100)

monthly_perf_py = sample_games.groupby(['month_name', 'season_phase']).agg({
    'game_date': 'count',
    'hits': 'sum',
    'runs': ['sum', 'mean'],
    'hr': 'sum'
}).round(2)

monthly_perf_py.columns = ['_'.join(col).strip('_') for col in monthly_perf_py.columns]
print(monthly_perf_py)

2.5 Handling Missing Data

Missing data is ubiquitous in baseball: injured players have no stats, new statistics weren't tracked historically, and data collection errors occur. How you handle missingness can significantly impact your conclusions.

2.5.1 Types of Missing Data in Baseball {#missing-data-types}

Missing Completely at Random (MCAR): Data entry errors, random system failures
Missing at Random (MAR): Older seasons lack Statcast data, but missingness depends on year, not the value itself
Missing Not at Random (MNAR): Players with poor performance stop playing (survivor bias)

2.5.2 Imputation Strategies {#imputation}

R Example: Detecting and Imputing

# Simulate data with missing values
set.seed(321)
batters_missing <- data.frame(
  player = paste("Player", 1:50),
  hr = sample(c(rpois(45, lambda = 20), rep(NA, 5))),
  avg = sample(c(runif(43, .230, .310), rep(NA, 7))),
  obp = sample(c(runif(46, .300, .400), rep(NA, 4))),
  age = sample(22:38, 50, replace = TRUE)
)

# Detect missing data
missing_summary <- batters_missing %>%
  summarize(
    total_rows = n(),
    hr_missing = sum(is.na(hr)),
    avg_missing = sum(is.na(avg)),
    obp_missing = sum(is.na(obp))
  )

print(missing_summary)

# Visualize missingness pattern
library(naniar)
vis_miss(batters_missing)  # Requires naniar package

# Imputation strategies

# 1. Remove rows with any missing data
complete_cases <- batters_missing %>%
  filter(complete.cases(.))

# 2. Mean imputation (simple, but reduces variance)
batters_mean_impute <- batters_missing %>%
  mutate(
    hr = ifelse(is.na(hr), mean(hr, na.rm = TRUE), hr),
    avg = ifelse(is.na(avg), mean(avg, na.rm = TRUE), avg),
    obp = ifelse(is.na(obp), mean(obp, na.rm = TRUE), obp)
  )

# 3. Median imputation (robust to outliers)
batters_median_impute <- batters_missing %>%
  mutate(
    hr = ifelse(is.na(hr), median(hr, na.rm = TRUE), hr)
  )

# 4. Group-based imputation (by age group)
batters_group_impute <- batters_missing %>%
  mutate(
    age_group = cut(age, breaks = c(20, 25, 30, 35, 40), labels = c("Young", "Prime", "Veteran", "Older"))
  ) %>%
  group_by(age_group) %>%
  mutate(
    hr = ifelse(is.na(hr), mean(hr, na.rm = TRUE), hr)
  ) %>%
  ungroup() %>%
  select(-age_group)

# 5. Forward fill (carry last observation forward - useful for cumulative stats)
batters_ffill <- batters_missing %>%
  arrange(player) %>%
  fill(hr, avg, obp, .direction = "down")

Python Example: Advanced Imputation

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

# Simulate data with missing values
np.random.seed(321)
batters_missing_py = pd.DataFrame({
    'player': [f'Player {i}' for i in range(1, 51)],
    'hr': np.random.poisson(20, 50).astype(float),
    'avg': np.random.uniform(0.230, 0.310, 50),
    'obp': np.random.uniform(0.300, 0.400, 50),
    'age': np.random.randint(22, 39, 50)
})

# Introduce missing values
batters_missing_py.loc[np.random.choice(50, 5, replace=False), 'hr'] = np.nan
batters_missing_py.loc[np.random.choice(50, 7, replace=False), 'avg'] = np.nan
batters_missing_py.loc[np.random.choice(50, 4, replace=False), 'obp'] = np.nan

# Check missing data
print(batters_missing_py.isnull().sum())
print(batters_missing_py.isnull().sum() / len(batters_missing_py) * 100)  # Percentages

# 1. Drop rows with any missing data
complete_cases_py = batters_missing_py.dropna()

# 2. Drop columns with too many missing values
threshold = 0.2  # 20%
batters_dropped_cols = batters_missing_py.dropna(
    axis=1,
    thresh=int((1-threshold) * len(batters_missing_py))
)

# 3. Mean imputation
imputer_mean = SimpleImputer(strategy='mean')
numeric_cols = ['hr', 'avg', 'obp', 'age']
batters_mean_imputed = batters_missing_py.copy()
batters_mean_imputed[numeric_cols] = imputer_mean.fit_transform(batters_missing_py[numeric_cols])

# 4. Median imputation
imputer_median = SimpleImputer(strategy='median')
batters_median_imputed = batters_missing_py.copy()
batters_median_imputed[numeric_cols] = imputer_median.fit_transform(batters_missing_py[numeric_cols])

# 5. KNN imputation (uses similar records)
imputer_knn = KNNImputer(n_neighbors=5)
batters_knn_imputed = batters_missing_py.copy()
batters_knn_imputed[numeric_cols] = imputer_knn.fit_transform(batters_missing_py[numeric_cols])

# 6. Forward fill
batters_ffill_py = batters_missing_py.fillna(method='ffill')

# 7. Group-based imputation
batters_missing_py['age_group'] = pd.cut(
    batters_missing_py['age'],
    bins=[20, 25, 30, 35, 40],
    labels=['Young', 'Prime', 'Veteran', 'Older']
)
batters_group_imputed = batters_missing_py.groupby('age_group').transform(
    lambda x: x.fillna(x.mean()) if x.dtype != 'object' else x
)
batters_group_imputed['player'] = batters_missing_py['player']

print("\nOriginal missing counts:", batters_missing_py.isnull().sum().sum())
print("After KNN imputation:", batters_knn_imputed.isnull().sum().sum())

2.5.3 When to Exclude vs. Impute {#exclude-vs-impute}

Exclude (listwise deletion) when:

Missing data is MCAR and you have abundant data

The proportion of missing data is very small (<5%)

Imputation might introduce more bias than deletion

Impute when:

Missing data is MAR and related to other observed variables

You have moderate missingness (5-20%) and need to preserve sample size

The missing data follows a pattern you can model

Example decision framework:

# R: Assess missingness and decide
assess_missingness <- function(df, threshold = 0.05) {
  missing_rates <- df %>%
    summarize(across(everything(), ~sum(is.na(.)) / n())) %>%
    pivot_longer(everything(), names_to = "variable", values_to = "missing_rate")

  recommendations <- missing_rates %>%
    mutate(
      recommendation = case_when(
        missing_rate == 0 ~ "No action needed",
        missing_rate < threshold ~ "Consider listwise deletion",
        missing_rate < 0.20 ~ "Imputation recommended",
        TRUE ~ "Consider dropping variable or collecting more data"
      )
    )

  return(recommendations)
}

recommendations <- assess_missingness(batters_missing, threshold = 0.05)
print(recommendations)

# Python: Assess and decide
def assess_missingness(df, threshold=0.05):
    missing_rates = df.isnull().sum() / len(df)

    recommendations = []
    for col, rate in missing_rates.items():
        if rate == 0:
            rec = "No action needed"
        elif rate < threshold:
            rec = "Consider listwise deletion"
        elif rate < 0.20:
            rec = "Imputation recommended"
        else:
            rec = "Consider dropping variable"

        recommendations.append({
            'variable': col,
            'missing_rate': rate,
            'recommendation': rec
        })

    return pd.DataFrame(recommendations)

recommendations_py = assess_missingness(batters_missing_py, threshold=0.05)
print(recommendations_py)

# Simulate data with missing values
set.seed(321)
batters_missing <- data.frame(
  player = paste("Player", 1:50),
  hr = sample(c(rpois(45, lambda = 20), rep(NA, 5))),
  avg = sample(c(runif(43, .230, .310), rep(NA, 7))),
  obp = sample(c(runif(46, .300, .400), rep(NA, 4))),
  age = sample(22:38, 50, replace = TRUE)
)

# Detect missing data
missing_summary <- batters_missing %>%
  summarize(
    total_rows = n(),
    hr_missing = sum(is.na(hr)),
    avg_missing = sum(is.na(avg)),
    obp_missing = sum(is.na(obp))
  )

print(missing_summary)

# Visualize missingness pattern
library(naniar)
vis_miss(batters_missing)  # Requires naniar package

# Imputation strategies

# 1. Remove rows with any missing data
complete_cases <- batters_missing %>%
  filter(complete.cases(.))

# 2. Mean imputation (simple, but reduces variance)
batters_mean_impute <- batters_missing %>%
  mutate(
    hr = ifelse(is.na(hr), mean(hr, na.rm = TRUE), hr),
    avg = ifelse(is.na(avg), mean(avg, na.rm = TRUE), avg),
    obp = ifelse(is.na(obp), mean(obp, na.rm = TRUE), obp)
  )

# 3. Median imputation (robust to outliers)
batters_median_impute <- batters_missing %>%
  mutate(
    hr = ifelse(is.na(hr), median(hr, na.rm = TRUE), hr)
  )

# 4. Group-based imputation (by age group)
batters_group_impute <- batters_missing %>%
  mutate(
    age_group = cut(age, breaks = c(20, 25, 30, 35, 40), labels = c("Young", "Prime", "Veteran", "Older"))
  ) %>%
  group_by(age_group) %>%
  mutate(
    hr = ifelse(is.na(hr), mean(hr, na.rm = TRUE), hr)
  ) %>%
  ungroup() %>%
  select(-age_group)

# 5. Forward fill (carry last observation forward - useful for cumulative stats)
batters_ffill <- batters_missing %>%
  arrange(player) %>%
  fill(hr, avg, obp, .direction = "down")

# R: Assess missingness and decide
assess_missingness <- function(df, threshold = 0.05) {
  missing_rates <- df %>%
    summarize(across(everything(), ~sum(is.na(.)) / n())) %>%
    pivot_longer(everything(), names_to = "variable", values_to = "missing_rate")

  recommendations <- missing_rates %>%
    mutate(
      recommendation = case_when(
        missing_rate == 0 ~ "No action needed",
        missing_rate < threshold ~ "Consider listwise deletion",
        missing_rate < 0.20 ~ "Imputation recommended",
        TRUE ~ "Consider dropping variable or collecting more data"
      )
    )

  return(recommendations)
}

recommendations <- assess_missingness(batters_missing, threshold = 0.05)
print(recommendations)

Python

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

# Simulate data with missing values
np.random.seed(321)
batters_missing_py = pd.DataFrame({
    'player': [f'Player {i}' for i in range(1, 51)],
    'hr': np.random.poisson(20, 50).astype(float),
    'avg': np.random.uniform(0.230, 0.310, 50),
    'obp': np.random.uniform(0.300, 0.400, 50),
    'age': np.random.randint(22, 39, 50)
})

# Introduce missing values
batters_missing_py.loc[np.random.choice(50, 5, replace=False), 'hr'] = np.nan
batters_missing_py.loc[np.random.choice(50, 7, replace=False), 'avg'] = np.nan
batters_missing_py.loc[np.random.choice(50, 4, replace=False), 'obp'] = np.nan

# Check missing data
print(batters_missing_py.isnull().sum())
print(batters_missing_py.isnull().sum() / len(batters_missing_py) * 100)  # Percentages

# 1. Drop rows with any missing data
complete_cases_py = batters_missing_py.dropna()

# 2. Drop columns with too many missing values
threshold = 0.2  # 20%
batters_dropped_cols = batters_missing_py.dropna(
    axis=1,
    thresh=int((1-threshold) * len(batters_missing_py))
)

# 3. Mean imputation
imputer_mean = SimpleImputer(strategy='mean')
numeric_cols = ['hr', 'avg', 'obp', 'age']
batters_mean_imputed = batters_missing_py.copy()
batters_mean_imputed[numeric_cols] = imputer_mean.fit_transform(batters_missing_py[numeric_cols])

# 4. Median imputation
imputer_median = SimpleImputer(strategy='median')
batters_median_imputed = batters_missing_py.copy()
batters_median_imputed[numeric_cols] = imputer_median.fit_transform(batters_missing_py[numeric_cols])

# 5. KNN imputation (uses similar records)
imputer_knn = KNNImputer(n_neighbors=5)
batters_knn_imputed = batters_missing_py.copy()
batters_knn_imputed[numeric_cols] = imputer_knn.fit_transform(batters_missing_py[numeric_cols])

# 6. Forward fill
batters_ffill_py = batters_missing_py.fillna(method='ffill')

# 7. Group-based imputation
batters_missing_py['age_group'] = pd.cut(
    batters_missing_py['age'],
    bins=[20, 25, 30, 35, 40],
    labels=['Young', 'Prime', 'Veteran', 'Older']
)
batters_group_imputed = batters_missing_py.groupby('age_group').transform(
    lambda x: x.fillna(x.mean()) if x.dtype != 'object' else x
)
batters_group_imputed['player'] = batters_missing_py['player']

print("\nOriginal missing counts:", batters_missing_py.isnull().sum().sum())
print("After KNN imputation:", batters_knn_imputed.isnull().sum().sum())

Python

# Python: Assess and decide
def assess_missingness(df, threshold=0.05):
    missing_rates = df.isnull().sum() / len(df)

    recommendations = []
    for col, rate in missing_rates.items():
        if rate == 0:
            rec = "No action needed"
        elif rate < threshold:
            rec = "Consider listwise deletion"
        elif rate < 0.20:
            rec = "Imputation recommended"
        else:
            rec = "Consider dropping variable"

        recommendations.append({
            'variable': col,
            'missing_rate': rate,
            'recommendation': rec
        })

    return pd.DataFrame(recommendations)

recommendations_py = assess_missingness(batters_missing_py, threshold=0.05)
print(recommendations_py)

2.6 Exercises

Exercise 2.1: Advanced Filtering and Selection

Using the baseballr package (R) or pybaseball (Python), retrieve 2024 batting data and answer:

How many players hit 30+ home runs with a strikeout rate below 20%?
Which players had an OBP of at least .350 and stole 20+ bases?
Create a new category "Three True Outcomes Rate" (HR + BB + SO) / PA and identify the top 10 players

R Solution Sketch:

library(baseballr)
library(tidyverse)

batters <- fg_batter_leaders(2024, 2024, qual = 300)

# 1.
q1 <- batters %>%
  filter(HR >= 30, `K_percent` < 20) %>%
  nrow()

# 2.
q2 <- batters %>%
  filter(OBP >= .350, SB >= 20) %>%
  select(Name, OBP, SB)

# 3.
q3 <- batters %>%
  mutate(TTO_rate = (HR + BB + SO) / PA) %>%
  arrange(desc(TTO_rate)) %>%
  select(Name, HR, BB, SO, PA, TTO_rate) %>%
  head(10)

Python Solution Sketch:

from pybaseball import batting_stats

batters_py = batting_stats(2024, qual=300)

# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")

# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]

# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]

Exercise 2.2: Grouping and Team Analysis

Calculate team-level statistics from individual player data:

Which team had the highest average OPS among qualified batters?
What's the correlation between team home runs and team wins (you'll need to join with standings data)?
Calculate each team's offensive balance: standard deviation of WAR among their top 5 hitters

Exercise 2.3: Time Series Analysis

Create a 162-game simulated season for a player and:

Calculate 15-game rolling averages for BA, OBP, and SLG
Identify the longest streak above .300 BA
Compare first-half vs. second-half performance
Visualize performance trends over the season

Exercise 2.4: Joins and Data Integration

Combine batting and pitching data:

Identify two-way players (appear in both datasets)
For two-way players, calculate their combined WAR
Compare offensive WAR vs. pitching WAR

Exercise 2.5: Missing Data Challenge

Using real-world data with missing Statcast metrics:

Assess the extent and pattern of missing data
Implement three different imputation strategies
Compare the impact on correlation between exit velocity and hard-hit rate
Justify which imputation method is most appropriate and why

This concludes Chapter 2. In the next chapter, we'll explore the rich ecosystem of baseball data sources, learning how to access FanGraphs, Baseball Reference, Statcast, and the Lahman database through R and Python packages.

library(baseballr)
library(tidyverse)

batters <- fg_batter_leaders(2024, 2024, qual = 300)

# 1.
q1 <- batters %>%
  filter(HR >= 30, `K_percent` < 20) %>%
  nrow()

# 2.
q2 <- batters %>%
  filter(OBP >= .350, SB >= 20) %>%
  select(Name, OBP, SB)

# 3.
q3 <- batters %>%
  mutate(TTO_rate = (HR + BB + SO) / PA) %>%
  arrange(desc(TTO_rate)) %>%
  select(Name, HR, BB, SO, PA, TTO_rate) %>%
  head(10)

Python

from pybaseball import batting_stats

batters_py = batting_stats(2024, qual=300)

# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")

# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]

# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]

Practice Exercises

Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.

5 exercises

Tips for Success

Read the problem carefully before starting to code
Break down complex problems into smaller steps
Use the hints if you're stuck - they won't give away the answer
After solving, compare your approach with the solution

Exercise 2.1

Advanced Filtering and Selection

Hard

Using the baseballr package (R) or pybaseball (Python), retrieve 2024 batting data and answer:

1. How many players hit 30+ home runs with a strikeout rate below 20%?
2. Which players had an OBP of at least .350 and stole 20+ bases?
3. Create a new category "Three True Outcomes Rate" (HR + BB + SO) / PA and identify the top 10 players

**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)

batters <- fg_batter_leaders(2024, 2024, qual = 300)

# 1.
q1 <- batters %>%
filter(HR >= 30, `K_percent` < 20) %>%
nrow()

# 2.
q2 <- batters %>%
filter(OBP >= .350, SB >= 20) %>%
select(Name, OBP, SB)

# 3.
q3 <- batters %>%
mutate(TTO_rate = (HR + BB + SO) / PA) %>%
arrange(desc(TTO_rate)) %>%
select(Name, HR, BB, SO, PA, TTO_rate) %>%
head(10)
```

**Python Solution Sketch:**
```python
from pybaseball import batting_stats

batters_py = batting_stats(2024, qual=300)

# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")

# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]

# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]
```

Exercise 2.2

Grouping and Team Analysis

Medium

Calculate team-level statistics from individual player data:

1. Which team had the highest average OPS among qualified batters?
2. What's the correlation between team home runs and team wins (you'll need to join with standings data)?
3. Calculate each team's offensive balance: standard deviation of WAR among their top 5 hitters

Exercise 2.3

Time Series Analysis

Hard

Create a 162-game simulated season for a player and:

1. Calculate 15-game rolling averages for BA, OBP, and SLG
2. Identify the longest streak above .300 BA
3. Compare first-half vs. second-half performance
4. Visualize performance trends over the season

Exercise 2.4

Joins and Data Integration

Medium

Combine batting and pitching data:

1. Identify two-way players (appear in both datasets)
2. For two-way players, calculate their combined WAR
3. Compare offensive WAR vs. pitching WAR

Exercise 2.5

Missing Data Challenge

Hard

Using real-world data with missing Statcast metrics:

1. Assess the extent and pattern of missing data
2. Implement three different imputation strategies
3. Compare the impact on correlation between exit velocity and hard-hit rate
4. Justify which imputation method is most appropriate and why

---

This concludes Chapter 2. In the next chapter, we'll explore the rich ecosystem of baseball data sources, learning how to access FanGraphs, Baseball Reference, Statcast, and the Lahman database through R and Python packages.

Chapter 2: Data Wrangling for Baseball

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

2.1 Introduction to the Tidyverse (R)

2.1.1 The Philosophy of Tidy Data {#tidy-data-philosophy}

2.1.2 The Pipe Operator: %>% and |> {#pipe-operator}

2.1.3 The Core dplyr Verbs {#dplyr-verbs}

filter(): Keep Rows Based on Conditions

select(): Choose Columns

mutate(): Create New Columns

arrange(): Sort Rows

summarize(): Aggregate Data

group_by(): Group Data for Split Analysis

2.1.4 Advanced dplyr Operations {#advanced-dplyr}

Joins: Combining Multiple Data Sources

Pivots: Reshaping Data

Window Functions: Calculations Within Groups

2.2 Introduction to Pandas (Python)

2.2.1 DataFrames and Series {#dataframes-series}

2.2.2 Selecting and Filtering Data {#pandas-selecting}

Filtering (Boolean Indexing)

2.2.3 Creating New Columns {#pandas-new-columns}

2.2.4 Grouping and Aggregating {#pandas-groupby}

2.2.5 Merging DataFrames {#pandas-merging}

2.3 Common Baseball Data Transformations

2.3.1 Calculating Rate Stats (per PA, per IP) {#rate-stats}

R Example: Batting Rates

Python Example: Pitching Rates

2.3.2 Rolling Averages and Moving Windows {#rolling-averages}

R Example: 10-Game Rolling Average

Python Example: Rolling Statcast Metrics

2.3.3 Year-over-Year Comparisons {#year-over-year}

R Example: YoY HR Comparison

Python Example: Multi-Year Career Trajectory

2.3.4 Split Analysis (home/away, vs LHP/RHP) {#split-analysis}

R Example: Home/Road Splits

Python Example: LHP/RHP Splits

2.4 Working with Dates and Time in Baseball

2.4.1 Game Dates and Scheduling {#game-dates}

R Example: Date Manipulation

Python Example: Date Parsing and Filtering

2.4.2 Time-Based Filtering {#time-filtering}

2.4.3 Seasonal Aggregations {#seasonal-aggregations}

2.5 Handling Missing Data

2.5.1 Types of Missing Data in Baseball {#missing-data-types}

2.5.2 Imputation Strategies {#imputation}

R Example: Detecting and Imputing

Python Example: Advanced Imputation

2.5.3 When to Exclude vs. Impute {#exclude-vs-impute}

2.6 Exercises

Exercise 2.1: Advanced Filtering and Selection

Exercise 2.2: Grouping and Team Analysis

Exercise 2.3: Time Series Analysis

Exercise 2.4: Joins and Data Integration

Exercise 2.5: Missing Data Challenge

Practice Exercises

Tips for Success

Advanced Filtering and Selection

Grouping and Team Analysis

Time Series Analysis

Joins and Data Integration

Missing Data Challenge

Chapter Summary

Related Resources

Glossary

Resources

All Chapters