Understanding the Velocity Trend
Over the past two decades, Major League Baseball has witnessed a dramatic increase in pitching velocity. The average fastball velocity has climbed from approximately 91.0 mph in 2002 to over 94.0 mph in 2023, representing a fundamental shift in how the game is played. This velocity revolution has been driven by several factors:
- Advanced Biomechanics: Motion capture technology and biomechanical analysis have optimized pitching mechanics
- Year-Round Training: Specialized velocity development programs and weighted ball training
- Analytics-Driven Approach: Recognition that velocity correlates strongly with strikeout rates and success
- Medical Advances: Tommy John surgery and rehabilitation protocols allowing pitchers to return stronger
Velocity Analysis with R
Let's analyze the velocity trend using baseballr:
# Load required libraries
library(baseballr)
library(dplyr)
library(ggplot2)
library(tidyr)
# Function to analyze velocity trends by season
analyze_velocity_trends <- function(start_year = 2015, end_year = 2023) {
# Get Statcast data for multiple seasons
velocity_data <- data.frame()
for (year in start_year:end_year) {
tryCatch({
# Get pitch-level data for a sample of games
season_data <- statcast_search(
start_date = paste0(year, "-04-01"),
end_date = paste0(year, "-10-31"),
player_type = "pitcher"
)
if (nrow(season_data) > 0) {
season_summary <- season_data %>%
filter(!is.na(release_speed), pitch_type %in% c("FF", "SI", "FC")) %>%
group_by(pitcher, player_name) %>%
summarise(
avg_velocity = mean(release_speed, na.rm = TRUE),
max_velocity = max(release_speed, na.rm = TRUE),
pitches = n(),
.groups = "drop"
) %>%
filter(pitches >= 100) %>%
mutate(season = year)
velocity_data <- bind_rows(velocity_data, season_summary)
}
}, error = function(e) {
message(paste("Error processing year", year, ":", e$message))
})
}
return(velocity_data)
}
# Calculate league-wide velocity trends
calculate_league_velocity <- function(velocity_data) {
league_trends <- velocity_data %>%
group_by(season) %>%
summarise(
mean_velocity = mean(avg_velocity, na.rm = TRUE),
median_velocity = median(avg_velocity, na.rm = TRUE),
p75_velocity = quantile(avg_velocity, 0.75, na.rm = TRUE),
p90_velocity = quantile(avg_velocity, 0.90, na.rm = TRUE),
p99_velocity = quantile(avg_velocity, 0.99, na.rm = TRUE),
.groups = "drop"
)
return(league_trends)
}
# Visualize velocity evolution
plot_velocity_trends <- function(league_trends) {
ggplot(league_trends, aes(x = season)) +
geom_line(aes(y = mean_velocity, color = "Mean"), linewidth = 1.2) +
geom_line(aes(y = p75_velocity, color = "75th Percentile"), linewidth = 1) +
geom_line(aes(y = p90_velocity, color = "90th Percentile"), linewidth = 1) +
geom_point(aes(y = mean_velocity, color = "Mean"), size = 3) +
geom_point(aes(y = p75_velocity, color = "75th Percentile"), size = 2) +
geom_point(aes(y = p90_velocity, color = "90th Percentile"), size = 2) +
labs(
title = "MLB Fastball Velocity Evolution (2015-2023)",
subtitle = "Average velocity across all qualified pitchers",
x = "Season",
y = "Velocity (mph)",
color = "Metric"
) +
theme_minimal() +
scale_color_manual(values = c("Mean" = "#002D72",
"75th Percentile" = "#D50032",
"90th Percentile" = "#FF8C00"))
}
# Analyze velocity impact on performance
velocity_performance_analysis <- function(pitcher_data) {
# Group pitchers by velocity quintiles
pitcher_data_with_quintiles <- pitcher_data %>%
mutate(velocity_quintile = ntile(avg_velocity, 5))
performance_by_velocity <- pitcher_data_with_quintiles %>%
group_by(velocity_quintile) %>%
summarise(
avg_velocity = mean(avg_velocity),
count = n(),
.groups = "drop"
)
return(performance_by_velocity)
}
Velocity Analysis with Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
from pybaseball import playerid_lookup, statcast_pitcher
from datetime import datetime
# Set style for visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
def analyze_velocity_by_season(start_year=2015, end_year=2023):
"""
Analyze velocity trends across multiple seasons
"""
velocity_trends = []
for year in range(start_year, end_year + 1):
try:
# Get season data
start_date = f"{year}-04-01"
end_date = f"{year}-10-31"
# Query Statcast data
data = statcast(start_dt=start_date, end_dt=end_date)
if data is not None and len(data) > 0:
# Filter for fastballs
fastballs = data[data['pitch_type'].isin(['FF', 'SI', 'FC'])]
# Calculate pitcher-level statistics
pitcher_stats = fastballs.groupby(['pitcher', 'player_name']).agg({
'release_speed': ['mean', 'max', 'count']
}).reset_index()
pitcher_stats.columns = ['pitcher_id', 'player_name',
'avg_velocity', 'max_velocity', 'pitches']
# Filter for qualified pitchers (100+ pitches)
qualified = pitcher_stats[pitcher_stats['pitches'] >= 100]
qualified['season'] = year
velocity_trends.append(qualified)
except Exception as e:
print(f"Error processing {year}: {e}")
return pd.concat(velocity_trends, ignore_index=True)
def calculate_league_velocity_stats(velocity_df):
"""
Calculate league-wide velocity statistics by season
"""
league_stats = velocity_df.groupby('season')['avg_velocity'].agg([
('mean', 'mean'),
('median', 'median'),
('std', 'std'),
('p75', lambda x: np.percentile(x, 75)),
('p90', lambda x: np.percentile(x, 90)),
('p99', lambda x: np.percentile(x, 99))
]).reset_index()
return league_stats
def plot_velocity_distribution(velocity_df, seasons=[2015, 2019, 2023]):
"""
Create distribution plots comparing velocity across seasons
"""
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, season in enumerate(seasons):
season_data = velocity_df[velocity_df['season'] == season]
axes[idx].hist(season_data['avg_velocity'], bins=30,
edgecolor='black', alpha=0.7, color='steelblue')
axes[idx].axvline(season_data['avg_velocity'].mean(),
color='red', linestyle='--', linewidth=2,
label=f'Mean: {season_data["avg_velocity"].mean():.2f} mph')
axes[idx].set_xlabel('Average Fastball Velocity (mph)')
axes[idx].set_ylabel('Number of Pitchers')
axes[idx].set_title(f'{season} Season')
axes[idx].legend()
plt.tight_layout()
return fig
def analyze_velocity_percentiles(velocity_df):
"""
Analyze how velocity percentiles have shifted over time
"""
percentiles = [10, 25, 50, 75, 90, 95, 99]
percentile_data = []
for season in velocity_df['season'].unique():
season_data = velocity_df[velocity_df['season'] == season]['avg_velocity']
row = {'season': season}
for p in percentiles:
row[f'p{p}'] = np.percentile(season_data, p)
percentile_data.append(row)
return pd.DataFrame(percentile_data)
def velocity_strikeout_correlation(pitch_data):
"""
Analyze correlation between velocity and strikeout rates
"""
# Calculate strikeout rate by pitch
pitch_data['is_strikeout'] = pitch_data['events'] == 'strikeout'
# Group by velocity bins
pitch_data['velocity_bin'] = pd.cut(pitch_data['release_speed'],
bins=range(85, 105, 2))
correlation_data = pitch_data.groupby('velocity_bin').agg({
'is_strikeout': 'mean',
'release_speed': 'count'
}).reset_index()
correlation_data.columns = ['velocity_bin', 'strikeout_rate', 'pitches']
return correlation_data[correlation_data['pitches'] >= 100]
The Velocity-Performance Relationship
Research has consistently shown a strong positive correlation between velocity and pitcher effectiveness:
- Strikeout Rate: Each 1 mph increase in average fastball velocity correlates with approximately a 0.5% increase in strikeout rate
- Batting Average Against: Fastballs above 95 mph result in batting averages roughly 30-40 points lower than those below 92 mph
- Expected Outcomes: Higher velocity leads to weaker contact and higher swinging-strike rates
However, this velocity obsession comes with significant costs:
- Injury Rates: Tommy John surgeries have increased by over 300% since 2000
- Pitcher Durability: Average innings per start has declined from 6.3 to 5.2 over the past 20 years
- Career Length: Hard-throwing pitchers often have shorter careers due to accumulated arm stress
# Load required libraries
library(baseballr)
library(dplyr)
library(ggplot2)
library(tidyr)
# Function to analyze velocity trends by season
analyze_velocity_trends <- function(start_year = 2015, end_year = 2023) {
# Get Statcast data for multiple seasons
velocity_data <- data.frame()
for (year in start_year:end_year) {
tryCatch({
# Get pitch-level data for a sample of games
season_data <- statcast_search(
start_date = paste0(year, "-04-01"),
end_date = paste0(year, "-10-31"),
player_type = "pitcher"
)
if (nrow(season_data) > 0) {
season_summary <- season_data %>%
filter(!is.na(release_speed), pitch_type %in% c("FF", "SI", "FC")) %>%
group_by(pitcher, player_name) %>%
summarise(
avg_velocity = mean(release_speed, na.rm = TRUE),
max_velocity = max(release_speed, na.rm = TRUE),
pitches = n(),
.groups = "drop"
) %>%
filter(pitches >= 100) %>%
mutate(season = year)
velocity_data <- bind_rows(velocity_data, season_summary)
}
}, error = function(e) {
message(paste("Error processing year", year, ":", e$message))
})
}
return(velocity_data)
}
# Calculate league-wide velocity trends
calculate_league_velocity <- function(velocity_data) {
league_trends <- velocity_data %>%
group_by(season) %>%
summarise(
mean_velocity = mean(avg_velocity, na.rm = TRUE),
median_velocity = median(avg_velocity, na.rm = TRUE),
p75_velocity = quantile(avg_velocity, 0.75, na.rm = TRUE),
p90_velocity = quantile(avg_velocity, 0.90, na.rm = TRUE),
p99_velocity = quantile(avg_velocity, 0.99, na.rm = TRUE),
.groups = "drop"
)
return(league_trends)
}
# Visualize velocity evolution
plot_velocity_trends <- function(league_trends) {
ggplot(league_trends, aes(x = season)) +
geom_line(aes(y = mean_velocity, color = "Mean"), linewidth = 1.2) +
geom_line(aes(y = p75_velocity, color = "75th Percentile"), linewidth = 1) +
geom_line(aes(y = p90_velocity, color = "90th Percentile"), linewidth = 1) +
geom_point(aes(y = mean_velocity, color = "Mean"), size = 3) +
geom_point(aes(y = p75_velocity, color = "75th Percentile"), size = 2) +
geom_point(aes(y = p90_velocity, color = "90th Percentile"), size = 2) +
labs(
title = "MLB Fastball Velocity Evolution (2015-2023)",
subtitle = "Average velocity across all qualified pitchers",
x = "Season",
y = "Velocity (mph)",
color = "Metric"
) +
theme_minimal() +
scale_color_manual(values = c("Mean" = "#002D72",
"75th Percentile" = "#D50032",
"90th Percentile" = "#FF8C00"))
}
# Analyze velocity impact on performance
velocity_performance_analysis <- function(pitcher_data) {
# Group pitchers by velocity quintiles
pitcher_data_with_quintiles <- pitcher_data %>%
mutate(velocity_quintile = ntile(avg_velocity, 5))
performance_by_velocity <- pitcher_data_with_quintiles %>%
group_by(velocity_quintile) %>%
summarise(
avg_velocity = mean(avg_velocity),
count = n(),
.groups = "drop"
)
return(performance_by_velocity)
}
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
from pybaseball import playerid_lookup, statcast_pitcher
from datetime import datetime
# Set style for visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
def analyze_velocity_by_season(start_year=2015, end_year=2023):
"""
Analyze velocity trends across multiple seasons
"""
velocity_trends = []
for year in range(start_year, end_year + 1):
try:
# Get season data
start_date = f"{year}-04-01"
end_date = f"{year}-10-31"
# Query Statcast data
data = statcast(start_dt=start_date, end_dt=end_date)
if data is not None and len(data) > 0:
# Filter for fastballs
fastballs = data[data['pitch_type'].isin(['FF', 'SI', 'FC'])]
# Calculate pitcher-level statistics
pitcher_stats = fastballs.groupby(['pitcher', 'player_name']).agg({
'release_speed': ['mean', 'max', 'count']
}).reset_index()
pitcher_stats.columns = ['pitcher_id', 'player_name',
'avg_velocity', 'max_velocity', 'pitches']
# Filter for qualified pitchers (100+ pitches)
qualified = pitcher_stats[pitcher_stats['pitches'] >= 100]
qualified['season'] = year
velocity_trends.append(qualified)
except Exception as e:
print(f"Error processing {year}: {e}")
return pd.concat(velocity_trends, ignore_index=True)
def calculate_league_velocity_stats(velocity_df):
"""
Calculate league-wide velocity statistics by season
"""
league_stats = velocity_df.groupby('season')['avg_velocity'].agg([
('mean', 'mean'),
('median', 'median'),
('std', 'std'),
('p75', lambda x: np.percentile(x, 75)),
('p90', lambda x: np.percentile(x, 90)),
('p99', lambda x: np.percentile(x, 99))
]).reset_index()
return league_stats
def plot_velocity_distribution(velocity_df, seasons=[2015, 2019, 2023]):
"""
Create distribution plots comparing velocity across seasons
"""
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for idx, season in enumerate(seasons):
season_data = velocity_df[velocity_df['season'] == season]
axes[idx].hist(season_data['avg_velocity'], bins=30,
edgecolor='black', alpha=0.7, color='steelblue')
axes[idx].axvline(season_data['avg_velocity'].mean(),
color='red', linestyle='--', linewidth=2,
label=f'Mean: {season_data["avg_velocity"].mean():.2f} mph')
axes[idx].set_xlabel('Average Fastball Velocity (mph)')
axes[idx].set_ylabel('Number of Pitchers')
axes[idx].set_title(f'{season} Season')
axes[idx].legend()
plt.tight_layout()
return fig
def analyze_velocity_percentiles(velocity_df):
"""
Analyze how velocity percentiles have shifted over time
"""
percentiles = [10, 25, 50, 75, 90, 95, 99]
percentile_data = []
for season in velocity_df['season'].unique():
season_data = velocity_df[velocity_df['season'] == season]['avg_velocity']
row = {'season': season}
for p in percentiles:
row[f'p{p}'] = np.percentile(season_data, p)
percentile_data.append(row)
return pd.DataFrame(percentile_data)
def velocity_strikeout_correlation(pitch_data):
"""
Analyze correlation between velocity and strikeout rates
"""
# Calculate strikeout rate by pitch
pitch_data['is_strikeout'] = pitch_data['events'] == 'strikeout'
# Group by velocity bins
pitch_data['velocity_bin'] = pd.cut(pitch_data['release_speed'],
bins=range(85, 105, 2))
correlation_data = pitch_data.groupby('velocity_bin').agg({
'is_strikeout': 'mean',
'release_speed': 'count'
}).reset_index()
correlation_data.columns = ['velocity_bin', 'strikeout_rate', 'pitches']
return correlation_data[correlation_data['pitches'] >= 100]
Understanding the Shift Ban
Prior to the 2023 season, defensive shifts had become ubiquitous in Major League Baseball. Teams would routinely position three or more infielders on one side of second base against pull-heavy hitters, dramatically reducing batting averages on balls in play. The 2023 rule change mandated:
- Two infielders must be positioned on each side of second base
- All four infielders must be on the infield dirt when the pitcher begins delivery
- No positioning directly behind second base in short outfield
Shift Usage Before the Ban
# Analyze shift usage trends (2016-2022)
library(baseballr)
library(dplyr)
library(ggplot2)
analyze_shift_usage <- function(years = 2016:2022) {
shift_data <- data.frame()
for (year in years) {
tryCatch({
# Get Statcast data
season_data <- statcast_search(
start_date = paste0(year, "-04-01"),
end_date = paste0(year, "-10-31"),
player_type = "batter"
)
if (nrow(season_data) > 0) {
# Analyze shift deployment
shift_summary <- season_data %>%
filter(!is.na(if_fielding_alignment)) %>%
group_by(if_fielding_alignment) %>%
summarise(
pitches = n(),
.groups = "drop"
) %>%
mutate(
season = year,
pct = pitches / sum(pitches) * 100
)
shift_data <- bind_rows(shift_data, shift_summary)
}
}, error = function(e) {
message(paste("Error:", e$message))
})
}
return(shift_data)
}
# Analyze batter performance against shifts
analyze_shift_performance <- function(player_data) {
performance <- player_data %>%
filter(!is.na(if_fielding_alignment), !is.na(events)) %>%
mutate(
is_shift = if_fielding_alignment %in% c("Shift", "Strategic"),
is_hit = events %in% c("single", "double", "triple", "home_run"),
in_play = !events %in% c("strikeout", "walk", "hit_by_pitch")
) %>%
group_by(is_shift) %>%
summarise(
balls_in_play = sum(in_play),
hits = sum(is_hit),
avg = hits / balls_in_play,
.groups = "drop"
)
return(performance)
}
# Calculate shift victims (players most affected)
identify_shift_victims <- function(player_data) {
shift_victims <- player_data %>%
filter(!is.na(if_fielding_alignment), stand == "L") %>%
mutate(is_shift = if_fielding_alignment %in% c("Shift", "Strategic")) %>%
group_by(batter, player_name, is_shift) %>%
summarise(
pa = n(),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
avg_exit_velocity = mean(launch_speed, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = is_shift,
values_from = c(pa, avg_launch_angle, avg_exit_velocity),
names_prefix = "shift_"
) %>%
filter(!is.na(pa_shift_TRUE), pa_shift_TRUE >= 100) %>%
arrange(desc(pa_shift_TRUE))
return(shift_victims)
}
2023 Shift Ban Impact Analysis
import pandas as pd
import numpy as np
from pybaseball import statcast
import matplotlib.pyplot as plt
import seaborn as sns
def compare_pre_post_shift_ban(year_before=2022, year_after=2023):
"""
Compare batting statistics before and after shift ban
"""
# Get data for both seasons
data_before = statcast(start_dt=f"{year_before}-04-01",
end_dt=f"{year_before}-10-31")
data_after = statcast(start_dt=f"{year_after}-04-01",
end_dt=f"{year_after}-10-31")
def calculate_babip_by_direction(df, season_label):
"""Calculate BABIP by batted ball direction"""
# Filter for balls in play
bip = df[df['events'].isin(['single', 'double', 'triple',
'field_out', 'field_error',
'force_out', 'grounded_into_double_play'])]
# Classify hit direction based on hit coordinates
bip['direction'] = pd.cut(bip['hc_x'],
bins=[0, 83, 125.5, 168, 250],
labels=['Pull', 'Center', 'Oppo', 'Extreme_Pull'])
# Calculate BABIP by direction
direction_stats = bip.groupby('direction').apply(
lambda x: pd.Series({
'BABIP': len(x[x['events'].isin(['single', 'double', 'triple'])]) / len(x),
'BIP': len(x),
'Season': season_label
})
).reset_index()
return direction_stats
# Calculate for both seasons
stats_before = calculate_babip_by_direction(data_before, year_before)
stats_after = calculate_babip_by_direction(data_after, year_after)
# Combine results
comparison = pd.concat([stats_before, stats_after])
return comparison
def analyze_lefty_pull_hitters(year_before=2022, year_after=2023):
"""
Analyze impact on left-handed pull hitters (most affected by shifts)
"""
results = []
for year in [year_before, year_after]:
data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")
# Filter for left-handed batters
lefty_data = data[data['stand'] == 'L'].copy()
# Calculate pull rate and performance
lefty_stats = lefty_data.groupby(['batter', 'player_name']).apply(
lambda x: pd.Series({
'PA': len(x),
'AVG': len(x[x['events'].isin(['single', 'double', 'triple', 'home_run'])]) /
len(x[x['events'].notna()]),
'BABIP': len(x[(x['events'].isin(['single', 'double', 'triple'])) &
(x['type'] == 'X')]) /
len(x[(x['type'] == 'X') &
(~x['events'].isin(['home_run']))]),
'Season': year
})
).reset_index()
# Filter qualified batters
qualified = lefty_stats[lefty_stats['PA'] >= 300]
results.append(qualified)
comparison_df = pd.concat(results)
# Calculate average improvement
pivot = comparison_df.pivot_table(
index='player_name',
columns='Season',
values=['AVG', 'BABIP'],
aggfunc='mean'
)
pivot['AVG_Change'] = pivot[('AVG', year_after)] - pivot[('AVG', year_before)]
pivot['BABIP_Change'] = pivot[('BABIP', year_after)] - pivot[('BABIP', year_before)]
return pivot.sort_values('BABIP_Change', ascending=False)
def visualize_shift_ban_impact(comparison_data):
"""
Create visualization showing shift ban impact
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# BABIP by direction comparison
direction_pivot = comparison_data.pivot(
index='direction',
columns='Season',
values='BABIP'
)
direction_pivot.plot(kind='bar', ax=axes[0])
axes[0].set_title('BABIP by Hit Direction: Pre vs Post Shift Ban')
axes[0].set_xlabel('Hit Direction')
axes[0].set_ylabel('BABIP')
axes[0].legend(title='Season')
axes[0].grid(axis='y', alpha=0.3)
# Calculate percentage change
pct_change = ((direction_pivot.iloc[:, 1] - direction_pivot.iloc[:, 0]) /
direction_pivot.iloc[:, 0] * 100)
pct_change.plot(kind='bar', ax=axes[1], color='steelblue')
axes[1].set_title('BABIP Change by Direction (% Change)')
axes[1].set_xlabel('Hit Direction')
axes[1].set_ylabel('Percentage Change (%)')
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
return fig
def calculate_league_wide_impact(year_before=2022, year_after=2023):
"""
Calculate league-wide batting statistics change
"""
stats_comparison = {}
for year in [year_before, year_after]:
data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")
# Calculate league-wide metrics
hits = data['events'].isin(['single', 'double', 'triple', 'home_run']).sum()
at_bats = data['events'].notna().sum()
bip = data[data['events'].isin(['single', 'double', 'triple',
'field_out', 'field_error',
'force_out', 'grounded_into_double_play'])]
hits_bip = bip['events'].isin(['single', 'double', 'triple']).sum()
stats_comparison[year] = {
'AVG': hits / at_bats,
'BABIP': hits_bip / len(bip),
'Total_BIP': len(bip)
}
# Calculate changes
changes = {
'AVG_Change': stats_comparison[year_after]['AVG'] - stats_comparison[year_before]['AVG'],
'BABIP_Change': stats_comparison[year_after]['BABIP'] - stats_comparison[year_before]['BABIP'],
'AVG_Pct_Change': ((stats_comparison[year_after]['AVG'] -
stats_comparison[year_before]['AVG']) /
stats_comparison[year_before]['AVG'] * 100),
'BABIP_Pct_Change': ((stats_comparison[year_after]['BABIP'] -
stats_comparison[year_before]['BABIP']) /
stats_comparison[year_before]['BABIP'] * 100)
}
return pd.DataFrame([stats_comparison[year_before],
stats_comparison[year_after],
changes],
index=[year_before, year_after, 'Change'])
Key Findings from Shift Ban Analysis
The 2023 shift ban produced measurable impacts on offensive production:
- League-Wide BABIP: Increased from .290 (2022) to .298 (2023), an increase of approximately 8 points
- Left-Handed Pull Hitters: Average BABIP increased by 12-15 points for extreme pull hitters
- Ground Ball Rate: Slight increase as hitters adjusted to having more gap opportunities
- Home Run Rate: Minimal change, suggesting the shift primarily affected ground balls and line drives
Notable Beneficiaries:
- Kyle Schwarber (L): BABIP increased from .212 to .254
- Cody Bellinger (L): BABIP increased from .266 to .308
- Joey Gallo (L): BABIP increased from .189 to .243
# Analyze shift usage trends (2016-2022)
library(baseballr)
library(dplyr)
library(ggplot2)
analyze_shift_usage <- function(years = 2016:2022) {
shift_data <- data.frame()
for (year in years) {
tryCatch({
# Get Statcast data
season_data <- statcast_search(
start_date = paste0(year, "-04-01"),
end_date = paste0(year, "-10-31"),
player_type = "batter"
)
if (nrow(season_data) > 0) {
# Analyze shift deployment
shift_summary <- season_data %>%
filter(!is.na(if_fielding_alignment)) %>%
group_by(if_fielding_alignment) %>%
summarise(
pitches = n(),
.groups = "drop"
) %>%
mutate(
season = year,
pct = pitches / sum(pitches) * 100
)
shift_data <- bind_rows(shift_data, shift_summary)
}
}, error = function(e) {
message(paste("Error:", e$message))
})
}
return(shift_data)
}
# Analyze batter performance against shifts
analyze_shift_performance <- function(player_data) {
performance <- player_data %>%
filter(!is.na(if_fielding_alignment), !is.na(events)) %>%
mutate(
is_shift = if_fielding_alignment %in% c("Shift", "Strategic"),
is_hit = events %in% c("single", "double", "triple", "home_run"),
in_play = !events %in% c("strikeout", "walk", "hit_by_pitch")
) %>%
group_by(is_shift) %>%
summarise(
balls_in_play = sum(in_play),
hits = sum(is_hit),
avg = hits / balls_in_play,
.groups = "drop"
)
return(performance)
}
# Calculate shift victims (players most affected)
identify_shift_victims <- function(player_data) {
shift_victims <- player_data %>%
filter(!is.na(if_fielding_alignment), stand == "L") %>%
mutate(is_shift = if_fielding_alignment %in% c("Shift", "Strategic")) %>%
group_by(batter, player_name, is_shift) %>%
summarise(
pa = n(),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
avg_exit_velocity = mean(launch_speed, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = is_shift,
values_from = c(pa, avg_launch_angle, avg_exit_velocity),
names_prefix = "shift_"
) %>%
filter(!is.na(pa_shift_TRUE), pa_shift_TRUE >= 100) %>%
arrange(desc(pa_shift_TRUE))
return(shift_victims)
}
import pandas as pd
import numpy as np
from pybaseball import statcast
import matplotlib.pyplot as plt
import seaborn as sns
def compare_pre_post_shift_ban(year_before=2022, year_after=2023):
"""
Compare batting statistics before and after shift ban
"""
# Get data for both seasons
data_before = statcast(start_dt=f"{year_before}-04-01",
end_dt=f"{year_before}-10-31")
data_after = statcast(start_dt=f"{year_after}-04-01",
end_dt=f"{year_after}-10-31")
def calculate_babip_by_direction(df, season_label):
"""Calculate BABIP by batted ball direction"""
# Filter for balls in play
bip = df[df['events'].isin(['single', 'double', 'triple',
'field_out', 'field_error',
'force_out', 'grounded_into_double_play'])]
# Classify hit direction based on hit coordinates
bip['direction'] = pd.cut(bip['hc_x'],
bins=[0, 83, 125.5, 168, 250],
labels=['Pull', 'Center', 'Oppo', 'Extreme_Pull'])
# Calculate BABIP by direction
direction_stats = bip.groupby('direction').apply(
lambda x: pd.Series({
'BABIP': len(x[x['events'].isin(['single', 'double', 'triple'])]) / len(x),
'BIP': len(x),
'Season': season_label
})
).reset_index()
return direction_stats
# Calculate for both seasons
stats_before = calculate_babip_by_direction(data_before, year_before)
stats_after = calculate_babip_by_direction(data_after, year_after)
# Combine results
comparison = pd.concat([stats_before, stats_after])
return comparison
def analyze_lefty_pull_hitters(year_before=2022, year_after=2023):
"""
Analyze impact on left-handed pull hitters (most affected by shifts)
"""
results = []
for year in [year_before, year_after]:
data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")
# Filter for left-handed batters
lefty_data = data[data['stand'] == 'L'].copy()
# Calculate pull rate and performance
lefty_stats = lefty_data.groupby(['batter', 'player_name']).apply(
lambda x: pd.Series({
'PA': len(x),
'AVG': len(x[x['events'].isin(['single', 'double', 'triple', 'home_run'])]) /
len(x[x['events'].notna()]),
'BABIP': len(x[(x['events'].isin(['single', 'double', 'triple'])) &
(x['type'] == 'X')]) /
len(x[(x['type'] == 'X') &
(~x['events'].isin(['home_run']))]),
'Season': year
})
).reset_index()
# Filter qualified batters
qualified = lefty_stats[lefty_stats['PA'] >= 300]
results.append(qualified)
comparison_df = pd.concat(results)
# Calculate average improvement
pivot = comparison_df.pivot_table(
index='player_name',
columns='Season',
values=['AVG', 'BABIP'],
aggfunc='mean'
)
pivot['AVG_Change'] = pivot[('AVG', year_after)] - pivot[('AVG', year_before)]
pivot['BABIP_Change'] = pivot[('BABIP', year_after)] - pivot[('BABIP', year_before)]
return pivot.sort_values('BABIP_Change', ascending=False)
def visualize_shift_ban_impact(comparison_data):
"""
Create visualization showing shift ban impact
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# BABIP by direction comparison
direction_pivot = comparison_data.pivot(
index='direction',
columns='Season',
values='BABIP'
)
direction_pivot.plot(kind='bar', ax=axes[0])
axes[0].set_title('BABIP by Hit Direction: Pre vs Post Shift Ban')
axes[0].set_xlabel('Hit Direction')
axes[0].set_ylabel('BABIP')
axes[0].legend(title='Season')
axes[0].grid(axis='y', alpha=0.3)
# Calculate percentage change
pct_change = ((direction_pivot.iloc[:, 1] - direction_pivot.iloc[:, 0]) /
direction_pivot.iloc[:, 0] * 100)
pct_change.plot(kind='bar', ax=axes[1], color='steelblue')
axes[1].set_title('BABIP Change by Direction (% Change)')
axes[1].set_xlabel('Hit Direction')
axes[1].set_ylabel('Percentage Change (%)')
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
return fig
def calculate_league_wide_impact(year_before=2022, year_after=2023):
"""
Calculate league-wide batting statistics change
"""
stats_comparison = {}
for year in [year_before, year_after]:
data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")
# Calculate league-wide metrics
hits = data['events'].isin(['single', 'double', 'triple', 'home_run']).sum()
at_bats = data['events'].notna().sum()
bip = data[data['events'].isin(['single', 'double', 'triple',
'field_out', 'field_error',
'force_out', 'grounded_into_double_play'])]
hits_bip = bip['events'].isin(['single', 'double', 'triple']).sum()
stats_comparison[year] = {
'AVG': hits / at_bats,
'BABIP': hits_bip / len(bip),
'Total_BIP': len(bip)
}
# Calculate changes
changes = {
'AVG_Change': stats_comparison[year_after]['AVG'] - stats_comparison[year_before]['AVG'],
'BABIP_Change': stats_comparison[year_after]['BABIP'] - stats_comparison[year_before]['BABIP'],
'AVG_Pct_Change': ((stats_comparison[year_after]['AVG'] -
stats_comparison[year_before]['AVG']) /
stats_comparison[year_before]['AVG'] * 100),
'BABIP_Pct_Change': ((stats_comparison[year_after]['BABIP'] -
stats_comparison[year_before]['BABIP']) /
stats_comparison[year_before]['BABIP'] * 100)
}
return pd.DataFrame([stats_comparison[year_before],
stats_comparison[year_after],
changes],
index=[year_before, year_after, 'Change'])
The 2023 Pitch Clock Implementation
The introduction of pitch timers in 2023 represented the most significant pace-of-play change in modern baseball history:
- Pitcher Requirements: 15 seconds with bases empty, 20 seconds with runners on base
- Batter Requirements: Must be in the box with 8 seconds remaining
- Violations: Ball awarded for pitcher violations, strike for batter violations
- Disengagement Limits: Pitchers limited to 2 disengagements (pickoffs/step-offs) per plate appearance
Analyzing Pace of Play Changes
library(baseballr)
library(dplyr)
library(ggplot2)
analyze_game_time_trends <- function() {
# Compare game times across recent seasons
game_time_comparison <- data.frame(
season = c(2019, 2020, 2021, 2022, 2023),
avg_game_time = c(190, 188, 191, 193, 162), # minutes
avg_pitches_per_game = c(295, 290, 293, 297, 285),
avg_time_between_pitches = c(38.7, 39.1, 39.3, 39.0, 18.2) # seconds
)
return(game_time_comparison)
}
plot_game_time_evolution <- function(game_time_data) {
ggplot(game_time_data, aes(x = season, y = avg_game_time)) +
geom_line(linewidth = 1.2, color = "#002D72") +
geom_point(size = 4, color = "#D50032") +
geom_vline(xintercept = 2022.5, linetype = "dashed",
color = "red", linewidth = 1) +
annotate("text", x = 2022.5, y = 180,
label = "Pitch Clock\nImplemented",
hjust = -0.1, size = 4) +
labs(
title = "Average MLB Game Time (2019-2023)",
subtitle = "31-minute reduction after pitch clock implementation",
x = "Season",
y = "Average Game Time (minutes)"
) +
theme_minimal() +
scale_y_continuous(limits = c(150, 200))
}
analyze_pitcher_adaptation <- function(pitcher_data) {
# Analyze how pitchers adapted to pitch clock
pitcher_stats <- pitcher_data %>%
group_by(pitcher, player_name, season) %>%
summarise(
avg_time_to_plate = mean(time_to_plate, na.rm = TRUE),
pitches = n(),
violations = sum(violation == "pitch_timer", na.rm = TRUE),
violation_rate = violations / pitches,
.groups = "drop"
)
return(pitcher_stats)
}
compare_pitcher_performance_pre_post_clock <- function(data_2022, data_2023) {
# Calculate performance metrics for same pitchers in both seasons
performance_comparison <- data_2022 %>%
inner_join(data_2023, by = c("pitcher", "player_name"),
suffix = c("_2022", "_2023")) %>%
mutate(
k_rate_change = k_rate_2023 - k_rate_2022,
bb_rate_change = bb_rate_2023 - bb_rate_2022,
era_change = era_2023 - era_2022
)
return(performance_comparison)
}
Performance Impact Analysis with Python
import pandas as pd
import numpy as np
from pybaseball import statcast, pitching_stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
def analyze_pitcher_performance_change(years=[2022, 2023]):
"""
Analyze how pitcher performance changed with pitch clock
"""
performance_data = []
for year in years:
# Get season pitching statistics
season_stats = pitching_stats(year, year, qual=50)
season_stats['Season'] = year
performance_data.append(season_stats)
combined = pd.concat(performance_data)
# Calculate key metrics
metrics_by_season = combined.groupby('Season').agg({
'K/9': 'mean',
'BB/9': 'mean',
'ERA': 'mean',
'WHIP': 'mean',
'HR/9': 'mean',
'AVG': 'mean'
}).round(3)
return metrics_by_season
def analyze_pace_impact_on_performance(year=2023):
"""
Analyze correlation between pace and pitcher performance
"""
# Get pitch-level data
data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")
# Calculate average time between pitches for each pitcher
pitcher_pace = data.groupby(['pitcher', 'player_name']).agg({
'pitch_number': 'count',
'delta_run_exp': 'mean', # Proxy for performance
}).reset_index()
pitcher_pace.columns = ['pitcher_id', 'player_name', 'pitches', 'avg_run_value']
# Filter qualified pitchers
qualified = pitcher_pace[pitcher_pace['pitches'] >= 500]
return qualified
def analyze_stolen_base_impact():
"""
Analyze how pitch clock affected stolen base attempts
"""
sb_data = {
'Season': [2019, 2020, 2021, 2022, 2023],
'SB_Attempts_Per_Game': [1.89, 1.78, 1.73, 1.68, 2.23],
'SB_Success_Rate': [75.3, 74.8, 75.1, 75.4, 79.9],
'Pickoff_Attempts_Per_Game': [2.4, 2.3, 2.2, 2.1, 1.1]
}
df = pd.DataFrame(sb_data)
return df
def visualize_pitch_clock_impact():
"""
Create comprehensive visualization of pitch clock effects
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Game time reduction
game_times = pd.DataFrame({
'Season': [2019, 2020, 2021, 2022, 2023],
'Avg_Game_Time': [190, 188, 191, 193, 162]
})
axes[0, 0].plot(game_times['Season'], game_times['Avg_Game_Time'],
marker='o', linewidth=2, markersize=8, color='steelblue')
axes[0, 0].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
axes[0, 0].set_title('Average Game Time', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Season')
axes[0, 0].set_ylabel('Minutes')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)
# Stolen base attempts
sb_data = analyze_stolen_base_impact()
axes[0, 1].plot(sb_data['Season'], sb_data['SB_Attempts_Per_Game'],
marker='s', linewidth=2, markersize=8, color='darkgreen')
axes[0, 1].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
axes[0, 1].set_title('Stolen Base Attempts per Game', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Season')
axes[0, 1].set_ylabel('SB Attempts')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)
# Success rate
axes[1, 0].plot(sb_data['Season'], sb_data['SB_Success_Rate'],
marker='^', linewidth=2, markersize=8, color='purple')
axes[1, 0].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
axes[1, 0].set_title('Stolen Base Success Rate', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Season')
axes[1, 0].set_ylabel('Success Rate (%)')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)
# Pickoff attempts
axes[1, 1].plot(sb_data['Season'], sb_data['Pickoff_Attempts_Per_Game'],
marker='D', linewidth=2, markersize=8, color='orangered')
axes[1, 1].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
axes[1, 1].set_title('Pickoff Attempts per Game', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Season')
axes[1, 1].set_ylabel('Pickoff Attempts')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
return fig
def calculate_violation_rates(year=2023):
"""
Calculate pitch clock violation rates by team and pitcher
"""
data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")
# Assuming violation data is tracked in the dataset
# This would need to be adapted based on actual data structure
violation_summary = data.groupby('pitcher').agg({
'pitch_number': 'count'
}).reset_index()
violation_summary.columns = ['pitcher_id', 'total_pitches']
return violation_summary
Key Findings: Pitch Clock Impact
The 2023 pitch clock implementation had dramatic and immediate effects:
Pace of Play:
- Average game time reduced from 3:03 (2022) to 2:42 (2023) - a 21-minute decrease
- Time between pitches dropped from 39.0 seconds to 18.2 seconds
- Total pitches per game decreased by approximately 12
Stolen Base Renaissance:
- SB attempts per game increased 33% (1.68 to 2.23)
- Success rate improved from 75.4% to 79.9%
- Pickoff attempts decreased by nearly 50% due to disengagement limits
Pitcher Performance:
- Minimal impact on K/9, BB/9, or ERA league-wide
- Slight increase in home runs allowed (1.26 to 1.29 HR/9)
- Pitchers adapted quickly with violation rates dropping from 1.8% early season to 0.4% by season end
library(baseballr)
library(dplyr)
library(ggplot2)
analyze_game_time_trends <- function() {
# Compare game times across recent seasons
game_time_comparison <- data.frame(
season = c(2019, 2020, 2021, 2022, 2023),
avg_game_time = c(190, 188, 191, 193, 162), # minutes
avg_pitches_per_game = c(295, 290, 293, 297, 285),
avg_time_between_pitches = c(38.7, 39.1, 39.3, 39.0, 18.2) # seconds
)
return(game_time_comparison)
}
plot_game_time_evolution <- function(game_time_data) {
ggplot(game_time_data, aes(x = season, y = avg_game_time)) +
geom_line(linewidth = 1.2, color = "#002D72") +
geom_point(size = 4, color = "#D50032") +
geom_vline(xintercept = 2022.5, linetype = "dashed",
color = "red", linewidth = 1) +
annotate("text", x = 2022.5, y = 180,
label = "Pitch Clock\nImplemented",
hjust = -0.1, size = 4) +
labs(
title = "Average MLB Game Time (2019-2023)",
subtitle = "31-minute reduction after pitch clock implementation",
x = "Season",
y = "Average Game Time (minutes)"
) +
theme_minimal() +
scale_y_continuous(limits = c(150, 200))
}
analyze_pitcher_adaptation <- function(pitcher_data) {
# Analyze how pitchers adapted to pitch clock
pitcher_stats <- pitcher_data %>%
group_by(pitcher, player_name, season) %>%
summarise(
avg_time_to_plate = mean(time_to_plate, na.rm = TRUE),
pitches = n(),
violations = sum(violation == "pitch_timer", na.rm = TRUE),
violation_rate = violations / pitches,
.groups = "drop"
)
return(pitcher_stats)
}
compare_pitcher_performance_pre_post_clock <- function(data_2022, data_2023) {
# Calculate performance metrics for same pitchers in both seasons
performance_comparison <- data_2022 %>%
inner_join(data_2023, by = c("pitcher", "player_name"),
suffix = c("_2022", "_2023")) %>%
mutate(
k_rate_change = k_rate_2023 - k_rate_2022,
bb_rate_change = bb_rate_2023 - bb_rate_2022,
era_change = era_2023 - era_2022
)
return(performance_comparison)
}
import pandas as pd
import numpy as np
from pybaseball import statcast, pitching_stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
def analyze_pitcher_performance_change(years=[2022, 2023]):
"""
Analyze how pitcher performance changed with pitch clock
"""
performance_data = []
for year in years:
# Get season pitching statistics
season_stats = pitching_stats(year, year, qual=50)
season_stats['Season'] = year
performance_data.append(season_stats)
combined = pd.concat(performance_data)
# Calculate key metrics
metrics_by_season = combined.groupby('Season').agg({
'K/9': 'mean',
'BB/9': 'mean',
'ERA': 'mean',
'WHIP': 'mean',
'HR/9': 'mean',
'AVG': 'mean'
}).round(3)
return metrics_by_season
def analyze_pace_impact_on_performance(year=2023):
"""
Analyze correlation between pace and pitcher performance
"""
# Get pitch-level data
data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")
# Calculate average time between pitches for each pitcher
pitcher_pace = data.groupby(['pitcher', 'player_name']).agg({
'pitch_number': 'count',
'delta_run_exp': 'mean', # Proxy for performance
}).reset_index()
pitcher_pace.columns = ['pitcher_id', 'player_name', 'pitches', 'avg_run_value']
# Filter qualified pitchers
qualified = pitcher_pace[pitcher_pace['pitches'] >= 500]
return qualified
def analyze_stolen_base_impact():
"""
Analyze how pitch clock affected stolen base attempts
"""
sb_data = {
'Season': [2019, 2020, 2021, 2022, 2023],
'SB_Attempts_Per_Game': [1.89, 1.78, 1.73, 1.68, 2.23],
'SB_Success_Rate': [75.3, 74.8, 75.1, 75.4, 79.9],
'Pickoff_Attempts_Per_Game': [2.4, 2.3, 2.2, 2.1, 1.1]
}
df = pd.DataFrame(sb_data)
return df
def visualize_pitch_clock_impact():
"""
Create comprehensive visualization of pitch clock effects
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Game time reduction
game_times = pd.DataFrame({
'Season': [2019, 2020, 2021, 2022, 2023],
'Avg_Game_Time': [190, 188, 191, 193, 162]
})
axes[0, 0].plot(game_times['Season'], game_times['Avg_Game_Time'],
marker='o', linewidth=2, markersize=8, color='steelblue')
axes[0, 0].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
axes[0, 0].set_title('Average Game Time', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Season')
axes[0, 0].set_ylabel('Minutes')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)
# Stolen base attempts
sb_data = analyze_stolen_base_impact()
axes[0, 1].plot(sb_data['Season'], sb_data['SB_Attempts_Per_Game'],
marker='s', linewidth=2, markersize=8, color='darkgreen')
axes[0, 1].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
axes[0, 1].set_title('Stolen Base Attempts per Game', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Season')
axes[0, 1].set_ylabel('SB Attempts')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)
# Success rate
axes[1, 0].plot(sb_data['Season'], sb_data['SB_Success_Rate'],
marker='^', linewidth=2, markersize=8, color='purple')
axes[1, 0].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
axes[1, 0].set_title('Stolen Base Success Rate', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Season')
axes[1, 0].set_ylabel('Success Rate (%)')
axes[1, 0].legend()
axes[1, 0].grid(alpha=0.3)
# Pickoff attempts
axes[1, 1].plot(sb_data['Season'], sb_data['Pickoff_Attempts_Per_Game'],
marker='D', linewidth=2, markersize=8, color='orangered')
axes[1, 1].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
axes[1, 1].set_title('Pickoff Attempts per Game', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Season')
axes[1, 1].set_ylabel('Pickoff Attempts')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
return fig
def calculate_violation_rates(year=2023):
"""
Calculate pitch clock violation rates by team and pitcher
"""
data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")
# Assuming violation data is tracked in the dataset
# This would need to be adapted based on actual data structure
violation_summary = data.groupby('pitcher').agg({
'pitch_number': 'count'
}).reset_index()
violation_summary.columns = ['pitcher_id', 'total_pitches']
return violation_summary
The Base Size Change
For the 2023 season, MLB increased base sizes from 15 inches square to 18 inches square. While seemingly minor, this change reduced the distance between bases by approximately 4.5 inches, creating measurable impacts on stolen base success rates and overall baserunning dynamics.
Mathematical Impact Analysis
library(dplyr)
calculate_base_distance_impact <- function() {
# Calculate distance reduction
old_base_size <- 15 # inches
new_base_size <- 18 # inches
# Distance from back of one base to front of next
distance_reduction_per_base <- (new_base_size - old_base_size) / 2
# Total reduction from first to second (two bases involved)
total_reduction <- distance_reduction_per_base * 2
# Original distance (90 feet = 1080 inches)
original_distance <- 90 * 12
# New distance
new_distance <- original_distance - total_reduction
# Percentage reduction
pct_reduction <- (total_reduction / original_distance) * 100
results <- data.frame(
Original_Distance_Inches = original_distance,
New_Distance_Inches = new_distance,
Reduction_Inches = total_reduction,
Reduction_Feet = total_reduction / 12,
Percentage_Reduction = pct_reduction
)
return(results)
}
analyze_stolen_base_trends <- function() {
# Historical stolen base data
sb_trends <- data.frame(
Season = c(2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023),
SB_Total = c(2505, 2537, 2527, 2474, 2280, 874, 2213, 2487, 3503),
Games_Played = c(2430, 2428, 2430, 2431, 2429, 898, 2430, 2430, 2430),
SB_Success_Rate = c(73.2, 73.8, 74.1, 74.8, 75.3, 74.8, 75.1, 75.4, 79.9)
) %>%
mutate(
SB_Per_Game = SB_Total / Games_Played,
Era = ifelse(Season >= 2023, "Larger Bases + Pitch Clock", "Traditional")
)
return(sb_trends)
}
identify_stolen_base_leaders <- function(sb_data) {
# Analyze which types of players benefited most
leader_analysis <- sb_data %>%
filter(SB >= 20) %>%
group_by(Season) %>%
summarise(
Players_20Plus_SB = n(),
Players_30Plus_SB = sum(SB >= 30),
Players_40Plus_SB = sum(SB >= 40),
Players_50Plus_SB = sum(SB >= 50),
Avg_SB_Rate = mean(SB_Success_Rate),
.groups = "drop"
)
return(leader_analysis)
}
plot_stolen_base_renaissance <- function(sb_trends) {
library(ggplot2)
ggplot(sb_trends, aes(x = Season, y = SB_Per_Game, fill = Era)) +
geom_col() +
geom_text(aes(label = round(SB_Per_Game, 2)),
vjust = -0.5, size = 3) +
labs(
title = "Stolen Bases per Game: The 2023 Renaissance",
subtitle = "Impact of larger bases and pitch clock on stolen base frequency",
x = "Season",
y = "Stolen Bases per Game",
fill = "Era"
) +
theme_minimal() +
scale_fill_manual(values = c("Traditional" = "#808080",
"Larger Bases + Pitch Clock" = "#00AA00"))
}
Analyzing Individual Player Impact
import pandas as pd
import numpy as np
from pybaseball import batting_stats
import matplotlib.pyplot as plt
import seaborn as sns
def analyze_baserunner_performance(years=[2022, 2023]):
"""
Compare baserunning performance before and after rule changes
"""
baserunning_data = []
for year in years:
# Get batting statistics including stolen bases
stats = batting_stats(year, year, qual=200)
stats['Season'] = year
baserunning_data.append(stats[['Name', 'Season', 'SB', 'CS', 'PA']])
combined = pd.concat(baserunning_data)
# Calculate success rates
combined['SB_Attempts'] = combined['SB'] + combined['CS']
combined['SB_Success_Rate'] = np.where(
combined['SB_Attempts'] > 0,
combined['SB'] / combined['SB_Attempts'] * 100,
0
)
return combined
def identify_biggest_beneficiaries(baserunning_df):
"""
Find players who benefited most from rule changes
"""
# Pivot to compare seasons
pivot = baserunning_df.pivot_table(
index='Name',
columns='Season',
values=['SB', 'SB_Success_Rate'],
aggfunc='mean'
)
# Calculate improvements
pivot['SB_Increase'] = pivot[('SB', 2023)] - pivot[('SB', 2022)]
pivot['Success_Rate_Increase'] = (pivot[('SB_Success_Rate', 2023)] -
pivot[('SB_Success_Rate', 2022)])
# Filter for players with significant attempts in both years
qualified = pivot[
(pivot[('SB', 2022)] >= 10) &
(pivot[('SB', 2023)] >= 10)
]
return qualified.sort_values('SB_Increase', ascending=False)
def analyze_speed_tier_impact():
"""
Analyze how different speed tiers were affected by rule changes
"""
# Sprint speed data (would come from Statcast)
speed_impact = pd.DataFrame({
'Speed_Tier': ['Elite (29+ ft/s)', 'Above Avg (28-29 ft/s)',
'Average (27-28 ft/s)', 'Below Avg (<27 ft/s)'],
'2022_SB_Success': [78.5, 76.2, 74.1, 68.9],
'2023_SB_Success': [82.7, 80.4, 78.8, 74.2],
'2022_Avg_SB': [18.3, 12.5, 7.2, 3.1],
'2023_Avg_SB': [25.7, 17.8, 10.9, 5.4]
})
speed_impact['Success_Increase'] = (speed_impact['2023_SB_Success'] -
speed_impact['2022_SB_Success'])
speed_impact['SB_Increase'] = (speed_impact['2023_Avg_SB'] -
speed_impact['2022_Avg_SB'])
return speed_impact
def visualize_baserunning_revolution(speed_impact_df):
"""
Create visualizations showing baserunning changes
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Success rate comparison
x = np.arange(len(speed_impact_df))
width = 0.35
axes[0].bar(x - width/2, speed_impact_df['2022_SB_Success'],
width, label='2022', color='lightcoral')
axes[0].bar(x + width/2, speed_impact_df['2023_SB_Success'],
width, label='2023', color='lightgreen')
axes[0].set_xlabel('Speed Tier')
axes[0].set_ylabel('Success Rate (%)')
axes[0].set_title('Stolen Base Success Rate by Speed Tier')
axes[0].set_xticks(x)
axes[0].set_xticklabels(speed_impact_df['Speed_Tier'], rotation=45, ha='right')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)
# Average SB comparison
axes[1].bar(x - width/2, speed_impact_df['2022_Avg_SB'],
width, label='2022', color='lightcoral')
axes[1].bar(x + width/2, speed_impact_df['2023_Avg_SB'],
width, label='2023', color='lightgreen')
axes[1].set_xlabel('Speed Tier')
axes[1].set_ylabel('Average Stolen Bases')
axes[1].set_title('Average Stolen Bases by Speed Tier')
axes[1].set_xticks(x)
axes[1].set_xticklabels(speed_impact_df['Speed_Tier'], rotation=45, ha='right')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
return fig
def calculate_run_value_impact():
"""
Calculate the run value impact of increased stolen bases
"""
# Run expectancy values
run_expectancy = {
'runner_1st': 0.831,
'runner_2nd': 1.068,
'out_added': -0.25 # approximate cost of caught stealing
}
# 2022 vs 2023 league totals
comparison = pd.DataFrame({
'Metric': ['SB Total', 'CS Total', 'SB Success Rate'],
'2022': [2487, 813, 75.4],
'2023': [3503, 877, 79.9]
})
# Calculate run value
runs_2022 = (2487 * (run_expectancy['runner_2nd'] - run_expectancy['runner_1st']) -
813 * abs(run_expectancy['out_added']))
runs_2023 = (3503 * (run_expectancy['runner_2nd'] - run_expectancy['runner_1st']) -
877 * abs(run_expectancy['out_added']))
run_value_impact = pd.DataFrame({
'Season': [2022, 2023],
'Estimated_Run_Value': [runs_2022, runs_2023],
'Run_Value_Per_Game': [runs_2022 / 2430, runs_2023 / 2430]
})
run_value_impact['Increase'] = (run_value_impact['Estimated_Run_Value'] -
runs_2022)
return run_value_impact
Key Findings: Stolen Base Renaissance
The combined impact of larger bases, pitch clock, and pickoff limitations created the most dramatic stolen base increase in decades:
Volume Increases:
- Total stolen bases increased 41% (2,487 in 2022 to 3,503 in 2023)
- 24 players stole 30+ bases (compared to 16 in 2022)
- 8 players stole 50+ bases (compared to 3 in 2022)
- Ronald Acuna Jr. became first 40-70 player (40 HR, 70 SB)
Success Rate Improvements:
- League success rate improved from 75.4% to 79.9%
- Elite speed players (29+ ft/s sprint speed) succeeded at 82.7%
- Even below-average runners saw 5+ point improvement
Strategic Impact:
- Teams became more aggressive in all counts, not just obvious running situations
- Increase in first-to-third advancement on singles
- More frequent double steals and delayed steals
library(dplyr)
calculate_base_distance_impact <- function() {
# Calculate distance reduction
old_base_size <- 15 # inches
new_base_size <- 18 # inches
# Distance from back of one base to front of next
distance_reduction_per_base <- (new_base_size - old_base_size) / 2
# Total reduction from first to second (two bases involved)
total_reduction <- distance_reduction_per_base * 2
# Original distance (90 feet = 1080 inches)
original_distance <- 90 * 12
# New distance
new_distance <- original_distance - total_reduction
# Percentage reduction
pct_reduction <- (total_reduction / original_distance) * 100
results <- data.frame(
Original_Distance_Inches = original_distance,
New_Distance_Inches = new_distance,
Reduction_Inches = total_reduction,
Reduction_Feet = total_reduction / 12,
Percentage_Reduction = pct_reduction
)
return(results)
}
analyze_stolen_base_trends <- function() {
# Historical stolen base data
sb_trends <- data.frame(
Season = c(2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023),
SB_Total = c(2505, 2537, 2527, 2474, 2280, 874, 2213, 2487, 3503),
Games_Played = c(2430, 2428, 2430, 2431, 2429, 898, 2430, 2430, 2430),
SB_Success_Rate = c(73.2, 73.8, 74.1, 74.8, 75.3, 74.8, 75.1, 75.4, 79.9)
) %>%
mutate(
SB_Per_Game = SB_Total / Games_Played,
Era = ifelse(Season >= 2023, "Larger Bases + Pitch Clock", "Traditional")
)
return(sb_trends)
}
identify_stolen_base_leaders <- function(sb_data) {
# Analyze which types of players benefited most
leader_analysis <- sb_data %>%
filter(SB >= 20) %>%
group_by(Season) %>%
summarise(
Players_20Plus_SB = n(),
Players_30Plus_SB = sum(SB >= 30),
Players_40Plus_SB = sum(SB >= 40),
Players_50Plus_SB = sum(SB >= 50),
Avg_SB_Rate = mean(SB_Success_Rate),
.groups = "drop"
)
return(leader_analysis)
}
plot_stolen_base_renaissance <- function(sb_trends) {
library(ggplot2)
ggplot(sb_trends, aes(x = Season, y = SB_Per_Game, fill = Era)) +
geom_col() +
geom_text(aes(label = round(SB_Per_Game, 2)),
vjust = -0.5, size = 3) +
labs(
title = "Stolen Bases per Game: The 2023 Renaissance",
subtitle = "Impact of larger bases and pitch clock on stolen base frequency",
x = "Season",
y = "Stolen Bases per Game",
fill = "Era"
) +
theme_minimal() +
scale_fill_manual(values = c("Traditional" = "#808080",
"Larger Bases + Pitch Clock" = "#00AA00"))
}
import pandas as pd
import numpy as np
from pybaseball import batting_stats
import matplotlib.pyplot as plt
import seaborn as sns
def analyze_baserunner_performance(years=[2022, 2023]):
"""
Compare baserunning performance before and after rule changes
"""
baserunning_data = []
for year in years:
# Get batting statistics including stolen bases
stats = batting_stats(year, year, qual=200)
stats['Season'] = year
baserunning_data.append(stats[['Name', 'Season', 'SB', 'CS', 'PA']])
combined = pd.concat(baserunning_data)
# Calculate success rates
combined['SB_Attempts'] = combined['SB'] + combined['CS']
combined['SB_Success_Rate'] = np.where(
combined['SB_Attempts'] > 0,
combined['SB'] / combined['SB_Attempts'] * 100,
0
)
return combined
def identify_biggest_beneficiaries(baserunning_df):
"""
Find players who benefited most from rule changes
"""
# Pivot to compare seasons
pivot = baserunning_df.pivot_table(
index='Name',
columns='Season',
values=['SB', 'SB_Success_Rate'],
aggfunc='mean'
)
# Calculate improvements
pivot['SB_Increase'] = pivot[('SB', 2023)] - pivot[('SB', 2022)]
pivot['Success_Rate_Increase'] = (pivot[('SB_Success_Rate', 2023)] -
pivot[('SB_Success_Rate', 2022)])
# Filter for players with significant attempts in both years
qualified = pivot[
(pivot[('SB', 2022)] >= 10) &
(pivot[('SB', 2023)] >= 10)
]
return qualified.sort_values('SB_Increase', ascending=False)
def analyze_speed_tier_impact():
"""
Analyze how different speed tiers were affected by rule changes
"""
# Sprint speed data (would come from Statcast)
speed_impact = pd.DataFrame({
'Speed_Tier': ['Elite (29+ ft/s)', 'Above Avg (28-29 ft/s)',
'Average (27-28 ft/s)', 'Below Avg (<27 ft/s)'],
'2022_SB_Success': [78.5, 76.2, 74.1, 68.9],
'2023_SB_Success': [82.7, 80.4, 78.8, 74.2],
'2022_Avg_SB': [18.3, 12.5, 7.2, 3.1],
'2023_Avg_SB': [25.7, 17.8, 10.9, 5.4]
})
speed_impact['Success_Increase'] = (speed_impact['2023_SB_Success'] -
speed_impact['2022_SB_Success'])
speed_impact['SB_Increase'] = (speed_impact['2023_Avg_SB'] -
speed_impact['2022_Avg_SB'])
return speed_impact
def visualize_baserunning_revolution(speed_impact_df):
"""
Create visualizations showing baserunning changes
"""
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Success rate comparison
x = np.arange(len(speed_impact_df))
width = 0.35
axes[0].bar(x - width/2, speed_impact_df['2022_SB_Success'],
width, label='2022', color='lightcoral')
axes[0].bar(x + width/2, speed_impact_df['2023_SB_Success'],
width, label='2023', color='lightgreen')
axes[0].set_xlabel('Speed Tier')
axes[0].set_ylabel('Success Rate (%)')
axes[0].set_title('Stolen Base Success Rate by Speed Tier')
axes[0].set_xticks(x)
axes[0].set_xticklabels(speed_impact_df['Speed_Tier'], rotation=45, ha='right')
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)
# Average SB comparison
axes[1].bar(x - width/2, speed_impact_df['2022_Avg_SB'],
width, label='2022', color='lightcoral')
axes[1].bar(x + width/2, speed_impact_df['2023_Avg_SB'],
width, label='2023', color='lightgreen')
axes[1].set_xlabel('Speed Tier')
axes[1].set_ylabel('Average Stolen Bases')
axes[1].set_title('Average Stolen Bases by Speed Tier')
axes[1].set_xticks(x)
axes[1].set_xticklabels(speed_impact_df['Speed_Tier'], rotation=45, ha='right')
axes[1].legend()
axes[1].grid(axis='y', alpha=0.3)
plt.tight_layout()
return fig
def calculate_run_value_impact():
"""
Calculate the run value impact of increased stolen bases
"""
# Run expectancy values
run_expectancy = {
'runner_1st': 0.831,
'runner_2nd': 1.068,
'out_added': -0.25 # approximate cost of caught stealing
}
# 2022 vs 2023 league totals
comparison = pd.DataFrame({
'Metric': ['SB Total', 'CS Total', 'SB Success Rate'],
'2022': [2487, 813, 75.4],
'2023': [3503, 877, 79.9]
})
# Calculate run value
runs_2022 = (2487 * (run_expectancy['runner_2nd'] - run_expectancy['runner_1st']) -
813 * abs(run_expectancy['out_added']))
runs_2023 = (3503 * (run_expectancy['runner_2nd'] - run_expectancy['runner_1st']) -
877 * abs(run_expectancy['out_added']))
run_value_impact = pd.DataFrame({
'Season': [2022, 2023],
'Estimated_Run_Value': [runs_2022, runs_2023],
'Run_Value_Per_Game': [runs_2022 / 2430, runs_2023 / 2430]
})
run_value_impact['Increase'] = (run_value_impact['Estimated_Run_Value'] -
runs_2022)
return run_value_impact
Evolution of the Opener Strategy
The "opener" strategy, popularized by the Tampa Bay Rays starting in 2018, fundamentally challenged traditional pitching deployment. Instead of a traditional starting pitcher throwing 5-7 innings, teams would deploy:
- Opener (1-2 innings): High-velocity reliever to face lineup first time through
- Bulk Pitcher (3-5 innings): Long reliever or piggyback starter
- Leverage Relievers (2-3 innings): Traditional closer/setup configuration
Analyzing Opener Effectiveness
library(baseballr)
library(dplyr)
library(ggplot2)
analyze_opener_usage <- function(season = 2023) {
# Identify opener games (pitcher throws <3 innings as "starter")
opener_games <- data.frame(
team = character(),
games_with_opener = numeric(),
total_games = numeric(),
opener_win_pct = numeric()
)
# This would query game logs to identify patterns
# Simplified example structure:
example_data <- data.frame(
team = c("TB", "BAL", "MIL", "TB", "BAL"),
pitcher_type = c("Opener", "Traditional", "Opener", "Traditional", "Opener"),
innings_pitched = c(1.2, 6.0, 2.0, 5.1, 1.1),
runs_allowed = c(0, 2, 1, 3, 0),
result = c("W", "W", "L", "L", "W")
)
return(example_data)
}
calculate_times_through_order_penalty <- function(pitch_data) {
# Analyze performance by times through batting order
tto_analysis <- pitch_data %>%
filter(!is.na(times_through_order)) %>%
group_by(times_through_order) %>%
summarise(
pitches = n(),
avg_exit_velocity = mean(launch_speed, na.rm = TRUE),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
whiff_rate = sum(description == "swinging_strike", na.rm = TRUE) / pitches,
barrel_rate = sum(barrel == 1, na.rm = TRUE) / sum(!is.na(barrel)),
.groups = "drop"
)
return(tto_analysis)
}
plot_times_through_order <- function(tto_data) {
ggplot(tto_data, aes(x = times_through_order, y = avg_exit_velocity)) +
geom_line(linewidth = 1.2, color = "#002D72") +
geom_point(size = 4, color = "#D50032") +
labs(
title = "Exit Velocity by Times Through Batting Order",
subtitle = "Demonstrates the increasing penalty for pitcher familiarity",
x = "Times Through Order",
y = "Average Exit Velocity (mph)"
) +
theme_minimal() +
scale_x_continuous(breaks = 1:4)
}
analyze_bullpen_game_performance <- function(team_data) {
# Compare bullpen games vs traditional starts
performance_comparison <- team_data %>%
mutate(
game_type = case_when(
starter_ip < 3 ~ "Bullpen Game",
starter_ip >= 5 ~ "Traditional Start",
TRUE ~ "Short Start"
)
) %>%
group_by(game_type) %>%
summarise(
games = n(),
win_pct = mean(result == "W"),
avg_runs_allowed = mean(runs_allowed),
avg_total_pitchers_used = mean(pitchers_used),
.groups = "drop"
)
return(performance_comparison)
}
Opener Strategy Analysis with Python
import pandas as pd
import numpy as np
from pybaseball import schedule_and_record, pitching_stats
import matplotlib.pyplot as plt
import seaborn as sns
def identify_opener_games(season=2023, team='TBR'):
"""
Identify games where team used opener strategy
"""
# Get team schedule
schedule = schedule_and_record(season, team)
# Would need game-level pitching data to identify openers
# Simplified example:
opener_characteristics = {
'First_Pitcher_IP_Max': 2.0,
'Second_Pitcher_IP_Min': 3.0,
'Total_Pitchers_Used_Min': 4
}
return opener_characteristics
def analyze_times_through_order_penalty():
"""
Calculate the times through order penalty for starting pitchers
"""
# Aggregate data showing performance degradation
tto_penalty = pd.DataFrame({
'Times_Through_Order': [1, 2, 3, 4],
'AVG': [.237, .251, .268, .289],
'SLG': [.389, .412, .441, .478],
'wOBA': [.301, .319, .338, .361],
'K_Rate': [24.3, 22.1, 19.8, 17.2],
'BB_Rate': [7.8, 8.4, 9.2, 10.1]
})
return tto_penalty
def calculate_optimal_pitcher_usage():
"""
Calculate optimal innings for each pitcher based on TTO penalty
"""
# Run expectancy by times through order
run_expectancy_multiplier = {
1: 1.00,
2: 1.15,
3: 1.32,
4: 1.48
}
# Calculate expected runs for different strategies
strategies = pd.DataFrame({
'Strategy': ['Traditional (7 IP)', 'Opener (1-4-2)', 'Bullpen Game (2-2-2-2)'],
'First_TTO_IP': [7, 1, 2],
'Second_TTO_IP': [0, 4, 2],
'Weighted_Multiplier': [
(7 * 1.00 + 0 * 1.15) / 7, # Traditional
(1 * 1.00 + 4 * 1.00) / 5, # Opener
(2 * 1.00 + 2 * 1.00 + 2 * 1.00) / 6 # Bullpen
]
})
return strategies
def visualize_opener_effectiveness(tto_penalty_df):
"""
Visualize the times through order penalty
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Batting average
axes[0, 0].plot(tto_penalty_df['Times_Through_Order'],
tto_penalty_df['AVG'],
marker='o', linewidth=2, markersize=8, color='steelblue')
axes[0, 0].set_title('Batting Average by Times Through Order')
axes[0, 0].set_xlabel('Times Through Order')
axes[0, 0].set_ylabel('AVG')
axes[0, 0].grid(alpha=0.3)
# Slugging percentage
axes[0, 1].plot(tto_penalty_df['Times_Through_Order'],
tto_penalty_df['SLG'],
marker='s', linewidth=2, markersize=8, color='darkgreen')
axes[0, 1].set_title('Slugging Percentage by Times Through Order')
axes[0, 1].set_xlabel('Times Through Order')
axes[0, 1].set_ylabel('SLG')
axes[0, 1].grid(alpha=0.3)
# Strikeout rate
axes[1, 0].plot(tto_penalty_df['Times_Through_Order'],
tto_penalty_df['K_Rate'],
marker='^', linewidth=2, markersize=8, color='darkred')
axes[1, 0].set_title('Strikeout Rate by Times Through Order')
axes[1, 0].set_xlabel('Times Through Order')
axes[1, 0].set_ylabel('K%')
axes[1, 0].grid(alpha=0.3)
# Walk rate
axes[1, 1].plot(tto_penalty_df['Times_Through_Order'],
tto_penalty_df['BB_Rate'],
marker='D', linewidth=2, markersize=8, color='purple')
axes[1, 1].set_title('Walk Rate by Times Through Order')
axes[1, 1].set_xlabel('Times Through Order')
axes[1, 1].set_ylabel('BB%')
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
return fig
def analyze_bullpen_workload_impact(opener_frequency):
"""
Analyze how opener strategy affects bullpen workload and fatigue
"""
# Model bullpen usage patterns
workload_comparison = pd.DataFrame({
'Strategy': ['Traditional', 'Frequent Opener (20%)', 'Heavy Opener (40%)'],
'Avg_Relievers_Per_Game': [3.2, 4.1, 5.3],
'Reliever_IP_Per_Game': [3.1, 4.8, 6.2],
'Days_Rest_Avg': [2.1, 1.6, 1.3],
'Late_Season_ERA': [3.78, 4.12, 4.48]
})
return workload_comparison
def simulate_opener_season(num_games=162, opener_pct=25):
"""
Simulate a season with varying opener usage
"""
np.random.seed(42)
results = []
for game in range(num_games):
is_opener = np.random.random() < (opener_pct / 100)
if is_opener:
# Opener game simulation
runs_allowed = np.random.poisson(3.8) # Slightly better run prevention
win_prob = 0.52
else:
# Traditional start
runs_allowed = np.random.poisson(4.1)
win_prob = 0.50
won = np.random.random() < win_prob
results.append({
'Game': game + 1,
'Is_Opener': is_opener,
'Runs_Allowed': runs_allowed,
'Won': won
})
results_df = pd.DataFrame(results)
summary = results_df.groupby('Is_Opener').agg({
'Won': ['sum', 'mean'],
'Runs_Allowed': 'mean'
}).round(3)
return results_df, summary
Key Insights: Opener Strategy
Times Through Order Penalty:
- Batting average increases approximately 31 points from 1st to 3rd time through order
- Slugging percentage increases by 52 points
- This penalty justifies limiting starter exposure to batting order
Opener Advantages:
- Platoon Optimization: Match best relievers against top of lineup
- Velocity Advantage: Fresh arms throw harder (avg 1.5 mph increase)
- Times Through Order: Minimize 3rd time through exposure
- Roster Flexibility: Avoid need for 5th starter
Opener Disadvantages:
- Bullpen Fatigue: Increases reliever workload by 25-30%
- Reduced Flexibility: Limits in-game pitching changes
- Starter Development: Fewer opportunities for traditional starters
- Late Season Fatigue: Bullpen ERA typically rises in September
library(baseballr)
library(dplyr)
library(ggplot2)
analyze_opener_usage <- function(season = 2023) {
# Identify opener games (pitcher throws <3 innings as "starter")
opener_games <- data.frame(
team = character(),
games_with_opener = numeric(),
total_games = numeric(),
opener_win_pct = numeric()
)
# This would query game logs to identify patterns
# Simplified example structure:
example_data <- data.frame(
team = c("TB", "BAL", "MIL", "TB", "BAL"),
pitcher_type = c("Opener", "Traditional", "Opener", "Traditional", "Opener"),
innings_pitched = c(1.2, 6.0, 2.0, 5.1, 1.1),
runs_allowed = c(0, 2, 1, 3, 0),
result = c("W", "W", "L", "L", "W")
)
return(example_data)
}
calculate_times_through_order_penalty <- function(pitch_data) {
# Analyze performance by times through batting order
tto_analysis <- pitch_data %>%
filter(!is.na(times_through_order)) %>%
group_by(times_through_order) %>%
summarise(
pitches = n(),
avg_exit_velocity = mean(launch_speed, na.rm = TRUE),
avg_launch_angle = mean(launch_angle, na.rm = TRUE),
whiff_rate = sum(description == "swinging_strike", na.rm = TRUE) / pitches,
barrel_rate = sum(barrel == 1, na.rm = TRUE) / sum(!is.na(barrel)),
.groups = "drop"
)
return(tto_analysis)
}
plot_times_through_order <- function(tto_data) {
ggplot(tto_data, aes(x = times_through_order, y = avg_exit_velocity)) +
geom_line(linewidth = 1.2, color = "#002D72") +
geom_point(size = 4, color = "#D50032") +
labs(
title = "Exit Velocity by Times Through Batting Order",
subtitle = "Demonstrates the increasing penalty for pitcher familiarity",
x = "Times Through Order",
y = "Average Exit Velocity (mph)"
) +
theme_minimal() +
scale_x_continuous(breaks = 1:4)
}
analyze_bullpen_game_performance <- function(team_data) {
# Compare bullpen games vs traditional starts
performance_comparison <- team_data %>%
mutate(
game_type = case_when(
starter_ip < 3 ~ "Bullpen Game",
starter_ip >= 5 ~ "Traditional Start",
TRUE ~ "Short Start"
)
) %>%
group_by(game_type) %>%
summarise(
games = n(),
win_pct = mean(result == "W"),
avg_runs_allowed = mean(runs_allowed),
avg_total_pitchers_used = mean(pitchers_used),
.groups = "drop"
)
return(performance_comparison)
}
import pandas as pd
import numpy as np
from pybaseball import schedule_and_record, pitching_stats
import matplotlib.pyplot as plt
import seaborn as sns
def identify_opener_games(season=2023, team='TBR'):
"""
Identify games where team used opener strategy
"""
# Get team schedule
schedule = schedule_and_record(season, team)
# Would need game-level pitching data to identify openers
# Simplified example:
opener_characteristics = {
'First_Pitcher_IP_Max': 2.0,
'Second_Pitcher_IP_Min': 3.0,
'Total_Pitchers_Used_Min': 4
}
return opener_characteristics
def analyze_times_through_order_penalty():
"""
Calculate the times through order penalty for starting pitchers
"""
# Aggregate data showing performance degradation
tto_penalty = pd.DataFrame({
'Times_Through_Order': [1, 2, 3, 4],
'AVG': [.237, .251, .268, .289],
'SLG': [.389, .412, .441, .478],
'wOBA': [.301, .319, .338, .361],
'K_Rate': [24.3, 22.1, 19.8, 17.2],
'BB_Rate': [7.8, 8.4, 9.2, 10.1]
})
return tto_penalty
def calculate_optimal_pitcher_usage():
"""
Calculate optimal innings for each pitcher based on TTO penalty
"""
# Run expectancy by times through order
run_expectancy_multiplier = {
1: 1.00,
2: 1.15,
3: 1.32,
4: 1.48
}
# Calculate expected runs for different strategies
strategies = pd.DataFrame({
'Strategy': ['Traditional (7 IP)', 'Opener (1-4-2)', 'Bullpen Game (2-2-2-2)'],
'First_TTO_IP': [7, 1, 2],
'Second_TTO_IP': [0, 4, 2],
'Weighted_Multiplier': [
(7 * 1.00 + 0 * 1.15) / 7, # Traditional
(1 * 1.00 + 4 * 1.00) / 5, # Opener
(2 * 1.00 + 2 * 1.00 + 2 * 1.00) / 6 # Bullpen
]
})
return strategies
def visualize_opener_effectiveness(tto_penalty_df):
"""
Visualize the times through order penalty
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Batting average
axes[0, 0].plot(tto_penalty_df['Times_Through_Order'],
tto_penalty_df['AVG'],
marker='o', linewidth=2, markersize=8, color='steelblue')
axes[0, 0].set_title('Batting Average by Times Through Order')
axes[0, 0].set_xlabel('Times Through Order')
axes[0, 0].set_ylabel('AVG')
axes[0, 0].grid(alpha=0.3)
# Slugging percentage
axes[0, 1].plot(tto_penalty_df['Times_Through_Order'],
tto_penalty_df['SLG'],
marker='s', linewidth=2, markersize=8, color='darkgreen')
axes[0, 1].set_title('Slugging Percentage by Times Through Order')
axes[0, 1].set_xlabel('Times Through Order')
axes[0, 1].set_ylabel('SLG')
axes[0, 1].grid(alpha=0.3)
# Strikeout rate
axes[1, 0].plot(tto_penalty_df['Times_Through_Order'],
tto_penalty_df['K_Rate'],
marker='^', linewidth=2, markersize=8, color='darkred')
axes[1, 0].set_title('Strikeout Rate by Times Through Order')
axes[1, 0].set_xlabel('Times Through Order')
axes[1, 0].set_ylabel('K%')
axes[1, 0].grid(alpha=0.3)
# Walk rate
axes[1, 1].plot(tto_penalty_df['Times_Through_Order'],
tto_penalty_df['BB_Rate'],
marker='D', linewidth=2, markersize=8, color='purple')
axes[1, 1].set_title('Walk Rate by Times Through Order')
axes[1, 1].set_xlabel('Times Through Order')
axes[1, 1].set_ylabel('BB%')
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
return fig
def analyze_bullpen_workload_impact(opener_frequency):
"""
Analyze how opener strategy affects bullpen workload and fatigue
"""
# Model bullpen usage patterns
workload_comparison = pd.DataFrame({
'Strategy': ['Traditional', 'Frequent Opener (20%)', 'Heavy Opener (40%)'],
'Avg_Relievers_Per_Game': [3.2, 4.1, 5.3],
'Reliever_IP_Per_Game': [3.1, 4.8, 6.2],
'Days_Rest_Avg': [2.1, 1.6, 1.3],
'Late_Season_ERA': [3.78, 4.12, 4.48]
})
return workload_comparison
def simulate_opener_season(num_games=162, opener_pct=25):
"""
Simulate a season with varying opener usage
"""
np.random.seed(42)
results = []
for game in range(num_games):
is_opener = np.random.random() < (opener_pct / 100)
if is_opener:
# Opener game simulation
runs_allowed = np.random.poisson(3.8) # Slightly better run prevention
win_prob = 0.52
else:
# Traditional start
runs_allowed = np.random.poisson(4.1)
win_prob = 0.50
won = np.random.random() < win_prob
results.append({
'Game': game + 1,
'Is_Opener': is_opener,
'Runs_Allowed': runs_allowed,
'Won': won
})
results_df = pd.DataFrame(results)
summary = results_df.groupby('Is_Opener').agg({
'Won': ['sum', 'mean'],
'Runs_Allowed': 'mean'
}).round(3)
return results_df, summary
Defining the Three True Outcomes Era
The "Three True Outcomes" (TTO) - home runs, strikeouts, and walks - represent plate appearances where defense plays no role. The modern era has seen these outcomes dominate baseball like never before:
library(dplyr)
library(ggplot2)
analyze_tto_trends <- function() {
# Historical TTO rates
tto_evolution <- data.frame(
Season = seq(1980, 2023, by = 5),
HR_Rate = c(1.6, 2.0, 2.3, 2.6, 2.5, 2.7, 2.8, 3.0, 3.1, 3.0),
K_Rate = c(13.8, 15.2, 16.4, 16.8, 17.9, 19.8, 21.4, 23.2, 22.7, 22.4),
BB_Rate = c(8.7, 8.4, 9.1, 9.2, 8.7, 8.3, 8.5, 8.7, 8.8, 8.6),
TTO_Rate = c(24.1, 25.6, 27.8, 28.6, 29.1, 30.8, 32.7, 34.9, 34.6, 34.0)
)
return(tto_evolution)
}
plot_tto_evolution <- function(tto_data) {
library(tidyr)
tto_long <- tto_data %>%
pivot_longer(cols = c(HR_Rate, K_Rate, BB_Rate),
names_to = "Outcome",
values_to = "Rate")
ggplot(tto_long, aes(x = Season, y = Rate, color = Outcome)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
labs(
title = "Evolution of Three True Outcomes (1980-2023)",
subtitle = "Percentage of plate appearances ending in HR, K, or BB",
x = "Season",
y = "Rate (%)",
color = "Outcome Type"
) +
theme_minimal() +
scale_color_manual(
values = c("HR_Rate" = "#D50032",
"K_Rate" = "#002D72",
"BB_Rate" = "#FF8C00"),
labels = c("Home Runs", "Strikeouts", "Walks")
)
}
analyze_balls_in_play_decline <- function() {
# Calculate balls in play trends
bip_trends <- data.frame(
Season = 2015:2023,
BIP_Per_Game = c(26.8, 26.2, 25.9, 25.3, 24.7, 24.8, 24.9, 24.4, 25.1),
Action_Plays_Pct = c(42.3, 41.1, 40.2, 39.4, 38.7, 38.9, 39.1, 38.5, 39.8)
) %>%
mutate(
Pct_Change_From_2015 = (BIP_Per_Game / first(BIP_Per_Game) - 1) * 100
)
return(bip_trends)
}
identify_extreme_tto_players <- function(batting_data) {
# Find players with highest TTO rates
extreme_tto <- batting_data %>%
filter(PA >= 400) %>%
mutate(
TTO = HR + SO + BB,
TTO_Rate = TTO / PA * 100
) %>%
arrange(desc(TTO_Rate)) %>%
select(player_name, PA, HR, SO, BB, TTO, TTO_Rate)
return(extreme_tto)
}
TTO Analysis with Python
import pandas as pd
import numpy as np
from pybaseball import batting_stats, team_batting
import matplotlib.pyplot as plt
import seaborn as sns
def analyze_league_tto_evolution(start_year=2000, end_year=2023):
"""
Analyze league-wide three true outcomes trends
"""
league_stats = []
for year in range(start_year, end_year + 1):
try:
stats = team_batting(year)
# Calculate TTO metrics
total_pa = stats['PA'].sum()
total_hr = stats['HR'].sum()
total_so = stats['SO'].sum()
total_bb = stats['BB'].sum()
league_stats.append({
'Season': year,
'HR_Rate': (total_hr / total_pa) * 100,
'K_Rate': (total_so / total_pa) * 100,
'BB_Rate': (total_bb / total_pa) * 100,
'TTO_Rate': ((total_hr + total_so + total_bb) / total_pa) * 100,
'Non_TTO_Rate': ((total_pa - total_hr - total_so - total_bb) / total_pa) * 100
})
except Exception as e:
print(f"Error processing {year}: {e}")
return pd.DataFrame(league_stats)
def identify_tto_extremes(season=2023, min_pa=400):
"""
Identify players with extreme TTO profiles
"""
stats = batting_stats(season, season, qual=min_pa)
# Calculate TTO rate
stats['TTO'] = stats['HR'] + stats['SO'] + stats['BB']
stats['TTO_Rate'] = (stats['TTO'] / stats['PA']) * 100
stats['Non_TTO_Rate'] = 100 - stats['TTO_Rate']
# Identify extremes
highest_tto = stats.nlargest(20, 'TTO_Rate')[
['Name', 'PA', 'HR', 'SO', 'BB', 'TTO_Rate', 'AVG', 'OBP', 'SLG']
]
lowest_tto = stats.nsmallest(20, 'TTO_Rate')[
['Name', 'PA', 'HR', 'SO', 'BB', 'TTO_Rate', 'AVG', 'OBP', 'SLG']
]
return highest_tto, lowest_tto
def analyze_tto_correlation_with_success():
"""
Analyze whether TTO profile correlates with team success
"""
# Example data structure
team_tto_data = pd.DataFrame({
'Team': ['TB', 'LAD', 'HOU', 'ATL', 'NYY'],
'TTO_Rate': [35.2, 34.8, 33.9, 32.1, 36.4],
'Wins': [99, 100, 90, 104, 82],
'Run_Differential': [+145, +167, +89, +198, +45],
'Playoff_Berth': [True, True, True, True, False]
})
# Calculate correlation
correlation = team_tto_data[['TTO_Rate', 'Wins']].corr().iloc[0, 1]
return team_tto_data, correlation
def visualize_tto_player_types(high_tto_df, low_tto_df):
"""
Compare offensive profiles of high vs low TTO players
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# AVG comparison
axes[0, 0].hist([high_tto_df['AVG'], low_tto_df['AVG']],
label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
axes[0, 0].set_title('Batting Average Distribution')
axes[0, 0].set_xlabel('AVG')
axes[0, 0].set_ylabel('Count')
axes[0, 0].legend()
# OBP comparison
axes[0, 1].hist([high_tto_df['OBP'], low_tto_df['OBP']],
label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
axes[0, 1].set_title('On-Base Percentage Distribution')
axes[0, 1].set_xlabel('OBP')
axes[0, 1].set_ylabel('Count')
axes[0, 1].legend()
# SLG comparison
axes[1, 0].hist([high_tto_df['SLG'], low_tto_df['SLG']],
label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
axes[1, 0].set_title('Slugging Percentage Distribution')
axes[1, 0].set_xlabel('SLG')
axes[1, 0].set_ylabel('Count')
axes[1, 0].legend()
# Scatter: TTO Rate vs OPS
high_tto_df['OPS'] = high_tto_df['OBP'] + high_tto_df['SLG']
low_tto_df['OPS'] = low_tto_df['OBP'] + low_tto_df['SLG']
axes[1, 1].scatter(high_tto_df['TTO_Rate'], high_tto_df['OPS'],
label='High TTO', alpha=0.6, s=60)
axes[1, 1].scatter(low_tto_df['TTO_Rate'], low_tto_df['OPS'],
label='Low TTO', alpha=0.6, s=60)
axes[1, 1].set_title('TTO Rate vs OPS')
axes[1, 1].set_xlabel('TTO Rate (%)')
axes[1, 1].set_ylabel('OPS')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
return fig
def calculate_entertainment_value():
"""
Attempt to quantify 'entertainment value' of TTO vs action plays
"""
# Hypothetical engagement metrics
engagement_data = pd.DataFrame({
'Play_Type': ['Home Run', 'Strikeout', 'Walk', 'Single', 'Double',
'Triple', 'Stolen Base', 'Double Play'],
'Avg_Excitement_Score': [9.2, 3.1, 2.4, 6.8, 8.1, 9.5, 7.8, 8.9],
'Frequency_2000': [2.0, 15.8, 8.7, 14.2, 4.8, 0.8, 1.9, 2.1],
'Frequency_2023': [3.0, 22.4, 8.6, 12.1, 4.2, 0.5, 2.3, 1.8]
})
# Calculate weighted excitement score
engagement_data['Weighted_Excitement_2000'] = (
engagement_data['Avg_Excitement_Score'] *
engagement_data['Frequency_2000']
)
engagement_data['Weighted_Excitement_2023'] = (
engagement_data['Avg_Excitement_Score'] *
engagement_data['Frequency_2023']
)
total_2000 = engagement_data['Weighted_Excitement_2000'].sum()
total_2023 = engagement_data['Weighted_Excitement_2023'].sum()
print(f"Weighted Excitement Score Change: {total_2023 - total_2000:.2f}")
return engagement_data
Key Findings: Three True Outcomes
Historical Progression:
- TTO rate has increased from 24.1% (1980) to 34.0% (2023)
- Peak was 34.9% in 2019 before slight decline with rule changes
- Strikeouts alone now account for over 22% of all plate appearances
Player Archetypes:
Extreme TTO Players (2023):
- Kyle Schwarber: 50.2% TTO rate (46 HR, 215 SO, 80 BB)
- Joey Gallo: 52.7% TTO rate (when qualified)
- Mike Trout: 44.3% TTO rate (still elite overall production)
Contact-Oriented Players (2023):
- Luis Arraez: 15.8% TTO rate (won batting title)
- Steven Kwan: 19.2% TTO rate (elite contact skills)
- Jeff McNeil: 18.7% TTO rate
Impact on Game Quality:
- Balls in play decreased from 26.8 per game (2015) to 25.1 (2023)
- Average game action (defensive plays, baserunning) declined until 2023 rule changes
- 2023 rules partially reversed trend, increasing action plays by 3.4%
library(dplyr)
library(ggplot2)
analyze_tto_trends <- function() {
# Historical TTO rates
tto_evolution <- data.frame(
Season = seq(1980, 2023, by = 5),
HR_Rate = c(1.6, 2.0, 2.3, 2.6, 2.5, 2.7, 2.8, 3.0, 3.1, 3.0),
K_Rate = c(13.8, 15.2, 16.4, 16.8, 17.9, 19.8, 21.4, 23.2, 22.7, 22.4),
BB_Rate = c(8.7, 8.4, 9.1, 9.2, 8.7, 8.3, 8.5, 8.7, 8.8, 8.6),
TTO_Rate = c(24.1, 25.6, 27.8, 28.6, 29.1, 30.8, 32.7, 34.9, 34.6, 34.0)
)
return(tto_evolution)
}
plot_tto_evolution <- function(tto_data) {
library(tidyr)
tto_long <- tto_data %>%
pivot_longer(cols = c(HR_Rate, K_Rate, BB_Rate),
names_to = "Outcome",
values_to = "Rate")
ggplot(tto_long, aes(x = Season, y = Rate, color = Outcome)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3) +
labs(
title = "Evolution of Three True Outcomes (1980-2023)",
subtitle = "Percentage of plate appearances ending in HR, K, or BB",
x = "Season",
y = "Rate (%)",
color = "Outcome Type"
) +
theme_minimal() +
scale_color_manual(
values = c("HR_Rate" = "#D50032",
"K_Rate" = "#002D72",
"BB_Rate" = "#FF8C00"),
labels = c("Home Runs", "Strikeouts", "Walks")
)
}
analyze_balls_in_play_decline <- function() {
# Calculate balls in play trends
bip_trends <- data.frame(
Season = 2015:2023,
BIP_Per_Game = c(26.8, 26.2, 25.9, 25.3, 24.7, 24.8, 24.9, 24.4, 25.1),
Action_Plays_Pct = c(42.3, 41.1, 40.2, 39.4, 38.7, 38.9, 39.1, 38.5, 39.8)
) %>%
mutate(
Pct_Change_From_2015 = (BIP_Per_Game / first(BIP_Per_Game) - 1) * 100
)
return(bip_trends)
}
identify_extreme_tto_players <- function(batting_data) {
# Find players with highest TTO rates
extreme_tto <- batting_data %>%
filter(PA >= 400) %>%
mutate(
TTO = HR + SO + BB,
TTO_Rate = TTO / PA * 100
) %>%
arrange(desc(TTO_Rate)) %>%
select(player_name, PA, HR, SO, BB, TTO, TTO_Rate)
return(extreme_tto)
}
import pandas as pd
import numpy as np
from pybaseball import batting_stats, team_batting
import matplotlib.pyplot as plt
import seaborn as sns
def analyze_league_tto_evolution(start_year=2000, end_year=2023):
"""
Analyze league-wide three true outcomes trends
"""
league_stats = []
for year in range(start_year, end_year + 1):
try:
stats = team_batting(year)
# Calculate TTO metrics
total_pa = stats['PA'].sum()
total_hr = stats['HR'].sum()
total_so = stats['SO'].sum()
total_bb = stats['BB'].sum()
league_stats.append({
'Season': year,
'HR_Rate': (total_hr / total_pa) * 100,
'K_Rate': (total_so / total_pa) * 100,
'BB_Rate': (total_bb / total_pa) * 100,
'TTO_Rate': ((total_hr + total_so + total_bb) / total_pa) * 100,
'Non_TTO_Rate': ((total_pa - total_hr - total_so - total_bb) / total_pa) * 100
})
except Exception as e:
print(f"Error processing {year}: {e}")
return pd.DataFrame(league_stats)
def identify_tto_extremes(season=2023, min_pa=400):
"""
Identify players with extreme TTO profiles
"""
stats = batting_stats(season, season, qual=min_pa)
# Calculate TTO rate
stats['TTO'] = stats['HR'] + stats['SO'] + stats['BB']
stats['TTO_Rate'] = (stats['TTO'] / stats['PA']) * 100
stats['Non_TTO_Rate'] = 100 - stats['TTO_Rate']
# Identify extremes
highest_tto = stats.nlargest(20, 'TTO_Rate')[
['Name', 'PA', 'HR', 'SO', 'BB', 'TTO_Rate', 'AVG', 'OBP', 'SLG']
]
lowest_tto = stats.nsmallest(20, 'TTO_Rate')[
['Name', 'PA', 'HR', 'SO', 'BB', 'TTO_Rate', 'AVG', 'OBP', 'SLG']
]
return highest_tto, lowest_tto
def analyze_tto_correlation_with_success():
"""
Analyze whether TTO profile correlates with team success
"""
# Example data structure
team_tto_data = pd.DataFrame({
'Team': ['TB', 'LAD', 'HOU', 'ATL', 'NYY'],
'TTO_Rate': [35.2, 34.8, 33.9, 32.1, 36.4],
'Wins': [99, 100, 90, 104, 82],
'Run_Differential': [+145, +167, +89, +198, +45],
'Playoff_Berth': [True, True, True, True, False]
})
# Calculate correlation
correlation = team_tto_data[['TTO_Rate', 'Wins']].corr().iloc[0, 1]
return team_tto_data, correlation
def visualize_tto_player_types(high_tto_df, low_tto_df):
"""
Compare offensive profiles of high vs low TTO players
"""
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# AVG comparison
axes[0, 0].hist([high_tto_df['AVG'], low_tto_df['AVG']],
label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
axes[0, 0].set_title('Batting Average Distribution')
axes[0, 0].set_xlabel('AVG')
axes[0, 0].set_ylabel('Count')
axes[0, 0].legend()
# OBP comparison
axes[0, 1].hist([high_tto_df['OBP'], low_tto_df['OBP']],
label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
axes[0, 1].set_title('On-Base Percentage Distribution')
axes[0, 1].set_xlabel('OBP')
axes[0, 1].set_ylabel('Count')
axes[0, 1].legend()
# SLG comparison
axes[1, 0].hist([high_tto_df['SLG'], low_tto_df['SLG']],
label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
axes[1, 0].set_title('Slugging Percentage Distribution')
axes[1, 0].set_xlabel('SLG')
axes[1, 0].set_ylabel('Count')
axes[1, 0].legend()
# Scatter: TTO Rate vs OPS
high_tto_df['OPS'] = high_tto_df['OBP'] + high_tto_df['SLG']
low_tto_df['OPS'] = low_tto_df['OBP'] + low_tto_df['SLG']
axes[1, 1].scatter(high_tto_df['TTO_Rate'], high_tto_df['OPS'],
label='High TTO', alpha=0.6, s=60)
axes[1, 1].scatter(low_tto_df['TTO_Rate'], low_tto_df['OPS'],
label='Low TTO', alpha=0.6, s=60)
axes[1, 1].set_title('TTO Rate vs OPS')
axes[1, 1].set_xlabel('TTO Rate (%)')
axes[1, 1].set_ylabel('OPS')
axes[1, 1].legend()
axes[1, 1].grid(alpha=0.3)
plt.tight_layout()
return fig
def calculate_entertainment_value():
"""
Attempt to quantify 'entertainment value' of TTO vs action plays
"""
# Hypothetical engagement metrics
engagement_data = pd.DataFrame({
'Play_Type': ['Home Run', 'Strikeout', 'Walk', 'Single', 'Double',
'Triple', 'Stolen Base', 'Double Play'],
'Avg_Excitement_Score': [9.2, 3.1, 2.4, 6.8, 8.1, 9.5, 7.8, 8.9],
'Frequency_2000': [2.0, 15.8, 8.7, 14.2, 4.8, 0.8, 1.9, 2.1],
'Frequency_2023': [3.0, 22.4, 8.6, 12.1, 4.2, 0.5, 2.3, 1.8]
})
# Calculate weighted excitement score
engagement_data['Weighted_Excitement_2000'] = (
engagement_data['Avg_Excitement_Score'] *
engagement_data['Frequency_2000']
)
engagement_data['Weighted_Excitement_2023'] = (
engagement_data['Avg_Excitement_Score'] *
engagement_data['Frequency_2023']
)
total_2000 = engagement_data['Weighted_Excitement_2000'].sum()
total_2023 = engagement_data['Weighted_Excitement_2023'].sum()
print(f"Weighted Excitement Score Change: {total_2023 - total_2000:.2f}")
return engagement_data
Statistical Framework for Rule Change Analysis
When MLB implements rule changes, analyzing their impact requires careful statistical methodology to separate rule effects from natural variation, player development, and other confounding factors.
library(dplyr)
library(ggplot2)
library(broom)
# Difference-in-differences analysis framework
did_analysis_framework <- function(pre_data, post_data, treatment_group, control_group) {
# Combine data
combined_data <- bind_rows(
pre_data %>% mutate(period = "pre", treated = group %in% treatment_group),
post_data %>% mutate(period = "post", treated = group %in% treatment_group)
)
# Run difference-in-differences regression
did_model <- lm(outcome ~ treated * period, data = combined_data)
# Extract results
results <- tidy(did_model) %>%
filter(term == "treatedTRUE:periodpost")
return(list(model = did_model, did_estimate = results))
}
# Analyze shift ban impact with statistical rigor
analyze_shift_ban_statistical <- function() {
# Simulate player-level data for demonstration
set.seed(123)
# Pre-shift ban (2022) - left-handed pull hitters
pre_ban <- data.frame(
player_id = 1:100,
player_type = "LH_Pull",
babip = rnorm(100, mean = 0.260, sd = 0.040),
shift_faced_pct = runif(100, 60, 85),
season = 2022
)
# Post-shift ban (2023) - same players
post_ban <- data.frame(
player_id = 1:100,
player_type = "LH_Pull",
babip = rnorm(100, mean = 0.285, sd = 0.040),
shift_faced_pct = 0, # No shifts allowed
season = 2023
)
# Control group (right-handed hitters less affected by shifts)
control_pre <- data.frame(
player_id = 101:200,
player_type = "RH",
babip = rnorm(100, mean = 0.290, sd = 0.040),
shift_faced_pct = runif(100, 10, 25),
season = 2022
)
control_post <- data.frame(
player_id = 101:200,
player_type = "RH",
babip = rnorm(100, mean = 0.295, sd = 0.040),
shift_faced_pct = 0,
season = 2023
)
# Combine all data
full_data <- bind_rows(pre_ban, post_ban, control_pre, control_post) %>%
mutate(
post_ban = season == 2023,
treatment = player_type == "LH_Pull"
)
# Run DiD model
did_model <- lm(babip ~ treatment * post_ban, data = full_data)
return(list(data = full_data, model = did_model, summary = summary(did_model)))
}
# Regression discontinuity design for pitch clock
analyze_pitch_clock_rdd <- function(game_data) {
# Create discontinuity at 2023 season start
rdd_data <- game_data %>%
mutate(
days_from_cutoff = as.numeric(game_date - as.Date("2023-03-30")),
post_clock = days_from_cutoff > 0
) %>%
filter(abs(days_from_cutoff) <= 30) # 30-day window around implementation
# Run RDD model
rdd_model <- lm(game_length_minutes ~ days_from_cutoff * post_clock,
data = rdd_data)
# Estimate discontinuity (jump) at cutoff
discontinuity <- coef(rdd_model)["post_clockTRUE"]
return(list(model = rdd_model, effect = discontinuity))
}
# Time series analysis for rule changes
analyze_time_series_impact <- function(metric_data) {
library(forecast)
# Split into pre and post periods
pre_period <- metric_data %>% filter(season < 2023)
post_period <- metric_data %>% filter(season >= 2023)
# Fit ARIMA model on pre-period
ts_data <- ts(pre_period$value, start = min(pre_period$season))
arima_model <- auto.arima(ts_data)
# Forecast what would have happened without intervention
forecast_result <- forecast(arima_model, h = nrow(post_period))
# Compare actual vs forecast
comparison <- data.frame(
season = post_period$season,
actual = post_period$value,
predicted = as.numeric(forecast_result$mean),
lower_95 = as.numeric(forecast_result$lower[,2]),
upper_95 = as.numeric(forecast_result$upper[,2])
) %>%
mutate(
impact = actual - predicted,
significant = actual < lower_95 | actual > upper_95
)
return(list(model = arima_model, forecast = forecast_result, comparison = comparison))
}
# Visualization of rule change impact
plot_rule_change_impact <- function(comparison_data, rule_name) {
ggplot(comparison_data, aes(x = season)) +
geom_ribbon(aes(ymin = lower_95, ymax = upper_95),
alpha = 0.2, fill = "blue") +
geom_line(aes(y = predicted, color = "Predicted"),
linewidth = 1, linetype = "dashed") +
geom_line(aes(y = actual, color = "Actual"),
linewidth = 1.2) +
geom_point(aes(y = actual, color = "Actual"),
size = 3) +
geom_vline(xintercept = min(comparison_data$season) - 0.5,
linetype = "dashed", color = "red") +
labs(
title = paste("Impact of", rule_name),
subtitle = "Actual values vs forecasted counterfactual",
x = "Season",
y = "Metric Value",
color = "Series"
) +
theme_minimal() +
scale_color_manual(values = c("Actual" = "#D50032", "Predicted" = "#002D72"))
}
Comprehensive Rule Change Analysis in Python
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
import matplotlib.pyplot as plt
import seaborn as sns
def difference_in_differences_analysis(data, outcome_var, treatment_var,
time_var, player_var):
"""
Implement difference-in-differences estimation for rule changes
"""
# Create interaction term
data['treatment_x_time'] = data[treatment_var] * data[time_var]
# Prepare regression variables
X = data[[treatment_var, time_var, 'treatment_x_time']]
X = sm.add_constant(X)
y = data[outcome_var]
# Run OLS regression
model = sm.OLS(y, X).fit()
# The coefficient on treatment_x_time is the DiD estimate
did_estimate = model.params['treatment_x_time']
did_se = model.bse['treatment_x_time']
did_pvalue = model.pvalues['treatment_x_time']
results = {
'estimate': did_estimate,
'std_error': did_se,
'p_value': did_pvalue,
'confidence_interval': model.conf_int().loc['treatment_x_time'].values,
'model_summary': model.summary()
}
return results
def parallel_trends_test(data, outcome_var, treatment_var, pre_periods):
"""
Test parallel trends assumption for DiD
"""
# Filter to pre-treatment periods only
pre_data = data[data['period'].isin(pre_periods)]
# Create time trend interaction
pre_data['time_trend'] = pre_data['period'].astype(int)
pre_data['treatment_x_trend'] = (pre_data[treatment_var] *
pre_data['time_trend'])
# Regression: outcome ~ treatment + time_trend + treatment*time_trend
X = pre_data[[treatment_var, 'time_trend', 'treatment_x_trend']]
X = sm.add_constant(X)
y = pre_data[outcome_var]
model = sm.OLS(y, X).fit()
# If treatment*trend coefficient is insignificant, parallel trends holds
trend_coef = model.params['treatment_x_trend']
trend_pvalue = model.pvalues['treatment_x_trend']
parallel_trends_holds = trend_pvalue > 0.05
return {
'parallel_trends_satisfied': parallel_trends_holds,
'trend_coefficient': trend_coef,
'p_value': trend_pvalue
}
def synthetic_control_analysis(treated_unit, control_pool, pre_periods, post_periods):
"""
Implement synthetic control method for rule change analysis
"""
from sklearn.linear_model import LinearRegression
# Get pre-treatment outcomes
treated_pre = treated_unit[treated_unit['period'].isin(pre_periods)]['outcome']
control_pre = control_pool[control_pool['period'].isin(pre_periods)].pivot(
index='period', columns='unit', values='outcome'
)
# Find optimal weights to match treated unit in pre-period
model = LinearRegression(fit_intercept=False, positive=True)
model.fit(control_pre.values, treated_pre.values)
# Normalize weights to sum to 1
weights = model.coef_ / model.coef_.sum()
# Create synthetic control
control_all = control_pool.pivot(
index='period', columns='unit', values='outcome'
)
synthetic = (control_all * weights).sum(axis=1)
# Calculate treatment effect in post period
treated_all = treated_unit.set_index('period')['outcome']
effects = []
for period in post_periods:
effect = treated_all[period] - synthetic[period]
effects.append({
'period': period,
'treated': treated_all[period],
'synthetic': synthetic[period],
'effect': effect
})
return pd.DataFrame(effects), weights
def event_study_analysis(data, outcome_var, event_time_var, max_leads=3, max_lags=3):
"""
Event study regression to estimate dynamic treatment effects
"""
# Create dummy variables for each event time
dummies = pd.get_dummies(data[event_time_var], prefix='event')
# Exclude t=-1 as reference period (or t=0 if no pre-period)
reference_col = 'event_-1' if 'event_-1' in dummies.columns else 'event_0'
if reference_col in dummies.columns:
dummies = dummies.drop(columns=[reference_col])
# Run regression
X = sm.add_constant(dummies)
y = data[outcome_var]
model = sm.OLS(y, X).fit()
# Extract coefficients for plotting
coefficients = []
for col in dummies.columns:
event_time = int(col.split('_')[1])
coefficients.append({
'event_time': event_time,
'coefficient': model.params[col],
'std_error': model.bse[col],
'conf_low': model.conf_int().loc[col, 0],
'conf_high': model.conf_int().loc[col, 1]
})
coef_df = pd.DataFrame(coefficients).sort_values('event_time')
return coef_df, model
def visualize_event_study(coef_df, rule_name):
"""
Create event study plot
"""
fig, ax = plt.subplots(figsize=(12, 6))
ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
ax.axvline(x=0, color='red', linestyle='--', linewidth=1, alpha=0.5)
ax.errorbar(coef_df['event_time'], coef_df['coefficient'],
yerr=1.96 * coef_df['std_error'],
fmt='o', markersize=8, capsize=5, capthick=2,
color='steelblue', ecolor='gray', elinewidth=2)
ax.set_xlabel('Periods Relative to Rule Change', fontsize=12)
ax.set_ylabel('Estimated Effect', fontsize=12)
ax.set_title(f'Event Study: {rule_name}', fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)
# Add shading for pre/post periods
pre_periods = coef_df[coef_df['event_time'] < 0]['event_time']
if len(pre_periods) > 0:
ax.axvspan(pre_periods.min() - 0.5, -0.5, alpha=0.1, color='gray',
label='Pre-treatment')
post_periods = coef_df[coef_df['event_time'] > 0]['event_time']
if len(post_periods) > 0:
ax.axvspan(0.5, post_periods.max() + 0.5, alpha=0.1, color='lightblue',
label='Post-treatment')
ax.legend()
plt.tight_layout()
return fig
def calculate_statistical_power(effect_size, sample_size, alpha=0.05):
"""
Calculate statistical power for detecting rule change effects
"""
from scipy.stats import norm
# Calculate non-centrality parameter
ncp = effect_size * np.sqrt(sample_size)
# Critical value for two-tailed test
z_critical = norm.ppf(1 - alpha/2)
# Power = P(reject H0 | H1 is true)
power = 1 - norm.cdf(z_critical - ncp) + norm.cdf(-z_critical - ncp)
return power
def simulate_rule_change_impact(true_effect, sample_size, num_simulations=1000):
"""
Monte Carlo simulation to assess rule change detection
"""
np.random.seed(42)
detected = 0
estimates = []
for _ in range(num_simulations):
# Simulate control group (no change)
control_pre = np.random.normal(0, 1, sample_size)
control_post = np.random.normal(0, 1, sample_size)
# Simulate treatment group (with effect)
treatment_pre = np.random.normal(0, 1, sample_size)
treatment_post = np.random.normal(true_effect, 1, sample_size)
# Calculate DiD estimate
did_est = ((treatment_post.mean() - treatment_pre.mean()) -
(control_post.mean() - control_pre.mean()))
estimates.append(did_est)
# Test for significance
# Simplified t-test
se = np.sqrt(1/sample_size + 1/sample_size + 1/sample_size + 1/sample_size)
t_stat = did_est / se
if abs(t_stat) > 1.96:
detected += 1
power = detected / num_simulations
return {
'power': power,
'mean_estimate': np.mean(estimates),
'std_estimate': np.std(estimates),
'estimates': estimates
}
Practical Example: Complete Rule Change Analysis
# Complete workflow for analyzing shift ban impact
def complete_shift_ban_analysis():
"""
Comprehensive statistical analysis of shift ban impact
"""
# Step 1: Prepare data
np.random.seed(42)
# Create synthetic panel data
players = 200
seasons = [2021, 2022, 2023]
data_list = []
for player in range(players):
is_lh_pull = player < 100 # First 100 are LH pull hitters
for season in seasons:
# Base BABIP
base_babip = 0.290 if not is_lh_pull else 0.260
# Shift effect (only pre-2023)
if season < 2023 and is_lh_pull:
shift_penalty = -0.025
else:
shift_penalty = 0
# Generate outcome with noise
babip = base_babip + shift_penalty + np.random.normal(0, 0.030)
data_list.append({
'player_id': player,
'season': season,
'lh_pull_hitter': is_lh_pull,
'post_ban': season >= 2023,
'babip': babip
})
df = pd.DataFrame(data_list)
# Step 2: Difference-in-differences
df['treatment_x_post'] = df['lh_pull_hitter'] * df['post_ban']
X = df[['lh_pull_hitter', 'post_ban', 'treatment_x_post']]
X = sm.add_constant(X)
y = df['babip']
did_model = sm.OLS(y, X).fit()
print("=" * 60)
print("SHIFT BAN IMPACT ANALYSIS")
print("=" * 60)
print("\nDifference-in-Differences Results:")
print(f"Treatment Effect: {did_model.params['treatment_x_post']:.4f}")
print(f"Standard Error: {did_model.bse['treatment_x_post']:.4f}")
print(f"P-value: {did_model.pvalues['treatment_x_post']:.4f}")
print(f"95% CI: [{did_model.conf_int().loc['treatment_x_post', 0]:.4f}, "
f"{did_model.conf_int().loc['treatment_x_post', 1]:.4f}]")
# Step 3: Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Panel A: Trends
trends = df.groupby(['season', 'lh_pull_hitter'])['babip'].mean().reset_index()
for is_lh in [False, True]:
subset = trends[trends['lh_pull_hitter'] == is_lh]
label = 'LH Pull Hitters' if is_lh else 'Other Hitters'
axes[0].plot(subset['season'], subset['babip'], marker='o',
label=label, linewidth=2, markersize=8)
axes[0].axvline(x=2022.5, color='red', linestyle='--',
label='Shift Ban', linewidth=2)
axes[0].set_xlabel('Season')
axes[0].set_ylabel('BABIP')
axes[0].set_title('BABIP Trends by Player Type')
axes[0].legend()
axes[0].grid(alpha=0.3)
# Panel B: Distribution comparison
pre_treatment = df[(df['lh_pull_hitter']) & (df['season'] < 2023)]['babip']
post_treatment = df[(df['lh_pull_hitter']) & (df['season'] >= 2023)]['babip']
axes[1].hist(pre_treatment, alpha=0.5, bins=30,
label='Pre-Ban (2021-2022)', color='lightcoral', density=True)
axes[1].hist(post_treatment, alpha=0.5, bins=30,
label='Post-Ban (2023)', color='lightgreen', density=True)
axes[1].axvline(pre_treatment.mean(), color='red', linestyle='--', linewidth=2)
axes[1].axvline(post_treatment.mean(), color='green', linestyle='--', linewidth=2)
axes[1].set_xlabel('BABIP')
axes[1].set_ylabel('Density')
axes[1].set_title('BABIP Distribution: LH Pull Hitters')
axes[1].legend()
plt.tight_layout()
return df, did_model, fig
# Run the complete analysis
data, model, fig = complete_shift_ban_analysis()
plt.show()
library(dplyr)
library(ggplot2)
library(broom)
# Difference-in-differences analysis framework
did_analysis_framework <- function(pre_data, post_data, treatment_group, control_group) {
# Combine data
combined_data <- bind_rows(
pre_data %>% mutate(period = "pre", treated = group %in% treatment_group),
post_data %>% mutate(period = "post", treated = group %in% treatment_group)
)
# Run difference-in-differences regression
did_model <- lm(outcome ~ treated * period, data = combined_data)
# Extract results
results <- tidy(did_model) %>%
filter(term == "treatedTRUE:periodpost")
return(list(model = did_model, did_estimate = results))
}
# Analyze shift ban impact with statistical rigor
analyze_shift_ban_statistical <- function() {
# Simulate player-level data for demonstration
set.seed(123)
# Pre-shift ban (2022) - left-handed pull hitters
pre_ban <- data.frame(
player_id = 1:100,
player_type = "LH_Pull",
babip = rnorm(100, mean = 0.260, sd = 0.040),
shift_faced_pct = runif(100, 60, 85),
season = 2022
)
# Post-shift ban (2023) - same players
post_ban <- data.frame(
player_id = 1:100,
player_type = "LH_Pull",
babip = rnorm(100, mean = 0.285, sd = 0.040),
shift_faced_pct = 0, # No shifts allowed
season = 2023
)
# Control group (right-handed hitters less affected by shifts)
control_pre <- data.frame(
player_id = 101:200,
player_type = "RH",
babip = rnorm(100, mean = 0.290, sd = 0.040),
shift_faced_pct = runif(100, 10, 25),
season = 2022
)
control_post <- data.frame(
player_id = 101:200,
player_type = "RH",
babip = rnorm(100, mean = 0.295, sd = 0.040),
shift_faced_pct = 0,
season = 2023
)
# Combine all data
full_data <- bind_rows(pre_ban, post_ban, control_pre, control_post) %>%
mutate(
post_ban = season == 2023,
treatment = player_type == "LH_Pull"
)
# Run DiD model
did_model <- lm(babip ~ treatment * post_ban, data = full_data)
return(list(data = full_data, model = did_model, summary = summary(did_model)))
}
# Regression discontinuity design for pitch clock
analyze_pitch_clock_rdd <- function(game_data) {
# Create discontinuity at 2023 season start
rdd_data <- game_data %>%
mutate(
days_from_cutoff = as.numeric(game_date - as.Date("2023-03-30")),
post_clock = days_from_cutoff > 0
) %>%
filter(abs(days_from_cutoff) <= 30) # 30-day window around implementation
# Run RDD model
rdd_model <- lm(game_length_minutes ~ days_from_cutoff * post_clock,
data = rdd_data)
# Estimate discontinuity (jump) at cutoff
discontinuity <- coef(rdd_model)["post_clockTRUE"]
return(list(model = rdd_model, effect = discontinuity))
}
# Time series analysis for rule changes
analyze_time_series_impact <- function(metric_data) {
library(forecast)
# Split into pre and post periods
pre_period <- metric_data %>% filter(season < 2023)
post_period <- metric_data %>% filter(season >= 2023)
# Fit ARIMA model on pre-period
ts_data <- ts(pre_period$value, start = min(pre_period$season))
arima_model <- auto.arima(ts_data)
# Forecast what would have happened without intervention
forecast_result <- forecast(arima_model, h = nrow(post_period))
# Compare actual vs forecast
comparison <- data.frame(
season = post_period$season,
actual = post_period$value,
predicted = as.numeric(forecast_result$mean),
lower_95 = as.numeric(forecast_result$lower[,2]),
upper_95 = as.numeric(forecast_result$upper[,2])
) %>%
mutate(
impact = actual - predicted,
significant = actual < lower_95 | actual > upper_95
)
return(list(model = arima_model, forecast = forecast_result, comparison = comparison))
}
# Visualization of rule change impact
plot_rule_change_impact <- function(comparison_data, rule_name) {
ggplot(comparison_data, aes(x = season)) +
geom_ribbon(aes(ymin = lower_95, ymax = upper_95),
alpha = 0.2, fill = "blue") +
geom_line(aes(y = predicted, color = "Predicted"),
linewidth = 1, linetype = "dashed") +
geom_line(aes(y = actual, color = "Actual"),
linewidth = 1.2) +
geom_point(aes(y = actual, color = "Actual"),
size = 3) +
geom_vline(xintercept = min(comparison_data$season) - 0.5,
linetype = "dashed", color = "red") +
labs(
title = paste("Impact of", rule_name),
subtitle = "Actual values vs forecasted counterfactual",
x = "Season",
y = "Metric Value",
color = "Series"
) +
theme_minimal() +
scale_color_manual(values = c("Actual" = "#D50032", "Predicted" = "#002D72"))
}
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
import matplotlib.pyplot as plt
import seaborn as sns
def difference_in_differences_analysis(data, outcome_var, treatment_var,
time_var, player_var):
"""
Implement difference-in-differences estimation for rule changes
"""
# Create interaction term
data['treatment_x_time'] = data[treatment_var] * data[time_var]
# Prepare regression variables
X = data[[treatment_var, time_var, 'treatment_x_time']]
X = sm.add_constant(X)
y = data[outcome_var]
# Run OLS regression
model = sm.OLS(y, X).fit()
# The coefficient on treatment_x_time is the DiD estimate
did_estimate = model.params['treatment_x_time']
did_se = model.bse['treatment_x_time']
did_pvalue = model.pvalues['treatment_x_time']
results = {
'estimate': did_estimate,
'std_error': did_se,
'p_value': did_pvalue,
'confidence_interval': model.conf_int().loc['treatment_x_time'].values,
'model_summary': model.summary()
}
return results
def parallel_trends_test(data, outcome_var, treatment_var, pre_periods):
"""
Test parallel trends assumption for DiD
"""
# Filter to pre-treatment periods only
pre_data = data[data['period'].isin(pre_periods)]
# Create time trend interaction
pre_data['time_trend'] = pre_data['period'].astype(int)
pre_data['treatment_x_trend'] = (pre_data[treatment_var] *
pre_data['time_trend'])
# Regression: outcome ~ treatment + time_trend + treatment*time_trend
X = pre_data[[treatment_var, 'time_trend', 'treatment_x_trend']]
X = sm.add_constant(X)
y = pre_data[outcome_var]
model = sm.OLS(y, X).fit()
# If treatment*trend coefficient is insignificant, parallel trends holds
trend_coef = model.params['treatment_x_trend']
trend_pvalue = model.pvalues['treatment_x_trend']
parallel_trends_holds = trend_pvalue > 0.05
return {
'parallel_trends_satisfied': parallel_trends_holds,
'trend_coefficient': trend_coef,
'p_value': trend_pvalue
}
def synthetic_control_analysis(treated_unit, control_pool, pre_periods, post_periods):
"""
Implement synthetic control method for rule change analysis
"""
from sklearn.linear_model import LinearRegression
# Get pre-treatment outcomes
treated_pre = treated_unit[treated_unit['period'].isin(pre_periods)]['outcome']
control_pre = control_pool[control_pool['period'].isin(pre_periods)].pivot(
index='period', columns='unit', values='outcome'
)
# Find optimal weights to match treated unit in pre-period
model = LinearRegression(fit_intercept=False, positive=True)
model.fit(control_pre.values, treated_pre.values)
# Normalize weights to sum to 1
weights = model.coef_ / model.coef_.sum()
# Create synthetic control
control_all = control_pool.pivot(
index='period', columns='unit', values='outcome'
)
synthetic = (control_all * weights).sum(axis=1)
# Calculate treatment effect in post period
treated_all = treated_unit.set_index('period')['outcome']
effects = []
for period in post_periods:
effect = treated_all[period] - synthetic[period]
effects.append({
'period': period,
'treated': treated_all[period],
'synthetic': synthetic[period],
'effect': effect
})
return pd.DataFrame(effects), weights
def event_study_analysis(data, outcome_var, event_time_var, max_leads=3, max_lags=3):
"""
Event study regression to estimate dynamic treatment effects
"""
# Create dummy variables for each event time
dummies = pd.get_dummies(data[event_time_var], prefix='event')
# Exclude t=-1 as reference period (or t=0 if no pre-period)
reference_col = 'event_-1' if 'event_-1' in dummies.columns else 'event_0'
if reference_col in dummies.columns:
dummies = dummies.drop(columns=[reference_col])
# Run regression
X = sm.add_constant(dummies)
y = data[outcome_var]
model = sm.OLS(y, X).fit()
# Extract coefficients for plotting
coefficients = []
for col in dummies.columns:
event_time = int(col.split('_')[1])
coefficients.append({
'event_time': event_time,
'coefficient': model.params[col],
'std_error': model.bse[col],
'conf_low': model.conf_int().loc[col, 0],
'conf_high': model.conf_int().loc[col, 1]
})
coef_df = pd.DataFrame(coefficients).sort_values('event_time')
return coef_df, model
def visualize_event_study(coef_df, rule_name):
"""
Create event study plot
"""
fig, ax = plt.subplots(figsize=(12, 6))
ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
ax.axvline(x=0, color='red', linestyle='--', linewidth=1, alpha=0.5)
ax.errorbar(coef_df['event_time'], coef_df['coefficient'],
yerr=1.96 * coef_df['std_error'],
fmt='o', markersize=8, capsize=5, capthick=2,
color='steelblue', ecolor='gray', elinewidth=2)
ax.set_xlabel('Periods Relative to Rule Change', fontsize=12)
ax.set_ylabel('Estimated Effect', fontsize=12)
ax.set_title(f'Event Study: {rule_name}', fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)
# Add shading for pre/post periods
pre_periods = coef_df[coef_df['event_time'] < 0]['event_time']
if len(pre_periods) > 0:
ax.axvspan(pre_periods.min() - 0.5, -0.5, alpha=0.1, color='gray',
label='Pre-treatment')
post_periods = coef_df[coef_df['event_time'] > 0]['event_time']
if len(post_periods) > 0:
ax.axvspan(0.5, post_periods.max() + 0.5, alpha=0.1, color='lightblue',
label='Post-treatment')
ax.legend()
plt.tight_layout()
return fig
def calculate_statistical_power(effect_size, sample_size, alpha=0.05):
"""
Calculate statistical power for detecting rule change effects
"""
from scipy.stats import norm
# Calculate non-centrality parameter
ncp = effect_size * np.sqrt(sample_size)
# Critical value for two-tailed test
z_critical = norm.ppf(1 - alpha/2)
# Power = P(reject H0 | H1 is true)
power = 1 - norm.cdf(z_critical - ncp) + norm.cdf(-z_critical - ncp)
return power
def simulate_rule_change_impact(true_effect, sample_size, num_simulations=1000):
"""
Monte Carlo simulation to assess rule change detection
"""
np.random.seed(42)
detected = 0
estimates = []
for _ in range(num_simulations):
# Simulate control group (no change)
control_pre = np.random.normal(0, 1, sample_size)
control_post = np.random.normal(0, 1, sample_size)
# Simulate treatment group (with effect)
treatment_pre = np.random.normal(0, 1, sample_size)
treatment_post = np.random.normal(true_effect, 1, sample_size)
# Calculate DiD estimate
did_est = ((treatment_post.mean() - treatment_pre.mean()) -
(control_post.mean() - control_pre.mean()))
estimates.append(did_est)
# Test for significance
# Simplified t-test
se = np.sqrt(1/sample_size + 1/sample_size + 1/sample_size + 1/sample_size)
t_stat = did_est / se
if abs(t_stat) > 1.96:
detected += 1
power = detected / num_simulations
return {
'power': power,
'mean_estimate': np.mean(estimates),
'std_estimate': np.std(estimates),
'estimates': estimates
}
# Complete workflow for analyzing shift ban impact
def complete_shift_ban_analysis():
"""
Comprehensive statistical analysis of shift ban impact
"""
# Step 1: Prepare data
np.random.seed(42)
# Create synthetic panel data
players = 200
seasons = [2021, 2022, 2023]
data_list = []
for player in range(players):
is_lh_pull = player < 100 # First 100 are LH pull hitters
for season in seasons:
# Base BABIP
base_babip = 0.290 if not is_lh_pull else 0.260
# Shift effect (only pre-2023)
if season < 2023 and is_lh_pull:
shift_penalty = -0.025
else:
shift_penalty = 0
# Generate outcome with noise
babip = base_babip + shift_penalty + np.random.normal(0, 0.030)
data_list.append({
'player_id': player,
'season': season,
'lh_pull_hitter': is_lh_pull,
'post_ban': season >= 2023,
'babip': babip
})
df = pd.DataFrame(data_list)
# Step 2: Difference-in-differences
df['treatment_x_post'] = df['lh_pull_hitter'] * df['post_ban']
X = df[['lh_pull_hitter', 'post_ban', 'treatment_x_post']]
X = sm.add_constant(X)
y = df['babip']
did_model = sm.OLS(y, X).fit()
print("=" * 60)
print("SHIFT BAN IMPACT ANALYSIS")
print("=" * 60)
print("\nDifference-in-Differences Results:")
print(f"Treatment Effect: {did_model.params['treatment_x_post']:.4f}")
print(f"Standard Error: {did_model.bse['treatment_x_post']:.4f}")
print(f"P-value: {did_model.pvalues['treatment_x_post']:.4f}")
print(f"95% CI: [{did_model.conf_int().loc['treatment_x_post', 0]:.4f}, "
f"{did_model.conf_int().loc['treatment_x_post', 1]:.4f}]")
# Step 3: Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Panel A: Trends
trends = df.groupby(['season', 'lh_pull_hitter'])['babip'].mean().reset_index()
for is_lh in [False, True]:
subset = trends[trends['lh_pull_hitter'] == is_lh]
label = 'LH Pull Hitters' if is_lh else 'Other Hitters'
axes[0].plot(subset['season'], subset['babip'], marker='o',
label=label, linewidth=2, markersize=8)
axes[0].axvline(x=2022.5, color='red', linestyle='--',
label='Shift Ban', linewidth=2)
axes[0].set_xlabel('Season')
axes[0].set_ylabel('BABIP')
axes[0].set_title('BABIP Trends by Player Type')
axes[0].legend()
axes[0].grid(alpha=0.3)
# Panel B: Distribution comparison
pre_treatment = df[(df['lh_pull_hitter']) & (df['season'] < 2023)]['babip']
post_treatment = df[(df['lh_pull_hitter']) & (df['season'] >= 2023)]['babip']
axes[1].hist(pre_treatment, alpha=0.5, bins=30,
label='Pre-Ban (2021-2022)', color='lightcoral', density=True)
axes[1].hist(post_treatment, alpha=0.5, bins=30,
label='Post-Ban (2023)', color='lightgreen', density=True)
axes[1].axvline(pre_treatment.mean(), color='red', linestyle='--', linewidth=2)
axes[1].axvline(post_treatment.mean(), color='green', linestyle='--', linewidth=2)
axes[1].set_xlabel('BABIP')
axes[1].set_ylabel('Density')
axes[1].set_title('BABIP Distribution: LH Pull Hitters')
axes[1].legend()
plt.tight_layout()
return df, did_model, fig
# Run the complete analysis
data, model, fig = complete_shift_ban_analysis()
plt.show()
Easy Exercises
Exercise 27.1 (Easy)
Calculate the percentage increase in stolen base attempts from 2022 to 2023. The 2022 season had 2,487 stolen bases across 2,430 games, while 2023 had 3,503 stolen bases across 2,430 games.
Exercise 27.2 (Easy)
Using the baseballr or pybaseball package, retrieve pitch velocity data for a single pitcher across the 2023 season. Calculate their average fastball velocity and identify their fastest pitch of the season.
# Starter code for R
library(baseballr)
# Replace with actual pitcher ID
pitcher_id <- 592789 # Jacob deGrom example
# Your code here
# Starter code for Python
from pybaseball import statcast_pitcher
# Your code here
Exercise 27.3 (Easy)
Calculate the reduction in base distance (in inches) caused by the increase from 15-inch to 18-inch bases. What percentage reduction does this represent from the original 90-foot (1,080-inch) base path?
Medium Exercises
Exercise 27.4 (Medium)
Analyze the times-through-order penalty for a specific starting pitcher. Using Statcast data, calculate their batting average against, slugging percentage, and exit velocity for the 1st, 2nd, and 3rd time through the batting order.
# Starter code
library(baseballr)
library(dplyr)
analyze_tto_penalty <- function(pitcher_id, season = 2023) {
# Get pitcher's data
# Group by times_through_order
# Calculate metrics
# Return summary
}
Exercise 27.5 (Medium)
Compare left-handed pull hitters' performance in 2022 vs 2023 to quantify shift ban impact. Calculate:
- Average BABIP in 2022 vs 2023
- Average batting average on ground balls
- Statistical significance of the difference (t-test)
Exercise 27.6 (Medium)
Create a visualization showing the evolution of game times from 2019-2023. Include vertical line for pitch clock implementation and annotate the average reduction in minutes.
Hard Exercises
Exercise 27.7 (Hard)
Implement a difference-in-differences analysis to estimate the causal impact of the shift ban on left-handed pull hitters. Use right-handed hitters as a control group. Your analysis should include:
- Parallel trends test for pre-treatment periods
- DiD regression model
- Robustness checks (placebo tests)
- Visualization of results
# Starter code
def shift_ban_did_analysis(data):
"""
Implement DiD analysis for shift ban
Parameters:
-----------
data : DataFrame with columns [player_id, season, babip, is_lh_pull]
Returns:
--------
Dictionary with DiD estimate, standard error, p-value, and plots
"""
# Your code here
pass
Exercise 27.8 (Hard)
Build a predictive model to estimate the impact of the pitch clock on stolen base success rate. Your model should:
- Incorporate pitcher delivery time to home plate
- Account for runner sprint speed
- Include lead distance from base
- Control for pitcher pickoff move quality
- Validate on 2023 data
# Starter code
library(randomForest)
build_sb_success_model <- function(training_data) {
# Feature engineering
# Train model
# Evaluate performance
# Return model and metrics
}
Exercise 27.9 (Hard)
Conduct an event study analysis of the pitch clock implementation on pitcher performance. Create event-time dummies for each month relative to implementation and estimate dynamic treatment effects. Plot coefficients with 95% confidence intervals to show how effects evolved over the first season.
Exercise 27.10 (Hard)
Design and implement a simulation study to determine the optimal pitcher usage strategy given the times-through-order penalty. Compare:
- Traditional starter (6-7 IP)
- Opener strategy (1-4-2 configuration)
- Pure bullpen game (9 relievers)
Your simulation should account for:
- TTO penalty magnitude
- Bullpen fatigue effects
- Reliever availability constraints
- Win probability maximization
Include visualization of expected win rates and season-long bullpen usage patterns for each strategy.
Summary
This chapter has explored the most significant trends and rule changes shaping modern Major League Baseball. The velocity revolution continues to push the boundaries of human performance, while also raising concerns about pitcher health and sustainability. The 2023 rule changes - shift ban, pitch clock, and larger bases - represented the most comprehensive modifications to gameplay in generations, with measurable impacts on offensive production, game pace, and strategic approaches.
The rise of Three True Outcomes has fundamentally altered the aesthetic and strategic nature of baseball, though recent rule changes have begun to restore some balance between power and action. The opener strategy and bullpen specialization reflect teams' increasingly sophisticated understanding of performance optimization, even as they challenge traditional conceptions of pitcher roles.
As analysts, understanding these trends requires rigorous statistical methodology, careful consideration of confounding factors, and appropriate causal inference techniques. The frameworks presented in this chapter - difference-in-differences, regression discontinuity, event studies, and synthetic controls - provide the tools necessary to separate true causal effects from spurious correlations.
The future of baseball analytics will continue to grapple with these tensions: power vs. contact, specialization vs. versatility, tradition vs. optimization, and entertainment value vs. competitive advantage.
# Starter code for R
library(baseballr)
# Replace with actual pitcher ID
pitcher_id <- 592789 # Jacob deGrom example
# Your code here
# Starter code
library(baseballr)
library(dplyr)
analyze_tto_penalty <- function(pitcher_id, season = 2023) {
# Get pitcher's data
# Group by times_through_order
# Calculate metrics
# Return summary
}
# Starter code
library(randomForest)
build_sb_success_model <- function(training_data) {
# Feature engineering
# Train model
# Evaluate performance
# Return model and metrics
}
# Starter code for Python
from pybaseball import statcast_pitcher
# Your code here
# Starter code
def shift_ban_did_analysis(data):
"""
Implement DiD analysis for shift ban
Parameters:
-----------
data : DataFrame with columns [player_id, season, babip, is_lh_pull]
Returns:
--------
Dictionary with DiD estimate, standard error, p-value, and plots
"""
# Your code here
pass