The Need for Park Adjustment
Park factors quantify how much a ballpark increases or decreases various outcomes compared to a neutral environment. A park factor of 100 is neutral, above 100 favors a particular outcome (e.g., hitting), and below 100 suppresses it.
Why Park Factors Matter:
- Player Evaluation: A .280 hitter in Colorado might be equivalent to a .260 hitter elsewhere
- Projections: Future performance depends partly on home ballpark
- Contract Negotiations: Teams discount stats accumulated in extreme parks
- Strategic Decisions: Roster construction should account for home environment
- Historical Comparison: Comparing players across eras requires park context
Calculating Basic Park Factors
The fundamental park factor formula compares outcomes at home versus road:
Park Factor = 100 × (Home Rate / Road Rate)
However, this naive approach has issues. A better method accounts for the fact that the team plays in both environments:
Park Factor = 100 × ((Home Games) / (Road Games)) / (((Home Games + Road Games) / 2) / Road Games)
Simplified:
Park Factor = 100 × (Home Rate × Road Rate) / ((Home Rate + Road Rate) / 2)
R Implementation: Multi-Year Park Factors
# R: Calculate Park Factors
library(tidyverse)
library(Lahman)
# Calculate park factors for runs scored (2019-2023)
calculate_park_factors <- function(start_year = 2019, end_year = 2023) {
# Get team game logs
team_stats <- Teams %>%
filter(yearID >= start_year, yearID <= end_year) %>%
select(yearID, teamID, G, R, RA, HR, H, BB, SO) %>%
mutate(
home_games = G / 2, # Approximation
road_games = G / 2
)
# Get park-specific data
# For demonstration, we'll calculate from team splits
# In practice, you'd use game-by-game data
park_factors <- team_stats %>%
group_by(teamID) %>%
summarise(
years = n(),
total_games = sum(G),
total_runs = sum(R),
total_hr = sum(HR),
runs_per_game = sum(R) / sum(G),
hr_per_game = sum(HR) / sum(G),
.groups = 'drop'
)
# Calculate league averages
lg_avg_rpg <- sum(team_stats$R) / sum(team_stats$G)
lg_avg_hrpg <- sum(team_stats$HR) / sum(team_stats$G)
park_factors <- park_factors %>%
mutate(
runs_park_factor = 100 * (runs_per_game / lg_avg_rpg),
hr_park_factor = 100 * (hr_per_game / lg_avg_hrpg)
) %>%
arrange(desc(runs_park_factor))
return(park_factors)
}
# Calculate and display park factors
pf <- calculate_park_factors()
print(pf %>% select(teamID, runs_park_factor, hr_park_factor))
# Visualize park factors
ggplot(pf, aes(x = reorder(teamID, runs_park_factor), y = runs_park_factor)) +
geom_col(aes(fill = runs_park_factor > 100), alpha = 0.8) +
geom_hline(yintercept = 100, linetype = "dashed", color = "black", size = 1) +
scale_fill_manual(values = c("steelblue", "coral"), guide = "none") +
coord_flip() +
labs(title = "MLB Park Factors for Runs (2019-2023)",
subtitle = "100 = League Average",
x = "Team",
y = "Park Factor") +
theme_minimal() +
theme(axis.text = element_text(size = 10))
Python Implementation: Comprehensive Park Factors
# Python: Advanced Park Factor Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast, team_batting, team_pitching
# Define park characteristics (approximate dimensions)
park_dimensions = pd.DataFrame({
'park': ['Coors Field', 'Yankee Stadium', 'Fenway Park', 'Oracle Park',
'Petco Park', 'Dodger Stadium', 'Wrigley Field', 'Minute Maid Park'],
'team': ['COL', 'NYY', 'BOS', 'SFG', 'SDP', 'LAD', 'CHC', 'HOU'],
'lf_distance': [347, 318, 310, 339, 334, 330, 355, 315],
'cf_distance': [415, 408, 390, 399, 396, 395, 400, 409],
'rf_distance': [350, 314, 302, 309, 322, 330, 353, 326],
'lf_wall_height': [8, 8, 37, 8, 8, 3.75, 11, 19],
'elevation': [5200, 55, 20, 0, 20, 340, 595, 50],
'known_factor': ['Extreme hitter', 'RH hitter', 'LH hitter', 'Extreme pitcher',
'Pitcher friendly', 'Pitcher friendly', 'Neutral', 'LH hitter']
})
print("MLB Park Dimensions and Characteristics:")
print(park_dimensions.to_string(index=False))
# Simulate park factors based on real tendencies
np.random.seed(42)
park_factors_data = {
'park': ['Coors Field', 'Globe Life Field', 'Great American Ball Park',
'Rogers Centre', 'Yankee Stadium', 'Camden Yards', 'Fenway Park',
'Wrigley Field', 'Chase Field', 'Minute Maid Park',
'T-Mobile Park', 'Target Field', 'Guaranteed Rate Field',
'Kauffman Stadium', 'Dodger Stadium', 'Petco Park', 'Oracle Park',
'Truist Park', 'Nationals Park', 'Progressive Field'],
'team': ['COL', 'TEX', 'CIN', 'TOR', 'NYY', 'BAL', 'BOS', 'CHC', 'ARI', 'HOU',
'SEA', 'MIN', 'CHW', 'KCR', 'LAD', 'SDP', 'SFG', 'ATL', 'WSN', 'CLE'],
'runs_pf': [115, 108, 107, 104, 103, 103, 102, 101, 100, 100,
99, 98, 98, 97, 96, 94, 92, 100, 99, 97],
'hr_pf': [122, 110, 112, 105, 108, 106, 104, 102, 98, 103,
96, 97, 99, 95, 97, 89, 85, 101, 100, 98],
'hits_pf': [112, 106, 104, 103, 102, 102, 103, 101, 101, 99,
100, 99, 98, 98, 97, 95, 94, 100, 99, 97],
'doubles_pf': [109, 105, 103, 104, 100, 103, 107, 102, 99, 98,
102, 98, 97, 99, 96, 96, 92, 101, 100, 99],
'triples_pf': [118, 95, 98, 92, 88, 97, 85, 103, 105, 94,
108, 102, 101, 112, 99, 102, 104, 96, 98, 100]
}
park_factors_df = pd.DataFrame(park_factors_data)
# Calculate aggregate offensive park factor
park_factors_df['overall_pf'] = (
park_factors_df['runs_pf'] * 0.4 +
park_factors_df['hr_pf'] * 0.3 +
park_factors_df['hits_pf'] * 0.3
)
park_factors_df = park_factors_df.sort_values('overall_pf', ascending=False)
print("\nPark Factors Summary (2019-2023):")
print(park_factors_df[['park', 'runs_pf', 'hr_pf', 'overall_pf']].to_string(index=False))
# Visualize multi-dimensional park factors
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Runs vs HR park factors
axes[0].scatter(park_factors_df['runs_pf'], park_factors_df['hr_pf'],
s=100, alpha=0.6, c=park_factors_df['overall_pf'],
cmap='RdYlGn_r')
axes[0].axhline(y=100, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=100, color='gray', linestyle='--', alpha=0.5)
# Annotate extreme parks
for idx, row in park_factors_df.iterrows():
if row['runs_pf'] > 108 or row['runs_pf'] < 94:
axes[0].annotate(row['team'], (row['runs_pf'], row['hr_pf']),
fontsize=8, alpha=0.7)
axes[0].set_xlabel('Runs Park Factor')
axes[0].set_ylabel('Home Runs Park Factor')
axes[0].set_title('Park Factors: Runs vs Home Runs')
axes[0].grid(True, alpha=0.3)
# Overall park factor ranking
park_factors_df_sorted = park_factors_df.sort_values('overall_pf')
colors = ['green' if x < 100 else 'red' for x in park_factors_df_sorted['overall_pf']]
axes[1].barh(range(len(park_factors_df_sorted)),
park_factors_df_sorted['overall_pf'],
color=colors, alpha=0.6)
axes[1].set_yticks(range(len(park_factors_df_sorted)))
axes[1].set_yticklabels(park_factors_df_sorted['team'], fontsize=8)
axes[1].axvline(x=100, color='black', linestyle='--', linewidth=2)
axes[1].set_xlabel('Overall Park Factor')
axes[1].set_title('Overall Offensive Park Factors')
axes[1].grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('park_factors_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Statistical significance of park effects
from scipy import stats
def assess_park_significance(home_rpg, road_rpg, games=81):
"""
Test if park effect is statistically significant
"""
# Assuming Poisson distribution for runs
# Standard error for rate difference
se = np.sqrt((home_rpg + road_rpg) / games)
z_score = (home_rpg - road_rpg) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
return {
'z_score': z_score,
'p_value': p_value,
'significant': p_value < 0.05
}
# Example: Coors Field
coors_home_rpg = 5.8
coors_road_rpg = 4.7
significance = assess_park_significance(coors_home_rpg, coors_road_rpg)
print(f"\nCoors Field Park Effect Significance:")
print(f"Z-score: {significance['z_score']:.2f}")
print(f"P-value: {significance['p_value']:.4f}")
print(f"Statistically significant: {significance['significant']}")
Advanced Park Factors: Handedness and Batted Ball Type
Modern park factors account for batter handedness and batted ball outcomes:
# Python: Handedness-Specific Park Factors
def calculate_handedness_park_factors():
"""
Calculate park factors by batter handedness
"""
# Example data structure
handedness_pf = pd.DataFrame({
'park': ['Yankee Stadium', 'Fenway Park', 'Oracle Park', 'Coors Field',
'Petco Park', 'Dodger Stadium'],
'team': ['NYY', 'BOS', 'SFG', 'COL', 'SDP', 'LAD'],
'rhb_hr_pf': [125, 95, 82, 118, 88, 95],
'lhb_hr_pf': [88, 115, 90, 126, 92, 98],
'rhb_overall_pf': [108, 98, 90, 113, 92, 97],
'lhb_overall_pf': [95, 107, 93, 117, 95, 96]
})
print("Park Factors by Batter Handedness:")
print(handedness_pf.to_string(index=False))
# Visualize the handedness effect
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(handedness_pf))
width = 0.35
ax.bar(x - width/2, handedness_pf['rhb_hr_pf'], width,
label='RHB HR PF', alpha=0.8, color='steelblue')
ax.bar(x + width/2, handedness_pf['lhb_hr_pf'], width,
label='LHB HR PF', alpha=0.8, color='coral')
ax.axhline(y=100, color='black', linestyle='--', linewidth=1)
ax.set_xlabel('Ballpark')
ax.set_ylabel('Home Run Park Factor')
ax.set_title('Home Run Park Factors by Batter Handedness')
ax.set_xticks(x)
ax.set_xticklabels(handedness_pf['park'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
return handedness_pf
handedness_pf = calculate_handedness_park_factors()
# Identify parks with largest handedness splits
handedness_pf['hr_split'] = abs(handedness_pf['rhb_hr_pf'] -
handedness_pf['lhb_hr_pf'])
print("\nParks with Largest Handedness Effects:")
print(handedness_pf.nlargest(5, 'hr_split')[['park', 'rhb_hr_pf',
'lhb_hr_pf', 'hr_split']])
Time-Varying Park Factors
Park factors change over time due to:
- Ballpark modifications (fence distances, wall heights)
- Climate changes
- Surrounding construction affecting wind patterns
- Ball specifications (deadened ball era 2021-2022)
# R: Track Park Factor Changes Over Time
library(tidyverse)
# Simulate park factor changes for select ballparks
years <- 2015:2023
parks <- c("COL", "BOS", "NYY", "SFG", "LAD")
set.seed(123)
park_trends <- expand_grid(year = years, park = parks) %>%
mutate(
base_pf = case_when(
park == "COL" ~ 115,
park == "BOS" ~ 102,
park == "NYY" ~ 103,
park == "SFG" ~ 92,
park == "LAD" ~ 96
),
# Add trend and noise
trend = case_when(
park == "COL" & year >= 2019 ~ -1.5, # Humidor effect
park == "SFG" & year >= 2020 ~ -2, # Oracle Park settling
TRUE ~ 0
),
noise = rnorm(n(), 0, 2),
runs_pf = base_pf + (year - 2015) * trend + noise
)
# Visualize trends
ggplot(park_trends, aes(x = year, y = runs_pf, color = park)) +
geom_line(size = 1.2) +
geom_point(size = 2.5) +
geom_hline(yintercept = 100, linetype = "dashed", color = "gray50") +
scale_color_brewer(palette = "Set1", name = "Ballpark") +
labs(title = "Park Factor Evolution Over Time",
subtitle = "Notable: Coors Field humidor impact (2019)",
x = "Year",
y = "Runs Park Factor") +
theme_minimal() +
theme(legend.position = "right")
# Statistical test for trend
coors_trend <- park_trends %>%
filter(park == "COL")
coors_model <- lm(runs_pf ~ year, data = coors_trend)
summary(coors_model)
cat("\nCoors Field Trend Analysis:\n")
cat(sprintf("Slope: %.2f runs PF per year\n", coef(coors_model)[2]))
cat(sprintf("P-value: %.4f\n", summary(coors_model)$coefficients[2, 4]))
Park Factor = 100 × (Home Rate / Road Rate)
Park Factor = 100 × ((Home Games) / (Road Games)) / (((Home Games + Road Games) / 2) / Road Games)
Park Factor = 100 × (Home Rate × Road Rate) / ((Home Rate + Road Rate) / 2)
# R: Calculate Park Factors
library(tidyverse)
library(Lahman)
# Calculate park factors for runs scored (2019-2023)
calculate_park_factors <- function(start_year = 2019, end_year = 2023) {
# Get team game logs
team_stats <- Teams %>%
filter(yearID >= start_year, yearID <= end_year) %>%
select(yearID, teamID, G, R, RA, HR, H, BB, SO) %>%
mutate(
home_games = G / 2, # Approximation
road_games = G / 2
)
# Get park-specific data
# For demonstration, we'll calculate from team splits
# In practice, you'd use game-by-game data
park_factors <- team_stats %>%
group_by(teamID) %>%
summarise(
years = n(),
total_games = sum(G),
total_runs = sum(R),
total_hr = sum(HR),
runs_per_game = sum(R) / sum(G),
hr_per_game = sum(HR) / sum(G),
.groups = 'drop'
)
# Calculate league averages
lg_avg_rpg <- sum(team_stats$R) / sum(team_stats$G)
lg_avg_hrpg <- sum(team_stats$HR) / sum(team_stats$G)
park_factors <- park_factors %>%
mutate(
runs_park_factor = 100 * (runs_per_game / lg_avg_rpg),
hr_park_factor = 100 * (hr_per_game / lg_avg_hrpg)
) %>%
arrange(desc(runs_park_factor))
return(park_factors)
}
# Calculate and display park factors
pf <- calculate_park_factors()
print(pf %>% select(teamID, runs_park_factor, hr_park_factor))
# Visualize park factors
ggplot(pf, aes(x = reorder(teamID, runs_park_factor), y = runs_park_factor)) +
geom_col(aes(fill = runs_park_factor > 100), alpha = 0.8) +
geom_hline(yintercept = 100, linetype = "dashed", color = "black", size = 1) +
scale_fill_manual(values = c("steelblue", "coral"), guide = "none") +
coord_flip() +
labs(title = "MLB Park Factors for Runs (2019-2023)",
subtitle = "100 = League Average",
x = "Team",
y = "Park Factor") +
theme_minimal() +
theme(axis.text = element_text(size = 10))
# R: Track Park Factor Changes Over Time
library(tidyverse)
# Simulate park factor changes for select ballparks
years <- 2015:2023
parks <- c("COL", "BOS", "NYY", "SFG", "LAD")
set.seed(123)
park_trends <- expand_grid(year = years, park = parks) %>%
mutate(
base_pf = case_when(
park == "COL" ~ 115,
park == "BOS" ~ 102,
park == "NYY" ~ 103,
park == "SFG" ~ 92,
park == "LAD" ~ 96
),
# Add trend and noise
trend = case_when(
park == "COL" & year >= 2019 ~ -1.5, # Humidor effect
park == "SFG" & year >= 2020 ~ -2, # Oracle Park settling
TRUE ~ 0
),
noise = rnorm(n(), 0, 2),
runs_pf = base_pf + (year - 2015) * trend + noise
)
# Visualize trends
ggplot(park_trends, aes(x = year, y = runs_pf, color = park)) +
geom_line(size = 1.2) +
geom_point(size = 2.5) +
geom_hline(yintercept = 100, linetype = "dashed", color = "gray50") +
scale_color_brewer(palette = "Set1", name = "Ballpark") +
labs(title = "Park Factor Evolution Over Time",
subtitle = "Notable: Coors Field humidor impact (2019)",
x = "Year",
y = "Runs Park Factor") +
theme_minimal() +
theme(legend.position = "right")
# Statistical test for trend
coors_trend <- park_trends %>%
filter(park == "COL")
coors_model <- lm(runs_pf ~ year, data = coors_trend)
summary(coors_model)
cat("\nCoors Field Trend Analysis:\n")
cat(sprintf("Slope: %.2f runs PF per year\n", coef(coors_model)[2]))
cat(sprintf("P-value: %.4f\n", summary(coors_model)$coefficients[2, 4]))
# Python: Advanced Park Factor Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast, team_batting, team_pitching
# Define park characteristics (approximate dimensions)
park_dimensions = pd.DataFrame({
'park': ['Coors Field', 'Yankee Stadium', 'Fenway Park', 'Oracle Park',
'Petco Park', 'Dodger Stadium', 'Wrigley Field', 'Minute Maid Park'],
'team': ['COL', 'NYY', 'BOS', 'SFG', 'SDP', 'LAD', 'CHC', 'HOU'],
'lf_distance': [347, 318, 310, 339, 334, 330, 355, 315],
'cf_distance': [415, 408, 390, 399, 396, 395, 400, 409],
'rf_distance': [350, 314, 302, 309, 322, 330, 353, 326],
'lf_wall_height': [8, 8, 37, 8, 8, 3.75, 11, 19],
'elevation': [5200, 55, 20, 0, 20, 340, 595, 50],
'known_factor': ['Extreme hitter', 'RH hitter', 'LH hitter', 'Extreme pitcher',
'Pitcher friendly', 'Pitcher friendly', 'Neutral', 'LH hitter']
})
print("MLB Park Dimensions and Characteristics:")
print(park_dimensions.to_string(index=False))
# Simulate park factors based on real tendencies
np.random.seed(42)
park_factors_data = {
'park': ['Coors Field', 'Globe Life Field', 'Great American Ball Park',
'Rogers Centre', 'Yankee Stadium', 'Camden Yards', 'Fenway Park',
'Wrigley Field', 'Chase Field', 'Minute Maid Park',
'T-Mobile Park', 'Target Field', 'Guaranteed Rate Field',
'Kauffman Stadium', 'Dodger Stadium', 'Petco Park', 'Oracle Park',
'Truist Park', 'Nationals Park', 'Progressive Field'],
'team': ['COL', 'TEX', 'CIN', 'TOR', 'NYY', 'BAL', 'BOS', 'CHC', 'ARI', 'HOU',
'SEA', 'MIN', 'CHW', 'KCR', 'LAD', 'SDP', 'SFG', 'ATL', 'WSN', 'CLE'],
'runs_pf': [115, 108, 107, 104, 103, 103, 102, 101, 100, 100,
99, 98, 98, 97, 96, 94, 92, 100, 99, 97],
'hr_pf': [122, 110, 112, 105, 108, 106, 104, 102, 98, 103,
96, 97, 99, 95, 97, 89, 85, 101, 100, 98],
'hits_pf': [112, 106, 104, 103, 102, 102, 103, 101, 101, 99,
100, 99, 98, 98, 97, 95, 94, 100, 99, 97],
'doubles_pf': [109, 105, 103, 104, 100, 103, 107, 102, 99, 98,
102, 98, 97, 99, 96, 96, 92, 101, 100, 99],
'triples_pf': [118, 95, 98, 92, 88, 97, 85, 103, 105, 94,
108, 102, 101, 112, 99, 102, 104, 96, 98, 100]
}
park_factors_df = pd.DataFrame(park_factors_data)
# Calculate aggregate offensive park factor
park_factors_df['overall_pf'] = (
park_factors_df['runs_pf'] * 0.4 +
park_factors_df['hr_pf'] * 0.3 +
park_factors_df['hits_pf'] * 0.3
)
park_factors_df = park_factors_df.sort_values('overall_pf', ascending=False)
print("\nPark Factors Summary (2019-2023):")
print(park_factors_df[['park', 'runs_pf', 'hr_pf', 'overall_pf']].to_string(index=False))
# Visualize multi-dimensional park factors
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Runs vs HR park factors
axes[0].scatter(park_factors_df['runs_pf'], park_factors_df['hr_pf'],
s=100, alpha=0.6, c=park_factors_df['overall_pf'],
cmap='RdYlGn_r')
axes[0].axhline(y=100, color='gray', linestyle='--', alpha=0.5)
axes[0].axvline(x=100, color='gray', linestyle='--', alpha=0.5)
# Annotate extreme parks
for idx, row in park_factors_df.iterrows():
if row['runs_pf'] > 108 or row['runs_pf'] < 94:
axes[0].annotate(row['team'], (row['runs_pf'], row['hr_pf']),
fontsize=8, alpha=0.7)
axes[0].set_xlabel('Runs Park Factor')
axes[0].set_ylabel('Home Runs Park Factor')
axes[0].set_title('Park Factors: Runs vs Home Runs')
axes[0].grid(True, alpha=0.3)
# Overall park factor ranking
park_factors_df_sorted = park_factors_df.sort_values('overall_pf')
colors = ['green' if x < 100 else 'red' for x in park_factors_df_sorted['overall_pf']]
axes[1].barh(range(len(park_factors_df_sorted)),
park_factors_df_sorted['overall_pf'],
color=colors, alpha=0.6)
axes[1].set_yticks(range(len(park_factors_df_sorted)))
axes[1].set_yticklabels(park_factors_df_sorted['team'], fontsize=8)
axes[1].axvline(x=100, color='black', linestyle='--', linewidth=2)
axes[1].set_xlabel('Overall Park Factor')
axes[1].set_title('Overall Offensive Park Factors')
axes[1].grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('park_factors_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Statistical significance of park effects
from scipy import stats
def assess_park_significance(home_rpg, road_rpg, games=81):
"""
Test if park effect is statistically significant
"""
# Assuming Poisson distribution for runs
# Standard error for rate difference
se = np.sqrt((home_rpg + road_rpg) / games)
z_score = (home_rpg - road_rpg) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))
return {
'z_score': z_score,
'p_value': p_value,
'significant': p_value < 0.05
}
# Example: Coors Field
coors_home_rpg = 5.8
coors_road_rpg = 4.7
significance = assess_park_significance(coors_home_rpg, coors_road_rpg)
print(f"\nCoors Field Park Effect Significance:")
print(f"Z-score: {significance['z_score']:.2f}")
print(f"P-value: {significance['p_value']:.4f}")
print(f"Statistically significant: {significance['significant']}")
# Python: Handedness-Specific Park Factors
def calculate_handedness_park_factors():
"""
Calculate park factors by batter handedness
"""
# Example data structure
handedness_pf = pd.DataFrame({
'park': ['Yankee Stadium', 'Fenway Park', 'Oracle Park', 'Coors Field',
'Petco Park', 'Dodger Stadium'],
'team': ['NYY', 'BOS', 'SFG', 'COL', 'SDP', 'LAD'],
'rhb_hr_pf': [125, 95, 82, 118, 88, 95],
'lhb_hr_pf': [88, 115, 90, 126, 92, 98],
'rhb_overall_pf': [108, 98, 90, 113, 92, 97],
'lhb_overall_pf': [95, 107, 93, 117, 95, 96]
})
print("Park Factors by Batter Handedness:")
print(handedness_pf.to_string(index=False))
# Visualize the handedness effect
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(handedness_pf))
width = 0.35
ax.bar(x - width/2, handedness_pf['rhb_hr_pf'], width,
label='RHB HR PF', alpha=0.8, color='steelblue')
ax.bar(x + width/2, handedness_pf['lhb_hr_pf'], width,
label='LHB HR PF', alpha=0.8, color='coral')
ax.axhline(y=100, color='black', linestyle='--', linewidth=1)
ax.set_xlabel('Ballpark')
ax.set_ylabel('Home Run Park Factor')
ax.set_title('Home Run Park Factors by Batter Handedness')
ax.set_xticks(x)
ax.set_xticklabels(handedness_pf['park'], rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
return handedness_pf
handedness_pf = calculate_handedness_park_factors()
# Identify parks with largest handedness splits
handedness_pf['hr_split'] = abs(handedness_pf['rhb_hr_pf'] -
handedness_pf['lhb_hr_pf'])
print("\nParks with Largest Handedness Effects:")
print(handedness_pf.nlargest(5, 'hr_split')[['park', 'rhb_hr_pf',
'lhb_hr_pf', 'hr_split']])
Physics of Batted Ball Flight
A baseball in flight experiences three primary forces:
- Gravity: Constant downward acceleration (9.8 m/s²)
- Drag: Air resistance proportional to velocity squared
- Magnus Force: Spin-induced lift (or drop)
The drag force equation:
F_drag = 0.5 × ρ × C_d × A × v²
Where:
- ρ = air density (kg/m³)
- C_d = drag coefficient (~0.3-0.5 for baseball)
- A = cross-sectional area (0.00426 m² for baseball)
- v = velocity (m/s)
Air density depends on temperature, pressure, and humidity:
ρ = (P × M) / (R × T)
Where:
- P = pressure (Pa)
- M = molar mass of air (kg/mol)
- R = gas constant (8.314 J/(mol·K))
- T = temperature (K)
R Implementation: Ball Flight Simulator
# R: Baseball Flight Physics Simulator
library(tidyverse)
# Constants
g <- 9.8 # gravity (m/s^2)
R_gas <- 8.314 # gas constant
M_air <- 0.029 # molar mass of dry air (kg/mol)
baseball_area <- 0.00426 # cross-sectional area (m^2)
baseball_mass <- 0.145 # mass (kg)
C_d <- 0.35 # drag coefficient
# Calculate air density
calculate_air_density <- function(temp_f, pressure_mb, humidity_pct) {
# Convert to SI units
temp_k <- (temp_f - 32) * 5/9 + 273.15
pressure_pa <- pressure_mb * 100
# Calculate saturation vapor pressure (Clausius-Clapeyron)
e_s <- 611.2 * exp(17.67 * (temp_k - 273.15) / (temp_k - 29.65))
# Actual vapor pressure
e <- (humidity_pct / 100) * e_s
# Partial pressure of dry air
p_d <- pressure_pa - e
# Air density (accounting for humidity)
rho <- (p_d * M_air + e * 0.018) / (R_gas * temp_k)
return(rho)
}
# Simulate ball flight
simulate_batted_ball <- function(exit_velo_mph, launch_angle_deg,
temp_f, pressure_mb = 1013,
humidity_pct = 50, dt = 0.01) {
# Convert inputs
v0 <- exit_velo_mph * 0.44704 # mph to m/s
angle <- launch_angle_deg * pi / 180
# Initial conditions
vx <- v0 * cos(angle)
vy <- v0 * sin(angle)
x <- 0
y <- 1 # bat height
# Air density
rho <- calculate_air_density(temp_f, pressure_mb, humidity_pct)
# Storage
trajectory <- data.frame(t = 0, x = x, y = y, vx = vx, vy = vy)
t <- 0
while (y > 0 && t < 10) {
# Velocity magnitude
v <- sqrt(vx^2 + vy^2)
# Drag force
F_drag <- 0.5 * rho * C_d * baseball_area * v^2
# Acceleration components
ax <- -(F_drag / baseball_mass) * (vx / v)
ay <- -g - (F_drag / baseball_mass) * (vy / v)
# Update velocity
vx <- vx + ax * dt
vy <- vy + ay * dt
# Update position
x <- x + vx * dt
y <- y + vy * dt
t <- t + dt
# Store
trajectory <- rbind(trajectory, data.frame(t = t, x = x, y = y,
vx = vx, vy = vy))
}
# Convert distance to feet
trajectory$x_ft <- trajectory$x * 3.28084
trajectory$y_ft <- trajectory$y * 3.28084
return(trajectory)
}
# Simulate the same batted ball at different temperatures
exit_velo <- 105 # mph
launch_angle <- 30 # degrees
temps <- c(50, 60, 70, 80, 90, 100)
results <- list()
for (temp in temps) {
traj <- simulate_batted_ball(exit_velo, launch_angle, temp)
max_dist <- max(traj$x_ft)
results[[as.character(temp)]] <- data.frame(
temp = temp,
distance = max_dist
)
}
distance_vs_temp <- bind_rows(results)
print("Distance vs Temperature:")
print(distance_vs_temp)
# Visualize impact
ggplot(distance_vs_temp, aes(x = temp, y = distance)) +
geom_line(color = "red", size = 1.5) +
geom_point(size = 3, color = "darkred") +
labs(title = "Impact of Temperature on Batted Ball Distance",
subtitle = "105 mph exit velocity, 30° launch angle",
x = "Temperature (°F)",
y = "Distance (feet)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
# Calculate temperature effect
temp_effect <- lm(distance ~ temp, data = distance_vs_temp)
cat(sprintf("\nEstimated effect: %.2f feet per 10°F increase\n",
coef(temp_effect)[2] * 10))
Python Implementation: Comprehensive Weather Effects
# Python: Advanced Weather Impact Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
class BallFlightSimulator:
"""
Simulate batted ball flight with weather effects
"""
def __init__(self):
self.g = 9.8 # m/s^2
self.m = 0.145 # kg
self.A = 0.00426 # m^2
self.Cd = 0.35
self.R = 8.314
self.M_air = 0.029
def air_density(self, temp_f, pressure_mb, humidity_pct):
"""Calculate air density accounting for temperature, pressure, humidity"""
# Convert temperature
temp_c = (temp_f - 32) * 5/9
temp_k = temp_c + 273.15
# Saturation vapor pressure (Magnus formula)
e_s = 611.2 * np.exp(17.67 * temp_c / (temp_c + 243.5))
# Actual vapor pressure
e = (humidity_pct / 100) * e_s
# Pressure in Pa
pressure_pa = pressure_mb * 100
# Partial pressure of dry air
p_d = pressure_pa - e
# Density (accounting for water vapor)
rho = (p_d * self.M_air + e * 0.018) / (self.R * temp_k)
return rho
def simulate(self, exit_velo_mph, launch_angle_deg,
temp_f, pressure_mb=1013.25, humidity_pct=50,
wind_mph=0, backspin_rpm=2000):
"""
Simulate ball flight trajectory
Parameters:
- exit_velo_mph: Exit velocity in mph
- launch_angle_deg: Launch angle in degrees
- temp_f: Temperature in Fahrenheit
- pressure_mb: Atmospheric pressure in millibars
- humidity_pct: Relative humidity percentage
- wind_mph: Wind speed (positive = helping wind)
- backspin_rpm: Backspin in RPM
"""
# Convert units
v0 = exit_velo_mph * 0.44704 # to m/s
angle = np.radians(launch_angle_deg)
wind = wind_mph * 0.44704 # to m/s
omega = backspin_rpm * 2 * np.pi / 60 # to rad/s
# Initial conditions
vx0 = v0 * np.cos(angle)
vy0 = v0 * np.sin(angle)
# Air density
rho = self.air_density(temp_f, pressure_mb, humidity_pct)
# Time vector
t = np.linspace(0, 10, 1000)
# Differential equations
def derivatives(state, t):
x, y, vx, vy = state
# Relative velocity (accounting for wind)
vx_rel = vx - wind
v = np.sqrt(vx_rel**2 + vy**2)
if v == 0:
return [vx, vy, 0, -self.g]
# Drag force
F_drag = 0.5 * rho * self.Cd * self.A * v**2
# Magnus force (lift from backspin)
# Simplified: F_magnus = S * omega * v
S = 4.1e-4 # Magnus coefficient
F_magnus = S * omega * v
# Accelerations
ax = -(F_drag / self.m) * (vx_rel / v)
ay = -self.g - (F_drag / self.m) * (vy / v) + (F_magnus / self.m)
return [vx, vy, ax, ay]
# Solve ODE
initial_state = [0, 1, vx0, vy0]
solution = odeint(derivatives, initial_state, t)
# Extract trajectory
x = solution[:, 0] * 3.28084 # to feet
y = solution[:, 1] * 3.28084 # to feet
# Find landing point
landing_idx = np.where(y < 0)[0]
if len(landing_idx) > 0:
landing_idx = landing_idx[0]
distance = x[landing_idx]
hang_time = t[landing_idx]
else:
distance = x[-1]
hang_time = t[-1]
return {
'distance': distance,
'hang_time': hang_time,
'trajectory': pd.DataFrame({'x': x[:landing_idx+1],
'y': y[:landing_idx+1],
't': t[:landing_idx+1]})
}
# Initialize simulator
sim = BallFlightSimulator()
# Standard batted ball: 105 mph, 30 degrees, 2000 rpm backspin
print("Weather Impact on Identical Batted Ball (105 mph, 30°, 2000 RPM):\n")
weather_conditions = [
{'name': 'Cold April', 'temp': 45, 'pressure': 1020, 'humidity': 40},
{'name': 'Cool Spring', 'temp': 60, 'pressure': 1015, 'humidity': 50},
{'name': 'Mild Summer', 'temp': 75, 'pressure': 1013, 'humidity': 60},
{'name': 'Hot & Humid', 'temp': 90, 'pressure': 1010, 'humidity': 80},
{'name': 'Extreme Heat', 'temp': 105, 'pressure': 1005, 'humidity': 30}
]
results = []
for condition in weather_conditions:
result = sim.simulate(105, 30, condition['temp'],
condition['pressure'], condition['humidity'])
results.append({
'condition': condition['name'],
'temp': condition['temp'],
'distance': result['distance'],
'hang_time': result['hang_time']
})
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
# Calculate distance gained from cold to hot
cold_dist = results_df[results_df['temp'] == 45]['distance'].values[0]
hot_dist = results_df[results_df['temp'] == 105]['distance'].values[0]
print(f"\nDistance gained from 45°F to 105°F: {hot_dist - cold_dist:.1f} feet")
print(f"Percentage increase: {((hot_dist - cold_dist) / cold_dist * 100):.1f}%")
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Distance vs temperature
axes[0].plot(results_df['temp'], results_df['distance'],
'o-', linewidth=2, markersize=8, color='orangered')
axes[0].set_xlabel('Temperature (°F)', fontsize=12)
axes[0].set_ylabel('Distance (feet)', fontsize=12)
axes[0].set_title('Ball Flight Distance vs Temperature', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# Add reference line
axes[0].axhline(y=400, color='gray', linestyle='--', alpha=0.5, label='Home Run Distance')
axes[0].legend()
# Trajectory comparison (extreme conditions)
cold_traj = sim.simulate(105, 30, 45, 1020, 40)['trajectory']
hot_traj = sim.simulate(105, 30, 105, 1005, 30)['trajectory']
axes[1].plot(cold_traj['x'], cold_traj['y'], label='45°F (Cold)', linewidth=2)
axes[1].plot(hot_traj['x'], hot_traj['y'], label='105°F (Hot)', linewidth=2)
axes[1].set_xlabel('Distance (feet)', fontsize=12)
axes[1].set_ylabel('Height (feet)', fontsize=12)
axes[1].set_title('Trajectory Comparison', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(0, 450)
plt.tight_layout()
plt.savefig('weather_ball_flight.png', dpi=300, bbox_inches='tight')
plt.show()
Real-World Weather Impact on Game Outcomes
# Python: Analyze actual weather effects on scoring
def analyze_weather_scoring_relationship():
"""
Analyze relationship between weather and runs scored
Using hypothetical but realistic data
"""
# Simulate realistic game data with weather
np.random.seed(42)
n_games = 500
game_data = pd.DataFrame({
'game_id': range(n_games),
'temperature': np.random.normal(72, 15, n_games),
'humidity': np.random.normal(55, 20, n_games),
'wind_speed': np.abs(np.random.normal(5, 5, n_games)),
'pressure': np.random.normal(1013, 10, n_games)
})
# Clip to realistic ranges
game_data['temperature'] = game_data['temperature'].clip(35, 105)
game_data['humidity'] = game_data['humidity'].clip(10, 95)
game_data['wind_speed'] = game_data['wind_speed'].clip(0, 25)
# Generate runs with weather dependencies
base_runs = 4.5
temp_effect = (game_data['temperature'] - 70) * 0.03 # 0.3 runs per 10°F
wind_effect = game_data['wind_speed'] * 0.05
noise = np.random.normal(0, 1.5, n_games)
game_data['total_runs'] = (base_runs + temp_effect +
wind_effect + noise).clip(0, None)
game_data['total_runs'] = game_data['total_runs'].round().astype(int)
# Statistical analysis
from scipy.stats import pearsonr, spearmanr
temp_corr, temp_p = pearsonr(game_data['temperature'], game_data['total_runs'])
wind_corr, wind_p = pearsonr(game_data['wind_speed'], game_data['total_runs'])
print("Weather-Runs Correlation Analysis:")
print(f"Temperature: r = {temp_corr:.3f}, p = {temp_p:.4f}")
print(f"Wind Speed: r = {wind_corr:.3f}, p = {wind_p:.4f}")
# Regression model
from sklearn.linear_model import LinearRegression
X = game_data[['temperature', 'wind_speed', 'humidity', 'pressure']]
y = game_data['total_runs']
model = LinearRegression()
model.fit(X, y)
print("\nRegression Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature}: {coef:.4f}")
print(f"R-squared: {model.score(X, y):.3f}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Temperature vs runs
axes[0, 0].scatter(game_data['temperature'], game_data['total_runs'],
alpha=0.5, s=20)
axes[0, 0].plot(np.unique(game_data['temperature']),
np.poly1d(np.polyfit(game_data['temperature'],
game_data['total_runs'], 1))(
np.unique(game_data['temperature'])),
color='red', linewidth=2)
axes[0, 0].set_xlabel('Temperature (°F)')
axes[0, 0].set_ylabel('Total Runs')
axes[0, 0].set_title('Temperature vs Runs Scored')
axes[0, 0].grid(True, alpha=0.3)
# Wind vs runs
axes[0, 1].scatter(game_data['wind_speed'], game_data['total_runs'],
alpha=0.5, s=20, color='green')
axes[0, 1].plot(np.unique(game_data['wind_speed']),
np.poly1d(np.polyfit(game_data['wind_speed'],
game_data['total_runs'], 1))(
np.unique(game_data['wind_speed'])),
color='darkgreen', linewidth=2)
axes[0, 1].set_xlabel('Wind Speed (mph)')
axes[0, 1].set_ylabel('Total Runs')
axes[0, 1].set_title('Wind Speed vs Runs Scored')
axes[0, 1].grid(True, alpha=0.3)
# Humidity vs runs
axes[1, 0].scatter(game_data['humidity'], game_data['total_runs'],
alpha=0.5, s=20, color='blue')
axes[1, 0].set_xlabel('Humidity (%)')
axes[1, 0].set_ylabel('Total Runs')
axes[1, 0].set_title('Humidity vs Runs Scored')
axes[1, 0].grid(True, alpha=0.3)
# Temperature distribution by scoring level
game_data['scoring_level'] = pd.cut(game_data['total_runs'],
bins=[0, 3, 6, 100],
labels=['Low (0-3)', 'Med (4-6)', 'High (7+)'])
for level in ['Low (0-3)', 'Med (4-6)', 'High (7+)']:
data = game_data[game_data['scoring_level'] == level]['temperature']
axes[1, 1].hist(data, alpha=0.5, bins=20, label=level)
axes[1, 1].set_xlabel('Temperature (°F)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Temperature Distribution by Scoring Level')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('weather_scoring_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
return game_data, model
game_data, weather_model = analyze_weather_scoring_relationship()
F_drag = 0.5 × ρ × C_d × A × v²
ρ = (P × M) / (R × T)
# R: Baseball Flight Physics Simulator
library(tidyverse)
# Constants
g <- 9.8 # gravity (m/s^2)
R_gas <- 8.314 # gas constant
M_air <- 0.029 # molar mass of dry air (kg/mol)
baseball_area <- 0.00426 # cross-sectional area (m^2)
baseball_mass <- 0.145 # mass (kg)
C_d <- 0.35 # drag coefficient
# Calculate air density
calculate_air_density <- function(temp_f, pressure_mb, humidity_pct) {
# Convert to SI units
temp_k <- (temp_f - 32) * 5/9 + 273.15
pressure_pa <- pressure_mb * 100
# Calculate saturation vapor pressure (Clausius-Clapeyron)
e_s <- 611.2 * exp(17.67 * (temp_k - 273.15) / (temp_k - 29.65))
# Actual vapor pressure
e <- (humidity_pct / 100) * e_s
# Partial pressure of dry air
p_d <- pressure_pa - e
# Air density (accounting for humidity)
rho <- (p_d * M_air + e * 0.018) / (R_gas * temp_k)
return(rho)
}
# Simulate ball flight
simulate_batted_ball <- function(exit_velo_mph, launch_angle_deg,
temp_f, pressure_mb = 1013,
humidity_pct = 50, dt = 0.01) {
# Convert inputs
v0 <- exit_velo_mph * 0.44704 # mph to m/s
angle <- launch_angle_deg * pi / 180
# Initial conditions
vx <- v0 * cos(angle)
vy <- v0 * sin(angle)
x <- 0
y <- 1 # bat height
# Air density
rho <- calculate_air_density(temp_f, pressure_mb, humidity_pct)
# Storage
trajectory <- data.frame(t = 0, x = x, y = y, vx = vx, vy = vy)
t <- 0
while (y > 0 && t < 10) {
# Velocity magnitude
v <- sqrt(vx^2 + vy^2)
# Drag force
F_drag <- 0.5 * rho * C_d * baseball_area * v^2
# Acceleration components
ax <- -(F_drag / baseball_mass) * (vx / v)
ay <- -g - (F_drag / baseball_mass) * (vy / v)
# Update velocity
vx <- vx + ax * dt
vy <- vy + ay * dt
# Update position
x <- x + vx * dt
y <- y + vy * dt
t <- t + dt
# Store
trajectory <- rbind(trajectory, data.frame(t = t, x = x, y = y,
vx = vx, vy = vy))
}
# Convert distance to feet
trajectory$x_ft <- trajectory$x * 3.28084
trajectory$y_ft <- trajectory$y * 3.28084
return(trajectory)
}
# Simulate the same batted ball at different temperatures
exit_velo <- 105 # mph
launch_angle <- 30 # degrees
temps <- c(50, 60, 70, 80, 90, 100)
results <- list()
for (temp in temps) {
traj <- simulate_batted_ball(exit_velo, launch_angle, temp)
max_dist <- max(traj$x_ft)
results[[as.character(temp)]] <- data.frame(
temp = temp,
distance = max_dist
)
}
distance_vs_temp <- bind_rows(results)
print("Distance vs Temperature:")
print(distance_vs_temp)
# Visualize impact
ggplot(distance_vs_temp, aes(x = temp, y = distance)) +
geom_line(color = "red", size = 1.5) +
geom_point(size = 3, color = "darkred") +
labs(title = "Impact of Temperature on Batted Ball Distance",
subtitle = "105 mph exit velocity, 30° launch angle",
x = "Temperature (°F)",
y = "Distance (feet)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
# Calculate temperature effect
temp_effect <- lm(distance ~ temp, data = distance_vs_temp)
cat(sprintf("\nEstimated effect: %.2f feet per 10°F increase\n",
coef(temp_effect)[2] * 10))
# Python: Advanced Weather Impact Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
class BallFlightSimulator:
"""
Simulate batted ball flight with weather effects
"""
def __init__(self):
self.g = 9.8 # m/s^2
self.m = 0.145 # kg
self.A = 0.00426 # m^2
self.Cd = 0.35
self.R = 8.314
self.M_air = 0.029
def air_density(self, temp_f, pressure_mb, humidity_pct):
"""Calculate air density accounting for temperature, pressure, humidity"""
# Convert temperature
temp_c = (temp_f - 32) * 5/9
temp_k = temp_c + 273.15
# Saturation vapor pressure (Magnus formula)
e_s = 611.2 * np.exp(17.67 * temp_c / (temp_c + 243.5))
# Actual vapor pressure
e = (humidity_pct / 100) * e_s
# Pressure in Pa
pressure_pa = pressure_mb * 100
# Partial pressure of dry air
p_d = pressure_pa - e
# Density (accounting for water vapor)
rho = (p_d * self.M_air + e * 0.018) / (self.R * temp_k)
return rho
def simulate(self, exit_velo_mph, launch_angle_deg,
temp_f, pressure_mb=1013.25, humidity_pct=50,
wind_mph=0, backspin_rpm=2000):
"""
Simulate ball flight trajectory
Parameters:
- exit_velo_mph: Exit velocity in mph
- launch_angle_deg: Launch angle in degrees
- temp_f: Temperature in Fahrenheit
- pressure_mb: Atmospheric pressure in millibars
- humidity_pct: Relative humidity percentage
- wind_mph: Wind speed (positive = helping wind)
- backspin_rpm: Backspin in RPM
"""
# Convert units
v0 = exit_velo_mph * 0.44704 # to m/s
angle = np.radians(launch_angle_deg)
wind = wind_mph * 0.44704 # to m/s
omega = backspin_rpm * 2 * np.pi / 60 # to rad/s
# Initial conditions
vx0 = v0 * np.cos(angle)
vy0 = v0 * np.sin(angle)
# Air density
rho = self.air_density(temp_f, pressure_mb, humidity_pct)
# Time vector
t = np.linspace(0, 10, 1000)
# Differential equations
def derivatives(state, t):
x, y, vx, vy = state
# Relative velocity (accounting for wind)
vx_rel = vx - wind
v = np.sqrt(vx_rel**2 + vy**2)
if v == 0:
return [vx, vy, 0, -self.g]
# Drag force
F_drag = 0.5 * rho * self.Cd * self.A * v**2
# Magnus force (lift from backspin)
# Simplified: F_magnus = S * omega * v
S = 4.1e-4 # Magnus coefficient
F_magnus = S * omega * v
# Accelerations
ax = -(F_drag / self.m) * (vx_rel / v)
ay = -self.g - (F_drag / self.m) * (vy / v) + (F_magnus / self.m)
return [vx, vy, ax, ay]
# Solve ODE
initial_state = [0, 1, vx0, vy0]
solution = odeint(derivatives, initial_state, t)
# Extract trajectory
x = solution[:, 0] * 3.28084 # to feet
y = solution[:, 1] * 3.28084 # to feet
# Find landing point
landing_idx = np.where(y < 0)[0]
if len(landing_idx) > 0:
landing_idx = landing_idx[0]
distance = x[landing_idx]
hang_time = t[landing_idx]
else:
distance = x[-1]
hang_time = t[-1]
return {
'distance': distance,
'hang_time': hang_time,
'trajectory': pd.DataFrame({'x': x[:landing_idx+1],
'y': y[:landing_idx+1],
't': t[:landing_idx+1]})
}
# Initialize simulator
sim = BallFlightSimulator()
# Standard batted ball: 105 mph, 30 degrees, 2000 rpm backspin
print("Weather Impact on Identical Batted Ball (105 mph, 30°, 2000 RPM):\n")
weather_conditions = [
{'name': 'Cold April', 'temp': 45, 'pressure': 1020, 'humidity': 40},
{'name': 'Cool Spring', 'temp': 60, 'pressure': 1015, 'humidity': 50},
{'name': 'Mild Summer', 'temp': 75, 'pressure': 1013, 'humidity': 60},
{'name': 'Hot & Humid', 'temp': 90, 'pressure': 1010, 'humidity': 80},
{'name': 'Extreme Heat', 'temp': 105, 'pressure': 1005, 'humidity': 30}
]
results = []
for condition in weather_conditions:
result = sim.simulate(105, 30, condition['temp'],
condition['pressure'], condition['humidity'])
results.append({
'condition': condition['name'],
'temp': condition['temp'],
'distance': result['distance'],
'hang_time': result['hang_time']
})
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
# Calculate distance gained from cold to hot
cold_dist = results_df[results_df['temp'] == 45]['distance'].values[0]
hot_dist = results_df[results_df['temp'] == 105]['distance'].values[0]
print(f"\nDistance gained from 45°F to 105°F: {hot_dist - cold_dist:.1f} feet")
print(f"Percentage increase: {((hot_dist - cold_dist) / cold_dist * 100):.1f}%")
# Visualize results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Distance vs temperature
axes[0].plot(results_df['temp'], results_df['distance'],
'o-', linewidth=2, markersize=8, color='orangered')
axes[0].set_xlabel('Temperature (°F)', fontsize=12)
axes[0].set_ylabel('Distance (feet)', fontsize=12)
axes[0].set_title('Ball Flight Distance vs Temperature', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# Add reference line
axes[0].axhline(y=400, color='gray', linestyle='--', alpha=0.5, label='Home Run Distance')
axes[0].legend()
# Trajectory comparison (extreme conditions)
cold_traj = sim.simulate(105, 30, 45, 1020, 40)['trajectory']
hot_traj = sim.simulate(105, 30, 105, 1005, 30)['trajectory']
axes[1].plot(cold_traj['x'], cold_traj['y'], label='45°F (Cold)', linewidth=2)
axes[1].plot(hot_traj['x'], hot_traj['y'], label='105°F (Hot)', linewidth=2)
axes[1].set_xlabel('Distance (feet)', fontsize=12)
axes[1].set_ylabel('Height (feet)', fontsize=12)
axes[1].set_title('Trajectory Comparison', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(0, 450)
plt.tight_layout()
plt.savefig('weather_ball_flight.png', dpi=300, bbox_inches='tight')
plt.show()
# Python: Analyze actual weather effects on scoring
def analyze_weather_scoring_relationship():
"""
Analyze relationship between weather and runs scored
Using hypothetical but realistic data
"""
# Simulate realistic game data with weather
np.random.seed(42)
n_games = 500
game_data = pd.DataFrame({
'game_id': range(n_games),
'temperature': np.random.normal(72, 15, n_games),
'humidity': np.random.normal(55, 20, n_games),
'wind_speed': np.abs(np.random.normal(5, 5, n_games)),
'pressure': np.random.normal(1013, 10, n_games)
})
# Clip to realistic ranges
game_data['temperature'] = game_data['temperature'].clip(35, 105)
game_data['humidity'] = game_data['humidity'].clip(10, 95)
game_data['wind_speed'] = game_data['wind_speed'].clip(0, 25)
# Generate runs with weather dependencies
base_runs = 4.5
temp_effect = (game_data['temperature'] - 70) * 0.03 # 0.3 runs per 10°F
wind_effect = game_data['wind_speed'] * 0.05
noise = np.random.normal(0, 1.5, n_games)
game_data['total_runs'] = (base_runs + temp_effect +
wind_effect + noise).clip(0, None)
game_data['total_runs'] = game_data['total_runs'].round().astype(int)
# Statistical analysis
from scipy.stats import pearsonr, spearmanr
temp_corr, temp_p = pearsonr(game_data['temperature'], game_data['total_runs'])
wind_corr, wind_p = pearsonr(game_data['wind_speed'], game_data['total_runs'])
print("Weather-Runs Correlation Analysis:")
print(f"Temperature: r = {temp_corr:.3f}, p = {temp_p:.4f}")
print(f"Wind Speed: r = {wind_corr:.3f}, p = {wind_p:.4f}")
# Regression model
from sklearn.linear_model import LinearRegression
X = game_data[['temperature', 'wind_speed', 'humidity', 'pressure']]
y = game_data['total_runs']
model = LinearRegression()
model.fit(X, y)
print("\nRegression Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature}: {coef:.4f}")
print(f"R-squared: {model.score(X, y):.3f}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Temperature vs runs
axes[0, 0].scatter(game_data['temperature'], game_data['total_runs'],
alpha=0.5, s=20)
axes[0, 0].plot(np.unique(game_data['temperature']),
np.poly1d(np.polyfit(game_data['temperature'],
game_data['total_runs'], 1))(
np.unique(game_data['temperature'])),
color='red', linewidth=2)
axes[0, 0].set_xlabel('Temperature (°F)')
axes[0, 0].set_ylabel('Total Runs')
axes[0, 0].set_title('Temperature vs Runs Scored')
axes[0, 0].grid(True, alpha=0.3)
# Wind vs runs
axes[0, 1].scatter(game_data['wind_speed'], game_data['total_runs'],
alpha=0.5, s=20, color='green')
axes[0, 1].plot(np.unique(game_data['wind_speed']),
np.poly1d(np.polyfit(game_data['wind_speed'],
game_data['total_runs'], 1))(
np.unique(game_data['wind_speed'])),
color='darkgreen', linewidth=2)
axes[0, 1].set_xlabel('Wind Speed (mph)')
axes[0, 1].set_ylabel('Total Runs')
axes[0, 1].set_title('Wind Speed vs Runs Scored')
axes[0, 1].grid(True, alpha=0.3)
# Humidity vs runs
axes[1, 0].scatter(game_data['humidity'], game_data['total_runs'],
alpha=0.5, s=20, color='blue')
axes[1, 0].set_xlabel('Humidity (%)')
axes[1, 0].set_ylabel('Total Runs')
axes[1, 0].set_title('Humidity vs Runs Scored')
axes[1, 0].grid(True, alpha=0.3)
# Temperature distribution by scoring level
game_data['scoring_level'] = pd.cut(game_data['total_runs'],
bins=[0, 3, 6, 100],
labels=['Low (0-3)', 'Med (4-6)', 'High (7+)'])
for level in ['Low (0-3)', 'Med (4-6)', 'High (7+)']:
data = game_data[game_data['scoring_level'] == level]['temperature']
axes[1, 1].hist(data, alpha=0.5, bins=20, label=level)
axes[1, 1].set_xlabel('Temperature (°F)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Temperature Distribution by Scoring Level')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('weather_scoring_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
return game_data, model
game_data, weather_model = analyze_weather_scoring_relationship()
The Coors Field Effect
Coors Field in Denver sits at 5,200 feet elevation, where air density is approximately 17% lower than sea level. This dramatically affects baseball flight:
Physical Effects:
- Reduced drag force (proportional to air density)
- Balls travel 5-10% farther
- Curveballs break less (reduced Magnus force)
- Fastballs have less rise
- Harder for pitchers to control breaking pitches
Historical Impact:
- 1995-2001: Extreme offense (park factor ~125)
- 2002: Humidor installed
- 2002-2018: Reduced but still high offense (PF ~115)
- 2019: Enhanced humidor
- 2019-present: More suppressed offense (PF ~108-110)
R Analysis: Quantifying Coors Effect
# R: Coors Field Deep Dive
library(tidyverse)
library(Lahman)
# Analyze Coors Field impact across eras
coors_analysis <- Teams %>%
filter(teamID == "COL") %>%
select(yearID, teamID, G, R, RA, HR, H, BB, SO) %>%
mutate(
rpg = R / G,
hrpg = HR / G,
era = case_when(
yearID >= 1993 & yearID <= 2001 ~ "Pre-Humidor (1993-2001)",
yearID >= 2002 & yearID <= 2018 ~ "Humidor v1 (2002-2018)",
yearID >= 2019 ~ "Enhanced Humidor (2019+)"
)
) %>%
filter(!is.na(era))
# Compare eras
era_summary <- coors_analysis %>%
group_by(era) %>%
summarise(
years = n(),
avg_rpg = mean(rpg),
avg_hrpg = mean(hrpg),
.groups = 'drop'
)
print("Coors Field by Era:")
print(era_summary)
# Visualize the change
ggplot(coors_analysis, aes(x = yearID, y = rpg)) +
geom_line(size = 1.2, color = "purple") +
geom_point(size = 2.5, color = "purple") +
geom_vline(xintercept = 2002, linetype = "dashed", color = "blue", size = 1) +
geom_vline(xintercept = 2019, linetype = "dashed", color = "red", size = 1) +
annotate("text", x = 2002, y = 6.5, label = "Humidor\nInstalled",
hjust = -0.1, color = "blue") +
annotate("text", x = 2019, y = 6.5, label = "Enhanced\nHumidor",
hjust = -0.1, color = "red") +
labs(title = "Colorado Rockies Runs Per Game Over Time",
subtitle = "Impact of Humidor Installation",
x = "Year",
y = "Runs Per Game") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
# Calculate altitude-adjusted metrics for Coors players
# Example: Player home/road splits
# Simulate player data at Coors
player_coors <- data.frame(
player = c("Larry Walker", "Todd Helton", "Charlie Blackmon",
"Nolan Arenado", "Trevor Story"),
years_coors = c(10, 17, 9, 8, 7),
home_avg = c(.381, .345, .331, .315, .294),
road_avg = c(.282, .287, .281, .267, .251),
home_hr_rate = c(.075, .062, .048, .055, .057),
road_hr_rate = c(.048, .041, .029, .038, .039)
)
player_coors <- player_coors %>%
mutate(
avg_split = home_avg - road_avg,
hr_rate_split = home_hr_rate - road_hr_rate,
avg_ratio = home_avg / road_avg,
hr_ratio = home_hr_rate / road_hr_rate
)
print("\nCoors Field Player Splits:")
print(player_coors %>%
select(player, avg_split, hr_rate_split, avg_ratio, hr_ratio))
# Visualize splits
player_coors_long <- player_coors %>%
select(player, home_avg, road_avg) %>%
pivot_longer(cols = c(home_avg, road_avg),
names_to = "location",
values_to = "batting_avg") %>%
mutate(location = ifelse(location == "home_avg", "Home (Coors)", "Road"))
ggplot(player_coors_long, aes(x = player, y = batting_avg, fill = location)) +
geom_col(position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("purple", "gray60"), name = "") +
coord_flip() +
labs(title = "Home vs Road Batting Average",
subtitle = "Notable Rockies Hitters",
x = "",
y = "Batting Average") +
theme_minimal() +
theme(legend.position = "top")
Python Implementation: Physics-Based Altitude Modeling
# Python: Altitude Effects on Ball Flight
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
class AltitudeEffects:
"""
Model altitude effects on baseball physics
"""
def __init__(self):
self.g = 9.8 # m/s^2
self.sea_level_pressure = 101325 # Pa
self.sea_level_temp = 288.15 # K
self.lapse_rate = 0.0065 # K/m
def pressure_at_altitude(self, altitude_ft):
"""Calculate atmospheric pressure at altitude"""
altitude_m = altitude_ft * 0.3048
# Barometric formula
temp = self.sea_level_temp - self.lapse_rate * altitude_m
pressure = self.sea_level_pressure * (
(temp / self.sea_level_temp) ** 5.255
)
return pressure / 100 # Convert to mb
def air_density_ratio(self, altitude_ft, temp_f=70):
"""
Calculate air density relative to sea level
"""
# Convert temperature
temp_c = (temp_f - 32) * 5/9
temp_k = temp_c + 273.15
# Pressure at altitude
pressure_alt = self.pressure_at_altitude(altitude_ft)
# Density ratio (simplified)
altitude_m = altitude_ft * 0.3048
temp_alt = temp_k - self.lapse_rate * altitude_m
rho_ratio = (pressure_alt / (self.sea_level_pressure / 100)) * (
self.sea_level_temp / temp_alt
)
return rho_ratio
def distance_multiplier(self, altitude_ft, temp_f=70):
"""
Estimate distance multiplier due to altitude
Empirical relationship: distance increases ~4% per 1000 ft
"""
rho_ratio = self.air_density_ratio(altitude_ft, temp_f)
# Distance is inversely related to drag, which is proportional to density
# Approximately: distance_ratio = rho_ratio^(-0.5)
# More accurate empirical fit: distance_ratio = 1 / (rho_ratio^0.55)
distance_ratio = 1 / (rho_ratio ** 0.55)
return distance_ratio
# Initialize
alt_model = AltitudeEffects()
# Compare major league ballparks by elevation
ballparks = pd.DataFrame({
'park': ['Coors Field', 'Chase Field', 'Kauffman Stadium',
'Dodger Stadium', 'Wrigley Field', 'Fenway Park',
'Yankee Stadium', 'Oracle Park'],
'city': ['Denver', 'Phoenix', 'Kansas City', 'Los Angeles',
'Chicago', 'Boston', 'New York', 'San Francisco'],
'elevation_ft': [5200, 1090, 750, 340, 595, 20, 55, 0],
'avg_temp_summer': [73, 95, 79, 75, 73, 73, 77, 65]
})
# Calculate effects
ballparks['pressure_mb'] = ballparks['elevation_ft'].apply(
alt_model.pressure_at_altitude
)
ballparks['density_ratio'] = ballparks.apply(
lambda row: alt_model.air_density_ratio(row['elevation_ft'],
row['avg_temp_summer']),
axis=1
)
ballparks['distance_multiplier'] = ballparks.apply(
lambda row: alt_model.distance_multiplier(row['elevation_ft'],
row['avg_temp_summer']),
axis=1
)
# Calculate distance for standard 400-foot fly ball
ballparks['actual_distance'] = 400 * ballparks['distance_multiplier']
ballparks['distance_gain'] = ballparks['actual_distance'] - 400
print("Altitude Effects by Ballpark:\n")
print(ballparks[['park', 'elevation_ft', 'density_ratio',
'distance_multiplier', 'actual_distance',
'distance_gain']].to_string(index=False))
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Elevation vs distance gain
axes[0, 0].scatter(ballparks['elevation_ft'], ballparks['distance_gain'],
s=100, alpha=0.6, c=ballparks['elevation_ft'],
cmap='terrain')
for idx, row in ballparks.iterrows():
if row['elevation_ft'] > 500:
axes[0, 0].annotate(row['city'],
(row['elevation_ft'], row['distance_gain']),
fontsize=9, alpha=0.8)
axes[0, 0].set_xlabel('Elevation (feet)')
axes[0, 0].set_ylabel('Distance Gained (feet)')
axes[0, 0].set_title('Altitude Effect on 400-ft Fly Ball')
axes[0, 0].grid(True, alpha=0.3)
# Air density ratio by park
parks_sorted = ballparks.sort_values('density_ratio')
colors = ['red' if x > 5000 else 'blue' if x > 1000 else 'green'
for x in parks_sorted['elevation_ft']]
axes[0, 1].barh(range(len(parks_sorted)), parks_sorted['density_ratio'],
color=colors, alpha=0.6)
axes[0, 1].set_yticks(range(len(parks_sorted)))
axes[0, 1].set_yticklabels(parks_sorted['city'])
axes[0, 1].set_xlabel('Air Density Ratio (vs sea level)')
axes[0, 1].set_title('Relative Air Density by Ballpark')
axes[0, 1].grid(True, alpha=0.3, axis='x')
# Distance multiplier across range of elevations
elevations = np.linspace(0, 6000, 100)
multipliers = [alt_model.distance_multiplier(e) for e in elevations]
axes[1, 0].plot(elevations, multipliers, linewidth=2, color='purple')
axes[1, 0].axhline(y=1.0, linestyle='--', color='gray', alpha=0.5)
axes[1, 0].scatter([5200], [alt_model.distance_multiplier(5200)],
s=200, c='red', zorder=5, label='Coors Field')
axes[1, 0].set_xlabel('Elevation (feet)')
axes[1, 0].set_ylabel('Distance Multiplier')
axes[1, 0].set_title('Ball Flight Distance vs Elevation')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Expected vs actual Coors park factor
# Theoretical prediction vs observed
years = np.arange(1995, 2024)
theoretical_pf = 100 * alt_model.distance_multiplier(5200)
# Simulate observed park factors (with humidor effects)
observed_pf = []
for year in years:
if year < 2002:
pf = np.random.normal(125, 3) # Pre-humidor
elif year < 2019:
pf = np.random.normal(115, 3) # Humidor v1
else:
pf = np.random.normal(110, 2) # Enhanced humidor
observed_pf.append(pf)
axes[1, 1].plot(years, observed_pf, label='Observed', linewidth=2)
axes[1, 1].axhline(y=theoretical_pf, linestyle='--', color='red',
linewidth=2, label=f'Theoretical ({theoretical_pf:.0f})')
axes[1, 1].axvline(x=2002, linestyle=':', color='blue', alpha=0.5)
axes[1, 1].axvline(x=2019, linestyle=':', color='green', alpha=0.5)
axes[1, 1].annotate('Humidor', xy=(2002, 120), fontsize=9)
axes[1, 1].annotate('Enhanced', xy=(2019, 120), fontsize=9)
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_ylabel('Park Factor')
axes[1, 1].set_title('Coors Field Park Factor Over Time')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('altitude_effects_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
Breaking Ball Analysis at Altitude
# Python: Analyze breaking ball effectiveness at different elevations
def breaking_ball_altitude_effect():
"""
Model breaking ball movement at different altitudes
"""
# Magnus force depends on air density
# F_magnus = S * rho * A * v * omega
# Where omega is spin rate
alt_model = AltitudeEffects()
# Standard curveball parameters
velocity_mph = 78
spin_rpm = 2400
# Convert units
v = velocity_mph * 0.44704 # m/s
omega = spin_rpm * 2 * np.pi / 60 # rad/s
elevations = [0, 1000, 2000, 3000, 4000, 5200]
results = []
for elevation in elevations:
rho_ratio = alt_model.air_density_ratio(elevation)
# Break is proportional to Magnus force, which is proportional to density
break_ratio = rho_ratio
# Assuming sea level break is 12 inches
sea_level_break = 12
actual_break = sea_level_break * break_ratio
break_reduction = sea_level_break - actual_break
results.append({
'elevation': elevation,
'density_ratio': rho_ratio,
'break_inches': actual_break,
'break_reduction': break_reduction,
'break_pct_reduction': (break_reduction / sea_level_break) * 100
})
results_df = pd.DataFrame(results)
print("Breaking Ball Movement at Altitude:")
print("(78 mph curveball with 2400 RPM)\n")
print(results_df.to_string(index=False))
# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(results_df['elevation'], results_df['break_inches'],
'o-', linewidth=2, markersize=8, color='red')
ax.axhline(y=12, linestyle='--', color='gray', alpha=0.5,
label='Sea Level Break')
ax.scatter([5200], [results_df[results_df['elevation']==5200]['break_inches'].values[0]],
s=300, c='darkred', zorder=5, marker='*', label='Coors Field')
ax.set_xlabel('Elevation (feet)', fontsize=12)
ax.set_ylabel('Curveball Break (inches)', fontsize=12)
ax.set_title('Breaking Ball Movement vs Altitude', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
plt.tight_layout()
plt.show()
return results_df
breaking_df = breaking_ball_altitude_effect()
# Calculate pitcher disadvantage at Coors
coors_break = breaking_df[breaking_df['elevation']==5200]['break_inches'].values[0]
sea_level_break = breaking_df[breaking_df['elevation']==0]['break_inches'].values[0]
print(f"\nCurveball at Coors: {coors_break:.1f} inches break")
print(f"Curveball at sea level: {sea_level_break:.1f} inches break")
print(f"Reduction: {sea_level_break - coors_break:.1f} inches ({((sea_level_break - coors_break)/sea_level_break*100):.1f}%)")
# R: Coors Field Deep Dive
library(tidyverse)
library(Lahman)
# Analyze Coors Field impact across eras
coors_analysis <- Teams %>%
filter(teamID == "COL") %>%
select(yearID, teamID, G, R, RA, HR, H, BB, SO) %>%
mutate(
rpg = R / G,
hrpg = HR / G,
era = case_when(
yearID >= 1993 & yearID <= 2001 ~ "Pre-Humidor (1993-2001)",
yearID >= 2002 & yearID <= 2018 ~ "Humidor v1 (2002-2018)",
yearID >= 2019 ~ "Enhanced Humidor (2019+)"
)
) %>%
filter(!is.na(era))
# Compare eras
era_summary <- coors_analysis %>%
group_by(era) %>%
summarise(
years = n(),
avg_rpg = mean(rpg),
avg_hrpg = mean(hrpg),
.groups = 'drop'
)
print("Coors Field by Era:")
print(era_summary)
# Visualize the change
ggplot(coors_analysis, aes(x = yearID, y = rpg)) +
geom_line(size = 1.2, color = "purple") +
geom_point(size = 2.5, color = "purple") +
geom_vline(xintercept = 2002, linetype = "dashed", color = "blue", size = 1) +
geom_vline(xintercept = 2019, linetype = "dashed", color = "red", size = 1) +
annotate("text", x = 2002, y = 6.5, label = "Humidor\nInstalled",
hjust = -0.1, color = "blue") +
annotate("text", x = 2019, y = 6.5, label = "Enhanced\nHumidor",
hjust = -0.1, color = "red") +
labs(title = "Colorado Rockies Runs Per Game Over Time",
subtitle = "Impact of Humidor Installation",
x = "Year",
y = "Runs Per Game") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))
# Calculate altitude-adjusted metrics for Coors players
# Example: Player home/road splits
# Simulate player data at Coors
player_coors <- data.frame(
player = c("Larry Walker", "Todd Helton", "Charlie Blackmon",
"Nolan Arenado", "Trevor Story"),
years_coors = c(10, 17, 9, 8, 7),
home_avg = c(.381, .345, .331, .315, .294),
road_avg = c(.282, .287, .281, .267, .251),
home_hr_rate = c(.075, .062, .048, .055, .057),
road_hr_rate = c(.048, .041, .029, .038, .039)
)
player_coors <- player_coors %>%
mutate(
avg_split = home_avg - road_avg,
hr_rate_split = home_hr_rate - road_hr_rate,
avg_ratio = home_avg / road_avg,
hr_ratio = home_hr_rate / road_hr_rate
)
print("\nCoors Field Player Splits:")
print(player_coors %>%
select(player, avg_split, hr_rate_split, avg_ratio, hr_ratio))
# Visualize splits
player_coors_long <- player_coors %>%
select(player, home_avg, road_avg) %>%
pivot_longer(cols = c(home_avg, road_avg),
names_to = "location",
values_to = "batting_avg") %>%
mutate(location = ifelse(location == "home_avg", "Home (Coors)", "Road"))
ggplot(player_coors_long, aes(x = player, y = batting_avg, fill = location)) +
geom_col(position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("purple", "gray60"), name = "") +
coord_flip() +
labs(title = "Home vs Road Batting Average",
subtitle = "Notable Rockies Hitters",
x = "",
y = "Batting Average") +
theme_minimal() +
theme(legend.position = "top")
# Python: Altitude Effects on Ball Flight
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
class AltitudeEffects:
"""
Model altitude effects on baseball physics
"""
def __init__(self):
self.g = 9.8 # m/s^2
self.sea_level_pressure = 101325 # Pa
self.sea_level_temp = 288.15 # K
self.lapse_rate = 0.0065 # K/m
def pressure_at_altitude(self, altitude_ft):
"""Calculate atmospheric pressure at altitude"""
altitude_m = altitude_ft * 0.3048
# Barometric formula
temp = self.sea_level_temp - self.lapse_rate * altitude_m
pressure = self.sea_level_pressure * (
(temp / self.sea_level_temp) ** 5.255
)
return pressure / 100 # Convert to mb
def air_density_ratio(self, altitude_ft, temp_f=70):
"""
Calculate air density relative to sea level
"""
# Convert temperature
temp_c = (temp_f - 32) * 5/9
temp_k = temp_c + 273.15
# Pressure at altitude
pressure_alt = self.pressure_at_altitude(altitude_ft)
# Density ratio (simplified)
altitude_m = altitude_ft * 0.3048
temp_alt = temp_k - self.lapse_rate * altitude_m
rho_ratio = (pressure_alt / (self.sea_level_pressure / 100)) * (
self.sea_level_temp / temp_alt
)
return rho_ratio
def distance_multiplier(self, altitude_ft, temp_f=70):
"""
Estimate distance multiplier due to altitude
Empirical relationship: distance increases ~4% per 1000 ft
"""
rho_ratio = self.air_density_ratio(altitude_ft, temp_f)
# Distance is inversely related to drag, which is proportional to density
# Approximately: distance_ratio = rho_ratio^(-0.5)
# More accurate empirical fit: distance_ratio = 1 / (rho_ratio^0.55)
distance_ratio = 1 / (rho_ratio ** 0.55)
return distance_ratio
# Initialize
alt_model = AltitudeEffects()
# Compare major league ballparks by elevation
ballparks = pd.DataFrame({
'park': ['Coors Field', 'Chase Field', 'Kauffman Stadium',
'Dodger Stadium', 'Wrigley Field', 'Fenway Park',
'Yankee Stadium', 'Oracle Park'],
'city': ['Denver', 'Phoenix', 'Kansas City', 'Los Angeles',
'Chicago', 'Boston', 'New York', 'San Francisco'],
'elevation_ft': [5200, 1090, 750, 340, 595, 20, 55, 0],
'avg_temp_summer': [73, 95, 79, 75, 73, 73, 77, 65]
})
# Calculate effects
ballparks['pressure_mb'] = ballparks['elevation_ft'].apply(
alt_model.pressure_at_altitude
)
ballparks['density_ratio'] = ballparks.apply(
lambda row: alt_model.air_density_ratio(row['elevation_ft'],
row['avg_temp_summer']),
axis=1
)
ballparks['distance_multiplier'] = ballparks.apply(
lambda row: alt_model.distance_multiplier(row['elevation_ft'],
row['avg_temp_summer']),
axis=1
)
# Calculate distance for standard 400-foot fly ball
ballparks['actual_distance'] = 400 * ballparks['distance_multiplier']
ballparks['distance_gain'] = ballparks['actual_distance'] - 400
print("Altitude Effects by Ballpark:\n")
print(ballparks[['park', 'elevation_ft', 'density_ratio',
'distance_multiplier', 'actual_distance',
'distance_gain']].to_string(index=False))
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Elevation vs distance gain
axes[0, 0].scatter(ballparks['elevation_ft'], ballparks['distance_gain'],
s=100, alpha=0.6, c=ballparks['elevation_ft'],
cmap='terrain')
for idx, row in ballparks.iterrows():
if row['elevation_ft'] > 500:
axes[0, 0].annotate(row['city'],
(row['elevation_ft'], row['distance_gain']),
fontsize=9, alpha=0.8)
axes[0, 0].set_xlabel('Elevation (feet)')
axes[0, 0].set_ylabel('Distance Gained (feet)')
axes[0, 0].set_title('Altitude Effect on 400-ft Fly Ball')
axes[0, 0].grid(True, alpha=0.3)
# Air density ratio by park
parks_sorted = ballparks.sort_values('density_ratio')
colors = ['red' if x > 5000 else 'blue' if x > 1000 else 'green'
for x in parks_sorted['elevation_ft']]
axes[0, 1].barh(range(len(parks_sorted)), parks_sorted['density_ratio'],
color=colors, alpha=0.6)
axes[0, 1].set_yticks(range(len(parks_sorted)))
axes[0, 1].set_yticklabels(parks_sorted['city'])
axes[0, 1].set_xlabel('Air Density Ratio (vs sea level)')
axes[0, 1].set_title('Relative Air Density by Ballpark')
axes[0, 1].grid(True, alpha=0.3, axis='x')
# Distance multiplier across range of elevations
elevations = np.linspace(0, 6000, 100)
multipliers = [alt_model.distance_multiplier(e) for e in elevations]
axes[1, 0].plot(elevations, multipliers, linewidth=2, color='purple')
axes[1, 0].axhline(y=1.0, linestyle='--', color='gray', alpha=0.5)
axes[1, 0].scatter([5200], [alt_model.distance_multiplier(5200)],
s=200, c='red', zorder=5, label='Coors Field')
axes[1, 0].set_xlabel('Elevation (feet)')
axes[1, 0].set_ylabel('Distance Multiplier')
axes[1, 0].set_title('Ball Flight Distance vs Elevation')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Expected vs actual Coors park factor
# Theoretical prediction vs observed
years = np.arange(1995, 2024)
theoretical_pf = 100 * alt_model.distance_multiplier(5200)
# Simulate observed park factors (with humidor effects)
observed_pf = []
for year in years:
if year < 2002:
pf = np.random.normal(125, 3) # Pre-humidor
elif year < 2019:
pf = np.random.normal(115, 3) # Humidor v1
else:
pf = np.random.normal(110, 2) # Enhanced humidor
observed_pf.append(pf)
axes[1, 1].plot(years, observed_pf, label='Observed', linewidth=2)
axes[1, 1].axhline(y=theoretical_pf, linestyle='--', color='red',
linewidth=2, label=f'Theoretical ({theoretical_pf:.0f})')
axes[1, 1].axvline(x=2002, linestyle=':', color='blue', alpha=0.5)
axes[1, 1].axvline(x=2019, linestyle=':', color='green', alpha=0.5)
axes[1, 1].annotate('Humidor', xy=(2002, 120), fontsize=9)
axes[1, 1].annotate('Enhanced', xy=(2019, 120), fontsize=9)
axes[1, 1].set_xlabel('Year')
axes[1, 1].set_ylabel('Park Factor')
axes[1, 1].set_title('Coors Field Park Factor Over Time')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('altitude_effects_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Python: Analyze breaking ball effectiveness at different elevations
def breaking_ball_altitude_effect():
"""
Model breaking ball movement at different altitudes
"""
# Magnus force depends on air density
# F_magnus = S * rho * A * v * omega
# Where omega is spin rate
alt_model = AltitudeEffects()
# Standard curveball parameters
velocity_mph = 78
spin_rpm = 2400
# Convert units
v = velocity_mph * 0.44704 # m/s
omega = spin_rpm * 2 * np.pi / 60 # rad/s
elevations = [0, 1000, 2000, 3000, 4000, 5200]
results = []
for elevation in elevations:
rho_ratio = alt_model.air_density_ratio(elevation)
# Break is proportional to Magnus force, which is proportional to density
break_ratio = rho_ratio
# Assuming sea level break is 12 inches
sea_level_break = 12
actual_break = sea_level_break * break_ratio
break_reduction = sea_level_break - actual_break
results.append({
'elevation': elevation,
'density_ratio': rho_ratio,
'break_inches': actual_break,
'break_reduction': break_reduction,
'break_pct_reduction': (break_reduction / sea_level_break) * 100
})
results_df = pd.DataFrame(results)
print("Breaking Ball Movement at Altitude:")
print("(78 mph curveball with 2400 RPM)\n")
print(results_df.to_string(index=False))
# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(results_df['elevation'], results_df['break_inches'],
'o-', linewidth=2, markersize=8, color='red')
ax.axhline(y=12, linestyle='--', color='gray', alpha=0.5,
label='Sea Level Break')
ax.scatter([5200], [results_df[results_df['elevation']==5200]['break_inches'].values[0]],
s=300, c='darkred', zorder=5, marker='*', label='Coors Field')
ax.set_xlabel('Elevation (feet)', fontsize=12)
ax.set_ylabel('Curveball Break (inches)', fontsize=12)
ax.set_title('Breaking Ball Movement vs Altitude', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
plt.tight_layout()
plt.show()
return results_df
breaking_df = breaking_ball_altitude_effect()
# Calculate pitcher disadvantage at Coors
coors_break = breaking_df[breaking_df['elevation']==5200]['break_inches'].values[0]
sea_level_break = breaking_df[breaking_df['elevation']==0]['break_inches'].values[0]
print(f"\nCurveball at Coors: {coors_break:.1f} inches break")
print(f"Curveball at sea level: {sea_level_break:.1f} inches break")
print(f"Reduction: {sea_level_break - coors_break:.1f} inches ({((sea_level_break - coors_break)/sea_level_break*100):.1f}%)")
Temperature's Multi-Faceted Impact
Temperature affects baseball through several mechanisms:
- Ball Elasticity: Warmer balls are more elastic, bouncing off bats with higher exit velocity
- Air Density: Warmer air is less dense, reducing drag
- Player Physiology: Muscles function better in moderate warmth
Empirical Rule of Thumb: Every 10°F increase adds approximately 2-3 feet to fly ball distance.
R Analysis: Temperature and Offense
# R: Temperature Impact on MLB Offense
library(tidyverse)
# Simulate realistic temperature-performance data
set.seed(456)
n_games <- 1000
temp_data <- tibble(
game_id = 1:n_games,
temperature = round(rnorm(n_games, 75, 12)),
# Add realistic constraints
temperature = pmin(pmax(temperature, 40), 105),
# Generate runs with temperature effect
# Base: 4.5 runs, +0.25 runs per 10°F above 70
base_runs = 4.5,
temp_effect = (temperature - 70) * 0.025,
random_effect = rnorm(n_games, 0, 1.5),
total_runs = round(pmax(base_runs + temp_effect + random_effect, 0)),
# Home runs with stronger temperature effect
base_hr = 1.2,
hr_temp_effect = (temperature - 70) * 0.015,
hr_random = rnorm(n_games, 0, 0.5),
home_runs = round(pmax(base_hr + hr_temp_effect + hr_random, 0))
)
# Statistical analysis
temp_model <- lm(total_runs ~ temperature, data = temp_data)
hr_model <- lm(home_runs ~ temperature, data = temp_data)
cat("Temperature Effect on Runs:\n")
cat(sprintf("Coefficient: %.4f runs per degree F\n", coef(temp_model)[2]))
cat(sprintf("Per 10°F: %.2f runs\n", coef(temp_model)[2] * 10))
cat(sprintf("R-squared: %.3f\n", summary(temp_model)$r.squared))
cat(sprintf("P-value: %.4f\n\n", summary(temp_model)$coefficients[2, 4]))
cat("Temperature Effect on Home Runs:\n")
cat(sprintf("Coefficient: %.4f HR per degree F\n", coef(hr_model)[2]))
cat(sprintf("Per 10°F: %.2f HR\n", coef(hr_model)[2] * 10))
cat(sprintf("R-squared: %.3f\n", summary(hr_model)$r.squared))
# Bin by temperature ranges
temp_bins <- temp_data %>%
mutate(
temp_category = cut(temperature,
breaks = c(0, 55, 65, 75, 85, 110),
labels = c("Cold (<55)", "Cool (55-65)",
"Mild (65-75)", "Warm (75-85)",
"Hot (85+)"))
) %>%
group_by(temp_category) %>%
summarise(
n_games = n(),
avg_runs = mean(total_runs),
avg_hr = mean(home_runs),
sd_runs = sd(total_runs),
se_runs = sd_runs / sqrt(n_games),
.groups = 'drop'
)
print("\nScoring by Temperature Range:")
print(temp_bins)
# Visualizations
p1 <- ggplot(temp_data, aes(x = temperature, y = total_runs)) +
geom_point(alpha = 0.3, color = "coral") +
geom_smooth(method = "lm", se = TRUE, color = "darkred", size = 1.5) +
labs(title = "Temperature vs Total Runs",
x = "Temperature (°F)",
y = "Total Runs") +
theme_minimal()
p2 <- ggplot(temp_bins, aes(x = temp_category, y = avg_runs)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_errorbar(aes(ymin = avg_runs - se_runs, ymax = avg_runs + se_runs),
width = 0.2) +
labs(title = "Average Runs by Temperature Category",
x = "Temperature Range",
y = "Average Runs per Game") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
library(patchwork)
p1 + p2
# Predict scoring for extreme temperatures
new_temps <- data.frame(temperature = c(45, 60, 75, 90, 100))
predictions <- predict(temp_model, new_temps, interval = "confidence")
cat("\nPredicted Runs at Different Temperatures:\n")
print(cbind(new_temps, predictions))
Python Implementation: Humidity Analysis
# Python: Humidity Effects on Baseball
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
class HumidityEffect:
"""
Model humidity effects on air density and ball flight
"""
def __init__(self):
self.R = 8.314 # Gas constant
self.M_dry = 0.029 # Molar mass dry air
self.M_water = 0.018 # Molar mass water vapor
def air_density_with_humidity(self, temp_f, pressure_mb, humidity_pct):
"""
Calculate air density accounting for humidity
Counterintuitive: Higher humidity = LOWER density
Water molecules (H2O, M=18) displace nitrogen/oxygen (M=28/32)
"""
# Convert temperature
temp_c = (temp_f - 32) * 5/9
temp_k = temp_c + 273.15
# Saturation vapor pressure (Magnus formula)
e_s = 611.2 * np.exp(17.67 * temp_c / (temp_c + 243.5))
# Actual vapor pressure
e = (humidity_pct / 100) * e_s
# Pressure in Pa
P = pressure_mb * 100
# Partial pressure of dry air
P_d = P - e
# Air density
rho = (P_d * self.M_dry + e * self.M_water) / (self.R * temp_k)
return rho
def humidity_distance_effect(self, temp_f, base_humidity=50):
"""
Calculate distance change due to humidity
Higher humidity = less dense air = farther distance
"""
pressure = 1013.25 # Standard pressure
# Base density at 50% humidity
rho_base = self.air_density_with_humidity(temp_f, pressure, base_humidity)
# Calculate for range of humidities
humidities = np.arange(10, 100, 10)
densities = [self.air_density_with_humidity(temp_f, pressure, h)
for h in humidities]
# Distance is inversely proportional to density^0.55
distance_ratios = [(rho_base / rho) ** 0.55 for rho in densities]
return pd.DataFrame({
'humidity': humidities,
'density': densities,
'distance_ratio': distance_ratios,
'distance_400ft': [400 * ratio for ratio in distance_ratios]
})
# Analysis
humidity_model = HumidityEffect()
# Compare different temperatures
temps = [50, 70, 90]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, temp in enumerate(temps):
df = humidity_model.humidity_distance_effect(temp)
axes[idx].plot(df['humidity'], df['distance_400ft'], 'o-',
linewidth=2, markersize=6)
axes[idx].axhline(y=400, linestyle='--', color='gray', alpha=0.5)
axes[idx].set_xlabel('Humidity (%)', fontsize=11)
axes[idx].set_ylabel('Distance (feet)', fontsize=11)
axes[idx].set_title(f'Temperature: {temp}°F', fontsize=12, fontweight='bold')
axes[idx].grid(True, alpha=0.3)
axes[idx].set_ylim([399, 406])
plt.tight_layout()
plt.savefig('humidity_effects.png', dpi=300, bbox_inches='tight')
plt.show()
# Calculate typical effect size
df_70 = humidity_model.humidity_distance_effect(70)
dry_distance = df_70[df_70['humidity']==10]['distance_400ft'].values[0]
humid_distance = df_70[df_70['humidity']==90]['distance_400ft'].values[0]
print(f"At 70°F, 400-foot fly ball:")
print(f"At 10% humidity: {dry_distance:.1f} feet")
print(f"At 90% humidity: {humid_distance:.1f} feet")
print(f"Difference: {humid_distance - dry_distance:.1f} feet")
print(f"\nCounterIntuitive: Higher humidity HELPS hitting!")
print("Water vapor is lighter than air, reducing air density")
# Combined temperature-humidity effect
def combined_weather_effect():
"""Model combined temperature and humidity effects"""
temps = np.linspace(50, 95, 10)
humidities = np.linspace(20, 80, 10)
T, H = np.meshgrid(temps, humidities)
# Calculate distance for each combination
distances = np.zeros_like(T)
for i in range(len(humidities)):
for j in range(len(temps)):
rho = humidity_model.air_density_with_humidity(
T[i, j], 1013.25, H[i, j]
)
# Reference: sea level, 70°F, 50% humidity
rho_ref = humidity_model.air_density_with_humidity(
70, 1013.25, 50
)
distances[i, j] = 400 * (rho_ref / rho) ** 0.55
# Create contour plot
fig, ax = plt.subplots(figsize=(10, 7))
contour = ax.contourf(T, H, distances, levels=15, cmap='RdYlGn')
contour_lines = ax.contour(T, H, distances, levels=10, colors='black',
alpha=0.3, linewidths=0.5)
ax.clabel(contour_lines, inline=True, fontsize=8, fmt='%.0f ft')
cbar = plt.colorbar(contour, ax=ax, label='Distance (feet)')
ax.set_xlabel('Temperature (°F)', fontsize=12)
ax.set_ylabel('Humidity (%)', fontsize=12)
ax.set_title('Combined Temperature-Humidity Effect on 400-ft Fly Ball',
fontsize=14, fontweight='bold')
# Mark some notable conditions
ax.plot(50, 30, 'b*', markersize=15, label='Cold & Dry')
ax.plot(95, 80, 'r*', markersize=15, label='Hot & Humid')
ax.legend(loc='upper left')
plt.tight_layout()
plt.savefig('temp_humidity_combined.png', dpi=300, bbox_inches='tight')
plt.show()
combined_weather_effect()
Real-World Application: Weather-Based Betting
# Python: Using weather data for run total predictions
def weather_betting_model():
"""
Build model to predict over/under based on weather
"""
# Simulate historical game data with weather
np.random.seed(789)
n = 2000
data = pd.DataFrame({
'temp': np.random.normal(72, 13, n),
'humidity': np.random.normal(55, 18, n),
'wind_speed': np.abs(np.random.normal(6, 4, n)),
'wind_direction': np.random.choice(['In', 'Out', 'Cross', 'Calm'], n,
p=[0.25, 0.25, 0.3, 0.2]),
'park_factor': np.random.normal(100, 8, n)
})
# Generate runs with realistic relationships
base_runs = 8.5
temp_effect = (data['temp'] - 70) * 0.04
humidity_effect = (data['humidity'] - 50) * 0.01
wind_effect = np.where(data['wind_direction'] == 'Out',
data['wind_speed'] * 0.08,
-data['wind_speed'] * 0.03)
park_effect = (data['park_factor'] - 100) * 0.03
noise = np.random.normal(0, 1.8, n)
data['total_runs'] = (base_runs + temp_effect + humidity_effect +
wind_effect + park_effect + noise).clip(2, 20)
data['total_runs'] = data['total_runs'].round()
# Build regression model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
# Encode wind direction
data_encoded = pd.get_dummies(data, columns=['wind_direction'],
drop_first=True)
X = data_encoded.drop('total_runs', axis=1)
y = data_encoded['total_runs']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Weather-Based Run Prediction Model")
print("=" * 50)
print(f"R-squared: {r2_score(y_test, y_pred):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f} runs")
print(f"\nFeature Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
# Example prediction
example_game = pd.DataFrame({
'temp': [85],
'humidity': [70],
'wind_speed': [10],
'park_factor': [110],
'wind_direction_Out': [1],
'wind_direction_Calm': [0],
'wind_direction_Cross': [0]
})
prediction = model.predict(example_game)[0]
print(f"\nExample Prediction:")
print(f"Conditions: 85°F, 70% humidity, 10mph wind OUT, PF=110")
print(f"Predicted Total Runs: {prediction:.1f}")
# Visualize predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(y_test, y_pred, alpha=0.3, s=20)
axes[0].plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()],
'r--', linewidth=2)
axes[0].set_xlabel('Actual Total Runs')
axes[0].set_ylabel('Predicted Total Runs')
axes[0].set_title('Prediction Accuracy')
axes[0].grid(True, alpha=0.3)
residuals = y_test - y_pred
axes[1].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Prediction Error (runs)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Errors')
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
return model, data_encoded
weather_model, weather_data = weather_betting_model()
# R: Temperature Impact on MLB Offense
library(tidyverse)
# Simulate realistic temperature-performance data
set.seed(456)
n_games <- 1000
temp_data <- tibble(
game_id = 1:n_games,
temperature = round(rnorm(n_games, 75, 12)),
# Add realistic constraints
temperature = pmin(pmax(temperature, 40), 105),
# Generate runs with temperature effect
# Base: 4.5 runs, +0.25 runs per 10°F above 70
base_runs = 4.5,
temp_effect = (temperature - 70) * 0.025,
random_effect = rnorm(n_games, 0, 1.5),
total_runs = round(pmax(base_runs + temp_effect + random_effect, 0)),
# Home runs with stronger temperature effect
base_hr = 1.2,
hr_temp_effect = (temperature - 70) * 0.015,
hr_random = rnorm(n_games, 0, 0.5),
home_runs = round(pmax(base_hr + hr_temp_effect + hr_random, 0))
)
# Statistical analysis
temp_model <- lm(total_runs ~ temperature, data = temp_data)
hr_model <- lm(home_runs ~ temperature, data = temp_data)
cat("Temperature Effect on Runs:\n")
cat(sprintf("Coefficient: %.4f runs per degree F\n", coef(temp_model)[2]))
cat(sprintf("Per 10°F: %.2f runs\n", coef(temp_model)[2] * 10))
cat(sprintf("R-squared: %.3f\n", summary(temp_model)$r.squared))
cat(sprintf("P-value: %.4f\n\n", summary(temp_model)$coefficients[2, 4]))
cat("Temperature Effect on Home Runs:\n")
cat(sprintf("Coefficient: %.4f HR per degree F\n", coef(hr_model)[2]))
cat(sprintf("Per 10°F: %.2f HR\n", coef(hr_model)[2] * 10))
cat(sprintf("R-squared: %.3f\n", summary(hr_model)$r.squared))
# Bin by temperature ranges
temp_bins <- temp_data %>%
mutate(
temp_category = cut(temperature,
breaks = c(0, 55, 65, 75, 85, 110),
labels = c("Cold (<55)", "Cool (55-65)",
"Mild (65-75)", "Warm (75-85)",
"Hot (85+)"))
) %>%
group_by(temp_category) %>%
summarise(
n_games = n(),
avg_runs = mean(total_runs),
avg_hr = mean(home_runs),
sd_runs = sd(total_runs),
se_runs = sd_runs / sqrt(n_games),
.groups = 'drop'
)
print("\nScoring by Temperature Range:")
print(temp_bins)
# Visualizations
p1 <- ggplot(temp_data, aes(x = temperature, y = total_runs)) +
geom_point(alpha = 0.3, color = "coral") +
geom_smooth(method = "lm", se = TRUE, color = "darkred", size = 1.5) +
labs(title = "Temperature vs Total Runs",
x = "Temperature (°F)",
y = "Total Runs") +
theme_minimal()
p2 <- ggplot(temp_bins, aes(x = temp_category, y = avg_runs)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_errorbar(aes(ymin = avg_runs - se_runs, ymax = avg_runs + se_runs),
width = 0.2) +
labs(title = "Average Runs by Temperature Category",
x = "Temperature Range",
y = "Average Runs per Game") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
library(patchwork)
p1 + p2
# Predict scoring for extreme temperatures
new_temps <- data.frame(temperature = c(45, 60, 75, 90, 100))
predictions <- predict(temp_model, new_temps, interval = "confidence")
cat("\nPredicted Runs at Different Temperatures:\n")
print(cbind(new_temps, predictions))
# Python: Humidity Effects on Baseball
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
class HumidityEffect:
"""
Model humidity effects on air density and ball flight
"""
def __init__(self):
self.R = 8.314 # Gas constant
self.M_dry = 0.029 # Molar mass dry air
self.M_water = 0.018 # Molar mass water vapor
def air_density_with_humidity(self, temp_f, pressure_mb, humidity_pct):
"""
Calculate air density accounting for humidity
Counterintuitive: Higher humidity = LOWER density
Water molecules (H2O, M=18) displace nitrogen/oxygen (M=28/32)
"""
# Convert temperature
temp_c = (temp_f - 32) * 5/9
temp_k = temp_c + 273.15
# Saturation vapor pressure (Magnus formula)
e_s = 611.2 * np.exp(17.67 * temp_c / (temp_c + 243.5))
# Actual vapor pressure
e = (humidity_pct / 100) * e_s
# Pressure in Pa
P = pressure_mb * 100
# Partial pressure of dry air
P_d = P - e
# Air density
rho = (P_d * self.M_dry + e * self.M_water) / (self.R * temp_k)
return rho
def humidity_distance_effect(self, temp_f, base_humidity=50):
"""
Calculate distance change due to humidity
Higher humidity = less dense air = farther distance
"""
pressure = 1013.25 # Standard pressure
# Base density at 50% humidity
rho_base = self.air_density_with_humidity(temp_f, pressure, base_humidity)
# Calculate for range of humidities
humidities = np.arange(10, 100, 10)
densities = [self.air_density_with_humidity(temp_f, pressure, h)
for h in humidities]
# Distance is inversely proportional to density^0.55
distance_ratios = [(rho_base / rho) ** 0.55 for rho in densities]
return pd.DataFrame({
'humidity': humidities,
'density': densities,
'distance_ratio': distance_ratios,
'distance_400ft': [400 * ratio for ratio in distance_ratios]
})
# Analysis
humidity_model = HumidityEffect()
# Compare different temperatures
temps = [50, 70, 90]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, temp in enumerate(temps):
df = humidity_model.humidity_distance_effect(temp)
axes[idx].plot(df['humidity'], df['distance_400ft'], 'o-',
linewidth=2, markersize=6)
axes[idx].axhline(y=400, linestyle='--', color='gray', alpha=0.5)
axes[idx].set_xlabel('Humidity (%)', fontsize=11)
axes[idx].set_ylabel('Distance (feet)', fontsize=11)
axes[idx].set_title(f'Temperature: {temp}°F', fontsize=12, fontweight='bold')
axes[idx].grid(True, alpha=0.3)
axes[idx].set_ylim([399, 406])
plt.tight_layout()
plt.savefig('humidity_effects.png', dpi=300, bbox_inches='tight')
plt.show()
# Calculate typical effect size
df_70 = humidity_model.humidity_distance_effect(70)
dry_distance = df_70[df_70['humidity']==10]['distance_400ft'].values[0]
humid_distance = df_70[df_70['humidity']==90]['distance_400ft'].values[0]
print(f"At 70°F, 400-foot fly ball:")
print(f"At 10% humidity: {dry_distance:.1f} feet")
print(f"At 90% humidity: {humid_distance:.1f} feet")
print(f"Difference: {humid_distance - dry_distance:.1f} feet")
print(f"\nCounterIntuitive: Higher humidity HELPS hitting!")
print("Water vapor is lighter than air, reducing air density")
# Combined temperature-humidity effect
def combined_weather_effect():
"""Model combined temperature and humidity effects"""
temps = np.linspace(50, 95, 10)
humidities = np.linspace(20, 80, 10)
T, H = np.meshgrid(temps, humidities)
# Calculate distance for each combination
distances = np.zeros_like(T)
for i in range(len(humidities)):
for j in range(len(temps)):
rho = humidity_model.air_density_with_humidity(
T[i, j], 1013.25, H[i, j]
)
# Reference: sea level, 70°F, 50% humidity
rho_ref = humidity_model.air_density_with_humidity(
70, 1013.25, 50
)
distances[i, j] = 400 * (rho_ref / rho) ** 0.55
# Create contour plot
fig, ax = plt.subplots(figsize=(10, 7))
contour = ax.contourf(T, H, distances, levels=15, cmap='RdYlGn')
contour_lines = ax.contour(T, H, distances, levels=10, colors='black',
alpha=0.3, linewidths=0.5)
ax.clabel(contour_lines, inline=True, fontsize=8, fmt='%.0f ft')
cbar = plt.colorbar(contour, ax=ax, label='Distance (feet)')
ax.set_xlabel('Temperature (°F)', fontsize=12)
ax.set_ylabel('Humidity (%)', fontsize=12)
ax.set_title('Combined Temperature-Humidity Effect on 400-ft Fly Ball',
fontsize=14, fontweight='bold')
# Mark some notable conditions
ax.plot(50, 30, 'b*', markersize=15, label='Cold & Dry')
ax.plot(95, 80, 'r*', markersize=15, label='Hot & Humid')
ax.legend(loc='upper left')
plt.tight_layout()
plt.savefig('temp_humidity_combined.png', dpi=300, bbox_inches='tight')
plt.show()
combined_weather_effect()
# Python: Using weather data for run total predictions
def weather_betting_model():
"""
Build model to predict over/under based on weather
"""
# Simulate historical game data with weather
np.random.seed(789)
n = 2000
data = pd.DataFrame({
'temp': np.random.normal(72, 13, n),
'humidity': np.random.normal(55, 18, n),
'wind_speed': np.abs(np.random.normal(6, 4, n)),
'wind_direction': np.random.choice(['In', 'Out', 'Cross', 'Calm'], n,
p=[0.25, 0.25, 0.3, 0.2]),
'park_factor': np.random.normal(100, 8, n)
})
# Generate runs with realistic relationships
base_runs = 8.5
temp_effect = (data['temp'] - 70) * 0.04
humidity_effect = (data['humidity'] - 50) * 0.01
wind_effect = np.where(data['wind_direction'] == 'Out',
data['wind_speed'] * 0.08,
-data['wind_speed'] * 0.03)
park_effect = (data['park_factor'] - 100) * 0.03
noise = np.random.normal(0, 1.8, n)
data['total_runs'] = (base_runs + temp_effect + humidity_effect +
wind_effect + park_effect + noise).clip(2, 20)
data['total_runs'] = data['total_runs'].round()
# Build regression model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
# Encode wind direction
data_encoded = pd.get_dummies(data, columns=['wind_direction'],
drop_first=True)
X = data_encoded.drop('total_runs', axis=1)
y = data_encoded['total_runs']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Weather-Based Run Prediction Model")
print("=" * 50)
print(f"R-squared: {r2_score(y_test, y_pred):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f} runs")
print(f"\nFeature Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
# Example prediction
example_game = pd.DataFrame({
'temp': [85],
'humidity': [70],
'wind_speed': [10],
'park_factor': [110],
'wind_direction_Out': [1],
'wind_direction_Calm': [0],
'wind_direction_Cross': [0]
})
prediction = model.predict(example_game)[0]
print(f"\nExample Prediction:")
print(f"Conditions: 85°F, 70% humidity, 10mph wind OUT, PF=110")
print(f"Predicted Total Runs: {prediction:.1f}")
# Visualize predictions vs actual
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].scatter(y_test, y_pred, alpha=0.3, s=20)
axes[0].plot([y_test.min(), y_test.max()],
[y_test.min(), y_test.max()],
'r--', linewidth=2)
axes[0].set_xlabel('Actual Total Runs')
axes[0].set_ylabel('Predicted Total Runs')
axes[0].set_title('Prediction Accuracy')
axes[0].grid(True, alpha=0.3)
residuals = y_test - y_pred
axes[1].hist(residuals, bins=30, edgecolor='black', alpha=0.7)
axes[1].set_xlabel('Prediction Error (runs)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Errors')
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
return model, data_encoded
weather_model, weather_data = weather_betting_model()
Wind's Directional Impact
Wind affects batted balls differently based on direction:
- Blowing Out: Helps fly balls, especially to the outfield wind is blowing toward
- Blowing In: Suppresses fly balls
- Cross Wind: Affects lateral movement, can turn home runs foul
- Swirling: Creates unpredictability, especially at Wrigley Field
Rule of Thumb: 10 mph wind blowing out adds ~15-25 feet to fly balls; blowing in subtracts similar amounts.
R Implementation: Wind Effect Analysis
# R: Analyze Wind Effects
library(tidyverse)
# Simulate wind and batted ball outcomes
set.seed(321)
wind_data <- expand_grid(
wind_direction = c("Out to CF", "Out to LF", "Out to RF",
"In from CF", "Cross L-to-R", "Cross R-to-L", "Calm"),
wind_speed = c(0, 5, 10, 15, 20),
exit_velo = seq(95, 110, 5),
launch_angle = seq(20, 35, 5)
) %>%
rowwise() %>%
mutate(
# Base distance
base_distance = -140 + 4.5 * exit_velo + 6 * launch_angle -
0.08 * launch_angle^2,
# Wind effect
wind_effect = case_when(
wind_direction == "Calm" ~ 0,
grepl("Out", wind_direction) ~ wind_speed * 1.8,
grepl("In", wind_direction) ~ -wind_speed * 2.0,
grepl("Cross", wind_direction) ~ wind_speed * 0.3
),
# Actual distance
actual_distance = base_distance + wind_effect + rnorm(1, 0, 5),
# Outcome
is_hr = actual_distance >= 400
) %>%
ungroup()
# Analyze wind impact on HR rate
hr_by_wind <- wind_data %>%
group_by(wind_direction, wind_speed) %>%
summarise(
batted_balls = n(),
home_runs = sum(is_hr),
hr_rate = mean(is_hr),
avg_distance = mean(actual_distance),
.groups = 'drop'
)
# Focus on meaningful wind speeds
hr_summary <- wind_data %>%
filter(wind_speed %in% c(0, 10, 20)) %>%
mutate(wind_label = paste(wind_direction, wind_speed, "mph")) %>%
group_by(wind_label) %>%
summarise(
hr_rate = mean(is_hr),
avg_distance = mean(actual_distance),
.groups = 'drop'
) %>%
arrange(desc(hr_rate))
print("Home Run Rate by Wind Conditions:")
print(hr_summary, n = 25)
# Visualize
wind_out <- wind_data %>%
filter(wind_direction %in% c("Out to CF", "In from CF", "Calm")) %>%
mutate(wind_direction = factor(wind_direction,
levels = c("In from CF", "Calm", "Out to CF")))
ggplot(wind_out, aes(x = wind_speed, y = actual_distance,
color = wind_direction)) +
geom_smooth(method = "lm", se = TRUE, size = 1.5) +
scale_color_manual(values = c("blue", "gray", "red"),
name = "Wind Direction") +
labs(title = "Wind Impact on Batted Ball Distance",
subtitle = "All exit velocities and launch angles",
x = "Wind Speed (mph)",
y = "Distance (feet)") +
theme_minimal() +
theme(legend.position = "top")
# Statistical model
wind_model <- lm(actual_distance ~ wind_speed * wind_direction +
exit_velo + launch_angle,
data = wind_data)
cat("\nWind Effect Model:\n")
print(summary(wind_model)$coefficients[2:3, ])
# Calculate expected distance change
wind_predictions <- expand_grid(
wind_speed = c(0, 10, 20),
wind_direction = "Out to CF",
exit_velo = 105,
launch_angle = 28
)
wind_predictions$predicted_distance <- predict(wind_model, wind_predictions)
cat("\nPredicted Distance (105 mph EV, 28° LA):\n")
print(wind_predictions %>% select(wind_speed, wind_direction, predicted_distance))
Python Implementation: Wrigley Field Wind Analysis
# Python: Wrigley Field Wind Patterns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
# Wrigley Field is famous for wind effects
# Lake Michigan creates unique patterns
def wrigley_wind_analysis():
"""
Analyze characteristic wind patterns at Wrigley Field
"""
# Simulate Wrigley wind data by month
months = ['April', 'May', 'June', 'July', 'August', 'September', 'October']
# Wrigley: wind often blows IN in April/May, OUT in summer
wind_patterns = pd.DataFrame({
'month': months,
'avg_wind_speed': [12, 11, 9, 8, 8, 9, 11],
'pct_blowing_out': [25, 35, 45, 50, 48, 40, 30],
'pct_blowing_in': [45, 40, 30, 25, 27, 35, 40],
'pct_cross': [30, 25, 25, 25, 25, 25, 30],
'avg_runs_per_game': [8.2, 8.9, 9.4, 9.7, 9.5, 9.1, 8.5],
'avg_hr_per_game': [2.3, 2.7, 3.1, 3.4, 3.2, 2.9, 2.5]
})
print("Wrigley Field Wind Patterns by Month:")
print(wind_patterns.to_string(index=False))
# Correlation between wind direction and offense
from scipy.stats import pearsonr
corr_runs, p_runs = pearsonr(wind_patterns['pct_blowing_out'],
wind_patterns['avg_runs_per_game'])
corr_hr, p_hr = pearsonr(wind_patterns['pct_blowing_out'],
wind_patterns['avg_hr_per_game'])
print(f"\nCorrelation Analysis:")
print(f"Wind Out % vs Runs/Game: r = {corr_runs:.3f}, p = {p_runs:.3f}")
print(f"Wind Out % vs HR/Game: r = {corr_hr:.3f}, p = {p_hr:.3f}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Wind direction by month
x = np.arange(len(months))
width = 0.25
axes[0, 0].bar(x - width, wind_patterns['pct_blowing_out'], width,
label='Blowing Out', color='red', alpha=0.7)
axes[0, 0].bar(x, wind_patterns['pct_blowing_in'], width,
label='Blowing In', color='blue', alpha=0.7)
axes[0, 0].bar(x + width, wind_patterns['pct_cross'], width,
label='Cross Wind', color='gray', alpha=0.7)
axes[0, 0].set_xlabel('Month')
axes[0, 0].set_ylabel('Percentage of Games (%)')
axes[0, 0].set_title('Wrigley Field Wind Direction by Month')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(months, rotation=45)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3, axis='y')
# Scoring vs wind out percentage
axes[0, 1].scatter(wind_patterns['pct_blowing_out'],
wind_patterns['avg_runs_per_game'],
s=100, alpha=0.6)
for idx, month in enumerate(months):
axes[0, 1].annotate(month,
(wind_patterns.iloc[idx]['pct_blowing_out'],
wind_patterns.iloc[idx]['avg_runs_per_game']),
fontsize=9, alpha=0.7)
# Add trend line
z = np.polyfit(wind_patterns['pct_blowing_out'],
wind_patterns['avg_runs_per_game'], 1)
p = np.poly1d(z)
axes[0, 1].plot(wind_patterns['pct_blowing_out'],
p(wind_patterns['pct_blowing_out']),
"r--", alpha=0.8, linewidth=2)
axes[0, 1].set_xlabel('% Games with Wind Blowing Out')
axes[0, 1].set_ylabel('Average Runs per Game')
axes[0, 1].set_title('Wind Direction vs Scoring')
axes[0, 1].grid(True, alpha=0.3)
# Home runs vs wind
axes[1, 0].scatter(wind_patterns['pct_blowing_out'],
wind_patterns['avg_hr_per_game'],
s=100, alpha=0.6, color='orange')
for idx, month in enumerate(months):
axes[1, 0].annotate(month,
(wind_patterns.iloc[idx]['pct_blowing_out'],
wind_patterns.iloc[idx]['avg_hr_per_game']),
fontsize=9, alpha=0.7)
z = np.polyfit(wind_patterns['pct_blowing_out'],
wind_patterns['avg_hr_per_game'], 1)
p = np.poly1d(z)
axes[1, 0].plot(wind_patterns['pct_blowing_out'],
p(wind_patterns['pct_blowing_out']),
"r--", alpha=0.8, linewidth=2)
axes[1, 0].set_xlabel('% Games with Wind Blowing Out')
axes[1, 0].set_ylabel('Average Home Runs per Game')
axes[1, 0].set_title('Wind Direction vs Home Runs')
axes[1, 0].grid(True, alpha=0.3)
# Monthly trends
axes[1, 1].plot(months, wind_patterns['avg_runs_per_game'],
'o-', linewidth=2, markersize=8, label='Runs/Game')
ax2 = axes[1, 1].twinx()
ax2.plot(months, wind_patterns['avg_hr_per_game'],
'o-', linewidth=2, markersize=8, color='orange', label='HR/Game')
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Runs per Game', color='blue')
ax2.set_ylabel('Home Runs per Game', color='orange')
axes[1, 1].set_title('Seasonal Offensive Trends at Wrigley')
axes[1, 1].tick_params(axis='y', labelcolor='blue')
ax2.tick_params(axis='y', labelcolor='orange')
axes[1, 1].set_xticklabels(months, rotation=45)
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('wrigley_wind_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
return wind_patterns
wrigley_wind = wrigley_wind_analysis()
# Wind rose visualization
def create_wind_rose():
"""
Create wind rose showing typical Wrigley wind patterns
"""
# Simulate wind direction data (degrees from North)
# Wrigley: Lake Michigan is to the east
np.random.seed(42)
# Common directions at Wrigley
# North (0°): In from CF
# South (180°): Out to CF
# East (90°): From lake, cross wind
# West (270°): Cross wind
directions = []
speeds = []
# Generate realistic distribution
for _ in range(1000):
# More common: SW winds (blowing out), NE winds (blowing in)
direction_type = np.random.choice(['SW', 'NE', 'E', 'W', 'Variable'],
p=[0.35, 0.30, 0.15, 0.15, 0.05])
if direction_type == 'SW':
dir_deg = np.random.normal(225, 20) # Southwest
speed = np.random.normal(10, 3)
elif direction_type == 'NE':
dir_deg = np.random.normal(45, 20) # Northeast
speed = np.random.normal(12, 3)
elif direction_type == 'E':
dir_deg = np.random.normal(90, 15) # East
speed = np.random.normal(8, 2)
elif direction_type == 'W':
dir_deg = np.random.normal(270, 15) # West
speed = np.random.normal(8, 2)
else:
dir_deg = np.random.uniform(0, 360)
speed = np.random.normal(5, 2)
directions.append(dir_deg % 360)
speeds.append(max(0, speed))
# Create polar histogram
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='polar')
# Convert to radians and bin
dir_radians = np.array(directions) * np.pi / 180
bins = np.linspace(0, 2*np.pi, 17)
hist, bin_edges = np.histogram(dir_radians, bins=bins)
# Plot
width = 2*np.pi / 16
bars = ax.bar(bin_edges[:-1], hist, width=width, alpha=0.7)
# Color by impact (SW = red for wind out, NE = blue for wind in)
colors = []
for edge in bin_edges[:-1]:
deg = edge * 180 / np.pi
if 135 <= deg <= 315: # SW quadrant (wind blowing out)
colors.append('red')
elif deg < 135 or deg > 315: # NE quadrant (wind blowing in)
colors.append('blue')
else:
colors.append('gray')
for bar, color in zip(bars, colors):
bar.set_facecolor(color)
bar.set_alpha(0.6)
ax.set_theta_zero_location('N')
ax.set_theta_direction(-1)
ax.set_title('Wrigley Field Wind Rose\\n(Red = Blowing Out, Blue = Blowing In)',
fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
create_wind_rose()
# R: Analyze Wind Effects
library(tidyverse)
# Simulate wind and batted ball outcomes
set.seed(321)
wind_data <- expand_grid(
wind_direction = c("Out to CF", "Out to LF", "Out to RF",
"In from CF", "Cross L-to-R", "Cross R-to-L", "Calm"),
wind_speed = c(0, 5, 10, 15, 20),
exit_velo = seq(95, 110, 5),
launch_angle = seq(20, 35, 5)
) %>%
rowwise() %>%
mutate(
# Base distance
base_distance = -140 + 4.5 * exit_velo + 6 * launch_angle -
0.08 * launch_angle^2,
# Wind effect
wind_effect = case_when(
wind_direction == "Calm" ~ 0,
grepl("Out", wind_direction) ~ wind_speed * 1.8,
grepl("In", wind_direction) ~ -wind_speed * 2.0,
grepl("Cross", wind_direction) ~ wind_speed * 0.3
),
# Actual distance
actual_distance = base_distance + wind_effect + rnorm(1, 0, 5),
# Outcome
is_hr = actual_distance >= 400
) %>%
ungroup()
# Analyze wind impact on HR rate
hr_by_wind <- wind_data %>%
group_by(wind_direction, wind_speed) %>%
summarise(
batted_balls = n(),
home_runs = sum(is_hr),
hr_rate = mean(is_hr),
avg_distance = mean(actual_distance),
.groups = 'drop'
)
# Focus on meaningful wind speeds
hr_summary <- wind_data %>%
filter(wind_speed %in% c(0, 10, 20)) %>%
mutate(wind_label = paste(wind_direction, wind_speed, "mph")) %>%
group_by(wind_label) %>%
summarise(
hr_rate = mean(is_hr),
avg_distance = mean(actual_distance),
.groups = 'drop'
) %>%
arrange(desc(hr_rate))
print("Home Run Rate by Wind Conditions:")
print(hr_summary, n = 25)
# Visualize
wind_out <- wind_data %>%
filter(wind_direction %in% c("Out to CF", "In from CF", "Calm")) %>%
mutate(wind_direction = factor(wind_direction,
levels = c("In from CF", "Calm", "Out to CF")))
ggplot(wind_out, aes(x = wind_speed, y = actual_distance,
color = wind_direction)) +
geom_smooth(method = "lm", se = TRUE, size = 1.5) +
scale_color_manual(values = c("blue", "gray", "red"),
name = "Wind Direction") +
labs(title = "Wind Impact on Batted Ball Distance",
subtitle = "All exit velocities and launch angles",
x = "Wind Speed (mph)",
y = "Distance (feet)") +
theme_minimal() +
theme(legend.position = "top")
# Statistical model
wind_model <- lm(actual_distance ~ wind_speed * wind_direction +
exit_velo + launch_angle,
data = wind_data)
cat("\nWind Effect Model:\n")
print(summary(wind_model)$coefficients[2:3, ])
# Calculate expected distance change
wind_predictions <- expand_grid(
wind_speed = c(0, 10, 20),
wind_direction = "Out to CF",
exit_velo = 105,
launch_angle = 28
)
wind_predictions$predicted_distance <- predict(wind_model, wind_predictions)
cat("\nPredicted Distance (105 mph EV, 28° LA):\n")
print(wind_predictions %>% select(wind_speed, wind_direction, predicted_distance))
# Python: Wrigley Field Wind Patterns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
# Wrigley Field is famous for wind effects
# Lake Michigan creates unique patterns
def wrigley_wind_analysis():
"""
Analyze characteristic wind patterns at Wrigley Field
"""
# Simulate Wrigley wind data by month
months = ['April', 'May', 'June', 'July', 'August', 'September', 'October']
# Wrigley: wind often blows IN in April/May, OUT in summer
wind_patterns = pd.DataFrame({
'month': months,
'avg_wind_speed': [12, 11, 9, 8, 8, 9, 11],
'pct_blowing_out': [25, 35, 45, 50, 48, 40, 30],
'pct_blowing_in': [45, 40, 30, 25, 27, 35, 40],
'pct_cross': [30, 25, 25, 25, 25, 25, 30],
'avg_runs_per_game': [8.2, 8.9, 9.4, 9.7, 9.5, 9.1, 8.5],
'avg_hr_per_game': [2.3, 2.7, 3.1, 3.4, 3.2, 2.9, 2.5]
})
print("Wrigley Field Wind Patterns by Month:")
print(wind_patterns.to_string(index=False))
# Correlation between wind direction and offense
from scipy.stats import pearsonr
corr_runs, p_runs = pearsonr(wind_patterns['pct_blowing_out'],
wind_patterns['avg_runs_per_game'])
corr_hr, p_hr = pearsonr(wind_patterns['pct_blowing_out'],
wind_patterns['avg_hr_per_game'])
print(f"\nCorrelation Analysis:")
print(f"Wind Out % vs Runs/Game: r = {corr_runs:.3f}, p = {p_runs:.3f}")
print(f"Wind Out % vs HR/Game: r = {corr_hr:.3f}, p = {p_hr:.3f}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Wind direction by month
x = np.arange(len(months))
width = 0.25
axes[0, 0].bar(x - width, wind_patterns['pct_blowing_out'], width,
label='Blowing Out', color='red', alpha=0.7)
axes[0, 0].bar(x, wind_patterns['pct_blowing_in'], width,
label='Blowing In', color='blue', alpha=0.7)
axes[0, 0].bar(x + width, wind_patterns['pct_cross'], width,
label='Cross Wind', color='gray', alpha=0.7)
axes[0, 0].set_xlabel('Month')
axes[0, 0].set_ylabel('Percentage of Games (%)')
axes[0, 0].set_title('Wrigley Field Wind Direction by Month')
axes[0, 0].set_xticks(x)
axes[0, 0].set_xticklabels(months, rotation=45)
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3, axis='y')
# Scoring vs wind out percentage
axes[0, 1].scatter(wind_patterns['pct_blowing_out'],
wind_patterns['avg_runs_per_game'],
s=100, alpha=0.6)
for idx, month in enumerate(months):
axes[0, 1].annotate(month,
(wind_patterns.iloc[idx]['pct_blowing_out'],
wind_patterns.iloc[idx]['avg_runs_per_game']),
fontsize=9, alpha=0.7)
# Add trend line
z = np.polyfit(wind_patterns['pct_blowing_out'],
wind_patterns['avg_runs_per_game'], 1)
p = np.poly1d(z)
axes[0, 1].plot(wind_patterns['pct_blowing_out'],
p(wind_patterns['pct_blowing_out']),
"r--", alpha=0.8, linewidth=2)
axes[0, 1].set_xlabel('% Games with Wind Blowing Out')
axes[0, 1].set_ylabel('Average Runs per Game')
axes[0, 1].set_title('Wind Direction vs Scoring')
axes[0, 1].grid(True, alpha=0.3)
# Home runs vs wind
axes[1, 0].scatter(wind_patterns['pct_blowing_out'],
wind_patterns['avg_hr_per_game'],
s=100, alpha=0.6, color='orange')
for idx, month in enumerate(months):
axes[1, 0].annotate(month,
(wind_patterns.iloc[idx]['pct_blowing_out'],
wind_patterns.iloc[idx]['avg_hr_per_game']),
fontsize=9, alpha=0.7)
z = np.polyfit(wind_patterns['pct_blowing_out'],
wind_patterns['avg_hr_per_game'], 1)
p = np.poly1d(z)
axes[1, 0].plot(wind_patterns['pct_blowing_out'],
p(wind_patterns['pct_blowing_out']),
"r--", alpha=0.8, linewidth=2)
axes[1, 0].set_xlabel('% Games with Wind Blowing Out')
axes[1, 0].set_ylabel('Average Home Runs per Game')
axes[1, 0].set_title('Wind Direction vs Home Runs')
axes[1, 0].grid(True, alpha=0.3)
# Monthly trends
axes[1, 1].plot(months, wind_patterns['avg_runs_per_game'],
'o-', linewidth=2, markersize=8, label='Runs/Game')
ax2 = axes[1, 1].twinx()
ax2.plot(months, wind_patterns['avg_hr_per_game'],
'o-', linewidth=2, markersize=8, color='orange', label='HR/Game')
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Runs per Game', color='blue')
ax2.set_ylabel('Home Runs per Game', color='orange')
axes[1, 1].set_title('Seasonal Offensive Trends at Wrigley')
axes[1, 1].tick_params(axis='y', labelcolor='blue')
ax2.tick_params(axis='y', labelcolor='orange')
axes[1, 1].set_xticklabels(months, rotation=45)
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('wrigley_wind_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
return wind_patterns
wrigley_wind = wrigley_wind_analysis()
# Wind rose visualization
def create_wind_rose():
"""
Create wind rose showing typical Wrigley wind patterns
"""
# Simulate wind direction data (degrees from North)
# Wrigley: Lake Michigan is to the east
np.random.seed(42)
# Common directions at Wrigley
# North (0°): In from CF
# South (180°): Out to CF
# East (90°): From lake, cross wind
# West (270°): Cross wind
directions = []
speeds = []
# Generate realistic distribution
for _ in range(1000):
# More common: SW winds (blowing out), NE winds (blowing in)
direction_type = np.random.choice(['SW', 'NE', 'E', 'W', 'Variable'],
p=[0.35, 0.30, 0.15, 0.15, 0.05])
if direction_type == 'SW':
dir_deg = np.random.normal(225, 20) # Southwest
speed = np.random.normal(10, 3)
elif direction_type == 'NE':
dir_deg = np.random.normal(45, 20) # Northeast
speed = np.random.normal(12, 3)
elif direction_type == 'E':
dir_deg = np.random.normal(90, 15) # East
speed = np.random.normal(8, 2)
elif direction_type == 'W':
dir_deg = np.random.normal(270, 15) # West
speed = np.random.normal(8, 2)
else:
dir_deg = np.random.uniform(0, 360)
speed = np.random.normal(5, 2)
directions.append(dir_deg % 360)
speeds.append(max(0, speed))
# Create polar histogram
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='polar')
# Convert to radians and bin
dir_radians = np.array(directions) * np.pi / 180
bins = np.linspace(0, 2*np.pi, 17)
hist, bin_edges = np.histogram(dir_radians, bins=bins)
# Plot
width = 2*np.pi / 16
bars = ax.bar(bin_edges[:-1], hist, width=width, alpha=0.7)
# Color by impact (SW = red for wind out, NE = blue for wind in)
colors = []
for edge in bin_edges[:-1]:
deg = edge * 180 / np.pi
if 135 <= deg <= 315: # SW quadrant (wind blowing out)
colors.append('red')
elif deg < 135 or deg > 315: # NE quadrant (wind blowing in)
colors.append('blue')
else:
colors.append('gray')
for bar, color in zip(bars, colors):
bar.set_facecolor(color)
bar.set_alpha(0.6)
ax.set_theta_zero_location('N')
ax.set_theta_direction(-1)
ax.set_title('Wrigley Field Wind Rose\\n(Red = Blowing Out, Blue = Blowing In)',
fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
create_wind_rose()
Day/Night Game Differences
Day games show systematically different outcomes than night games:
Day Game Characteristics:
- Higher temperatures (especially afternoon games)
- More wind variability
- Sun interference (fielding difficulty)
- Player fatigue (unusual game time for body clock)
Night Game Characteristics:
- Cooler temperatures
- Better lighting consistency
- Standard game time for players
- Potentially higher humidity
R Analysis: Day/Night Splits
# R: Day vs Night Game Analysis
library(tidyverse)
library(Lahman)
# Simulate realistic day/night data
set.seed(654)
games_data <- tibble(
game_id = 1:2430, # Full season of games
game_type = sample(c("Day", "Night"), 2430, replace = TRUE,
prob = c(0.35, 0.65)), # ~35% day games
temperature = ifelse(game_type == "Day",
rnorm(2430, 78, 10),
rnorm(2430, 68, 8)),
month = sample(c("April", "May", "June", "July",
"August", "September"), 2430, replace = TRUE)
) %>%
mutate(
# Day games have higher scoring due to temperature
base_runs = 8.5,
temp_effect = (temperature - 70) * 0.03,
day_effect = ifelse(game_type == "Day", 0.4, 0), # Extra boost
noise = rnorm(n(), 0, 1.6),
total_runs = pmax(base_runs + temp_effect + day_effect + noise, 2),
total_runs = round(total_runs),
# Home runs
hr_base = 2.2,
hr_temp = (temperature - 70) * 0.015,
hr_day = ifelse(game_type == "Day", 0.25, 0),
hr_noise = rnorm(n(), 0, 0.7),
home_runs = pmax(hr_base + hr_temp + hr_day + hr_noise, 0),
home_runs = round(home_runs),
# Errors (sun field at day games)
errors = rpois(n(), ifelse(game_type == "Day", 0.85, 0.65))
)
# Compare day vs night
day_night_summary <- games_data %>%
group_by(game_type) %>%
summarise(
n_games = n(),
avg_temp = mean(temperature),
avg_runs = mean(total_runs),
avg_hr = mean(home_runs),
avg_errors = mean(errors),
.groups = 'drop'
)
print("Day vs Night Game Statistics:")
print(day_night_summary)
# Statistical tests
t_test_runs <- t.test(total_runs ~ game_type, data = games_data)
t_test_hr <- t.test(home_runs ~ game_type, data = games_data)
t_test_errors <- t.test(errors ~ game_type, data = games_data)
cat("\nStatistical Significance Tests:\n")
cat(sprintf("Runs - Day vs Night: t = %.2f, p = %.4f\n",
t_test_runs$statistic, t_test_runs$p.value))
cat(sprintf("HR - Day vs Night: t = %.2f, p = %.4f\n",
t_test_hr$statistic, t_test_hr$p.value))
cat(sprintf("Errors - Day vs Night: t = %.2f, p = %.4f\n",
t_test_errors$statistic, t_test_errors$p.value))
# Visualizations
p1 <- ggplot(games_data, aes(x = game_type, y = total_runs, fill = game_type)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("Day" = "gold", "Night" = "navy")) +
labs(title = "Total Runs: Day vs Night Games",
x = "Game Type",
y = "Total Runs") +
theme_minimal() +
theme(legend.position = "none")
p2 <- ggplot(games_data, aes(x = temperature, y = total_runs,
color = game_type)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = TRUE) +
scale_color_manual(values = c("Day" = "orange", "Night" = "darkblue"),
name = "Game Type") +
labs(title = "Temperature Effect by Game Type",
x = "Temperature (°F)",
y = "Total Runs") +
theme_minimal()
library(patchwork)
p1 + p2
# Monthly breakdown
monthly_summary <- games_data %>%
group_by(month, game_type) %>%
summarise(
avg_runs = mean(total_runs),
n = n(),
.groups = 'drop'
) %>%
mutate(month = factor(month, levels = c("April", "May", "June",
"July", "August", "September")))
ggplot(monthly_summary, aes(x = month, y = avg_runs,
fill = game_type, group = game_type)) +
geom_col(position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("Day" = "gold", "Night" = "navy"),
name = "Game Type") +
labs(title = "Scoring by Month and Game Time",
x = "Month",
y = "Average Runs per Game") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Python Implementation: Comprehensive Scheduling Analysis
# Python: Advanced Day/Night and Scheduling Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def comprehensive_scheduling_analysis():
"""
Analyze various scheduling factors affecting performance
"""
np.random.seed(999)
n_games = 2430
# Generate comprehensive game data
games = pd.DataFrame({
'game_id': range(n_games),
'game_type': np.random.choice(['Day', 'Night'], n_games, p=[0.35, 0.65]),
'day_of_week': np.random.choice(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
n_games, p=[0.10, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15]),
'month': np.random.choice(range(4, 10), n_games), # April-Sept (4-9)
'is_weekend': np.random.choice([0, 1], n_games, p=[0.7, 0.3]),
'is_doubleheader': np.random.choice([0, 1], n_games, p=[0.97, 0.03]),
'rest_days': np.random.choice([0, 1, 2, 3], n_games, p=[0.7, 0.2, 0.08, 0.02])
})
# Temperature varies by time and month
games['temperature'] = np.where(
games['game_type'] == 'Day',
np.random.normal(75 + games['month'] * 1.5, 11),
np.random.normal(67 + games['month'] * 1.2, 9)
)
# Generate runs with multiple factors
base_runs = 8.5
day_effect = np.where(games['game_type'] == 'Day', 0.45, 0)
temp_effect = (games['temperature'] - 70) * 0.03
dh_effect = np.where(games['is_doubleheader'] == 1, -0.6, 0) # Fatigue
rest_effect = games['rest_days'] * 0.15 # Fresh teams score more
noise = np.random.normal(0, 1.5, n_games)
games['total_runs'] = (base_runs + day_effect + temp_effect +
dh_effect + rest_effect + noise).clip(1, 25)
games['total_runs'] = games['total_runs'].round()
# Analysis 1: Day vs Night
day_night = games.groupby('game_type').agg({
'total_runs': ['mean', 'std', 'count'],
'temperature': 'mean'
}).round(2)
print("Day vs Night Game Analysis:")
print(day_night)
print()
# Analysis 2: Day of week effects
dow_effects = games.groupby('day_of_week').agg({
'total_runs': ['mean', 'count']
}).round(2)
dow_effects = dow_effects.reindex(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
print("Day of Week Effects:")
print(dow_effects)
print()
# Analysis 3: Rest days impact
rest_effects = games.groupby('rest_days').agg({
'total_runs': ['mean', 'std', 'count']
}).round(2)
print("Rest Days Impact:")
print(rest_effects)
print()
# Analysis 4: Doubleheader effects
dh_effects = games.groupby('is_doubleheader').agg({
'total_runs': ['mean', 'std', 'count']
}).round(2)
print("Doubleheader Effects:")
print(dh_effects)
print()
# Statistical modeling
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
# Encode categorical variables
games_encoded = games.copy()
games_encoded['game_type_encoded'] = (games_encoded['game_type'] == 'Day').astype(int)
le_dow = LabelEncoder()
games_encoded['dow_encoded'] = le_dow.fit_transform(games_encoded['day_of_week'])
X = games_encoded[['game_type_encoded', 'temperature', 'month',
'is_doubleheader', 'rest_days', 'dow_encoded']]
y = games_encoded['total_runs']
model = LinearRegression()
model.fit(X, y)
print("\nRegression Model Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
print(f"\nR-squared: {model.score(X, y):.4f}")
# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
# 1. Day vs Night boxplot
sns.boxplot(data=games, x='game_type', y='total_runs', ax=axes[0, 0],
palette={'Day': 'gold', 'Night': 'navy'})
axes[0, 0].set_title('Day vs Night Scoring', fontweight='bold')
axes[0, 0].set_ylabel('Total Runs')
# 2. Temperature distribution by game type
games[games['game_type']=='Day']['temperature'].hist(
bins=30, alpha=0.6, ax=axes[0, 1], label='Day', color='gold'
)
games[games['game_type']=='Night']['temperature'].hist(
bins=30, alpha=0.6, ax=axes[0, 1], label='Night', color='navy'
)
axes[0, 1].set_xlabel('Temperature (°F)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Temperature Distribution', fontweight='bold')
axes[0, 1].legend()
# 3. Day of week scoring
dow_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
dow_data = games.groupby('day_of_week')['total_runs'].mean().reindex(dow_order)
axes[0, 2].bar(range(7), dow_data.values, color='steelblue', alpha=0.7)
axes[0, 2].set_xticks(range(7))
axes[0, 2].set_xticklabels(dow_order)
axes[0, 2].set_ylabel('Average Runs')
axes[0, 2].set_title('Scoring by Day of Week', fontweight='bold')
axes[0, 2].grid(True, alpha=0.3, axis='y')
# 4. Rest days effect
rest_data = games.groupby('rest_days')['total_runs'].mean()
axes[1, 0].plot(rest_data.index, rest_data.values, 'o-',
linewidth=2, markersize=10, color='green')
axes[1, 0].set_xlabel('Rest Days Before Game')
axes[1, 0].set_ylabel('Average Runs')
axes[1, 0].set_title('Impact of Rest on Scoring', fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)
# 5. Monthly trends by game type
monthly_by_type = games.groupby(['month', 'game_type'])['total_runs'].mean().unstack()
monthly_by_type.plot(ax=axes[1, 1], marker='o', linewidth=2)
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Average Runs')
axes[1, 1].set_title('Monthly Trends by Game Type', fontweight='bold')
axes[1, 1].set_xticks(range(4, 10))
axes[1, 1].set_xticklabels(['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep'])
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].legend(title='Game Type')
# 6. Feature importance
feature_importance = pd.DataFrame({
'feature': ['Day Game', 'Temperature', 'Month',
'Doubleheader', 'Rest Days', 'Day of Week'],
'coefficient': model.coef_
}).sort_values('coefficient')
colors = ['red' if x < 0 else 'green' for x in feature_importance['coefficient']]
axes[1, 2].barh(range(len(feature_importance)),
feature_importance['coefficient'],
color=colors, alpha=0.6)
axes[1, 2].set_yticks(range(len(feature_importance)))
axes[1, 2].set_yticklabels(feature_importance['feature'])
axes[1, 2].set_xlabel('Effect on Runs')
axes[1, 2].set_title('Factor Impact on Scoring', fontweight='bold')
axes[1, 2].axvline(x=0, color='black', linestyle='-', linewidth=0.8)
axes[1, 2].grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('scheduling_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
return games, model
games_df, sched_model = comprehensive_scheduling_analysis()
# Practical application: Predict expected runs for specific game
print("\n" + "="*60)
print("Example Prediction:")
print("="*60)
example_game = pd.DataFrame({
'game_type_encoded': [1], # Day game
'temperature': [88],
'month': [7], # July
'is_doubleheader': [0],
'rest_days': [1],
'dow_encoded': [5] # Saturday
})
predicted_runs = sched_model.predict(example_game)[0]
print(f"Day game in July, 88°F, Saturday, 1 day rest:")
print(f"Expected total runs: {predicted_runs:.1f}")
# R: Day vs Night Game Analysis
library(tidyverse)
library(Lahman)
# Simulate realistic day/night data
set.seed(654)
games_data <- tibble(
game_id = 1:2430, # Full season of games
game_type = sample(c("Day", "Night"), 2430, replace = TRUE,
prob = c(0.35, 0.65)), # ~35% day games
temperature = ifelse(game_type == "Day",
rnorm(2430, 78, 10),
rnorm(2430, 68, 8)),
month = sample(c("April", "May", "June", "July",
"August", "September"), 2430, replace = TRUE)
) %>%
mutate(
# Day games have higher scoring due to temperature
base_runs = 8.5,
temp_effect = (temperature - 70) * 0.03,
day_effect = ifelse(game_type == "Day", 0.4, 0), # Extra boost
noise = rnorm(n(), 0, 1.6),
total_runs = pmax(base_runs + temp_effect + day_effect + noise, 2),
total_runs = round(total_runs),
# Home runs
hr_base = 2.2,
hr_temp = (temperature - 70) * 0.015,
hr_day = ifelse(game_type == "Day", 0.25, 0),
hr_noise = rnorm(n(), 0, 0.7),
home_runs = pmax(hr_base + hr_temp + hr_day + hr_noise, 0),
home_runs = round(home_runs),
# Errors (sun field at day games)
errors = rpois(n(), ifelse(game_type == "Day", 0.85, 0.65))
)
# Compare day vs night
day_night_summary <- games_data %>%
group_by(game_type) %>%
summarise(
n_games = n(),
avg_temp = mean(temperature),
avg_runs = mean(total_runs),
avg_hr = mean(home_runs),
avg_errors = mean(errors),
.groups = 'drop'
)
print("Day vs Night Game Statistics:")
print(day_night_summary)
# Statistical tests
t_test_runs <- t.test(total_runs ~ game_type, data = games_data)
t_test_hr <- t.test(home_runs ~ game_type, data = games_data)
t_test_errors <- t.test(errors ~ game_type, data = games_data)
cat("\nStatistical Significance Tests:\n")
cat(sprintf("Runs - Day vs Night: t = %.2f, p = %.4f\n",
t_test_runs$statistic, t_test_runs$p.value))
cat(sprintf("HR - Day vs Night: t = %.2f, p = %.4f\n",
t_test_hr$statistic, t_test_hr$p.value))
cat(sprintf("Errors - Day vs Night: t = %.2f, p = %.4f\n",
t_test_errors$statistic, t_test_errors$p.value))
# Visualizations
p1 <- ggplot(games_data, aes(x = game_type, y = total_runs, fill = game_type)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("Day" = "gold", "Night" = "navy")) +
labs(title = "Total Runs: Day vs Night Games",
x = "Game Type",
y = "Total Runs") +
theme_minimal() +
theme(legend.position = "none")
p2 <- ggplot(games_data, aes(x = temperature, y = total_runs,
color = game_type)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = TRUE) +
scale_color_manual(values = c("Day" = "orange", "Night" = "darkblue"),
name = "Game Type") +
labs(title = "Temperature Effect by Game Type",
x = "Temperature (°F)",
y = "Total Runs") +
theme_minimal()
library(patchwork)
p1 + p2
# Monthly breakdown
monthly_summary <- games_data %>%
group_by(month, game_type) %>%
summarise(
avg_runs = mean(total_runs),
n = n(),
.groups = 'drop'
) %>%
mutate(month = factor(month, levels = c("April", "May", "June",
"July", "August", "September")))
ggplot(monthly_summary, aes(x = month, y = avg_runs,
fill = game_type, group = game_type)) +
geom_col(position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("Day" = "gold", "Night" = "navy"),
name = "Game Type") +
labs(title = "Scoring by Month and Game Time",
x = "Month",
y = "Average Runs per Game") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Python: Advanced Day/Night and Scheduling Analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def comprehensive_scheduling_analysis():
"""
Analyze various scheduling factors affecting performance
"""
np.random.seed(999)
n_games = 2430
# Generate comprehensive game data
games = pd.DataFrame({
'game_id': range(n_games),
'game_type': np.random.choice(['Day', 'Night'], n_games, p=[0.35, 0.65]),
'day_of_week': np.random.choice(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
n_games, p=[0.10, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15]),
'month': np.random.choice(range(4, 10), n_games), # April-Sept (4-9)
'is_weekend': np.random.choice([0, 1], n_games, p=[0.7, 0.3]),
'is_doubleheader': np.random.choice([0, 1], n_games, p=[0.97, 0.03]),
'rest_days': np.random.choice([0, 1, 2, 3], n_games, p=[0.7, 0.2, 0.08, 0.02])
})
# Temperature varies by time and month
games['temperature'] = np.where(
games['game_type'] == 'Day',
np.random.normal(75 + games['month'] * 1.5, 11),
np.random.normal(67 + games['month'] * 1.2, 9)
)
# Generate runs with multiple factors
base_runs = 8.5
day_effect = np.where(games['game_type'] == 'Day', 0.45, 0)
temp_effect = (games['temperature'] - 70) * 0.03
dh_effect = np.where(games['is_doubleheader'] == 1, -0.6, 0) # Fatigue
rest_effect = games['rest_days'] * 0.15 # Fresh teams score more
noise = np.random.normal(0, 1.5, n_games)
games['total_runs'] = (base_runs + day_effect + temp_effect +
dh_effect + rest_effect + noise).clip(1, 25)
games['total_runs'] = games['total_runs'].round()
# Analysis 1: Day vs Night
day_night = games.groupby('game_type').agg({
'total_runs': ['mean', 'std', 'count'],
'temperature': 'mean'
}).round(2)
print("Day vs Night Game Analysis:")
print(day_night)
print()
# Analysis 2: Day of week effects
dow_effects = games.groupby('day_of_week').agg({
'total_runs': ['mean', 'count']
}).round(2)
dow_effects = dow_effects.reindex(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
print("Day of Week Effects:")
print(dow_effects)
print()
# Analysis 3: Rest days impact
rest_effects = games.groupby('rest_days').agg({
'total_runs': ['mean', 'std', 'count']
}).round(2)
print("Rest Days Impact:")
print(rest_effects)
print()
# Analysis 4: Doubleheader effects
dh_effects = games.groupby('is_doubleheader').agg({
'total_runs': ['mean', 'std', 'count']
}).round(2)
print("Doubleheader Effects:")
print(dh_effects)
print()
# Statistical modeling
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
# Encode categorical variables
games_encoded = games.copy()
games_encoded['game_type_encoded'] = (games_encoded['game_type'] == 'Day').astype(int)
le_dow = LabelEncoder()
games_encoded['dow_encoded'] = le_dow.fit_transform(games_encoded['day_of_week'])
X = games_encoded[['game_type_encoded', 'temperature', 'month',
'is_doubleheader', 'rest_days', 'dow_encoded']]
y = games_encoded['total_runs']
model = LinearRegression()
model.fit(X, y)
print("\nRegression Model Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
print(f"\nR-squared: {model.score(X, y):.4f}")
# Visualizations
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
# 1. Day vs Night boxplot
sns.boxplot(data=games, x='game_type', y='total_runs', ax=axes[0, 0],
palette={'Day': 'gold', 'Night': 'navy'})
axes[0, 0].set_title('Day vs Night Scoring', fontweight='bold')
axes[0, 0].set_ylabel('Total Runs')
# 2. Temperature distribution by game type
games[games['game_type']=='Day']['temperature'].hist(
bins=30, alpha=0.6, ax=axes[0, 1], label='Day', color='gold'
)
games[games['game_type']=='Night']['temperature'].hist(
bins=30, alpha=0.6, ax=axes[0, 1], label='Night', color='navy'
)
axes[0, 1].set_xlabel('Temperature (°F)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Temperature Distribution', fontweight='bold')
axes[0, 1].legend()
# 3. Day of week scoring
dow_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
dow_data = games.groupby('day_of_week')['total_runs'].mean().reindex(dow_order)
axes[0, 2].bar(range(7), dow_data.values, color='steelblue', alpha=0.7)
axes[0, 2].set_xticks(range(7))
axes[0, 2].set_xticklabels(dow_order)
axes[0, 2].set_ylabel('Average Runs')
axes[0, 2].set_title('Scoring by Day of Week', fontweight='bold')
axes[0, 2].grid(True, alpha=0.3, axis='y')
# 4. Rest days effect
rest_data = games.groupby('rest_days')['total_runs'].mean()
axes[1, 0].plot(rest_data.index, rest_data.values, 'o-',
linewidth=2, markersize=10, color='green')
axes[1, 0].set_xlabel('Rest Days Before Game')
axes[1, 0].set_ylabel('Average Runs')
axes[1, 0].set_title('Impact of Rest on Scoring', fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)
# 5. Monthly trends by game type
monthly_by_type = games.groupby(['month', 'game_type'])['total_runs'].mean().unstack()
monthly_by_type.plot(ax=axes[1, 1], marker='o', linewidth=2)
axes[1, 1].set_xlabel('Month')
axes[1, 1].set_ylabel('Average Runs')
axes[1, 1].set_title('Monthly Trends by Game Type', fontweight='bold')
axes[1, 1].set_xticks(range(4, 10))
axes[1, 1].set_xticklabels(['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep'])
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].legend(title='Game Type')
# 6. Feature importance
feature_importance = pd.DataFrame({
'feature': ['Day Game', 'Temperature', 'Month',
'Doubleheader', 'Rest Days', 'Day of Week'],
'coefficient': model.coef_
}).sort_values('coefficient')
colors = ['red' if x < 0 else 'green' for x in feature_importance['coefficient']]
axes[1, 2].barh(range(len(feature_importance)),
feature_importance['coefficient'],
color=colors, alpha=0.6)
axes[1, 2].set_yticks(range(len(feature_importance)))
axes[1, 2].set_yticklabels(feature_importance['feature'])
axes[1, 2].set_xlabel('Effect on Runs')
axes[1, 2].set_title('Factor Impact on Scoring', fontweight='bold')
axes[1, 2].axvline(x=0, color='black', linestyle='-', linewidth=0.8)
axes[1, 2].grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('scheduling_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
return games, model
games_df, sched_model = comprehensive_scheduling_analysis()
# Practical application: Predict expected runs for specific game
print("\n" + "="*60)
print("Example Prediction:")
print("="*60)
example_game = pd.DataFrame({
'game_type_encoded': [1], # Day game
'temperature': [88],
'month': [7], # July
'is_doubleheader': [0],
'rest_days': [1],
'dow_encoded': [5] # Saturday
})
predicted_runs = sched_model.predict(example_game)[0]
print(f"Day game in July, 88°F, Saturday, 1 day rest:")
print(f"Expected total runs: {predicted_runs:.1f}")
Exercise 1: Calculate Custom Park Factors
Using the Lahman database, calculate three-year park factors (2021-2023) for all MLB ballparks. Include separate factors for runs, home runs, hits, and strikeouts. Identify the top 5 most extreme parks in each category.
Steps:
- Extract team-level statistics for home and road games
- Calculate park factors using the proper formula
- Visualize the results with a multi-faceted plot
- Discuss which parks favor which types of outcomes
Exercise 2: Weather Impact Simulation
Build a physics-based simulator that predicts whether identical batted balls (105 mph exit velocity, 28-degree launch angle) result in home runs under different weather conditions.
Requirements:
- Test combinations of temperature (50-100°F), humidity (20-90%), and wind (-15 to +15 mph)
- Account for air density changes
- Calculate the probability of a home run for each condition
- Create a heat map showing HR probability across weather conditions
- Determine the weather conditions that maximize and minimize HR probability
Exercise 3: Coors Field Player Adjustment
Analyze the career statistics of five players who spent significant time with the Colorado Rockies. Calculate their park-adjusted statistics and compare them to their raw numbers.
Analysis Points:
- Home vs road splits for batting average, OBP, SLG, and HR rate
- Career totals adjusted for Coors Field effects
- Comparison to league-average park adjustments
- Discussion of whether Coors Field affects player evaluation and contract value
Exercise 4: Day/Night Betting Model
Build a predictive model for game total runs that incorporates game time (day/night), weather conditions, park factors, and team strength.
Model Requirements:
- Use regression or machine learning approach
- Include interaction terms (e.g., day games × temperature)
- Validate with train/test split
- Calculate expected value for over/under bets
- Backtest strategy: when would betting based on weather conditions be profitable?
Bonus: Identify specific situations where the model predicts the largest deviations from betting market totals.
Summary
Environmental factors substantially affect baseball outcomes in quantifiable ways:
Key Findings:
- Park Factors: Coors Field (PF ~110-115) and other extreme parks require statistical adjustment
- Temperature: Every 10°F adds ~2-3 feet to fly ball distance; correlates strongly with offense
- Altitude: Coors Field's 5,200-foot elevation reduces air density 17%, adding 15-25 feet to fly balls and reducing breaking ball movement
- Humidity: Counterintuitively helps offense (water vapor is lighter than air)
- Wind: 10 mph helping wind adds ~20 feet; headwind subtracts similar amounts
- Day vs Night: Day games average 0.4-0.5 more runs due to higher temperatures
- Scheduling: Rest days, doubleheaders, and day-of-week show measurable effects
Practical Applications:
- Player Evaluation: Adjust statistics for park and weather contexts
- Projections: Account for home ballpark changes (trades, free agency)
- Betting: Weather creates exploitable edges in run totals markets
- Strategy: Lineup construction can optimize for home park characteristics
- Historical Analysis: Era and park adjustments essential for fair comparisons
Understanding these environmental effects allows analysts to separate signal from noise, fairly evaluate players across contexts, and identify strategic opportunities. While we cannot control the weather, we can quantify its impact and make better-informed decisions accordingly.