Baseball Analytics Practice Exercises | MLB Analytics Textbook

1

Introduction to MLB Analytics

4 exercises

Exercise 1.1

Environment Setup Verification

Hard

Write code in your chosen language (R or Python) that:

1. Loads the necessary baseball analytics packages
2. Prints your R or Python version
3. Queries basic information about a player of your choice using the baseball data package
4. Creates a simple plot (any plot) to verify visualization works

This confirms your environment is properly configured.

Exercise 1.2

Data Hierarchy Exploration

Medium

Using baseball data from 2023:

1. Get game-level data for your favorite team
2. Calculate the team's wins and losses
3. Aggregate to find total runs scored and allowed across the season
4. Calculate the team's Pythagorean winning percentage using the formula from the Preface
5. Compare actual winning percentage to Pythagorean expectation

**R Version Hint:**

```r
# Use baseballr to get team game logs
library(baseballr)
library(tidyverse)

# Example for Yankees (team_id = 147)
yankees_games <- mlb_team_schedule(season = 2023,
team_id = 147)
# Then aggregate and calculate...
```

**Python Version Hint:**

```python
# Use pybaseball to get team game logs
from pybaseball import schedule_and_record

# Example for Yankees
yankees_games = schedule_and_record(2023, 'NYY')
# Then aggregate and calculate...
```

Exercise 1.3

Hit Type Analysis

Hard

Analyze the distribution of hit types (singles, doubles, triples, home runs) for two power hitters and two contact hitters in 2023.

1. Choose four players representing different offensive profiles
2. Get their 2023 statistics including hit breakdowns
3. Calculate what percentage of their hits were singles, doubles, triples, and home runs
4. Create a visualization comparing these distributions
5. Write a brief interpretation: How do power hitters' hit distributions differ from contact hitters?

Exercise 1.4

Monthly Performance Trends

Hard

Investigate whether players perform differently early vs. late in the season.

1. Get plate appearance data for a player across the 2023 season
2. Group plate appearances by month (April, May, June, July, August, September)
3. Calculate monthly batting average, OBP, and slugging percentage
4. Create a line plot showing how these metrics changed throughout the season
5. Discuss: Do you see evidence of the player getting hot or cold? How confident are you given sample sizes?

**Challenge extension:** Calculate monthly sample sizes and add confidence intervals to your plot to visualize uncertainty.

---

You've now completed your introduction to baseball analytics. You understand what baseball analytics is, what questions it addresses, have a working analytical environment, understand baseball's data structure, and have completed a real analysis examining the shift ban's impact. The foundation is laid.

Subsequent chapters build systematically on these foundations. Chapter 2 covers data acquisition—how to get data from various sources. Chapter 3 teaches data manipulation and transformation. Chapters 4-8 introduce core baseball metrics, showing both what they mean conceptually and how to calculate them from raw data. Later chapters cover visualization, modeling, prediction, and specialized topics.

Baseball analytics is ultimately about answering questions that matter. As you work through this book, always return to the questions: What am I trying to understand? What evidence would help answer that question? How confident should I be in my conclusions? The technical skills you'll develop are tools in service of clear thinking about meaningful questions.

Let's continue to Chapter 2, where we'll learn to acquire data from multiple sources, building the datasets that fuel every analysis.

2

Data Wrangling for Baseball

5 exercises

Read Chapter

Exercise 2.1

Advanced Filtering and Selection

Hard

Using the baseballr package (R) or pybaseball (Python), retrieve 2024 batting data and answer:

1. How many players hit 30+ home runs with a strikeout rate below 20%?
2. Which players had an OBP of at least .350 and stole 20+ bases?
3. Create a new category "Three True Outcomes Rate" (HR + BB + SO) / PA and identify the top 10 players

**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)

batters <- fg_batter_leaders(2024, 2024, qual = 300)

# 1.
q1 <- batters %>%
filter(HR >= 30, `K_percent` < 20) %>%
nrow()

# 2.
q2 <- batters %>%
filter(OBP >= .350, SB >= 20) %>%
select(Name, OBP, SB)

# 3.
q3 <- batters %>%
mutate(TTO_rate = (HR + BB + SO) / PA) %>%
arrange(desc(TTO_rate)) %>%
select(Name, HR, BB, SO, PA, TTO_rate) %>%
head(10)
```

**Python Solution Sketch:**
```python
from pybaseball import batting_stats

batters_py = batting_stats(2024, qual=300)

# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")

# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]

# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]
```

Exercise 2.2

Grouping and Team Analysis

Medium

Calculate team-level statistics from individual player data:

1. Which team had the highest average OPS among qualified batters?
2. What's the correlation between team home runs and team wins (you'll need to join with standings data)?
3. Calculate each team's offensive balance: standard deviation of WAR among their top 5 hitters

Exercise 2.3

Time Series Analysis

Hard

Create a 162-game simulated season for a player and:

1. Calculate 15-game rolling averages for BA, OBP, and SLG
2. Identify the longest streak above .300 BA
3. Compare first-half vs. second-half performance
4. Visualize performance trends over the season

Exercise 2.4

Joins and Data Integration

Medium

Combine batting and pitching data:

1. Identify two-way players (appear in both datasets)
2. For two-way players, calculate their combined WAR
3. Compare offensive WAR vs. pitching WAR

Exercise 2.5

Missing Data Challenge

Hard

Using real-world data with missing Statcast metrics:

1. Assess the extent and pattern of missing data
2. Implement three different imputation strategies
3. Compare the impact on correlation between exit velocity and hard-hit rate
4. Justify which imputation method is most appropriate and why

---

This concludes Chapter 2. In the next chapter, we'll explore the rich ecosystem of baseball data sources, learning how to access FanGraphs, Baseball Reference, Statcast, and the Lahman database through R and Python packages.

3

The Baseball Data Ecosystem

4 exercises

Read Chapter

Exercise 3.1

Multi-Source Data Integration

Medium

Using both FanGraphs (via baseballr/pybaseball) and Statcast data:

1. Retrieve 2024 batting statistics for players with 400+ PA
2. For the top 10 hitters by wRC+, get their Statcast data
3. Compare their wOBA (FanGraphs) to their xwOBA (Statcast)
4. Which players are most over-performing or under-performing their expected stats?

**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)

# 1. Get FanGraphs data
batters <- fg_batter_leaders(2024, 2024, qual = 400)
top10 <- batters %>% arrange(desc(`wRC+`)) %>% head(10)

# 2 & 3. Get Statcast data for each
# (Would loop through players and use statcast_search with their IDs)
# Compare wOBA vs xwOBA

# 4. Calculate differences
# top10 %>% mutate(woba_diff = wOBA - xwOBA)
```

**Python Solution Sketch:**
```python
from pybaseball import batting_stats, statcast_batter, playerid_lookup
import pandas as pd

# 1. Get batting stats
batters = batting_stats(2024, qual=400)
top10 = batters.nlargest(10, 'wRC+')

# 2. Get Statcast data for top 10
# (Would need to lookup MLBAM IDs and query statcast_batter)

# 3 & 4. Compare wOBA vs xwOBA
# Calculate differences to identify over/under-performers
```

Exercise 3.2

Historical Trends with Lahman

Medium

Using the Lahman database:

1. Calculate the league-average batting average by decade (1920-present)
2. Identify the decade with the highest and lowest scoring
3. Plot the trend of strikeouts per game over time
4. Compare the "Steroid Era" (1995-2005) to the "Modern Era" (2015-2024) in terms of HR rate, K rate, and BA

Exercise 3.3

Team Performance Analysis

Hard

Combine multiple data sources:

1. Get 2024 team standings (use mlb_standings from baseballr or schedule_and_record from pybaseball)
2. Calculate team batting statistics (aggregate from individual player data)
3. Join with team ERA (from pitching data)
4. Create a Pythagorean expectation model: Expected W% = R^2 / (R^2 + RA^2)
5. Compare actual wins to expected wins—which teams over-performed?

Exercise 3.4

Statcast Deep Dive

Hard

Pick your favorite pitcher and analyze their arsenal:

1. Retrieve all Statcast data for their 2024 season
2. For each pitch type:
- Calculate average velocity, spin rate, and movement
- Calculate whiff rate and zone rate
- Identify which pitch generates the most swings and misses
3. Analyze platoon splits: How do their pitches perform vs. LHB vs. RHB?
4. Create a "pitch quality" ranking based on velocity, spin, and whiff rate

**Deliverable**: A comprehensive report with visualizations showing pitch usage, effectiveness, and recommendations for pitch selection.

---

This concludes Chapter 3. You now have the tools to access virtually any baseball dataset available. In the next chapters, we'll use these data sources to explore specific analytical techniques, from evaluating hitters and pitchers to building predictive models and crafting advanced visualizations.

The combination of R's `baseballr` package and Python's `pybaseball` library, along with the historical richness of the Lahman database, provides everything you need to conduct professional-grade baseball analysis. Master these tools, and you'll be equipped to answer almost any baseball question with data.

7

Statcast Analytics - Pitching

3 exercises

Read Chapter

Exercise 7.1

Pitcher Comparison Analysis

Hard

Using Statcast data, compare two starting pitchers from different teams:

1. Pull data for both pitchers for the 2024 season
2. Calculate velocity, spin rate, and movement profiles for each pitch type
3. Compare their arsenals: usage rates, average velocity, and whiff rates
4. Create a pitch movement chart for each pitcher
5. Write a brief scouting report comparing their arsenals

**Suggested pitchers**: Spencer Strider (ATL) and Shota Imanaga (CHC)

Exercise 7.2

Command and Location Analysis

Hard

Analyze pitch location patterns for a pitcher of your choice:

1. Calculate zone%, edge%, heart%, and chase rate by pitch type
2. Analyze CSW% overall and by pitch type
3. Create a heatmap showing pitch locations for their primary pitch (four-seam fastball)
4. Compare location patterns by count (ahead vs. behind in the count)
5. Assess: Is this pitcher's success driven more by stuff or command?

Exercise 7.3

Arsenal Effectiveness Study

Hard

Investigate which pitch in a pitcher's arsenal is most/least effective:

1. For each pitch type, calculate:
- Whiff rate
- xwOBA against
- Hard hit rate against
- CSW%
- Usage rate
2. Identify the best and worst pitches in the arsenal
3. Analyze if usage rate aligns with effectiveness (do they throw their best pitches most?)
4. Calculate pitch values: Run Value per 100 pitches for each pitch type
5. Make a recommendation: Should they adjust their pitch usage?

**Challenge Extension**: Compare the pitcher's arsenal effectiveness against left-handed vs. right-handed batters. Do they have platoon splits? Which pitches drive those splits?

---

You've now completed your deep dive into Statcast pitching analytics. You understand how modern tracking systems measure every pitch, what those measurements reveal about pitcher performance, and how to analyze arsenals, command, and expected outcomes. These skills form the foundation for evaluating pitchers in the modern game, whether you're building projection models, designing development plans, or making strategic decisions.

The next chapter will explore park factors and environmental effects - understanding how context affects the statistics we've been analyzing throughout this book.

8

Fielding & Baserunning Analytics

4 exercises

Read Chapter

Exercise 8.1

Calculating Simple OAA

Hard

Using Statcast data for a single month, calculate a simplified version of OAA for outfielders:

1. Get batted ball data for one month (July 2024 recommended)
2. Filter for outfield fly balls and line drives
3. Calculate catch probability based on distance and hang time (use simplified model from section 8.2.2)
4. Determine actual outcomes (caught or not)
5. Calculate OAA for each outfielder as sum of (actual - expected)
6. Identify the top 5 and bottom 5 defenders

**Bonus**: Compare your simplified OAA to Baseball Savant's official OAA for the same players and time period. How close are they?

Exercise 8.2

Shift Impact Analysis

Hard

Replicate the shift ban analysis from section 8.7.2 using real data:

1. Get Statcast data for ground balls hit by left-handed batters for:
- May 2022 (shifts allowed)
- May 2023 (shifts banned)
2. Calculate ground ball hit rates for each month
3. Perform a statistical test for the difference
4. Create a visualization comparing the two periods
5. Calculate how many extra hits occurred in 2023 vs expected based on 2022 rates

**Challenge**: Identify which individual players benefited most from the shift ban by comparing their 2022 vs 2023 ground ball BABIP.

Exercise 8.3

Sprint Speed and Stolen Base Efficiency

Hard

Analyze the relationship between sprint speed and stolen base success:

1. Get sprint speed data for all qualified players (2024)
2. Get stolen base attempts and success rates
3. Calculate stolen base success rate for players with 10+ attempts
4. Create a scatter plot of sprint speed vs SB success rate
5. Fit a regression model and interpret the relationship
6. Identify players who over/under-perform their expected SB rate based on speed

**Question**: What sprint speed corresponds to 75% stolen base success (break-even point)?

Exercise 8.4

Defensive Value Comparison

Hard

Compare defensive metrics across different systems:

1. Select 20 players across multiple positions (2024 season)
2. Collect their OAA, UZR, and DRS values
3. Standardize all metrics to same scale (z-scores)
4. Calculate correlation between metrics
5. Identify players where metrics disagree significantly
6. Create a visualization showing agreement/disagreement

**Question**: For which positions do the metrics agree most? Where do they diverge most? Why might this be?

---

You've now completed your introduction to fielding and baserunning analytics. You understand why defense is challenging to measure, how modern metrics like OAA work, the value of positioning and shifts, and how to evaluate baserunning contribution. Defense and baserunning combined can account for 2-3 WAR per season for elite players—real, measurable value that traditional statistics completely missed.

The Statcast revolution has transformed defensive evaluation from subjective ("he looks good") to objective ("he made 73% of plays with 65% average probability"). We can now properly credit players like Kevin Kiermaier, Yadier Molina, and Matt Chapman for defensive excellence that was previously unrecognized in conventional statistics.

In Chapter 9, we'll turn to win probability and leveraged situations, understanding how context affects player and managerial decisions. The technical skills you've developed throughout this book will combine to help you evaluate complete players—offense, defense, baserunning, and situational performance—using the full arsenal of modern analytics.

11

Fantasy Baseball & Sports Betting Analytics

4 exercises

Read Chapter

Exercise 11.1

Fantasy Player Valuation

Medium

Using projection data for 10 players across all five standard hitting categories:

1. Calculate replacement-level statistics (use the worst player's projections)
2. Define per-point denominators for each category (assume reasonable league spreads)
3. Calculate SGP for each player
4. Convert SGP to auction values (12-team league, $260 budget)
5. Visualize the relationship between a player's projected home runs and their auction value

**Extension**: How does the value of a player with extreme stolen base totals (50+) change if the league adds OBP as a sixth category?

Exercise 11.2

DFS Optimization

Hard

Create a simplified DFS optimizer for a slate of 20 players:

1. Generate random projections (points) and salaries for 20 players across positions
2. Implement a greedy algorithm that selects players by value-per-dollar
3. Compare the greedy solution to a random selection
4. Calculate what percentage of random lineups beat the greedy lineup
5. Discuss: Why might the optimal lineup differ from highest value-per-dollar?

**Extension**: Add stack constraints (at least 3 batters from the same team must be selected).

Exercise 11.3

Implied Probability and Vig Analysis

Hard

Given these betting lines for five games:

```
Game 1: Team A -180, Team B +160
Game 2: Team C -110, Team D -110
Game 3: Team E +140, Team F -160
Game 4: Team G -125, Team H +105
Game 5: Team I -200, Team J +175
```

1. Calculate implied probability for each team
2. Calculate the vig for each game
3. Identify which game has the highest and lowest vig
4. If you believe Team B has a 42% chance to win, calculate the EV of betting $100 on them
5. Visualize implied probabilities vs. your model probabilities (create hypothetical model probabilities)

Exercise 11.4

Bankroll Simulation

Hard

Simulate a betting season with the following parameters:

1. Starting bankroll: $1,000
2. 100 bets over the season
3. Each bet has 53% win probability (representing edge over 50%)
4. Odds: -110 for all bets
5. Three bet sizing strategies: flat $50 per bet, 5% Kelly, 2% Kelly

For each strategy:
- Simulate 1,000 seasons (1,000 × 100 bets)
- Calculate median ending bankroll
- Calculate probability of ruin (ending bankroll < $100)
- Calculate 90th percentile outcome
- Visualize distribution of ending bankrolls

Discuss which strategy you'd recommend and why.

---

Fantasy baseball and sports betting analytics demonstrate how rigorous quantitative methods inform decisions under uncertainty. Whether valuing players across multiple performance dimensions, optimizing roster construction against salary constraints, or calculating expected value for betting opportunities, the principles of probability, statistics, and optimization provide frameworks for systematic decision-making.

These applications also illustrate analytics' limitations. Fantasy success depends on projection accuracy—but player performance includes irreducible randomness. Sports betting models require edge over sophisticated market prices—but even small edges require disciplined bankroll management to survive variance. No amount of analytical sophistication eliminates uncertainty or guarantees profits.

The skills developed in this chapter—converting probabilities to decisions, optimizing under constraints, managing risk—transfer to countless domains beyond sports. Data scientists in any field face similar challenges: building predictive models, quantifying uncertainty, and making optimal choices given limited information. Baseball provides a rich, accessible environment for developing these capabilities.

Chapter 12 will explore advanced topics in baseball analytics, including machine learning applications, deep learning for pitch classification, and cutting-edge research areas shaping the field's future.

14

Team Building & Roster Construction

4 exercises

Read Chapter

Exercise 14.1

Free Agent Cost Analysis

Medium

Using 2024 free agent data, calculate cost per WAR for at least 10 free agent signings. Then:

a) Compare cost per WAR across different position groups (pitchers vs hitters, premium positions vs corner positions)

b) Analyze whether older players (33+) cost more or less per WAR than younger free agents (28-30)

c) Identify which signing appears most efficient (best value) and least efficient (worst value)

**Data to collect:**
- Player name, position, age
- Contract terms (years, AAV)
- Projected WAR for first year (use Steamer or ZiPS)

**Hint:** Check FanGraphs or Baseball Prospectus for free agent tracker and projections.

Exercise 14.2

Trade Surplus Value

Medium

Evaluate a recent blockbuster trade using surplus value analysis. Choose a trade from the past 2 years involving multiple players.

**Your analysis should:**

a) Calculate total surplus value for each side of the trade (projected WAR × market rate - expected salary over years of control)

b) Apply discount rates to future value (use 5-10%)

c) Determine which team "won" the trade based on surplus value

d) Discuss how competitive windows might make the trade beneficial for both sides despite unequal surplus value

**Suggested trades:**
- Juan Soto to Padres (2022)
- Tyler Glasnow to Dodgers (2023)
- Dylan Cease to Padres (2024)

Exercise 14.3

Draft Pick Value Curve

Hard

Using Baseball Reference or FanGraphs, collect data on draft picks from a single year (suggest 2014-2016 to allow development time):

**For picks 1-30:**

a) Calculate what percentage reached MLB (100+ PA or 50+ IP)

b) For those who reached MLB, calculate total career WAR through 2023

c) Build a value curve showing expected WAR by draft position

d) Identify which picks outperformed or underperformed expectations

**Extension:** Compare college vs high school players. Do high school picks have higher variance? Higher ceiling?

Exercise 14.4

Competitive Window Modeling

Hard

Choose a current MLB team and project their competitive window:

a) Identify core players and project their WAR trajectory over next 5 years using aging curves

b) Estimate prospect contribution (consult top prospect lists)

c) Calculate total projected WAR and expected wins for each season

d) Determine the team's optimal strategy: compete now, rebuild, or middle ground

e) Recommend specific roster moves (trades, free agent signings, or sell-offs) that align with your recommended strategy

**Teams with interesting situations:**
- Baltimore Orioles (young core, rising)
- St. Louis Cardinals (aging core, crossroads)
- Los Angeles Angels (Trout aging, weak farm)
- Tampa Bay Rays (perennial contender, low payroll)

---

**Chapter Summary**

Team building combines economics, player valuation, strategic timing, and organizational philosophy. Key takeaways:

1. **Economic Efficiency**: Pre-arbitration players provide 40x ROI vs free agents; exploit this arbitrage
2. **Positional Value**: Premium defensive positions (C, SS, CF) allow lower offensive standards
3. **Free Agent Markets**: Account for aging curves, apply discount rates, avoid winner's curse
4. **Trade Strategy**: Exchange surplus value across different timelines; align with competitive windows
5. **Draft Philosophy**: Balance upside (high school) vs safety (college) based on organizational timeline
6. **Strategic Clarity**: Commit fully to competing or rebuilding; avoid mediocre middle ground

Successful team building requires analytical rigor, clear strategic vision, and disciplined execution. The best front offices combine quantitative analysis with qualitative evaluation, organizational development, and adaptive strategy. As analytics evolve, teams that integrate new methods while maintaining coherent long-term plans will sustain competitive advantage.

15

Player Development & Minor League Analytics

4 exercises

Read Chapter

Exercise 15.1

Age-Adjusted Performance Analysis

Medium

**Task**: Analyze a prospect's performance adjusting for age relative to league average. Using the provided data, calculate age-adjusted metrics and determine if the prospect is performing above or below expectations.

**Data**:
```
Prospect: SS, Age 20
Level: High-A (League Avg Age: 22.8)
Stats: .275 AVG, .345 OBP, .485 SLG, 15 HR, 285 PA
12.2% BB%, 24.5% K%, .210 ISO
```

**Questions**:
1. Calculate the prospect's age-adjusted wRC+ (assume league average is 100)
2. How does the strikeout rate compare when adjusted for age?
3. Based on age-adjusted metrics, is this prospect ahead or behind the development curve?
4. What level should this prospect be promoted to next, and why?

Exercise 15.2

Breakout Candidate Identification

Medium

**Task**: Using the swing decision and contact quality metrics below, identify which prospect is most likely to break out in the next season.

**Prospect Comparison**:

| Metric | Prospect A | Prospect B | Prospect C |
|--------|-----------|-----------|-----------|
| Current wRC+ | 105 | 118 | 98 |
| Chase Rate Change | -4.5% | -1.2% | +2.1% |
| Zone Contact Change | +3.2% | +1.8% | -0.5% |
| Avg EV Change | +2.1 mph | +0.8 mph | +3.5 mph |
| Barrel Rate | 8.5% | 11.2% | 6.8% |
| Age | 22 | 24 | 21 |

**Questions**:
1. Calculate a composite breakout score for each prospect
2. Which prospect shows the most promising leading indicators?
3. What specific improvements drive your choice?
4. What realistic wRC+ would you project for each prospect next season?

Exercise 15.3

International Prospect Translation

Medium

**Task**: Translate the following KBO statistics to MLB equivalents and project first-year MLB performance.

**Player Data**:
```
Player: OF, Age 26
League: KBO (Korean Baseball Organization)
Stats: .318 AVG, .385 OBP, .538 SLG, 28 HR, 550 PA
9.5% BB%, 15.2% K%, .220 ISO
Previous MLB exposure: None
```

**Questions**:
1. Translate the KBO statistics to MLB equivalents using appropriate league factors
2. What MLB slash line would you project for Year 1?
3. What is the biggest risk factor in this projection?
4. How would your projection change if the player were age 23 instead of 26?

Exercise 15.4

Call-Up Decision Analysis

Hard

**Task**: You are the GM. Determine whether to call up your top prospect on May 1st or wait until mid-June for service time reasons.

**Prospect Profile**:
```
Position: 3B, Age 22
AAA Stats (145 PA): .298/.375/.512, 6 HR, 12.4% BB%, 21.2% K%
AA Stats (425 PA): .285/.360/.485, 18 HR, 10.1% BB%, 24.5% K%
Defensive Grade: 55 (above average)
Future WAR Projection: 4.0 WAR annually (ages 25-29)
```

**Team Context**:
```
Current 3B Production: 85 wRC+ (below average)
Team Record: 15-18 (below .500)
Payroll Situation: Middle of pack
Days until full year service time: 12 days (mid-April)
Estimated Super Two cutoff: Already passed
```

**Questions**:
1. Calculate the financial value of delaying the call-up until mid-June
2. What is the estimated WAR cost of keeping an 85 wRC+ player at 3B for 6 more weeks?
3. Make your recommendation: Call up now or wait? Justify with analysis.
4. What performance threshold would make you change your decision?

---

**Exercise Solutions**: Solutions to these exercises involve combining multiple analytical techniques from the chapter. Students should use the code frameworks provided to build their own analysis pipelines, applying appropriate age adjustments, translation factors, and decision models. The exercises emphasize practical decision-making under uncertainty, mirroring real-world front office challenges.

17

Advanced Statistical Methods

4 exercises

Read Chapter

Exercise 17.1

Bayesian Rookie Evaluation

Hard

A rookie pitcher has thrown 25 innings with a 2.52 ERA. Using Bayesian methods:

1. Calculate a Bayesian estimate of his true ERA using a prior of N(4.00, 0.80^2)
2. Construct a 95% credible interval
3. What's the probability his true ERA is below 3.50?
4. How many innings would he need to pitch at 2.52 ERA before we're 90% confident his true ERA is below 3.50?

Implement your solution in R or Python and visualize the posterior distribution.

Exercise 17.2

Hierarchical Team-Player Model

Hard

Using data for a season (real or simulated):

1. Fit a hierarchical model where player OPS varies by player and team
2. Extract team-level random effects
3. Compare team effects to actual team winning percentages
4. Identify players whose performance is most "above" or "below" what the hierarchical model predicts based on their team
5. Discuss: What does it mean when a player substantially outperforms his hierarchical prediction?

Exercise 17.3

Detecting Velocity Changes

Hard

Create or obtain pitch-by-pitch velocity data for a starting pitcher over a season:

1. Plot velocity over time (by game or pitch number)
2. Fit a smooth trend to identify any systematic velocity decline
3. Use changepoint detection to identify if there was an abrupt velocity drop (potentially indicating injury)
4. Calculate the probability that any detected changepoint is real vs noise
5. If you detect a changepoint, investigate whether the pitcher's effectiveness (measured by wOBA allowed or K-BB%) changed simultaneously

Exercise 17.4

Causal Effect of Designated Hitter

Hard

Using propensity score matching:

1. Compare the career length of pitchers in AL vs NL (before universal DH in 2022)
2. Account for confounding variables: pitcher quality (career ERA+), usage pattern (starter vs reliever), age at debut
3. Estimate the causal effect of playing in the AL (with DH) vs NL (pitchers bat) on pitcher career length
4. Discuss: Why might the DH extend (or shorten) pitcher careers? What are alternative explanations for any observed difference?

---

You've now explored six advanced statistical techniques with particular application to baseball analytics. Bayesian methods help us update beliefs rationally in the face of limited data. Hierarchical models share information across groups to improve estimates. Time series methods model temporal trends and detect change points. Survival analysis quantifies time-to-event processes. Causal inference attempts to establish causation from observational data. Monte Carlo simulation quantifies uncertainty and probabilities in complex systems.

These methods represent the frontier of baseball analytics. While traditional statistics remain valuable for description and basic inference, these advanced techniques enable analysts to answer more sophisticated questions: not just "what happened?" but "what's the player's true talent?", "what caused this change?", "what will happen?", and "how certain should we be?"

As you apply these methods to real baseball questions, remember that sophisticated techniques don't guarantee correct answers. Always question your assumptions: Are your priors reasonable? Are your groups truly comparable? Is your model appropriate for your data structure? The best analysts combine advanced statistical methods with deep baseball knowledge and healthy skepticism about their own conclusions.

In the next chapter, we'll explore machine learning methods that build on these statistical foundations to create predictive models for player performance, game outcomes, and strategic decisions.

21

Umpire Analysis & Strike Zone Modeling

4 exercises

Read Chapter

Exercise 21.1

Umpire Accuracy Analysis

Hard

Using pitch-level data from the 2024 season:

a) Calculate the overall accuracy rate for each umpire (minimum 1,000 called pitches)
b) Identify the five most accurate and five least accurate umpires
c) Create a visualization comparing each umpire's accuracy on pitches inside vs. outside the strike zone
d) Test whether there is a statistically significant difference in accuracy between the most and least accurate umpires

**Hint:** Use a two-sample t-test or permutation test to assess statistical significance. Consider whether accuracy rates are normally distributed.

Exercise 21.2

Strike Zone Visualization

Hard

For a specific umpire of your choice:

a) Create a heat map showing the probability of a called strike at different locations
b) Overlay the rulebook strike zone on your visualization
c) Identify regions where the umpire's zone significantly differs from the rulebook (>20 percentage points)
d) Create a similar visualization for the league average and place them side-by-side for comparison

**Hint:** Use 2D binning or kernel density estimation to create smooth probability surfaces. The `stat_summary_2d()` function in ggplot2 or `scipy.stats.binned_statistic_2d()` in Python are helpful.

Exercise 21.3

Predicting Called Strikes

Hard

Build and compare predictive models for called strikes:

a) Train a logistic regression model using pitch location, count, batter handedness, and pitcher handedness as features
b) Train a random forest model with the same features
c) Add umpire identity as a feature to both models (use one-hot encoding)
d) Compare the models using AUC, accuracy, and calibration plots
e) Identify which features are most important in each model
f) Use the best model to identify the 10 most surprising calls from the 2024 season (largest difference between predicted probability and actual call)

**Hint:** Feature importance can be extracted from logistic regression coefficients and random forest's `feature_importances_` attribute. For surprising calls, look for high-probability strikes called balls and vice versa.

Exercise 21.4

ABS Impact Simulation

Hard

Simulate the impact of implementing full ABS:

a) For each pitch in your dataset, determine whether the human umpire's call matches what ABS would call
b) Calculate the overall agreement rate and identify systematic biases (e.g., do human umpires call more strikes or fewer strikes than ABS?)
c) Estimate how strikeout rates and walk rates would change under full ABS (focus on pitches with 2 strikes and 3 balls respectively)
d) Calculate the expected number of calls that would be overturned per game
e) Analyze whether certain types of pitchers (high strikeout, high walk, etc.) would be helped or hurt more by ABS

**Hint:** You'll need to define the ABS zone precisely using the sz_top and sz_bot variables. Consider grouping pitchers by strikeout and walk rates to assess differential impacts.

---

This chapter has covered the fundamentals of umpire analysis and strike zone modeling, from defining accuracy metrics to building predictive models and evaluating the potential impact of automated systems. As MLB continues to consider the role of technology in officiating, these analytical tools will remain essential for understanding how umpires influence the game and how changes to ball-strike calling might affect gameplay and strategy. The combination of granular pitch-tracking data and sophisticated statistical modeling allows us to evaluate umpire performance with unprecedented precision while also informing important decisions about the future of the sport.

23

International Baseball Analytics

4 exercises

Read Chapter

Exercise 23.1

NPB Translation Model

Hard

Using the provided NPB-to-MLB translation data, build a regression model to project the first-year MLB performance of a hypothetical NPB player:

**Player Profile:**
- Age: 25
- Final NPB season: .305/.380/.520, 28 HR in 550 PA
- Position: Corner OF
- Exit velocity: 106 mph (NPB measurement)

**Tasks:**
1. Apply the translation factors from Section 23.2
2. Calculate projected MLB slash line and HR total
3. Estimate first-year WAR using the projection
4. Assess confidence level and identify key uncertainties

**Bonus:** Compare your projection to actual performance of similar NPB players (e.g., Seiya Suzuki, Masataka Yoshida).

Exercise 23.2

KBO Pitcher Projection

Hard

A 27-year-old KBO left-handed starter has the following final season:

- 2.65 ERA, 1.15 WHIP
- 9.8 K/9, 2.8 BB/9, 0.75 HR/9
- 175 IP, 15-6 record

**Tasks:**
1. Using the KBO pitcher translation model from Section 23.3, project his MLB stats
2. Calculate projected FIP and ERA
3. Build a confidence interval for your ERA projection
4. Compare to similar KBO pitchers (e.g., Hyun-Jin Ryu, Kwang-Hyun Kim)
5. Recommend a contract structure based on projection and risk

Exercise 23.3

Latin American Tool-Based Valuation

Hard

You are evaluating three Dominican Republic prospects for international signing:

**Prospect A:**
- Age: 17
- Hit: 60, Power: 70, Speed: 55, Field: 55, Arm: 60
- Exit velocity: 108 mph
- Asking bonus: $3.5M

**Prospect B:**
- Age: 16
- Hit: 65, Power: 60, Speed: 70, Field: 65, Arm: 60
- Exit velocity: 104 mph
- Asking bonus: $4.0M

**Prospect C:**
- Age: 18
- Hit: 55, Power: 75, Speed: 45, Field: 50, Arm: 55
- Exit velocity: 112 mph
- Asking bonus: $2.5M

**Tasks:**
1. Use the Latin American projection model from Section 23.4
2. Project WAR through age 23 for each prospect
3. Calculate value per dollar of bonus
4. Rank the prospects considering both ceiling and floor outcomes
5. Recommend which prospect(s) to sign and at what price

**Advanced:** Simulate 1,000 career paths for each prospect incorporating uncertainty and injury risk.

Exercise 23.4

International League Environment Analysis

Hard

Using the league comparison data from Section 23.1, conduct a comprehensive analysis:

**Tasks:**
1. Calculate park-adjusted metrics for each league
2. Estimate "true talent" translation factors using regression to the mean
3. Build a Bayesian updating system that improves projections as players accumulate MLB PA
4. Create visualizations comparing league offensive environments over time (2015-2023)
5. Develop recommendations for adjusting scouting priorities based on league trends

**Data Required:**
- League-wide statistics (provided in section)
- Park factors (research or estimate)
- Historical translation success rates

**Deliverables:**
- R or Python code implementing your analysis
- Report summarizing findings
- Recommendations for international scouting departments

---

24

Front Office & Analytics Career Guide

4 exercises

Read Chapter

Exercise 24.1

Pitcher Arsenal Analysis and Optimization

Hard

**Objective**: Analyze a pitcher's repertoire using Statcast data and provide recommendations for pitch usage optimization.

**Skills Demonstrated**: Data acquisition, exploratory analysis, visualization, strategic thinking

**Project Steps**:

1. Acquire Statcast pitch-level data for a pitcher (use baseballr package or Baseball Savant)
2. Analyze pitch characteristics (velocity, movement, spin)
3. Evaluate pitch effectiveness by count and situation
4. Identify optimization opportunities
5. Create compelling visualizations
6. Write executive summary with recommendations

**R Implementation**:

```r
# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns

library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)

# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
# In practice, use scrape_statcast_savant_pitcher()
# For this example, we'll simulate data

set.seed(123)
n_pitches <- 2500

tibble(
pitch_type = sample(
c("FF", "SI", "SL", "CH", "CU"),
n_pitches,
replace = TRUE,
prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
),
release_speed = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
),
pfx_x = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
),
pfx_z = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
),
release_spin_rate = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
),
balls = sample(0:3, n_pitches, replace = TRUE),
strikes = sample(0:2, n_pitches, replace = TRUE),
stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
description = sample(
c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
n_pitches,
replace = TRUE,
prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
),
launch_speed = ifelse(description == "hit_into_play",
rnorm(n_pitches, 87, 10), NA),
launch_angle = ifelse(description == "hit_into_play",
rnorm(n_pitches, 12, 20), NA),
estimated_woba_using_speedangle = ifelse(
description == "hit_into_play",
pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
NA
)
)
}

# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")

# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
n = n(),
pct = n() / nrow(pitcher_data),
avg_velo = mean(release_speed, na.rm = TRUE),
avg_spin = mean(release_spin_rate, na.rm = TRUE)
) %>%
arrange(desc(n))

print("Pitch Mix:")
print(pitch_mix)

# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
usage = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
na.rm = TRUE),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
) %>%
arrange(desc(csw_rate))

print("\nPitch Effectiveness:")
print(pitch_effectiveness)

# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
mutate(count = paste0(balls, "-", strikes)) %>%
group_by(count, pitch_type) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
.groups = "drop"
) %>%
group_by(count) %>%
mutate(usage_pct = n / sum(n)) %>%
arrange(count, desc(usage_pct))

# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
group_by(pitch_type, stand) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = stand,
values_from = c(n, whiff_rate, avg_xwoba),
names_sep = "_"
)

print("\nPlatoon Splits:")
print(platoon_splits)

# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
"FF" = "#d22d49", "SI" = "#FE9D00",
"SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)

movement_plot <- ggplot(pitcher_data,
aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
geom_point(alpha = 0.3, size = 2) +
stat_ellipse(level = 0.75, size = 1.2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Pitch Movement Profile",
subtitle = "Catcher's perspective (RHP)",
x = "Horizontal Break (inches)",
y = "Induced Vertical Break (inches)",
color = "Pitch Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right"
) +
coord_fixed()

# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
geom_point(alpha = 0.4, size = 2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Velocity vs. Spin Rate",
x = "Release Speed (mph)",
y = "Spin Rate (rpm)",
color = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")

# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
geom_col(position = "stack") +
scale_fill_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Pitch Usage by Count",
x = "Count",
y = "Usage %",
fill = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")

# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
plot_annotation(
title = "Comprehensive Pitcher Arsenal Analysis",
subtitle = "Example Pitcher - 2024 Season",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)

print(combined_plot)

# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")

# Best pitch
best_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_max(csw_rate, n = 1)

cat("1. PRIMARY WEAPON\n")
cat(sprintf(" - %s showing elite CSW rate of %.1f%%\n",
best_pitch$pitch_type, best_pitch$csw_rate * 100))
cat(" - Maintain high usage in favorable counts\n\n")

# Underused effective pitch
underused <- effectiveness %>%
filter(usage < quantile(effectiveness$usage, 0.33)) %>%
filter(csw_rate > 0.30)

if(nrow(underused) > 0) {
cat("2. USAGE OPTIMIZATION\n")
for(i in 1:nrow(underused)) {
cat(sprintf(" - Consider increasing %s usage (current: %d pitches)\n",
underused$pitch_type[i], underused$usage[i]))
cat(sprintf(" Shows strong CSW rate: %.1f%%\n",
underused$csw_rate[i] * 100))
}
cat("\n")
}

# Weak pitch
weak_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_min(csw_rate, n = 1)

cat("3. PITCH DEVELOPMENT FOCUS\n")
cat(sprintf(" - %s showing below-average performance\n",
weak_pitch$pitch_type))
cat(sprintf(" - CSW rate: %.1f%% vs. league average ~28%%\n",
weak_pitch$csw_rate * 100))
cat(" - Consider: velocity increase, movement adjustment, or reduced usage\n\n")

cat("4. STRATEGIC ADJUSTMENTS\n")
cat(" - Review count-specific usage patterns\n")
cat(" - Analyze platoon splits for pitch selection\n")
cat(" - Consider sequencing effects (not shown in basic analysis)\n")
cat(" - Monitor fatigue impact on pitch quality\n")
}

generate_recommendations(pitcher_data, pitch_effectiveness)

# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")
```

**Portfolio Presentation Tips**:
- Include interactive visualizations (consider using plotly)
- Compare pitcher to league averages
- Add context about pitcher role and team strategy
- Discuss limitations (sample size, park factors, etc.)
- Provide actionable recommendations

Exercise 24.2

Player Aging Curves and Performance Projection

Hard

**Objective**: Build aging curves for different player skills and create a performance projection system.

**Skills Demonstrated**: Statistical modeling, time series analysis, predictive analytics, data visualization

**Python Implementation**:

```python
# Player Aging Curves and Projection System
# Analyzing how player skills change with age

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
"""
Generate simulated player career data.
In practice, this would come from Baseball Reference or FanGraphs.
"""
np.random.seed(42)

players = []
for player_id in range(n_players):
# Random career start age (20-25)
start_age = np.random.randint(20, 26)
# Random career length (2-15 years)
career_length = np.random.randint(2, 16)

# Peak age varies (26-30)
peak_age = np.random.randint(26, 31)
# Peak performance level
peak_wrc_plus = np.random.normal(110, 20)

for year_in_career in range(career_length):
age = start_age + year_in_career
season = years_range[0] + np.random.randint(0,
years_range[1] - years_range[0])

# Age-based performance (simplified aging curve)
age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
base_wrc = peak_wrc_plus * age_factor

# Add random variation
wrc_plus = max(50, base_wrc + np.random.normal(0, 15))

# Other stats correlated with wRC+
pa = np.random.randint(300, 650)
avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
obp = avg + 0.060 + np.random.normal(0, 0.020)
slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)

players.append({
'player_id': player_id,
'age': age,
'season': season,
'PA': pa,
'AVG': np.clip(avg, 0.150, 0.400),
'OBP': np.clip(obp, 0.250, 0.500),
'SLG': np.clip(slg, 0.300, 0.700),
'wRC_plus': wrc_plus,
'ISO': np.clip(slg - avg, 0.050, 0.350)
})

return pd.DataFrame(players)

# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)

print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")

# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
"""
Calculate aging curve using year-to-year delta method.
This controls for selection bias better than simple averaging.
"""
# Filter for consecutive seasons
df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])

# Calculate year-to-year changes
df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]

# Keep only consecutive seasons
df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()

# Group by age and calculate average change
aging_curve = df_deltas.groupby('age').agg({
'metric_delta': ['mean', 'std', 'count'],
metric: 'mean'
}).reset_index()

aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']

return aging_curve

# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")

metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}

for metric in metrics:
aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
print(f" {metric}: {len(aging_curves[metric])} age points")

# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
"""
Fit a polynomial curve to aging data.
"""
# Use weighted regression (weight by sample size)
weights = np.sqrt(aging_data['n'])

# Polynomial features (degree 2)
X = aging_data[age_col].values.reshape(-1, 1)
y = aging_data[delta_col].values

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = Ridge(alpha=1.0)
model.fit(X_poly, y, sample_weight=weights)

return model, poly

# Fit curves
fitted_models = {}
for metric in metrics:
fitted_models[metric] = fit_aging_curve(aging_curves[metric])
print(f"Fitted aging curve for {metric}")

# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
ax = axes[idx]
curve_data = aging_curves[metric]
model, poly = fitted_models[metric]

# Plot raw deltas
ax.scatter(curve_data['age'], curve_data['delta_mean'],
s=curve_data['n']*2, alpha=0.6, label='Observed')

# Plot fitted curve
age_range = np.linspace(curve_data['age'].min(),
curve_data['age'].max(), 100)
X_pred = poly.transform(age_range.reshape(-1, 1))
y_pred = model.predict(X_pred)

ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)

ax.set_xlabel('Age', fontsize=11)
ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")

# 4. Build Projection System
class PlayerProjector:
"""
Project player performance based on recent history and aging curves.
"""

def __init__(self, aging_models):
self.aging_models = aging_models

def project_player(self, player_history, years_forward=1):
"""
Project player performance forward.

Parameters:
-----------
player_history : DataFrame
Recent seasons for player (last 3 years recommended)
years_forward : int
Number of years to project forward

Returns:
--------
dict : Projected statistics
"""
# Weight recent seasons more heavily (3:2:1 for last 3 years)
weights = np.array([3, 2, 1])[:len(player_history)]
weights = weights / weights.sum()

# Current age and baseline performance
current_age = player_history['age'].iloc[-1]

projections = {}

for metric in self.aging_models.keys():
if metric not in player_history.columns:
continue

# Weighted average of recent performance
baseline = np.average(player_history[metric].iloc[-3:],
weights=weights)

# Apply aging curve
model, poly = self.aging_models[metric]

# Project forward
projected_value = baseline
for year in range(years_forward):
age = current_age + year + 1
X_age = poly.transform([[age]])
age_adjustment = model.predict(X_age)[0]
projected_value += age_adjustment

projections[metric] = projected_value

projections['age'] = current_age + years_forward
projections['projection_years'] = years_forward

return projections

# 5. Test Projection System
projector = PlayerProjector(fitted_models)

# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]

test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')

print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")

print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))

# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)

for year in range(1, 4):
projection = projector.project_player(test_player_data, years_forward=year)
print(f"+{year:<5} {projection['age']:<5.0f} "
f"{projection.get('wRC_plus', 0):<8.1f} "
f"{projection.get('ISO', 0):<8.3f} "
f"{projection.get('AVG', 0):<8.3f} "
f"{projection.get('OBP', 0):<8.3f}")

# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
"""
Evaluate projection accuracy on historical data.
"""
results = []

for player_id in data['player_id'].unique():
player_data = data[data['player_id'] == player_id].sort_values('age')

# Need at least 4 seasons (3 to project, 1 to validate)
if len(player_data) < 4:
continue

# Use all but last season for projection
train_data = player_data.iloc[:-1]
actual_data = player_data.iloc[-1]

if len(train_data) < 3:
continue

# Make projection
try:
projection = projector.project_player(train_data, years_forward=1)

for metric in ['wRC_plus', 'ISO', 'AVG']:
if metric in projection:
results.append({
'player_id': player_id,
'metric': metric,
'actual': actual_data[metric],
'projected': projection[metric],
'error': projection[metric] - actual_data[metric]
})
except:
continue

return pd.DataFrame(results)

print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)

print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)

for metric in ['wRC_plus', 'ISO', 'AVG']:
metric_eval = evaluation[evaluation['metric'] == metric]

if len(metric_eval) > 0:
mae = np.abs(metric_eval['error']).mean()
rmse = np.sqrt((metric_eval['error'] ** 2).mean())

# Calculate R²
actual = metric_eval['actual'].values
predicted = metric_eval['projected'].values
ss_res = np.sum((actual - predicted) ** 2)
ss_tot = np.sum((actual - actual.mean()) ** 2)
r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")

# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
ax = axes[idx]
metric_eval = evaluation[evaluation['metric'] == metric]

if len(metric_eval) > 0:
ax.scatter(metric_eval['actual'], metric_eval['projected'],
alpha=0.4, s=30)

# Add y=x line
min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
ax.plot([min_val, max_val], [min_val, max_val],
'r--', linewidth=2, label='Perfect Projection')

ax.set_xlabel(f'Actual {metric}', fontsize=11)
ax.set_ylabel(f'Projected {metric}', fontsize=11)
ax.set_title(f'{metric} Projection Accuracy',
fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")
```

**Extension Ideas**:
- Incorporate minor league translation factors
- Add injury risk modeling
- Create playing time projections
- Develop position-specific aging curves
- Compare to established projection systems (Steamer, ZiPS)

Exercise 24.3

Draft Value Analysis and Strategy Optimization

Hard

**Objective**: Analyze historical draft performance to quantify pick value and optimize draft strategy.

**Skills Demonstrated**: Data analysis, value modeling, strategic thinking, data visualization

**Key Analysis Components**:

```r
# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy

library(tidyverse)
library(survival)
library(ggplot2)
library(scales)

# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
set.seed(42)

drafts <- expand.grid(
year = 2008:2022,
round = 1:rounds,
pick = 1:30
) %>%
mutate(
overall_pick = (round - 1) * 30 + pick,
# Probability of reaching majors decreases with pick
p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
reached_mlb = rbinom(n(), 1, p_mlb),
# Career WAR conditional on reaching MLB
war_if_mlb = ifelse(
reached_mlb == 1,
pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
0
),
# Years to debut
years_to_debut = ifelse(
reached_mlb == 1,
pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
NA
),
# Position (simplified)
position = sample(
c("P", "C", "IF", "OF"),
n(),
replace = TRUE,
prob = c(0.45, 0.10, 0.25, 0.20)
),
# College vs HS
player_type = sample(
c("College", "HS", "International"),
n(),
replace = TRUE,
prob = c(0.55, 0.35, 0.10)
),
# Slot value (simplified formula)
slot_value = pmax(
200000,
12000000 * exp(-overall_pick / 15)
),
# Signing bonus (usually close to slot)
signing_bonus = slot_value * runif(n(), 0.85, 1.15)
)

return(drafts)
}

# Generate data
draft_data <- generate_draft_data()

print(sprintf("Generated %d draft picks from %d drafts",
nrow(draft_data), n_distinct(draft_data$year)))

# 1. Success Rate by Round
success_by_round <- draft_data %>%
group_by(round) %>%
summarize(
n_picks = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
total_war = sum(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
) %>%
filter(round <= 20) # Focus on first 20 rounds

print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))

# 2. Value Curve Estimation
value_curve <- draft_data %>%
group_by(overall_pick) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
) %>%
filter(overall_pick <= 300)

# Fit exponential decay model
value_model <- nls(
expected_war ~ a * exp(-b * overall_pick),
data = value_curve %>% filter(expected_war > 0),
start = list(a = 10, b = 0.01)
)

# Add fitted values
value_curve$fitted_war <- predict(
value_model,
newdata = data.frame(overall_pick = value_curve$overall_pick)
)

print("\nValue Curve Model:")
print(summary(value_model))

# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
geom_vline(xintercept = c(30, 60, 90),
linetype = "dashed", alpha = 0.3) +
annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
label = "Round 1", size = 3.5) +
annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
label = "Round 2", size = 3.5) +
labs(
title = "MLB Draft Pick Value Curve",
subtitle = "Expected career WAR by draft position",
x = "Overall Pick",
y = "Expected Career WAR",
caption = "Exponential decay model fitted to historical data"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11)
)

print(value_plot)

# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(position, round) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
.groups = "drop"
) %>%
group_by(position) %>%
summarize(
total_picks = sum(n),
avg_mlb_rate = mean(mlb_rate),
avg_war = mean(avg_war)
) %>%
arrange(desc(avg_war))

print("\nPosition-Specific Success Rates:")
print(position_analysis)

# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(player_type) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
)

print("\nCollege vs High School Performance:")
print(player_type_analysis)

# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
mutate(
war_per_million = war_if_mlb / (signing_bonus / 1000000),
pick_group = case_when(
overall_pick <= 30 ~ "Top 30",
overall_pick <= 60 ~ "31-60",
overall_pick <= 100 ~ "61-100",
TRUE ~ "100+"
)
) %>%
group_by(pick_group) %>%
summarize(
n = n(),
avg_bonus = mean(signing_bonus),
avg_war = mean(war_if_mlb),
war_per_million = mean(war_per_million)
)

print("\nReturn on Investment by Pick Range:")
print(roi_analysis)

# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
"""
Simple optimization: maximize expected WAR given bonus pool constraints
"""

# Get expected value for each pick
pick_values <- value_curve %>%
filter(overall_pick %in% available_picks) %>%
left_join(
draft_data %>%
group_by(overall_pick) %>%
summarize(avg_slot = mean(slot_value)),
by = "overall_pick"
)

# Greedy algorithm: pick highest value/cost ratio within budget
selected <- tibble()
remaining_budget <- budget
remaining_picks <- pick_values

while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
# Calculate value per dollar
remaining_picks <- remaining_picks %>%
mutate(value_per_dollar = expected_war / avg_slot)

# Select best value pick we can afford
best_pick <- remaining_picks %>%
filter(avg_slot <= remaining_budget) %>%
slice_max(value_per_dollar, n = 1)

if(nrow(best_pick) == 0) break

selected <- bind_rows(selected, best_pick)
remaining_budget <- remaining_budget - best_pick$avg_slot
remaining_picks <- remaining_picks %>%
filter(overall_pick != best_pick$overall_pick)
}

return(selected)
}

# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000

optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)

print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
select(overall_pick, expected_war, avg_slot, value_per_dollar))

# 8. Comprehensive Dashboard Visualization
library(patchwork)

# Plot 1: Success rate by round
p1 <- success_by_round %>%
filter(round <= 10) %>%
ggplot(aes(x = round, y = mlb_rate)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
vjust = -0.5, size = 3) +
scale_y_continuous(labels = percent_format()) +
labs(title = "MLB Success Rate by Round",
x = "Draft Round", y = "% Reaching MLB") +
theme_minimal()

# Plot 2: WAR distribution
p2 <- draft_data %>%
filter(reached_mlb == 1, overall_pick <= 100) %>%
ggplot(aes(x = war_if_mlb)) +
geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
labs(title = "Career WAR Distribution (MLB Players)",
x = "Career WAR", y = "Count") +
theme_minimal()

# Plot 3: Position comparison
p3 <- draft_data %>%
filter(reached_mlb == 1, round <= 5) %>%
ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
geom_boxplot(alpha = 0.7) +
labs(title = "WAR by Position (Rounds 1-5)",
x = "Position", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")

# Plot 4: College vs HS
p4 <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
labs(title = "College vs HS Performance",
x = "Player Type", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")

# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
plot_annotation(
title = "MLB Draft Analysis Dashboard",
subtitle = "Historical performance metrics and value analysis",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)

print(combined)

# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")

cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf(" - First round produces %.1f%% of total draft WAR\n",
100 * first_round_war / total_war))

cat("\n2. SUCCESS RATES\n")
cat(sprintf(" - Round 1: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[1]))
cat(sprintf(" - Round 5: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[5]))
cat(sprintf(" - Round 10: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[10]))

cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf(" - Average time to debut: %.1f years\n",
mean(draft_data$years_to_debut, na.rm = TRUE)))

cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat(" - Prioritize early picks; value drops exponentially\n")
cat(" - Consider college players for faster development\n")
cat(" - High school players have higher variance in outcomes\n")
cat(" - Pitchers dominate draft but consider positional scarcity\n")
cat(" - Later rounds: focus on high-ceiling, high-risk players\n")

cat("\n=== ANALYSIS COMPLETE ===\n")
```

**Portfolio Enhancement**:
- Add international signing analysis
- Compare team draft performance
- Analyze specific draft classes
- Include financial constraints modeling
- Compare to prospect ranking systems

Exercise 24.4

Defensive Positioning and Shift Analysis

Hard

**Objective**: Analyze defensive positioning effectiveness using batted ball data.

**Skills Demonstrated**: Spatial analysis, causal inference, strategic analysis, data visualization

**Implementation Framework**:

```python
# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches

# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)

# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
"""
Simulate batted ball locations and outcomes.
Coordinates in feet from home plate.
"""
np.random.seed(42)

data = []

for _ in range(n_balls):
# Batter handedness
stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])

# Shift decision (more common vs pull hitters)
is_shifter = np.random.random() < 0.3
shift_on = is_shifter and (np.random.random() < 0.7)

# Hit location (pull tendency varies)
if stand == 'R':
# Righties pull left
if is_shifter:
angle = np.random.normal(-25, 35) # Pull-heavy
else:
angle = np.random.normal(-10, 45) # Balanced
else:
# Lefties pull right
if is_shifter:
angle = np.random.normal(25, 35)
else:
angle = np.random.normal(10, 45)

# Distance based on exit velo and launch angle
exit_velo = np.random.normal(88, 8)
launch_angle = np.random.normal(12, 18)

# Simplified distance calculation
distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
distance = max(50, min(400, distance + np.random.normal(0, 20)))

# Convert to x, y coordinates
angle_rad = np.radians(angle)
x = distance * np.sin(angle_rad)
y = distance * np.cos(angle_rad)

# Hit outcome (shift effectiveness)
if shift_on:
# Shift reduces hits in pull direction
if stand == 'R' and x < -50:
prob_hit = 0.18 # Reduced by shift
elif stand == 'L' and x > 50:
prob_hit = 0.18
else:
prob_hit = 0.28 # Normal rate
else:
prob_hit = 0.25

# Adjust for distance (harder to field)
prob_hit = min(0.95, prob_hit * (distance / 250))

is_hit = np.random.random() < prob_hit

data.append({
'x': x,
'y': y,
'distance': distance,
'angle': angle,
'exit_velo': exit_velo,
'launch_angle': launch_angle,
'stand': stand,
'shift_on': shift_on,
'is_hit': is_hit,
'is_shifter': is_shifter
})

return pd.DataFrame(data)

# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)

print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")

# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
'is_hit': ['mean', 'count'],
'exit_velo': 'mean'
}).round(3)

print("\nShift Effectiveness:")
print(shift_analysis)

# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
"""
Estimate runs saved by shifting.
"""
results = []

for stand in ['R', 'L']:
for shifter in [True, False]:
subset = data[(data['stand'] == stand) &
(data['is_shifter'] == shifter)]

if len(subset) == 0:
continue

shifted = subset[subset['shift_on'] == True]
no_shift = subset[subset['shift_on'] == False]

if len(shifted) > 0 and len(no_shift) > 0:
babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
# Approximate run value per hit prevented: ~0.5 runs
runs_saved_per_pa = babip_diff * 0.5

results.append({
'stand': stand,
'is_shifter': shifter,
'shifted_babip': shifted['is_hit'].mean(),
'no_shift_babip': no_shift['is_hit'].mean(),
'babip_diff': babip_diff,
'runs_saved_per_100pa': runs_saved_per_pa * 100,
'n_shifted': len(shifted),
'n_no_shift': len(no_shift)
})

return pd.DataFrame(results)

shift_value = calculate_shift_value(bb_data)

print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))

# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
"""
Plot baseball field with hit locations.
"""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 10))

# Draw field outline
# Infield dirt
infield = patches.Wedge((0, 0), 95, 45, 135,
facecolor='tan', alpha=0.3)
ax.add_patch(infield)

# Outfield grass
outfield = patches.Wedge((0, 0), 400, 45, 135,
facecolor='green', alpha=0.1)
ax.add_patch(outfield)

# Foul lines
ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)

# Plot hits
hits = data[data['is_hit'] == True]
outs = data[data['is_hit'] == False]

ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
s=20, label='Out')
ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
s=30, label='Hit')

ax.set_xlim(-320, 320)
ax.set_ylim(0, 400)
ax.set_aspect('equal')
ax.set_xlabel('Distance from center (ft)', fontsize=11)
ax.set_ylabel('Distance from home (ft)', fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.2)

return ax

# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
(bb_data['is_shifter'] == True)]

plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == False],
'RHB Pull Hitter - No Shift',
ax=ax1
)

plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == True],
'RHB Pull Hitter - Shift On',
ax=ax2
)

plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")

# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
"""
Create BABIP heat map for given conditions.
"""
subset = data[(data['shift_on'] == shift_status) &
(data['stand'] == stand)]

# Create grid
x_bins = np.linspace(-250, 250, 25)
y_bins = np.linspace(50, 350, 20)

grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))

for i in range(len(y_bins)-1):
for j in range(len(x_bins)-1):
mask = ((subset['x'] >= x_bins[j]) &
(subset['x'] < x_bins[j+1]) &
(subset['y'] >= y_bins[i]) &
(subset['y'] < y_bins[i+1]))

cell_data = subset[mask]
if len(cell_data) >= 5: # Minimum sample
grid_babip[i, j] = cell_data['is_hit'].mean()
grid_count[i, j] = len(cell_data)
else:
grid_babip[i, j] = np.nan

return grid_babip, x_bins, y_bins, grid_count

# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stand in enumerate(['R', 'L']):
for j, shift_on in enumerate([False, True]):
ax = axes[i, j]

shifters = bb_data[bb_data['is_shifter'] == True]
grid, x_bins, y_bins, counts = create_babip_heatmap(
shifters, shift_on, stand
)

im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
y_bins[0], y_bins[-1]],
origin='lower', cmap='RdYlGn_r',
vmin=0, vmax=0.5, aspect='auto')

shift_text = "Shift On" if shift_on else "No Shift"
hand_text = "RHB" if stand == 'R' else "LHB"
ax.set_title(f'{hand_text} - {shift_text}',
fontsize=11, fontweight='bold')
ax.set_xlabel('Horizontal Position (ft)')
ax.set_ylabel('Distance from Home (ft)')

# Add colorbar
plt.colorbar(im, ax=ax, label='BABIP')

plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")

# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)

X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
model.coef_[0]):
print(f" {feat}: {coef:.3f}")

# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)

print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
print(f" {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")

print("\n2. WHEN TO SHIFT")
print(" - Strong pull tendency (>70% pull rate)")
print(" - Ground ball hitters (LA < 10°)")
print(" - Extreme pull hitters benefit most from aggressive shifts")

print("\n3. SHIFT VARIATIONS")
print(" - Full shift: 3 infielders on pull side")
print(" - Partial shift: 2.5 infielders pull side")
print(" - No shift: Traditional alignment")
print(" - Decision should consider:")
print(" * Batter's spray chart")
print(" * Game situation (runners, outs)")
print(" * Pitcher's ground ball rate")

print("\n4. LIMITATIONS & CONSIDERATIONS")
print(" - Shift beaten by opposite field hits")
print(" - Bunt defense vulnerabilities")
print(" - Runner advancement opportunities")
print(" - Pitcher-specific adjustments")

print("\n5. FUTURE ANALYSIS")
print(" - Pitcher-specific positioning")
print(" - Count-based positioning adjustments")
print(" - Outfield positioning optimization")
print(" - Real-time adjustment algorithms")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
```

**Portfolio Development Tips**:
- Use real Statcast spray chart data when possible
- Incorporate expected outcomes (xBA, xwOBA)
- Add video analysis component
- Compare to MLB team shift strategies
- Analyze shift effectiveness by ballpark

---

Practice Exercises

How to Practice Effectively

Read the Chapter First

Try Before You Peek

Experiment & Extend

Understanding Difficulty Levels

Foundation Building

Skill Development

Advanced Application

Exercises by Difficulty

Filter Exercises

Pro Tip

All Exercises

Introduction to MLB Analytics

Environment Setup Verification

Data Hierarchy Exploration

Hit Type Analysis

Monthly Performance Trends

Data Wrangling for Baseball

Advanced Filtering and Selection

Grouping and Team Analysis

Time Series Analysis

Joins and Data Integration

Missing Data Challenge

The Baseball Data Ecosystem

Multi-Source Data Integration

Historical Trends with Lahman

Team Performance Analysis

Statcast Deep Dive

Statcast Analytics - Pitching

Pitcher Comparison Analysis

Command and Location Analysis

Arsenal Effectiveness Study

Fielding & Baserunning Analytics

Calculating Simple OAA

Shift Impact Analysis

Sprint Speed and Stolen Base Efficiency

Defensive Value Comparison

Fantasy Baseball & Sports Betting Analytics

Fantasy Player Valuation

DFS Optimization

Implied Probability and Vig Analysis

Bankroll Simulation

Team Building & Roster Construction

Free Agent Cost Analysis

Trade Surplus Value

Draft Pick Value Curve

Competitive Window Modeling

Player Development & Minor League Analytics

Age-Adjusted Performance Analysis

Breakout Candidate Identification

International Prospect Translation

Call-Up Decision Analysis

Advanced Statistical Methods

Bayesian Rookie Evaluation

Hierarchical Team-Player Model

Detecting Velocity Changes

Causal Effect of Designated Hitter

Umpire Analysis & Strike Zone Modeling

Umpire Accuracy Analysis

Strike Zone Visualization

Predicting Called Strikes

ABS Impact Simulation

International Baseball Analytics

NPB Translation Model

KBO Pitcher Projection

Latin American Tool-Based Valuation

International League Environment Analysis

Front Office & Analytics Career Guide

Pitcher Arsenal Analysis and Optimization

Player Aging Curves and Performance Projection

Draft Value Analysis and Strategy Optimization

Defensive Positioning and Shift Analysis

Build Your Skills Progressively

Data Wrangling

Visualization

Metrics & Analysis

Advanced Topics

Need to Review the Material?