Practice Exercises

Master baseball analytics through hands-on practice. Work through exercises that reinforce concepts from each chapter, from basic data wrangling to advanced predictive modeling.

48 Total Exercises
12 Chapters
3 Difficulty Levels

How to Practice Effectively

Get the most out of these exercises with our recommended approach

1
Read the Chapter First

Make sure you've completed the corresponding chapter before attempting exercises. Understanding the concepts will make problem-solving much easier.

2
Try Before You Peek

Attempt each exercise on your own before looking at hints or solutions. Struggling through problems builds deeper understanding than copying answers.

3
Experiment & Extend

After solving an exercise, try modifying it. Use different players, seasons, or metrics. This reinforces learning and builds real analysis skills.

Understanding Difficulty Levels

Exercises are categorized by difficulty to help you build skills progressively. Start with easy exercises to build confidence, then work your way up.

Easy
Foundation Building

Basic operations like loading data, simple calculations, and fundamental visualizations. Perfect for beginners or warming up.

Medium
Skill Development

Multi-step problems requiring data manipulation, metric calculations, and combining multiple concepts. The core of your learning.

Hard
Advanced Application

Complex problems requiring creative problem-solving, advanced techniques, and often combining skills from multiple chapters.

Exercises by Difficulty
Easy 0 exercises
Medium 11 exercises
Hard 37 exercises

Total Exercises: 48
Filter Exercises
Showing: Hard exercises (37 of 48)
Show All

Hard Exercises

37 exercises
Exercise 1.1
Environment Setup Verification
Hard
Write code in your chosen language (R or Python) that:

1. Loads the necessary baseball analytics packages
2. Prints your R or Python version
3. Queries basic information about a player of your choice using the baseball data package
4. Creates a simple plot (any plot) to verify visualization works

This confirms your environment is properly configured.
Exercise 1.3
Hit Type Analysis
Hard
Analyze the distribution of hit types (singles, doubles, triples, home runs) for two power hitters and two contact hitters in 2023.

1. Choose four players representing different offensive profiles
2. Get their 2023 statistics including hit breakdowns
3. Calculate what percentage of their hits were singles, doubles, triples, and home runs
4. Create a visualization comparing these distributions
5. Write a brief interpretation: How do power hitters' hit distributions differ from contact hitters?
Exercise 1.4
Monthly Performance Trends
Hard
Investigate whether players perform differently early vs. late in the season.

1. Get plate appearance data for a player across the 2023 season
2. Group plate appearances by month (April, May, June, July, August, September)
3. Calculate monthly batting average, OBP, and slugging percentage
4. Create a line plot showing how these metrics changed throughout the season
5. Discuss: Do you see evidence of the player getting hot or cold? How confident are you given sample sizes?

**Challenge extension:** Calculate monthly sample sizes and add confidence intervals to your plot to visualize uncertainty.

---

You've now completed your introduction to baseball analytics. You understand what baseball analytics is, what questions it addresses, have a working analytical environment, understand baseball's data structure, and have completed a real analysis examining the shift ban's impact. The foundation is laid.

Subsequent chapters build systematically on these foundations. Chapter 2 covers data acquisition—how to get data from various sources. Chapter 3 teaches data manipulation and transformation. Chapters 4-8 introduce core baseball metrics, showing both what they mean conceptually and how to calculate them from raw data. Later chapters cover visualization, modeling, prediction, and specialized topics.

Baseball analytics is ultimately about answering questions that matter. As you work through this book, always return to the questions: What am I trying to understand? What evidence would help answer that question? How confident should I be in my conclusions? The technical skills you'll develop are tools in service of clear thinking about meaningful questions.

Let's continue to Chapter 2, where we'll learn to acquire data from multiple sources, building the datasets that fuel every analysis.
Exercise 2.1
Advanced Filtering and Selection
Hard
Using the baseballr package (R) or pybaseball (Python), retrieve 2024 batting data and answer:

1. How many players hit 30+ home runs with a strikeout rate below 20%?
2. Which players had an OBP of at least .350 and stole 20+ bases?
3. Create a new category "Three True Outcomes Rate" (HR + BB + SO) / PA and identify the top 10 players

**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)

batters <- fg_batter_leaders(2024, 2024, qual = 300)

# 1.
q1 <- batters %>%
filter(HR >= 30, `K_percent` < 20) %>%
nrow()

# 2.
q2 <- batters %>%
filter(OBP >= .350, SB >= 20) %>%
select(Name, OBP, SB)

# 3.
q3 <- batters %>%
mutate(TTO_rate = (HR + BB + SO) / PA) %>%
arrange(desc(TTO_rate)) %>%
select(Name, HR, BB, SO, PA, TTO_rate) %>%
head(10)
```

**Python Solution Sketch:**
```python
from pybaseball import batting_stats

batters_py = batting_stats(2024, qual=300)

# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")

# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]

# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]
```
Exercise 2.3
Time Series Analysis
Hard
Create a 162-game simulated season for a player and:

1. Calculate 15-game rolling averages for BA, OBP, and SLG
2. Identify the longest streak above .300 BA
3. Compare first-half vs. second-half performance
4. Visualize performance trends over the season
Exercise 2.5
Missing Data Challenge
Hard
Using real-world data with missing Statcast metrics:

1. Assess the extent and pattern of missing data
2. Implement three different imputation strategies
3. Compare the impact on correlation between exit velocity and hard-hit rate
4. Justify which imputation method is most appropriate and why

---

This concludes Chapter 2. In the next chapter, we'll explore the rich ecosystem of baseball data sources, learning how to access FanGraphs, Baseball Reference, Statcast, and the Lahman database through R and Python packages.
Exercise 3.3
Team Performance Analysis
Hard
Combine multiple data sources:

1. Get 2024 team standings (use mlb_standings from baseballr or schedule_and_record from pybaseball)
2. Calculate team batting statistics (aggregate from individual player data)
3. Join with team ERA (from pitching data)
4. Create a Pythagorean expectation model: Expected W% = R^2 / (R^2 + RA^2)
5. Compare actual wins to expected wins—which teams over-performed?
Exercise 3.4
Statcast Deep Dive
Hard
Pick your favorite pitcher and analyze their arsenal:

1. Retrieve all Statcast data for their 2024 season
2. For each pitch type:
- Calculate average velocity, spin rate, and movement
- Calculate whiff rate and zone rate
- Identify which pitch generates the most swings and misses
3. Analyze platoon splits: How do their pitches perform vs. LHB vs. RHB?
4. Create a "pitch quality" ranking based on velocity, spin, and whiff rate

**Deliverable**: A comprehensive report with visualizations showing pitch usage, effectiveness, and recommendations for pitch selection.

---

This concludes Chapter 3. You now have the tools to access virtually any baseball dataset available. In the next chapters, we'll use these data sources to explore specific analytical techniques, from evaluating hitters and pitchers to building predictive models and crafting advanced visualizations.

The combination of R's `baseballr` package and Python's `pybaseball` library, along with the historical richness of the Lahman database, provides everything you need to conduct professional-grade baseball analysis. Master these tools, and you'll be equipped to answer almost any baseball question with data.
Exercise 7.1
Pitcher Comparison Analysis
Hard
Using Statcast data, compare two starting pitchers from different teams:

1. Pull data for both pitchers for the 2024 season
2. Calculate velocity, spin rate, and movement profiles for each pitch type
3. Compare their arsenals: usage rates, average velocity, and whiff rates
4. Create a pitch movement chart for each pitcher
5. Write a brief scouting report comparing their arsenals

**Suggested pitchers**: Spencer Strider (ATL) and Shota Imanaga (CHC)
Exercise 7.2
Command and Location Analysis
Hard
Analyze pitch location patterns for a pitcher of your choice:

1. Calculate zone%, edge%, heart%, and chase rate by pitch type
2. Analyze CSW% overall and by pitch type
3. Create a heatmap showing pitch locations for their primary pitch (four-seam fastball)
4. Compare location patterns by count (ahead vs. behind in the count)
5. Assess: Is this pitcher's success driven more by stuff or command?
Exercise 7.3
Arsenal Effectiveness Study
Hard
Investigate which pitch in a pitcher's arsenal is most/least effective:

1. For each pitch type, calculate:
- Whiff rate
- xwOBA against
- Hard hit rate against
- CSW%
- Usage rate
2. Identify the best and worst pitches in the arsenal
3. Analyze if usage rate aligns with effectiveness (do they throw their best pitches most?)
4. Calculate pitch values: Run Value per 100 pitches for each pitch type
5. Make a recommendation: Should they adjust their pitch usage?

**Challenge Extension**: Compare the pitcher's arsenal effectiveness against left-handed vs. right-handed batters. Do they have platoon splits? Which pitches drive those splits?

---

You've now completed your deep dive into Statcast pitching analytics. You understand how modern tracking systems measure every pitch, what those measurements reveal about pitcher performance, and how to analyze arsenals, command, and expected outcomes. These skills form the foundation for evaluating pitchers in the modern game, whether you're building projection models, designing development plans, or making strategic decisions.

The next chapter will explore park factors and environmental effects - understanding how context affects the statistics we've been analyzing throughout this book.
Exercise 8.1
Calculating Simple OAA
Hard
Using Statcast data for a single month, calculate a simplified version of OAA for outfielders:

1. Get batted ball data for one month (July 2024 recommended)
2. Filter for outfield fly balls and line drives
3. Calculate catch probability based on distance and hang time (use simplified model from section 8.2.2)
4. Determine actual outcomes (caught or not)
5. Calculate OAA for each outfielder as sum of (actual - expected)
6. Identify the top 5 and bottom 5 defenders

**Bonus**: Compare your simplified OAA to Baseball Savant's official OAA for the same players and time period. How close are they?
Exercise 8.2
Shift Impact Analysis
Hard
Replicate the shift ban analysis from section 8.7.2 using real data:

1. Get Statcast data for ground balls hit by left-handed batters for:
- May 2022 (shifts allowed)
- May 2023 (shifts banned)
2. Calculate ground ball hit rates for each month
3. Perform a statistical test for the difference
4. Create a visualization comparing the two periods
5. Calculate how many extra hits occurred in 2023 vs expected based on 2022 rates

**Challenge**: Identify which individual players benefited most from the shift ban by comparing their 2022 vs 2023 ground ball BABIP.
Exercise 8.3
Sprint Speed and Stolen Base Efficiency
Hard
Analyze the relationship between sprint speed and stolen base success:

1. Get sprint speed data for all qualified players (2024)
2. Get stolen base attempts and success rates
3. Calculate stolen base success rate for players with 10+ attempts
4. Create a scatter plot of sprint speed vs SB success rate
5. Fit a regression model and interpret the relationship
6. Identify players who over/under-perform their expected SB rate based on speed

**Question**: What sprint speed corresponds to 75% stolen base success (break-even point)?
Exercise 8.4
Defensive Value Comparison
Hard
Compare defensive metrics across different systems:

1. Select 20 players across multiple positions (2024 season)
2. Collect their OAA, UZR, and DRS values
3. Standardize all metrics to same scale (z-scores)
4. Calculate correlation between metrics
5. Identify players where metrics disagree significantly
6. Create a visualization showing agreement/disagreement

**Question**: For which positions do the metrics agree most? Where do they diverge most? Why might this be?

---

You've now completed your introduction to fielding and baserunning analytics. You understand why defense is challenging to measure, how modern metrics like OAA work, the value of positioning and shifts, and how to evaluate baserunning contribution. Defense and baserunning combined can account for 2-3 WAR per season for elite players—real, measurable value that traditional statistics completely missed.

The Statcast revolution has transformed defensive evaluation from subjective ("he looks good") to objective ("he made 73% of plays with 65% average probability"). We can now properly credit players like Kevin Kiermaier, Yadier Molina, and Matt Chapman for defensive excellence that was previously unrecognized in conventional statistics.

In Chapter 9, we'll turn to win probability and leveraged situations, understanding how context affects player and managerial decisions. The technical skills you've developed throughout this book will combine to help you evaluate complete players—offense, defense, baserunning, and situational performance—using the full arsenal of modern analytics.
Exercise 11.2
DFS Optimization
Hard
Create a simplified DFS optimizer for a slate of 20 players:

1. Generate random projections (points) and salaries for 20 players across positions
2. Implement a greedy algorithm that selects players by value-per-dollar
3. Compare the greedy solution to a random selection
4. Calculate what percentage of random lineups beat the greedy lineup
5. Discuss: Why might the optimal lineup differ from highest value-per-dollar?

**Extension**: Add stack constraints (at least 3 batters from the same team must be selected).
Exercise 11.3
Implied Probability and Vig Analysis
Hard
Given these betting lines for five games:

```
Game 1: Team A -180, Team B +160
Game 2: Team C -110, Team D -110
Game 3: Team E +140, Team F -160
Game 4: Team G -125, Team H +105
Game 5: Team I -200, Team J +175
```

1. Calculate implied probability for each team
2. Calculate the vig for each game
3. Identify which game has the highest and lowest vig
4. If you believe Team B has a 42% chance to win, calculate the EV of betting $100 on them
5. Visualize implied probabilities vs. your model probabilities (create hypothetical model probabilities)
Exercise 11.4
Bankroll Simulation
Hard
Simulate a betting season with the following parameters:

1. Starting bankroll: $1,000
2. 100 bets over the season
3. Each bet has 53% win probability (representing edge over 50%)
4. Odds: -110 for all bets
5. Three bet sizing strategies: flat $50 per bet, 5% Kelly, 2% Kelly

For each strategy:
- Simulate 1,000 seasons (1,000 × 100 bets)
- Calculate median ending bankroll
- Calculate probability of ruin (ending bankroll < $100)
- Calculate 90th percentile outcome
- Visualize distribution of ending bankrolls

Discuss which strategy you'd recommend and why.

---

Fantasy baseball and sports betting analytics demonstrate how rigorous quantitative methods inform decisions under uncertainty. Whether valuing players across multiple performance dimensions, optimizing roster construction against salary constraints, or calculating expected value for betting opportunities, the principles of probability, statistics, and optimization provide frameworks for systematic decision-making.

These applications also illustrate analytics' limitations. Fantasy success depends on projection accuracy—but player performance includes irreducible randomness. Sports betting models require edge over sophisticated market prices—but even small edges require disciplined bankroll management to survive variance. No amount of analytical sophistication eliminates uncertainty or guarantees profits.

The skills developed in this chapter—converting probabilities to decisions, optimizing under constraints, managing risk—transfer to countless domains beyond sports. Data scientists in any field face similar challenges: building predictive models, quantifying uncertainty, and making optimal choices given limited information. Baseball provides a rich, accessible environment for developing these capabilities.

Chapter 12 will explore advanced topics in baseball analytics, including machine learning applications, deep learning for pitch classification, and cutting-edge research areas shaping the field's future.
Exercise 14.3
Draft Pick Value Curve
Hard
Using Baseball Reference or FanGraphs, collect data on draft picks from a single year (suggest 2014-2016 to allow development time):

**For picks 1-30:**

a) Calculate what percentage reached MLB (100+ PA or 50+ IP)

b) For those who reached MLB, calculate total career WAR through 2023

c) Build a value curve showing expected WAR by draft position

d) Identify which picks outperformed or underperformed expectations

**Extension:** Compare college vs high school players. Do high school picks have higher variance? Higher ceiling?
Exercise 14.4
Competitive Window Modeling
Hard
Choose a current MLB team and project their competitive window:

a) Identify core players and project their WAR trajectory over next 5 years using aging curves

b) Estimate prospect contribution (consult top prospect lists)

c) Calculate total projected WAR and expected wins for each season

d) Determine the team's optimal strategy: compete now, rebuild, or middle ground

e) Recommend specific roster moves (trades, free agent signings, or sell-offs) that align with your recommended strategy

**Teams with interesting situations:**
- Baltimore Orioles (young core, rising)
- St. Louis Cardinals (aging core, crossroads)
- Los Angeles Angels (Trout aging, weak farm)
- Tampa Bay Rays (perennial contender, low payroll)

---

**Chapter Summary**

Team building combines economics, player valuation, strategic timing, and organizational philosophy. Key takeaways:

1. **Economic Efficiency**: Pre-arbitration players provide 40x ROI vs free agents; exploit this arbitrage
2. **Positional Value**: Premium defensive positions (C, SS, CF) allow lower offensive standards
3. **Free Agent Markets**: Account for aging curves, apply discount rates, avoid winner's curse
4. **Trade Strategy**: Exchange surplus value across different timelines; align with competitive windows
5. **Draft Philosophy**: Balance upside (high school) vs safety (college) based on organizational timeline
6. **Strategic Clarity**: Commit fully to competing or rebuilding; avoid mediocre middle ground

Successful team building requires analytical rigor, clear strategic vision, and disciplined execution. The best front offices combine quantitative analysis with qualitative evaluation, organizational development, and adaptive strategy. As analytics evolve, teams that integrate new methods while maintaining coherent long-term plans will sustain competitive advantage.
Exercise 15.4
Call-Up Decision Analysis
Hard
**Task**: You are the GM. Determine whether to call up your top prospect on May 1st or wait until mid-June for service time reasons.

**Prospect Profile**:
```
Position: 3B, Age 22
AAA Stats (145 PA): .298/.375/.512, 6 HR, 12.4% BB%, 21.2% K%
AA Stats (425 PA): .285/.360/.485, 18 HR, 10.1% BB%, 24.5% K%
Defensive Grade: 55 (above average)
Future WAR Projection: 4.0 WAR annually (ages 25-29)
```

**Team Context**:
```
Current 3B Production: 85 wRC+ (below average)
Team Record: 15-18 (below .500)
Payroll Situation: Middle of pack
Days until full year service time: 12 days (mid-April)
Estimated Super Two cutoff: Already passed
```

**Questions**:
1. Calculate the financial value of delaying the call-up until mid-June
2. What is the estimated WAR cost of keeping an 85 wRC+ player at 3B for 6 more weeks?
3. Make your recommendation: Call up now or wait? Justify with analysis.
4. What performance threshold would make you change your decision?

---

**Exercise Solutions**: Solutions to these exercises involve combining multiple analytical techniques from the chapter. Students should use the code frameworks provided to build their own analysis pipelines, applying appropriate age adjustments, translation factors, and decision models. The exercises emphasize practical decision-making under uncertainty, mirroring real-world front office challenges.
Exercise 17.1
Bayesian Rookie Evaluation
Hard
A rookie pitcher has thrown 25 innings with a 2.52 ERA. Using Bayesian methods:

1. Calculate a Bayesian estimate of his true ERA using a prior of N(4.00, 0.80^2)
2. Construct a 95% credible interval
3. What's the probability his true ERA is below 3.50?
4. How many innings would he need to pitch at 2.52 ERA before we're 90% confident his true ERA is below 3.50?

Implement your solution in R or Python and visualize the posterior distribution.
Exercise 17.2
Hierarchical Team-Player Model
Hard
Using data for a season (real or simulated):

1. Fit a hierarchical model where player OPS varies by player and team
2. Extract team-level random effects
3. Compare team effects to actual team winning percentages
4. Identify players whose performance is most "above" or "below" what the hierarchical model predicts based on their team
5. Discuss: What does it mean when a player substantially outperforms his hierarchical prediction?
Exercise 17.3
Detecting Velocity Changes
Hard
Create or obtain pitch-by-pitch velocity data for a starting pitcher over a season:

1. Plot velocity over time (by game or pitch number)
2. Fit a smooth trend to identify any systematic velocity decline
3. Use changepoint detection to identify if there was an abrupt velocity drop (potentially indicating injury)
4. Calculate the probability that any detected changepoint is real vs noise
5. If you detect a changepoint, investigate whether the pitcher's effectiveness (measured by wOBA allowed or K-BB%) changed simultaneously
Exercise 17.4
Causal Effect of Designated Hitter
Hard
Using propensity score matching:

1. Compare the career length of pitchers in AL vs NL (before universal DH in 2022)
2. Account for confounding variables: pitcher quality (career ERA+), usage pattern (starter vs reliever), age at debut
3. Estimate the causal effect of playing in the AL (with DH) vs NL (pitchers bat) on pitcher career length
4. Discuss: Why might the DH extend (or shorten) pitcher careers? What are alternative explanations for any observed difference?

---

You've now explored six advanced statistical techniques with particular application to baseball analytics. Bayesian methods help us update beliefs rationally in the face of limited data. Hierarchical models share information across groups to improve estimates. Time series methods model temporal trends and detect change points. Survival analysis quantifies time-to-event processes. Causal inference attempts to establish causation from observational data. Monte Carlo simulation quantifies uncertainty and probabilities in complex systems.

These methods represent the frontier of baseball analytics. While traditional statistics remain valuable for description and basic inference, these advanced techniques enable analysts to answer more sophisticated questions: not just "what happened?" but "what's the player's true talent?", "what caused this change?", "what will happen?", and "how certain should we be?"

As you apply these methods to real baseball questions, remember that sophisticated techniques don't guarantee correct answers. Always question your assumptions: Are your priors reasonable? Are your groups truly comparable? Is your model appropriate for your data structure? The best analysts combine advanced statistical methods with deep baseball knowledge and healthy skepticism about their own conclusions.

In the next chapter, we'll explore machine learning methods that build on these statistical foundations to create predictive models for player performance, game outcomes, and strategic decisions.
Exercise 21.1
Umpire Accuracy Analysis
Hard
Using pitch-level data from the 2024 season:

a) Calculate the overall accuracy rate for each umpire (minimum 1,000 called pitches)
b) Identify the five most accurate and five least accurate umpires
c) Create a visualization comparing each umpire's accuracy on pitches inside vs. outside the strike zone
d) Test whether there is a statistically significant difference in accuracy between the most and least accurate umpires

**Hint:** Use a two-sample t-test or permutation test to assess statistical significance. Consider whether accuracy rates are normally distributed.
Exercise 21.2
Strike Zone Visualization
Hard
For a specific umpire of your choice:

a) Create a heat map showing the probability of a called strike at different locations
b) Overlay the rulebook strike zone on your visualization
c) Identify regions where the umpire's zone significantly differs from the rulebook (>20 percentage points)
d) Create a similar visualization for the league average and place them side-by-side for comparison

**Hint:** Use 2D binning or kernel density estimation to create smooth probability surfaces. The `stat_summary_2d()` function in ggplot2 or `scipy.stats.binned_statistic_2d()` in Python are helpful.
Exercise 21.3
Predicting Called Strikes
Hard
Build and compare predictive models for called strikes:

a) Train a logistic regression model using pitch location, count, batter handedness, and pitcher handedness as features
b) Train a random forest model with the same features
c) Add umpire identity as a feature to both models (use one-hot encoding)
d) Compare the models using AUC, accuracy, and calibration plots
e) Identify which features are most important in each model
f) Use the best model to identify the 10 most surprising calls from the 2024 season (largest difference between predicted probability and actual call)

**Hint:** Feature importance can be extracted from logistic regression coefficients and random forest's `feature_importances_` attribute. For surprising calls, look for high-probability strikes called balls and vice versa.
Exercise 21.4
ABS Impact Simulation
Hard
Simulate the impact of implementing full ABS:

a) For each pitch in your dataset, determine whether the human umpire's call matches what ABS would call
b) Calculate the overall agreement rate and identify systematic biases (e.g., do human umpires call more strikes or fewer strikes than ABS?)
c) Estimate how strikeout rates and walk rates would change under full ABS (focus on pitches with 2 strikes and 3 balls respectively)
d) Calculate the expected number of calls that would be overturned per game
e) Analyze whether certain types of pitchers (high strikeout, high walk, etc.) would be helped or hurt more by ABS

**Hint:** You'll need to define the ABS zone precisely using the sz_top and sz_bot variables. Consider grouping pitchers by strikeout and walk rates to assess differential impacts.

---

This chapter has covered the fundamentals of umpire analysis and strike zone modeling, from defining accuracy metrics to building predictive models and evaluating the potential impact of automated systems. As MLB continues to consider the role of technology in officiating, these analytical tools will remain essential for understanding how umpires influence the game and how changes to ball-strike calling might affect gameplay and strategy. The combination of granular pitch-tracking data and sophisticated statistical modeling allows us to evaluate umpire performance with unprecedented precision while also informing important decisions about the future of the sport.
Exercise 23.1
NPB Translation Model
Hard
Using the provided NPB-to-MLB translation data, build a regression model to project the first-year MLB performance of a hypothetical NPB player:

**Player Profile:**
- Age: 25
- Final NPB season: .305/.380/.520, 28 HR in 550 PA
- Position: Corner OF
- Exit velocity: 106 mph (NPB measurement)

**Tasks:**
1. Apply the translation factors from Section 23.2
2. Calculate projected MLB slash line and HR total
3. Estimate first-year WAR using the projection
4. Assess confidence level and identify key uncertainties

**Bonus:** Compare your projection to actual performance of similar NPB players (e.g., Seiya Suzuki, Masataka Yoshida).
Exercise 23.2
KBO Pitcher Projection
Hard
A 27-year-old KBO left-handed starter has the following final season:

- 2.65 ERA, 1.15 WHIP
- 9.8 K/9, 2.8 BB/9, 0.75 HR/9
- 175 IP, 15-6 record

**Tasks:**
1. Using the KBO pitcher translation model from Section 23.3, project his MLB stats
2. Calculate projected FIP and ERA
3. Build a confidence interval for your ERA projection
4. Compare to similar KBO pitchers (e.g., Hyun-Jin Ryu, Kwang-Hyun Kim)
5. Recommend a contract structure based on projection and risk
Exercise 23.3
Latin American Tool-Based Valuation
Hard
You are evaluating three Dominican Republic prospects for international signing:

**Prospect A:**
- Age: 17
- Hit: 60, Power: 70, Speed: 55, Field: 55, Arm: 60
- Exit velocity: 108 mph
- Asking bonus: $3.5M

**Prospect B:**
- Age: 16
- Hit: 65, Power: 60, Speed: 70, Field: 65, Arm: 60
- Exit velocity: 104 mph
- Asking bonus: $4.0M

**Prospect C:**
- Age: 18
- Hit: 55, Power: 75, Speed: 45, Field: 50, Arm: 55
- Exit velocity: 112 mph
- Asking bonus: $2.5M

**Tasks:**
1. Use the Latin American projection model from Section 23.4
2. Project WAR through age 23 for each prospect
3. Calculate value per dollar of bonus
4. Rank the prospects considering both ceiling and floor outcomes
5. Recommend which prospect(s) to sign and at what price

**Advanced:** Simulate 1,000 career paths for each prospect incorporating uncertainty and injury risk.
Exercise 23.4
International League Environment Analysis
Hard
Using the league comparison data from Section 23.1, conduct a comprehensive analysis:

**Tasks:**
1. Calculate park-adjusted metrics for each league
2. Estimate "true talent" translation factors using regression to the mean
3. Build a Bayesian updating system that improves projections as players accumulate MLB PA
4. Create visualizations comparing league offensive environments over time (2015-2023)
5. Develop recommendations for adjusting scouting priorities based on league trends

**Data Required:**
- League-wide statistics (provided in section)
- Park factors (research or estimate)
- Historical translation success rates

**Deliverables:**
- R or Python code implementing your analysis
- Report summarizing findings
- Recommendations for international scouting departments

---
Exercise 24.1
Pitcher Arsenal Analysis and Optimization
Hard
**Objective**: Analyze a pitcher's repertoire using Statcast data and provide recommendations for pitch usage optimization.

**Skills Demonstrated**: Data acquisition, exploratory analysis, visualization, strategic thinking

**Project Steps**:

1. Acquire Statcast pitch-level data for a pitcher (use baseballr package or Baseball Savant)
2. Analyze pitch characteristics (velocity, movement, spin)
3. Evaluate pitch effectiveness by count and situation
4. Identify optimization opportunities
5. Create compelling visualizations
6. Write executive summary with recommendations

**R Implementation**:

```r
# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns

library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)

# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
# In practice, use scrape_statcast_savant_pitcher()
# For this example, we'll simulate data

set.seed(123)
n_pitches <- 2500

tibble(
pitch_type = sample(
c("FF", "SI", "SL", "CH", "CU"),
n_pitches,
replace = TRUE,
prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
),
release_speed = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
),
pfx_x = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
),
pfx_z = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
),
release_spin_rate = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
),
balls = sample(0:3, n_pitches, replace = TRUE),
strikes = sample(0:2, n_pitches, replace = TRUE),
stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
description = sample(
c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
n_pitches,
replace = TRUE,
prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
),
launch_speed = ifelse(description == "hit_into_play",
rnorm(n_pitches, 87, 10), NA),
launch_angle = ifelse(description == "hit_into_play",
rnorm(n_pitches, 12, 20), NA),
estimated_woba_using_speedangle = ifelse(
description == "hit_into_play",
pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
NA
)
)
}

# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")

# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
n = n(),
pct = n() / nrow(pitcher_data),
avg_velo = mean(release_speed, na.rm = TRUE),
avg_spin = mean(release_spin_rate, na.rm = TRUE)
) %>%
arrange(desc(n))

print("Pitch Mix:")
print(pitch_mix)

# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
usage = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
na.rm = TRUE),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
) %>%
arrange(desc(csw_rate))

print("\nPitch Effectiveness:")
print(pitch_effectiveness)

# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
mutate(count = paste0(balls, "-", strikes)) %>%
group_by(count, pitch_type) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
.groups = "drop"
) %>%
group_by(count) %>%
mutate(usage_pct = n / sum(n)) %>%
arrange(count, desc(usage_pct))

# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
group_by(pitch_type, stand) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = stand,
values_from = c(n, whiff_rate, avg_xwoba),
names_sep = "_"
)

print("\nPlatoon Splits:")
print(platoon_splits)

# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
"FF" = "#d22d49", "SI" = "#FE9D00",
"SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)

movement_plot <- ggplot(pitcher_data,
aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
geom_point(alpha = 0.3, size = 2) +
stat_ellipse(level = 0.75, size = 1.2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Pitch Movement Profile",
subtitle = "Catcher's perspective (RHP)",
x = "Horizontal Break (inches)",
y = "Induced Vertical Break (inches)",
color = "Pitch Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right"
) +
coord_fixed()

# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
geom_point(alpha = 0.4, size = 2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Velocity vs. Spin Rate",
x = "Release Speed (mph)",
y = "Spin Rate (rpm)",
color = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")

# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
geom_col(position = "stack") +
scale_fill_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Pitch Usage by Count",
x = "Count",
y = "Usage %",
fill = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")

# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
plot_annotation(
title = "Comprehensive Pitcher Arsenal Analysis",
subtitle = "Example Pitcher - 2024 Season",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)

print(combined_plot)

# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")

# Best pitch
best_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_max(csw_rate, n = 1)

cat("1. PRIMARY WEAPON\n")
cat(sprintf(" - %s showing elite CSW rate of %.1f%%\n",
best_pitch$pitch_type, best_pitch$csw_rate * 100))
cat(" - Maintain high usage in favorable counts\n\n")

# Underused effective pitch
underused <- effectiveness %>%
filter(usage < quantile(effectiveness$usage, 0.33)) %>%
filter(csw_rate > 0.30)

if(nrow(underused) > 0) {
cat("2. USAGE OPTIMIZATION\n")
for(i in 1:nrow(underused)) {
cat(sprintf(" - Consider increasing %s usage (current: %d pitches)\n",
underused$pitch_type[i], underused$usage[i]))
cat(sprintf(" Shows strong CSW rate: %.1f%%\n",
underused$csw_rate[i] * 100))
}
cat("\n")
}

# Weak pitch
weak_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_min(csw_rate, n = 1)

cat("3. PITCH DEVELOPMENT FOCUS\n")
cat(sprintf(" - %s showing below-average performance\n",
weak_pitch$pitch_type))
cat(sprintf(" - CSW rate: %.1f%% vs. league average ~28%%\n",
weak_pitch$csw_rate * 100))
cat(" - Consider: velocity increase, movement adjustment, or reduced usage\n\n")

cat("4. STRATEGIC ADJUSTMENTS\n")
cat(" - Review count-specific usage patterns\n")
cat(" - Analyze platoon splits for pitch selection\n")
cat(" - Consider sequencing effects (not shown in basic analysis)\n")
cat(" - Monitor fatigue impact on pitch quality\n")
}

generate_recommendations(pitcher_data, pitch_effectiveness)

# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")
```

**Portfolio Presentation Tips**:
- Include interactive visualizations (consider using plotly)
- Compare pitcher to league averages
- Add context about pitcher role and team strategy
- Discuss limitations (sample size, park factors, etc.)
- Provide actionable recommendations
Exercise 24.2
Player Aging Curves and Performance Projection
Hard
**Objective**: Build aging curves for different player skills and create a performance projection system.

**Skills Demonstrated**: Statistical modeling, time series analysis, predictive analytics, data visualization

**Python Implementation**:

```python
# Player Aging Curves and Projection System
# Analyzing how player skills change with age

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
"""
Generate simulated player career data.
In practice, this would come from Baseball Reference or FanGraphs.
"""
np.random.seed(42)

players = []
for player_id in range(n_players):
# Random career start age (20-25)
start_age = np.random.randint(20, 26)
# Random career length (2-15 years)
career_length = np.random.randint(2, 16)

# Peak age varies (26-30)
peak_age = np.random.randint(26, 31)
# Peak performance level
peak_wrc_plus = np.random.normal(110, 20)

for year_in_career in range(career_length):
age = start_age + year_in_career
season = years_range[0] + np.random.randint(0,
years_range[1] - years_range[0])

# Age-based performance (simplified aging curve)
age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
base_wrc = peak_wrc_plus * age_factor

# Add random variation
wrc_plus = max(50, base_wrc + np.random.normal(0, 15))

# Other stats correlated with wRC+
pa = np.random.randint(300, 650)
avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
obp = avg + 0.060 + np.random.normal(0, 0.020)
slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)

players.append({
'player_id': player_id,
'age': age,
'season': season,
'PA': pa,
'AVG': np.clip(avg, 0.150, 0.400),
'OBP': np.clip(obp, 0.250, 0.500),
'SLG': np.clip(slg, 0.300, 0.700),
'wRC_plus': wrc_plus,
'ISO': np.clip(slg - avg, 0.050, 0.350)
})

return pd.DataFrame(players)

# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)

print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")

# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
"""
Calculate aging curve using year-to-year delta method.
This controls for selection bias better than simple averaging.
"""
# Filter for consecutive seasons
df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])

# Calculate year-to-year changes
df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]

# Keep only consecutive seasons
df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()

# Group by age and calculate average change
aging_curve = df_deltas.groupby('age').agg({
'metric_delta': ['mean', 'std', 'count'],
metric: 'mean'
}).reset_index()

aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']

return aging_curve

# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")

metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}

for metric in metrics:
aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
print(f" {metric}: {len(aging_curves[metric])} age points")

# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
"""
Fit a polynomial curve to aging data.
"""
# Use weighted regression (weight by sample size)
weights = np.sqrt(aging_data['n'])

# Polynomial features (degree 2)
X = aging_data[age_col].values.reshape(-1, 1)
y = aging_data[delta_col].values

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = Ridge(alpha=1.0)
model.fit(X_poly, y, sample_weight=weights)

return model, poly

# Fit curves
fitted_models = {}
for metric in metrics:
fitted_models[metric] = fit_aging_curve(aging_curves[metric])
print(f"Fitted aging curve for {metric}")

# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
ax = axes[idx]
curve_data = aging_curves[metric]
model, poly = fitted_models[metric]

# Plot raw deltas
ax.scatter(curve_data['age'], curve_data['delta_mean'],
s=curve_data['n']*2, alpha=0.6, label='Observed')

# Plot fitted curve
age_range = np.linspace(curve_data['age'].min(),
curve_data['age'].max(), 100)
X_pred = poly.transform(age_range.reshape(-1, 1))
y_pred = model.predict(X_pred)

ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)

ax.set_xlabel('Age', fontsize=11)
ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")

# 4. Build Projection System
class PlayerProjector:
"""
Project player performance based on recent history and aging curves.
"""

def __init__(self, aging_models):
self.aging_models = aging_models

def project_player(self, player_history, years_forward=1):
"""
Project player performance forward.

Parameters:
-----------
player_history : DataFrame
Recent seasons for player (last 3 years recommended)
years_forward : int
Number of years to project forward

Returns:
--------
dict : Projected statistics
"""
# Weight recent seasons more heavily (3:2:1 for last 3 years)
weights = np.array([3, 2, 1])[:len(player_history)]
weights = weights / weights.sum()

# Current age and baseline performance
current_age = player_history['age'].iloc[-1]

projections = {}

for metric in self.aging_models.keys():
if metric not in player_history.columns:
continue

# Weighted average of recent performance
baseline = np.average(player_history[metric].iloc[-3:],
weights=weights)

# Apply aging curve
model, poly = self.aging_models[metric]

# Project forward
projected_value = baseline
for year in range(years_forward):
age = current_age + year + 1
X_age = poly.transform([[age]])
age_adjustment = model.predict(X_age)[0]
projected_value += age_adjustment

projections[metric] = projected_value

projections['age'] = current_age + years_forward
projections['projection_years'] = years_forward

return projections

# 5. Test Projection System
projector = PlayerProjector(fitted_models)

# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]

test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')

print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")

print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))

# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)

for year in range(1, 4):
projection = projector.project_player(test_player_data, years_forward=year)
print(f"+{year:<5} {projection['age']:<5.0f} "
f"{projection.get('wRC_plus', 0):<8.1f} "
f"{projection.get('ISO', 0):<8.3f} "
f"{projection.get('AVG', 0):<8.3f} "
f"{projection.get('OBP', 0):<8.3f}")

# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
"""
Evaluate projection accuracy on historical data.
"""
results = []

for player_id in data['player_id'].unique():
player_data = data[data['player_id'] == player_id].sort_values('age')

# Need at least 4 seasons (3 to project, 1 to validate)
if len(player_data) < 4:
continue

# Use all but last season for projection
train_data = player_data.iloc[:-1]
actual_data = player_data.iloc[-1]

if len(train_data) < 3:
continue

# Make projection
try:
projection = projector.project_player(train_data, years_forward=1)

for metric in ['wRC_plus', 'ISO', 'AVG']:
if metric in projection:
results.append({
'player_id': player_id,
'metric': metric,
'actual': actual_data[metric],
'projected': projection[metric],
'error': projection[metric] - actual_data[metric]
})
except:
continue

return pd.DataFrame(results)

print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)

print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)

for metric in ['wRC_plus', 'ISO', 'AVG']:
metric_eval = evaluation[evaluation['metric'] == metric]

if len(metric_eval) > 0:
mae = np.abs(metric_eval['error']).mean()
rmse = np.sqrt((metric_eval['error'] ** 2).mean())

# Calculate R²
actual = metric_eval['actual'].values
predicted = metric_eval['projected'].values
ss_res = np.sum((actual - predicted) ** 2)
ss_tot = np.sum((actual - actual.mean()) ** 2)
r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")

# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
ax = axes[idx]
metric_eval = evaluation[evaluation['metric'] == metric]

if len(metric_eval) > 0:
ax.scatter(metric_eval['actual'], metric_eval['projected'],
alpha=0.4, s=30)

# Add y=x line
min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
ax.plot([min_val, max_val], [min_val, max_val],
'r--', linewidth=2, label='Perfect Projection')

ax.set_xlabel(f'Actual {metric}', fontsize=11)
ax.set_ylabel(f'Projected {metric}', fontsize=11)
ax.set_title(f'{metric} Projection Accuracy',
fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")
```

**Extension Ideas**:
- Incorporate minor league translation factors
- Add injury risk modeling
- Create playing time projections
- Develop position-specific aging curves
- Compare to established projection systems (Steamer, ZiPS)
Exercise 24.3
Draft Value Analysis and Strategy Optimization
Hard
**Objective**: Analyze historical draft performance to quantify pick value and optimize draft strategy.

**Skills Demonstrated**: Data analysis, value modeling, strategic thinking, data visualization

**Key Analysis Components**:

```r
# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy

library(tidyverse)
library(survival)
library(ggplot2)
library(scales)

# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
set.seed(42)

drafts <- expand.grid(
year = 2008:2022,
round = 1:rounds,
pick = 1:30
) %>%
mutate(
overall_pick = (round - 1) * 30 + pick,
# Probability of reaching majors decreases with pick
p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
reached_mlb = rbinom(n(), 1, p_mlb),
# Career WAR conditional on reaching MLB
war_if_mlb = ifelse(
reached_mlb == 1,
pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
0
),
# Years to debut
years_to_debut = ifelse(
reached_mlb == 1,
pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
NA
),
# Position (simplified)
position = sample(
c("P", "C", "IF", "OF"),
n(),
replace = TRUE,
prob = c(0.45, 0.10, 0.25, 0.20)
),
# College vs HS
player_type = sample(
c("College", "HS", "International"),
n(),
replace = TRUE,
prob = c(0.55, 0.35, 0.10)
),
# Slot value (simplified formula)
slot_value = pmax(
200000,
12000000 * exp(-overall_pick / 15)
),
# Signing bonus (usually close to slot)
signing_bonus = slot_value * runif(n(), 0.85, 1.15)
)

return(drafts)
}

# Generate data
draft_data <- generate_draft_data()

print(sprintf("Generated %d draft picks from %d drafts",
nrow(draft_data), n_distinct(draft_data$year)))

# 1. Success Rate by Round
success_by_round <- draft_data %>%
group_by(round) %>%
summarize(
n_picks = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
total_war = sum(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
) %>%
filter(round <= 20) # Focus on first 20 rounds

print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))

# 2. Value Curve Estimation
value_curve <- draft_data %>%
group_by(overall_pick) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
) %>%
filter(overall_pick <= 300)

# Fit exponential decay model
value_model <- nls(
expected_war ~ a * exp(-b * overall_pick),
data = value_curve %>% filter(expected_war > 0),
start = list(a = 10, b = 0.01)
)

# Add fitted values
value_curve$fitted_war <- predict(
value_model,
newdata = data.frame(overall_pick = value_curve$overall_pick)
)

print("\nValue Curve Model:")
print(summary(value_model))

# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
geom_vline(xintercept = c(30, 60, 90),
linetype = "dashed", alpha = 0.3) +
annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
label = "Round 1", size = 3.5) +
annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
label = "Round 2", size = 3.5) +
labs(
title = "MLB Draft Pick Value Curve",
subtitle = "Expected career WAR by draft position",
x = "Overall Pick",
y = "Expected Career WAR",
caption = "Exponential decay model fitted to historical data"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11)
)

print(value_plot)

# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(position, round) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
.groups = "drop"
) %>%
group_by(position) %>%
summarize(
total_picks = sum(n),
avg_mlb_rate = mean(mlb_rate),
avg_war = mean(avg_war)
) %>%
arrange(desc(avg_war))

print("\nPosition-Specific Success Rates:")
print(position_analysis)

# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(player_type) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
)

print("\nCollege vs High School Performance:")
print(player_type_analysis)

# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
mutate(
war_per_million = war_if_mlb / (signing_bonus / 1000000),
pick_group = case_when(
overall_pick <= 30 ~ "Top 30",
overall_pick <= 60 ~ "31-60",
overall_pick <= 100 ~ "61-100",
TRUE ~ "100+"
)
) %>%
group_by(pick_group) %>%
summarize(
n = n(),
avg_bonus = mean(signing_bonus),
avg_war = mean(war_if_mlb),
war_per_million = mean(war_per_million)
)

print("\nReturn on Investment by Pick Range:")
print(roi_analysis)

# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
"""
Simple optimization: maximize expected WAR given bonus pool constraints
"""

# Get expected value for each pick
pick_values <- value_curve %>%
filter(overall_pick %in% available_picks) %>%
left_join(
draft_data %>%
group_by(overall_pick) %>%
summarize(avg_slot = mean(slot_value)),
by = "overall_pick"
)

# Greedy algorithm: pick highest value/cost ratio within budget
selected <- tibble()
remaining_budget <- budget
remaining_picks <- pick_values

while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
# Calculate value per dollar
remaining_picks <- remaining_picks %>%
mutate(value_per_dollar = expected_war / avg_slot)

# Select best value pick we can afford
best_pick <- remaining_picks %>%
filter(avg_slot <= remaining_budget) %>%
slice_max(value_per_dollar, n = 1)

if(nrow(best_pick) == 0) break

selected <- bind_rows(selected, best_pick)
remaining_budget <- remaining_budget - best_pick$avg_slot
remaining_picks <- remaining_picks %>%
filter(overall_pick != best_pick$overall_pick)
}

return(selected)
}

# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000

optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)

print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
select(overall_pick, expected_war, avg_slot, value_per_dollar))

# 8. Comprehensive Dashboard Visualization
library(patchwork)

# Plot 1: Success rate by round
p1 <- success_by_round %>%
filter(round <= 10) %>%
ggplot(aes(x = round, y = mlb_rate)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
vjust = -0.5, size = 3) +
scale_y_continuous(labels = percent_format()) +
labs(title = "MLB Success Rate by Round",
x = "Draft Round", y = "% Reaching MLB") +
theme_minimal()

# Plot 2: WAR distribution
p2 <- draft_data %>%
filter(reached_mlb == 1, overall_pick <= 100) %>%
ggplot(aes(x = war_if_mlb)) +
geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
labs(title = "Career WAR Distribution (MLB Players)",
x = "Career WAR", y = "Count") +
theme_minimal()

# Plot 3: Position comparison
p3 <- draft_data %>%
filter(reached_mlb == 1, round <= 5) %>%
ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
geom_boxplot(alpha = 0.7) +
labs(title = "WAR by Position (Rounds 1-5)",
x = "Position", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")

# Plot 4: College vs HS
p4 <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
labs(title = "College vs HS Performance",
x = "Player Type", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")

# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
plot_annotation(
title = "MLB Draft Analysis Dashboard",
subtitle = "Historical performance metrics and value analysis",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)

print(combined)

# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")

cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf(" - First round produces %.1f%% of total draft WAR\n",
100 * first_round_war / total_war))

cat("\n2. SUCCESS RATES\n")
cat(sprintf(" - Round 1: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[1]))
cat(sprintf(" - Round 5: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[5]))
cat(sprintf(" - Round 10: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[10]))

cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf(" - Average time to debut: %.1f years\n",
mean(draft_data$years_to_debut, na.rm = TRUE)))

cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat(" - Prioritize early picks; value drops exponentially\n")
cat(" - Consider college players for faster development\n")
cat(" - High school players have higher variance in outcomes\n")
cat(" - Pitchers dominate draft but consider positional scarcity\n")
cat(" - Later rounds: focus on high-ceiling, high-risk players\n")

cat("\n=== ANALYSIS COMPLETE ===\n")
```

**Portfolio Enhancement**:
- Add international signing analysis
- Compare team draft performance
- Analyze specific draft classes
- Include financial constraints modeling
- Compare to prospect ranking systems
Exercise 24.4
Defensive Positioning and Shift Analysis
Hard
**Objective**: Analyze defensive positioning effectiveness using batted ball data.

**Skills Demonstrated**: Spatial analysis, causal inference, strategic analysis, data visualization

**Implementation Framework**:

```python
# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches

# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)

# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
"""
Simulate batted ball locations and outcomes.
Coordinates in feet from home plate.
"""
np.random.seed(42)

data = []

for _ in range(n_balls):
# Batter handedness
stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])

# Shift decision (more common vs pull hitters)
is_shifter = np.random.random() < 0.3
shift_on = is_shifter and (np.random.random() < 0.7)

# Hit location (pull tendency varies)
if stand == 'R':
# Righties pull left
if is_shifter:
angle = np.random.normal(-25, 35) # Pull-heavy
else:
angle = np.random.normal(-10, 45) # Balanced
else:
# Lefties pull right
if is_shifter:
angle = np.random.normal(25, 35)
else:
angle = np.random.normal(10, 45)

# Distance based on exit velo and launch angle
exit_velo = np.random.normal(88, 8)
launch_angle = np.random.normal(12, 18)

# Simplified distance calculation
distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
distance = max(50, min(400, distance + np.random.normal(0, 20)))

# Convert to x, y coordinates
angle_rad = np.radians(angle)
x = distance * np.sin(angle_rad)
y = distance * np.cos(angle_rad)

# Hit outcome (shift effectiveness)
if shift_on:
# Shift reduces hits in pull direction
if stand == 'R' and x < -50:
prob_hit = 0.18 # Reduced by shift
elif stand == 'L' and x > 50:
prob_hit = 0.18
else:
prob_hit = 0.28 # Normal rate
else:
prob_hit = 0.25

# Adjust for distance (harder to field)
prob_hit = min(0.95, prob_hit * (distance / 250))

is_hit = np.random.random() < prob_hit

data.append({
'x': x,
'y': y,
'distance': distance,
'angle': angle,
'exit_velo': exit_velo,
'launch_angle': launch_angle,
'stand': stand,
'shift_on': shift_on,
'is_hit': is_hit,
'is_shifter': is_shifter
})

return pd.DataFrame(data)

# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)

print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")

# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
'is_hit': ['mean', 'count'],
'exit_velo': 'mean'
}).round(3)

print("\nShift Effectiveness:")
print(shift_analysis)

# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
"""
Estimate runs saved by shifting.
"""
results = []

for stand in ['R', 'L']:
for shifter in [True, False]:
subset = data[(data['stand'] == stand) &
(data['is_shifter'] == shifter)]

if len(subset) == 0:
continue

shifted = subset[subset['shift_on'] == True]
no_shift = subset[subset['shift_on'] == False]

if len(shifted) > 0 and len(no_shift) > 0:
babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
# Approximate run value per hit prevented: ~0.5 runs
runs_saved_per_pa = babip_diff * 0.5

results.append({
'stand': stand,
'is_shifter': shifter,
'shifted_babip': shifted['is_hit'].mean(),
'no_shift_babip': no_shift['is_hit'].mean(),
'babip_diff': babip_diff,
'runs_saved_per_100pa': runs_saved_per_pa * 100,
'n_shifted': len(shifted),
'n_no_shift': len(no_shift)
})

return pd.DataFrame(results)

shift_value = calculate_shift_value(bb_data)

print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))

# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
"""
Plot baseball field with hit locations.
"""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 10))

# Draw field outline
# Infield dirt
infield = patches.Wedge((0, 0), 95, 45, 135,
facecolor='tan', alpha=0.3)
ax.add_patch(infield)

# Outfield grass
outfield = patches.Wedge((0, 0), 400, 45, 135,
facecolor='green', alpha=0.1)
ax.add_patch(outfield)

# Foul lines
ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)

# Plot hits
hits = data[data['is_hit'] == True]
outs = data[data['is_hit'] == False]

ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
s=20, label='Out')
ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
s=30, label='Hit')

ax.set_xlim(-320, 320)
ax.set_ylim(0, 400)
ax.set_aspect('equal')
ax.set_xlabel('Distance from center (ft)', fontsize=11)
ax.set_ylabel('Distance from home (ft)', fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.2)

return ax

# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
(bb_data['is_shifter'] == True)]

plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == False],
'RHB Pull Hitter - No Shift',
ax=ax1
)

plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == True],
'RHB Pull Hitter - Shift On',
ax=ax2
)

plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")

# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
"""
Create BABIP heat map for given conditions.
"""
subset = data[(data['shift_on'] == shift_status) &
(data['stand'] == stand)]

# Create grid
x_bins = np.linspace(-250, 250, 25)
y_bins = np.linspace(50, 350, 20)

grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))

for i in range(len(y_bins)-1):
for j in range(len(x_bins)-1):
mask = ((subset['x'] >= x_bins[j]) &
(subset['x'] < x_bins[j+1]) &
(subset['y'] >= y_bins[i]) &
(subset['y'] < y_bins[i+1]))

cell_data = subset[mask]
if len(cell_data) >= 5: # Minimum sample
grid_babip[i, j] = cell_data['is_hit'].mean()
grid_count[i, j] = len(cell_data)
else:
grid_babip[i, j] = np.nan

return grid_babip, x_bins, y_bins, grid_count

# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stand in enumerate(['R', 'L']):
for j, shift_on in enumerate([False, True]):
ax = axes[i, j]

shifters = bb_data[bb_data['is_shifter'] == True]
grid, x_bins, y_bins, counts = create_babip_heatmap(
shifters, shift_on, stand
)

im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
y_bins[0], y_bins[-1]],
origin='lower', cmap='RdYlGn_r',
vmin=0, vmax=0.5, aspect='auto')

shift_text = "Shift On" if shift_on else "No Shift"
hand_text = "RHB" if stand == 'R' else "LHB"
ax.set_title(f'{hand_text} - {shift_text}',
fontsize=11, fontweight='bold')
ax.set_xlabel('Horizontal Position (ft)')
ax.set_ylabel('Distance from Home (ft)')

# Add colorbar
plt.colorbar(im, ax=ax, label='BABIP')

plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")

# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)

X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
model.coef_[0]):
print(f" {feat}: {coef:.3f}")

# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)

print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
print(f" {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")

print("\n2. WHEN TO SHIFT")
print(" - Strong pull tendency (>70% pull rate)")
print(" - Ground ball hitters (LA < 10°)")
print(" - Extreme pull hitters benefit most from aggressive shifts")

print("\n3. SHIFT VARIATIONS")
print(" - Full shift: 3 infielders on pull side")
print(" - Partial shift: 2.5 infielders pull side")
print(" - No shift: Traditional alignment")
print(" - Decision should consider:")
print(" * Batter's spray chart")
print(" * Game situation (runners, outs)")
print(" * Pitcher's ground ball rate")

print("\n4. LIMITATIONS & CONSIDERATIONS")
print(" - Shift beaten by opposite field hits")
print(" - Bunt defense vulnerabilities")
print(" - Runner advancement opportunities")
print(" - Pitcher-specific adjustments")

print("\n5. FUTURE ANALYSIS")
print(" - Pitcher-specific positioning")
print(" - Count-based positioning adjustments")
print(" - Outfield positioning optimization")
print(" - Real-time adjustment algorithms")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
```

**Portfolio Development Tips**:
- Use real Statcast spray chart data when possible
- Incorporate expected outcomes (xBA, xwOBA)
- Add video analysis component
- Compare to MLB team shift strategies
- Analyze shift effectiveness by ballpark

---

Build Your Skills Progressively

Follow this recommended path through the exercises

Data Wrangling

Start here! Learn to load, clean, and manipulate baseball data.

Chapters 1-3
Visualization

Create compelling charts and visualizations of baseball data.

Chapter 4
Metrics & Analysis

Calculate and interpret sabermetric and Statcast metrics.

Chapters 5-8
Advanced Topics

Machine learning, custom metrics, and interactive apps.

Chapters 9-12

Need to Review the Material?

Head back to the chapters to refresh your understanding before tackling exercises.