Filter Exercises
All Exercises
48 exercisesIntroduction to MLB Analytics
4 exercisesEnvironment Setup Verification
1. Loads the necessary baseball analytics packages
2. Prints your R or Python version
3. Queries basic information about a player of your choice using the baseball data package
4. Creates a simple plot (any plot) to verify visualization works
This confirms your environment is properly configured.
Data Hierarchy Exploration
1. Get game-level data for your favorite team
2. Calculate the team's wins and losses
3. Aggregate to find total runs scored and allowed across the season
4. Calculate the team's Pythagorean winning percentage using the formula from the Preface
5. Compare actual winning percentage to Pythagorean expectation
**R Version Hint:**
```r
# Use baseballr to get team game logs
library(baseballr)
library(tidyverse)
# Example for Yankees (team_id = 147)
yankees_games <- mlb_team_schedule(season = 2023,
team_id = 147)
# Then aggregate and calculate...
```
**Python Version Hint:**
```python
# Use pybaseball to get team game logs
from pybaseball import schedule_and_record
# Example for Yankees
yankees_games = schedule_and_record(2023, 'NYY')
# Then aggregate and calculate...
```
Hit Type Analysis
1. Choose four players representing different offensive profiles
2. Get their 2023 statistics including hit breakdowns
3. Calculate what percentage of their hits were singles, doubles, triples, and home runs
4. Create a visualization comparing these distributions
5. Write a brief interpretation: How do power hitters' hit distributions differ from contact hitters?
Monthly Performance Trends
1. Get plate appearance data for a player across the 2023 season
2. Group plate appearances by month (April, May, June, July, August, September)
3. Calculate monthly batting average, OBP, and slugging percentage
4. Create a line plot showing how these metrics changed throughout the season
5. Discuss: Do you see evidence of the player getting hot or cold? How confident are you given sample sizes?
**Challenge extension:** Calculate monthly sample sizes and add confidence intervals to your plot to visualize uncertainty.
---
You've now completed your introduction to baseball analytics. You understand what baseball analytics is, what questions it addresses, have a working analytical environment, understand baseball's data structure, and have completed a real analysis examining the shift ban's impact. The foundation is laid.
Subsequent chapters build systematically on these foundations. Chapter 2 covers data acquisition—how to get data from various sources. Chapter 3 teaches data manipulation and transformation. Chapters 4-8 introduce core baseball metrics, showing both what they mean conceptually and how to calculate them from raw data. Later chapters cover visualization, modeling, prediction, and specialized topics.
Baseball analytics is ultimately about answering questions that matter. As you work through this book, always return to the questions: What am I trying to understand? What evidence would help answer that question? How confident should I be in my conclusions? The technical skills you'll develop are tools in service of clear thinking about meaningful questions.
Let's continue to Chapter 2, where we'll learn to acquire data from multiple sources, building the datasets that fuel every analysis.
Data Wrangling for Baseball
5 exercisesAdvanced Filtering and Selection
1. How many players hit 30+ home runs with a strikeout rate below 20%?
2. Which players had an OBP of at least .350 and stole 20+ bases?
3. Create a new category "Three True Outcomes Rate" (HR + BB + SO) / PA and identify the top 10 players
**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)
batters <- fg_batter_leaders(2024, 2024, qual = 300)
# 1.
q1 <- batters %>%
filter(HR >= 30, `K_percent` < 20) %>%
nrow()
# 2.
q2 <- batters %>%
filter(OBP >= .350, SB >= 20) %>%
select(Name, OBP, SB)
# 3.
q3 <- batters %>%
mutate(TTO_rate = (HR + BB + SO) / PA) %>%
arrange(desc(TTO_rate)) %>%
select(Name, HR, BB, SO, PA, TTO_rate) %>%
head(10)
```
**Python Solution Sketch:**
```python
from pybaseball import batting_stats
batters_py = batting_stats(2024, qual=300)
# 1.
q1_py = batters_py[(batters_py['HR'] >= 30) & (batters_py['K%'] < 20)]
print(f"Count: {len(q1_py)}")
# 2.
q2_py = batters_py[(batters_py['OBP'] >= .350) & (batters_py['SB'] >= 20)][['Name', 'OBP', 'SB']]
# 3.
batters_py['TTO_rate'] = (batters_py['HR'] + batters_py['BB'] + batters_py['SO']) / batters_py['PA']
q3_py = batters_py.nlargest(10, 'TTO_rate')[['Name', 'HR', 'BB', 'SO', 'PA', 'TTO_rate']]
```
Grouping and Team Analysis
1. Which team had the highest average OPS among qualified batters?
2. What's the correlation between team home runs and team wins (you'll need to join with standings data)?
3. Calculate each team's offensive balance: standard deviation of WAR among their top 5 hitters
Time Series Analysis
1. Calculate 15-game rolling averages for BA, OBP, and SLG
2. Identify the longest streak above .300 BA
3. Compare first-half vs. second-half performance
4. Visualize performance trends over the season
Joins and Data Integration
1. Identify two-way players (appear in both datasets)
2. For two-way players, calculate their combined WAR
3. Compare offensive WAR vs. pitching WAR
Missing Data Challenge
1. Assess the extent and pattern of missing data
2. Implement three different imputation strategies
3. Compare the impact on correlation between exit velocity and hard-hit rate
4. Justify which imputation method is most appropriate and why
---
This concludes Chapter 2. In the next chapter, we'll explore the rich ecosystem of baseball data sources, learning how to access FanGraphs, Baseball Reference, Statcast, and the Lahman database through R and Python packages.
The Baseball Data Ecosystem
4 exercisesMulti-Source Data Integration
1. Retrieve 2024 batting statistics for players with 400+ PA
2. For the top 10 hitters by wRC+, get their Statcast data
3. Compare their wOBA (FanGraphs) to their xwOBA (Statcast)
4. Which players are most over-performing or under-performing their expected stats?
**R Solution Sketch:**
```r
library(baseballr)
library(tidyverse)
# 1. Get FanGraphs data
batters <- fg_batter_leaders(2024, 2024, qual = 400)
top10 <- batters %>% arrange(desc(`wRC+`)) %>% head(10)
# 2 & 3. Get Statcast data for each
# (Would loop through players and use statcast_search with their IDs)
# Compare wOBA vs xwOBA
# 4. Calculate differences
# top10 %>% mutate(woba_diff = wOBA - xwOBA)
```
**Python Solution Sketch:**
```python
from pybaseball import batting_stats, statcast_batter, playerid_lookup
import pandas as pd
# 1. Get batting stats
batters = batting_stats(2024, qual=400)
top10 = batters.nlargest(10, 'wRC+')
# 2. Get Statcast data for top 10
# (Would need to lookup MLBAM IDs and query statcast_batter)
# 3 & 4. Compare wOBA vs xwOBA
# Calculate differences to identify over/under-performers
```
Historical Trends with Lahman
1. Calculate the league-average batting average by decade (1920-present)
2. Identify the decade with the highest and lowest scoring
3. Plot the trend of strikeouts per game over time
4. Compare the "Steroid Era" (1995-2005) to the "Modern Era" (2015-2024) in terms of HR rate, K rate, and BA
Team Performance Analysis
1. Get 2024 team standings (use mlb_standings from baseballr or schedule_and_record from pybaseball)
2. Calculate team batting statistics (aggregate from individual player data)
3. Join with team ERA (from pitching data)
4. Create a Pythagorean expectation model: Expected W% = R^2 / (R^2 + RA^2)
5. Compare actual wins to expected wins—which teams over-performed?
Statcast Deep Dive
1. Retrieve all Statcast data for their 2024 season
2. For each pitch type:
- Calculate average velocity, spin rate, and movement
- Calculate whiff rate and zone rate
- Identify which pitch generates the most swings and misses
3. Analyze platoon splits: How do their pitches perform vs. LHB vs. RHB?
4. Create a "pitch quality" ranking based on velocity, spin, and whiff rate
**Deliverable**: A comprehensive report with visualizations showing pitch usage, effectiveness, and recommendations for pitch selection.
---
This concludes Chapter 3. You now have the tools to access virtually any baseball dataset available. In the next chapters, we'll use these data sources to explore specific analytical techniques, from evaluating hitters and pitchers to building predictive models and crafting advanced visualizations.
The combination of R's `baseballr` package and Python's `pybaseball` library, along with the historical richness of the Lahman database, provides everything you need to conduct professional-grade baseball analysis. Master these tools, and you'll be equipped to answer almost any baseball question with data.
Statcast Analytics - Pitching
3 exercisesPitcher Comparison Analysis
1. Pull data for both pitchers for the 2024 season
2. Calculate velocity, spin rate, and movement profiles for each pitch type
3. Compare their arsenals: usage rates, average velocity, and whiff rates
4. Create a pitch movement chart for each pitcher
5. Write a brief scouting report comparing their arsenals
**Suggested pitchers**: Spencer Strider (ATL) and Shota Imanaga (CHC)
Command and Location Analysis
1. Calculate zone%, edge%, heart%, and chase rate by pitch type
2. Analyze CSW% overall and by pitch type
3. Create a heatmap showing pitch locations for their primary pitch (four-seam fastball)
4. Compare location patterns by count (ahead vs. behind in the count)
5. Assess: Is this pitcher's success driven more by stuff or command?
Arsenal Effectiveness Study
1. For each pitch type, calculate:
- Whiff rate
- xwOBA against
- Hard hit rate against
- CSW%
- Usage rate
2. Identify the best and worst pitches in the arsenal
3. Analyze if usage rate aligns with effectiveness (do they throw their best pitches most?)
4. Calculate pitch values: Run Value per 100 pitches for each pitch type
5. Make a recommendation: Should they adjust their pitch usage?
**Challenge Extension**: Compare the pitcher's arsenal effectiveness against left-handed vs. right-handed batters. Do they have platoon splits? Which pitches drive those splits?
---
You've now completed your deep dive into Statcast pitching analytics. You understand how modern tracking systems measure every pitch, what those measurements reveal about pitcher performance, and how to analyze arsenals, command, and expected outcomes. These skills form the foundation for evaluating pitchers in the modern game, whether you're building projection models, designing development plans, or making strategic decisions.
The next chapter will explore park factors and environmental effects - understanding how context affects the statistics we've been analyzing throughout this book.
Fielding & Baserunning Analytics
4 exercisesCalculating Simple OAA
1. Get batted ball data for one month (July 2024 recommended)
2. Filter for outfield fly balls and line drives
3. Calculate catch probability based on distance and hang time (use simplified model from section 8.2.2)
4. Determine actual outcomes (caught or not)
5. Calculate OAA for each outfielder as sum of (actual - expected)
6. Identify the top 5 and bottom 5 defenders
**Bonus**: Compare your simplified OAA to Baseball Savant's official OAA for the same players and time period. How close are they?
Shift Impact Analysis
1. Get Statcast data for ground balls hit by left-handed batters for:
- May 2022 (shifts allowed)
- May 2023 (shifts banned)
2. Calculate ground ball hit rates for each month
3. Perform a statistical test for the difference
4. Create a visualization comparing the two periods
5. Calculate how many extra hits occurred in 2023 vs expected based on 2022 rates
**Challenge**: Identify which individual players benefited most from the shift ban by comparing their 2022 vs 2023 ground ball BABIP.
Sprint Speed and Stolen Base Efficiency
1. Get sprint speed data for all qualified players (2024)
2. Get stolen base attempts and success rates
3. Calculate stolen base success rate for players with 10+ attempts
4. Create a scatter plot of sprint speed vs SB success rate
5. Fit a regression model and interpret the relationship
6. Identify players who over/under-perform their expected SB rate based on speed
**Question**: What sprint speed corresponds to 75% stolen base success (break-even point)?
Defensive Value Comparison
1. Select 20 players across multiple positions (2024 season)
2. Collect their OAA, UZR, and DRS values
3. Standardize all metrics to same scale (z-scores)
4. Calculate correlation between metrics
5. Identify players where metrics disagree significantly
6. Create a visualization showing agreement/disagreement
**Question**: For which positions do the metrics agree most? Where do they diverge most? Why might this be?
---
You've now completed your introduction to fielding and baserunning analytics. You understand why defense is challenging to measure, how modern metrics like OAA work, the value of positioning and shifts, and how to evaluate baserunning contribution. Defense and baserunning combined can account for 2-3 WAR per season for elite players—real, measurable value that traditional statistics completely missed.
The Statcast revolution has transformed defensive evaluation from subjective ("he looks good") to objective ("he made 73% of plays with 65% average probability"). We can now properly credit players like Kevin Kiermaier, Yadier Molina, and Matt Chapman for defensive excellence that was previously unrecognized in conventional statistics.
In Chapter 9, we'll turn to win probability and leveraged situations, understanding how context affects player and managerial decisions. The technical skills you've developed throughout this book will combine to help you evaluate complete players—offense, defense, baserunning, and situational performance—using the full arsenal of modern analytics.
Fantasy Baseball & Sports Betting Analytics
4 exercisesFantasy Player Valuation
1. Calculate replacement-level statistics (use the worst player's projections)
2. Define per-point denominators for each category (assume reasonable league spreads)
3. Calculate SGP for each player
4. Convert SGP to auction values (12-team league, $260 budget)
5. Visualize the relationship between a player's projected home runs and their auction value
**Extension**: How does the value of a player with extreme stolen base totals (50+) change if the league adds OBP as a sixth category?
DFS Optimization
1. Generate random projections (points) and salaries for 20 players across positions
2. Implement a greedy algorithm that selects players by value-per-dollar
3. Compare the greedy solution to a random selection
4. Calculate what percentage of random lineups beat the greedy lineup
5. Discuss: Why might the optimal lineup differ from highest value-per-dollar?
**Extension**: Add stack constraints (at least 3 batters from the same team must be selected).
Implied Probability and Vig Analysis
```
Game 1: Team A -180, Team B +160
Game 2: Team C -110, Team D -110
Game 3: Team E +140, Team F -160
Game 4: Team G -125, Team H +105
Game 5: Team I -200, Team J +175
```
1. Calculate implied probability for each team
2. Calculate the vig for each game
3. Identify which game has the highest and lowest vig
4. If you believe Team B has a 42% chance to win, calculate the EV of betting $100 on them
5. Visualize implied probabilities vs. your model probabilities (create hypothetical model probabilities)
Bankroll Simulation
1. Starting bankroll: $1,000
2. 100 bets over the season
3. Each bet has 53% win probability (representing edge over 50%)
4. Odds: -110 for all bets
5. Three bet sizing strategies: flat $50 per bet, 5% Kelly, 2% Kelly
For each strategy:
- Simulate 1,000 seasons (1,000 × 100 bets)
- Calculate median ending bankroll
- Calculate probability of ruin (ending bankroll < $100)
- Calculate 90th percentile outcome
- Visualize distribution of ending bankrolls
Discuss which strategy you'd recommend and why.
---
Fantasy baseball and sports betting analytics demonstrate how rigorous quantitative methods inform decisions under uncertainty. Whether valuing players across multiple performance dimensions, optimizing roster construction against salary constraints, or calculating expected value for betting opportunities, the principles of probability, statistics, and optimization provide frameworks for systematic decision-making.
These applications also illustrate analytics' limitations. Fantasy success depends on projection accuracy—but player performance includes irreducible randomness. Sports betting models require edge over sophisticated market prices—but even small edges require disciplined bankroll management to survive variance. No amount of analytical sophistication eliminates uncertainty or guarantees profits.
The skills developed in this chapter—converting probabilities to decisions, optimizing under constraints, managing risk—transfer to countless domains beyond sports. Data scientists in any field face similar challenges: building predictive models, quantifying uncertainty, and making optimal choices given limited information. Baseball provides a rich, accessible environment for developing these capabilities.
Chapter 12 will explore advanced topics in baseball analytics, including machine learning applications, deep learning for pitch classification, and cutting-edge research areas shaping the field's future.
Team Building & Roster Construction
4 exercisesFree Agent Cost Analysis
a) Compare cost per WAR across different position groups (pitchers vs hitters, premium positions vs corner positions)
b) Analyze whether older players (33+) cost more or less per WAR than younger free agents (28-30)
c) Identify which signing appears most efficient (best value) and least efficient (worst value)
**Data to collect:**
- Player name, position, age
- Contract terms (years, AAV)
- Projected WAR for first year (use Steamer or ZiPS)
**Hint:** Check FanGraphs or Baseball Prospectus for free agent tracker and projections.
Trade Surplus Value
**Your analysis should:**
a) Calculate total surplus value for each side of the trade (projected WAR × market rate - expected salary over years of control)
b) Apply discount rates to future value (use 5-10%)
c) Determine which team "won" the trade based on surplus value
d) Discuss how competitive windows might make the trade beneficial for both sides despite unequal surplus value
**Suggested trades:**
- Juan Soto to Padres (2022)
- Tyler Glasnow to Dodgers (2023)
- Dylan Cease to Padres (2024)
Draft Pick Value Curve
**For picks 1-30:**
a) Calculate what percentage reached MLB (100+ PA or 50+ IP)
b) For those who reached MLB, calculate total career WAR through 2023
c) Build a value curve showing expected WAR by draft position
d) Identify which picks outperformed or underperformed expectations
**Extension:** Compare college vs high school players. Do high school picks have higher variance? Higher ceiling?
Competitive Window Modeling
a) Identify core players and project their WAR trajectory over next 5 years using aging curves
b) Estimate prospect contribution (consult top prospect lists)
c) Calculate total projected WAR and expected wins for each season
d) Determine the team's optimal strategy: compete now, rebuild, or middle ground
e) Recommend specific roster moves (trades, free agent signings, or sell-offs) that align with your recommended strategy
**Teams with interesting situations:**
- Baltimore Orioles (young core, rising)
- St. Louis Cardinals (aging core, crossroads)
- Los Angeles Angels (Trout aging, weak farm)
- Tampa Bay Rays (perennial contender, low payroll)
---
**Chapter Summary**
Team building combines economics, player valuation, strategic timing, and organizational philosophy. Key takeaways:
1. **Economic Efficiency**: Pre-arbitration players provide 40x ROI vs free agents; exploit this arbitrage
2. **Positional Value**: Premium defensive positions (C, SS, CF) allow lower offensive standards
3. **Free Agent Markets**: Account for aging curves, apply discount rates, avoid winner's curse
4. **Trade Strategy**: Exchange surplus value across different timelines; align with competitive windows
5. **Draft Philosophy**: Balance upside (high school) vs safety (college) based on organizational timeline
6. **Strategic Clarity**: Commit fully to competing or rebuilding; avoid mediocre middle ground
Successful team building requires analytical rigor, clear strategic vision, and disciplined execution. The best front offices combine quantitative analysis with qualitative evaluation, organizational development, and adaptive strategy. As analytics evolve, teams that integrate new methods while maintaining coherent long-term plans will sustain competitive advantage.
Player Development & Minor League Analytics
4 exercisesAge-Adjusted Performance Analysis
**Data**:
```
Prospect: SS, Age 20
Level: High-A (League Avg Age: 22.8)
Stats: .275 AVG, .345 OBP, .485 SLG, 15 HR, 285 PA
12.2% BB%, 24.5% K%, .210 ISO
```
**Questions**:
1. Calculate the prospect's age-adjusted wRC+ (assume league average is 100)
2. How does the strikeout rate compare when adjusted for age?
3. Based on age-adjusted metrics, is this prospect ahead or behind the development curve?
4. What level should this prospect be promoted to next, and why?
Breakout Candidate Identification
**Prospect Comparison**:
| Metric | Prospect A | Prospect B | Prospect C |
|--------|-----------|-----------|-----------|
| Current wRC+ | 105 | 118 | 98 |
| Chase Rate Change | -4.5% | -1.2% | +2.1% |
| Zone Contact Change | +3.2% | +1.8% | -0.5% |
| Avg EV Change | +2.1 mph | +0.8 mph | +3.5 mph |
| Barrel Rate | 8.5% | 11.2% | 6.8% |
| Age | 22 | 24 | 21 |
**Questions**:
1. Calculate a composite breakout score for each prospect
2. Which prospect shows the most promising leading indicators?
3. What specific improvements drive your choice?
4. What realistic wRC+ would you project for each prospect next season?
International Prospect Translation
**Player Data**:
```
Player: OF, Age 26
League: KBO (Korean Baseball Organization)
Stats: .318 AVG, .385 OBP, .538 SLG, 28 HR, 550 PA
9.5% BB%, 15.2% K%, .220 ISO
Previous MLB exposure: None
```
**Questions**:
1. Translate the KBO statistics to MLB equivalents using appropriate league factors
2. What MLB slash line would you project for Year 1?
3. What is the biggest risk factor in this projection?
4. How would your projection change if the player were age 23 instead of 26?
Call-Up Decision Analysis
**Prospect Profile**:
```
Position: 3B, Age 22
AAA Stats (145 PA): .298/.375/.512, 6 HR, 12.4% BB%, 21.2% K%
AA Stats (425 PA): .285/.360/.485, 18 HR, 10.1% BB%, 24.5% K%
Defensive Grade: 55 (above average)
Future WAR Projection: 4.0 WAR annually (ages 25-29)
```
**Team Context**:
```
Current 3B Production: 85 wRC+ (below average)
Team Record: 15-18 (below .500)
Payroll Situation: Middle of pack
Days until full year service time: 12 days (mid-April)
Estimated Super Two cutoff: Already passed
```
**Questions**:
1. Calculate the financial value of delaying the call-up until mid-June
2. What is the estimated WAR cost of keeping an 85 wRC+ player at 3B for 6 more weeks?
3. Make your recommendation: Call up now or wait? Justify with analysis.
4. What performance threshold would make you change your decision?
---
**Exercise Solutions**: Solutions to these exercises involve combining multiple analytical techniques from the chapter. Students should use the code frameworks provided to build their own analysis pipelines, applying appropriate age adjustments, translation factors, and decision models. The exercises emphasize practical decision-making under uncertainty, mirroring real-world front office challenges.
Advanced Statistical Methods
4 exercisesBayesian Rookie Evaluation
1. Calculate a Bayesian estimate of his true ERA using a prior of N(4.00, 0.80^2)
2. Construct a 95% credible interval
3. What's the probability his true ERA is below 3.50?
4. How many innings would he need to pitch at 2.52 ERA before we're 90% confident his true ERA is below 3.50?
Implement your solution in R or Python and visualize the posterior distribution.
Hierarchical Team-Player Model
1. Fit a hierarchical model where player OPS varies by player and team
2. Extract team-level random effects
3. Compare team effects to actual team winning percentages
4. Identify players whose performance is most "above" or "below" what the hierarchical model predicts based on their team
5. Discuss: What does it mean when a player substantially outperforms his hierarchical prediction?
Detecting Velocity Changes
1. Plot velocity over time (by game or pitch number)
2. Fit a smooth trend to identify any systematic velocity decline
3. Use changepoint detection to identify if there was an abrupt velocity drop (potentially indicating injury)
4. Calculate the probability that any detected changepoint is real vs noise
5. If you detect a changepoint, investigate whether the pitcher's effectiveness (measured by wOBA allowed or K-BB%) changed simultaneously
Causal Effect of Designated Hitter
1. Compare the career length of pitchers in AL vs NL (before universal DH in 2022)
2. Account for confounding variables: pitcher quality (career ERA+), usage pattern (starter vs reliever), age at debut
3. Estimate the causal effect of playing in the AL (with DH) vs NL (pitchers bat) on pitcher career length
4. Discuss: Why might the DH extend (or shorten) pitcher careers? What are alternative explanations for any observed difference?
---
You've now explored six advanced statistical techniques with particular application to baseball analytics. Bayesian methods help us update beliefs rationally in the face of limited data. Hierarchical models share information across groups to improve estimates. Time series methods model temporal trends and detect change points. Survival analysis quantifies time-to-event processes. Causal inference attempts to establish causation from observational data. Monte Carlo simulation quantifies uncertainty and probabilities in complex systems.
These methods represent the frontier of baseball analytics. While traditional statistics remain valuable for description and basic inference, these advanced techniques enable analysts to answer more sophisticated questions: not just "what happened?" but "what's the player's true talent?", "what caused this change?", "what will happen?", and "how certain should we be?"
As you apply these methods to real baseball questions, remember that sophisticated techniques don't guarantee correct answers. Always question your assumptions: Are your priors reasonable? Are your groups truly comparable? Is your model appropriate for your data structure? The best analysts combine advanced statistical methods with deep baseball knowledge and healthy skepticism about their own conclusions.
In the next chapter, we'll explore machine learning methods that build on these statistical foundations to create predictive models for player performance, game outcomes, and strategic decisions.
Umpire Analysis & Strike Zone Modeling
4 exercisesUmpire Accuracy Analysis
a) Calculate the overall accuracy rate for each umpire (minimum 1,000 called pitches)
b) Identify the five most accurate and five least accurate umpires
c) Create a visualization comparing each umpire's accuracy on pitches inside vs. outside the strike zone
d) Test whether there is a statistically significant difference in accuracy between the most and least accurate umpires
**Hint:** Use a two-sample t-test or permutation test to assess statistical significance. Consider whether accuracy rates are normally distributed.
Strike Zone Visualization
a) Create a heat map showing the probability of a called strike at different locations
b) Overlay the rulebook strike zone on your visualization
c) Identify regions where the umpire's zone significantly differs from the rulebook (>20 percentage points)
d) Create a similar visualization for the league average and place them side-by-side for comparison
**Hint:** Use 2D binning or kernel density estimation to create smooth probability surfaces. The `stat_summary_2d()` function in ggplot2 or `scipy.stats.binned_statistic_2d()` in Python are helpful.
Predicting Called Strikes
a) Train a logistic regression model using pitch location, count, batter handedness, and pitcher handedness as features
b) Train a random forest model with the same features
c) Add umpire identity as a feature to both models (use one-hot encoding)
d) Compare the models using AUC, accuracy, and calibration plots
e) Identify which features are most important in each model
f) Use the best model to identify the 10 most surprising calls from the 2024 season (largest difference between predicted probability and actual call)
**Hint:** Feature importance can be extracted from logistic regression coefficients and random forest's `feature_importances_` attribute. For surprising calls, look for high-probability strikes called balls and vice versa.
ABS Impact Simulation
a) For each pitch in your dataset, determine whether the human umpire's call matches what ABS would call
b) Calculate the overall agreement rate and identify systematic biases (e.g., do human umpires call more strikes or fewer strikes than ABS?)
c) Estimate how strikeout rates and walk rates would change under full ABS (focus on pitches with 2 strikes and 3 balls respectively)
d) Calculate the expected number of calls that would be overturned per game
e) Analyze whether certain types of pitchers (high strikeout, high walk, etc.) would be helped or hurt more by ABS
**Hint:** You'll need to define the ABS zone precisely using the sz_top and sz_bot variables. Consider grouping pitchers by strikeout and walk rates to assess differential impacts.
---
This chapter has covered the fundamentals of umpire analysis and strike zone modeling, from defining accuracy metrics to building predictive models and evaluating the potential impact of automated systems. As MLB continues to consider the role of technology in officiating, these analytical tools will remain essential for understanding how umpires influence the game and how changes to ball-strike calling might affect gameplay and strategy. The combination of granular pitch-tracking data and sophisticated statistical modeling allows us to evaluate umpire performance with unprecedented precision while also informing important decisions about the future of the sport.
International Baseball Analytics
4 exercisesNPB Translation Model
**Player Profile:**
- Age: 25
- Final NPB season: .305/.380/.520, 28 HR in 550 PA
- Position: Corner OF
- Exit velocity: 106 mph (NPB measurement)
**Tasks:**
1. Apply the translation factors from Section 23.2
2. Calculate projected MLB slash line and HR total
3. Estimate first-year WAR using the projection
4. Assess confidence level and identify key uncertainties
**Bonus:** Compare your projection to actual performance of similar NPB players (e.g., Seiya Suzuki, Masataka Yoshida).
KBO Pitcher Projection
- 2.65 ERA, 1.15 WHIP
- 9.8 K/9, 2.8 BB/9, 0.75 HR/9
- 175 IP, 15-6 record
**Tasks:**
1. Using the KBO pitcher translation model from Section 23.3, project his MLB stats
2. Calculate projected FIP and ERA
3. Build a confidence interval for your ERA projection
4. Compare to similar KBO pitchers (e.g., Hyun-Jin Ryu, Kwang-Hyun Kim)
5. Recommend a contract structure based on projection and risk
Latin American Tool-Based Valuation
**Prospect A:**
- Age: 17
- Hit: 60, Power: 70, Speed: 55, Field: 55, Arm: 60
- Exit velocity: 108 mph
- Asking bonus: $3.5M
**Prospect B:**
- Age: 16
- Hit: 65, Power: 60, Speed: 70, Field: 65, Arm: 60
- Exit velocity: 104 mph
- Asking bonus: $4.0M
**Prospect C:**
- Age: 18
- Hit: 55, Power: 75, Speed: 45, Field: 50, Arm: 55
- Exit velocity: 112 mph
- Asking bonus: $2.5M
**Tasks:**
1. Use the Latin American projection model from Section 23.4
2. Project WAR through age 23 for each prospect
3. Calculate value per dollar of bonus
4. Rank the prospects considering both ceiling and floor outcomes
5. Recommend which prospect(s) to sign and at what price
**Advanced:** Simulate 1,000 career paths for each prospect incorporating uncertainty and injury risk.
International League Environment Analysis
**Tasks:**
1. Calculate park-adjusted metrics for each league
2. Estimate "true talent" translation factors using regression to the mean
3. Build a Bayesian updating system that improves projections as players accumulate MLB PA
4. Create visualizations comparing league offensive environments over time (2015-2023)
5. Develop recommendations for adjusting scouting priorities based on league trends
**Data Required:**
- League-wide statistics (provided in section)
- Park factors (research or estimate)
- Historical translation success rates
**Deliverables:**
- R or Python code implementing your analysis
- Report summarizing findings
- Recommendations for international scouting departments
---
Front Office & Analytics Career Guide
4 exercisesPitcher Arsenal Analysis and Optimization
**Skills Demonstrated**: Data acquisition, exploratory analysis, visualization, strategic thinking
**Project Steps**:
1. Acquire Statcast pitch-level data for a pitcher (use baseballr package or Baseball Savant)
2. Analyze pitch characteristics (velocity, movement, spin)
3. Evaluate pitch effectiveness by count and situation
4. Identify optimization opportunities
5. Create compelling visualizations
6. Write executive summary with recommendations
**R Implementation**:
```r
# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns
library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)
# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
# In practice, use scrape_statcast_savant_pitcher()
# For this example, we'll simulate data
set.seed(123)
n_pitches <- 2500
tibble(
pitch_type = sample(
c("FF", "SI", "SL", "CH", "CU"),
n_pitches,
replace = TRUE,
prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
),
release_speed = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
),
pfx_x = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
),
pfx_z = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
),
release_spin_rate = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
),
balls = sample(0:3, n_pitches, replace = TRUE),
strikes = sample(0:2, n_pitches, replace = TRUE),
stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
description = sample(
c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
n_pitches,
replace = TRUE,
prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
),
launch_speed = ifelse(description == "hit_into_play",
rnorm(n_pitches, 87, 10), NA),
launch_angle = ifelse(description == "hit_into_play",
rnorm(n_pitches, 12, 20), NA),
estimated_woba_using_speedangle = ifelse(
description == "hit_into_play",
pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
NA
)
)
}
# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")
# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
n = n(),
pct = n() / nrow(pitcher_data),
avg_velo = mean(release_speed, na.rm = TRUE),
avg_spin = mean(release_spin_rate, na.rm = TRUE)
) %>%
arrange(desc(n))
print("Pitch Mix:")
print(pitch_mix)
# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
usage = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
na.rm = TRUE),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
) %>%
arrange(desc(csw_rate))
print("\nPitch Effectiveness:")
print(pitch_effectiveness)
# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
mutate(count = paste0(balls, "-", strikes)) %>%
group_by(count, pitch_type) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
.groups = "drop"
) %>%
group_by(count) %>%
mutate(usage_pct = n / sum(n)) %>%
arrange(count, desc(usage_pct))
# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
group_by(pitch_type, stand) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = stand,
values_from = c(n, whiff_rate, avg_xwoba),
names_sep = "_"
)
print("\nPlatoon Splits:")
print(platoon_splits)
# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
"FF" = "#d22d49", "SI" = "#FE9D00",
"SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)
movement_plot <- ggplot(pitcher_data,
aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
geom_point(alpha = 0.3, size = 2) +
stat_ellipse(level = 0.75, size = 1.2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Pitch Movement Profile",
subtitle = "Catcher's perspective (RHP)",
x = "Horizontal Break (inches)",
y = "Induced Vertical Break (inches)",
color = "Pitch Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right"
) +
coord_fixed()
# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
geom_point(alpha = 0.4, size = 2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Velocity vs. Spin Rate",
x = "Release Speed (mph)",
y = "Spin Rate (rpm)",
color = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")
# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
geom_col(position = "stack") +
scale_fill_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Pitch Usage by Count",
x = "Count",
y = "Usage %",
fill = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")
# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
plot_annotation(
title = "Comprehensive Pitcher Arsenal Analysis",
subtitle = "Example Pitcher - 2024 Season",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)
print(combined_plot)
# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")
# Best pitch
best_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_max(csw_rate, n = 1)
cat("1. PRIMARY WEAPON\n")
cat(sprintf(" - %s showing elite CSW rate of %.1f%%\n",
best_pitch$pitch_type, best_pitch$csw_rate * 100))
cat(" - Maintain high usage in favorable counts\n\n")
# Underused effective pitch
underused <- effectiveness %>%
filter(usage < quantile(effectiveness$usage, 0.33)) %>%
filter(csw_rate > 0.30)
if(nrow(underused) > 0) {
cat("2. USAGE OPTIMIZATION\n")
for(i in 1:nrow(underused)) {
cat(sprintf(" - Consider increasing %s usage (current: %d pitches)\n",
underused$pitch_type[i], underused$usage[i]))
cat(sprintf(" Shows strong CSW rate: %.1f%%\n",
underused$csw_rate[i] * 100))
}
cat("\n")
}
# Weak pitch
weak_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_min(csw_rate, n = 1)
cat("3. PITCH DEVELOPMENT FOCUS\n")
cat(sprintf(" - %s showing below-average performance\n",
weak_pitch$pitch_type))
cat(sprintf(" - CSW rate: %.1f%% vs. league average ~28%%\n",
weak_pitch$csw_rate * 100))
cat(" - Consider: velocity increase, movement adjustment, or reduced usage\n\n")
cat("4. STRATEGIC ADJUSTMENTS\n")
cat(" - Review count-specific usage patterns\n")
cat(" - Analyze platoon splits for pitch selection\n")
cat(" - Consider sequencing effects (not shown in basic analysis)\n")
cat(" - Monitor fatigue impact on pitch quality\n")
}
generate_recommendations(pitcher_data, pitch_effectiveness)
# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")
```
**Portfolio Presentation Tips**:
- Include interactive visualizations (consider using plotly)
- Compare pitcher to league averages
- Add context about pitcher role and team strategy
- Discuss limitations (sample size, park factors, etc.)
- Provide actionable recommendations
Player Aging Curves and Performance Projection
**Skills Demonstrated**: Statistical modeling, time series analysis, predictive analytics, data visualization
**Python Implementation**:
```python
# Player Aging Curves and Projection System
# Analyzing how player skills change with age
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)
# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
"""
Generate simulated player career data.
In practice, this would come from Baseball Reference or FanGraphs.
"""
np.random.seed(42)
players = []
for player_id in range(n_players):
# Random career start age (20-25)
start_age = np.random.randint(20, 26)
# Random career length (2-15 years)
career_length = np.random.randint(2, 16)
# Peak age varies (26-30)
peak_age = np.random.randint(26, 31)
# Peak performance level
peak_wrc_plus = np.random.normal(110, 20)
for year_in_career in range(career_length):
age = start_age + year_in_career
season = years_range[0] + np.random.randint(0,
years_range[1] - years_range[0])
# Age-based performance (simplified aging curve)
age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
base_wrc = peak_wrc_plus * age_factor
# Add random variation
wrc_plus = max(50, base_wrc + np.random.normal(0, 15))
# Other stats correlated with wRC+
pa = np.random.randint(300, 650)
avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
obp = avg + 0.060 + np.random.normal(0, 0.020)
slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)
players.append({
'player_id': player_id,
'age': age,
'season': season,
'PA': pa,
'AVG': np.clip(avg, 0.150, 0.400),
'OBP': np.clip(obp, 0.250, 0.500),
'SLG': np.clip(slg, 0.300, 0.700),
'wRC_plus': wrc_plus,
'ISO': np.clip(slg - avg, 0.050, 0.350)
})
return pd.DataFrame(players)
# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)
print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")
# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
"""
Calculate aging curve using year-to-year delta method.
This controls for selection bias better than simple averaging.
"""
# Filter for consecutive seasons
df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])
# Calculate year-to-year changes
df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]
# Keep only consecutive seasons
df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()
# Group by age and calculate average change
aging_curve = df_deltas.groupby('age').agg({
'metric_delta': ['mean', 'std', 'count'],
metric: 'mean'
}).reset_index()
aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']
return aging_curve
# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")
metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}
for metric in metrics:
aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
print(f" {metric}: {len(aging_curves[metric])} age points")
# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
"""
Fit a polynomial curve to aging data.
"""
# Use weighted regression (weight by sample size)
weights = np.sqrt(aging_data['n'])
# Polynomial features (degree 2)
X = aging_data[age_col].values.reshape(-1, 1)
y = aging_data[delta_col].values
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model = Ridge(alpha=1.0)
model.fit(X_poly, y, sample_weight=weights)
return model, poly
# Fit curves
fitted_models = {}
for metric in metrics:
fitted_models[metric] = fit_aging_curve(aging_curves[metric])
print(f"Fitted aging curve for {metric}")
# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()
for idx, metric in enumerate(metrics):
ax = axes[idx]
curve_data = aging_curves[metric]
model, poly = fitted_models[metric]
# Plot raw deltas
ax.scatter(curve_data['age'], curve_data['delta_mean'],
s=curve_data['n']*2, alpha=0.6, label='Observed')
# Plot fitted curve
age_range = np.linspace(curve_data['age'].min(),
curve_data['age'].max(), 100)
X_pred = poly.transform(age_range.reshape(-1, 1))
y_pred = model.predict(X_pred)
ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)
ax.set_xlabel('Age', fontsize=11)
ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")
# 4. Build Projection System
class PlayerProjector:
"""
Project player performance based on recent history and aging curves.
"""
def __init__(self, aging_models):
self.aging_models = aging_models
def project_player(self, player_history, years_forward=1):
"""
Project player performance forward.
Parameters:
-----------
player_history : DataFrame
Recent seasons for player (last 3 years recommended)
years_forward : int
Number of years to project forward
Returns:
--------
dict : Projected statistics
"""
# Weight recent seasons more heavily (3:2:1 for last 3 years)
weights = np.array([3, 2, 1])[:len(player_history)]
weights = weights / weights.sum()
# Current age and baseline performance
current_age = player_history['age'].iloc[-1]
projections = {}
for metric in self.aging_models.keys():
if metric not in player_history.columns:
continue
# Weighted average of recent performance
baseline = np.average(player_history[metric].iloc[-3:],
weights=weights)
# Apply aging curve
model, poly = self.aging_models[metric]
# Project forward
projected_value = baseline
for year in range(years_forward):
age = current_age + year + 1
X_age = poly.transform([[age]])
age_adjustment = model.predict(X_age)[0]
projected_value += age_adjustment
projections[metric] = projected_value
projections['age'] = current_age + years_forward
projections['projection_years'] = years_forward
return projections
# 5. Test Projection System
projector = PlayerProjector(fitted_models)
# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]
test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')
print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")
print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))
# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)
for year in range(1, 4):
projection = projector.project_player(test_player_data, years_forward=year)
print(f"+{year:<5} {projection['age']:<5.0f} "
f"{projection.get('wRC_plus', 0):<8.1f} "
f"{projection.get('ISO', 0):<8.3f} "
f"{projection.get('AVG', 0):<8.3f} "
f"{projection.get('OBP', 0):<8.3f}")
# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
"""
Evaluate projection accuracy on historical data.
"""
results = []
for player_id in data['player_id'].unique():
player_data = data[data['player_id'] == player_id].sort_values('age')
# Need at least 4 seasons (3 to project, 1 to validate)
if len(player_data) < 4:
continue
# Use all but last season for projection
train_data = player_data.iloc[:-1]
actual_data = player_data.iloc[-1]
if len(train_data) < 3:
continue
# Make projection
try:
projection = projector.project_player(train_data, years_forward=1)
for metric in ['wRC_plus', 'ISO', 'AVG']:
if metric in projection:
results.append({
'player_id': player_id,
'metric': metric,
'actual': actual_data[metric],
'projected': projection[metric],
'error': projection[metric] - actual_data[metric]
})
except:
continue
return pd.DataFrame(results)
print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)
print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)
for metric in ['wRC_plus', 'ISO', 'AVG']:
metric_eval = evaluation[evaluation['metric'] == metric]
if len(metric_eval) > 0:
mae = np.abs(metric_eval['error']).mean()
rmse = np.sqrt((metric_eval['error'] ** 2).mean())
# Calculate R²
actual = metric_eval['actual'].values
predicted = metric_eval['projected'].values
ss_res = np.sum((actual - predicted) ** 2)
ss_tot = np.sum((actual - actual.mean()) ** 2)
r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0
print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")
# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
ax = axes[idx]
metric_eval = evaluation[evaluation['metric'] == metric]
if len(metric_eval) > 0:
ax.scatter(metric_eval['actual'], metric_eval['projected'],
alpha=0.4, s=30)
# Add y=x line
min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
ax.plot([min_val, max_val], [min_val, max_val],
'r--', linewidth=2, label='Perfect Projection')
ax.set_xlabel(f'Actual {metric}', fontsize=11)
ax.set_ylabel(f'Projected {metric}', fontsize=11)
ax.set_title(f'{metric} Projection Accuracy',
fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")
print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")
```
**Extension Ideas**:
- Incorporate minor league translation factors
- Add injury risk modeling
- Create playing time projections
- Develop position-specific aging curves
- Compare to established projection systems (Steamer, ZiPS)
Draft Value Analysis and Strategy Optimization
**Skills Demonstrated**: Data analysis, value modeling, strategic thinking, data visualization
**Key Analysis Components**:
```r
# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy
library(tidyverse)
library(survival)
library(ggplot2)
library(scales)
# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
set.seed(42)
drafts <- expand.grid(
year = 2008:2022,
round = 1:rounds,
pick = 1:30
) %>%
mutate(
overall_pick = (round - 1) * 30 + pick,
# Probability of reaching majors decreases with pick
p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
reached_mlb = rbinom(n(), 1, p_mlb),
# Career WAR conditional on reaching MLB
war_if_mlb = ifelse(
reached_mlb == 1,
pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
0
),
# Years to debut
years_to_debut = ifelse(
reached_mlb == 1,
pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
NA
),
# Position (simplified)
position = sample(
c("P", "C", "IF", "OF"),
n(),
replace = TRUE,
prob = c(0.45, 0.10, 0.25, 0.20)
),
# College vs HS
player_type = sample(
c("College", "HS", "International"),
n(),
replace = TRUE,
prob = c(0.55, 0.35, 0.10)
),
# Slot value (simplified formula)
slot_value = pmax(
200000,
12000000 * exp(-overall_pick / 15)
),
# Signing bonus (usually close to slot)
signing_bonus = slot_value * runif(n(), 0.85, 1.15)
)
return(drafts)
}
# Generate data
draft_data <- generate_draft_data()
print(sprintf("Generated %d draft picks from %d drafts",
nrow(draft_data), n_distinct(draft_data$year)))
# 1. Success Rate by Round
success_by_round <- draft_data %>%
group_by(round) %>%
summarize(
n_picks = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
total_war = sum(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
) %>%
filter(round <= 20) # Focus on first 20 rounds
print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))
# 2. Value Curve Estimation
value_curve <- draft_data %>%
group_by(overall_pick) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
) %>%
filter(overall_pick <= 300)
# Fit exponential decay model
value_model <- nls(
expected_war ~ a * exp(-b * overall_pick),
data = value_curve %>% filter(expected_war > 0),
start = list(a = 10, b = 0.01)
)
# Add fitted values
value_curve$fitted_war <- predict(
value_model,
newdata = data.frame(overall_pick = value_curve$overall_pick)
)
print("\nValue Curve Model:")
print(summary(value_model))
# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
geom_vline(xintercept = c(30, 60, 90),
linetype = "dashed", alpha = 0.3) +
annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
label = "Round 1", size = 3.5) +
annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
label = "Round 2", size = 3.5) +
labs(
title = "MLB Draft Pick Value Curve",
subtitle = "Expected career WAR by draft position",
x = "Overall Pick",
y = "Expected Career WAR",
caption = "Exponential decay model fitted to historical data"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11)
)
print(value_plot)
# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(position, round) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
.groups = "drop"
) %>%
group_by(position) %>%
summarize(
total_picks = sum(n),
avg_mlb_rate = mean(mlb_rate),
avg_war = mean(avg_war)
) %>%
arrange(desc(avg_war))
print("\nPosition-Specific Success Rates:")
print(position_analysis)
# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(player_type) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
)
print("\nCollege vs High School Performance:")
print(player_type_analysis)
# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
mutate(
war_per_million = war_if_mlb / (signing_bonus / 1000000),
pick_group = case_when(
overall_pick <= 30 ~ "Top 30",
overall_pick <= 60 ~ "31-60",
overall_pick <= 100 ~ "61-100",
TRUE ~ "100+"
)
) %>%
group_by(pick_group) %>%
summarize(
n = n(),
avg_bonus = mean(signing_bonus),
avg_war = mean(war_if_mlb),
war_per_million = mean(war_per_million)
)
print("\nReturn on Investment by Pick Range:")
print(roi_analysis)
# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
"""
Simple optimization: maximize expected WAR given bonus pool constraints
"""
# Get expected value for each pick
pick_values <- value_curve %>%
filter(overall_pick %in% available_picks) %>%
left_join(
draft_data %>%
group_by(overall_pick) %>%
summarize(avg_slot = mean(slot_value)),
by = "overall_pick"
)
# Greedy algorithm: pick highest value/cost ratio within budget
selected <- tibble()
remaining_budget <- budget
remaining_picks <- pick_values
while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
# Calculate value per dollar
remaining_picks <- remaining_picks %>%
mutate(value_per_dollar = expected_war / avg_slot)
# Select best value pick we can afford
best_pick <- remaining_picks %>%
filter(avg_slot <= remaining_budget) %>%
slice_max(value_per_dollar, n = 1)
if(nrow(best_pick) == 0) break
selected <- bind_rows(selected, best_pick)
remaining_budget <- remaining_budget - best_pick$avg_slot
remaining_picks <- remaining_picks %>%
filter(overall_pick != best_pick$overall_pick)
}
return(selected)
}
# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000
optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)
print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
select(overall_pick, expected_war, avg_slot, value_per_dollar))
# 8. Comprehensive Dashboard Visualization
library(patchwork)
# Plot 1: Success rate by round
p1 <- success_by_round %>%
filter(round <= 10) %>%
ggplot(aes(x = round, y = mlb_rate)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
vjust = -0.5, size = 3) +
scale_y_continuous(labels = percent_format()) +
labs(title = "MLB Success Rate by Round",
x = "Draft Round", y = "% Reaching MLB") +
theme_minimal()
# Plot 2: WAR distribution
p2 <- draft_data %>%
filter(reached_mlb == 1, overall_pick <= 100) %>%
ggplot(aes(x = war_if_mlb)) +
geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
labs(title = "Career WAR Distribution (MLB Players)",
x = "Career WAR", y = "Count") +
theme_minimal()
# Plot 3: Position comparison
p3 <- draft_data %>%
filter(reached_mlb == 1, round <= 5) %>%
ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
geom_boxplot(alpha = 0.7) +
labs(title = "WAR by Position (Rounds 1-5)",
x = "Position", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")
# Plot 4: College vs HS
p4 <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
labs(title = "College vs HS Performance",
x = "Player Type", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")
# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
plot_annotation(
title = "MLB Draft Analysis Dashboard",
subtitle = "Historical performance metrics and value analysis",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)
print(combined)
# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")
cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf(" - First round produces %.1f%% of total draft WAR\n",
100 * first_round_war / total_war))
cat("\n2. SUCCESS RATES\n")
cat(sprintf(" - Round 1: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[1]))
cat(sprintf(" - Round 5: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[5]))
cat(sprintf(" - Round 10: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[10]))
cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf(" - Average time to debut: %.1f years\n",
mean(draft_data$years_to_debut, na.rm = TRUE)))
cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat(" - Prioritize early picks; value drops exponentially\n")
cat(" - Consider college players for faster development\n")
cat(" - High school players have higher variance in outcomes\n")
cat(" - Pitchers dominate draft but consider positional scarcity\n")
cat(" - Later rounds: focus on high-ceiling, high-risk players\n")
cat("\n=== ANALYSIS COMPLETE ===\n")
```
**Portfolio Enhancement**:
- Add international signing analysis
- Compare team draft performance
- Analyze specific draft classes
- Include financial constraints modeling
- Compare to prospect ranking systems
Defensive Positioning and Shift Analysis
**Skills Demonstrated**: Spatial analysis, causal inference, strategic analysis, data visualization
**Implementation Framework**:
```python
# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches
# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)
# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
"""
Simulate batted ball locations and outcomes.
Coordinates in feet from home plate.
"""
np.random.seed(42)
data = []
for _ in range(n_balls):
# Batter handedness
stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])
# Shift decision (more common vs pull hitters)
is_shifter = np.random.random() < 0.3
shift_on = is_shifter and (np.random.random() < 0.7)
# Hit location (pull tendency varies)
if stand == 'R':
# Righties pull left
if is_shifter:
angle = np.random.normal(-25, 35) # Pull-heavy
else:
angle = np.random.normal(-10, 45) # Balanced
else:
# Lefties pull right
if is_shifter:
angle = np.random.normal(25, 35)
else:
angle = np.random.normal(10, 45)
# Distance based on exit velo and launch angle
exit_velo = np.random.normal(88, 8)
launch_angle = np.random.normal(12, 18)
# Simplified distance calculation
distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
distance = max(50, min(400, distance + np.random.normal(0, 20)))
# Convert to x, y coordinates
angle_rad = np.radians(angle)
x = distance * np.sin(angle_rad)
y = distance * np.cos(angle_rad)
# Hit outcome (shift effectiveness)
if shift_on:
# Shift reduces hits in pull direction
if stand == 'R' and x < -50:
prob_hit = 0.18 # Reduced by shift
elif stand == 'L' and x > 50:
prob_hit = 0.18
else:
prob_hit = 0.28 # Normal rate
else:
prob_hit = 0.25
# Adjust for distance (harder to field)
prob_hit = min(0.95, prob_hit * (distance / 250))
is_hit = np.random.random() < prob_hit
data.append({
'x': x,
'y': y,
'distance': distance,
'angle': angle,
'exit_velo': exit_velo,
'launch_angle': launch_angle,
'stand': stand,
'shift_on': shift_on,
'is_hit': is_hit,
'is_shifter': is_shifter
})
return pd.DataFrame(data)
# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)
print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")
# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
'is_hit': ['mean', 'count'],
'exit_velo': 'mean'
}).round(3)
print("\nShift Effectiveness:")
print(shift_analysis)
# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
"""
Estimate runs saved by shifting.
"""
results = []
for stand in ['R', 'L']:
for shifter in [True, False]:
subset = data[(data['stand'] == stand) &
(data['is_shifter'] == shifter)]
if len(subset) == 0:
continue
shifted = subset[subset['shift_on'] == True]
no_shift = subset[subset['shift_on'] == False]
if len(shifted) > 0 and len(no_shift) > 0:
babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
# Approximate run value per hit prevented: ~0.5 runs
runs_saved_per_pa = babip_diff * 0.5
results.append({
'stand': stand,
'is_shifter': shifter,
'shifted_babip': shifted['is_hit'].mean(),
'no_shift_babip': no_shift['is_hit'].mean(),
'babip_diff': babip_diff,
'runs_saved_per_100pa': runs_saved_per_pa * 100,
'n_shifted': len(shifted),
'n_no_shift': len(no_shift)
})
return pd.DataFrame(results)
shift_value = calculate_shift_value(bb_data)
print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))
# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
"""
Plot baseball field with hit locations.
"""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 10))
# Draw field outline
# Infield dirt
infield = patches.Wedge((0, 0), 95, 45, 135,
facecolor='tan', alpha=0.3)
ax.add_patch(infield)
# Outfield grass
outfield = patches.Wedge((0, 0), 400, 45, 135,
facecolor='green', alpha=0.1)
ax.add_patch(outfield)
# Foul lines
ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)
# Plot hits
hits = data[data['is_hit'] == True]
outs = data[data['is_hit'] == False]
ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
s=20, label='Out')
ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
s=30, label='Hit')
ax.set_xlim(-320, 320)
ax.set_ylim(0, 400)
ax.set_aspect('equal')
ax.set_xlabel('Distance from center (ft)', fontsize=11)
ax.set_ylabel('Distance from home (ft)', fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.2)
return ax
# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
(bb_data['is_shifter'] == True)]
plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == False],
'RHB Pull Hitter - No Shift',
ax=ax1
)
plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == True],
'RHB Pull Hitter - Shift On',
ax=ax2
)
plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")
# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
"""
Create BABIP heat map for given conditions.
"""
subset = data[(data['shift_on'] == shift_status) &
(data['stand'] == stand)]
# Create grid
x_bins = np.linspace(-250, 250, 25)
y_bins = np.linspace(50, 350, 20)
grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))
for i in range(len(y_bins)-1):
for j in range(len(x_bins)-1):
mask = ((subset['x'] >= x_bins[j]) &
(subset['x'] < x_bins[j+1]) &
(subset['y'] >= y_bins[i]) &
(subset['y'] < y_bins[i+1]))
cell_data = subset[mask]
if len(cell_data) >= 5: # Minimum sample
grid_babip[i, j] = cell_data['is_hit'].mean()
grid_count[i, j] = len(cell_data)
else:
grid_babip[i, j] = np.nan
return grid_babip, x_bins, y_bins, grid_count
# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
for i, stand in enumerate(['R', 'L']):
for j, shift_on in enumerate([False, True]):
ax = axes[i, j]
shifters = bb_data[bb_data['is_shifter'] == True]
grid, x_bins, y_bins, counts = create_babip_heatmap(
shifters, shift_on, stand
)
im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
y_bins[0], y_bins[-1]],
origin='lower', cmap='RdYlGn_r',
vmin=0, vmax=0.5, aspect='auto')
shift_text = "Shift On" if shift_on else "No Shift"
hand_text = "RHB" if stand == 'R' else "LHB"
ax.set_title(f'{hand_text} - {shift_text}',
fontsize=11, fontweight='bold')
ax.set_xlabel('Horizontal Position (ft)')
ax.set_ylabel('Distance from Home (ft)')
# Add colorbar
plt.colorbar(im, ax=ax, label='BABIP')
plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")
# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)
X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
model.coef_[0]):
print(f" {feat}: {coef:.3f}")
# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)
print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
print(f" {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")
print("\n2. WHEN TO SHIFT")
print(" - Strong pull tendency (>70% pull rate)")
print(" - Ground ball hitters (LA < 10°)")
print(" - Extreme pull hitters benefit most from aggressive shifts")
print("\n3. SHIFT VARIATIONS")
print(" - Full shift: 3 infielders on pull side")
print(" - Partial shift: 2.5 infielders pull side")
print(" - No shift: Traditional alignment")
print(" - Decision should consider:")
print(" * Batter's spray chart")
print(" * Game situation (runners, outs)")
print(" * Pitcher's ground ball rate")
print("\n4. LIMITATIONS & CONSIDERATIONS")
print(" - Shift beaten by opposite field hits")
print(" - Bunt defense vulnerabilities")
print(" - Runner advancement opportunities")
print(" - Pitcher-specific adjustments")
print("\n5. FUTURE ANALYSIS")
print(" - Pitcher-specific positioning")
print(" - Count-based positioning adjustments")
print(" - Outfield positioning optimization")
print(" - Real-time adjustment algorithms")
print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
```
**Portfolio Development Tips**:
- Use real Statcast spray chart data when possible
- Incorporate expected outcomes (xBA, xwOBA)
- Add video analysis component
- Compare to MLB team shift strategies
- Analyze shift effectiveness by ballpark
---