Chapter 24: Front Office & Analytics Career Guide

The path to a career in baseball analytics has evolved dramatically over the past two decades. What was once a field dominated by traditional scouts and baseball lifers has transformed into a multidisciplinary profession that values quantitative skills, programming ability, and innovative thinking alongside baseball knowledge. This chapter provides a comprehensive guide to building a career in baseball analytics, from developing your skillset to navigating the interview process.

Beginner ~21 min read 7 sections 9 code examples 4 exercises
Book Progress
46%
Chapter 25 of 54
What You'll Learn
  • Careers in Baseball Analytics
  • Building Your Analytics Portfolio
  • Essential Technical Skills
  • The Job Application Process
  • And 3 more topics...
Languages in This Chapter
R (6) Python (3)

All code examples can be copied and run in your environment.

24.1 Careers in Baseball Analytics

The Modern Baseball Analytics Landscape

Baseball front offices today employ diverse teams of analysts with varying specializations. Understanding the different roles and career paths available is essential for targeting your professional development.

Entry-Level Positions

  • Baseball Analytics Intern: Seasonal or year-round internships focusing on specific projects, data collection, or analytical support
  • Baseball Operations Assistant: Administrative support with analytical components, often involving data management and report generation
  • Quantitative Analyst: Entry-level position focusing on statistical modeling and data analysis
  • Video Coordinator: Collecting and organizing video footage, often with analytical tagging responsibilities

Mid-Level Positions

  • Baseball Analytics Analyst: Core analytical work including model development, research projects, and decision support
  • Quantitative Researcher: Advanced statistical modeling, machine learning, and predictive analytics
  • Pro Scouting Analyst: Combining traditional scouting with quantitative analysis of major league talent
  • Amateur Scouting Analyst: Supporting draft and international signing decisions with data-driven insights
  • Player Development Analyst: Working with minor league affiliates on player progression and development strategies
  • Research & Development Analyst: Exploring new analytical methods, technologies, and data sources

Senior-Level Positions

  • Senior Analyst/Lead Analyst: Managing analytical projects and mentoring junior staff
  • Director of Baseball Analytics: Overseeing the analytics department and setting analytical strategy
  • Director of Research & Development: Leading innovation in analytical methods and technology
  • Vice President of Baseball Analytics: Executive-level position with strategic influence on baseball operations
  • Assistant General Manager (Analytics): Bridge role between analytics and traditional baseball operations
  • General Manager: Ultimate decision-making authority, increasingly with analytical backgrounds

Related Career Paths

Baseball analytics skills translate to numerous adjacent roles:

  • Biomechanics Researcher: Analyzing motion capture data to optimize player performance and reduce injuries
  • Sports Technology Developer: Building tools and platforms for data collection and analysis
  • Independent Consultant: Providing analytical services to multiple organizations
  • Media Analyst: Bringing analytical insights to broadcast, print, or digital media
  • Betting Analytics: Applying baseball analysis to sports gambling markets
  • Academic Researcher: Studying baseball from statistical, economic, or social science perspectives

Organizational Structures

Different organizations structure their analytics departments in various ways:

Centralized Model: A single analytics department supports all baseball operations functions (player development, amateur scouting, pro scouting, major league operations).

Embedded Model: Analysts are embedded within specific departments (e.g., a pro scouting analyst who works primarily with the pro scouting director).

Hybrid Model: A core R&D team handles long-term research while embedded analysts support day-to-day operations.

Understanding an organization's structure helps you identify where you might fit and what collaboration patterns to expect.

Compensation and Market Dynamics

Baseball analytics salaries vary widely based on experience, role, and organization:

  • Entry-level positions: $40,000-$65,000
  • Mid-level analysts: $65,000-$100,000
  • Senior analysts/Directors: $100,000-$200,000
  • Executive positions: $200,000+

Important considerations:


  • Baseball salaries often lag behind tech industry compensation for similar skills

  • Smaller market teams may pay less than large market teams

  • Passion for baseball is expected but shouldn't justify exploitative compensation

  • Benefits packages vary significantly by organization

  • Many entry-level positions are seasonal or contract-based initially


24.2 Building Your Analytics Portfolio

Your portfolio is your most important asset in the job search process. It demonstrates your technical skills, baseball knowledge, and ability to communicate insights.

Portfolio Principles

Quality Over Quantity: Three excellent projects are better than ten mediocre ones. Each project should showcase different skills and demonstrate depth of analysis.

Tell a Story: Every project should have a clear narrative arc: motivation, methodology, findings, and implications. Write for an intelligent audience that may not have your technical background.

Show Your Work: Include code, methodologies, and limitations. Transparency about your approach demonstrates intellectual honesty and technical competence.

Make It Accessible: Host your portfolio on GitHub Pages, a personal website, or a platform like RPubs. Ensure projects are easy to navigate and professionally presented.

Project Categories

1. Predictive Modeling Projects

Build models that forecast player performance, injury risk, or game outcomes. These projects demonstrate statistical modeling skills and understanding of baseball dynamics.

Example topics:


  • Pitcher performance projection using pitch-level data

  • Rookie success prediction using minor league statistics

  • Win probability models incorporating game situations

  • Injury risk models using workload and biomechanical data

2. Player Evaluation Projects

Develop new metrics or frameworks for evaluating player value. These projects show creativity and deep baseball knowledge.

Example topics:


  • Defensive value metrics using batted ball data

  • Catcher framing adjusted for umpire tendencies

  • Base running value beyond traditional stolen bases

  • Clutch performance analysis controlling for context

3. Strategic Analysis Projects

Analyze team strategies and their effectiveness. These projects demonstrate tactical understanding and causal inference skills.

Example topics:


  • Optimal batting order construction

  • Bullpen management and pitcher usage patterns

  • Defensive positioning strategies

  • Draft strategy and value optimization

4. Data Visualization Projects

Create compelling visualizations that communicate complex insights. These projects showcase data communication skills.

Example topics:


  • Interactive pitch movement charts

  • Player aging curves across different skills

  • Trade deadline activity networks

  • Spray chart analysis with context

Technical Implementation

Your portfolio should demonstrate proficiency in industry-standard tools. Below are templates for common portfolio projects.

Project Structure Template

project-name/
├── README.md
├── data/
│   ├── raw/
│   └── processed/
├── code/
│   ├── 01-data-collection.R
│   ├── 02-data-cleaning.R
│   ├── 03-exploratory-analysis.R
│   ├── 04-modeling.R
│   └── 05-visualization.R
├── output/
│   ├── figures/
│   └── tables/
└── report.Rmd or report.ipynb

Writing About Your Work

Strong portfolio projects include written analysis that demonstrates:

  • Clear Problem Statement: What question are you answering and why does it matter?
  • Methodology Explanation: How did you approach the problem? What analytical choices did you make?
  • Results Interpretation: What did you find? What are the practical implications?
  • Limitations Discussion: What are the constraints and caveats of your analysis?
  • Future Directions: What would you do with more time, data, or resources?

Use section headers, bullet points, and visualizations to make your writing scannable. Avoid jargon unless necessary, and explain technical concepts clearly.

R
project-name/
├── README.md
├── data/
│   ├── raw/
│   └── processed/
├── code/
│   ├── 01-data-collection.R
│   ├── 02-data-cleaning.R
│   ├── 03-exploratory-analysis.R
│   ├── 04-modeling.R
│   └── 05-visualization.R
├── output/
│   ├── figures/
│   └── tables/
└── report.Rmd or report.ipynb

24.3 Essential Technical Skills

Success in baseball analytics requires a diverse technical skillset. While no one is expert in everything, you should develop proficiency across several key areas.

Programming Languages

R

R remains the dominant language in baseball analytics due to its statistical capabilities and baseball-specific packages.

Essential R skills:


  • Data manipulation with dplyr and tidyr

  • Visualization with ggplot2

  • Statistical modeling with base R, caret, and tidymodels

  • Report generation with R Markdown

  • Working with baseball data using baseballr and other packages

Python

Python is increasingly popular, especially for machine learning and data engineering tasks.

Essential Python skills:


  • Data manipulation with pandas and numpy

  • Visualization with matplotlib and seaborn

  • Machine learning with scikit-learn

  • Web scraping with BeautifulSoup and scrapy

  • Notebook-based analysis with Jupyter

SQL

Database querying is essential for working with large datasets.

Essential SQL skills:


  • SELECT, JOIN, GROUP BY operations

  • Window functions for time-series analysis

  • CTEs (Common Table Expressions) for complex queries

  • Query optimization for performance

  • Database design fundamentals

Statistical Methods

Foundational Statistics


  • Probability distributions and hypothesis testing

  • Regression analysis (linear, logistic, Poisson)

  • Time series analysis

  • Bayesian inference

  • Experimental design and A/B testing

Advanced Methods


  • Machine learning (random forests, gradient boosting, neural networks)

  • Survival analysis for injury and career longevity

  • Hierarchical/mixed-effects models for player evaluation

  • Causal inference methods

  • Optimization algorithms

Domain Knowledge

Technical skills must be paired with baseball expertise:

Baseball Operations Understanding


  • Roster construction and salary cap management

  • Draft and international signing processes

  • Player development systems

  • Arbitration and free agency mechanics

  • Collective Bargaining Agreement provisions

On-Field Baseball Knowledge


  • Game situations and strategic decisions

  • Player roles and positional value

  • Pitch types and sequencing

  • Defensive alignments and shifts

  • Base running and situational hitting

Historical Context


  • Evolution of the game and rule changes

  • Historical trends in player performance

  • Park effects and era adjustments

  • Notable players and teams for benchmarking

Data Sources and Tools

Public Data Sources


  • Baseball Reference and FanGraphs for player statistics

  • MLB.com Statcast for pitch and batted ball data

  • Baseball Savant for advanced Statcast metrics

  • Retrosheet for historical play-by-play data

  • Brooks Baseball for pitch-level data

Development Tools


  • Version control with Git and GitHub

  • Integrated Development Environments (RStudio, PyCharm, VS Code)

  • Jupyter notebooks for interactive analysis

  • Docker for reproducible environments

  • Cloud platforms (AWS, Google Cloud) for large-scale computation


24.4 The Job Application Process

Breaking into baseball analytics is highly competitive. A strategic approach to the application process is essential.

Finding Opportunities

Official Channels


  • MLB's official job board (mlb.com/jobs)

  • Individual team websites (check careers or front office sections)

  • LinkedIn job postings

  • Baseball prospectus job board

  • Baseball America postings

Networking


  • SABR Analytics Conference

  • MIT Sloan Sports Analytics Conference

  • Twitter/X connections with analytics professionals

  • Alumni networks from your university

  • Local baseball analytics meetups

Proactive Outreach


  • Cold emails to teams (respectful, concise, value-focused)

  • Informational interviews

  • Contributing to public baseball research

  • Building relationships over time

Crafting Your Application

Resume Best Practices

Focus on relevant experience and quantifiable achievements:

BASEBALL ANALYTICS EXPERIENCE

Research Analyst | University Baseball Analytics Lab | 2023-2024
- Developed pitch selection model using Statcast data, identifying 12% improvement
  opportunity in changeup usage for fastball-heavy pitchers
- Created automated scouting reports combining video and statistical analysis,
  reducing report generation time by 60%
- Presented findings at regional SABR conference

TECHNICAL PROJECTS

Pitcher Performance Projection System
- Built ensemble model combining stuff metrics and command indicators to project
  ERA with 15% lower RMSE than baseline FIP projections
- Technologies: R, Python (scikit-learn), SQL, Git
- Results published on personal blog with 5,000+ views

TECHNICAL SKILLS

Programming: R (advanced), Python (intermediate), SQL (intermediate)
Statistical Methods: Regression, mixed-effects models, machine learning, time series
Tools: Git, RStudio, Jupyter, Tableau, Docker
Baseball Data: Statcast, PITCHf/x, Retrosheet, FanGraphs

Cover Letter Strategy

Your cover letter should:


  1. Show you understand the organization's analytical philosophy

  2. Connect your experience to their specific needs

  3. Demonstrate passion backed by substantive knowledge

  4. Be concise (one page maximum)

Template structure:


  • Opening: Why this team/role specifically

  • Body: 2-3 relevant experiences or projects with outcomes

  • Connection: How your skills address their needs

  • Closing: Clear call to action

Portfolio Presentation

Include a portfolio link prominently in your resume and cover letter. Ensure the landing page is polished and your best work is immediately visible.

Consider creating a one-page portfolio summary:

ANALYTICS PORTFOLIO
github.com/username | website.com | email@example.com

Featured Projects:

1. Minor League Pitcher Development Analysis
   Identified mechanical adjustments that correlate with MLB success
   Tools: R, Statcast, video analysis | Link: [URL]

2. Defensive Shift Optimization Framework
   Game theory approach to positioning strategy
   Tools: Python, optimization algorithms | Link: [URL]

3. Draft Value Model
   Historical analysis of draft pick value by round and position
   Tools: R, Bayesian modeling | Link: [URL]

Application Timing

Internship Cycles


  • Baseball Operations internships: Applications typically open September-November for following season

  • Analytics-specific internships: Some teams hire year-round

  • Summer internships: Apply in fall/winter

Full-Time Positions


  • Limited predictability; positions open when needs arise

  • More opportunities after season ends (October-November)

  • Some hiring around winter meetings (December)

Strategy


  • Apply early in application windows

  • Follow up appropriately (one polite email after 2-3 weeks)

  • Continue building your portfolio while waiting

  • Apply to multiple teams; this is a numbers game

R
BASEBALL ANALYTICS EXPERIENCE

Research Analyst | University Baseball Analytics Lab | 2023-2024
- Developed pitch selection model using Statcast data, identifying 12% improvement
  opportunity in changeup usage for fastball-heavy pitchers
- Created automated scouting reports combining video and statistical analysis,
  reducing report generation time by 60%
- Presented findings at regional SABR conference

TECHNICAL PROJECTS

Pitcher Performance Projection System
- Built ensemble model combining stuff metrics and command indicators to project
  ERA with 15% lower RMSE than baseline FIP projections
- Technologies: R, Python (scikit-learn), SQL, Git
- Results published on personal blog with 5,000+ views

TECHNICAL SKILLS

Programming: R (advanced), Python (intermediate), SQL (intermediate)
Statistical Methods: Regression, mixed-effects models, machine learning, time series
Tools: Git, RStudio, Jupyter, Tableau, Docker
Baseball Data: Statcast, PITCHf/x, Retrosheet, FanGraphs
R
ANALYTICS PORTFOLIO
github.com/username | website.com | email@example.com

Featured Projects:

1. Minor League Pitcher Development Analysis
   Identified mechanical adjustments that correlate with MLB success
   Tools: R, Statcast, video analysis | Link: [URL]

2. Defensive Shift Optimization Framework
   Game theory approach to positioning strategy
   Tools: Python, optimization algorithms | Link: [URL]

3. Draft Value Model
   Historical analysis of draft pick value by round and position
   Tools: R, Bayesian modeling | Link: [URL]

24.5 Interview Preparation & Case Studies

Baseball analytics interviews typically involve multiple stages and diverse evaluation methods.

Interview Stages

1. Phone/Video Screening (30-45 minutes)


  • Background and interest in baseball analytics

  • Technical skills overview

  • Behavioral questions

  • Basic baseball knowledge assessment

2. Technical Assessment (1-4 hours)


  • Take-home project analyzing baseball data

  • SQL or programming challenges

  • Statistical reasoning problems

  • Time-constrained analysis task

3. On-Site/Virtual Interview (3-6 hours)


  • Multiple one-on-one interviews

  • Presentation of take-home project

  • Case study analysis

  • Conversations with various department members

4. Final Interview (1-2 hours)


  • Meet senior leadership

  • Culture fit assessment

  • Negotiation discussion

Technical Interview Topics

Data Analysis Questions

Example: "Given a dataset of pitch-by-pitch data, how would you evaluate whether a pitcher's performance has improved this season?"

Approach:


  1. Clarify the question (What metrics define improvement? Over what timeframe?)

  2. Discuss confounding factors (opponent quality, injury, role changes)

  3. Propose multiple analytical approaches

  4. Consider limitations and alternative explanations

  5. Discuss actionable implications

Statistical Reasoning

Example: "A rookie has a .350 batting average through his first 20 games. How much weight should we give this performance?"

Approach:


  • Discuss sample size and regression to the mean

  • Reference prior expectations (minor league performance, scouting reports)

  • Calculate confidence intervals

  • Compare to historical rookie performance

  • Bayesian updating framework

Programming Challenges

Example: "Write a function that calculates wOBA from a dataset of plate appearances."

def calculate_woba(df):
    """
    Calculate wOBA from plate appearance data.

    Parameters:
    df (pd.DataFrame): DataFrame with columns for each outcome

    Returns:
    float: Weighted on-base average
    """
    # wOBA weights (2024 season approximate values)
    weights = {
        'uBB': 0.69,  # Unintentional walk
        'HBP': 0.72,  # Hit by pitch
        '1B': 0.88,   # Single
        '2B': 1.24,   # Double
        '3B': 1.56,   # Triple
        'HR': 1.95    # Home run
    }

    # Calculate weighted sum of positive outcomes
    weighted_sum = (
        df['uBB'] * weights['uBB'] +
        df['HBP'] * weights['HBP'] +
        df['1B'] * weights['1B'] +
        df['2B'] * weights['2B'] +
        df['3B'] * weights['3B'] +
        df['HR'] * weights['HR']
    )

    # Plate appearances (excluding IBB and sacrifices)
    pa = df['AB'] + df['uBB'] + df['HBP'] + df['SF']

    return weighted_sum / pa if pa > 0 else 0

# Example usage
import pandas as pd

player_data = pd.DataFrame({
    'AB': [500],
    'uBB': [60],
    'HBP': [5],
    'SF': [3],
    '1B': [100],
    '2B': [30],
    '3B': [3],
    'HR': [25]
})

woba = calculate_woba(player_data)
print(f"Player wOBA: {woba:.3f}")

Case Study Examples

Case Study 1: Trade Evaluation

Scenario: "Your team is considering trading for a starting pitcher. You have access to their Statcast data, injury history, and contract details. Walk me through your evaluation process."

Framework:


  1. Performance Analysis



  • Recent performance trends (ERA, FIP, xFIP, xERA)

  • Pitch mix and effectiveness by pitch type

  • Command metrics (zone rate, chase rate)

  • Platoon splits and context-dependent performance

  1. Sustainability Assessment
  • Statcast quality metrics (velocity, spin, movement)
  • Batted ball profile (exit velocity, launch angle)
  • Sequencing and pitch usage patterns
  • Park effects and league adjustment
  1. Risk Factors
  • Injury history and current health status
  • Workload concerns (innings pitched, pitch count trends)
  • Age and aging curves for similar pitchers
  • Contract value and team control
  1. Contextual Fit
  • How pitcher fits team needs
  • Home park effects
  • Defensive support considerations
  • Roster construction implications
  1. Cost-Benefit Analysis
  • Expected performance in team context
  • Comparison to alternatives (free agency, internal options)
  • Prospect cost evaluation
  • Financial implications

Case Study 2: Player Development Decision

Scenario: "A promising minor league hitter is struggling. His batting average has dropped, but his walk rate has increased. What additional information would you want, and how would you analyze whether this is concerning?"

Analysis approach:

# Simulated minor league hitter data analysis

library(tidyverse)
library(ggplot2)

# Create sample data
player_data <- tibble(
  month = rep(c("April", "May", "June", "July"), each = 25),
  game = rep(1:25, 4),
  AB = rpois(100, 4),
  H = rpois(100, 1.2),
  BB = rpois(100, 0.8),
  K = rpois(100, 1.3),
  HR = rpois(100, 0.3),
  exit_velo = rnorm(100, 89, 3),
  launch_angle = rnorm(100, 12, 8),
  hard_hit_rate = runif(100, 0.3, 0.5)
)

# Calculate key metrics
player_summary <- player_data %>%
  mutate(month = factor(month, levels = c("April", "May", "June", "July"))) %>%
  group_by(month) %>%
  summarize(
    AVG = sum(H) / sum(AB),
    OBP = sum(H + BB) / sum(AB + BB),
    BB_rate = sum(BB) / sum(AB + BB),
    K_rate = sum(K) / sum(AB),
    ISO = sum(HR * 3 + H) / sum(AB) - AVG,
    avg_exit_velo = mean(exit_velo),
    avg_launch_angle = mean(launch_angle),
    hard_hit_rate = mean(hard_hit_rate)
  )

# Visualize trends
ggplot(player_summary, aes(x = month, group = 1)) +
  geom_line(aes(y = AVG, color = "AVG"), size = 1.2) +
  geom_line(aes(y = OBP, color = "OBP"), size = 1.2) +
  geom_point(aes(y = AVG, color = "AVG"), size = 3) +
  geom_point(aes(y = OBP, color = "OBP"), size = 3) +
  scale_color_manual(values = c("AVG" = "red", "OBP" = "blue")) +
  labs(
    title = "Batting Average vs. On-Base Percentage Trends",
    subtitle = "Declining AVG with stable OBP suggests improved plate discipline",
    x = "Month",
    y = "Rate",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "top")

# Quality of contact analysis
print("Quality of Contact Metrics:")
print(player_summary %>%
        select(month, avg_exit_velo, hard_hit_rate))

# Interpretation framework
cat("\nAnalysis Framework:\n")
cat("1. Batting average decline with walk rate increase may indicate:\n")
cat("   - Improved pitch recognition and discipline\n")
cat("   - Potential BABIP luck (check if batted ball quality maintained)\n")
cat("   - Adjustment to higher level of pitching\n\n")
cat("2. Key questions to investigate:\n")
cat("   - Is exit velocity and hard contact rate stable?\n")
cat("   - Has BABIP declined significantly?\n")
cat("   - What pitch types are causing issues?\n")
cat("   - Are strikeout rates acceptable?\n")
cat("   - How do metrics compare to league averages at this level?\n\n")
cat("3. Recommendation depends on:\n")
cat("   - If quality of contact maintained: positive sign, continue development\n")
cat("   - If quality declining: identify mechanical issues, consider adjustment period\n")
cat("   - Context of league difficulty and age relative to level\n")

Case Study 3: Strategic Decision

Scenario: "It's the 7th inning of a playoff game. Your team is down by 1 run with a runner on first and no outs. Your best hitter is at the plate. Should you bunt?"

Analysis framework:


  1. Calculate win expectancy with and without bunt

  2. Consider hitter's ability and pitcher's weakness

  3. Account for playoff leverage

  4. Evaluate defense and base runner speed

  5. Consider bullpen depth and future innings

Behavioral Interview Questions

Common questions and approaches:

"Tell me about a time you made a mistake in your analysis."


  • Be honest about the error

  • Explain how you discovered it

  • Describe what you learned

  • Show how you prevent similar mistakes now

"How do you handle disagreement with colleagues?"


  • Emphasize collaboration and data-driven discussion

  • Show respect for different perspectives

  • Demonstrate ability to find common ground

  • Know when to advocate and when to defer

"Describe a complex analysis you made accessible to a non-technical audience."


  • Use specific example

  • Explain your communication strategy

  • Discuss challenges and how you addressed them

  • Show outcome and impact

Questions to Ask Interviewers

Asking thoughtful questions demonstrates genuine interest and helps you evaluate fit:

About the Role


  • "What does a typical project lifecycle look like from question to implementation?"

  • "How do analysts collaborate with coaches, scouts, and baseball operations staff?"

  • "What are the biggest analytical challenges the organization currently faces?"

About the Team


  • "How is the analytics department structured?"

  • "What backgrounds do team members come from?"

  • "How has the team's role evolved over the past few years?"

About Development


  • "What opportunities exist for professional development and learning?"

  • "How do you evaluate success in this role?"

  • "What does career progression look like?"

About Culture


  • "How does the organization balance analytics and traditional scouting?"

  • "Can you describe the decision-making process for major roster moves?"

  • "What's the communication style between analytics and leadership?"

R
# Simulated minor league hitter data analysis

library(tidyverse)
library(ggplot2)

# Create sample data
player_data <- tibble(
  month = rep(c("April", "May", "June", "July"), each = 25),
  game = rep(1:25, 4),
  AB = rpois(100, 4),
  H = rpois(100, 1.2),
  BB = rpois(100, 0.8),
  K = rpois(100, 1.3),
  HR = rpois(100, 0.3),
  exit_velo = rnorm(100, 89, 3),
  launch_angle = rnorm(100, 12, 8),
  hard_hit_rate = runif(100, 0.3, 0.5)
)

# Calculate key metrics
player_summary <- player_data %>%
  mutate(month = factor(month, levels = c("April", "May", "June", "July"))) %>%
  group_by(month) %>%
  summarize(
    AVG = sum(H) / sum(AB),
    OBP = sum(H + BB) / sum(AB + BB),
    BB_rate = sum(BB) / sum(AB + BB),
    K_rate = sum(K) / sum(AB),
    ISO = sum(HR * 3 + H) / sum(AB) - AVG,
    avg_exit_velo = mean(exit_velo),
    avg_launch_angle = mean(launch_angle),
    hard_hit_rate = mean(hard_hit_rate)
  )

# Visualize trends
ggplot(player_summary, aes(x = month, group = 1)) +
  geom_line(aes(y = AVG, color = "AVG"), size = 1.2) +
  geom_line(aes(y = OBP, color = "OBP"), size = 1.2) +
  geom_point(aes(y = AVG, color = "AVG"), size = 3) +
  geom_point(aes(y = OBP, color = "OBP"), size = 3) +
  scale_color_manual(values = c("AVG" = "red", "OBP" = "blue")) +
  labs(
    title = "Batting Average vs. On-Base Percentage Trends",
    subtitle = "Declining AVG with stable OBP suggests improved plate discipline",
    x = "Month",
    y = "Rate",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "top")

# Quality of contact analysis
print("Quality of Contact Metrics:")
print(player_summary %>%
        select(month, avg_exit_velo, hard_hit_rate))

# Interpretation framework
cat("\nAnalysis Framework:\n")
cat("1. Batting average decline with walk rate increase may indicate:\n")
cat("   - Improved pitch recognition and discipline\n")
cat("   - Potential BABIP luck (check if batted ball quality maintained)\n")
cat("   - Adjustment to higher level of pitching\n\n")
cat("2. Key questions to investigate:\n")
cat("   - Is exit velocity and hard contact rate stable?\n")
cat("   - Has BABIP declined significantly?\n")
cat("   - What pitch types are causing issues?\n")
cat("   - Are strikeout rates acceptable?\n")
cat("   - How do metrics compare to league averages at this level?\n\n")
cat("3. Recommendation depends on:\n")
cat("   - If quality of contact maintained: positive sign, continue development\n")
cat("   - If quality declining: identify mechanical issues, consider adjustment period\n")
cat("   - Context of league difficulty and age relative to level\n")
Python
def calculate_woba(df):
    """
    Calculate wOBA from plate appearance data.

    Parameters:
    df (pd.DataFrame): DataFrame with columns for each outcome

    Returns:
    float: Weighted on-base average
    """
    # wOBA weights (2024 season approximate values)
    weights = {
        'uBB': 0.69,  # Unintentional walk
        'HBP': 0.72,  # Hit by pitch
        '1B': 0.88,   # Single
        '2B': 1.24,   # Double
        '3B': 1.56,   # Triple
        'HR': 1.95    # Home run
    }

    # Calculate weighted sum of positive outcomes
    weighted_sum = (
        df['uBB'] * weights['uBB'] +
        df['HBP'] * weights['HBP'] +
        df['1B'] * weights['1B'] +
        df['2B'] * weights['2B'] +
        df['3B'] * weights['3B'] +
        df['HR'] * weights['HR']
    )

    # Plate appearances (excluding IBB and sacrifices)
    pa = df['AB'] + df['uBB'] + df['HBP'] + df['SF']

    return weighted_sum / pa if pa > 0 else 0

# Example usage
import pandas as pd

player_data = pd.DataFrame({
    'AB': [500],
    'uBB': [60],
    'HBP': [5],
    'SF': [3],
    '1B': [100],
    '2B': [30],
    '3B': [3],
    'HR': [25]
})

woba = calculate_woba(player_data)
print(f"Player wOBA: {woba:.3f}")

24.6 Resources & Continued Learning

Baseball analytics is a continuously evolving field. Committing to ongoing learning is essential for career growth.

Books

Foundational Baseball Analytics


  • The Book: Playing the Percentages in Baseball by Tom Tango, Mitchel Lichtman, and Andrew Dolphin

  • Analyzing Baseball Data with R by Max Marchi and Jim Albert

  • The MVP Machine by Ben Lindbergh and Travis Sawchik

  • Smart Baseball by Keith Law

Statistical Methods


  • An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani

  • Statistical Rethinking by Richard McElreath

  • Regression and Other Stories by Gelman, Hill, and Vehtari

  • The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

Programming


  • R for Data Science by Hadley Wickham and Garrett Grolemund

  • Python for Data Analysis by Wes McKinney

  • Advanced R by Hadley Wickham

Online Courses

Data Science Fundamentals


  • Coursera: Data Science Specialization (Johns Hopkins)

  • edX: MicroMasters in Statistics and Data Science (MIT)

  • DataCamp: Data Science Career Track

  • Udacity: Data Analyst Nanodegree

Baseball-Specific


  • Baseball Savant: Introduction to Statcast

  • FanGraphs: Library and research archives

  • Baseball Prospectus: Statistical primer

  • YouTube: Various channels covering baseball analytics

Communities and Conferences

Online Communities


  • Twitter/X: Follow baseball analysts, teams, and writers

  • Reddit: r/Sabermetrics community

  • Discord: Baseball analytics servers

  • SABR: Society for American Baseball Research membership

Annual Conferences


  • SABR Analytics Conference (March, Phoenix)

  • MIT Sloan Sports Analytics Conference (March, Boston)

  • Baseball Prospectus events

  • Regional SABR chapter meetings

Publications and Blogs

Regular Reading


  • FanGraphs.com

  • Baseball Prospectus

  • The Athletic (baseball coverage)

  • MLB.com analysis section

  • Baseball Savant blog

Individual Analysts


  • Follow prominent analysts on personal blogs and Twitter

  • Read historical sabermetric research

  • Study modern analytical innovations

Building Your Network

Strategies


  • Engage thoughtfully on social media

  • Attend conferences and events

  • Contribute to public baseball research

  • Collaborate on open-source projects

  • Write and share your own analysis

  • Participate in forecasting competitions

Professional Organizations


  • SABR membership

  • American Statistical Association

  • Society for Sports Analytics Research

  • University alumni networks

Staying Current

Weekly Habits


  • Read 3-5 analytical articles

  • Follow game recaps with analytical focus

  • Practice coding challenges

  • Review new research and methodologies

Monthly Habits


  • Complete a small analytical project

  • Read academic papers on sports analytics

  • Contribute to open-source projects

  • Network with one new person

Annual Habits


  • Attend at least one conference

  • Complete a major portfolio project

  • Update resume and portfolio

  • Reflect on skill development and set new goals


24.7 Exercises

These exercises are designed to build portfolio-worthy projects that demonstrate key skills for baseball analytics roles.

Exercise 24.1: Pitcher Arsenal Analysis and Optimization

Objective: Analyze a pitcher's repertoire using Statcast data and provide recommendations for pitch usage optimization.

Skills Demonstrated: Data acquisition, exploratory analysis, visualization, strategic thinking

Project Steps:

  1. Acquire Statcast pitch-level data for a pitcher (use baseballr package or Baseball Savant)
  2. Analyze pitch characteristics (velocity, movement, spin)
  3. Evaluate pitch effectiveness by count and situation
  4. Identify optimization opportunities
  5. Create compelling visualizations
  6. Write executive summary with recommendations

R Implementation:

# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns

library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)

# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
  # In practice, use scrape_statcast_savant_pitcher()
  # For this example, we'll simulate data

  set.seed(123)
  n_pitches <- 2500

  tibble(
    pitch_type = sample(
      c("FF", "SI", "SL", "CH", "CU"),
      n_pitches,
      replace = TRUE,
      prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
    ),
    release_speed = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
      pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
      pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
      pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
      pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
    ),
    pfx_x = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
      pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
      pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
      pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
      pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
    ),
    pfx_z = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
      pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
      pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
      pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
      pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
    ),
    release_spin_rate = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
      pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
      pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
      pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
      pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
    ),
    balls = sample(0:3, n_pitches, replace = TRUE),
    strikes = sample(0:2, n_pitches, replace = TRUE),
    stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
    description = sample(
      c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
      n_pitches,
      replace = TRUE,
      prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
    ),
    launch_speed = ifelse(description == "hit_into_play",
                          rnorm(n_pitches, 87, 10), NA),
    launch_angle = ifelse(description == "hit_into_play",
                         rnorm(n_pitches, 12, 20), NA),
    estimated_woba_using_speedangle = ifelse(
      description == "hit_into_play",
      pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
      NA
    )
  )
}

# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")

# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
  group_by(pitch_type) %>%
  summarize(
    n = n(),
    pct = n() / nrow(pitcher_data),
    avg_velo = mean(release_speed, na.rm = TRUE),
    avg_spin = mean(release_spin_rate, na.rm = TRUE)
  ) %>%
  arrange(desc(n))

print("Pitch Mix:")
print(pitch_mix)

# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
  group_by(pitch_type) %>%
  summarize(
    usage = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
                    na.rm = TRUE),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
  ) %>%
  arrange(desc(csw_rate))

print("\nPitch Effectiveness:")
print(pitch_effectiveness)

# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
  mutate(count = paste0(balls, "-", strikes)) %>%
  group_by(count, pitch_type) %>%
  summarize(
    n = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(count) %>%
  mutate(usage_pct = n / sum(n)) %>%
  arrange(count, desc(usage_pct))

# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
  group_by(pitch_type, stand) %>%
  summarize(
    n = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = stand,
    values_from = c(n, whiff_rate, avg_xwoba),
    names_sep = "_"
  )

print("\nPlatoon Splits:")
print(platoon_splits)

# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
  "FF" = "#d22d49", "SI" = "#FE9D00",
  "SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)

movement_plot <- ggplot(pitcher_data,
                        aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_ellipse(level = 0.75, size = 1.2) +
  scale_color_manual(values = pitch_colors,
                    labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                              "SL" = "Slider", "CH" = "Changeup",
                              "CU" = "Curveball")) +
  labs(
    title = "Pitch Movement Profile",
    subtitle = "Catcher's perspective (RHP)",
    x = "Horizontal Break (inches)",
    y = "Induced Vertical Break (inches)",
    color = "Pitch Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  ) +
  coord_fixed()

# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
  ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
  geom_point(alpha = 0.4, size = 2) +
  scale_color_manual(values = pitch_colors,
                    labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                              "SL" = "Slider", "CH" = "Changeup",
                              "CU" = "Curveball")) +
  labs(
    title = "Velocity vs. Spin Rate",
    x = "Release Speed (mph)",
    y = "Spin Rate (rpm)",
    color = "Pitch Type"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
  filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
  ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
  geom_col(position = "stack") +
  scale_fill_manual(values = pitch_colors,
                   labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                             "SL" = "Slider", "CH" = "Changeup",
                             "CU" = "Curveball")) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Pitch Usage by Count",
    x = "Count",
    y = "Usage %",
    fill = "Pitch Type"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
  plot_annotation(
    title = "Comprehensive Pitcher Arsenal Analysis",
    subtitle = "Example Pitcher - 2024 Season",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )

print(combined_plot)

# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
  cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")

  # Best pitch
  best_pitch <- effectiveness %>%
    filter(usage >= 50) %>%
    slice_max(csw_rate, n = 1)

  cat("1. PRIMARY WEAPON\n")
  cat(sprintf("   - %s showing elite CSW rate of %.1f%%\n",
              best_pitch$pitch_type, best_pitch$csw_rate * 100))
  cat("   - Maintain high usage in favorable counts\n\n")

  # Underused effective pitch
  underused <- effectiveness %>%
    filter(usage < quantile(effectiveness$usage, 0.33)) %>%
    filter(csw_rate > 0.30)

  if(nrow(underused) > 0) {
    cat("2. USAGE OPTIMIZATION\n")
    for(i in 1:nrow(underused)) {
      cat(sprintf("   - Consider increasing %s usage (current: %d pitches)\n",
                  underused$pitch_type[i], underused$usage[i]))
      cat(sprintf("     Shows strong CSW rate: %.1f%%\n",
                  underused$csw_rate[i] * 100))
    }
    cat("\n")
  }

  # Weak pitch
  weak_pitch <- effectiveness %>%
    filter(usage >= 50) %>%
    slice_min(csw_rate, n = 1)

  cat("3. PITCH DEVELOPMENT FOCUS\n")
  cat(sprintf("   - %s showing below-average performance\n",
              weak_pitch$pitch_type))
  cat(sprintf("   - CSW rate: %.1f%% vs. league average ~28%%\n",
              weak_pitch$csw_rate * 100))
  cat("   - Consider: velocity increase, movement adjustment, or reduced usage\n\n")

  cat("4. STRATEGIC ADJUSTMENTS\n")
  cat("   - Review count-specific usage patterns\n")
  cat("   - Analyze platoon splits for pitch selection\n")
  cat("   - Consider sequencing effects (not shown in basic analysis)\n")
  cat("   - Monitor fatigue impact on pitch quality\n")
}

generate_recommendations(pitcher_data, pitch_effectiveness)

# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")

Portfolio Presentation Tips:


  • Include interactive visualizations (consider using plotly)

  • Compare pitcher to league averages

  • Add context about pitcher role and team strategy

  • Discuss limitations (sample size, park factors, etc.)

  • Provide actionable recommendations

Exercise 24.2: Player Aging Curves and Performance Projection

Objective: Build aging curves for different player skills and create a performance projection system.

Skills Demonstrated: Statistical modeling, time series analysis, predictive analytics, data visualization

Python Implementation:

# Player Aging Curves and Projection System
# Analyzing how player skills change with age

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
    """
    Generate simulated player career data.
    In practice, this would come from Baseball Reference or FanGraphs.
    """
    np.random.seed(42)

    players = []
    for player_id in range(n_players):
        # Random career start age (20-25)
        start_age = np.random.randint(20, 26)
        # Random career length (2-15 years)
        career_length = np.random.randint(2, 16)

        # Peak age varies (26-30)
        peak_age = np.random.randint(26, 31)
        # Peak performance level
        peak_wrc_plus = np.random.normal(110, 20)

        for year_in_career in range(career_length):
            age = start_age + year_in_career
            season = years_range[0] + np.random.randint(0,
                                      years_range[1] - years_range[0])

            # Age-based performance (simplified aging curve)
            age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
            base_wrc = peak_wrc_plus * age_factor

            # Add random variation
            wrc_plus = max(50, base_wrc + np.random.normal(0, 15))

            # Other stats correlated with wRC+
            pa = np.random.randint(300, 650)
            avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
            obp = avg + 0.060 + np.random.normal(0, 0.020)
            slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)

            players.append({
                'player_id': player_id,
                'age': age,
                'season': season,
                'PA': pa,
                'AVG': np.clip(avg, 0.150, 0.400),
                'OBP': np.clip(obp, 0.250, 0.500),
                'SLG': np.clip(slg, 0.300, 0.700),
                'wRC_plus': wrc_plus,
                'ISO': np.clip(slg - avg, 0.050, 0.350)
            })

    return pd.DataFrame(players)

# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)

print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")

# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
    """
    Calculate aging curve using year-to-year delta method.
    This controls for selection bias better than simple averaging.
    """
    # Filter for consecutive seasons
    df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])

    # Calculate year-to-year changes
    df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
    df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
    df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]

    # Keep only consecutive seasons
    df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()

    # Group by age and calculate average change
    aging_curve = df_deltas.groupby('age').agg({
        'metric_delta': ['mean', 'std', 'count'],
        metric: 'mean'
    }).reset_index()

    aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']

    return aging_curve

# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")

metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}

for metric in metrics:
    aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
    print(f"  {metric}: {len(aging_curves[metric])} age points")

# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
    """
    Fit a polynomial curve to aging data.
    """
    # Use weighted regression (weight by sample size)
    weights = np.sqrt(aging_data['n'])

    # Polynomial features (degree 2)
    X = aging_data[age_col].values.reshape(-1, 1)
    y = aging_data[delta_col].values

    poly = PolynomialFeatures(degree=2)
    X_poly = poly.fit_transform(X)

    model = Ridge(alpha=1.0)
    model.fit(X_poly, y, sample_weight=weights)

    return model, poly

# Fit curves
fitted_models = {}
for metric in metrics:
    fitted_models[metric] = fit_aging_curve(aging_curves[metric])
    print(f"Fitted aging curve for {metric}")

# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
    ax = axes[idx]
    curve_data = aging_curves[metric]
    model, poly = fitted_models[metric]

    # Plot raw deltas
    ax.scatter(curve_data['age'], curve_data['delta_mean'],
               s=curve_data['n']*2, alpha=0.6, label='Observed')

    # Plot fitted curve
    age_range = np.linspace(curve_data['age'].min(),
                           curve_data['age'].max(), 100)
    X_pred = poly.transform(age_range.reshape(-1, 1))
    y_pred = model.predict(X_pred)

    ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
    ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)

    ax.set_xlabel('Age', fontsize=11)
    ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
    ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")

# 4. Build Projection System
class PlayerProjector:
    """
    Project player performance based on recent history and aging curves.
    """

    def __init__(self, aging_models):
        self.aging_models = aging_models

    def project_player(self, player_history, years_forward=1):
        """
        Project player performance forward.

        Parameters:
        -----------
        player_history : DataFrame
            Recent seasons for player (last 3 years recommended)
        years_forward : int
            Number of years to project forward

        Returns:
        --------
        dict : Projected statistics
        """
        # Weight recent seasons more heavily (3:2:1 for last 3 years)
        weights = np.array([3, 2, 1])[:len(player_history)]
        weights = weights / weights.sum()

        # Current age and baseline performance
        current_age = player_history['age'].iloc[-1]

        projections = {}

        for metric in self.aging_models.keys():
            if metric not in player_history.columns:
                continue

            # Weighted average of recent performance
            baseline = np.average(player_history[metric].iloc[-3:],
                                weights=weights)

            # Apply aging curve
            model, poly = self.aging_models[metric]

            # Project forward
            projected_value = baseline
            for year in range(years_forward):
                age = current_age + year + 1
                X_age = poly.transform([[age]])
                age_adjustment = model.predict(X_age)[0]
                projected_value += age_adjustment

            projections[metric] = projected_value

        projections['age'] = current_age + years_forward
        projections['projection_years'] = years_forward

        return projections

# 5. Test Projection System
projector = PlayerProjector(fitted_models)

# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]

test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')

print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")

print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))

# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)

for year in range(1, 4):
    projection = projector.project_player(test_player_data, years_forward=year)
    print(f"+{year:<5} {projection['age']:<5.0f} "
          f"{projection.get('wRC_plus', 0):<8.1f} "
          f"{projection.get('ISO', 0):<8.3f} "
          f"{projection.get('AVG', 0):<8.3f} "
          f"{projection.get('OBP', 0):<8.3f}")

# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
    """
    Evaluate projection accuracy on historical data.
    """
    results = []

    for player_id in data['player_id'].unique():
        player_data = data[data['player_id'] == player_id].sort_values('age')

        # Need at least 4 seasons (3 to project, 1 to validate)
        if len(player_data) < 4:
            continue

        # Use all but last season for projection
        train_data = player_data.iloc[:-1]
        actual_data = player_data.iloc[-1]

        if len(train_data) < 3:
            continue

        # Make projection
        try:
            projection = projector.project_player(train_data, years_forward=1)

            for metric in ['wRC_plus', 'ISO', 'AVG']:
                if metric in projection:
                    results.append({
                        'player_id': player_id,
                        'metric': metric,
                        'actual': actual_data[metric],
                        'projected': projection[metric],
                        'error': projection[metric] - actual_data[metric]
                    })
        except:
            continue

    return pd.DataFrame(results)

print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)

print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)

for metric in ['wRC_plus', 'ISO', 'AVG']:
    metric_eval = evaluation[evaluation['metric'] == metric]

    if len(metric_eval) > 0:
        mae = np.abs(metric_eval['error']).mean()
        rmse = np.sqrt((metric_eval['error'] ** 2).mean())

        # Calculate R²
        actual = metric_eval['actual'].values
        predicted = metric_eval['projected'].values
        ss_res = np.sum((actual - predicted) ** 2)
        ss_tot = np.sum((actual - actual.mean()) ** 2)
        r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

        print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")

# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
    ax = axes[idx]
    metric_eval = evaluation[evaluation['metric'] == metric]

    if len(metric_eval) > 0:
        ax.scatter(metric_eval['actual'], metric_eval['projected'],
                  alpha=0.4, s=30)

        # Add y=x line
        min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
        max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
        ax.plot([min_val, max_val], [min_val, max_val],
               'r--', linewidth=2, label='Perfect Projection')

        ax.set_xlabel(f'Actual {metric}', fontsize=11)
        ax.set_ylabel(f'Projected {metric}', fontsize=11)
        ax.set_title(f'{metric} Projection Accuracy',
                    fontsize=12, fontweight='bold')
        ax.legend()
        ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")

Extension Ideas:


  • Incorporate minor league translation factors

  • Add injury risk modeling

  • Create playing time projections

  • Develop position-specific aging curves

  • Compare to established projection systems (Steamer, ZiPS)

Exercise 24.3: Draft Value Analysis and Strategy Optimization

Objective: Analyze historical draft performance to quantify pick value and optimize draft strategy.

Skills Demonstrated: Data analysis, value modeling, strategic thinking, data visualization

Key Analysis Components:

# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy

library(tidyverse)
library(survival)
library(ggplot2)
library(scales)

# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
  set.seed(42)

  drafts <- expand.grid(
    year = 2008:2022,
    round = 1:rounds,
    pick = 1:30
  ) %>%
    mutate(
      overall_pick = (round - 1) * 30 + pick,
      # Probability of reaching majors decreases with pick
      p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
      reached_mlb = rbinom(n(), 1, p_mlb),
      # Career WAR conditional on reaching MLB
      war_if_mlb = ifelse(
        reached_mlb == 1,
        pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
        0
      ),
      # Years to debut
      years_to_debut = ifelse(
        reached_mlb == 1,
        pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
        NA
      ),
      # Position (simplified)
      position = sample(
        c("P", "C", "IF", "OF"),
        n(),
        replace = TRUE,
        prob = c(0.45, 0.10, 0.25, 0.20)
      ),
      # College vs HS
      player_type = sample(
        c("College", "HS", "International"),
        n(),
        replace = TRUE,
        prob = c(0.55, 0.35, 0.10)
      ),
      # Slot value (simplified formula)
      slot_value = pmax(
        200000,
        12000000 * exp(-overall_pick / 15)
      ),
      # Signing bonus (usually close to slot)
      signing_bonus = slot_value * runif(n(), 0.85, 1.15)
    )

  return(drafts)
}

# Generate data
draft_data <- generate_draft_data()

print(sprintf("Generated %d draft picks from %d drafts",
              nrow(draft_data), n_distinct(draft_data$year)))

# 1. Success Rate by Round
success_by_round <- draft_data %>%
  group_by(round) %>%
  summarize(
    n_picks = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    total_war = sum(war_if_mlb),
    avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
  ) %>%
  filter(round <= 20)  # Focus on first 20 rounds

print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))

# 2. Value Curve Estimation
value_curve <- draft_data %>%
  group_by(overall_pick) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
  ) %>%
  filter(overall_pick <= 300)

# Fit exponential decay model
value_model <- nls(
  expected_war ~ a * exp(-b * overall_pick),
  data = value_curve %>% filter(expected_war > 0),
  start = list(a = 10, b = 0.01)
)

# Add fitted values
value_curve$fitted_war <- predict(
  value_model,
  newdata = data.frame(overall_pick = value_curve$overall_pick)
)

print("\nValue Curve Model:")
print(summary(value_model))

# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
  geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
  geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
  geom_vline(xintercept = c(30, 60, 90),
             linetype = "dashed", alpha = 0.3) +
  annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
           label = "Round 1", size = 3.5) +
  annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
           label = "Round 2", size = 3.5) +
  labs(
    title = "MLB Draft Pick Value Curve",
    subtitle = "Expected career WAR by draft position",
    x = "Overall Pick",
    y = "Expected Career WAR",
    caption = "Exponential decay model fitted to historical data"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11)
  )

print(value_plot)

# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
  filter(round <= 10) %>%
  group_by(position, round) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    .groups = "drop"
  ) %>%
  group_by(position) %>%
  summarize(
    total_picks = sum(n),
    avg_mlb_rate = mean(mlb_rate),
    avg_war = mean(avg_war)
  ) %>%
  arrange(desc(avg_war))

print("\nPosition-Specific Success Rates:")
print(position_analysis)

# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
  filter(round <= 10) %>%
  group_by(player_type) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
  )

print("\nCollege vs High School Performance:")
print(player_type_analysis)

# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
  filter(reached_mlb == 1, round <= 10) %>%
  mutate(
    war_per_million = war_if_mlb / (signing_bonus / 1000000),
    pick_group = case_when(
      overall_pick <= 30 ~ "Top 30",
      overall_pick <= 60 ~ "31-60",
      overall_pick <= 100 ~ "61-100",
      TRUE ~ "100+"
    )
  ) %>%
  group_by(pick_group) %>%
  summarize(
    n = n(),
    avg_bonus = mean(signing_bonus),
    avg_war = mean(war_if_mlb),
    war_per_million = mean(war_per_million)
  )

print("\nReturn on Investment by Pick Range:")
print(roi_analysis)

# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
  """
  Simple optimization: maximize expected WAR given bonus pool constraints
  """

  # Get expected value for each pick
  pick_values <- value_curve %>%
    filter(overall_pick %in% available_picks) %>%
    left_join(
      draft_data %>%
        group_by(overall_pick) %>%
        summarize(avg_slot = mean(slot_value)),
      by = "overall_pick"
    )

  # Greedy algorithm: pick highest value/cost ratio within budget
  selected <- tibble()
  remaining_budget <- budget
  remaining_picks <- pick_values

  while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
    # Calculate value per dollar
    remaining_picks <- remaining_picks %>%
      mutate(value_per_dollar = expected_war / avg_slot)

    # Select best value pick we can afford
    best_pick <- remaining_picks %>%
      filter(avg_slot <= remaining_budget) %>%
      slice_max(value_per_dollar, n = 1)

    if(nrow(best_pick) == 0) break

    selected <- bind_rows(selected, best_pick)
    remaining_budget <- remaining_budget - best_pick$avg_slot
    remaining_picks <- remaining_picks %>%
      filter(overall_pick != best_pick$overall_pick)
  }

  return(selected)
}

# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000

optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)

print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
        select(overall_pick, expected_war, avg_slot, value_per_dollar))

# 8. Comprehensive Dashboard Visualization
library(patchwork)

# Plot 1: Success rate by round
p1 <- success_by_round %>%
  filter(round <= 10) %>%
  ggplot(aes(x = round, y = mlb_rate)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
            vjust = -0.5, size = 3) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "MLB Success Rate by Round",
       x = "Draft Round", y = "% Reaching MLB") +
  theme_minimal()

# Plot 2: WAR distribution
p2 <- draft_data %>%
  filter(reached_mlb == 1, overall_pick <= 100) %>%
  ggplot(aes(x = war_if_mlb)) +
  geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
  labs(title = "Career WAR Distribution (MLB Players)",
       x = "Career WAR", y = "Count") +
  theme_minimal()

# Plot 3: Position comparison
p3 <- draft_data %>%
  filter(reached_mlb == 1, round <= 5) %>%
  ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "WAR by Position (Rounds 1-5)",
       x = "Position", y = "Career WAR") +
  theme_minimal() +
  theme(legend.position = "none")

# Plot 4: College vs HS
p4 <- draft_data %>%
  filter(reached_mlb == 1, round <= 10) %>%
  ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
  labs(title = "College vs HS Performance",
       x = "Player Type", y = "Career WAR") +
  theme_minimal() +
  theme(legend.position = "none")

# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
  plot_annotation(
    title = "MLB Draft Analysis Dashboard",
    subtitle = "Historical performance metrics and value analysis",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )

print(combined)

# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")

cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf("   - First round produces %.1f%% of total draft WAR\n",
            100 * first_round_war / total_war))

cat("\n2. SUCCESS RATES\n")
cat(sprintf("   - Round 1: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[1]))
cat(sprintf("   - Round 5: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[5]))
cat(sprintf("   - Round 10: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[10]))

cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf("   - Average time to debut: %.1f years\n",
            mean(draft_data$years_to_debut, na.rm = TRUE)))

cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat("   - Prioritize early picks; value drops exponentially\n")
cat("   - Consider college players for faster development\n")
cat("   - High school players have higher variance in outcomes\n")
cat("   - Pitchers dominate draft but consider positional scarcity\n")
cat("   - Later rounds: focus on high-ceiling, high-risk players\n")

cat("\n=== ANALYSIS COMPLETE ===\n")

Portfolio Enhancement:


  • Add international signing analysis

  • Compare team draft performance

  • Analyze specific draft classes

  • Include financial constraints modeling

  • Compare to prospect ranking systems

Exercise 24.4: Defensive Positioning and Shift Analysis

Objective: Analyze defensive positioning effectiveness using batted ball data.

Skills Demonstrated: Spatial analysis, causal inference, strategic analysis, data visualization

Implementation Framework:

# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches

# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)

# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
    """
    Simulate batted ball locations and outcomes.
    Coordinates in feet from home plate.
    """
    np.random.seed(42)

    data = []

    for _ in range(n_balls):
        # Batter handedness
        stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])

        # Shift decision (more common vs pull hitters)
        is_shifter = np.random.random() < 0.3
        shift_on = is_shifter and (np.random.random() < 0.7)

        # Hit location (pull tendency varies)
        if stand == 'R':
            # Righties pull left
            if is_shifter:
                angle = np.random.normal(-25, 35)  # Pull-heavy
            else:
                angle = np.random.normal(-10, 45)  # Balanced
        else:
            # Lefties pull right
            if is_shifter:
                angle = np.random.normal(25, 35)
            else:
                angle = np.random.normal(10, 45)

        # Distance based on exit velo and launch angle
        exit_velo = np.random.normal(88, 8)
        launch_angle = np.random.normal(12, 18)

        # Simplified distance calculation
        distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
        distance = max(50, min(400, distance + np.random.normal(0, 20)))

        # Convert to x, y coordinates
        angle_rad = np.radians(angle)
        x = distance * np.sin(angle_rad)
        y = distance * np.cos(angle_rad)

        # Hit outcome (shift effectiveness)
        if shift_on:
            # Shift reduces hits in pull direction
            if stand == 'R' and x < -50:
                prob_hit = 0.18  # Reduced by shift
            elif stand == 'L' and x > 50:
                prob_hit = 0.18
            else:
                prob_hit = 0.28  # Normal rate
        else:
            prob_hit = 0.25

        # Adjust for distance (harder to field)
        prob_hit = min(0.95, prob_hit * (distance / 250))

        is_hit = np.random.random() < prob_hit

        data.append({
            'x': x,
            'y': y,
            'distance': distance,
            'angle': angle,
            'exit_velo': exit_velo,
            'launch_angle': launch_angle,
            'stand': stand,
            'shift_on': shift_on,
            'is_hit': is_hit,
            'is_shifter': is_shifter
        })

    return pd.DataFrame(data)

# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)

print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")

# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
    'is_hit': ['mean', 'count'],
    'exit_velo': 'mean'
}).round(3)

print("\nShift Effectiveness:")
print(shift_analysis)

# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
    """
    Estimate runs saved by shifting.
    """
    results = []

    for stand in ['R', 'L']:
        for shifter in [True, False]:
            subset = data[(data['stand'] == stand) &
                         (data['is_shifter'] == shifter)]

            if len(subset) == 0:
                continue

            shifted = subset[subset['shift_on'] == True]
            no_shift = subset[subset['shift_on'] == False]

            if len(shifted) > 0 and len(no_shift) > 0:
                babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
                # Approximate run value per hit prevented: ~0.5 runs
                runs_saved_per_pa = babip_diff * 0.5

                results.append({
                    'stand': stand,
                    'is_shifter': shifter,
                    'shifted_babip': shifted['is_hit'].mean(),
                    'no_shift_babip': no_shift['is_hit'].mean(),
                    'babip_diff': babip_diff,
                    'runs_saved_per_100pa': runs_saved_per_pa * 100,
                    'n_shifted': len(shifted),
                    'n_no_shift': len(no_shift)
                })

    return pd.DataFrame(results)

shift_value = calculate_shift_value(bb_data)

print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))

# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
    """
    Plot baseball field with hit locations.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 10))

    # Draw field outline
    # Infield dirt
    infield = patches.Wedge((0, 0), 95, 45, 135,
                           facecolor='tan', alpha=0.3)
    ax.add_patch(infield)

    # Outfield grass
    outfield = patches.Wedge((0, 0), 400, 45, 135,
                            facecolor='green', alpha=0.1)
    ax.add_patch(outfield)

    # Foul lines
    ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
    ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)

    # Plot hits
    hits = data[data['is_hit'] == True]
    outs = data[data['is_hit'] == False]

    ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
              s=20, label='Out')
    ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
              s=30, label='Hit')

    ax.set_xlim(-320, 320)
    ax.set_ylim(0, 400)
    ax.set_aspect('equal')
    ax.set_xlabel('Distance from center (ft)', fontsize=11)
    ax.set_ylabel('Distance from home (ft)', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.2)

    return ax

# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
                      (bb_data['is_shifter'] == True)]

plot_field_with_hits(
    rhb_shifter[rhb_shifter['shift_on'] == False],
    'RHB Pull Hitter - No Shift',
    ax=ax1
)

plot_field_with_hits(
    rhb_shifter[rhb_shifter['shift_on'] == True],
    'RHB Pull Hitter - Shift On',
    ax=ax2
)

plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")

# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
    """
    Create BABIP heat map for given conditions.
    """
    subset = data[(data['shift_on'] == shift_status) &
                  (data['stand'] == stand)]

    # Create grid
    x_bins = np.linspace(-250, 250, 25)
    y_bins = np.linspace(50, 350, 20)

    grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
    grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))

    for i in range(len(y_bins)-1):
        for j in range(len(x_bins)-1):
            mask = ((subset['x'] >= x_bins[j]) &
                   (subset['x'] < x_bins[j+1]) &
                   (subset['y'] >= y_bins[i]) &
                   (subset['y'] < y_bins[i+1]))

            cell_data = subset[mask]
            if len(cell_data) >= 5:  # Minimum sample
                grid_babip[i, j] = cell_data['is_hit'].mean()
                grid_count[i, j] = len(cell_data)
            else:
                grid_babip[i, j] = np.nan

    return grid_babip, x_bins, y_bins, grid_count

# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stand in enumerate(['R', 'L']):
    for j, shift_on in enumerate([False, True]):
        ax = axes[i, j]

        shifters = bb_data[bb_data['is_shifter'] == True]
        grid, x_bins, y_bins, counts = create_babip_heatmap(
            shifters, shift_on, stand
        )

        im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
                                     y_bins[0], y_bins[-1]],
                      origin='lower', cmap='RdYlGn_r',
                      vmin=0, vmax=0.5, aspect='auto')

        shift_text = "Shift On" if shift_on else "No Shift"
        hand_text = "RHB" if stand == 'R' else "LHB"
        ax.set_title(f'{hand_text} - {shift_text}',
                    fontsize=11, fontweight='bold')
        ax.set_xlabel('Horizontal Position (ft)')
        ax.set_ylabel('Distance from Home (ft)')

        # Add colorbar
        plt.colorbar(im, ax=ax, label='BABIP')

plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")

# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
                      ((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)

X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
                      model.coef_[0]):
    print(f"  {feat}: {coef:.3f}")

# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)

print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
    print(f"   {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")

print("\n2. WHEN TO SHIFT")
print("   - Strong pull tendency (>70% pull rate)")
print("   - Ground ball hitters (LA < 10°)")
print("   - Extreme pull hitters benefit most from aggressive shifts")

print("\n3. SHIFT VARIATIONS")
print("   - Full shift: 3 infielders on pull side")
print("   - Partial shift: 2.5 infielders pull side")
print("   - No shift: Traditional alignment")
print("   - Decision should consider:")
print("     * Batter's spray chart")
print("     * Game situation (runners, outs)")
print("     * Pitcher's ground ball rate")

print("\n4. LIMITATIONS & CONSIDERATIONS")
print("   - Shift beaten by opposite field hits")
print("   - Bunt defense vulnerabilities")
print("   - Runner advancement opportunities")
print("   - Pitcher-specific adjustments")

print("\n5. FUTURE ANALYSIS")
print("   - Pitcher-specific positioning")
print("   - Count-based positioning adjustments")
print("   - Outfield positioning optimization")
print("   - Real-time adjustment algorithms")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)

Portfolio Development Tips:


  • Use real Statcast spray chart data when possible

  • Incorporate expected outcomes (xBA, xwOBA)

  • Add video analysis component

  • Compare to MLB team shift strategies

  • Analyze shift effectiveness by ballpark


Chapter Summary

Building a career in baseball analytics requires a combination of technical skills, baseball knowledge, and strategic career planning. Key takeaways:

Skills Development: Master programming (R, Python, SQL), statistical methods, and baseball domain knowledge. Build a strong portfolio demonstrating diverse analytical capabilities.

Career Planning: Understand the various roles and career paths in baseball analytics. Network actively, apply strategically, and prepare thoroughly for technical and behavioral interviews.

Continuous Learning: Stay current with new analytical methods, data sources, and baseball trends. Engage with the community through conferences, publications, and collaborations.

Portfolio Projects: Complete projects that showcase your ability to: collect and clean data, apply appropriate statistical methods, create compelling visualizations, and communicate actionable insights.

The field of baseball analytics continues to evolve rapidly. Success requires not just technical proficiency, but also creativity, communication skills, and genuine passion for understanding the game through data. The exercises in this chapter provide starting points for building portfolio projects that demonstrate your readiness for a career in this exciting field.

R
# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns

library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)

# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
  # In practice, use scrape_statcast_savant_pitcher()
  # For this example, we'll simulate data

  set.seed(123)
  n_pitches <- 2500

  tibble(
    pitch_type = sample(
      c("FF", "SI", "SL", "CH", "CU"),
      n_pitches,
      replace = TRUE,
      prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
    ),
    release_speed = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
      pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
      pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
      pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
      pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
    ),
    pfx_x = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
      pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
      pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
      pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
      pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
    ),
    pfx_z = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
      pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
      pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
      pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
      pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
    ),
    release_spin_rate = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
      pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
      pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
      pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
      pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
    ),
    balls = sample(0:3, n_pitches, replace = TRUE),
    strikes = sample(0:2, n_pitches, replace = TRUE),
    stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
    description = sample(
      c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
      n_pitches,
      replace = TRUE,
      prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
    ),
    launch_speed = ifelse(description == "hit_into_play",
                          rnorm(n_pitches, 87, 10), NA),
    launch_angle = ifelse(description == "hit_into_play",
                         rnorm(n_pitches, 12, 20), NA),
    estimated_woba_using_speedangle = ifelse(
      description == "hit_into_play",
      pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
      NA
    )
  )
}

# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")

# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
  group_by(pitch_type) %>%
  summarize(
    n = n(),
    pct = n() / nrow(pitcher_data),
    avg_velo = mean(release_speed, na.rm = TRUE),
    avg_spin = mean(release_spin_rate, na.rm = TRUE)
  ) %>%
  arrange(desc(n))

print("Pitch Mix:")
print(pitch_mix)

# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
  group_by(pitch_type) %>%
  summarize(
    usage = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
                    na.rm = TRUE),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
  ) %>%
  arrange(desc(csw_rate))

print("\nPitch Effectiveness:")
print(pitch_effectiveness)

# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
  mutate(count = paste0(balls, "-", strikes)) %>%
  group_by(count, pitch_type) %>%
  summarize(
    n = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(count) %>%
  mutate(usage_pct = n / sum(n)) %>%
  arrange(count, desc(usage_pct))

# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
  group_by(pitch_type, stand) %>%
  summarize(
    n = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = stand,
    values_from = c(n, whiff_rate, avg_xwoba),
    names_sep = "_"
  )

print("\nPlatoon Splits:")
print(platoon_splits)

# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
  "FF" = "#d22d49", "SI" = "#FE9D00",
  "SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)

movement_plot <- ggplot(pitcher_data,
                        aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_ellipse(level = 0.75, size = 1.2) +
  scale_color_manual(values = pitch_colors,
                    labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                              "SL" = "Slider", "CH" = "Changeup",
                              "CU" = "Curveball")) +
  labs(
    title = "Pitch Movement Profile",
    subtitle = "Catcher's perspective (RHP)",
    x = "Horizontal Break (inches)",
    y = "Induced Vertical Break (inches)",
    color = "Pitch Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  ) +
  coord_fixed()

# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
  ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
  geom_point(alpha = 0.4, size = 2) +
  scale_color_manual(values = pitch_colors,
                    labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                              "SL" = "Slider", "CH" = "Changeup",
                              "CU" = "Curveball")) +
  labs(
    title = "Velocity vs. Spin Rate",
    x = "Release Speed (mph)",
    y = "Spin Rate (rpm)",
    color = "Pitch Type"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
  filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
  ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
  geom_col(position = "stack") +
  scale_fill_manual(values = pitch_colors,
                   labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                             "SL" = "Slider", "CH" = "Changeup",
                             "CU" = "Curveball")) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Pitch Usage by Count",
    x = "Count",
    y = "Usage %",
    fill = "Pitch Type"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
  plot_annotation(
    title = "Comprehensive Pitcher Arsenal Analysis",
    subtitle = "Example Pitcher - 2024 Season",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )

print(combined_plot)

# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
  cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")

  # Best pitch
  best_pitch <- effectiveness %>%
    filter(usage >= 50) %>%
    slice_max(csw_rate, n = 1)

  cat("1. PRIMARY WEAPON\n")
  cat(sprintf("   - %s showing elite CSW rate of %.1f%%\n",
              best_pitch$pitch_type, best_pitch$csw_rate * 100))
  cat("   - Maintain high usage in favorable counts\n\n")

  # Underused effective pitch
  underused <- effectiveness %>%
    filter(usage < quantile(effectiveness$usage, 0.33)) %>%
    filter(csw_rate > 0.30)

  if(nrow(underused) > 0) {
    cat("2. USAGE OPTIMIZATION\n")
    for(i in 1:nrow(underused)) {
      cat(sprintf("   - Consider increasing %s usage (current: %d pitches)\n",
                  underused$pitch_type[i], underused$usage[i]))
      cat(sprintf("     Shows strong CSW rate: %.1f%%\n",
                  underused$csw_rate[i] * 100))
    }
    cat("\n")
  }

  # Weak pitch
  weak_pitch <- effectiveness %>%
    filter(usage >= 50) %>%
    slice_min(csw_rate, n = 1)

  cat("3. PITCH DEVELOPMENT FOCUS\n")
  cat(sprintf("   - %s showing below-average performance\n",
              weak_pitch$pitch_type))
  cat(sprintf("   - CSW rate: %.1f%% vs. league average ~28%%\n",
              weak_pitch$csw_rate * 100))
  cat("   - Consider: velocity increase, movement adjustment, or reduced usage\n\n")

  cat("4. STRATEGIC ADJUSTMENTS\n")
  cat("   - Review count-specific usage patterns\n")
  cat("   - Analyze platoon splits for pitch selection\n")
  cat("   - Consider sequencing effects (not shown in basic analysis)\n")
  cat("   - Monitor fatigue impact on pitch quality\n")
}

generate_recommendations(pitcher_data, pitch_effectiveness)

# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")
R
# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy

library(tidyverse)
library(survival)
library(ggplot2)
library(scales)

# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
  set.seed(42)

  drafts <- expand.grid(
    year = 2008:2022,
    round = 1:rounds,
    pick = 1:30
  ) %>%
    mutate(
      overall_pick = (round - 1) * 30 + pick,
      # Probability of reaching majors decreases with pick
      p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
      reached_mlb = rbinom(n(), 1, p_mlb),
      # Career WAR conditional on reaching MLB
      war_if_mlb = ifelse(
        reached_mlb == 1,
        pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
        0
      ),
      # Years to debut
      years_to_debut = ifelse(
        reached_mlb == 1,
        pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
        NA
      ),
      # Position (simplified)
      position = sample(
        c("P", "C", "IF", "OF"),
        n(),
        replace = TRUE,
        prob = c(0.45, 0.10, 0.25, 0.20)
      ),
      # College vs HS
      player_type = sample(
        c("College", "HS", "International"),
        n(),
        replace = TRUE,
        prob = c(0.55, 0.35, 0.10)
      ),
      # Slot value (simplified formula)
      slot_value = pmax(
        200000,
        12000000 * exp(-overall_pick / 15)
      ),
      # Signing bonus (usually close to slot)
      signing_bonus = slot_value * runif(n(), 0.85, 1.15)
    )

  return(drafts)
}

# Generate data
draft_data <- generate_draft_data()

print(sprintf("Generated %d draft picks from %d drafts",
              nrow(draft_data), n_distinct(draft_data$year)))

# 1. Success Rate by Round
success_by_round <- draft_data %>%
  group_by(round) %>%
  summarize(
    n_picks = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    total_war = sum(war_if_mlb),
    avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
  ) %>%
  filter(round <= 20)  # Focus on first 20 rounds

print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))

# 2. Value Curve Estimation
value_curve <- draft_data %>%
  group_by(overall_pick) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
  ) %>%
  filter(overall_pick <= 300)

# Fit exponential decay model
value_model <- nls(
  expected_war ~ a * exp(-b * overall_pick),
  data = value_curve %>% filter(expected_war > 0),
  start = list(a = 10, b = 0.01)
)

# Add fitted values
value_curve$fitted_war <- predict(
  value_model,
  newdata = data.frame(overall_pick = value_curve$overall_pick)
)

print("\nValue Curve Model:")
print(summary(value_model))

# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
  geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
  geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
  geom_vline(xintercept = c(30, 60, 90),
             linetype = "dashed", alpha = 0.3) +
  annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
           label = "Round 1", size = 3.5) +
  annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
           label = "Round 2", size = 3.5) +
  labs(
    title = "MLB Draft Pick Value Curve",
    subtitle = "Expected career WAR by draft position",
    x = "Overall Pick",
    y = "Expected Career WAR",
    caption = "Exponential decay model fitted to historical data"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11)
  )

print(value_plot)

# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
  filter(round <= 10) %>%
  group_by(position, round) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    .groups = "drop"
  ) %>%
  group_by(position) %>%
  summarize(
    total_picks = sum(n),
    avg_mlb_rate = mean(mlb_rate),
    avg_war = mean(avg_war)
  ) %>%
  arrange(desc(avg_war))

print("\nPosition-Specific Success Rates:")
print(position_analysis)

# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
  filter(round <= 10) %>%
  group_by(player_type) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
  )

print("\nCollege vs High School Performance:")
print(player_type_analysis)

# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
  filter(reached_mlb == 1, round <= 10) %>%
  mutate(
    war_per_million = war_if_mlb / (signing_bonus / 1000000),
    pick_group = case_when(
      overall_pick <= 30 ~ "Top 30",
      overall_pick <= 60 ~ "31-60",
      overall_pick <= 100 ~ "61-100",
      TRUE ~ "100+"
    )
  ) %>%
  group_by(pick_group) %>%
  summarize(
    n = n(),
    avg_bonus = mean(signing_bonus),
    avg_war = mean(war_if_mlb),
    war_per_million = mean(war_per_million)
  )

print("\nReturn on Investment by Pick Range:")
print(roi_analysis)

# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
  """
  Simple optimization: maximize expected WAR given bonus pool constraints
  """

  # Get expected value for each pick
  pick_values <- value_curve %>%
    filter(overall_pick %in% available_picks) %>%
    left_join(
      draft_data %>%
        group_by(overall_pick) %>%
        summarize(avg_slot = mean(slot_value)),
      by = "overall_pick"
    )

  # Greedy algorithm: pick highest value/cost ratio within budget
  selected <- tibble()
  remaining_budget <- budget
  remaining_picks <- pick_values

  while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
    # Calculate value per dollar
    remaining_picks <- remaining_picks %>%
      mutate(value_per_dollar = expected_war / avg_slot)

    # Select best value pick we can afford
    best_pick <- remaining_picks %>%
      filter(avg_slot <= remaining_budget) %>%
      slice_max(value_per_dollar, n = 1)

    if(nrow(best_pick) == 0) break

    selected <- bind_rows(selected, best_pick)
    remaining_budget <- remaining_budget - best_pick$avg_slot
    remaining_picks <- remaining_picks %>%
      filter(overall_pick != best_pick$overall_pick)
  }

  return(selected)
}

# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000

optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)

print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
        select(overall_pick, expected_war, avg_slot, value_per_dollar))

# 8. Comprehensive Dashboard Visualization
library(patchwork)

# Plot 1: Success rate by round
p1 <- success_by_round %>%
  filter(round <= 10) %>%
  ggplot(aes(x = round, y = mlb_rate)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
            vjust = -0.5, size = 3) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "MLB Success Rate by Round",
       x = "Draft Round", y = "% Reaching MLB") +
  theme_minimal()

# Plot 2: WAR distribution
p2 <- draft_data %>%
  filter(reached_mlb == 1, overall_pick <= 100) %>%
  ggplot(aes(x = war_if_mlb)) +
  geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
  labs(title = "Career WAR Distribution (MLB Players)",
       x = "Career WAR", y = "Count") +
  theme_minimal()

# Plot 3: Position comparison
p3 <- draft_data %>%
  filter(reached_mlb == 1, round <= 5) %>%
  ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "WAR by Position (Rounds 1-5)",
       x = "Position", y = "Career WAR") +
  theme_minimal() +
  theme(legend.position = "none")

# Plot 4: College vs HS
p4 <- draft_data %>%
  filter(reached_mlb == 1, round <= 10) %>%
  ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
  labs(title = "College vs HS Performance",
       x = "Player Type", y = "Career WAR") +
  theme_minimal() +
  theme(legend.position = "none")

# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
  plot_annotation(
    title = "MLB Draft Analysis Dashboard",
    subtitle = "Historical performance metrics and value analysis",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )

print(combined)

# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")

cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf("   - First round produces %.1f%% of total draft WAR\n",
            100 * first_round_war / total_war))

cat("\n2. SUCCESS RATES\n")
cat(sprintf("   - Round 1: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[1]))
cat(sprintf("   - Round 5: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[5]))
cat(sprintf("   - Round 10: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[10]))

cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf("   - Average time to debut: %.1f years\n",
            mean(draft_data$years_to_debut, na.rm = TRUE)))

cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat("   - Prioritize early picks; value drops exponentially\n")
cat("   - Consider college players for faster development\n")
cat("   - High school players have higher variance in outcomes\n")
cat("   - Pitchers dominate draft but consider positional scarcity\n")
cat("   - Later rounds: focus on high-ceiling, high-risk players\n")

cat("\n=== ANALYSIS COMPLETE ===\n")
Python
# Player Aging Curves and Projection System
# Analyzing how player skills change with age

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
    """
    Generate simulated player career data.
    In practice, this would come from Baseball Reference or FanGraphs.
    """
    np.random.seed(42)

    players = []
    for player_id in range(n_players):
        # Random career start age (20-25)
        start_age = np.random.randint(20, 26)
        # Random career length (2-15 years)
        career_length = np.random.randint(2, 16)

        # Peak age varies (26-30)
        peak_age = np.random.randint(26, 31)
        # Peak performance level
        peak_wrc_plus = np.random.normal(110, 20)

        for year_in_career in range(career_length):
            age = start_age + year_in_career
            season = years_range[0] + np.random.randint(0,
                                      years_range[1] - years_range[0])

            # Age-based performance (simplified aging curve)
            age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
            base_wrc = peak_wrc_plus * age_factor

            # Add random variation
            wrc_plus = max(50, base_wrc + np.random.normal(0, 15))

            # Other stats correlated with wRC+
            pa = np.random.randint(300, 650)
            avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
            obp = avg + 0.060 + np.random.normal(0, 0.020)
            slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)

            players.append({
                'player_id': player_id,
                'age': age,
                'season': season,
                'PA': pa,
                'AVG': np.clip(avg, 0.150, 0.400),
                'OBP': np.clip(obp, 0.250, 0.500),
                'SLG': np.clip(slg, 0.300, 0.700),
                'wRC_plus': wrc_plus,
                'ISO': np.clip(slg - avg, 0.050, 0.350)
            })

    return pd.DataFrame(players)

# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)

print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")

# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
    """
    Calculate aging curve using year-to-year delta method.
    This controls for selection bias better than simple averaging.
    """
    # Filter for consecutive seasons
    df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])

    # Calculate year-to-year changes
    df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
    df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
    df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]

    # Keep only consecutive seasons
    df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()

    # Group by age and calculate average change
    aging_curve = df_deltas.groupby('age').agg({
        'metric_delta': ['mean', 'std', 'count'],
        metric: 'mean'
    }).reset_index()

    aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']

    return aging_curve

# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")

metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}

for metric in metrics:
    aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
    print(f"  {metric}: {len(aging_curves[metric])} age points")

# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
    """
    Fit a polynomial curve to aging data.
    """
    # Use weighted regression (weight by sample size)
    weights = np.sqrt(aging_data['n'])

    # Polynomial features (degree 2)
    X = aging_data[age_col].values.reshape(-1, 1)
    y = aging_data[delta_col].values

    poly = PolynomialFeatures(degree=2)
    X_poly = poly.fit_transform(X)

    model = Ridge(alpha=1.0)
    model.fit(X_poly, y, sample_weight=weights)

    return model, poly

# Fit curves
fitted_models = {}
for metric in metrics:
    fitted_models[metric] = fit_aging_curve(aging_curves[metric])
    print(f"Fitted aging curve for {metric}")

# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
    ax = axes[idx]
    curve_data = aging_curves[metric]
    model, poly = fitted_models[metric]

    # Plot raw deltas
    ax.scatter(curve_data['age'], curve_data['delta_mean'],
               s=curve_data['n']*2, alpha=0.6, label='Observed')

    # Plot fitted curve
    age_range = np.linspace(curve_data['age'].min(),
                           curve_data['age'].max(), 100)
    X_pred = poly.transform(age_range.reshape(-1, 1))
    y_pred = model.predict(X_pred)

    ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
    ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)

    ax.set_xlabel('Age', fontsize=11)
    ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
    ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")

# 4. Build Projection System
class PlayerProjector:
    """
    Project player performance based on recent history and aging curves.
    """

    def __init__(self, aging_models):
        self.aging_models = aging_models

    def project_player(self, player_history, years_forward=1):
        """
        Project player performance forward.

        Parameters:
        -----------
        player_history : DataFrame
            Recent seasons for player (last 3 years recommended)
        years_forward : int
            Number of years to project forward

        Returns:
        --------
        dict : Projected statistics
        """
        # Weight recent seasons more heavily (3:2:1 for last 3 years)
        weights = np.array([3, 2, 1])[:len(player_history)]
        weights = weights / weights.sum()

        # Current age and baseline performance
        current_age = player_history['age'].iloc[-1]

        projections = {}

        for metric in self.aging_models.keys():
            if metric not in player_history.columns:
                continue

            # Weighted average of recent performance
            baseline = np.average(player_history[metric].iloc[-3:],
                                weights=weights)

            # Apply aging curve
            model, poly = self.aging_models[metric]

            # Project forward
            projected_value = baseline
            for year in range(years_forward):
                age = current_age + year + 1
                X_age = poly.transform([[age]])
                age_adjustment = model.predict(X_age)[0]
                projected_value += age_adjustment

            projections[metric] = projected_value

        projections['age'] = current_age + years_forward
        projections['projection_years'] = years_forward

        return projections

# 5. Test Projection System
projector = PlayerProjector(fitted_models)

# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]

test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')

print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")

print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))

# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)

for year in range(1, 4):
    projection = projector.project_player(test_player_data, years_forward=year)
    print(f"+{year:<5} {projection['age']:<5.0f} "
          f"{projection.get('wRC_plus', 0):<8.1f} "
          f"{projection.get('ISO', 0):<8.3f} "
          f"{projection.get('AVG', 0):<8.3f} "
          f"{projection.get('OBP', 0):<8.3f}")

# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
    """
    Evaluate projection accuracy on historical data.
    """
    results = []

    for player_id in data['player_id'].unique():
        player_data = data[data['player_id'] == player_id].sort_values('age')

        # Need at least 4 seasons (3 to project, 1 to validate)
        if len(player_data) < 4:
            continue

        # Use all but last season for projection
        train_data = player_data.iloc[:-1]
        actual_data = player_data.iloc[-1]

        if len(train_data) < 3:
            continue

        # Make projection
        try:
            projection = projector.project_player(train_data, years_forward=1)

            for metric in ['wRC_plus', 'ISO', 'AVG']:
                if metric in projection:
                    results.append({
                        'player_id': player_id,
                        'metric': metric,
                        'actual': actual_data[metric],
                        'projected': projection[metric],
                        'error': projection[metric] - actual_data[metric]
                    })
        except:
            continue

    return pd.DataFrame(results)

print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)

print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)

for metric in ['wRC_plus', 'ISO', 'AVG']:
    metric_eval = evaluation[evaluation['metric'] == metric]

    if len(metric_eval) > 0:
        mae = np.abs(metric_eval['error']).mean()
        rmse = np.sqrt((metric_eval['error'] ** 2).mean())

        # Calculate R²
        actual = metric_eval['actual'].values
        predicted = metric_eval['projected'].values
        ss_res = np.sum((actual - predicted) ** 2)
        ss_tot = np.sum((actual - actual.mean()) ** 2)
        r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

        print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")

# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
    ax = axes[idx]
    metric_eval = evaluation[evaluation['metric'] == metric]

    if len(metric_eval) > 0:
        ax.scatter(metric_eval['actual'], metric_eval['projected'],
                  alpha=0.4, s=30)

        # Add y=x line
        min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
        max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
        ax.plot([min_val, max_val], [min_val, max_val],
               'r--', linewidth=2, label='Perfect Projection')

        ax.set_xlabel(f'Actual {metric}', fontsize=11)
        ax.set_ylabel(f'Projected {metric}', fontsize=11)
        ax.set_title(f'{metric} Projection Accuracy',
                    fontsize=12, fontweight='bold')
        ax.legend()
        ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")
Python
# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches

# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)

# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
    """
    Simulate batted ball locations and outcomes.
    Coordinates in feet from home plate.
    """
    np.random.seed(42)

    data = []

    for _ in range(n_balls):
        # Batter handedness
        stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])

        # Shift decision (more common vs pull hitters)
        is_shifter = np.random.random() < 0.3
        shift_on = is_shifter and (np.random.random() < 0.7)

        # Hit location (pull tendency varies)
        if stand == 'R':
            # Righties pull left
            if is_shifter:
                angle = np.random.normal(-25, 35)  # Pull-heavy
            else:
                angle = np.random.normal(-10, 45)  # Balanced
        else:
            # Lefties pull right
            if is_shifter:
                angle = np.random.normal(25, 35)
            else:
                angle = np.random.normal(10, 45)

        # Distance based on exit velo and launch angle
        exit_velo = np.random.normal(88, 8)
        launch_angle = np.random.normal(12, 18)

        # Simplified distance calculation
        distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
        distance = max(50, min(400, distance + np.random.normal(0, 20)))

        # Convert to x, y coordinates
        angle_rad = np.radians(angle)
        x = distance * np.sin(angle_rad)
        y = distance * np.cos(angle_rad)

        # Hit outcome (shift effectiveness)
        if shift_on:
            # Shift reduces hits in pull direction
            if stand == 'R' and x < -50:
                prob_hit = 0.18  # Reduced by shift
            elif stand == 'L' and x > 50:
                prob_hit = 0.18
            else:
                prob_hit = 0.28  # Normal rate
        else:
            prob_hit = 0.25

        # Adjust for distance (harder to field)
        prob_hit = min(0.95, prob_hit * (distance / 250))

        is_hit = np.random.random() < prob_hit

        data.append({
            'x': x,
            'y': y,
            'distance': distance,
            'angle': angle,
            'exit_velo': exit_velo,
            'launch_angle': launch_angle,
            'stand': stand,
            'shift_on': shift_on,
            'is_hit': is_hit,
            'is_shifter': is_shifter
        })

    return pd.DataFrame(data)

# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)

print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")

# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
    'is_hit': ['mean', 'count'],
    'exit_velo': 'mean'
}).round(3)

print("\nShift Effectiveness:")
print(shift_analysis)

# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
    """
    Estimate runs saved by shifting.
    """
    results = []

    for stand in ['R', 'L']:
        for shifter in [True, False]:
            subset = data[(data['stand'] == stand) &
                         (data['is_shifter'] == shifter)]

            if len(subset) == 0:
                continue

            shifted = subset[subset['shift_on'] == True]
            no_shift = subset[subset['shift_on'] == False]

            if len(shifted) > 0 and len(no_shift) > 0:
                babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
                # Approximate run value per hit prevented: ~0.5 runs
                runs_saved_per_pa = babip_diff * 0.5

                results.append({
                    'stand': stand,
                    'is_shifter': shifter,
                    'shifted_babip': shifted['is_hit'].mean(),
                    'no_shift_babip': no_shift['is_hit'].mean(),
                    'babip_diff': babip_diff,
                    'runs_saved_per_100pa': runs_saved_per_pa * 100,
                    'n_shifted': len(shifted),
                    'n_no_shift': len(no_shift)
                })

    return pd.DataFrame(results)

shift_value = calculate_shift_value(bb_data)

print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))

# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
    """
    Plot baseball field with hit locations.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 10))

    # Draw field outline
    # Infield dirt
    infield = patches.Wedge((0, 0), 95, 45, 135,
                           facecolor='tan', alpha=0.3)
    ax.add_patch(infield)

    # Outfield grass
    outfield = patches.Wedge((0, 0), 400, 45, 135,
                            facecolor='green', alpha=0.1)
    ax.add_patch(outfield)

    # Foul lines
    ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
    ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)

    # Plot hits
    hits = data[data['is_hit'] == True]
    outs = data[data['is_hit'] == False]

    ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
              s=20, label='Out')
    ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
              s=30, label='Hit')

    ax.set_xlim(-320, 320)
    ax.set_ylim(0, 400)
    ax.set_aspect('equal')
    ax.set_xlabel('Distance from center (ft)', fontsize=11)
    ax.set_ylabel('Distance from home (ft)', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.2)

    return ax

# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
                      (bb_data['is_shifter'] == True)]

plot_field_with_hits(
    rhb_shifter[rhb_shifter['shift_on'] == False],
    'RHB Pull Hitter - No Shift',
    ax=ax1
)

plot_field_with_hits(
    rhb_shifter[rhb_shifter['shift_on'] == True],
    'RHB Pull Hitter - Shift On',
    ax=ax2
)

plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")

# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
    """
    Create BABIP heat map for given conditions.
    """
    subset = data[(data['shift_on'] == shift_status) &
                  (data['stand'] == stand)]

    # Create grid
    x_bins = np.linspace(-250, 250, 25)
    y_bins = np.linspace(50, 350, 20)

    grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
    grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))

    for i in range(len(y_bins)-1):
        for j in range(len(x_bins)-1):
            mask = ((subset['x'] >= x_bins[j]) &
                   (subset['x'] < x_bins[j+1]) &
                   (subset['y'] >= y_bins[i]) &
                   (subset['y'] < y_bins[i+1]))

            cell_data = subset[mask]
            if len(cell_data) >= 5:  # Minimum sample
                grid_babip[i, j] = cell_data['is_hit'].mean()
                grid_count[i, j] = len(cell_data)
            else:
                grid_babip[i, j] = np.nan

    return grid_babip, x_bins, y_bins, grid_count

# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stand in enumerate(['R', 'L']):
    for j, shift_on in enumerate([False, True]):
        ax = axes[i, j]

        shifters = bb_data[bb_data['is_shifter'] == True]
        grid, x_bins, y_bins, counts = create_babip_heatmap(
            shifters, shift_on, stand
        )

        im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
                                     y_bins[0], y_bins[-1]],
                      origin='lower', cmap='RdYlGn_r',
                      vmin=0, vmax=0.5, aspect='auto')

        shift_text = "Shift On" if shift_on else "No Shift"
        hand_text = "RHB" if stand == 'R' else "LHB"
        ax.set_title(f'{hand_text} - {shift_text}',
                    fontsize=11, fontweight='bold')
        ax.set_xlabel('Horizontal Position (ft)')
        ax.set_ylabel('Distance from Home (ft)')

        # Add colorbar
        plt.colorbar(im, ax=ax, label='BABIP')

plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")

# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
                      ((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)

X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
                      model.coef_[0]):
    print(f"  {feat}: {coef:.3f}")

# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)

print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
    print(f"   {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")

print("\n2. WHEN TO SHIFT")
print("   - Strong pull tendency (>70% pull rate)")
print("   - Ground ball hitters (LA < 10°)")
print("   - Extreme pull hitters benefit most from aggressive shifts")

print("\n3. SHIFT VARIATIONS")
print("   - Full shift: 3 infielders on pull side")
print("   - Partial shift: 2.5 infielders pull side")
print("   - No shift: Traditional alignment")
print("   - Decision should consider:")
print("     * Batter's spray chart")
print("     * Game situation (runners, outs)")
print("     * Pitcher's ground ball rate")

print("\n4. LIMITATIONS & CONSIDERATIONS")
print("   - Shift beaten by opposite field hits")
print("   - Bunt defense vulnerabilities")
print("   - Runner advancement opportunities")
print("   - Pitcher-specific adjustments")

print("\n5. FUTURE ANALYSIS")
print("   - Pitcher-specific positioning")
print("   - Count-based positioning adjustments")
print("   - Outfield positioning optimization")
print("   - Real-time adjustment algorithms")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)

Practice Exercises

Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.

4 exercises
Tips for Success
  • Read the problem carefully before starting to code
  • Break down complex problems into smaller steps
  • Use the hints if you're stuck - they won't give away the answer
  • After solving, compare your approach with the solution
Exercise 24.1
Pitcher Arsenal Analysis and Optimization
Hard
**Objective**: Analyze a pitcher's repertoire using Statcast data and provide recommendations for pitch usage optimization.

**Skills Demonstrated**: Data acquisition, exploratory analysis, visualization, strategic thinking

**Project Steps**:

1. Acquire Statcast pitch-level data for a pitcher (use baseballr package or Baseball Savant)
2. Analyze pitch characteristics (velocity, movement, spin)
3. Evaluate pitch effectiveness by count and situation
4. Identify optimization opportunities
5. Create compelling visualizations
6. Write executive summary with recommendations

**R Implementation**:

```r
# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns

library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)

# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
# In practice, use scrape_statcast_savant_pitcher()
# For this example, we'll simulate data

set.seed(123)
n_pitches <- 2500

tibble(
pitch_type = sample(
c("FF", "SI", "SL", "CH", "CU"),
n_pitches,
replace = TRUE,
prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
),
release_speed = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
),
pfx_x = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
),
pfx_z = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
),
release_spin_rate = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
),
balls = sample(0:3, n_pitches, replace = TRUE),
strikes = sample(0:2, n_pitches, replace = TRUE),
stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
description = sample(
c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
n_pitches,
replace = TRUE,
prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
),
launch_speed = ifelse(description == "hit_into_play",
rnorm(n_pitches, 87, 10), NA),
launch_angle = ifelse(description == "hit_into_play",
rnorm(n_pitches, 12, 20), NA),
estimated_woba_using_speedangle = ifelse(
description == "hit_into_play",
pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
NA
)
)
}

# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")

# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
n = n(),
pct = n() / nrow(pitcher_data),
avg_velo = mean(release_speed, na.rm = TRUE),
avg_spin = mean(release_spin_rate, na.rm = TRUE)
) %>%
arrange(desc(n))

print("Pitch Mix:")
print(pitch_mix)

# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
usage = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
na.rm = TRUE),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
) %>%
arrange(desc(csw_rate))

print("\nPitch Effectiveness:")
print(pitch_effectiveness)

# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
mutate(count = paste0(balls, "-", strikes)) %>%
group_by(count, pitch_type) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
.groups = "drop"
) %>%
group_by(count) %>%
mutate(usage_pct = n / sum(n)) %>%
arrange(count, desc(usage_pct))

# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
group_by(pitch_type, stand) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = stand,
values_from = c(n, whiff_rate, avg_xwoba),
names_sep = "_"
)

print("\nPlatoon Splits:")
print(platoon_splits)

# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
"FF" = "#d22d49", "SI" = "#FE9D00",
"SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)

movement_plot <- ggplot(pitcher_data,
aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
geom_point(alpha = 0.3, size = 2) +
stat_ellipse(level = 0.75, size = 1.2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Pitch Movement Profile",
subtitle = "Catcher's perspective (RHP)",
x = "Horizontal Break (inches)",
y = "Induced Vertical Break (inches)",
color = "Pitch Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right"
) +
coord_fixed()

# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
geom_point(alpha = 0.4, size = 2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Velocity vs. Spin Rate",
x = "Release Speed (mph)",
y = "Spin Rate (rpm)",
color = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")

# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
geom_col(position = "stack") +
scale_fill_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Pitch Usage by Count",
x = "Count",
y = "Usage %",
fill = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")

# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
plot_annotation(
title = "Comprehensive Pitcher Arsenal Analysis",
subtitle = "Example Pitcher - 2024 Season",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)

print(combined_plot)

# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")

# Best pitch
best_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_max(csw_rate, n = 1)

cat("1. PRIMARY WEAPON\n")
cat(sprintf(" - %s showing elite CSW rate of %.1f%%\n",
best_pitch$pitch_type, best_pitch$csw_rate * 100))
cat(" - Maintain high usage in favorable counts\n\n")

# Underused effective pitch
underused <- effectiveness %>%
filter(usage < quantile(effectiveness$usage, 0.33)) %>%
filter(csw_rate > 0.30)

if(nrow(underused) > 0) {
cat("2. USAGE OPTIMIZATION\n")
for(i in 1:nrow(underused)) {
cat(sprintf(" - Consider increasing %s usage (current: %d pitches)\n",
underused$pitch_type[i], underused$usage[i]))
cat(sprintf(" Shows strong CSW rate: %.1f%%\n",
underused$csw_rate[i] * 100))
}
cat("\n")
}

# Weak pitch
weak_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_min(csw_rate, n = 1)

cat("3. PITCH DEVELOPMENT FOCUS\n")
cat(sprintf(" - %s showing below-average performance\n",
weak_pitch$pitch_type))
cat(sprintf(" - CSW rate: %.1f%% vs. league average ~28%%\n",
weak_pitch$csw_rate * 100))
cat(" - Consider: velocity increase, movement adjustment, or reduced usage\n\n")

cat("4. STRATEGIC ADJUSTMENTS\n")
cat(" - Review count-specific usage patterns\n")
cat(" - Analyze platoon splits for pitch selection\n")
cat(" - Consider sequencing effects (not shown in basic analysis)\n")
cat(" - Monitor fatigue impact on pitch quality\n")
}

generate_recommendations(pitcher_data, pitch_effectiveness)

# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")
```

**Portfolio Presentation Tips**:
- Include interactive visualizations (consider using plotly)
- Compare pitcher to league averages
- Add context about pitcher role and team strategy
- Discuss limitations (sample size, park factors, etc.)
- Provide actionable recommendations
Exercise 24.2
Player Aging Curves and Performance Projection
Hard
**Objective**: Build aging curves for different player skills and create a performance projection system.

**Skills Demonstrated**: Statistical modeling, time series analysis, predictive analytics, data visualization

**Python Implementation**:

```python
# Player Aging Curves and Projection System
# Analyzing how player skills change with age

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
"""
Generate simulated player career data.
In practice, this would come from Baseball Reference or FanGraphs.
"""
np.random.seed(42)

players = []
for player_id in range(n_players):
# Random career start age (20-25)
start_age = np.random.randint(20, 26)
# Random career length (2-15 years)
career_length = np.random.randint(2, 16)

# Peak age varies (26-30)
peak_age = np.random.randint(26, 31)
# Peak performance level
peak_wrc_plus = np.random.normal(110, 20)

for year_in_career in range(career_length):
age = start_age + year_in_career
season = years_range[0] + np.random.randint(0,
years_range[1] - years_range[0])

# Age-based performance (simplified aging curve)
age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
base_wrc = peak_wrc_plus * age_factor

# Add random variation
wrc_plus = max(50, base_wrc + np.random.normal(0, 15))

# Other stats correlated with wRC+
pa = np.random.randint(300, 650)
avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
obp = avg + 0.060 + np.random.normal(0, 0.020)
slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)

players.append({
'player_id': player_id,
'age': age,
'season': season,
'PA': pa,
'AVG': np.clip(avg, 0.150, 0.400),
'OBP': np.clip(obp, 0.250, 0.500),
'SLG': np.clip(slg, 0.300, 0.700),
'wRC_plus': wrc_plus,
'ISO': np.clip(slg - avg, 0.050, 0.350)
})

return pd.DataFrame(players)

# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)

print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")

# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
"""
Calculate aging curve using year-to-year delta method.
This controls for selection bias better than simple averaging.
"""
# Filter for consecutive seasons
df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])

# Calculate year-to-year changes
df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]

# Keep only consecutive seasons
df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()

# Group by age and calculate average change
aging_curve = df_deltas.groupby('age').agg({
'metric_delta': ['mean', 'std', 'count'],
metric: 'mean'
}).reset_index()

aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']

return aging_curve

# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")

metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}

for metric in metrics:
aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
print(f" {metric}: {len(aging_curves[metric])} age points")

# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
"""
Fit a polynomial curve to aging data.
"""
# Use weighted regression (weight by sample size)
weights = np.sqrt(aging_data['n'])

# Polynomial features (degree 2)
X = aging_data[age_col].values.reshape(-1, 1)
y = aging_data[delta_col].values

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = Ridge(alpha=1.0)
model.fit(X_poly, y, sample_weight=weights)

return model, poly

# Fit curves
fitted_models = {}
for metric in metrics:
fitted_models[metric] = fit_aging_curve(aging_curves[metric])
print(f"Fitted aging curve for {metric}")

# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
ax = axes[idx]
curve_data = aging_curves[metric]
model, poly = fitted_models[metric]

# Plot raw deltas
ax.scatter(curve_data['age'], curve_data['delta_mean'],
s=curve_data['n']*2, alpha=0.6, label='Observed')

# Plot fitted curve
age_range = np.linspace(curve_data['age'].min(),
curve_data['age'].max(), 100)
X_pred = poly.transform(age_range.reshape(-1, 1))
y_pred = model.predict(X_pred)

ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)

ax.set_xlabel('Age', fontsize=11)
ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")

# 4. Build Projection System
class PlayerProjector:
"""
Project player performance based on recent history and aging curves.
"""

def __init__(self, aging_models):
self.aging_models = aging_models

def project_player(self, player_history, years_forward=1):
"""
Project player performance forward.

Parameters:
-----------
player_history : DataFrame
Recent seasons for player (last 3 years recommended)
years_forward : int
Number of years to project forward

Returns:
--------
dict : Projected statistics
"""
# Weight recent seasons more heavily (3:2:1 for last 3 years)
weights = np.array([3, 2, 1])[:len(player_history)]
weights = weights / weights.sum()

# Current age and baseline performance
current_age = player_history['age'].iloc[-1]

projections = {}

for metric in self.aging_models.keys():
if metric not in player_history.columns:
continue

# Weighted average of recent performance
baseline = np.average(player_history[metric].iloc[-3:],
weights=weights)

# Apply aging curve
model, poly = self.aging_models[metric]

# Project forward
projected_value = baseline
for year in range(years_forward):
age = current_age + year + 1
X_age = poly.transform([[age]])
age_adjustment = model.predict(X_age)[0]
projected_value += age_adjustment

projections[metric] = projected_value

projections['age'] = current_age + years_forward
projections['projection_years'] = years_forward

return projections

# 5. Test Projection System
projector = PlayerProjector(fitted_models)

# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]

test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')

print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")

print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))

# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)

for year in range(1, 4):
projection = projector.project_player(test_player_data, years_forward=year)
print(f"+{year:<5} {projection['age']:<5.0f} "
f"{projection.get('wRC_plus', 0):<8.1f} "
f"{projection.get('ISO', 0):<8.3f} "
f"{projection.get('AVG', 0):<8.3f} "
f"{projection.get('OBP', 0):<8.3f}")

# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
"""
Evaluate projection accuracy on historical data.
"""
results = []

for player_id in data['player_id'].unique():
player_data = data[data['player_id'] == player_id].sort_values('age')

# Need at least 4 seasons (3 to project, 1 to validate)
if len(player_data) < 4:
continue

# Use all but last season for projection
train_data = player_data.iloc[:-1]
actual_data = player_data.iloc[-1]

if len(train_data) < 3:
continue

# Make projection
try:
projection = projector.project_player(train_data, years_forward=1)

for metric in ['wRC_plus', 'ISO', 'AVG']:
if metric in projection:
results.append({
'player_id': player_id,
'metric': metric,
'actual': actual_data[metric],
'projected': projection[metric],
'error': projection[metric] - actual_data[metric]
})
except:
continue

return pd.DataFrame(results)

print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)

print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)

for metric in ['wRC_plus', 'ISO', 'AVG']:
metric_eval = evaluation[evaluation['metric'] == metric]

if len(metric_eval) > 0:
mae = np.abs(metric_eval['error']).mean()
rmse = np.sqrt((metric_eval['error'] ** 2).mean())

# Calculate R²
actual = metric_eval['actual'].values
predicted = metric_eval['projected'].values
ss_res = np.sum((actual - predicted) ** 2)
ss_tot = np.sum((actual - actual.mean()) ** 2)
r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")

# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
ax = axes[idx]
metric_eval = evaluation[evaluation['metric'] == metric]

if len(metric_eval) > 0:
ax.scatter(metric_eval['actual'], metric_eval['projected'],
alpha=0.4, s=30)

# Add y=x line
min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
ax.plot([min_val, max_val], [min_val, max_val],
'r--', linewidth=2, label='Perfect Projection')

ax.set_xlabel(f'Actual {metric}', fontsize=11)
ax.set_ylabel(f'Projected {metric}', fontsize=11)
ax.set_title(f'{metric} Projection Accuracy',
fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")
```

**Extension Ideas**:
- Incorporate minor league translation factors
- Add injury risk modeling
- Create playing time projections
- Develop position-specific aging curves
- Compare to established projection systems (Steamer, ZiPS)
Exercise 24.3
Draft Value Analysis and Strategy Optimization
Hard
**Objective**: Analyze historical draft performance to quantify pick value and optimize draft strategy.

**Skills Demonstrated**: Data analysis, value modeling, strategic thinking, data visualization

**Key Analysis Components**:

```r
# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy

library(tidyverse)
library(survival)
library(ggplot2)
library(scales)

# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
set.seed(42)

drafts <- expand.grid(
year = 2008:2022,
round = 1:rounds,
pick = 1:30
) %>%
mutate(
overall_pick = (round - 1) * 30 + pick,
# Probability of reaching majors decreases with pick
p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
reached_mlb = rbinom(n(), 1, p_mlb),
# Career WAR conditional on reaching MLB
war_if_mlb = ifelse(
reached_mlb == 1,
pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
0
),
# Years to debut
years_to_debut = ifelse(
reached_mlb == 1,
pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
NA
),
# Position (simplified)
position = sample(
c("P", "C", "IF", "OF"),
n(),
replace = TRUE,
prob = c(0.45, 0.10, 0.25, 0.20)
),
# College vs HS
player_type = sample(
c("College", "HS", "International"),
n(),
replace = TRUE,
prob = c(0.55, 0.35, 0.10)
),
# Slot value (simplified formula)
slot_value = pmax(
200000,
12000000 * exp(-overall_pick / 15)
),
# Signing bonus (usually close to slot)
signing_bonus = slot_value * runif(n(), 0.85, 1.15)
)

return(drafts)
}

# Generate data
draft_data <- generate_draft_data()

print(sprintf("Generated %d draft picks from %d drafts",
nrow(draft_data), n_distinct(draft_data$year)))

# 1. Success Rate by Round
success_by_round <- draft_data %>%
group_by(round) %>%
summarize(
n_picks = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
total_war = sum(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
) %>%
filter(round <= 20) # Focus on first 20 rounds

print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))

# 2. Value Curve Estimation
value_curve <- draft_data %>%
group_by(overall_pick) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
) %>%
filter(overall_pick <= 300)

# Fit exponential decay model
value_model <- nls(
expected_war ~ a * exp(-b * overall_pick),
data = value_curve %>% filter(expected_war > 0),
start = list(a = 10, b = 0.01)
)

# Add fitted values
value_curve$fitted_war <- predict(
value_model,
newdata = data.frame(overall_pick = value_curve$overall_pick)
)

print("\nValue Curve Model:")
print(summary(value_model))

# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
geom_vline(xintercept = c(30, 60, 90),
linetype = "dashed", alpha = 0.3) +
annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
label = "Round 1", size = 3.5) +
annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
label = "Round 2", size = 3.5) +
labs(
title = "MLB Draft Pick Value Curve",
subtitle = "Expected career WAR by draft position",
x = "Overall Pick",
y = "Expected Career WAR",
caption = "Exponential decay model fitted to historical data"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11)
)

print(value_plot)

# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(position, round) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
.groups = "drop"
) %>%
group_by(position) %>%
summarize(
total_picks = sum(n),
avg_mlb_rate = mean(mlb_rate),
avg_war = mean(avg_war)
) %>%
arrange(desc(avg_war))

print("\nPosition-Specific Success Rates:")
print(position_analysis)

# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(player_type) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
)

print("\nCollege vs High School Performance:")
print(player_type_analysis)

# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
mutate(
war_per_million = war_if_mlb / (signing_bonus / 1000000),
pick_group = case_when(
overall_pick <= 30 ~ "Top 30",
overall_pick <= 60 ~ "31-60",
overall_pick <= 100 ~ "61-100",
TRUE ~ "100+"
)
) %>%
group_by(pick_group) %>%
summarize(
n = n(),
avg_bonus = mean(signing_bonus),
avg_war = mean(war_if_mlb),
war_per_million = mean(war_per_million)
)

print("\nReturn on Investment by Pick Range:")
print(roi_analysis)

# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
"""
Simple optimization: maximize expected WAR given bonus pool constraints
"""

# Get expected value for each pick
pick_values <- value_curve %>%
filter(overall_pick %in% available_picks) %>%
left_join(
draft_data %>%
group_by(overall_pick) %>%
summarize(avg_slot = mean(slot_value)),
by = "overall_pick"
)

# Greedy algorithm: pick highest value/cost ratio within budget
selected <- tibble()
remaining_budget <- budget
remaining_picks <- pick_values

while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
# Calculate value per dollar
remaining_picks <- remaining_picks %>%
mutate(value_per_dollar = expected_war / avg_slot)

# Select best value pick we can afford
best_pick <- remaining_picks %>%
filter(avg_slot <= remaining_budget) %>%
slice_max(value_per_dollar, n = 1)

if(nrow(best_pick) == 0) break

selected <- bind_rows(selected, best_pick)
remaining_budget <- remaining_budget - best_pick$avg_slot
remaining_picks <- remaining_picks %>%
filter(overall_pick != best_pick$overall_pick)
}

return(selected)
}

# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000

optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)

print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
select(overall_pick, expected_war, avg_slot, value_per_dollar))

# 8. Comprehensive Dashboard Visualization
library(patchwork)

# Plot 1: Success rate by round
p1 <- success_by_round %>%
filter(round <= 10) %>%
ggplot(aes(x = round, y = mlb_rate)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
vjust = -0.5, size = 3) +
scale_y_continuous(labels = percent_format()) +
labs(title = "MLB Success Rate by Round",
x = "Draft Round", y = "% Reaching MLB") +
theme_minimal()

# Plot 2: WAR distribution
p2 <- draft_data %>%
filter(reached_mlb == 1, overall_pick <= 100) %>%
ggplot(aes(x = war_if_mlb)) +
geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
labs(title = "Career WAR Distribution (MLB Players)",
x = "Career WAR", y = "Count") +
theme_minimal()

# Plot 3: Position comparison
p3 <- draft_data %>%
filter(reached_mlb == 1, round <= 5) %>%
ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
geom_boxplot(alpha = 0.7) +
labs(title = "WAR by Position (Rounds 1-5)",
x = "Position", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")

# Plot 4: College vs HS
p4 <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
labs(title = "College vs HS Performance",
x = "Player Type", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")

# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
plot_annotation(
title = "MLB Draft Analysis Dashboard",
subtitle = "Historical performance metrics and value analysis",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)

print(combined)

# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")

cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf(" - First round produces %.1f%% of total draft WAR\n",
100 * first_round_war / total_war))

cat("\n2. SUCCESS RATES\n")
cat(sprintf(" - Round 1: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[1]))
cat(sprintf(" - Round 5: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[5]))
cat(sprintf(" - Round 10: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[10]))

cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf(" - Average time to debut: %.1f years\n",
mean(draft_data$years_to_debut, na.rm = TRUE)))

cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat(" - Prioritize early picks; value drops exponentially\n")
cat(" - Consider college players for faster development\n")
cat(" - High school players have higher variance in outcomes\n")
cat(" - Pitchers dominate draft but consider positional scarcity\n")
cat(" - Later rounds: focus on high-ceiling, high-risk players\n")

cat("\n=== ANALYSIS COMPLETE ===\n")
```

**Portfolio Enhancement**:
- Add international signing analysis
- Compare team draft performance
- Analyze specific draft classes
- Include financial constraints modeling
- Compare to prospect ranking systems
Exercise 24.4
Defensive Positioning and Shift Analysis
Hard
**Objective**: Analyze defensive positioning effectiveness using batted ball data.

**Skills Demonstrated**: Spatial analysis, causal inference, strategic analysis, data visualization

**Implementation Framework**:

```python
# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches

# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)

# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
"""
Simulate batted ball locations and outcomes.
Coordinates in feet from home plate.
"""
np.random.seed(42)

data = []

for _ in range(n_balls):
# Batter handedness
stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])

# Shift decision (more common vs pull hitters)
is_shifter = np.random.random() < 0.3
shift_on = is_shifter and (np.random.random() < 0.7)

# Hit location (pull tendency varies)
if stand == 'R':
# Righties pull left
if is_shifter:
angle = np.random.normal(-25, 35) # Pull-heavy
else:
angle = np.random.normal(-10, 45) # Balanced
else:
# Lefties pull right
if is_shifter:
angle = np.random.normal(25, 35)
else:
angle = np.random.normal(10, 45)

# Distance based on exit velo and launch angle
exit_velo = np.random.normal(88, 8)
launch_angle = np.random.normal(12, 18)

# Simplified distance calculation
distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
distance = max(50, min(400, distance + np.random.normal(0, 20)))

# Convert to x, y coordinates
angle_rad = np.radians(angle)
x = distance * np.sin(angle_rad)
y = distance * np.cos(angle_rad)

# Hit outcome (shift effectiveness)
if shift_on:
# Shift reduces hits in pull direction
if stand == 'R' and x < -50:
prob_hit = 0.18 # Reduced by shift
elif stand == 'L' and x > 50:
prob_hit = 0.18
else:
prob_hit = 0.28 # Normal rate
else:
prob_hit = 0.25

# Adjust for distance (harder to field)
prob_hit = min(0.95, prob_hit * (distance / 250))

is_hit = np.random.random() < prob_hit

data.append({
'x': x,
'y': y,
'distance': distance,
'angle': angle,
'exit_velo': exit_velo,
'launch_angle': launch_angle,
'stand': stand,
'shift_on': shift_on,
'is_hit': is_hit,
'is_shifter': is_shifter
})

return pd.DataFrame(data)

# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)

print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")

# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
'is_hit': ['mean', 'count'],
'exit_velo': 'mean'
}).round(3)

print("\nShift Effectiveness:")
print(shift_analysis)

# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
"""
Estimate runs saved by shifting.
"""
results = []

for stand in ['R', 'L']:
for shifter in [True, False]:
subset = data[(data['stand'] == stand) &
(data['is_shifter'] == shifter)]

if len(subset) == 0:
continue

shifted = subset[subset['shift_on'] == True]
no_shift = subset[subset['shift_on'] == False]

if len(shifted) > 0 and len(no_shift) > 0:
babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
# Approximate run value per hit prevented: ~0.5 runs
runs_saved_per_pa = babip_diff * 0.5

results.append({
'stand': stand,
'is_shifter': shifter,
'shifted_babip': shifted['is_hit'].mean(),
'no_shift_babip': no_shift['is_hit'].mean(),
'babip_diff': babip_diff,
'runs_saved_per_100pa': runs_saved_per_pa * 100,
'n_shifted': len(shifted),
'n_no_shift': len(no_shift)
})

return pd.DataFrame(results)

shift_value = calculate_shift_value(bb_data)

print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))

# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
"""
Plot baseball field with hit locations.
"""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 10))

# Draw field outline
# Infield dirt
infield = patches.Wedge((0, 0), 95, 45, 135,
facecolor='tan', alpha=0.3)
ax.add_patch(infield)

# Outfield grass
outfield = patches.Wedge((0, 0), 400, 45, 135,
facecolor='green', alpha=0.1)
ax.add_patch(outfield)

# Foul lines
ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)

# Plot hits
hits = data[data['is_hit'] == True]
outs = data[data['is_hit'] == False]

ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
s=20, label='Out')
ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
s=30, label='Hit')

ax.set_xlim(-320, 320)
ax.set_ylim(0, 400)
ax.set_aspect('equal')
ax.set_xlabel('Distance from center (ft)', fontsize=11)
ax.set_ylabel('Distance from home (ft)', fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.2)

return ax

# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
(bb_data['is_shifter'] == True)]

plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == False],
'RHB Pull Hitter - No Shift',
ax=ax1
)

plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == True],
'RHB Pull Hitter - Shift On',
ax=ax2
)

plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")

# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
"""
Create BABIP heat map for given conditions.
"""
subset = data[(data['shift_on'] == shift_status) &
(data['stand'] == stand)]

# Create grid
x_bins = np.linspace(-250, 250, 25)
y_bins = np.linspace(50, 350, 20)

grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))

for i in range(len(y_bins)-1):
for j in range(len(x_bins)-1):
mask = ((subset['x'] >= x_bins[j]) &
(subset['x'] < x_bins[j+1]) &
(subset['y'] >= y_bins[i]) &
(subset['y'] < y_bins[i+1]))

cell_data = subset[mask]
if len(cell_data) >= 5: # Minimum sample
grid_babip[i, j] = cell_data['is_hit'].mean()
grid_count[i, j] = len(cell_data)
else:
grid_babip[i, j] = np.nan

return grid_babip, x_bins, y_bins, grid_count

# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stand in enumerate(['R', 'L']):
for j, shift_on in enumerate([False, True]):
ax = axes[i, j]

shifters = bb_data[bb_data['is_shifter'] == True]
grid, x_bins, y_bins, counts = create_babip_heatmap(
shifters, shift_on, stand
)

im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
y_bins[0], y_bins[-1]],
origin='lower', cmap='RdYlGn_r',
vmin=0, vmax=0.5, aspect='auto')

shift_text = "Shift On" if shift_on else "No Shift"
hand_text = "RHB" if stand == 'R' else "LHB"
ax.set_title(f'{hand_text} - {shift_text}',
fontsize=11, fontweight='bold')
ax.set_xlabel('Horizontal Position (ft)')
ax.set_ylabel('Distance from Home (ft)')

# Add colorbar
plt.colorbar(im, ax=ax, label='BABIP')

plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")

# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)

X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
model.coef_[0]):
print(f" {feat}: {coef:.3f}")

# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)

print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
print(f" {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")

print("\n2. WHEN TO SHIFT")
print(" - Strong pull tendency (>70% pull rate)")
print(" - Ground ball hitters (LA < 10°)")
print(" - Extreme pull hitters benefit most from aggressive shifts")

print("\n3. SHIFT VARIATIONS")
print(" - Full shift: 3 infielders on pull side")
print(" - Partial shift: 2.5 infielders pull side")
print(" - No shift: Traditional alignment")
print(" - Decision should consider:")
print(" * Batter's spray chart")
print(" * Game situation (runners, outs)")
print(" * Pitcher's ground ball rate")

print("\n4. LIMITATIONS & CONSIDERATIONS")
print(" - Shift beaten by opposite field hits")
print(" - Bunt defense vulnerabilities")
print(" - Runner advancement opportunities")
print(" - Pitcher-specific adjustments")

print("\n5. FUTURE ANALYSIS")
print(" - Pitcher-specific positioning")
print(" - Count-based positioning adjustments")
print(" - Outfield positioning optimization")
print(" - Real-time adjustment algorithms")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
```

**Portfolio Development Tips**:
- Use real Statcast spray chart data when possible
- Incorporate expected outcomes (xBA, xwOBA)
- Add video analysis component
- Compare to MLB team shift strategies
- Analyze shift effectiveness by ballpark

---

Chapter Summary

In this chapter, you learned about front office & analytics career guide. Key topics covered:

  • Careers in Baseball Analytics
  • Building Your Analytics Portfolio
  • Essential Technical Skills
  • The Job Application Process
  • Interview Preparation & Case Studies
  • Resources & Continued Learning
4 practice exercises available Practice Now