Chapter 24: Front Office & Analytics Career Guide

24.1 Careers in Baseball Analytics

The Modern Baseball Analytics Landscape

Baseball front offices today employ diverse teams of analysts with varying specializations. Understanding the different roles and career paths available is essential for targeting your professional development.

Entry-Level Positions

Baseball Analytics Intern: Seasonal or year-round internships focusing on specific projects, data collection, or analytical support
Baseball Operations Assistant: Administrative support with analytical components, often involving data management and report generation
Quantitative Analyst: Entry-level position focusing on statistical modeling and data analysis
Video Coordinator: Collecting and organizing video footage, often with analytical tagging responsibilities

Mid-Level Positions

Baseball Analytics Analyst: Core analytical work including model development, research projects, and decision support
Quantitative Researcher: Advanced statistical modeling, machine learning, and predictive analytics
Pro Scouting Analyst: Combining traditional scouting with quantitative analysis of major league talent
Amateur Scouting Analyst: Supporting draft and international signing decisions with data-driven insights
Player Development Analyst: Working with minor league affiliates on player progression and development strategies
Research & Development Analyst: Exploring new analytical methods, technologies, and data sources

Senior-Level Positions

Senior Analyst/Lead Analyst: Managing analytical projects and mentoring junior staff
Director of Baseball Analytics: Overseeing the analytics department and setting analytical strategy
Director of Research & Development: Leading innovation in analytical methods and technology
Vice President of Baseball Analytics: Executive-level position with strategic influence on baseball operations
Assistant General Manager (Analytics): Bridge role between analytics and traditional baseball operations
General Manager: Ultimate decision-making authority, increasingly with analytical backgrounds

Related Career Paths

Baseball analytics skills translate to numerous adjacent roles:

Biomechanics Researcher: Analyzing motion capture data to optimize player performance and reduce injuries
Sports Technology Developer: Building tools and platforms for data collection and analysis
Independent Consultant: Providing analytical services to multiple organizations
Media Analyst: Bringing analytical insights to broadcast, print, or digital media
Betting Analytics: Applying baseball analysis to sports gambling markets
Academic Researcher: Studying baseball from statistical, economic, or social science perspectives

Organizational Structures

Different organizations structure their analytics departments in various ways:

Centralized Model: A single analytics department supports all baseball operations functions (player development, amateur scouting, pro scouting, major league operations).

Embedded Model: Analysts are embedded within specific departments (e.g., a pro scouting analyst who works primarily with the pro scouting director).

Hybrid Model: A core R&D team handles long-term research while embedded analysts support day-to-day operations.

Understanding an organization's structure helps you identify where you might fit and what collaboration patterns to expect.

Compensation and Market Dynamics

Baseball analytics salaries vary widely based on experience, role, and organization:

Entry-level positions: $40,000-$65,000
Mid-level analysts: $65,000-$100,000
Senior analysts/Directors: $100,000-$200,000
Executive positions: $200,000+

Important considerations:

Baseball salaries often lag behind tech industry compensation for similar skills

Smaller market teams may pay less than large market teams

Passion for baseball is expected but shouldn't justify exploitative compensation

Benefits packages vary significantly by organization

Many entry-level positions are seasonal or contract-based initially

24.2 Building Your Analytics Portfolio

Your portfolio is your most important asset in the job search process. It demonstrates your technical skills, baseball knowledge, and ability to communicate insights.

Portfolio Principles

Quality Over Quantity: Three excellent projects are better than ten mediocre ones. Each project should showcase different skills and demonstrate depth of analysis.

Tell a Story: Every project should have a clear narrative arc: motivation, methodology, findings, and implications. Write for an intelligent audience that may not have your technical background.

Show Your Work: Include code, methodologies, and limitations. Transparency about your approach demonstrates intellectual honesty and technical competence.

Make It Accessible: Host your portfolio on GitHub Pages, a personal website, or a platform like RPubs. Ensure projects are easy to navigate and professionally presented.

Project Categories

1. Predictive Modeling Projects

Build models that forecast player performance, injury risk, or game outcomes. These projects demonstrate statistical modeling skills and understanding of baseball dynamics.

Example topics:

Pitcher performance projection using pitch-level data

Rookie success prediction using minor league statistics

Win probability models incorporating game situations

Injury risk models using workload and biomechanical data

2. Player Evaluation Projects

Develop new metrics or frameworks for evaluating player value. These projects show creativity and deep baseball knowledge.

Example topics:

Defensive value metrics using batted ball data

Catcher framing adjusted for umpire tendencies

Base running value beyond traditional stolen bases

Clutch performance analysis controlling for context

3. Strategic Analysis Projects

Analyze team strategies and their effectiveness. These projects demonstrate tactical understanding and causal inference skills.

Example topics:

Optimal batting order construction

Bullpen management and pitcher usage patterns

Defensive positioning strategies

Draft strategy and value optimization

4. Data Visualization Projects

Create compelling visualizations that communicate complex insights. These projects showcase data communication skills.

Example topics:

Interactive pitch movement charts

Player aging curves across different skills

Trade deadline activity networks

Spray chart analysis with context

Technical Implementation

Your portfolio should demonstrate proficiency in industry-standard tools. Below are templates for common portfolio projects.

Project Structure Template

project-name/
├── README.md
├── data/
│   ├── raw/
│   └── processed/
├── code/
│   ├── 01-data-collection.R
│   ├── 02-data-cleaning.R
│   ├── 03-exploratory-analysis.R
│   ├── 04-modeling.R
│   └── 05-visualization.R
├── output/
│   ├── figures/
│   └── tables/
└── report.Rmd or report.ipynb

Writing About Your Work

Strong portfolio projects include written analysis that demonstrates:

Clear Problem Statement: What question are you answering and why does it matter?
Methodology Explanation: How did you approach the problem? What analytical choices did you make?
Results Interpretation: What did you find? What are the practical implications?
Limitations Discussion: What are the constraints and caveats of your analysis?
Future Directions: What would you do with more time, data, or resources?

Use section headers, bullet points, and visualizations to make your writing scannable. Avoid jargon unless necessary, and explain technical concepts clearly.

project-name/
├── README.md
├── data/
│   ├── raw/
│   └── processed/
├── code/
│   ├── 01-data-collection.R
│   ├── 02-data-cleaning.R
│   ├── 03-exploratory-analysis.R
│   ├── 04-modeling.R
│   └── 05-visualization.R
├── output/
│   ├── figures/
│   └── tables/
└── report.Rmd or report.ipynb

24.3 Essential Technical Skills

Success in baseball analytics requires a diverse technical skillset. While no one is expert in everything, you should develop proficiency across several key areas.

Programming Languages

R remains the dominant language in baseball analytics due to its statistical capabilities and baseball-specific packages.

Essential R skills:

Data manipulation with dplyr and tidyr

Visualization with ggplot2

Statistical modeling with base R, caret, and tidymodels

Report generation with R Markdown

Working with baseball data using baseballr and other packages

Python

Python is increasingly popular, especially for machine learning and data engineering tasks.

Essential Python skills:

Data manipulation with pandas and numpy

Visualization with matplotlib and seaborn

Machine learning with scikit-learn

Web scraping with BeautifulSoup and scrapy

Notebook-based analysis with Jupyter

SQL

Database querying is essential for working with large datasets.

Essential SQL skills:

SELECT, JOIN, GROUP BY operations

Window functions for time-series analysis

CTEs (Common Table Expressions) for complex queries

Query optimization for performance

Database design fundamentals

Statistical Methods

Foundational Statistics

Probability distributions and hypothesis testing

Regression analysis (linear, logistic, Poisson)

Time series analysis

Bayesian inference

Experimental design and A/B testing

Advanced Methods

Machine learning (random forests, gradient boosting, neural networks)

Survival analysis for injury and career longevity

Hierarchical/mixed-effects models for player evaluation

Causal inference methods

Optimization algorithms

Domain Knowledge

Technical skills must be paired with baseball expertise:

Baseball Operations Understanding

Roster construction and salary cap management

Draft and international signing processes

Player development systems

Arbitration and free agency mechanics

Collective Bargaining Agreement provisions

On-Field Baseball Knowledge

Game situations and strategic decisions

Player roles and positional value

Pitch types and sequencing

Defensive alignments and shifts

Base running and situational hitting

Historical Context

Evolution of the game and rule changes

Historical trends in player performance

Park effects and era adjustments

Notable players and teams for benchmarking

Data Sources and Tools

Public Data Sources

Baseball Reference and FanGraphs for player statistics

MLB.com Statcast for pitch and batted ball data

Baseball Savant for advanced Statcast metrics

Retrosheet for historical play-by-play data

Brooks Baseball for pitch-level data

Development Tools

Version control with Git and GitHub

Integrated Development Environments (RStudio, PyCharm, VS Code)

Jupyter notebooks for interactive analysis

Docker for reproducible environments

Cloud platforms (AWS, Google Cloud) for large-scale computation

24.4 The Job Application Process

Breaking into baseball analytics is highly competitive. A strategic approach to the application process is essential.

Finding Opportunities

Official Channels

MLB's official job board (mlb.com/jobs)

Individual team websites (check careers or front office sections)

LinkedIn job postings

Baseball prospectus job board

Baseball America postings

Networking

SABR Analytics Conference

MIT Sloan Sports Analytics Conference

Twitter/X connections with analytics professionals

Alumni networks from your university

Local baseball analytics meetups

Proactive Outreach

Cold emails to teams (respectful, concise, value-focused)

Informational interviews

Contributing to public baseball research

Building relationships over time

Crafting Your Application

Resume Best Practices

Focus on relevant experience and quantifiable achievements:

BASEBALL ANALYTICS EXPERIENCE

Research Analyst | University Baseball Analytics Lab | 2023-2024
- Developed pitch selection model using Statcast data, identifying 12% improvement
  opportunity in changeup usage for fastball-heavy pitchers
- Created automated scouting reports combining video and statistical analysis,
  reducing report generation time by 60%
- Presented findings at regional SABR conference

TECHNICAL PROJECTS

Pitcher Performance Projection System
- Built ensemble model combining stuff metrics and command indicators to project
  ERA with 15% lower RMSE than baseline FIP projections
- Technologies: R, Python (scikit-learn), SQL, Git
- Results published on personal blog with 5,000+ views

TECHNICAL SKILLS

Programming: R (advanced), Python (intermediate), SQL (intermediate)
Statistical Methods: Regression, mixed-effects models, machine learning, time series
Tools: Git, RStudio, Jupyter, Tableau, Docker
Baseball Data: Statcast, PITCHf/x, Retrosheet, FanGraphs

Cover Letter Strategy

Your cover letter should:

Show you understand the organization's analytical philosophy

Connect your experience to their specific needs

Demonstrate passion backed by substantive knowledge

Be concise (one page maximum)

Template structure:

Opening: Why this team/role specifically

Body: 2-3 relevant experiences or projects with outcomes

Connection: How your skills address their needs

Closing: Clear call to action

Portfolio Presentation

Include a portfolio link prominently in your resume and cover letter. Ensure the landing page is polished and your best work is immediately visible.

Consider creating a one-page portfolio summary:

ANALYTICS PORTFOLIO
github.com/username | website.com | email@example.com

Featured Projects:

1. Minor League Pitcher Development Analysis
   Identified mechanical adjustments that correlate with MLB success
   Tools: R, Statcast, video analysis | Link: [URL]

2. Defensive Shift Optimization Framework
   Game theory approach to positioning strategy
   Tools: Python, optimization algorithms | Link: [URL]

3. Draft Value Model
   Historical analysis of draft pick value by round and position
   Tools: R, Bayesian modeling | Link: [URL]

Application Timing

Internship Cycles

Baseball Operations internships: Applications typically open September-November for following season

Analytics-specific internships: Some teams hire year-round

Summer internships: Apply in fall/winter

Full-Time Positions

Limited predictability; positions open when needs arise

More opportunities after season ends (October-November)

Some hiring around winter meetings (December)

Strategy

Apply early in application windows

Follow up appropriately (one polite email after 2-3 weeks)

Continue building your portfolio while waiting

Apply to multiple teams; this is a numbers game

BASEBALL ANALYTICS EXPERIENCE

Research Analyst | University Baseball Analytics Lab | 2023-2024
- Developed pitch selection model using Statcast data, identifying 12% improvement
  opportunity in changeup usage for fastball-heavy pitchers
- Created automated scouting reports combining video and statistical analysis,
  reducing report generation time by 60%
- Presented findings at regional SABR conference

TECHNICAL PROJECTS

Pitcher Performance Projection System
- Built ensemble model combining stuff metrics and command indicators to project
  ERA with 15% lower RMSE than baseline FIP projections
- Technologies: R, Python (scikit-learn), SQL, Git
- Results published on personal blog with 5,000+ views

TECHNICAL SKILLS

Programming: R (advanced), Python (intermediate), SQL (intermediate)
Statistical Methods: Regression, mixed-effects models, machine learning, time series
Tools: Git, RStudio, Jupyter, Tableau, Docker
Baseball Data: Statcast, PITCHf/x, Retrosheet, FanGraphs

ANALYTICS PORTFOLIO
github.com/username | website.com | email@example.com

Featured Projects:

1. Minor League Pitcher Development Analysis
   Identified mechanical adjustments that correlate with MLB success
   Tools: R, Statcast, video analysis | Link: [URL]

2. Defensive Shift Optimization Framework
   Game theory approach to positioning strategy
   Tools: Python, optimization algorithms | Link: [URL]

3. Draft Value Model
   Historical analysis of draft pick value by round and position
   Tools: R, Bayesian modeling | Link: [URL]

24.5 Interview Preparation & Case Studies

Baseball analytics interviews typically involve multiple stages and diverse evaluation methods.

Interview Stages

1. Phone/Video Screening (30-45 minutes)

Background and interest in baseball analytics

Technical skills overview

Behavioral questions

Basic baseball knowledge assessment

2. Technical Assessment (1-4 hours)

Take-home project analyzing baseball data

SQL or programming challenges

Statistical reasoning problems

Time-constrained analysis task

3. On-Site/Virtual Interview (3-6 hours)

Multiple one-on-one interviews

Presentation of take-home project

Case study analysis

Conversations with various department members

4. Final Interview (1-2 hours)

Meet senior leadership

Culture fit assessment

Negotiation discussion

Technical Interview Topics

Data Analysis Questions

Example: "Given a dataset of pitch-by-pitch data, how would you evaluate whether a pitcher's performance has improved this season?"

Approach:

Clarify the question (What metrics define improvement? Over what timeframe?)

Discuss confounding factors (opponent quality, injury, role changes)

Propose multiple analytical approaches

Consider limitations and alternative explanations

Discuss actionable implications

Statistical Reasoning

Example: "A rookie has a .350 batting average through his first 20 games. How much weight should we give this performance?"

Approach:

Discuss sample size and regression to the mean

Reference prior expectations (minor league performance, scouting reports)

Calculate confidence intervals

Compare to historical rookie performance

Bayesian updating framework

Programming Challenges

Example: "Write a function that calculates wOBA from a dataset of plate appearances."

def calculate_woba(df):
    """
    Calculate wOBA from plate appearance data.

    Parameters:
    df (pd.DataFrame): DataFrame with columns for each outcome

    Returns:
    float: Weighted on-base average
    """
    # wOBA weights (2024 season approximate values)
    weights = {
        'uBB': 0.69,  # Unintentional walk
        'HBP': 0.72,  # Hit by pitch
        '1B': 0.88,   # Single
        '2B': 1.24,   # Double
        '3B': 1.56,   # Triple
        'HR': 1.95    # Home run
    }

    # Calculate weighted sum of positive outcomes
    weighted_sum = (
        df['uBB'] * weights['uBB'] +
        df['HBP'] * weights['HBP'] +
        df['1B'] * weights['1B'] +
        df['2B'] * weights['2B'] +
        df['3B'] * weights['3B'] +
        df['HR'] * weights['HR']
    )

    # Plate appearances (excluding IBB and sacrifices)
    pa = df['AB'] + df['uBB'] + df['HBP'] + df['SF']

    return weighted_sum / pa if pa > 0 else 0

# Example usage
import pandas as pd

player_data = pd.DataFrame({
    'AB': [500],
    'uBB': [60],
    'HBP': [5],
    'SF': [3],
    '1B': [100],
    '2B': [30],
    '3B': [3],
    'HR': [25]
})

woba = calculate_woba(player_data)
print(f"Player wOBA: {woba:.3f}")

Case Study Examples

Case Study 1: Trade Evaluation

Scenario: "Your team is considering trading for a starting pitcher. You have access to their Statcast data, injury history, and contract details. Walk me through your evaluation process."

Framework:

Performance Analysis

Recent performance trends (ERA, FIP, xFIP, xERA)

Pitch mix and effectiveness by pitch type

Command metrics (zone rate, chase rate)

Platoon splits and context-dependent performance

Sustainability Assessment

Statcast quality metrics (velocity, spin, movement)
Batted ball profile (exit velocity, launch angle)
Sequencing and pitch usage patterns
Park effects and league adjustment

Risk Factors

Injury history and current health status
Workload concerns (innings pitched, pitch count trends)
Age and aging curves for similar pitchers
Contract value and team control

Contextual Fit

How pitcher fits team needs
Home park effects
Defensive support considerations
Roster construction implications

Cost-Benefit Analysis

Expected performance in team context
Comparison to alternatives (free agency, internal options)
Prospect cost evaluation
Financial implications

Case Study 2: Player Development Decision

Scenario: "A promising minor league hitter is struggling. His batting average has dropped, but his walk rate has increased. What additional information would you want, and how would you analyze whether this is concerning?"

Analysis approach:

# Simulated minor league hitter data analysis

library(tidyverse)
library(ggplot2)

# Create sample data
player_data <- tibble(
  month = rep(c("April", "May", "June", "July"), each = 25),
  game = rep(1:25, 4),
  AB = rpois(100, 4),
  H = rpois(100, 1.2),
  BB = rpois(100, 0.8),
  K = rpois(100, 1.3),
  HR = rpois(100, 0.3),
  exit_velo = rnorm(100, 89, 3),
  launch_angle = rnorm(100, 12, 8),
  hard_hit_rate = runif(100, 0.3, 0.5)
)

# Calculate key metrics
player_summary <- player_data %>%
  mutate(month = factor(month, levels = c("April", "May", "June", "July"))) %>%
  group_by(month) %>%
  summarize(
    AVG = sum(H) / sum(AB),
    OBP = sum(H + BB) / sum(AB + BB),
    BB_rate = sum(BB) / sum(AB + BB),
    K_rate = sum(K) / sum(AB),
    ISO = sum(HR * 3 + H) / sum(AB) - AVG,
    avg_exit_velo = mean(exit_velo),
    avg_launch_angle = mean(launch_angle),
    hard_hit_rate = mean(hard_hit_rate)
  )

# Visualize trends
ggplot(player_summary, aes(x = month, group = 1)) +
  geom_line(aes(y = AVG, color = "AVG"), size = 1.2) +
  geom_line(aes(y = OBP, color = "OBP"), size = 1.2) +
  geom_point(aes(y = AVG, color = "AVG"), size = 3) +
  geom_point(aes(y = OBP, color = "OBP"), size = 3) +
  scale_color_manual(values = c("AVG" = "red", "OBP" = "blue")) +
  labs(
    title = "Batting Average vs. On-Base Percentage Trends",
    subtitle = "Declining AVG with stable OBP suggests improved plate discipline",
    x = "Month",
    y = "Rate",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "top")

# Quality of contact analysis
print("Quality of Contact Metrics:")
print(player_summary %>%
        select(month, avg_exit_velo, hard_hit_rate))

# Interpretation framework
cat("\nAnalysis Framework:\n")
cat("1. Batting average decline with walk rate increase may indicate:\n")
cat("   - Improved pitch recognition and discipline\n")
cat("   - Potential BABIP luck (check if batted ball quality maintained)\n")
cat("   - Adjustment to higher level of pitching\n\n")
cat("2. Key questions to investigate:\n")
cat("   - Is exit velocity and hard contact rate stable?\n")
cat("   - Has BABIP declined significantly?\n")
cat("   - What pitch types are causing issues?\n")
cat("   - Are strikeout rates acceptable?\n")
cat("   - How do metrics compare to league averages at this level?\n\n")
cat("3. Recommendation depends on:\n")
cat("   - If quality of contact maintained: positive sign, continue development\n")
cat("   - If quality declining: identify mechanical issues, consider adjustment period\n")
cat("   - Context of league difficulty and age relative to level\n")

Case Study 3: Strategic Decision

Scenario: "It's the 7th inning of a playoff game. Your team is down by 1 run with a runner on first and no outs. Your best hitter is at the plate. Should you bunt?"

Analysis framework:

Calculate win expectancy with and without bunt

Consider hitter's ability and pitcher's weakness

Account for playoff leverage

Evaluate defense and base runner speed

Consider bullpen depth and future innings

Behavioral Interview Questions

Common questions and approaches:

"Tell me about a time you made a mistake in your analysis."

Be honest about the error

Explain how you discovered it

Describe what you learned

Show how you prevent similar mistakes now

"How do you handle disagreement with colleagues?"

Emphasize collaboration and data-driven discussion

Show respect for different perspectives

Demonstrate ability to find common ground

Know when to advocate and when to defer

"Describe a complex analysis you made accessible to a non-technical audience."

Use specific example

Explain your communication strategy

Discuss challenges and how you addressed them

Show outcome and impact

Questions to Ask Interviewers

Asking thoughtful questions demonstrates genuine interest and helps you evaluate fit:

About the Role

"What does a typical project lifecycle look like from question to implementation?"

"How do analysts collaborate with coaches, scouts, and baseball operations staff?"

"What are the biggest analytical challenges the organization currently faces?"

About the Team

"How is the analytics department structured?"

"What backgrounds do team members come from?"

"How has the team's role evolved over the past few years?"

About Development

"What opportunities exist for professional development and learning?"

"How do you evaluate success in this role?"

"What does career progression look like?"

About Culture

"How does the organization balance analytics and traditional scouting?"

"Can you describe the decision-making process for major roster moves?"

"What's the communication style between analytics and leadership?"

# Simulated minor league hitter data analysis

library(tidyverse)
library(ggplot2)

# Create sample data
player_data <- tibble(
  month = rep(c("April", "May", "June", "July"), each = 25),
  game = rep(1:25, 4),
  AB = rpois(100, 4),
  H = rpois(100, 1.2),
  BB = rpois(100, 0.8),
  K = rpois(100, 1.3),
  HR = rpois(100, 0.3),
  exit_velo = rnorm(100, 89, 3),
  launch_angle = rnorm(100, 12, 8),
  hard_hit_rate = runif(100, 0.3, 0.5)
)

# Calculate key metrics
player_summary <- player_data %>%
  mutate(month = factor(month, levels = c("April", "May", "June", "July"))) %>%
  group_by(month) %>%
  summarize(
    AVG = sum(H) / sum(AB),
    OBP = sum(H + BB) / sum(AB + BB),
    BB_rate = sum(BB) / sum(AB + BB),
    K_rate = sum(K) / sum(AB),
    ISO = sum(HR * 3 + H) / sum(AB) - AVG,
    avg_exit_velo = mean(exit_velo),
    avg_launch_angle = mean(launch_angle),
    hard_hit_rate = mean(hard_hit_rate)
  )

# Visualize trends
ggplot(player_summary, aes(x = month, group = 1)) +
  geom_line(aes(y = AVG, color = "AVG"), size = 1.2) +
  geom_line(aes(y = OBP, color = "OBP"), size = 1.2) +
  geom_point(aes(y = AVG, color = "AVG"), size = 3) +
  geom_point(aes(y = OBP, color = "OBP"), size = 3) +
  scale_color_manual(values = c("AVG" = "red", "OBP" = "blue")) +
  labs(
    title = "Batting Average vs. On-Base Percentage Trends",
    subtitle = "Declining AVG with stable OBP suggests improved plate discipline",
    x = "Month",
    y = "Rate",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "top")

# Quality of contact analysis
print("Quality of Contact Metrics:")
print(player_summary %>%
        select(month, avg_exit_velo, hard_hit_rate))

# Interpretation framework
cat("\nAnalysis Framework:\n")
cat("1. Batting average decline with walk rate increase may indicate:\n")
cat("   - Improved pitch recognition and discipline\n")
cat("   - Potential BABIP luck (check if batted ball quality maintained)\n")
cat("   - Adjustment to higher level of pitching\n\n")
cat("2. Key questions to investigate:\n")
cat("   - Is exit velocity and hard contact rate stable?\n")
cat("   - Has BABIP declined significantly?\n")
cat("   - What pitch types are causing issues?\n")
cat("   - Are strikeout rates acceptable?\n")
cat("   - How do metrics compare to league averages at this level?\n\n")
cat("3. Recommendation depends on:\n")
cat("   - If quality of contact maintained: positive sign, continue development\n")
cat("   - If quality declining: identify mechanical issues, consider adjustment period\n")
cat("   - Context of league difficulty and age relative to level\n")

Python

def calculate_woba(df):
    """
    Calculate wOBA from plate appearance data.

    Parameters:
    df (pd.DataFrame): DataFrame with columns for each outcome

    Returns:
    float: Weighted on-base average
    """
    # wOBA weights (2024 season approximate values)
    weights = {
        'uBB': 0.69,  # Unintentional walk
        'HBP': 0.72,  # Hit by pitch
        '1B': 0.88,   # Single
        '2B': 1.24,   # Double
        '3B': 1.56,   # Triple
        'HR': 1.95    # Home run
    }

    # Calculate weighted sum of positive outcomes
    weighted_sum = (
        df['uBB'] * weights['uBB'] +
        df['HBP'] * weights['HBP'] +
        df['1B'] * weights['1B'] +
        df['2B'] * weights['2B'] +
        df['3B'] * weights['3B'] +
        df['HR'] * weights['HR']
    )

    # Plate appearances (excluding IBB and sacrifices)
    pa = df['AB'] + df['uBB'] + df['HBP'] + df['SF']

    return weighted_sum / pa if pa > 0 else 0

# Example usage
import pandas as pd

player_data = pd.DataFrame({
    'AB': [500],
    'uBB': [60],
    'HBP': [5],
    'SF': [3],
    '1B': [100],
    '2B': [30],
    '3B': [3],
    'HR': [25]
})

woba = calculate_woba(player_data)
print(f"Player wOBA: {woba:.3f}")

24.6 Resources & Continued Learning

Baseball analytics is a continuously evolving field. Committing to ongoing learning is essential for career growth.

Books

Foundational Baseball Analytics

The Book: Playing the Percentages in Baseball by Tom Tango, Mitchel Lichtman, and Andrew Dolphin

Analyzing Baseball Data with R by Max Marchi and Jim Albert

The MVP Machine by Ben Lindbergh and Travis Sawchik

Smart Baseball by Keith Law

Statistical Methods

An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani

Statistical Rethinking by Richard McElreath

Regression and Other Stories by Gelman, Hill, and Vehtari

The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

Programming

R for Data Science by Hadley Wickham and Garrett Grolemund

Python for Data Analysis by Wes McKinney

Advanced R by Hadley Wickham

Online Courses

Data Science Fundamentals

Coursera: Data Science Specialization (Johns Hopkins)

edX: MicroMasters in Statistics and Data Science (MIT)

DataCamp: Data Science Career Track

Udacity: Data Analyst Nanodegree

Baseball-Specific

Baseball Savant: Introduction to Statcast

FanGraphs: Library and research archives

Baseball Prospectus: Statistical primer

YouTube: Various channels covering baseball analytics

Communities and Conferences

Online Communities

Twitter/X: Follow baseball analysts, teams, and writers

Reddit: r/Sabermetrics community

Discord: Baseball analytics servers

SABR: Society for American Baseball Research membership

Annual Conferences

SABR Analytics Conference (March, Phoenix)

MIT Sloan Sports Analytics Conference (March, Boston)

Baseball Prospectus events

Regional SABR chapter meetings

Publications and Blogs

Regular Reading

FanGraphs.com

Baseball Prospectus

The Athletic (baseball coverage)

MLB.com analysis section

Baseball Savant blog

Individual Analysts

Follow prominent analysts on personal blogs and Twitter

Read historical sabermetric research

Study modern analytical innovations

Building Your Network

Strategies

Engage thoughtfully on social media

Attend conferences and events

Contribute to public baseball research

Collaborate on open-source projects

Write and share your own analysis

Participate in forecasting competitions

Professional Organizations

SABR membership

American Statistical Association

Society for Sports Analytics Research

University alumni networks

Staying Current

Weekly Habits

Read 3-5 analytical articles

Follow game recaps with analytical focus

Practice coding challenges

Review new research and methodologies

Monthly Habits

Complete a small analytical project

Read academic papers on sports analytics

Contribute to open-source projects

Network with one new person

Annual Habits

Attend at least one conference

Complete a major portfolio project

Update resume and portfolio

Reflect on skill development and set new goals

24.7 Exercises

These exercises are designed to build portfolio-worthy projects that demonstrate key skills for baseball analytics roles.

Exercise 24.1: Pitcher Arsenal Analysis and Optimization

Objective: Analyze a pitcher's repertoire using Statcast data and provide recommendations for pitch usage optimization.

Skills Demonstrated: Data acquisition, exploratory analysis, visualization, strategic thinking

Project Steps:

Acquire Statcast pitch-level data for a pitcher (use baseballr package or Baseball Savant)
Analyze pitch characteristics (velocity, movement, spin)
Evaluate pitch effectiveness by count and situation
Identify optimization opportunities
Create compelling visualizations
Write executive summary with recommendations

R Implementation:

# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns

library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)

# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
  # In practice, use scrape_statcast_savant_pitcher()
  # For this example, we'll simulate data

  set.seed(123)
  n_pitches <- 2500

  tibble(
    pitch_type = sample(
      c("FF", "SI", "SL", "CH", "CU"),
      n_pitches,
      replace = TRUE,
      prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
    ),
    release_speed = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
      pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
      pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
      pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
      pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
    ),
    pfx_x = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
      pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
      pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
      pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
      pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
    ),
    pfx_z = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
      pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
      pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
      pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
      pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
    ),
    release_spin_rate = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
      pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
      pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
      pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
      pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
    ),
    balls = sample(0:3, n_pitches, replace = TRUE),
    strikes = sample(0:2, n_pitches, replace = TRUE),
    stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
    description = sample(
      c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
      n_pitches,
      replace = TRUE,
      prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
    ),
    launch_speed = ifelse(description == "hit_into_play",
                          rnorm(n_pitches, 87, 10), NA),
    launch_angle = ifelse(description == "hit_into_play",
                         rnorm(n_pitches, 12, 20), NA),
    estimated_woba_using_speedangle = ifelse(
      description == "hit_into_play",
      pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
      NA
    )
  )
}

# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")

# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
  group_by(pitch_type) %>%
  summarize(
    n = n(),
    pct = n() / nrow(pitcher_data),
    avg_velo = mean(release_speed, na.rm = TRUE),
    avg_spin = mean(release_spin_rate, na.rm = TRUE)
  ) %>%
  arrange(desc(n))

print("Pitch Mix:")
print(pitch_mix)

# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
  group_by(pitch_type) %>%
  summarize(
    usage = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
                    na.rm = TRUE),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
  ) %>%
  arrange(desc(csw_rate))

print("\nPitch Effectiveness:")
print(pitch_effectiveness)

# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
  mutate(count = paste0(balls, "-", strikes)) %>%
  group_by(count, pitch_type) %>%
  summarize(
    n = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(count) %>%
  mutate(usage_pct = n / sum(n)) %>%
  arrange(count, desc(usage_pct))

# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
  group_by(pitch_type, stand) %>%
  summarize(
    n = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = stand,
    values_from = c(n, whiff_rate, avg_xwoba),
    names_sep = "_"
  )

print("\nPlatoon Splits:")
print(platoon_splits)

# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
  "FF" = "#d22d49", "SI" = "#FE9D00",
  "SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)

movement_plot <- ggplot(pitcher_data,
                        aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_ellipse(level = 0.75, size = 1.2) +
  scale_color_manual(values = pitch_colors,
                    labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                              "SL" = "Slider", "CH" = "Changeup",
                              "CU" = "Curveball")) +
  labs(
    title = "Pitch Movement Profile",
    subtitle = "Catcher's perspective (RHP)",
    x = "Horizontal Break (inches)",
    y = "Induced Vertical Break (inches)",
    color = "Pitch Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  ) +
  coord_fixed()

# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
  ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
  geom_point(alpha = 0.4, size = 2) +
  scale_color_manual(values = pitch_colors,
                    labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                              "SL" = "Slider", "CH" = "Changeup",
                              "CU" = "Curveball")) +
  labs(
    title = "Velocity vs. Spin Rate",
    x = "Release Speed (mph)",
    y = "Spin Rate (rpm)",
    color = "Pitch Type"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
  filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
  ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
  geom_col(position = "stack") +
  scale_fill_manual(values = pitch_colors,
                   labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                             "SL" = "Slider", "CH" = "Changeup",
                             "CU" = "Curveball")) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Pitch Usage by Count",
    x = "Count",
    y = "Usage %",
    fill = "Pitch Type"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
  plot_annotation(
    title = "Comprehensive Pitcher Arsenal Analysis",
    subtitle = "Example Pitcher - 2024 Season",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )

print(combined_plot)

# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
  cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")

  # Best pitch
  best_pitch <- effectiveness %>%
    filter(usage >= 50) %>%
    slice_max(csw_rate, n = 1)

  cat("1. PRIMARY WEAPON\n")
  cat(sprintf("   - %s showing elite CSW rate of %.1f%%\n",
              best_pitch$pitch_type, best_pitch$csw_rate * 100))
  cat("   - Maintain high usage in favorable counts\n\n")

  # Underused effective pitch
  underused <- effectiveness %>%
    filter(usage < quantile(effectiveness$usage, 0.33)) %>%
    filter(csw_rate > 0.30)

  if(nrow(underused) > 0) {
    cat("2. USAGE OPTIMIZATION\n")
    for(i in 1:nrow(underused)) {
      cat(sprintf("   - Consider increasing %s usage (current: %d pitches)\n",
                  underused$pitch_type[i], underused$usage[i]))
      cat(sprintf("     Shows strong CSW rate: %.1f%%\n",
                  underused$csw_rate[i] * 100))
    }
    cat("\n")
  }

  # Weak pitch
  weak_pitch <- effectiveness %>%
    filter(usage >= 50) %>%
    slice_min(csw_rate, n = 1)

  cat("3. PITCH DEVELOPMENT FOCUS\n")
  cat(sprintf("   - %s showing below-average performance\n",
              weak_pitch$pitch_type))
  cat(sprintf("   - CSW rate: %.1f%% vs. league average ~28%%\n",
              weak_pitch$csw_rate * 100))
  cat("   - Consider: velocity increase, movement adjustment, or reduced usage\n\n")

  cat("4. STRATEGIC ADJUSTMENTS\n")
  cat("   - Review count-specific usage patterns\n")
  cat("   - Analyze platoon splits for pitch selection\n")
  cat("   - Consider sequencing effects (not shown in basic analysis)\n")
  cat("   - Monitor fatigue impact on pitch quality\n")
}

generate_recommendations(pitcher_data, pitch_effectiveness)

# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")

Portfolio Presentation Tips:

Include interactive visualizations (consider using plotly)

Compare pitcher to league averages

Add context about pitcher role and team strategy

Discuss limitations (sample size, park factors, etc.)

Provide actionable recommendations

Exercise 24.2: Player Aging Curves and Performance Projection

Objective: Build aging curves for different player skills and create a performance projection system.

Skills Demonstrated: Statistical modeling, time series analysis, predictive analytics, data visualization

Python Implementation:

# Player Aging Curves and Projection System
# Analyzing how player skills change with age

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
    """
    Generate simulated player career data.
    In practice, this would come from Baseball Reference or FanGraphs.
    """
    np.random.seed(42)

    players = []
    for player_id in range(n_players):
        # Random career start age (20-25)
        start_age = np.random.randint(20, 26)
        # Random career length (2-15 years)
        career_length = np.random.randint(2, 16)

        # Peak age varies (26-30)
        peak_age = np.random.randint(26, 31)
        # Peak performance level
        peak_wrc_plus = np.random.normal(110, 20)

        for year_in_career in range(career_length):
            age = start_age + year_in_career
            season = years_range[0] + np.random.randint(0,
                                      years_range[1] - years_range[0])

            # Age-based performance (simplified aging curve)
            age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
            base_wrc = peak_wrc_plus * age_factor

            # Add random variation
            wrc_plus = max(50, base_wrc + np.random.normal(0, 15))

            # Other stats correlated with wRC+
            pa = np.random.randint(300, 650)
            avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
            obp = avg + 0.060 + np.random.normal(0, 0.020)
            slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)

            players.append({
                'player_id': player_id,
                'age': age,
                'season': season,
                'PA': pa,
                'AVG': np.clip(avg, 0.150, 0.400),
                'OBP': np.clip(obp, 0.250, 0.500),
                'SLG': np.clip(slg, 0.300, 0.700),
                'wRC_plus': wrc_plus,
                'ISO': np.clip(slg - avg, 0.050, 0.350)
            })

    return pd.DataFrame(players)

# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)

print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")

# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
    """
    Calculate aging curve using year-to-year delta method.
    This controls for selection bias better than simple averaging.
    """
    # Filter for consecutive seasons
    df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])

    # Calculate year-to-year changes
    df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
    df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
    df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]

    # Keep only consecutive seasons
    df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()

    # Group by age and calculate average change
    aging_curve = df_deltas.groupby('age').agg({
        'metric_delta': ['mean', 'std', 'count'],
        metric: 'mean'
    }).reset_index()

    aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']

    return aging_curve

# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")

metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}

for metric in metrics:
    aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
    print(f"  {metric}: {len(aging_curves[metric])} age points")

# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
    """
    Fit a polynomial curve to aging data.
    """
    # Use weighted regression (weight by sample size)
    weights = np.sqrt(aging_data['n'])

    # Polynomial features (degree 2)
    X = aging_data[age_col].values.reshape(-1, 1)
    y = aging_data[delta_col].values

    poly = PolynomialFeatures(degree=2)
    X_poly = poly.fit_transform(X)

    model = Ridge(alpha=1.0)
    model.fit(X_poly, y, sample_weight=weights)

    return model, poly

# Fit curves
fitted_models = {}
for metric in metrics:
    fitted_models[metric] = fit_aging_curve(aging_curves[metric])
    print(f"Fitted aging curve for {metric}")

# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
    ax = axes[idx]
    curve_data = aging_curves[metric]
    model, poly = fitted_models[metric]

    # Plot raw deltas
    ax.scatter(curve_data['age'], curve_data['delta_mean'],
               s=curve_data['n']*2, alpha=0.6, label='Observed')

    # Plot fitted curve
    age_range = np.linspace(curve_data['age'].min(),
                           curve_data['age'].max(), 100)
    X_pred = poly.transform(age_range.reshape(-1, 1))
    y_pred = model.predict(X_pred)

    ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
    ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)

    ax.set_xlabel('Age', fontsize=11)
    ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
    ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")

# 4. Build Projection System
class PlayerProjector:
    """
    Project player performance based on recent history and aging curves.
    """

    def __init__(self, aging_models):
        self.aging_models = aging_models

    def project_player(self, player_history, years_forward=1):
        """
        Project player performance forward.

        Parameters:
        -----------
        player_history : DataFrame
            Recent seasons for player (last 3 years recommended)
        years_forward : int
            Number of years to project forward

        Returns:
        --------
        dict : Projected statistics
        """
        # Weight recent seasons more heavily (3:2:1 for last 3 years)
        weights = np.array([3, 2, 1])[:len(player_history)]
        weights = weights / weights.sum()

        # Current age and baseline performance
        current_age = player_history['age'].iloc[-1]

        projections = {}

        for metric in self.aging_models.keys():
            if metric not in player_history.columns:
                continue

            # Weighted average of recent performance
            baseline = np.average(player_history[metric].iloc[-3:],
                                weights=weights)

            # Apply aging curve
            model, poly = self.aging_models[metric]

            # Project forward
            projected_value = baseline
            for year in range(years_forward):
                age = current_age + year + 1
                X_age = poly.transform([[age]])
                age_adjustment = model.predict(X_age)[0]
                projected_value += age_adjustment

            projections[metric] = projected_value

        projections['age'] = current_age + years_forward
        projections['projection_years'] = years_forward

        return projections

# 5. Test Projection System
projector = PlayerProjector(fitted_models)

# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]

test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')

print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")

print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))

# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)

for year in range(1, 4):
    projection = projector.project_player(test_player_data, years_forward=year)
    print(f"+{year:<5} {projection['age']:<5.0f} "
          f"{projection.get('wRC_plus', 0):<8.1f} "
          f"{projection.get('ISO', 0):<8.3f} "
          f"{projection.get('AVG', 0):<8.3f} "
          f"{projection.get('OBP', 0):<8.3f}")

# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
    """
    Evaluate projection accuracy on historical data.
    """
    results = []

    for player_id in data['player_id'].unique():
        player_data = data[data['player_id'] == player_id].sort_values('age')

        # Need at least 4 seasons (3 to project, 1 to validate)
        if len(player_data) < 4:
            continue

        # Use all but last season for projection
        train_data = player_data.iloc[:-1]
        actual_data = player_data.iloc[-1]

        if len(train_data) < 3:
            continue

        # Make projection
        try:
            projection = projector.project_player(train_data, years_forward=1)

            for metric in ['wRC_plus', 'ISO', 'AVG']:
                if metric in projection:
                    results.append({
                        'player_id': player_id,
                        'metric': metric,
                        'actual': actual_data[metric],
                        'projected': projection[metric],
                        'error': projection[metric] - actual_data[metric]
                    })
        except:
            continue

    return pd.DataFrame(results)

print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)

print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)

for metric in ['wRC_plus', 'ISO', 'AVG']:
    metric_eval = evaluation[evaluation['metric'] == metric]

    if len(metric_eval) > 0:
        mae = np.abs(metric_eval['error']).mean()
        rmse = np.sqrt((metric_eval['error'] ** 2).mean())

        # Calculate R²
        actual = metric_eval['actual'].values
        predicted = metric_eval['projected'].values
        ss_res = np.sum((actual - predicted) ** 2)
        ss_tot = np.sum((actual - actual.mean()) ** 2)
        r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

        print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")

# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
    ax = axes[idx]
    metric_eval = evaluation[evaluation['metric'] == metric]

    if len(metric_eval) > 0:
        ax.scatter(metric_eval['actual'], metric_eval['projected'],
                  alpha=0.4, s=30)

        # Add y=x line
        min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
        max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
        ax.plot([min_val, max_val], [min_val, max_val],
               'r--', linewidth=2, label='Perfect Projection')

        ax.set_xlabel(f'Actual {metric}', fontsize=11)
        ax.set_ylabel(f'Projected {metric}', fontsize=11)
        ax.set_title(f'{metric} Projection Accuracy',
                    fontsize=12, fontweight='bold')
        ax.legend()
        ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")

Extension Ideas:

Incorporate minor league translation factors

Add injury risk modeling

Create playing time projections

Develop position-specific aging curves

Compare to established projection systems (Steamer, ZiPS)

Exercise 24.3: Draft Value Analysis and Strategy Optimization

Objective: Analyze historical draft performance to quantify pick value and optimize draft strategy.

Skills Demonstrated: Data analysis, value modeling, strategic thinking, data visualization

Key Analysis Components:

# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy

library(tidyverse)
library(survival)
library(ggplot2)
library(scales)

# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
  set.seed(42)

  drafts <- expand.grid(
    year = 2008:2022,
    round = 1:rounds,
    pick = 1:30
  ) %>%
    mutate(
      overall_pick = (round - 1) * 30 + pick,
      # Probability of reaching majors decreases with pick
      p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
      reached_mlb = rbinom(n(), 1, p_mlb),
      # Career WAR conditional on reaching MLB
      war_if_mlb = ifelse(
        reached_mlb == 1,
        pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
        0
      ),
      # Years to debut
      years_to_debut = ifelse(
        reached_mlb == 1,
        pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
        NA
      ),
      # Position (simplified)
      position = sample(
        c("P", "C", "IF", "OF"),
        n(),
        replace = TRUE,
        prob = c(0.45, 0.10, 0.25, 0.20)
      ),
      # College vs HS
      player_type = sample(
        c("College", "HS", "International"),
        n(),
        replace = TRUE,
        prob = c(0.55, 0.35, 0.10)
      ),
      # Slot value (simplified formula)
      slot_value = pmax(
        200000,
        12000000 * exp(-overall_pick / 15)
      ),
      # Signing bonus (usually close to slot)
      signing_bonus = slot_value * runif(n(), 0.85, 1.15)
    )

  return(drafts)
}

# Generate data
draft_data <- generate_draft_data()

print(sprintf("Generated %d draft picks from %d drafts",
              nrow(draft_data), n_distinct(draft_data$year)))

# 1. Success Rate by Round
success_by_round <- draft_data %>%
  group_by(round) %>%
  summarize(
    n_picks = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    total_war = sum(war_if_mlb),
    avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
  ) %>%
  filter(round <= 20)  # Focus on first 20 rounds

print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))

# 2. Value Curve Estimation
value_curve <- draft_data %>%
  group_by(overall_pick) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
  ) %>%
  filter(overall_pick <= 300)

# Fit exponential decay model
value_model <- nls(
  expected_war ~ a * exp(-b * overall_pick),
  data = value_curve %>% filter(expected_war > 0),
  start = list(a = 10, b = 0.01)
)

# Add fitted values
value_curve$fitted_war <- predict(
  value_model,
  newdata = data.frame(overall_pick = value_curve$overall_pick)
)

print("\nValue Curve Model:")
print(summary(value_model))

# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
  geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
  geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
  geom_vline(xintercept = c(30, 60, 90),
             linetype = "dashed", alpha = 0.3) +
  annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
           label = "Round 1", size = 3.5) +
  annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
           label = "Round 2", size = 3.5) +
  labs(
    title = "MLB Draft Pick Value Curve",
    subtitle = "Expected career WAR by draft position",
    x = "Overall Pick",
    y = "Expected Career WAR",
    caption = "Exponential decay model fitted to historical data"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11)
  )

print(value_plot)

# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
  filter(round <= 10) %>%
  group_by(position, round) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    .groups = "drop"
  ) %>%
  group_by(position) %>%
  summarize(
    total_picks = sum(n),
    avg_mlb_rate = mean(mlb_rate),
    avg_war = mean(avg_war)
  ) %>%
  arrange(desc(avg_war))

print("\nPosition-Specific Success Rates:")
print(position_analysis)

# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
  filter(round <= 10) %>%
  group_by(player_type) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
  )

print("\nCollege vs High School Performance:")
print(player_type_analysis)

# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
  filter(reached_mlb == 1, round <= 10) %>%
  mutate(
    war_per_million = war_if_mlb / (signing_bonus / 1000000),
    pick_group = case_when(
      overall_pick <= 30 ~ "Top 30",
      overall_pick <= 60 ~ "31-60",
      overall_pick <= 100 ~ "61-100",
      TRUE ~ "100+"
    )
  ) %>%
  group_by(pick_group) %>%
  summarize(
    n = n(),
    avg_bonus = mean(signing_bonus),
    avg_war = mean(war_if_mlb),
    war_per_million = mean(war_per_million)
  )

print("\nReturn on Investment by Pick Range:")
print(roi_analysis)

# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
  """
  Simple optimization: maximize expected WAR given bonus pool constraints
  """

  # Get expected value for each pick
  pick_values <- value_curve %>%
    filter(overall_pick %in% available_picks) %>%
    left_join(
      draft_data %>%
        group_by(overall_pick) %>%
        summarize(avg_slot = mean(slot_value)),
      by = "overall_pick"
    )

  # Greedy algorithm: pick highest value/cost ratio within budget
  selected <- tibble()
  remaining_budget <- budget
  remaining_picks <- pick_values

  while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
    # Calculate value per dollar
    remaining_picks <- remaining_picks %>%
      mutate(value_per_dollar = expected_war / avg_slot)

    # Select best value pick we can afford
    best_pick <- remaining_picks %>%
      filter(avg_slot <= remaining_budget) %>%
      slice_max(value_per_dollar, n = 1)

    if(nrow(best_pick) == 0) break

    selected <- bind_rows(selected, best_pick)
    remaining_budget <- remaining_budget - best_pick$avg_slot
    remaining_picks <- remaining_picks %>%
      filter(overall_pick != best_pick$overall_pick)
  }

  return(selected)
}

# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000

optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)

print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
        select(overall_pick, expected_war, avg_slot, value_per_dollar))

# 8. Comprehensive Dashboard Visualization
library(patchwork)

# Plot 1: Success rate by round
p1 <- success_by_round %>%
  filter(round <= 10) %>%
  ggplot(aes(x = round, y = mlb_rate)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
            vjust = -0.5, size = 3) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "MLB Success Rate by Round",
       x = "Draft Round", y = "% Reaching MLB") +
  theme_minimal()

# Plot 2: WAR distribution
p2 <- draft_data %>%
  filter(reached_mlb == 1, overall_pick <= 100) %>%
  ggplot(aes(x = war_if_mlb)) +
  geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
  labs(title = "Career WAR Distribution (MLB Players)",
       x = "Career WAR", y = "Count") +
  theme_minimal()

# Plot 3: Position comparison
p3 <- draft_data %>%
  filter(reached_mlb == 1, round <= 5) %>%
  ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "WAR by Position (Rounds 1-5)",
       x = "Position", y = "Career WAR") +
  theme_minimal() +
  theme(legend.position = "none")

# Plot 4: College vs HS
p4 <- draft_data %>%
  filter(reached_mlb == 1, round <= 10) %>%
  ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
  labs(title = "College vs HS Performance",
       x = "Player Type", y = "Career WAR") +
  theme_minimal() +
  theme(legend.position = "none")

# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
  plot_annotation(
    title = "MLB Draft Analysis Dashboard",
    subtitle = "Historical performance metrics and value analysis",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )

print(combined)

# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")

cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf("   - First round produces %.1f%% of total draft WAR\n",
            100 * first_round_war / total_war))

cat("\n2. SUCCESS RATES\n")
cat(sprintf("   - Round 1: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[1]))
cat(sprintf("   - Round 5: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[5]))
cat(sprintf("   - Round 10: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[10]))

cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf("   - Average time to debut: %.1f years\n",
            mean(draft_data$years_to_debut, na.rm = TRUE)))

cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat("   - Prioritize early picks; value drops exponentially\n")
cat("   - Consider college players for faster development\n")
cat("   - High school players have higher variance in outcomes\n")
cat("   - Pitchers dominate draft but consider positional scarcity\n")
cat("   - Later rounds: focus on high-ceiling, high-risk players\n")

cat("\n=== ANALYSIS COMPLETE ===\n")

Portfolio Enhancement:

Add international signing analysis

Compare team draft performance

Analyze specific draft classes

Include financial constraints modeling

Compare to prospect ranking systems

Exercise 24.4: Defensive Positioning and Shift Analysis

Objective: Analyze defensive positioning effectiveness using batted ball data.

Skills Demonstrated: Spatial analysis, causal inference, strategic analysis, data visualization

Implementation Framework:

# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches

# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)

# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
    """
    Simulate batted ball locations and outcomes.
    Coordinates in feet from home plate.
    """
    np.random.seed(42)

    data = []

    for _ in range(n_balls):
        # Batter handedness
        stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])

        # Shift decision (more common vs pull hitters)
        is_shifter = np.random.random() < 0.3
        shift_on = is_shifter and (np.random.random() < 0.7)

        # Hit location (pull tendency varies)
        if stand == 'R':
            # Righties pull left
            if is_shifter:
                angle = np.random.normal(-25, 35)  # Pull-heavy
            else:
                angle = np.random.normal(-10, 45)  # Balanced
        else:
            # Lefties pull right
            if is_shifter:
                angle = np.random.normal(25, 35)
            else:
                angle = np.random.normal(10, 45)

        # Distance based on exit velo and launch angle
        exit_velo = np.random.normal(88, 8)
        launch_angle = np.random.normal(12, 18)

        # Simplified distance calculation
        distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
        distance = max(50, min(400, distance + np.random.normal(0, 20)))

        # Convert to x, y coordinates
        angle_rad = np.radians(angle)
        x = distance * np.sin(angle_rad)
        y = distance * np.cos(angle_rad)

        # Hit outcome (shift effectiveness)
        if shift_on:
            # Shift reduces hits in pull direction
            if stand == 'R' and x < -50:
                prob_hit = 0.18  # Reduced by shift
            elif stand == 'L' and x > 50:
                prob_hit = 0.18
            else:
                prob_hit = 0.28  # Normal rate
        else:
            prob_hit = 0.25

        # Adjust for distance (harder to field)
        prob_hit = min(0.95, prob_hit * (distance / 250))

        is_hit = np.random.random() < prob_hit

        data.append({
            'x': x,
            'y': y,
            'distance': distance,
            'angle': angle,
            'exit_velo': exit_velo,
            'launch_angle': launch_angle,
            'stand': stand,
            'shift_on': shift_on,
            'is_hit': is_hit,
            'is_shifter': is_shifter
        })

    return pd.DataFrame(data)

# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)

print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")

# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
    'is_hit': ['mean', 'count'],
    'exit_velo': 'mean'
}).round(3)

print("\nShift Effectiveness:")
print(shift_analysis)

# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
    """
    Estimate runs saved by shifting.
    """
    results = []

    for stand in ['R', 'L']:
        for shifter in [True, False]:
            subset = data[(data['stand'] == stand) &
                         (data['is_shifter'] == shifter)]

            if len(subset) == 0:
                continue

            shifted = subset[subset['shift_on'] == True]
            no_shift = subset[subset['shift_on'] == False]

            if len(shifted) > 0 and len(no_shift) > 0:
                babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
                # Approximate run value per hit prevented: ~0.5 runs
                runs_saved_per_pa = babip_diff * 0.5

                results.append({
                    'stand': stand,
                    'is_shifter': shifter,
                    'shifted_babip': shifted['is_hit'].mean(),
                    'no_shift_babip': no_shift['is_hit'].mean(),
                    'babip_diff': babip_diff,
                    'runs_saved_per_100pa': runs_saved_per_pa * 100,
                    'n_shifted': len(shifted),
                    'n_no_shift': len(no_shift)
                })

    return pd.DataFrame(results)

shift_value = calculate_shift_value(bb_data)

print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))

# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
    """
    Plot baseball field with hit locations.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 10))

    # Draw field outline
    # Infield dirt
    infield = patches.Wedge((0, 0), 95, 45, 135,
                           facecolor='tan', alpha=0.3)
    ax.add_patch(infield)

    # Outfield grass
    outfield = patches.Wedge((0, 0), 400, 45, 135,
                            facecolor='green', alpha=0.1)
    ax.add_patch(outfield)

    # Foul lines
    ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
    ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)

    # Plot hits
    hits = data[data['is_hit'] == True]
    outs = data[data['is_hit'] == False]

    ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
              s=20, label='Out')
    ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
              s=30, label='Hit')

    ax.set_xlim(-320, 320)
    ax.set_ylim(0, 400)
    ax.set_aspect('equal')
    ax.set_xlabel('Distance from center (ft)', fontsize=11)
    ax.set_ylabel('Distance from home (ft)', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.2)

    return ax

# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
                      (bb_data['is_shifter'] == True)]

plot_field_with_hits(
    rhb_shifter[rhb_shifter['shift_on'] == False],
    'RHB Pull Hitter - No Shift',
    ax=ax1
)

plot_field_with_hits(
    rhb_shifter[rhb_shifter['shift_on'] == True],
    'RHB Pull Hitter - Shift On',
    ax=ax2
)

plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")

# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
    """
    Create BABIP heat map for given conditions.
    """
    subset = data[(data['shift_on'] == shift_status) &
                  (data['stand'] == stand)]

    # Create grid
    x_bins = np.linspace(-250, 250, 25)
    y_bins = np.linspace(50, 350, 20)

    grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
    grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))

    for i in range(len(y_bins)-1):
        for j in range(len(x_bins)-1):
            mask = ((subset['x'] >= x_bins[j]) &
                   (subset['x'] < x_bins[j+1]) &
                   (subset['y'] >= y_bins[i]) &
                   (subset['y'] < y_bins[i+1]))

            cell_data = subset[mask]
            if len(cell_data) >= 5:  # Minimum sample
                grid_babip[i, j] = cell_data['is_hit'].mean()
                grid_count[i, j] = len(cell_data)
            else:
                grid_babip[i, j] = np.nan

    return grid_babip, x_bins, y_bins, grid_count

# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stand in enumerate(['R', 'L']):
    for j, shift_on in enumerate([False, True]):
        ax = axes[i, j]

        shifters = bb_data[bb_data['is_shifter'] == True]
        grid, x_bins, y_bins, counts = create_babip_heatmap(
            shifters, shift_on, stand
        )

        im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
                                     y_bins[0], y_bins[-1]],
                      origin='lower', cmap='RdYlGn_r',
                      vmin=0, vmax=0.5, aspect='auto')

        shift_text = "Shift On" if shift_on else "No Shift"
        hand_text = "RHB" if stand == 'R' else "LHB"
        ax.set_title(f'{hand_text} - {shift_text}',
                    fontsize=11, fontweight='bold')
        ax.set_xlabel('Horizontal Position (ft)')
        ax.set_ylabel('Distance from Home (ft)')

        # Add colorbar
        plt.colorbar(im, ax=ax, label='BABIP')

plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")

# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
                      ((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)

X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
                      model.coef_[0]):
    print(f"  {feat}: {coef:.3f}")

# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)

print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
    print(f"   {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")

print("\n2. WHEN TO SHIFT")
print("   - Strong pull tendency (>70% pull rate)")
print("   - Ground ball hitters (LA < 10°)")
print("   - Extreme pull hitters benefit most from aggressive shifts")

print("\n3. SHIFT VARIATIONS")
print("   - Full shift: 3 infielders on pull side")
print("   - Partial shift: 2.5 infielders pull side")
print("   - No shift: Traditional alignment")
print("   - Decision should consider:")
print("     * Batter's spray chart")
print("     * Game situation (runners, outs)")
print("     * Pitcher's ground ball rate")

print("\n4. LIMITATIONS & CONSIDERATIONS")
print("   - Shift beaten by opposite field hits")
print("   - Bunt defense vulnerabilities")
print("   - Runner advancement opportunities")
print("   - Pitcher-specific adjustments")

print("\n5. FUTURE ANALYSIS")
print("   - Pitcher-specific positioning")
print("   - Count-based positioning adjustments")
print("   - Outfield positioning optimization")
print("   - Real-time adjustment algorithms")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)

Portfolio Development Tips:

Use real Statcast spray chart data when possible

Incorporate expected outcomes (xBA, xwOBA)

Add video analysis component

Compare to MLB team shift strategies

Analyze shift effectiveness by ballpark

Chapter Summary

Building a career in baseball analytics requires a combination of technical skills, baseball knowledge, and strategic career planning. Key takeaways:

Skills Development: Master programming (R, Python, SQL), statistical methods, and baseball domain knowledge. Build a strong portfolio demonstrating diverse analytical capabilities.

Career Planning: Understand the various roles and career paths in baseball analytics. Network actively, apply strategically, and prepare thoroughly for technical and behavioral interviews.

Continuous Learning: Stay current with new analytical methods, data sources, and baseball trends. Engage with the community through conferences, publications, and collaborations.

Portfolio Projects: Complete projects that showcase your ability to: collect and clean data, apply appropriate statistical methods, create compelling visualizations, and communicate actionable insights.

The field of baseball analytics continues to evolve rapidly. Success requires not just technical proficiency, but also creativity, communication skills, and genuine passion for understanding the game through data. The exercises in this chapter provide starting points for building portfolio projects that demonstrate your readiness for a career in this exciting field.

# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns

library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)

# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
  # In practice, use scrape_statcast_savant_pitcher()
  # For this example, we'll simulate data

  set.seed(123)
  n_pitches <- 2500

  tibble(
    pitch_type = sample(
      c("FF", "SI", "SL", "CH", "CU"),
      n_pitches,
      replace = TRUE,
      prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
    ),
    release_speed = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
      pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
      pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
      pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
      pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
    ),
    pfx_x = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
      pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
      pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
      pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
      pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
    ),
    pfx_z = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
      pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
      pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
      pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
      pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
    ),
    release_spin_rate = case_when(
      pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
      pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
      pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
      pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
      pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
    ),
    balls = sample(0:3, n_pitches, replace = TRUE),
    strikes = sample(0:2, n_pitches, replace = TRUE),
    stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
    description = sample(
      c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
      n_pitches,
      replace = TRUE,
      prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
    ),
    launch_speed = ifelse(description == "hit_into_play",
                          rnorm(n_pitches, 87, 10), NA),
    launch_angle = ifelse(description == "hit_into_play",
                         rnorm(n_pitches, 12, 20), NA),
    estimated_woba_using_speedangle = ifelse(
      description == "hit_into_play",
      pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
      NA
    )
  )
}

# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")

# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
  group_by(pitch_type) %>%
  summarize(
    n = n(),
    pct = n() / nrow(pitcher_data),
    avg_velo = mean(release_speed, na.rm = TRUE),
    avg_spin = mean(release_spin_rate, na.rm = TRUE)
  ) %>%
  arrange(desc(n))

print("Pitch Mix:")
print(pitch_mix)

# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
  group_by(pitch_type) %>%
  summarize(
    usage = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
                    na.rm = TRUE),
    avg_exit_velo = mean(launch_speed, na.rm = TRUE),
    avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
  ) %>%
  arrange(desc(csw_rate))

print("\nPitch Effectiveness:")
print(pitch_effectiveness)

# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
  mutate(count = paste0(balls, "-", strikes)) %>%
  group_by(count, pitch_type) %>%
  summarize(
    n = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(count) %>%
  mutate(usage_pct = n / sum(n)) %>%
  arrange(count, desc(usage_pct))

# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
  group_by(pitch_type, stand) %>%
  summarize(
    n = n(),
    whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
    avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(
    names_from = stand,
    values_from = c(n, whiff_rate, avg_xwoba),
    names_sep = "_"
  )

print("\nPlatoon Splits:")
print(platoon_splits)

# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
  "FF" = "#d22d49", "SI" = "#FE9D00",
  "SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)

movement_plot <- ggplot(pitcher_data,
                        aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
  geom_point(alpha = 0.3, size = 2) +
  stat_ellipse(level = 0.75, size = 1.2) +
  scale_color_manual(values = pitch_colors,
                    labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                              "SL" = "Slider", "CH" = "Changeup",
                              "CU" = "Curveball")) +
  labs(
    title = "Pitch Movement Profile",
    subtitle = "Catcher's perspective (RHP)",
    x = "Horizontal Break (inches)",
    y = "Induced Vertical Break (inches)",
    color = "Pitch Type"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "right"
  ) +
  coord_fixed()

# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
  ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
  geom_point(alpha = 0.4, size = 2) +
  scale_color_manual(values = pitch_colors,
                    labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                              "SL" = "Slider", "CH" = "Changeup",
                              "CU" = "Curveball")) +
  labs(
    title = "Velocity vs. Spin Rate",
    x = "Release Speed (mph)",
    y = "Spin Rate (rpm)",
    color = "Pitch Type"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
  filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
  ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
  geom_col(position = "stack") +
  scale_fill_manual(values = pitch_colors,
                   labels = c("FF" = "Four-Seam", "SI" = "Sinker",
                             "SL" = "Slider", "CH" = "Changeup",
                             "CU" = "Curveball")) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Pitch Usage by Count",
    x = "Count",
    y = "Usage %",
    fill = "Pitch Type"
  ) +
  theme_minimal() +
  theme(legend.position = "right")

# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
  plot_annotation(
    title = "Comprehensive Pitcher Arsenal Analysis",
    subtitle = "Example Pitcher - 2024 Season",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )

print(combined_plot)

# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
  cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")

  # Best pitch
  best_pitch <- effectiveness %>%
    filter(usage >= 50) %>%
    slice_max(csw_rate, n = 1)

  cat("1. PRIMARY WEAPON\n")
  cat(sprintf("   - %s showing elite CSW rate of %.1f%%\n",
              best_pitch$pitch_type, best_pitch$csw_rate * 100))
  cat("   - Maintain high usage in favorable counts\n\n")

  # Underused effective pitch
  underused <- effectiveness %>%
    filter(usage < quantile(effectiveness$usage, 0.33)) %>%
    filter(csw_rate > 0.30)

  if(nrow(underused) > 0) {
    cat("2. USAGE OPTIMIZATION\n")
    for(i in 1:nrow(underused)) {
      cat(sprintf("   - Consider increasing %s usage (current: %d pitches)\n",
                  underused$pitch_type[i], underused$usage[i]))
      cat(sprintf("     Shows strong CSW rate: %.1f%%\n",
                  underused$csw_rate[i] * 100))
    }
    cat("\n")
  }

  # Weak pitch
  weak_pitch <- effectiveness %>%
    filter(usage >= 50) %>%
    slice_min(csw_rate, n = 1)

  cat("3. PITCH DEVELOPMENT FOCUS\n")
  cat(sprintf("   - %s showing below-average performance\n",
              weak_pitch$pitch_type))
  cat(sprintf("   - CSW rate: %.1f%% vs. league average ~28%%\n",
              weak_pitch$csw_rate * 100))
  cat("   - Consider: velocity increase, movement adjustment, or reduced usage\n\n")

  cat("4. STRATEGIC ADJUSTMENTS\n")
  cat("   - Review count-specific usage patterns\n")
  cat("   - Analyze platoon splits for pitch selection\n")
  cat("   - Consider sequencing effects (not shown in basic analysis)\n")
  cat("   - Monitor fatigue impact on pitch quality\n")
}

generate_recommendations(pitcher_data, pitch_effectiveness)

# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")

# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy

library(tidyverse)
library(survival)
library(ggplot2)
library(scales)

# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
  set.seed(42)

  drafts <- expand.grid(
    year = 2008:2022,
    round = 1:rounds,
    pick = 1:30
  ) %>%
    mutate(
      overall_pick = (round - 1) * 30 + pick,
      # Probability of reaching majors decreases with pick
      p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
      reached_mlb = rbinom(n(), 1, p_mlb),
      # Career WAR conditional on reaching MLB
      war_if_mlb = ifelse(
        reached_mlb == 1,
        pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
        0
      ),
      # Years to debut
      years_to_debut = ifelse(
        reached_mlb == 1,
        pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
        NA
      ),
      # Position (simplified)
      position = sample(
        c("P", "C", "IF", "OF"),
        n(),
        replace = TRUE,
        prob = c(0.45, 0.10, 0.25, 0.20)
      ),
      # College vs HS
      player_type = sample(
        c("College", "HS", "International"),
        n(),
        replace = TRUE,
        prob = c(0.55, 0.35, 0.10)
      ),
      # Slot value (simplified formula)
      slot_value = pmax(
        200000,
        12000000 * exp(-overall_pick / 15)
      ),
      # Signing bonus (usually close to slot)
      signing_bonus = slot_value * runif(n(), 0.85, 1.15)
    )

  return(drafts)
}

# Generate data
draft_data <- generate_draft_data()

print(sprintf("Generated %d draft picks from %d drafts",
              nrow(draft_data), n_distinct(draft_data$year)))

# 1. Success Rate by Round
success_by_round <- draft_data %>%
  group_by(round) %>%
  summarize(
    n_picks = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    total_war = sum(war_if_mlb),
    avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
  ) %>%
  filter(round <= 20)  # Focus on first 20 rounds

print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))

# 2. Value Curve Estimation
value_curve <- draft_data %>%
  group_by(overall_pick) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
  ) %>%
  filter(overall_pick <= 300)

# Fit exponential decay model
value_model <- nls(
  expected_war ~ a * exp(-b * overall_pick),
  data = value_curve %>% filter(expected_war > 0),
  start = list(a = 10, b = 0.01)
)

# Add fitted values
value_curve$fitted_war <- predict(
  value_model,
  newdata = data.frame(overall_pick = value_curve$overall_pick)
)

print("\nValue Curve Model:")
print(summary(value_model))

# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
  geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
  geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
  geom_vline(xintercept = c(30, 60, 90),
             linetype = "dashed", alpha = 0.3) +
  annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
           label = "Round 1", size = 3.5) +
  annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
           label = "Round 2", size = 3.5) +
  labs(
    title = "MLB Draft Pick Value Curve",
    subtitle = "Expected career WAR by draft position",
    x = "Overall Pick",
    y = "Expected Career WAR",
    caption = "Exponential decay model fitted to historical data"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11)
  )

print(value_plot)

# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
  filter(round <= 10) %>%
  group_by(position, round) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    .groups = "drop"
  ) %>%
  group_by(position) %>%
  summarize(
    total_picks = sum(n),
    avg_mlb_rate = mean(mlb_rate),
    avg_war = mean(avg_war)
  ) %>%
  arrange(desc(avg_war))

print("\nPosition-Specific Success Rates:")
print(position_analysis)

# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
  filter(round <= 10) %>%
  group_by(player_type) %>%
  summarize(
    n = n(),
    mlb_rate = mean(reached_mlb),
    avg_war = mean(war_if_mlb),
    avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
  )

print("\nCollege vs High School Performance:")
print(player_type_analysis)

# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
  filter(reached_mlb == 1, round <= 10) %>%
  mutate(
    war_per_million = war_if_mlb / (signing_bonus / 1000000),
    pick_group = case_when(
      overall_pick <= 30 ~ "Top 30",
      overall_pick <= 60 ~ "31-60",
      overall_pick <= 100 ~ "61-100",
      TRUE ~ "100+"
    )
  ) %>%
  group_by(pick_group) %>%
  summarize(
    n = n(),
    avg_bonus = mean(signing_bonus),
    avg_war = mean(war_if_mlb),
    war_per_million = mean(war_per_million)
  )

print("\nReturn on Investment by Pick Range:")
print(roi_analysis)

# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
  """
  Simple optimization: maximize expected WAR given bonus pool constraints
  """

  # Get expected value for each pick
  pick_values <- value_curve %>%
    filter(overall_pick %in% available_picks) %>%
    left_join(
      draft_data %>%
        group_by(overall_pick) %>%
        summarize(avg_slot = mean(slot_value)),
      by = "overall_pick"
    )

  # Greedy algorithm: pick highest value/cost ratio within budget
  selected <- tibble()
  remaining_budget <- budget
  remaining_picks <- pick_values

  while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
    # Calculate value per dollar
    remaining_picks <- remaining_picks %>%
      mutate(value_per_dollar = expected_war / avg_slot)

    # Select best value pick we can afford
    best_pick <- remaining_picks %>%
      filter(avg_slot <= remaining_budget) %>%
      slice_max(value_per_dollar, n = 1)

    if(nrow(best_pick) == 0) break

    selected <- bind_rows(selected, best_pick)
    remaining_budget <- remaining_budget - best_pick$avg_slot
    remaining_picks <- remaining_picks %>%
      filter(overall_pick != best_pick$overall_pick)
  }

  return(selected)
}

# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000

optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)

print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
        select(overall_pick, expected_war, avg_slot, value_per_dollar))

# 8. Comprehensive Dashboard Visualization
library(patchwork)

# Plot 1: Success rate by round
p1 <- success_by_round %>%
  filter(round <= 10) %>%
  ggplot(aes(x = round, y = mlb_rate)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
            vjust = -0.5, size = 3) +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "MLB Success Rate by Round",
       x = "Draft Round", y = "% Reaching MLB") +
  theme_minimal()

# Plot 2: WAR distribution
p2 <- draft_data %>%
  filter(reached_mlb == 1, overall_pick <= 100) %>%
  ggplot(aes(x = war_if_mlb)) +
  geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
  labs(title = "Career WAR Distribution (MLB Players)",
       x = "Career WAR", y = "Count") +
  theme_minimal()

# Plot 3: Position comparison
p3 <- draft_data %>%
  filter(reached_mlb == 1, round <= 5) %>%
  ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "WAR by Position (Rounds 1-5)",
       x = "Position", y = "Career WAR") +
  theme_minimal() +
  theme(legend.position = "none")

# Plot 4: College vs HS
p4 <- draft_data %>%
  filter(reached_mlb == 1, round <= 10) %>%
  ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
  labs(title = "College vs HS Performance",
       x = "Player Type", y = "Career WAR") +
  theme_minimal() +
  theme(legend.position = "none")

# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
  plot_annotation(
    title = "MLB Draft Analysis Dashboard",
    subtitle = "Historical performance metrics and value analysis",
    theme = theme(plot.title = element_text(size = 16, face = "bold"))
  )

print(combined)

# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")

cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf("   - First round produces %.1f%% of total draft WAR\n",
            100 * first_round_war / total_war))

cat("\n2. SUCCESS RATES\n")
cat(sprintf("   - Round 1: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[1]))
cat(sprintf("   - Round 5: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[5]))
cat(sprintf("   - Round 10: %.1f%% reach MLB\n",
            100 * success_by_round$mlb_rate[10]))

cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf("   - Average time to debut: %.1f years\n",
            mean(draft_data$years_to_debut, na.rm = TRUE)))

cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat("   - Prioritize early picks; value drops exponentially\n")
cat("   - Consider college players for faster development\n")
cat("   - High school players have higher variance in outcomes\n")
cat("   - Pitchers dominate draft but consider positional scarcity\n")
cat("   - Later rounds: focus on high-ceiling, high-risk players\n")

cat("\n=== ANALYSIS COMPLETE ===\n")

Python

# Player Aging Curves and Projection System
# Analyzing how player skills change with age

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
    """
    Generate simulated player career data.
    In practice, this would come from Baseball Reference or FanGraphs.
    """
    np.random.seed(42)

    players = []
    for player_id in range(n_players):
        # Random career start age (20-25)
        start_age = np.random.randint(20, 26)
        # Random career length (2-15 years)
        career_length = np.random.randint(2, 16)

        # Peak age varies (26-30)
        peak_age = np.random.randint(26, 31)
        # Peak performance level
        peak_wrc_plus = np.random.normal(110, 20)

        for year_in_career in range(career_length):
            age = start_age + year_in_career
            season = years_range[0] + np.random.randint(0,
                                      years_range[1] - years_range[0])

            # Age-based performance (simplified aging curve)
            age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
            base_wrc = peak_wrc_plus * age_factor

            # Add random variation
            wrc_plus = max(50, base_wrc + np.random.normal(0, 15))

            # Other stats correlated with wRC+
            pa = np.random.randint(300, 650)
            avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
            obp = avg + 0.060 + np.random.normal(0, 0.020)
            slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)

            players.append({
                'player_id': player_id,
                'age': age,
                'season': season,
                'PA': pa,
                'AVG': np.clip(avg, 0.150, 0.400),
                'OBP': np.clip(obp, 0.250, 0.500),
                'SLG': np.clip(slg, 0.300, 0.700),
                'wRC_plus': wrc_plus,
                'ISO': np.clip(slg - avg, 0.050, 0.350)
            })

    return pd.DataFrame(players)

# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)

print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")

# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
    """
    Calculate aging curve using year-to-year delta method.
    This controls for selection bias better than simple averaging.
    """
    # Filter for consecutive seasons
    df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])

    # Calculate year-to-year changes
    df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
    df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
    df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]

    # Keep only consecutive seasons
    df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()

    # Group by age and calculate average change
    aging_curve = df_deltas.groupby('age').agg({
        'metric_delta': ['mean', 'std', 'count'],
        metric: 'mean'
    }).reset_index()

    aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']

    return aging_curve

# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")

metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}

for metric in metrics:
    aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
    print(f"  {metric}: {len(aging_curves[metric])} age points")

# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
    """
    Fit a polynomial curve to aging data.
    """
    # Use weighted regression (weight by sample size)
    weights = np.sqrt(aging_data['n'])

    # Polynomial features (degree 2)
    X = aging_data[age_col].values.reshape(-1, 1)
    y = aging_data[delta_col].values

    poly = PolynomialFeatures(degree=2)
    X_poly = poly.fit_transform(X)

    model = Ridge(alpha=1.0)
    model.fit(X_poly, y, sample_weight=weights)

    return model, poly

# Fit curves
fitted_models = {}
for metric in metrics:
    fitted_models[metric] = fit_aging_curve(aging_curves[metric])
    print(f"Fitted aging curve for {metric}")

# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
    ax = axes[idx]
    curve_data = aging_curves[metric]
    model, poly = fitted_models[metric]

    # Plot raw deltas
    ax.scatter(curve_data['age'], curve_data['delta_mean'],
               s=curve_data['n']*2, alpha=0.6, label='Observed')

    # Plot fitted curve
    age_range = np.linspace(curve_data['age'].min(),
                           curve_data['age'].max(), 100)
    X_pred = poly.transform(age_range.reshape(-1, 1))
    y_pred = model.predict(X_pred)

    ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
    ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)

    ax.set_xlabel('Age', fontsize=11)
    ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
    ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")

# 4. Build Projection System
class PlayerProjector:
    """
    Project player performance based on recent history and aging curves.
    """

    def __init__(self, aging_models):
        self.aging_models = aging_models

    def project_player(self, player_history, years_forward=1):
        """
        Project player performance forward.

        Parameters:
        -----------
        player_history : DataFrame
            Recent seasons for player (last 3 years recommended)
        years_forward : int
            Number of years to project forward

        Returns:
        --------
        dict : Projected statistics
        """
        # Weight recent seasons more heavily (3:2:1 for last 3 years)
        weights = np.array([3, 2, 1])[:len(player_history)]
        weights = weights / weights.sum()

        # Current age and baseline performance
        current_age = player_history['age'].iloc[-1]

        projections = {}

        for metric in self.aging_models.keys():
            if metric not in player_history.columns:
                continue

            # Weighted average of recent performance
            baseline = np.average(player_history[metric].iloc[-3:],
                                weights=weights)

            # Apply aging curve
            model, poly = self.aging_models[metric]

            # Project forward
            projected_value = baseline
            for year in range(years_forward):
                age = current_age + year + 1
                X_age = poly.transform([[age]])
                age_adjustment = model.predict(X_age)[0]
                projected_value += age_adjustment

            projections[metric] = projected_value

        projections['age'] = current_age + years_forward
        projections['projection_years'] = years_forward

        return projections

# 5. Test Projection System
projector = PlayerProjector(fitted_models)

# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]

test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')

print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")

print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))

# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)

for year in range(1, 4):
    projection = projector.project_player(test_player_data, years_forward=year)
    print(f"+{year:<5} {projection['age']:<5.0f} "
          f"{projection.get('wRC_plus', 0):<8.1f} "
          f"{projection.get('ISO', 0):<8.3f} "
          f"{projection.get('AVG', 0):<8.3f} "
          f"{projection.get('OBP', 0):<8.3f}")

# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
    """
    Evaluate projection accuracy on historical data.
    """
    results = []

    for player_id in data['player_id'].unique():
        player_data = data[data['player_id'] == player_id].sort_values('age')

        # Need at least 4 seasons (3 to project, 1 to validate)
        if len(player_data) < 4:
            continue

        # Use all but last season for projection
        train_data = player_data.iloc[:-1]
        actual_data = player_data.iloc[-1]

        if len(train_data) < 3:
            continue

        # Make projection
        try:
            projection = projector.project_player(train_data, years_forward=1)

            for metric in ['wRC_plus', 'ISO', 'AVG']:
                if metric in projection:
                    results.append({
                        'player_id': player_id,
                        'metric': metric,
                        'actual': actual_data[metric],
                        'projected': projection[metric],
                        'error': projection[metric] - actual_data[metric]
                    })
        except:
            continue

    return pd.DataFrame(results)

print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)

print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)

for metric in ['wRC_plus', 'ISO', 'AVG']:
    metric_eval = evaluation[evaluation['metric'] == metric]

    if len(metric_eval) > 0:
        mae = np.abs(metric_eval['error']).mean()
        rmse = np.sqrt((metric_eval['error'] ** 2).mean())

        # Calculate R²
        actual = metric_eval['actual'].values
        predicted = metric_eval['projected'].values
        ss_res = np.sum((actual - predicted) ** 2)
        ss_tot = np.sum((actual - actual.mean()) ** 2)
        r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

        print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")

# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
    ax = axes[idx]
    metric_eval = evaluation[evaluation['metric'] == metric]

    if len(metric_eval) > 0:
        ax.scatter(metric_eval['actual'], metric_eval['projected'],
                  alpha=0.4, s=30)

        # Add y=x line
        min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
        max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
        ax.plot([min_val, max_val], [min_val, max_val],
               'r--', linewidth=2, label='Perfect Projection')

        ax.set_xlabel(f'Actual {metric}', fontsize=11)
        ax.set_ylabel(f'Projected {metric}', fontsize=11)
        ax.set_title(f'{metric} Projection Accuracy',
                    fontsize=12, fontweight='bold')
        ax.legend()
        ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")

Python

# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches

# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)

# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
    """
    Simulate batted ball locations and outcomes.
    Coordinates in feet from home plate.
    """
    np.random.seed(42)

    data = []

    for _ in range(n_balls):
        # Batter handedness
        stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])

        # Shift decision (more common vs pull hitters)
        is_shifter = np.random.random() < 0.3
        shift_on = is_shifter and (np.random.random() < 0.7)

        # Hit location (pull tendency varies)
        if stand == 'R':
            # Righties pull left
            if is_shifter:
                angle = np.random.normal(-25, 35)  # Pull-heavy
            else:
                angle = np.random.normal(-10, 45)  # Balanced
        else:
            # Lefties pull right
            if is_shifter:
                angle = np.random.normal(25, 35)
            else:
                angle = np.random.normal(10, 45)

        # Distance based on exit velo and launch angle
        exit_velo = np.random.normal(88, 8)
        launch_angle = np.random.normal(12, 18)

        # Simplified distance calculation
        distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
        distance = max(50, min(400, distance + np.random.normal(0, 20)))

        # Convert to x, y coordinates
        angle_rad = np.radians(angle)
        x = distance * np.sin(angle_rad)
        y = distance * np.cos(angle_rad)

        # Hit outcome (shift effectiveness)
        if shift_on:
            # Shift reduces hits in pull direction
            if stand == 'R' and x < -50:
                prob_hit = 0.18  # Reduced by shift
            elif stand == 'L' and x > 50:
                prob_hit = 0.18
            else:
                prob_hit = 0.28  # Normal rate
        else:
            prob_hit = 0.25

        # Adjust for distance (harder to field)
        prob_hit = min(0.95, prob_hit * (distance / 250))

        is_hit = np.random.random() < prob_hit

        data.append({
            'x': x,
            'y': y,
            'distance': distance,
            'angle': angle,
            'exit_velo': exit_velo,
            'launch_angle': launch_angle,
            'stand': stand,
            'shift_on': shift_on,
            'is_hit': is_hit,
            'is_shifter': is_shifter
        })

    return pd.DataFrame(data)

# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)

print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")

# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
    'is_hit': ['mean', 'count'],
    'exit_velo': 'mean'
}).round(3)

print("\nShift Effectiveness:")
print(shift_analysis)

# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
    """
    Estimate runs saved by shifting.
    """
    results = []

    for stand in ['R', 'L']:
        for shifter in [True, False]:
            subset = data[(data['stand'] == stand) &
                         (data['is_shifter'] == shifter)]

            if len(subset) == 0:
                continue

            shifted = subset[subset['shift_on'] == True]
            no_shift = subset[subset['shift_on'] == False]

            if len(shifted) > 0 and len(no_shift) > 0:
                babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
                # Approximate run value per hit prevented: ~0.5 runs
                runs_saved_per_pa = babip_diff * 0.5

                results.append({
                    'stand': stand,
                    'is_shifter': shifter,
                    'shifted_babip': shifted['is_hit'].mean(),
                    'no_shift_babip': no_shift['is_hit'].mean(),
                    'babip_diff': babip_diff,
                    'runs_saved_per_100pa': runs_saved_per_pa * 100,
                    'n_shifted': len(shifted),
                    'n_no_shift': len(no_shift)
                })

    return pd.DataFrame(results)

shift_value = calculate_shift_value(bb_data)

print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))

# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
    """
    Plot baseball field with hit locations.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(10, 10))

    # Draw field outline
    # Infield dirt
    infield = patches.Wedge((0, 0), 95, 45, 135,
                           facecolor='tan', alpha=0.3)
    ax.add_patch(infield)

    # Outfield grass
    outfield = patches.Wedge((0, 0), 400, 45, 135,
                            facecolor='green', alpha=0.1)
    ax.add_patch(outfield)

    # Foul lines
    ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
    ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)

    # Plot hits
    hits = data[data['is_hit'] == True]
    outs = data[data['is_hit'] == False]

    ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
              s=20, label='Out')
    ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
              s=30, label='Hit')

    ax.set_xlim(-320, 320)
    ax.set_ylim(0, 400)
    ax.set_aspect('equal')
    ax.set_xlabel('Distance from center (ft)', fontsize=11)
    ax.set_ylabel('Distance from home (ft)', fontsize=11)
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.2)

    return ax

# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
                      (bb_data['is_shifter'] == True)]

plot_field_with_hits(
    rhb_shifter[rhb_shifter['shift_on'] == False],
    'RHB Pull Hitter - No Shift',
    ax=ax1
)

plot_field_with_hits(
    rhb_shifter[rhb_shifter['shift_on'] == True],
    'RHB Pull Hitter - Shift On',
    ax=ax2
)

plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")

# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
    """
    Create BABIP heat map for given conditions.
    """
    subset = data[(data['shift_on'] == shift_status) &
                  (data['stand'] == stand)]

    # Create grid
    x_bins = np.linspace(-250, 250, 25)
    y_bins = np.linspace(50, 350, 20)

    grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
    grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))

    for i in range(len(y_bins)-1):
        for j in range(len(x_bins)-1):
            mask = ((subset['x'] >= x_bins[j]) &
                   (subset['x'] < x_bins[j+1]) &
                   (subset['y'] >= y_bins[i]) &
                   (subset['y'] < y_bins[i+1]))

            cell_data = subset[mask]
            if len(cell_data) >= 5:  # Minimum sample
                grid_babip[i, j] = cell_data['is_hit'].mean()
                grid_count[i, j] = len(cell_data)
            else:
                grid_babip[i, j] = np.nan

    return grid_babip, x_bins, y_bins, grid_count

# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stand in enumerate(['R', 'L']):
    for j, shift_on in enumerate([False, True]):
        ax = axes[i, j]

        shifters = bb_data[bb_data['is_shifter'] == True]
        grid, x_bins, y_bins, counts = create_babip_heatmap(
            shifters, shift_on, stand
        )

        im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
                                     y_bins[0], y_bins[-1]],
                      origin='lower', cmap='RdYlGn_r',
                      vmin=0, vmax=0.5, aspect='auto')

        shift_text = "Shift On" if shift_on else "No Shift"
        hand_text = "RHB" if stand == 'R' else "LHB"
        ax.set_title(f'{hand_text} - {shift_text}',
                    fontsize=11, fontweight='bold')
        ax.set_xlabel('Horizontal Position (ft)')
        ax.set_ylabel('Distance from Home (ft)')

        # Add colorbar
        plt.colorbar(im, ax=ax, label='BABIP')

plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")

# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
                      ((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)

X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
                      model.coef_[0]):
    print(f"  {feat}: {coef:.3f}")

# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)

print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
    print(f"   {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")

print("\n2. WHEN TO SHIFT")
print("   - Strong pull tendency (>70% pull rate)")
print("   - Ground ball hitters (LA < 10°)")
print("   - Extreme pull hitters benefit most from aggressive shifts")

print("\n3. SHIFT VARIATIONS")
print("   - Full shift: 3 infielders on pull side")
print("   - Partial shift: 2.5 infielders pull side")
print("   - No shift: Traditional alignment")
print("   - Decision should consider:")
print("     * Batter's spray chart")
print("     * Game situation (runners, outs)")
print("     * Pitcher's ground ball rate")

print("\n4. LIMITATIONS & CONSIDERATIONS")
print("   - Shift beaten by opposite field hits")
print("   - Bunt defense vulnerabilities")
print("   - Runner advancement opportunities")
print("   - Pitcher-specific adjustments")

print("\n5. FUTURE ANALYSIS")
print("   - Pitcher-specific positioning")
print("   - Count-based positioning adjustments")
print("   - Outfield positioning optimization")
print("   - Real-time adjustment algorithms")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)

Practice Exercises

Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.

4 exercises

Tips for Success

Read the problem carefully before starting to code
Break down complex problems into smaller steps
Use the hints if you're stuck - they won't give away the answer
After solving, compare your approach with the solution

Exercise 24.1

Pitcher Arsenal Analysis and Optimization

Hard

**Objective**: Analyze a pitcher's repertoire using Statcast data and provide recommendations for pitch usage optimization.

**Skills Demonstrated**: Data acquisition, exploratory analysis, visualization, strategic thinking

**Project Steps**:

1. Acquire Statcast pitch-level data for a pitcher (use baseballr package or Baseball Savant)
2. Analyze pitch characteristics (velocity, movement, spin)
3. Evaluate pitch effectiveness by count and situation
4. Identify optimization opportunities
5. Create compelling visualizations
6. Write executive summary with recommendations

**R Implementation**:

```r
# Pitcher Arsenal Analysis
# This project analyzes pitcher stuff and usage patterns

library(tidyverse)
library(baseballr)
library(ggplot2)
library(patchwork)

# Function to get pitcher Statcast data
get_pitcher_data <- function(pitcher_name, start_date, end_date) {
# In practice, use scrape_statcast_savant_pitcher()
# For this example, we'll simulate data

set.seed(123)
n_pitches <- 2500

tibble(
pitch_type = sample(
c("FF", "SI", "SL", "CH", "CU"),
n_pitches,
replace = TRUE,
prob = c(0.40, 0.15, 0.25, 0.15, 0.05)
),
release_speed = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 94.5, 1.2),
pitch_type == "SI" ~ rnorm(n_pitches, 93.8, 1.1),
pitch_type == "SL" ~ rnorm(n_pitches, 85.2, 1.5),
pitch_type == "CH" ~ rnorm(n_pitches, 86.5, 1.3),
pitch_type == "CU" ~ rnorm(n_pitches, 78.5, 1.8)
),
pfx_x = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, -6.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, -12.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 3.5, 2.5),
pitch_type == "CH" ~ rnorm(n_pitches, -8.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, 5.5, 3)
),
pfx_z = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 14.5, 2),
pitch_type == "SI" ~ rnorm(n_pitches, 11.5, 2),
pitch_type == "SL" ~ rnorm(n_pitches, 2.5, 2),
pitch_type == "CH" ~ rnorm(n_pitches, 6.5, 2),
pitch_type == "CU" ~ rnorm(n_pitches, -5.5, 3)
),
release_spin_rate = case_when(
pitch_type == "FF" ~ rnorm(n_pitches, 2350, 100),
pitch_type == "SI" ~ rnorm(n_pitches, 2150, 100),
pitch_type == "SL" ~ rnorm(n_pitches, 2550, 150),
pitch_type == "CH" ~ rnorm(n_pitches, 1750, 100),
pitch_type == "CU" ~ rnorm(n_pitches, 2650, 150)
),
balls = sample(0:3, n_pitches, replace = TRUE),
strikes = sample(0:2, n_pitches, replace = TRUE),
stand = sample(c("R", "L"), n_pitches, replace = TRUE, prob = c(0.6, 0.4)),
description = sample(
c("called_strike", "ball", "swinging_strike", "foul", "hit_into_play"),
n_pitches,
replace = TRUE,
prob = c(0.15, 0.35, 0.12, 0.20, 0.18)
),
launch_speed = ifelse(description == "hit_into_play",
rnorm(n_pitches, 87, 10), NA),
launch_angle = ifelse(description == "hit_into_play",
rnorm(n_pitches, 12, 20), NA),
estimated_woba_using_speedangle = ifelse(
description == "hit_into_play",
pmin(pmax(rnorm(n_pitches, 0.320, 0.150), 0), 2.000),
NA
)
)
}

# Get data
pitcher_data <- get_pitcher_data("Example Pitcher", "2024-04-01", "2024-09-30")

# 1. Pitch Mix Analysis
pitch_mix <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
n = n(),
pct = n() / nrow(pitcher_data),
avg_velo = mean(release_speed, na.rm = TRUE),
avg_spin = mean(release_spin_rate, na.rm = TRUE)
) %>%
arrange(desc(n))

print("Pitch Mix:")
print(pitch_mix)

# 2. Pitch Effectiveness by Type
pitch_effectiveness <- pitcher_data %>%
group_by(pitch_type) %>%
summarize(
usage = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
csw_rate = mean(description %in% c("called_strike", "swinging_strike"),
na.rm = TRUE),
avg_exit_velo = mean(launch_speed, na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
chase_rate = mean(description == "swinging_strike" & balls > 0, na.rm = TRUE)
) %>%
arrange(desc(csw_rate))

print("\nPitch Effectiveness:")
print(pitch_effectiveness)

# 3. Count-Based Analysis
count_analysis <- pitcher_data %>%
mutate(count = paste0(balls, "-", strikes)) %>%
group_by(count, pitch_type) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
.groups = "drop"
) %>%
group_by(count) %>%
mutate(usage_pct = n / sum(n)) %>%
arrange(count, desc(usage_pct))

# 4. Platoon Splits
platoon_splits <- pitcher_data %>%
group_by(pitch_type, stand) %>%
summarize(
n = n(),
whiff_rate = mean(description == "swinging_strike", na.rm = TRUE),
avg_xwoba = mean(estimated_woba_using_speedangle, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_wider(
names_from = stand,
values_from = c(n, whiff_rate, avg_xwoba),
names_sep = "_"
)

print("\nPlatoon Splits:")
print(platoon_splits)

# 5. Visualization: Pitch Movement Chart
pitch_colors <- c(
"FF" = "#d22d49", "SI" = "#FE9D00",
"SL" = "#00D1ED", "CH" = "#1DBE3A", "CU" = "#AB87FF"
)

movement_plot <- ggplot(pitcher_data,
aes(x = pfx_x, y = pfx_z, color = pitch_type)) +
geom_point(alpha = 0.3, size = 2) +
stat_ellipse(level = 0.75, size = 1.2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Pitch Movement Profile",
subtitle = "Catcher's perspective (RHP)",
x = "Horizontal Break (inches)",
y = "Induced Vertical Break (inches)",
color = "Pitch Type"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "right"
) +
coord_fixed()

# 6. Visualization: Velocity and Spin by Pitch
velo_spin_plot <- pitcher_data %>%
ggplot(aes(x = release_speed, y = release_spin_rate, color = pitch_type)) +
geom_point(alpha = 0.4, size = 2) +
scale_color_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
labs(
title = "Velocity vs. Spin Rate",
x = "Release Speed (mph)",
y = "Spin Rate (rpm)",
color = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")

# 7. Visualization: Usage by Count
count_usage_plot <- count_analysis %>%
filter(count %in% c("0-0", "1-0", "0-1", "2-0", "1-1", "0-2", "3-2")) %>%
ggplot(aes(x = count, y = usage_pct, fill = pitch_type)) +
geom_col(position = "stack") +
scale_fill_manual(values = pitch_colors,
labels = c("FF" = "Four-Seam", "SI" = "Sinker",
"SL" = "Slider", "CH" = "Changeup",
"CU" = "Curveball")) +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Pitch Usage by Count",
x = "Count",
y = "Usage %",
fill = "Pitch Type"
) +
theme_minimal() +
theme(legend.position = "right")

# Combine plots
combined_plot <- (movement_plot | velo_spin_plot) / count_usage_plot +
plot_annotation(
title = "Comprehensive Pitcher Arsenal Analysis",
subtitle = "Example Pitcher - 2024 Season",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)

print(combined_plot)

# 8. Recommendations Function
generate_recommendations <- function(data, effectiveness) {
cat("\n=== PITCH USAGE RECOMMENDATIONS ===\n\n")

# Best pitch
best_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_max(csw_rate, n = 1)

cat("1. PRIMARY WEAPON\n")
cat(sprintf(" - %s showing elite CSW rate of %.1f%%\n",
best_pitch$pitch_type, best_pitch$csw_rate * 100))
cat(" - Maintain high usage in favorable counts\n\n")

# Underused effective pitch
underused <- effectiveness %>%
filter(usage < quantile(effectiveness$usage, 0.33)) %>%
filter(csw_rate > 0.30)

if(nrow(underused) > 0) {
cat("2. USAGE OPTIMIZATION\n")
for(i in 1:nrow(underused)) {
cat(sprintf(" - Consider increasing %s usage (current: %d pitches)\n",
underused$pitch_type[i], underused$usage[i]))
cat(sprintf(" Shows strong CSW rate: %.1f%%\n",
underused$csw_rate[i] * 100))
}
cat("\n")
}

# Weak pitch
weak_pitch <- effectiveness %>%
filter(usage >= 50) %>%
slice_min(csw_rate, n = 1)

cat("3. PITCH DEVELOPMENT FOCUS\n")
cat(sprintf(" - %s showing below-average performance\n",
weak_pitch$pitch_type))
cat(sprintf(" - CSW rate: %.1f%% vs. league average ~28%%\n",
weak_pitch$csw_rate * 100))
cat(" - Consider: velocity increase, movement adjustment, or reduced usage\n\n")

cat("4. STRATEGIC ADJUSTMENTS\n")
cat(" - Review count-specific usage patterns\n")
cat(" - Analyze platoon splits for pitch selection\n")
cat(" - Consider sequencing effects (not shown in basic analysis)\n")
cat(" - Monitor fatigue impact on pitch quality\n")
}

generate_recommendations(pitcher_data, pitch_effectiveness)

# Save results
cat("\n\nSaving analysis results...\n")
# ggsave("pitcher_arsenal_analysis.png", combined_plot, width = 14, height = 10)
# write_csv(pitch_effectiveness, "pitch_effectiveness_summary.csv")
cat("Analysis complete!\n")
```

**Portfolio Presentation Tips**:
- Include interactive visualizations (consider using plotly)
- Compare pitcher to league averages
- Add context about pitcher role and team strategy
- Discuss limitations (sample size, park factors, etc.)
- Provide actionable recommendations

Exercise 24.2

Player Aging Curves and Performance Projection

Hard

**Objective**: Build aging curves for different player skills and create a performance projection system.

**Skills Demonstrated**: Statistical modeling, time series analysis, predictive analytics, data visualization

**Python Implementation**:

```python
# Player Aging Curves and Projection System
# Analyzing how player skills change with age

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from scipy.optimize import curve_fit
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 10)

# Generate simulated player-season data
def generate_player_data(n_players=500, years_range=(2010, 2024)):
"""
Generate simulated player career data.
In practice, this would come from Baseball Reference or FanGraphs.
"""
np.random.seed(42)

players = []
for player_id in range(n_players):
# Random career start age (20-25)
start_age = np.random.randint(20, 26)
# Random career length (2-15 years)
career_length = np.random.randint(2, 16)

# Peak age varies (26-30)
peak_age = np.random.randint(26, 31)
# Peak performance level
peak_wrc_plus = np.random.normal(110, 20)

for year_in_career in range(career_length):
age = start_age + year_in_career
season = years_range[0] + np.random.randint(0,
years_range[1] - years_range[0])

# Age-based performance (simplified aging curve)
age_factor = 1 - (abs(age - peak_age) / 15) ** 1.8
base_wrc = peak_wrc_plus * age_factor

# Add random variation
wrc_plus = max(50, base_wrc + np.random.normal(0, 15))

# Other stats correlated with wRC+
pa = np.random.randint(300, 650)
avg = 0.200 + (wrc_plus / 1000) + np.random.normal(0, 0.025)
obp = avg + 0.060 + np.random.normal(0, 0.020)
slg = avg + 0.150 + (wrc_plus / 800) + np.random.normal(0, 0.040)

players.append({
'player_id': player_id,
'age': age,
'season': season,
'PA': pa,
'AVG': np.clip(avg, 0.150, 0.400),
'OBP': np.clip(obp, 0.250, 0.500),
'SLG': np.clip(slg, 0.300, 0.700),
'wRC_plus': wrc_plus,
'ISO': np.clip(slg - avg, 0.050, 0.350)
})

return pd.DataFrame(players)

# Generate data
print("Generating player data...")
player_data = generate_player_data(n_players=800)

print(f"\nDataset: {len(player_data)} player-seasons")
print(f"Age range: {player_data['age'].min()} to {player_data['age'].max()}")
print(f"Players: {player_data['player_id'].nunique()}")

# 1. Calculate Aging Curves using Delta Method
def calculate_aging_curve_delta(df, metric, min_pa=300):
"""
Calculate aging curve using year-to-year delta method.
This controls for selection bias better than simple averaging.
"""
# Filter for consecutive seasons
df_sorted = df[df['PA'] >= min_pa].sort_values(['player_id', 'age'])

# Calculate year-to-year changes
df_sorted['next_age'] = df_sorted.groupby('player_id')['age'].shift(-1)
df_sorted['next_metric'] = df_sorted.groupby('player_id')[metric].shift(-1)
df_sorted['metric_delta'] = df_sorted['next_metric'] - df_sorted[metric]

# Keep only consecutive seasons
df_deltas = df_sorted[df_sorted['next_age'] == df_sorted['age'] + 1].copy()

# Group by age and calculate average change
aging_curve = df_deltas.groupby('age').agg({
'metric_delta': ['mean', 'std', 'count'],
metric: 'mean'
}).reset_index()

aging_curve.columns = ['age', 'delta_mean', 'delta_std', 'n', 'avg_level']

return aging_curve

# Calculate aging curves for multiple metrics
print("\nCalculating aging curves...")

metrics = ['wRC_plus', 'ISO', 'AVG', 'OBP']
aging_curves = {}

for metric in metrics:
aging_curves[metric] = calculate_aging_curve_delta(player_data, metric)
print(f" {metric}: {len(aging_curves[metric])} age points")

# 2. Fit Polynomial Aging Curve
def fit_aging_curve(aging_data, age_col='age', delta_col='delta_mean'):
"""
Fit a polynomial curve to aging data.
"""
# Use weighted regression (weight by sample size)
weights = np.sqrt(aging_data['n'])

# Polynomial features (degree 2)
X = aging_data[age_col].values.reshape(-1, 1)
y = aging_data[delta_col].values

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

model = Ridge(alpha=1.0)
model.fit(X_poly, y, sample_weight=weights)

return model, poly

# Fit curves
fitted_models = {}
for metric in metrics:
fitted_models[metric] = fit_aging_curve(aging_curves[metric])
print(f"Fitted aging curve for {metric}")

# 3. Visualize Aging Curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics):
ax = axes[idx]
curve_data = aging_curves[metric]
model, poly = fitted_models[metric]

# Plot raw deltas
ax.scatter(curve_data['age'], curve_data['delta_mean'],
s=curve_data['n']*2, alpha=0.6, label='Observed')

# Plot fitted curve
age_range = np.linspace(curve_data['age'].min(),
curve_data['age'].max(), 100)
X_pred = poly.transform(age_range.reshape(-1, 1))
y_pred = model.predict(X_pred)

ax.plot(age_range, y_pred, 'r-', linewidth=2, label='Fitted Curve')
ax.axhline(y=0, color='black', linestyle='--', alpha=0.3)

ax.set_xlabel('Age', fontsize=11)
ax.set_ylabel(f'{metric} Year-to-Year Change', fontsize=11)
ax.set_title(f'{metric} Aging Curve', fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('aging_curves.png', dpi=300, bbox_inches='tight')
print("\nAging curves visualization saved")

# 4. Build Projection System
class PlayerProjector:
"""
Project player performance based on recent history and aging curves.
"""

def __init__(self, aging_models):
self.aging_models = aging_models

def project_player(self, player_history, years_forward=1):
"""
Project player performance forward.

Parameters:
-----------
player_history : DataFrame
Recent seasons for player (last 3 years recommended)
years_forward : int
Number of years to project forward

Returns:
--------
dict : Projected statistics
"""
# Weight recent seasons more heavily (3:2:1 for last 3 years)
weights = np.array([3, 2, 1])[:len(player_history)]
weights = weights / weights.sum()

# Current age and baseline performance
current_age = player_history['age'].iloc[-1]

projections = {}

for metric in self.aging_models.keys():
if metric not in player_history.columns:
continue

# Weighted average of recent performance
baseline = np.average(player_history[metric].iloc[-3:],
weights=weights)

# Apply aging curve
model, poly = self.aging_models[metric]

# Project forward
projected_value = baseline
for year in range(years_forward):
age = current_age + year + 1
X_age = poly.transform([[age]])
age_adjustment = model.predict(X_age)[0]
projected_value += age_adjustment

projections[metric] = projected_value

projections['age'] = current_age + years_forward
projections['projection_years'] = years_forward

return projections

# 5. Test Projection System
projector = PlayerProjector(fitted_models)

# Select a random player with at least 3 seasons
test_player_id = player_data.groupby('player_id').size()
test_player_id = test_player_id[test_player_id >= 3].sample(1).index[0]

test_player_data = player_data[player_data['player_id'] == test_player_id].sort_values('age')

print(f"\n{'='*60}")
print(f"PROJECTION EXAMPLE - Player {test_player_id}")
print(f"{'='*60}")

print("\nRecent Performance:")
print(test_player_data[['age', 'PA', 'AVG', 'OBP', 'SLG', 'wRC_plus']].tail(3).to_string(index=False))

# Project next 3 years
print("\nProjections:")
print(f"{'Year':<6} {'Age':<5} {'wRC+':<8} {'ISO':<8} {'AVG':<8} {'OBP':<8}")
print("-" * 50)

for year in range(1, 4):
projection = projector.project_player(test_player_data, years_forward=year)
print(f"+{year:<5} {projection['age']:<5.0f} "
f"{projection.get('wRC_plus', 0):<8.1f} "
f"{projection.get('ISO', 0):<8.3f} "
f"{projection.get('AVG', 0):<8.3f} "
f"{projection.get('OBP', 0):<8.3f}")

# 6. Projection Accuracy Analysis
def evaluate_projections(data, projector, test_seasons=[2023, 2024]):
"""
Evaluate projection accuracy on historical data.
"""
results = []

for player_id in data['player_id'].unique():
player_data = data[data['player_id'] == player_id].sort_values('age')

# Need at least 4 seasons (3 to project, 1 to validate)
if len(player_data) < 4:
continue

# Use all but last season for projection
train_data = player_data.iloc[:-1]
actual_data = player_data.iloc[-1]

if len(train_data) < 3:
continue

# Make projection
try:
projection = projector.project_player(train_data, years_forward=1)

for metric in ['wRC_plus', 'ISO', 'AVG']:
if metric in projection:
results.append({
'player_id': player_id,
'metric': metric,
'actual': actual_data[metric],
'projected': projection[metric],
'error': projection[metric] - actual_data[metric]
})
except:
continue

return pd.DataFrame(results)

print("\n\nEvaluating projection accuracy...")
evaluation = evaluate_projections(player_data, projector)

print("\nProjection Accuracy by Metric:")
print(f"{'Metric':<12} {'MAE':<10} {'RMSE':<10} {'R²':<10}")
print("-" * 45)

for metric in ['wRC_plus', 'ISO', 'AVG']:
metric_eval = evaluation[evaluation['metric'] == metric]

if len(metric_eval) > 0:
mae = np.abs(metric_eval['error']).mean()
rmse = np.sqrt((metric_eval['error'] ** 2).mean())

# Calculate R²
actual = metric_eval['actual'].values
predicted = metric_eval['projected'].values
ss_res = np.sum((actual - predicted) ** 2)
ss_tot = np.sum((actual - actual.mean()) ** 2)
r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0

print(f"{metric:<12} {mae:<10.3f} {rmse:<10.3f} {r2:<10.3f}")

# 7. Visualize Projection Accuracy
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, metric in enumerate(['wRC_plus', 'ISO', 'AVG']):
ax = axes[idx]
metric_eval = evaluation[evaluation['metric'] == metric]

if len(metric_eval) > 0:
ax.scatter(metric_eval['actual'], metric_eval['projected'],
alpha=0.4, s=30)

# Add y=x line
min_val = min(metric_eval['actual'].min(), metric_eval['projected'].min())
max_val = max(metric_eval['actual'].max(), metric_eval['projected'].max())
ax.plot([min_val, max_val], [min_val, max_val],
'r--', linewidth=2, label='Perfect Projection')

ax.set_xlabel(f'Actual {metric}', fontsize=11)
ax.set_ylabel(f'Projected {metric}', fontsize=11)
ax.set_title(f'{metric} Projection Accuracy',
fontsize=12, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('projection_accuracy.png', dpi=300, bbox_inches='tight')
print("\nProjection accuracy visualization saved")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print("\nKey Findings:")
print("1. Peak performance typically occurs between ages 27-29")
print("2. Decline rates vary by skill type (power vs. contact)")
print("3. Projection systems should weight recent performance heavily")
print("4. Aging adjustments are critical for multi-year projections")
print("\nRecommendations:")
print("- Use 3-year weighted averages for baseline projection")
print("- Apply aging curves derived from delta method")
print("- Consider regression to mean for extreme performances")
print("- Incorporate playing time projections")
print("- Account for injury history in risk assessment")
```

**Extension Ideas**:
- Incorporate minor league translation factors
- Add injury risk modeling
- Create playing time projections
- Develop position-specific aging curves
- Compare to established projection systems (Steamer, ZiPS)

Exercise 24.3

Draft Value Analysis and Strategy Optimization

Hard

**Objective**: Analyze historical draft performance to quantify pick value and optimize draft strategy.

**Skills Demonstrated**: Data analysis, value modeling, strategic thinking, data visualization

**Key Analysis Components**:

```r
# MLB Draft Value Analysis
# Quantifying draft pick value and optimizing strategy

library(tidyverse)
library(survival)
library(ggplot2)
library(scales)

# Generate simulated draft data
generate_draft_data <- function(n_years = 15, rounds = 40) {
set.seed(42)

drafts <- expand.grid(
year = 2008:2022,
round = 1:rounds,
pick = 1:30
) %>%
mutate(
overall_pick = (round - 1) * 30 + pick,
# Probability of reaching majors decreases with pick
p_mlb = pmax(0.05, 0.85 * exp(-overall_pick / 100)),
reached_mlb = rbinom(n(), 1, p_mlb),
# Career WAR conditional on reaching MLB
war_if_mlb = ifelse(
reached_mlb == 1,
pmax(0, rnorm(n(), 10 * exp(-overall_pick / 50), 8)),
0
),
# Years to debut
years_to_debut = ifelse(
reached_mlb == 1,
pmax(1, round(rnorm(n(), 3 + round/20, 1.5))),
NA
),
# Position (simplified)
position = sample(
c("P", "C", "IF", "OF"),
n(),
replace = TRUE,
prob = c(0.45, 0.10, 0.25, 0.20)
),
# College vs HS
player_type = sample(
c("College", "HS", "International"),
n(),
replace = TRUE,
prob = c(0.55, 0.35, 0.10)
),
# Slot value (simplified formula)
slot_value = pmax(
200000,
12000000 * exp(-overall_pick / 15)
),
# Signing bonus (usually close to slot)
signing_bonus = slot_value * runif(n(), 0.85, 1.15)
)

return(drafts)
}

# Generate data
draft_data <- generate_draft_data()

print(sprintf("Generated %d draft picks from %d drafts",
nrow(draft_data), n_distinct(draft_data$year)))

# 1. Success Rate by Round
success_by_round <- draft_data %>%
group_by(round) %>%
summarize(
n_picks = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
total_war = sum(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
) %>%
filter(round <= 20) # Focus on first 20 rounds

print("\nMLB Success Rate by Round:")
print(success_by_round %>% head(10))

# 2. Value Curve Estimation
value_curve <- draft_data %>%
group_by(overall_pick) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
expected_war = mlb_rate * mean(war_if_mlb[war_if_mlb > 0], na.rm = TRUE)
) %>%
filter(overall_pick <= 300)

# Fit exponential decay model
value_model <- nls(
expected_war ~ a * exp(-b * overall_pick),
data = value_curve %>% filter(expected_war > 0),
start = list(a = 10, b = 0.01)
)

# Add fitted values
value_curve$fitted_war <- predict(
value_model,
newdata = data.frame(overall_pick = value_curve$overall_pick)
)

print("\nValue Curve Model:")
print(summary(value_model))

# 3. Visualization: Draft Value Curve
value_plot <- ggplot(value_curve, aes(x = overall_pick)) +
geom_point(aes(y = expected_war), alpha = 0.5, size = 2) +
geom_line(aes(y = fitted_war), color = "red", size = 1.2) +
geom_vline(xintercept = c(30, 60, 90),
linetype = "dashed", alpha = 0.3) +
annotate("text", x = 15, y = max(value_curve$expected_war) * 0.95,
label = "Round 1", size = 3.5) +
annotate("text", x = 45, y = max(value_curve$expected_war) * 0.95,
label = "Round 2", size = 3.5) +
labs(
title = "MLB Draft Pick Value Curve",
subtitle = "Expected career WAR by draft position",
x = "Overall Pick",
y = "Expected Career WAR",
caption = "Exponential decay model fitted to historical data"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11)
)

print(value_plot)

# 4. Position-Specific Analysis
position_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(position, round) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
.groups = "drop"
) %>%
group_by(position) %>%
summarize(
total_picks = sum(n),
avg_mlb_rate = mean(mlb_rate),
avg_war = mean(avg_war)
) %>%
arrange(desc(avg_war))

print("\nPosition-Specific Success Rates:")
print(position_analysis)

# 5. College vs High School Analysis
player_type_analysis <- draft_data %>%
filter(round <= 10) %>%
group_by(player_type) %>%
summarize(
n = n(),
mlb_rate = mean(reached_mlb),
avg_war = mean(war_if_mlb),
avg_years_to_debut = mean(years_to_debut, na.rm = TRUE)
)

print("\nCollege vs High School Performance:")
print(player_type_analysis)

# 6. ROI Analysis (WAR per $ spent)
roi_analysis <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
mutate(
war_per_million = war_if_mlb / (signing_bonus / 1000000),
pick_group = case_when(
overall_pick <= 30 ~ "Top 30",
overall_pick <= 60 ~ "31-60",
overall_pick <= 100 ~ "61-100",
TRUE ~ "100+"
)
) %>%
group_by(pick_group) %>%
summarize(
n = n(),
avg_bonus = mean(signing_bonus),
avg_war = mean(war_if_mlb),
war_per_million = mean(war_per_million)
)

print("\nReturn on Investment by Pick Range:")
print(roi_analysis)

# 7. Draft Strategy Optimizer
optimize_draft_strategy <- function(available_picks, budget) {
"""
Simple optimization: maximize expected WAR given bonus pool constraints
"""

# Get expected value for each pick
pick_values <- value_curve %>%
filter(overall_pick %in% available_picks) %>%
left_join(
draft_data %>%
group_by(overall_pick) %>%
summarize(avg_slot = mean(slot_value)),
by = "overall_pick"
)

# Greedy algorithm: pick highest value/cost ratio within budget
selected <- tibble()
remaining_budget <- budget
remaining_picks <- pick_values

while(nrow(remaining_picks) > 0 & remaining_budget > 0) {
# Calculate value per dollar
remaining_picks <- remaining_picks %>%
mutate(value_per_dollar = expected_war / avg_slot)

# Select best value pick we can afford
best_pick <- remaining_picks %>%
filter(avg_slot <= remaining_budget) %>%
slice_max(value_per_dollar, n = 1)

if(nrow(best_pick) == 0) break

selected <- bind_rows(selected, best_pick)
remaining_budget <- remaining_budget - best_pick$avg_slot
remaining_picks <- remaining_picks %>%
filter(overall_pick != best_pick$overall_pick)
}

return(selected)
}

# Example: Optimize top 5 picks with $15M budget
example_picks <- c(10, 15, 45, 78, 112)
example_budget <- 15000000

optimal_strategy <- optimize_draft_strategy(example_picks, example_budget)

print("\n=== DRAFT STRATEGY OPTIMIZATION ===")
print(sprintf("\nAvailable Picks: %s", paste(example_picks, collapse = ", ")))
print(sprintf("Bonus Pool: $%.1fM\n", example_budget / 1000000))
print("Optimized Selection:")
print(optimal_strategy %>%
select(overall_pick, expected_war, avg_slot, value_per_dollar))

# 8. Comprehensive Dashboard Visualization
library(patchwork)

# Plot 1: Success rate by round
p1 <- success_by_round %>%
filter(round <= 10) %>%
ggplot(aes(x = round, y = mlb_rate)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_text(aes(label = percent(mlb_rate, accuracy = 1)),
vjust = -0.5, size = 3) +
scale_y_continuous(labels = percent_format()) +
labs(title = "MLB Success Rate by Round",
x = "Draft Round", y = "% Reaching MLB") +
theme_minimal()

# Plot 2: WAR distribution
p2 <- draft_data %>%
filter(reached_mlb == 1, overall_pick <= 100) %>%
ggplot(aes(x = war_if_mlb)) +
geom_histogram(binwidth = 5, fill = "darkgreen", alpha = 0.7) +
labs(title = "Career WAR Distribution (MLB Players)",
x = "Career WAR", y = "Count") +
theme_minimal()

# Plot 3: Position comparison
p3 <- draft_data %>%
filter(reached_mlb == 1, round <= 5) %>%
ggplot(aes(x = position, y = war_if_mlb, fill = position)) +
geom_boxplot(alpha = 0.7) +
labs(title = "WAR by Position (Rounds 1-5)",
x = "Position", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")

# Plot 4: College vs HS
p4 <- draft_data %>%
filter(reached_mlb == 1, round <= 10) %>%
ggplot(aes(x = player_type, y = war_if_mlb, fill = player_type)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, fill = "white", alpha = 0.5) +
labs(title = "College vs HS Performance",
x = "Player Type", y = "Career WAR") +
theme_minimal() +
theme(legend.position = "none")

# Combine plots
combined <- (p1 | p2) / (p3 | p4) +
plot_annotation(
title = "MLB Draft Analysis Dashboard",
subtitle = "Historical performance metrics and value analysis",
theme = theme(plot.title = element_text(size = 16, face = "bold"))
)

print(combined)

# 9. Key Insights Summary
cat("\n=== KEY INSIGHTS ===\n\n")

cat("1. VALUE CONCENTRATION\n")
first_round_war <- sum(draft_data$war_if_mlb[draft_data$round == 1])
total_war <- sum(draft_data$war_if_mlb)
cat(sprintf(" - First round produces %.1f%% of total draft WAR\n",
100 * first_round_war / total_war))

cat("\n2. SUCCESS RATES\n")
cat(sprintf(" - Round 1: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[1]))
cat(sprintf(" - Round 5: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[5]))
cat(sprintf(" - Round 10: %.1f%% reach MLB\n",
100 * success_by_round$mlb_rate[10]))

cat("\n3. DEVELOPMENT TIME\n")
cat(sprintf(" - Average time to debut: %.1f years\n",
mean(draft_data$years_to_debut, na.rm = TRUE)))

cat("\n4. STRATEGIC RECOMMENDATIONS\n")
cat(" - Prioritize early picks; value drops exponentially\n")
cat(" - Consider college players for faster development\n")
cat(" - High school players have higher variance in outcomes\n")
cat(" - Pitchers dominate draft but consider positional scarcity\n")
cat(" - Later rounds: focus on high-ceiling, high-risk players\n")

cat("\n=== ANALYSIS COMPLETE ===\n")
```

**Portfolio Enhancement**:
- Add international signing analysis
- Compare team draft performance
- Analyze specific draft classes
- Include financial constraints modeling
- Compare to prospect ranking systems

Exercise 24.4

Defensive Positioning and Shift Analysis

Hard

**Objective**: Analyze defensive positioning effectiveness using batted ball data.

**Skills Demonstrated**: Spatial analysis, causal inference, strategic analysis, data visualization

**Implementation Framework**:

```python
# Defensive Shift Analysis
# Evaluating positioning strategies using batted ball data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial import ConvexHull
from sklearn.neighbors import KernelDensity
import matplotlib.patches as patches

# Set style
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 10)

# Generate simulated batted ball data
def generate_batted_ball_data(n_balls=5000):
"""
Simulate batted ball locations and outcomes.
Coordinates in feet from home plate.
"""
np.random.seed(42)

data = []

for _ in range(n_balls):
# Batter handedness
stand = np.random.choice(['R', 'L'], p=[0.6, 0.4])

# Shift decision (more common vs pull hitters)
is_shifter = np.random.random() < 0.3
shift_on = is_shifter and (np.random.random() < 0.7)

# Hit location (pull tendency varies)
if stand == 'R':
# Righties pull left
if is_shifter:
angle = np.random.normal(-25, 35) # Pull-heavy
else:
angle = np.random.normal(-10, 45) # Balanced
else:
# Lefties pull right
if is_shifter:
angle = np.random.normal(25, 35)
else:
angle = np.random.normal(10, 45)

# Distance based on exit velo and launch angle
exit_velo = np.random.normal(88, 8)
launch_angle = np.random.normal(12, 18)

# Simplified distance calculation
distance = exit_velo * 2.5 * np.cos(np.radians(launch_angle))
distance = max(50, min(400, distance + np.random.normal(0, 20)))

# Convert to x, y coordinates
angle_rad = np.radians(angle)
x = distance * np.sin(angle_rad)
y = distance * np.cos(angle_rad)

# Hit outcome (shift effectiveness)
if shift_on:
# Shift reduces hits in pull direction
if stand == 'R' and x < -50:
prob_hit = 0.18 # Reduced by shift
elif stand == 'L' and x > 50:
prob_hit = 0.18
else:
prob_hit = 0.28 # Normal rate
else:
prob_hit = 0.25

# Adjust for distance (harder to field)
prob_hit = min(0.95, prob_hit * (distance / 250))

is_hit = np.random.random() < prob_hit

data.append({
'x': x,
'y': y,
'distance': distance,
'angle': angle,
'exit_velo': exit_velo,
'launch_angle': launch_angle,
'stand': stand,
'shift_on': shift_on,
'is_hit': is_hit,
'is_shifter': is_shifter
})

return pd.DataFrame(data)

# Generate data
print("Generating batted ball data...")
bb_data = generate_batted_ball_data(n_balls=8000)

print(f"\nDataset: {len(bb_data)} batted balls")
print(f"Shifts: {bb_data['shift_on'].sum()} ({100*bb_data['shift_on'].mean():.1f}%)")
print(f"Overall BABIP: {bb_data['is_hit'].mean():.3f}")

# 1. Shift Effectiveness Analysis
shift_analysis = bb_data.groupby(['stand', 'is_shifter', 'shift_on']).agg({
'is_hit': ['mean', 'count'],
'exit_velo': 'mean'
}).round(3)

print("\nShift Effectiveness:")
print(shift_analysis)

# 2. Calculate Runs Saved by Shifting
def calculate_shift_value(data):
"""
Estimate runs saved by shifting.
"""
results = []

for stand in ['R', 'L']:
for shifter in [True, False]:
subset = data[(data['stand'] == stand) &
(data['is_shifter'] == shifter)]

if len(subset) == 0:
continue

shifted = subset[subset['shift_on'] == True]
no_shift = subset[subset['shift_on'] == False]

if len(shifted) > 0 and len(no_shift) > 0:
babip_diff = no_shift['is_hit'].mean() - shifted['is_hit'].mean()
# Approximate run value per hit prevented: ~0.5 runs
runs_saved_per_pa = babip_diff * 0.5

results.append({
'stand': stand,
'is_shifter': shifter,
'shifted_babip': shifted['is_hit'].mean(),
'no_shift_babip': no_shift['is_hit'].mean(),
'babip_diff': babip_diff,
'runs_saved_per_100pa': runs_saved_per_pa * 100,
'n_shifted': len(shifted),
'n_no_shift': len(no_shift)
})

return pd.DataFrame(results)

shift_value = calculate_shift_value(bb_data)

print("\nShift Value Analysis:")
print(shift_value.to_string(index=False))

# 3. Visualize Hit Distribution with and without Shift
def plot_field_with_hits(data, title, ax=None):
"""
Plot baseball field with hit locations.
"""
if ax is None:
fig, ax = plt.subplots(figsize=(10, 10))

# Draw field outline
# Infield dirt
infield = patches.Wedge((0, 0), 95, 45, 135,
facecolor='tan', alpha=0.3)
ax.add_patch(infield)

# Outfield grass
outfield = patches.Wedge((0, 0), 400, 45, 135,
facecolor='green', alpha=0.1)
ax.add_patch(outfield)

# Foul lines
ax.plot([0, -300], [0, 300], 'k--', linewidth=1, alpha=0.3)
ax.plot([0, 300], [0, 300], 'k--', linewidth=1, alpha=0.3)

# Plot hits
hits = data[data['is_hit'] == True]
outs = data[data['is_hit'] == False]

ax.scatter(outs['x'], outs['y'], c='blue', alpha=0.3,
s=20, label='Out')
ax.scatter(hits['x'], hits['y'], c='red', alpha=0.5,
s=30, label='Hit')

ax.set_xlim(-320, 320)
ax.set_ylim(0, 400)
ax.set_aspect('equal')
ax.set_xlabel('Distance from center (ft)', fontsize=11)
ax.set_ylabel('Distance from home (ft)', fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.2)

return ax

# Plot for RHB pull hitters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

rhb_shifter = bb_data[(bb_data['stand'] == 'R') &
(bb_data['is_shifter'] == True)]

plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == False],
'RHB Pull Hitter - No Shift',
ax=ax1
)

plot_field_with_hits(
rhb_shifter[rhb_shifter['shift_on'] == True],
'RHB Pull Hitter - Shift On',
ax=ax2
)

plt.tight_layout()
plt.savefig('shift_comparison.png', dpi=300, bbox_inches='tight')
print("\nShift comparison visualization saved")

# 4. Heat Map Analysis
def create_babip_heatmap(data, shift_status, stand):
"""
Create BABIP heat map for given conditions.
"""
subset = data[(data['shift_on'] == shift_status) &
(data['stand'] == stand)]

# Create grid
x_bins = np.linspace(-250, 250, 25)
y_bins = np.linspace(50, 350, 20)

grid_babip = np.zeros((len(y_bins)-1, len(x_bins)-1))
grid_count = np.zeros((len(y_bins)-1, len(x_bins)-1))

for i in range(len(y_bins)-1):
for j in range(len(x_bins)-1):
mask = ((subset['x'] >= x_bins[j]) &
(subset['x'] < x_bins[j+1]) &
(subset['y'] >= y_bins[i]) &
(subset['y'] < y_bins[i+1]))

cell_data = subset[mask]
if len(cell_data) >= 5: # Minimum sample
grid_babip[i, j] = cell_data['is_hit'].mean()
grid_count[i, j] = len(cell_data)
else:
grid_babip[i, j] = np.nan

return grid_babip, x_bins, y_bins, grid_count

# Create heat maps
fig, axes = plt.subplots(2, 2, figsize=(14, 12))

for i, stand in enumerate(['R', 'L']):
for j, shift_on in enumerate([False, True]):
ax = axes[i, j]

shifters = bb_data[bb_data['is_shifter'] == True]
grid, x_bins, y_bins, counts = create_babip_heatmap(
shifters, shift_on, stand
)

im = ax.imshow(grid, extent=[x_bins[0], x_bins[-1],
y_bins[0], y_bins[-1]],
origin='lower', cmap='RdYlGn_r',
vmin=0, vmax=0.5, aspect='auto')

shift_text = "Shift On" if shift_on else "No Shift"
hand_text = "RHB" if stand == 'R' else "LHB"
ax.set_title(f'{hand_text} - {shift_text}',
fontsize=11, fontweight='bold')
ax.set_xlabel('Horizontal Position (ft)')
ax.set_ylabel('Distance from Home (ft)')

# Add colorbar
plt.colorbar(im, ax=ax, label='BABIP')

plt.tight_layout()
plt.savefig('babip_heatmaps.png', dpi=300, bbox_inches='tight')
print("BABIP heat maps saved")

# 5. Optimal Shift Decision Model
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Prepare features for shift decision model
features = bb_data[bb_data['is_shifter'] == True].copy()
features['is_pull'] = ((features['stand'] == 'R') & (features['angle'] < -15)) | \
((features['stand'] == 'L') & (features['angle'] > 15))
features['stand_R'] = (features['stand'] == 'R').astype(int)

X = features[['stand_R', 'is_pull', 'exit_velo']]
y = features['shift_on']

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Fit model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print("\n=== Shift Decision Model ===")
print("\nModel Performance:")
print(f"ROC-AUC: {roc_auc_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Feature importance
print("\nFeature Coefficients:")
for feat, coef in zip(['RHB', 'Pull Hit', 'Exit Velocity'],
model.coef_[0]):
print(f" {feat}: {coef:.3f}")

# 6. Strategic Recommendations
print("\n" + "="*60)
print("DEFENSIVE POSITIONING RECOMMENDATIONS")
print("="*60)

print("\n1. SHIFT EFFECTIVENESS")
for _, row in shift_value[shift_value['is_shifter'] == True].iterrows():
print(f" {row['stand']}HB: Shifting saves {row['runs_saved_per_100pa']:.1f} runs per 100 PA")

print("\n2. WHEN TO SHIFT")
print(" - Strong pull tendency (>70% pull rate)")
print(" - Ground ball hitters (LA < 10°)")
print(" - Extreme pull hitters benefit most from aggressive shifts")

print("\n3. SHIFT VARIATIONS")
print(" - Full shift: 3 infielders on pull side")
print(" - Partial shift: 2.5 infielders pull side")
print(" - No shift: Traditional alignment")
print(" - Decision should consider:")
print(" * Batter's spray chart")
print(" * Game situation (runners, outs)")
print(" * Pitcher's ground ball rate")

print("\n4. LIMITATIONS & CONSIDERATIONS")
print(" - Shift beaten by opposite field hits")
print(" - Bunt defense vulnerabilities")
print(" - Runner advancement opportunities")
print(" - Pitcher-specific adjustments")

print("\n5. FUTURE ANALYSIS")
print(" - Pitcher-specific positioning")
print(" - Count-based positioning adjustments")
print(" - Outfield positioning optimization")
print(" - Real-time adjustment algorithms")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
```

**Portfolio Development Tips**:
- Use real Statcast spray chart data when possible
- Incorporate expected outcomes (xBA, xwOBA)
- Add video analysis component
- Compare to MLB team shift strategies
- Analyze shift effectiveness by ballpark

---

Chapter 24: Front Office & Analytics Career Guide

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

24.1 Careers in Baseball Analytics

The Modern Baseball Analytics Landscape

Related Career Paths

Organizational Structures

Compensation and Market Dynamics

24.2 Building Your Analytics Portfolio

Portfolio Principles

Project Categories

Technical Implementation

Writing About Your Work

24.3 Essential Technical Skills

Programming Languages

Statistical Methods

Domain Knowledge

Data Sources and Tools

24.4 The Job Application Process

Finding Opportunities

Crafting Your Application

Application Timing

24.5 Interview Preparation & Case Studies

Interview Stages

Technical Interview Topics

Case Study Examples

Behavioral Interview Questions

Questions to Ask Interviewers

24.6 Resources & Continued Learning

Books

Online Courses

Communities and Conferences

Publications and Blogs

Building Your Network

Staying Current

24.7 Exercises

Exercise 24.1: Pitcher Arsenal Analysis and Optimization

Exercise 24.2: Player Aging Curves and Performance Projection

Exercise 24.3: Draft Value Analysis and Strategy Optimization

Exercise 24.4: Defensive Positioning and Shift Analysis

Chapter Summary

Practice Exercises

Tips for Success

Pitcher Arsenal Analysis and Optimization

Player Aging Curves and Performance Projection

Draft Value Analysis and Strategy Optimization

Defensive Positioning and Shift Analysis

Chapter Summary

Related Resources

Glossary

Resources

All Chapters