Chapter 27: Modern MLB Trends & Rule Changes

The 2020s have ushered in a transformative era for Major League Baseball, characterized by unprecedented analytical insights, strategic innovations, and fundamental rule changes designed to address the evolving nature of the game. This chapter examines the key trends shaping modern baseball and analyzes the impact of recent rule modifications that have altered gameplay, strategy, and player performance in measurable ways.

Intermediate ~9 min read 8 sections 20 code examples
Book Progress
52%
Chapter 28 of 54
What You'll Learn
  • The Velocity Revolution
  • Shift Ban Impact Analysis
  • Pitch Clock Effects on Performance
  • Larger Bases & Stolen Base Renaissance
  • And 4 more topics...
Languages in This Chapter
R (10) Python (10)

All code examples can be copied and run in your environment.

27.1 The Velocity Revolution

Understanding the Velocity Trend

Over the past two decades, Major League Baseball has witnessed a dramatic increase in pitching velocity. The average fastball velocity has climbed from approximately 91.0 mph in 2002 to over 94.0 mph in 2023, representing a fundamental shift in how the game is played. This velocity revolution has been driven by several factors:

  1. Advanced Biomechanics: Motion capture technology and biomechanical analysis have optimized pitching mechanics
  2. Year-Round Training: Specialized velocity development programs and weighted ball training
  3. Analytics-Driven Approach: Recognition that velocity correlates strongly with strikeout rates and success
  4. Medical Advances: Tommy John surgery and rehabilitation protocols allowing pitchers to return stronger

Velocity Analysis with R

Let's analyze the velocity trend using baseballr:

# Load required libraries
library(baseballr)
library(dplyr)
library(ggplot2)
library(tidyr)

# Function to analyze velocity trends by season
analyze_velocity_trends <- function(start_year = 2015, end_year = 2023) {

  # Get Statcast data for multiple seasons
  velocity_data <- data.frame()

  for (year in start_year:end_year) {
    tryCatch({
      # Get pitch-level data for a sample of games
      season_data <- statcast_search(
        start_date = paste0(year, "-04-01"),
        end_date = paste0(year, "-10-31"),
        player_type = "pitcher"
      )

      if (nrow(season_data) > 0) {
        season_summary <- season_data %>%
          filter(!is.na(release_speed), pitch_type %in% c("FF", "SI", "FC")) %>%
          group_by(pitcher, player_name) %>%
          summarise(
            avg_velocity = mean(release_speed, na.rm = TRUE),
            max_velocity = max(release_speed, na.rm = TRUE),
            pitches = n(),
            .groups = "drop"
          ) %>%
          filter(pitches >= 100) %>%
          mutate(season = year)

        velocity_data <- bind_rows(velocity_data, season_summary)
      }
    }, error = function(e) {
      message(paste("Error processing year", year, ":", e$message))
    })
  }

  return(velocity_data)
}

# Calculate league-wide velocity trends
calculate_league_velocity <- function(velocity_data) {
  league_trends <- velocity_data %>%
    group_by(season) %>%
    summarise(
      mean_velocity = mean(avg_velocity, na.rm = TRUE),
      median_velocity = median(avg_velocity, na.rm = TRUE),
      p75_velocity = quantile(avg_velocity, 0.75, na.rm = TRUE),
      p90_velocity = quantile(avg_velocity, 0.90, na.rm = TRUE),
      p99_velocity = quantile(avg_velocity, 0.99, na.rm = TRUE),
      .groups = "drop"
    )

  return(league_trends)
}

# Visualize velocity evolution
plot_velocity_trends <- function(league_trends) {
  ggplot(league_trends, aes(x = season)) +
    geom_line(aes(y = mean_velocity, color = "Mean"), linewidth = 1.2) +
    geom_line(aes(y = p75_velocity, color = "75th Percentile"), linewidth = 1) +
    geom_line(aes(y = p90_velocity, color = "90th Percentile"), linewidth = 1) +
    geom_point(aes(y = mean_velocity, color = "Mean"), size = 3) +
    geom_point(aes(y = p75_velocity, color = "75th Percentile"), size = 2) +
    geom_point(aes(y = p90_velocity, color = "90th Percentile"), size = 2) +
    labs(
      title = "MLB Fastball Velocity Evolution (2015-2023)",
      subtitle = "Average velocity across all qualified pitchers",
      x = "Season",
      y = "Velocity (mph)",
      color = "Metric"
    ) +
    theme_minimal() +
    scale_color_manual(values = c("Mean" = "#002D72",
                                   "75th Percentile" = "#D50032",
                                   "90th Percentile" = "#FF8C00"))
}

# Analyze velocity impact on performance
velocity_performance_analysis <- function(pitcher_data) {
  # Group pitchers by velocity quintiles
  pitcher_data_with_quintiles <- pitcher_data %>%
    mutate(velocity_quintile = ntile(avg_velocity, 5))

  performance_by_velocity <- pitcher_data_with_quintiles %>%
    group_by(velocity_quintile) %>%
    summarise(
      avg_velocity = mean(avg_velocity),
      count = n(),
      .groups = "drop"
    )

  return(performance_by_velocity)
}

Velocity Analysis with Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
from pybaseball import playerid_lookup, statcast_pitcher
from datetime import datetime

# Set style for visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

def analyze_velocity_by_season(start_year=2015, end_year=2023):
    """
    Analyze velocity trends across multiple seasons
    """
    velocity_trends = []

    for year in range(start_year, end_year + 1):
        try:
            # Get season data
            start_date = f"{year}-04-01"
            end_date = f"{year}-10-31"

            # Query Statcast data
            data = statcast(start_dt=start_date, end_dt=end_date)

            if data is not None and len(data) > 0:
                # Filter for fastballs
                fastballs = data[data['pitch_type'].isin(['FF', 'SI', 'FC'])]

                # Calculate pitcher-level statistics
                pitcher_stats = fastballs.groupby(['pitcher', 'player_name']).agg({
                    'release_speed': ['mean', 'max', 'count']
                }).reset_index()

                pitcher_stats.columns = ['pitcher_id', 'player_name',
                                        'avg_velocity', 'max_velocity', 'pitches']

                # Filter for qualified pitchers (100+ pitches)
                qualified = pitcher_stats[pitcher_stats['pitches'] >= 100]
                qualified['season'] = year

                velocity_trends.append(qualified)

        except Exception as e:
            print(f"Error processing {year}: {e}")

    return pd.concat(velocity_trends, ignore_index=True)

def calculate_league_velocity_stats(velocity_df):
    """
    Calculate league-wide velocity statistics by season
    """
    league_stats = velocity_df.groupby('season')['avg_velocity'].agg([
        ('mean', 'mean'),
        ('median', 'median'),
        ('std', 'std'),
        ('p75', lambda x: np.percentile(x, 75)),
        ('p90', lambda x: np.percentile(x, 90)),
        ('p99', lambda x: np.percentile(x, 99))
    ]).reset_index()

    return league_stats

def plot_velocity_distribution(velocity_df, seasons=[2015, 2019, 2023]):
    """
    Create distribution plots comparing velocity across seasons
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    for idx, season in enumerate(seasons):
        season_data = velocity_df[velocity_df['season'] == season]

        axes[idx].hist(season_data['avg_velocity'], bins=30,
                      edgecolor='black', alpha=0.7, color='steelblue')
        axes[idx].axvline(season_data['avg_velocity'].mean(),
                         color='red', linestyle='--', linewidth=2,
                         label=f'Mean: {season_data["avg_velocity"].mean():.2f} mph')
        axes[idx].set_xlabel('Average Fastball Velocity (mph)')
        axes[idx].set_ylabel('Number of Pitchers')
        axes[idx].set_title(f'{season} Season')
        axes[idx].legend()

    plt.tight_layout()
    return fig

def analyze_velocity_percentiles(velocity_df):
    """
    Analyze how velocity percentiles have shifted over time
    """
    percentiles = [10, 25, 50, 75, 90, 95, 99]
    percentile_data = []

    for season in velocity_df['season'].unique():
        season_data = velocity_df[velocity_df['season'] == season]['avg_velocity']
        row = {'season': season}

        for p in percentiles:
            row[f'p{p}'] = np.percentile(season_data, p)

        percentile_data.append(row)

    return pd.DataFrame(percentile_data)

def velocity_strikeout_correlation(pitch_data):
    """
    Analyze correlation between velocity and strikeout rates
    """
    # Calculate strikeout rate by pitch
    pitch_data['is_strikeout'] = pitch_data['events'] == 'strikeout'

    # Group by velocity bins
    pitch_data['velocity_bin'] = pd.cut(pitch_data['release_speed'],
                                         bins=range(85, 105, 2))

    correlation_data = pitch_data.groupby('velocity_bin').agg({
        'is_strikeout': 'mean',
        'release_speed': 'count'
    }).reset_index()

    correlation_data.columns = ['velocity_bin', 'strikeout_rate', 'pitches']

    return correlation_data[correlation_data['pitches'] >= 100]

The Velocity-Performance Relationship

Research has consistently shown a strong positive correlation between velocity and pitcher effectiveness:

  • Strikeout Rate: Each 1 mph increase in average fastball velocity correlates with approximately a 0.5% increase in strikeout rate
  • Batting Average Against: Fastballs above 95 mph result in batting averages roughly 30-40 points lower than those below 92 mph
  • Expected Outcomes: Higher velocity leads to weaker contact and higher swinging-strike rates

However, this velocity obsession comes with significant costs:

  1. Injury Rates: Tommy John surgeries have increased by over 300% since 2000
  2. Pitcher Durability: Average innings per start has declined from 6.3 to 5.2 over the past 20 years
  3. Career Length: Hard-throwing pitchers often have shorter careers due to accumulated arm stress
R
# Load required libraries
library(baseballr)
library(dplyr)
library(ggplot2)
library(tidyr)

# Function to analyze velocity trends by season
analyze_velocity_trends <- function(start_year = 2015, end_year = 2023) {

  # Get Statcast data for multiple seasons
  velocity_data <- data.frame()

  for (year in start_year:end_year) {
    tryCatch({
      # Get pitch-level data for a sample of games
      season_data <- statcast_search(
        start_date = paste0(year, "-04-01"),
        end_date = paste0(year, "-10-31"),
        player_type = "pitcher"
      )

      if (nrow(season_data) > 0) {
        season_summary <- season_data %>%
          filter(!is.na(release_speed), pitch_type %in% c("FF", "SI", "FC")) %>%
          group_by(pitcher, player_name) %>%
          summarise(
            avg_velocity = mean(release_speed, na.rm = TRUE),
            max_velocity = max(release_speed, na.rm = TRUE),
            pitches = n(),
            .groups = "drop"
          ) %>%
          filter(pitches >= 100) %>%
          mutate(season = year)

        velocity_data <- bind_rows(velocity_data, season_summary)
      }
    }, error = function(e) {
      message(paste("Error processing year", year, ":", e$message))
    })
  }

  return(velocity_data)
}

# Calculate league-wide velocity trends
calculate_league_velocity <- function(velocity_data) {
  league_trends <- velocity_data %>%
    group_by(season) %>%
    summarise(
      mean_velocity = mean(avg_velocity, na.rm = TRUE),
      median_velocity = median(avg_velocity, na.rm = TRUE),
      p75_velocity = quantile(avg_velocity, 0.75, na.rm = TRUE),
      p90_velocity = quantile(avg_velocity, 0.90, na.rm = TRUE),
      p99_velocity = quantile(avg_velocity, 0.99, na.rm = TRUE),
      .groups = "drop"
    )

  return(league_trends)
}

# Visualize velocity evolution
plot_velocity_trends <- function(league_trends) {
  ggplot(league_trends, aes(x = season)) +
    geom_line(aes(y = mean_velocity, color = "Mean"), linewidth = 1.2) +
    geom_line(aes(y = p75_velocity, color = "75th Percentile"), linewidth = 1) +
    geom_line(aes(y = p90_velocity, color = "90th Percentile"), linewidth = 1) +
    geom_point(aes(y = mean_velocity, color = "Mean"), size = 3) +
    geom_point(aes(y = p75_velocity, color = "75th Percentile"), size = 2) +
    geom_point(aes(y = p90_velocity, color = "90th Percentile"), size = 2) +
    labs(
      title = "MLB Fastball Velocity Evolution (2015-2023)",
      subtitle = "Average velocity across all qualified pitchers",
      x = "Season",
      y = "Velocity (mph)",
      color = "Metric"
    ) +
    theme_minimal() +
    scale_color_manual(values = c("Mean" = "#002D72",
                                   "75th Percentile" = "#D50032",
                                   "90th Percentile" = "#FF8C00"))
}

# Analyze velocity impact on performance
velocity_performance_analysis <- function(pitcher_data) {
  # Group pitchers by velocity quintiles
  pitcher_data_with_quintiles <- pitcher_data %>%
    mutate(velocity_quintile = ntile(avg_velocity, 5))

  performance_by_velocity <- pitcher_data_with_quintiles %>%
    group_by(velocity_quintile) %>%
    summarise(
      avg_velocity = mean(avg_velocity),
      count = n(),
      .groups = "drop"
    )

  return(performance_by_velocity)
}
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
from pybaseball import playerid_lookup, statcast_pitcher
from datetime import datetime

# Set style for visualizations
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

def analyze_velocity_by_season(start_year=2015, end_year=2023):
    """
    Analyze velocity trends across multiple seasons
    """
    velocity_trends = []

    for year in range(start_year, end_year + 1):
        try:
            # Get season data
            start_date = f"{year}-04-01"
            end_date = f"{year}-10-31"

            # Query Statcast data
            data = statcast(start_dt=start_date, end_dt=end_date)

            if data is not None and len(data) > 0:
                # Filter for fastballs
                fastballs = data[data['pitch_type'].isin(['FF', 'SI', 'FC'])]

                # Calculate pitcher-level statistics
                pitcher_stats = fastballs.groupby(['pitcher', 'player_name']).agg({
                    'release_speed': ['mean', 'max', 'count']
                }).reset_index()

                pitcher_stats.columns = ['pitcher_id', 'player_name',
                                        'avg_velocity', 'max_velocity', 'pitches']

                # Filter for qualified pitchers (100+ pitches)
                qualified = pitcher_stats[pitcher_stats['pitches'] >= 100]
                qualified['season'] = year

                velocity_trends.append(qualified)

        except Exception as e:
            print(f"Error processing {year}: {e}")

    return pd.concat(velocity_trends, ignore_index=True)

def calculate_league_velocity_stats(velocity_df):
    """
    Calculate league-wide velocity statistics by season
    """
    league_stats = velocity_df.groupby('season')['avg_velocity'].agg([
        ('mean', 'mean'),
        ('median', 'median'),
        ('std', 'std'),
        ('p75', lambda x: np.percentile(x, 75)),
        ('p90', lambda x: np.percentile(x, 90)),
        ('p99', lambda x: np.percentile(x, 99))
    ]).reset_index()

    return league_stats

def plot_velocity_distribution(velocity_df, seasons=[2015, 2019, 2023]):
    """
    Create distribution plots comparing velocity across seasons
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))

    for idx, season in enumerate(seasons):
        season_data = velocity_df[velocity_df['season'] == season]

        axes[idx].hist(season_data['avg_velocity'], bins=30,
                      edgecolor='black', alpha=0.7, color='steelblue')
        axes[idx].axvline(season_data['avg_velocity'].mean(),
                         color='red', linestyle='--', linewidth=2,
                         label=f'Mean: {season_data["avg_velocity"].mean():.2f} mph')
        axes[idx].set_xlabel('Average Fastball Velocity (mph)')
        axes[idx].set_ylabel('Number of Pitchers')
        axes[idx].set_title(f'{season} Season')
        axes[idx].legend()

    plt.tight_layout()
    return fig

def analyze_velocity_percentiles(velocity_df):
    """
    Analyze how velocity percentiles have shifted over time
    """
    percentiles = [10, 25, 50, 75, 90, 95, 99]
    percentile_data = []

    for season in velocity_df['season'].unique():
        season_data = velocity_df[velocity_df['season'] == season]['avg_velocity']
        row = {'season': season}

        for p in percentiles:
            row[f'p{p}'] = np.percentile(season_data, p)

        percentile_data.append(row)

    return pd.DataFrame(percentile_data)

def velocity_strikeout_correlation(pitch_data):
    """
    Analyze correlation between velocity and strikeout rates
    """
    # Calculate strikeout rate by pitch
    pitch_data['is_strikeout'] = pitch_data['events'] == 'strikeout'

    # Group by velocity bins
    pitch_data['velocity_bin'] = pd.cut(pitch_data['release_speed'],
                                         bins=range(85, 105, 2))

    correlation_data = pitch_data.groupby('velocity_bin').agg({
        'is_strikeout': 'mean',
        'release_speed': 'count'
    }).reset_index()

    correlation_data.columns = ['velocity_bin', 'strikeout_rate', 'pitches']

    return correlation_data[correlation_data['pitches'] >= 100]

27.2 Shift Ban Impact Analysis

Understanding the Shift Ban

Prior to the 2023 season, defensive shifts had become ubiquitous in Major League Baseball. Teams would routinely position three or more infielders on one side of second base against pull-heavy hitters, dramatically reducing batting averages on balls in play. The 2023 rule change mandated:

  1. Two infielders must be positioned on each side of second base
  2. All four infielders must be on the infield dirt when the pitcher begins delivery
  3. No positioning directly behind second base in short outfield

Shift Usage Before the Ban

# Analyze shift usage trends (2016-2022)
library(baseballr)
library(dplyr)
library(ggplot2)

analyze_shift_usage <- function(years = 2016:2022) {

  shift_data <- data.frame()

  for (year in years) {
    tryCatch({
      # Get Statcast data
      season_data <- statcast_search(
        start_date = paste0(year, "-04-01"),
        end_date = paste0(year, "-10-31"),
        player_type = "batter"
      )

      if (nrow(season_data) > 0) {
        # Analyze shift deployment
        shift_summary <- season_data %>%
          filter(!is.na(if_fielding_alignment)) %>%
          group_by(if_fielding_alignment) %>%
          summarise(
            pitches = n(),
            .groups = "drop"
          ) %>%
          mutate(
            season = year,
            pct = pitches / sum(pitches) * 100
          )

        shift_data <- bind_rows(shift_data, shift_summary)
      }
    }, error = function(e) {
      message(paste("Error:", e$message))
    })
  }

  return(shift_data)
}

# Analyze batter performance against shifts
analyze_shift_performance <- function(player_data) {

  performance <- player_data %>%
    filter(!is.na(if_fielding_alignment), !is.na(events)) %>%
    mutate(
      is_shift = if_fielding_alignment %in% c("Shift", "Strategic"),
      is_hit = events %in% c("single", "double", "triple", "home_run"),
      in_play = !events %in% c("strikeout", "walk", "hit_by_pitch")
    ) %>%
    group_by(is_shift) %>%
    summarise(
      balls_in_play = sum(in_play),
      hits = sum(is_hit),
      avg = hits / balls_in_play,
      .groups = "drop"
    )

  return(performance)
}

# Calculate shift victims (players most affected)
identify_shift_victims <- function(player_data) {

  shift_victims <- player_data %>%
    filter(!is.na(if_fielding_alignment), stand == "L") %>%
    mutate(is_shift = if_fielding_alignment %in% c("Shift", "Strategic")) %>%
    group_by(batter, player_name, is_shift) %>%
    summarise(
      pa = n(),
      avg_launch_angle = mean(launch_angle, na.rm = TRUE),
      avg_exit_velocity = mean(launch_speed, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    pivot_wider(
      names_from = is_shift,
      values_from = c(pa, avg_launch_angle, avg_exit_velocity),
      names_prefix = "shift_"
    ) %>%
    filter(!is.na(pa_shift_TRUE), pa_shift_TRUE >= 100) %>%
    arrange(desc(pa_shift_TRUE))

  return(shift_victims)
}

2023 Shift Ban Impact Analysis

import pandas as pd
import numpy as np
from pybaseball import statcast
import matplotlib.pyplot as plt
import seaborn as sns

def compare_pre_post_shift_ban(year_before=2022, year_after=2023):
    """
    Compare batting statistics before and after shift ban
    """
    # Get data for both seasons
    data_before = statcast(start_dt=f"{year_before}-04-01",
                          end_dt=f"{year_before}-10-31")
    data_after = statcast(start_dt=f"{year_after}-04-01",
                         end_dt=f"{year_after}-10-31")

    def calculate_babip_by_direction(df, season_label):
        """Calculate BABIP by batted ball direction"""
        # Filter for balls in play
        bip = df[df['events'].isin(['single', 'double', 'triple',
                                    'field_out', 'field_error',
                                    'force_out', 'grounded_into_double_play'])]

        # Classify hit direction based on hit coordinates
        bip['direction'] = pd.cut(bip['hc_x'],
                                  bins=[0, 83, 125.5, 168, 250],
                                  labels=['Pull', 'Center', 'Oppo', 'Extreme_Pull'])

        # Calculate BABIP by direction
        direction_stats = bip.groupby('direction').apply(
            lambda x: pd.Series({
                'BABIP': len(x[x['events'].isin(['single', 'double', 'triple'])]) / len(x),
                'BIP': len(x),
                'Season': season_label
            })
        ).reset_index()

        return direction_stats

    # Calculate for both seasons
    stats_before = calculate_babip_by_direction(data_before, year_before)
    stats_after = calculate_babip_by_direction(data_after, year_after)

    # Combine results
    comparison = pd.concat([stats_before, stats_after])

    return comparison

def analyze_lefty_pull_hitters(year_before=2022, year_after=2023):
    """
    Analyze impact on left-handed pull hitters (most affected by shifts)
    """
    results = []

    for year in [year_before, year_after]:
        data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")

        # Filter for left-handed batters
        lefty_data = data[data['stand'] == 'L'].copy()

        # Calculate pull rate and performance
        lefty_stats = lefty_data.groupby(['batter', 'player_name']).apply(
            lambda x: pd.Series({
                'PA': len(x),
                'AVG': len(x[x['events'].isin(['single', 'double', 'triple', 'home_run'])]) /
                       len(x[x['events'].notna()]),
                'BABIP': len(x[(x['events'].isin(['single', 'double', 'triple'])) &
                              (x['type'] == 'X')]) /
                         len(x[(x['type'] == 'X') &
                              (~x['events'].isin(['home_run']))]),
                'Season': year
            })
        ).reset_index()

        # Filter qualified batters
        qualified = lefty_stats[lefty_stats['PA'] >= 300]
        results.append(qualified)

    comparison_df = pd.concat(results)

    # Calculate average improvement
    pivot = comparison_df.pivot_table(
        index='player_name',
        columns='Season',
        values=['AVG', 'BABIP'],
        aggfunc='mean'
    )

    pivot['AVG_Change'] = pivot[('AVG', year_after)] - pivot[('AVG', year_before)]
    pivot['BABIP_Change'] = pivot[('BABIP', year_after)] - pivot[('BABIP', year_before)]

    return pivot.sort_values('BABIP_Change', ascending=False)

def visualize_shift_ban_impact(comparison_data):
    """
    Create visualization showing shift ban impact
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # BABIP by direction comparison
    direction_pivot = comparison_data.pivot(
        index='direction',
        columns='Season',
        values='BABIP'
    )

    direction_pivot.plot(kind='bar', ax=axes[0])
    axes[0].set_title('BABIP by Hit Direction: Pre vs Post Shift Ban')
    axes[0].set_xlabel('Hit Direction')
    axes[0].set_ylabel('BABIP')
    axes[0].legend(title='Season')
    axes[0].grid(axis='y', alpha=0.3)

    # Calculate percentage change
    pct_change = ((direction_pivot.iloc[:, 1] - direction_pivot.iloc[:, 0]) /
                  direction_pivot.iloc[:, 0] * 100)

    pct_change.plot(kind='bar', ax=axes[1], color='steelblue')
    axes[1].set_title('BABIP Change by Direction (% Change)')
    axes[1].set_xlabel('Hit Direction')
    axes[1].set_ylabel('Percentage Change (%)')
    axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1)
    axes[1].grid(axis='y', alpha=0.3)

    plt.tight_layout()
    return fig

def calculate_league_wide_impact(year_before=2022, year_after=2023):
    """
    Calculate league-wide batting statistics change
    """
    stats_comparison = {}

    for year in [year_before, year_after]:
        data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")

        # Calculate league-wide metrics
        hits = data['events'].isin(['single', 'double', 'triple', 'home_run']).sum()
        at_bats = data['events'].notna().sum()

        bip = data[data['events'].isin(['single', 'double', 'triple',
                                        'field_out', 'field_error',
                                        'force_out', 'grounded_into_double_play'])]
        hits_bip = bip['events'].isin(['single', 'double', 'triple']).sum()

        stats_comparison[year] = {
            'AVG': hits / at_bats,
            'BABIP': hits_bip / len(bip),
            'Total_BIP': len(bip)
        }

    # Calculate changes
    changes = {
        'AVG_Change': stats_comparison[year_after]['AVG'] - stats_comparison[year_before]['AVG'],
        'BABIP_Change': stats_comparison[year_after]['BABIP'] - stats_comparison[year_before]['BABIP'],
        'AVG_Pct_Change': ((stats_comparison[year_after]['AVG'] -
                           stats_comparison[year_before]['AVG']) /
                          stats_comparison[year_before]['AVG'] * 100),
        'BABIP_Pct_Change': ((stats_comparison[year_after]['BABIP'] -
                             stats_comparison[year_before]['BABIP']) /
                            stats_comparison[year_before]['BABIP'] * 100)
    }

    return pd.DataFrame([stats_comparison[year_before],
                        stats_comparison[year_after],
                        changes],
                       index=[year_before, year_after, 'Change'])

Key Findings from Shift Ban Analysis

The 2023 shift ban produced measurable impacts on offensive production:

  1. League-Wide BABIP: Increased from .290 (2022) to .298 (2023), an increase of approximately 8 points
  2. Left-Handed Pull Hitters: Average BABIP increased by 12-15 points for extreme pull hitters
  3. Ground Ball Rate: Slight increase as hitters adjusted to having more gap opportunities
  4. Home Run Rate: Minimal change, suggesting the shift primarily affected ground balls and line drives

Notable Beneficiaries:


  • Kyle Schwarber (L): BABIP increased from .212 to .254

  • Cody Bellinger (L): BABIP increased from .266 to .308

  • Joey Gallo (L): BABIP increased from .189 to .243

R
# Analyze shift usage trends (2016-2022)
library(baseballr)
library(dplyr)
library(ggplot2)

analyze_shift_usage <- function(years = 2016:2022) {

  shift_data <- data.frame()

  for (year in years) {
    tryCatch({
      # Get Statcast data
      season_data <- statcast_search(
        start_date = paste0(year, "-04-01"),
        end_date = paste0(year, "-10-31"),
        player_type = "batter"
      )

      if (nrow(season_data) > 0) {
        # Analyze shift deployment
        shift_summary <- season_data %>%
          filter(!is.na(if_fielding_alignment)) %>%
          group_by(if_fielding_alignment) %>%
          summarise(
            pitches = n(),
            .groups = "drop"
          ) %>%
          mutate(
            season = year,
            pct = pitches / sum(pitches) * 100
          )

        shift_data <- bind_rows(shift_data, shift_summary)
      }
    }, error = function(e) {
      message(paste("Error:", e$message))
    })
  }

  return(shift_data)
}

# Analyze batter performance against shifts
analyze_shift_performance <- function(player_data) {

  performance <- player_data %>%
    filter(!is.na(if_fielding_alignment), !is.na(events)) %>%
    mutate(
      is_shift = if_fielding_alignment %in% c("Shift", "Strategic"),
      is_hit = events %in% c("single", "double", "triple", "home_run"),
      in_play = !events %in% c("strikeout", "walk", "hit_by_pitch")
    ) %>%
    group_by(is_shift) %>%
    summarise(
      balls_in_play = sum(in_play),
      hits = sum(is_hit),
      avg = hits / balls_in_play,
      .groups = "drop"
    )

  return(performance)
}

# Calculate shift victims (players most affected)
identify_shift_victims <- function(player_data) {

  shift_victims <- player_data %>%
    filter(!is.na(if_fielding_alignment), stand == "L") %>%
    mutate(is_shift = if_fielding_alignment %in% c("Shift", "Strategic")) %>%
    group_by(batter, player_name, is_shift) %>%
    summarise(
      pa = n(),
      avg_launch_angle = mean(launch_angle, na.rm = TRUE),
      avg_exit_velocity = mean(launch_speed, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    pivot_wider(
      names_from = is_shift,
      values_from = c(pa, avg_launch_angle, avg_exit_velocity),
      names_prefix = "shift_"
    ) %>%
    filter(!is.na(pa_shift_TRUE), pa_shift_TRUE >= 100) %>%
    arrange(desc(pa_shift_TRUE))

  return(shift_victims)
}
Python
import pandas as pd
import numpy as np
from pybaseball import statcast
import matplotlib.pyplot as plt
import seaborn as sns

def compare_pre_post_shift_ban(year_before=2022, year_after=2023):
    """
    Compare batting statistics before and after shift ban
    """
    # Get data for both seasons
    data_before = statcast(start_dt=f"{year_before}-04-01",
                          end_dt=f"{year_before}-10-31")
    data_after = statcast(start_dt=f"{year_after}-04-01",
                         end_dt=f"{year_after}-10-31")

    def calculate_babip_by_direction(df, season_label):
        """Calculate BABIP by batted ball direction"""
        # Filter for balls in play
        bip = df[df['events'].isin(['single', 'double', 'triple',
                                    'field_out', 'field_error',
                                    'force_out', 'grounded_into_double_play'])]

        # Classify hit direction based on hit coordinates
        bip['direction'] = pd.cut(bip['hc_x'],
                                  bins=[0, 83, 125.5, 168, 250],
                                  labels=['Pull', 'Center', 'Oppo', 'Extreme_Pull'])

        # Calculate BABIP by direction
        direction_stats = bip.groupby('direction').apply(
            lambda x: pd.Series({
                'BABIP': len(x[x['events'].isin(['single', 'double', 'triple'])]) / len(x),
                'BIP': len(x),
                'Season': season_label
            })
        ).reset_index()

        return direction_stats

    # Calculate for both seasons
    stats_before = calculate_babip_by_direction(data_before, year_before)
    stats_after = calculate_babip_by_direction(data_after, year_after)

    # Combine results
    comparison = pd.concat([stats_before, stats_after])

    return comparison

def analyze_lefty_pull_hitters(year_before=2022, year_after=2023):
    """
    Analyze impact on left-handed pull hitters (most affected by shifts)
    """
    results = []

    for year in [year_before, year_after]:
        data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")

        # Filter for left-handed batters
        lefty_data = data[data['stand'] == 'L'].copy()

        # Calculate pull rate and performance
        lefty_stats = lefty_data.groupby(['batter', 'player_name']).apply(
            lambda x: pd.Series({
                'PA': len(x),
                'AVG': len(x[x['events'].isin(['single', 'double', 'triple', 'home_run'])]) /
                       len(x[x['events'].notna()]),
                'BABIP': len(x[(x['events'].isin(['single', 'double', 'triple'])) &
                              (x['type'] == 'X')]) /
                         len(x[(x['type'] == 'X') &
                              (~x['events'].isin(['home_run']))]),
                'Season': year
            })
        ).reset_index()

        # Filter qualified batters
        qualified = lefty_stats[lefty_stats['PA'] >= 300]
        results.append(qualified)

    comparison_df = pd.concat(results)

    # Calculate average improvement
    pivot = comparison_df.pivot_table(
        index='player_name',
        columns='Season',
        values=['AVG', 'BABIP'],
        aggfunc='mean'
    )

    pivot['AVG_Change'] = pivot[('AVG', year_after)] - pivot[('AVG', year_before)]
    pivot['BABIP_Change'] = pivot[('BABIP', year_after)] - pivot[('BABIP', year_before)]

    return pivot.sort_values('BABIP_Change', ascending=False)

def visualize_shift_ban_impact(comparison_data):
    """
    Create visualization showing shift ban impact
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # BABIP by direction comparison
    direction_pivot = comparison_data.pivot(
        index='direction',
        columns='Season',
        values='BABIP'
    )

    direction_pivot.plot(kind='bar', ax=axes[0])
    axes[0].set_title('BABIP by Hit Direction: Pre vs Post Shift Ban')
    axes[0].set_xlabel('Hit Direction')
    axes[0].set_ylabel('BABIP')
    axes[0].legend(title='Season')
    axes[0].grid(axis='y', alpha=0.3)

    # Calculate percentage change
    pct_change = ((direction_pivot.iloc[:, 1] - direction_pivot.iloc[:, 0]) /
                  direction_pivot.iloc[:, 0] * 100)

    pct_change.plot(kind='bar', ax=axes[1], color='steelblue')
    axes[1].set_title('BABIP Change by Direction (% Change)')
    axes[1].set_xlabel('Hit Direction')
    axes[1].set_ylabel('Percentage Change (%)')
    axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1)
    axes[1].grid(axis='y', alpha=0.3)

    plt.tight_layout()
    return fig

def calculate_league_wide_impact(year_before=2022, year_after=2023):
    """
    Calculate league-wide batting statistics change
    """
    stats_comparison = {}

    for year in [year_before, year_after]:
        data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")

        # Calculate league-wide metrics
        hits = data['events'].isin(['single', 'double', 'triple', 'home_run']).sum()
        at_bats = data['events'].notna().sum()

        bip = data[data['events'].isin(['single', 'double', 'triple',
                                        'field_out', 'field_error',
                                        'force_out', 'grounded_into_double_play'])]
        hits_bip = bip['events'].isin(['single', 'double', 'triple']).sum()

        stats_comparison[year] = {
            'AVG': hits / at_bats,
            'BABIP': hits_bip / len(bip),
            'Total_BIP': len(bip)
        }

    # Calculate changes
    changes = {
        'AVG_Change': stats_comparison[year_after]['AVG'] - stats_comparison[year_before]['AVG'],
        'BABIP_Change': stats_comparison[year_after]['BABIP'] - stats_comparison[year_before]['BABIP'],
        'AVG_Pct_Change': ((stats_comparison[year_after]['AVG'] -
                           stats_comparison[year_before]['AVG']) /
                          stats_comparison[year_before]['AVG'] * 100),
        'BABIP_Pct_Change': ((stats_comparison[year_after]['BABIP'] -
                             stats_comparison[year_before]['BABIP']) /
                            stats_comparison[year_before]['BABIP'] * 100)
    }

    return pd.DataFrame([stats_comparison[year_before],
                        stats_comparison[year_after],
                        changes],
                       index=[year_before, year_after, 'Change'])

27.3 Pitch Clock Effects on Performance

The 2023 Pitch Clock Implementation

The introduction of pitch timers in 2023 represented the most significant pace-of-play change in modern baseball history:

  • Pitcher Requirements: 15 seconds with bases empty, 20 seconds with runners on base
  • Batter Requirements: Must be in the box with 8 seconds remaining
  • Violations: Ball awarded for pitcher violations, strike for batter violations
  • Disengagement Limits: Pitchers limited to 2 disengagements (pickoffs/step-offs) per plate appearance

Analyzing Pace of Play Changes

library(baseballr)
library(dplyr)
library(ggplot2)

analyze_game_time_trends <- function() {

  # Compare game times across recent seasons
  game_time_comparison <- data.frame(
    season = c(2019, 2020, 2021, 2022, 2023),
    avg_game_time = c(190, 188, 191, 193, 162),  # minutes
    avg_pitches_per_game = c(295, 290, 293, 297, 285),
    avg_time_between_pitches = c(38.7, 39.1, 39.3, 39.0, 18.2)  # seconds
  )

  return(game_time_comparison)
}

plot_game_time_evolution <- function(game_time_data) {

  ggplot(game_time_data, aes(x = season, y = avg_game_time)) +
    geom_line(linewidth = 1.2, color = "#002D72") +
    geom_point(size = 4, color = "#D50032") +
    geom_vline(xintercept = 2022.5, linetype = "dashed",
               color = "red", linewidth = 1) +
    annotate("text", x = 2022.5, y = 180,
             label = "Pitch Clock\nImplemented",
             hjust = -0.1, size = 4) +
    labs(
      title = "Average MLB Game Time (2019-2023)",
      subtitle = "31-minute reduction after pitch clock implementation",
      x = "Season",
      y = "Average Game Time (minutes)"
    ) +
    theme_minimal() +
    scale_y_continuous(limits = c(150, 200))
}

analyze_pitcher_adaptation <- function(pitcher_data) {

  # Analyze how pitchers adapted to pitch clock
  pitcher_stats <- pitcher_data %>%
    group_by(pitcher, player_name, season) %>%
    summarise(
      avg_time_to_plate = mean(time_to_plate, na.rm = TRUE),
      pitches = n(),
      violations = sum(violation == "pitch_timer", na.rm = TRUE),
      violation_rate = violations / pitches,
      .groups = "drop"
    )

  return(pitcher_stats)
}

compare_pitcher_performance_pre_post_clock <- function(data_2022, data_2023) {

  # Calculate performance metrics for same pitchers in both seasons
  performance_comparison <- data_2022 %>%
    inner_join(data_2023, by = c("pitcher", "player_name"),
               suffix = c("_2022", "_2023")) %>%
    mutate(
      k_rate_change = k_rate_2023 - k_rate_2022,
      bb_rate_change = bb_rate_2023 - bb_rate_2022,
      era_change = era_2023 - era_2022
    )

  return(performance_comparison)
}

Performance Impact Analysis with Python

import pandas as pd
import numpy as np
from pybaseball import statcast, pitching_stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

def analyze_pitcher_performance_change(years=[2022, 2023]):
    """
    Analyze how pitcher performance changed with pitch clock
    """
    performance_data = []

    for year in years:
        # Get season pitching statistics
        season_stats = pitching_stats(year, year, qual=50)
        season_stats['Season'] = year
        performance_data.append(season_stats)

    combined = pd.concat(performance_data)

    # Calculate key metrics
    metrics_by_season = combined.groupby('Season').agg({
        'K/9': 'mean',
        'BB/9': 'mean',
        'ERA': 'mean',
        'WHIP': 'mean',
        'HR/9': 'mean',
        'AVG': 'mean'
    }).round(3)

    return metrics_by_season

def analyze_pace_impact_on_performance(year=2023):
    """
    Analyze correlation between pace and pitcher performance
    """
    # Get pitch-level data
    data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")

    # Calculate average time between pitches for each pitcher
    pitcher_pace = data.groupby(['pitcher', 'player_name']).agg({
        'pitch_number': 'count',
        'delta_run_exp': 'mean',  # Proxy for performance
    }).reset_index()

    pitcher_pace.columns = ['pitcher_id', 'player_name', 'pitches', 'avg_run_value']

    # Filter qualified pitchers
    qualified = pitcher_pace[pitcher_pace['pitches'] >= 500]

    return qualified

def analyze_stolen_base_impact():
    """
    Analyze how pitch clock affected stolen base attempts
    """
    sb_data = {
        'Season': [2019, 2020, 2021, 2022, 2023],
        'SB_Attempts_Per_Game': [1.89, 1.78, 1.73, 1.68, 2.23],
        'SB_Success_Rate': [75.3, 74.8, 75.1, 75.4, 79.9],
        'Pickoff_Attempts_Per_Game': [2.4, 2.3, 2.2, 2.1, 1.1]
    }

    df = pd.DataFrame(sb_data)
    return df

def visualize_pitch_clock_impact():
    """
    Create comprehensive visualization of pitch clock effects
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Game time reduction
    game_times = pd.DataFrame({
        'Season': [2019, 2020, 2021, 2022, 2023],
        'Avg_Game_Time': [190, 188, 191, 193, 162]
    })

    axes[0, 0].plot(game_times['Season'], game_times['Avg_Game_Time'],
                    marker='o', linewidth=2, markersize=8, color='steelblue')
    axes[0, 0].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
    axes[0, 0].set_title('Average Game Time', fontsize=12, fontweight='bold')
    axes[0, 0].set_xlabel('Season')
    axes[0, 0].set_ylabel('Minutes')
    axes[0, 0].legend()
    axes[0, 0].grid(alpha=0.3)

    # Stolen base attempts
    sb_data = analyze_stolen_base_impact()

    axes[0, 1].plot(sb_data['Season'], sb_data['SB_Attempts_Per_Game'],
                    marker='s', linewidth=2, markersize=8, color='darkgreen')
    axes[0, 1].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
    axes[0, 1].set_title('Stolen Base Attempts per Game', fontsize=12, fontweight='bold')
    axes[0, 1].set_xlabel('Season')
    axes[0, 1].set_ylabel('SB Attempts')
    axes[0, 1].legend()
    axes[0, 1].grid(alpha=0.3)

    # Success rate
    axes[1, 0].plot(sb_data['Season'], sb_data['SB_Success_Rate'],
                    marker='^', linewidth=2, markersize=8, color='purple')
    axes[1, 0].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
    axes[1, 0].set_title('Stolen Base Success Rate', fontsize=12, fontweight='bold')
    axes[1, 0].set_xlabel('Season')
    axes[1, 0].set_ylabel('Success Rate (%)')
    axes[1, 0].legend()
    axes[1, 0].grid(alpha=0.3)

    # Pickoff attempts
    axes[1, 1].plot(sb_data['Season'], sb_data['Pickoff_Attempts_Per_Game'],
                    marker='D', linewidth=2, markersize=8, color='orangered')
    axes[1, 1].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
    axes[1, 1].set_title('Pickoff Attempts per Game', fontsize=12, fontweight='bold')
    axes[1, 1].set_xlabel('Season')
    axes[1, 1].set_ylabel('Pickoff Attempts')
    axes[1, 1].legend()
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    return fig

def calculate_violation_rates(year=2023):
    """
    Calculate pitch clock violation rates by team and pitcher
    """
    data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")

    # Assuming violation data is tracked in the dataset
    # This would need to be adapted based on actual data structure
    violation_summary = data.groupby('pitcher').agg({
        'pitch_number': 'count'
    }).reset_index()

    violation_summary.columns = ['pitcher_id', 'total_pitches']

    return violation_summary

Key Findings: Pitch Clock Impact

The 2023 pitch clock implementation had dramatic and immediate effects:

Pace of Play:


  • Average game time reduced from 3:03 (2022) to 2:42 (2023) - a 21-minute decrease

  • Time between pitches dropped from 39.0 seconds to 18.2 seconds

  • Total pitches per game decreased by approximately 12

Stolen Base Renaissance:


  • SB attempts per game increased 33% (1.68 to 2.23)

  • Success rate improved from 75.4% to 79.9%

  • Pickoff attempts decreased by nearly 50% due to disengagement limits

Pitcher Performance:


  • Minimal impact on K/9, BB/9, or ERA league-wide

  • Slight increase in home runs allowed (1.26 to 1.29 HR/9)

  • Pitchers adapted quickly with violation rates dropping from 1.8% early season to 0.4% by season end

R
library(baseballr)
library(dplyr)
library(ggplot2)

analyze_game_time_trends <- function() {

  # Compare game times across recent seasons
  game_time_comparison <- data.frame(
    season = c(2019, 2020, 2021, 2022, 2023),
    avg_game_time = c(190, 188, 191, 193, 162),  # minutes
    avg_pitches_per_game = c(295, 290, 293, 297, 285),
    avg_time_between_pitches = c(38.7, 39.1, 39.3, 39.0, 18.2)  # seconds
  )

  return(game_time_comparison)
}

plot_game_time_evolution <- function(game_time_data) {

  ggplot(game_time_data, aes(x = season, y = avg_game_time)) +
    geom_line(linewidth = 1.2, color = "#002D72") +
    geom_point(size = 4, color = "#D50032") +
    geom_vline(xintercept = 2022.5, linetype = "dashed",
               color = "red", linewidth = 1) +
    annotate("text", x = 2022.5, y = 180,
             label = "Pitch Clock\nImplemented",
             hjust = -0.1, size = 4) +
    labs(
      title = "Average MLB Game Time (2019-2023)",
      subtitle = "31-minute reduction after pitch clock implementation",
      x = "Season",
      y = "Average Game Time (minutes)"
    ) +
    theme_minimal() +
    scale_y_continuous(limits = c(150, 200))
}

analyze_pitcher_adaptation <- function(pitcher_data) {

  # Analyze how pitchers adapted to pitch clock
  pitcher_stats <- pitcher_data %>%
    group_by(pitcher, player_name, season) %>%
    summarise(
      avg_time_to_plate = mean(time_to_plate, na.rm = TRUE),
      pitches = n(),
      violations = sum(violation == "pitch_timer", na.rm = TRUE),
      violation_rate = violations / pitches,
      .groups = "drop"
    )

  return(pitcher_stats)
}

compare_pitcher_performance_pre_post_clock <- function(data_2022, data_2023) {

  # Calculate performance metrics for same pitchers in both seasons
  performance_comparison <- data_2022 %>%
    inner_join(data_2023, by = c("pitcher", "player_name"),
               suffix = c("_2022", "_2023")) %>%
    mutate(
      k_rate_change = k_rate_2023 - k_rate_2022,
      bb_rate_change = bb_rate_2023 - bb_rate_2022,
      era_change = era_2023 - era_2022
    )

  return(performance_comparison)
}
Python
import pandas as pd
import numpy as np
from pybaseball import statcast, pitching_stats
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

def analyze_pitcher_performance_change(years=[2022, 2023]):
    """
    Analyze how pitcher performance changed with pitch clock
    """
    performance_data = []

    for year in years:
        # Get season pitching statistics
        season_stats = pitching_stats(year, year, qual=50)
        season_stats['Season'] = year
        performance_data.append(season_stats)

    combined = pd.concat(performance_data)

    # Calculate key metrics
    metrics_by_season = combined.groupby('Season').agg({
        'K/9': 'mean',
        'BB/9': 'mean',
        'ERA': 'mean',
        'WHIP': 'mean',
        'HR/9': 'mean',
        'AVG': 'mean'
    }).round(3)

    return metrics_by_season

def analyze_pace_impact_on_performance(year=2023):
    """
    Analyze correlation between pace and pitcher performance
    """
    # Get pitch-level data
    data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")

    # Calculate average time between pitches for each pitcher
    pitcher_pace = data.groupby(['pitcher', 'player_name']).agg({
        'pitch_number': 'count',
        'delta_run_exp': 'mean',  # Proxy for performance
    }).reset_index()

    pitcher_pace.columns = ['pitcher_id', 'player_name', 'pitches', 'avg_run_value']

    # Filter qualified pitchers
    qualified = pitcher_pace[pitcher_pace['pitches'] >= 500]

    return qualified

def analyze_stolen_base_impact():
    """
    Analyze how pitch clock affected stolen base attempts
    """
    sb_data = {
        'Season': [2019, 2020, 2021, 2022, 2023],
        'SB_Attempts_Per_Game': [1.89, 1.78, 1.73, 1.68, 2.23],
        'SB_Success_Rate': [75.3, 74.8, 75.1, 75.4, 79.9],
        'Pickoff_Attempts_Per_Game': [2.4, 2.3, 2.2, 2.1, 1.1]
    }

    df = pd.DataFrame(sb_data)
    return df

def visualize_pitch_clock_impact():
    """
    Create comprehensive visualization of pitch clock effects
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Game time reduction
    game_times = pd.DataFrame({
        'Season': [2019, 2020, 2021, 2022, 2023],
        'Avg_Game_Time': [190, 188, 191, 193, 162]
    })

    axes[0, 0].plot(game_times['Season'], game_times['Avg_Game_Time'],
                    marker='o', linewidth=2, markersize=8, color='steelblue')
    axes[0, 0].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
    axes[0, 0].set_title('Average Game Time', fontsize=12, fontweight='bold')
    axes[0, 0].set_xlabel('Season')
    axes[0, 0].set_ylabel('Minutes')
    axes[0, 0].legend()
    axes[0, 0].grid(alpha=0.3)

    # Stolen base attempts
    sb_data = analyze_stolen_base_impact()

    axes[0, 1].plot(sb_data['Season'], sb_data['SB_Attempts_Per_Game'],
                    marker='s', linewidth=2, markersize=8, color='darkgreen')
    axes[0, 1].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
    axes[0, 1].set_title('Stolen Base Attempts per Game', fontsize=12, fontweight='bold')
    axes[0, 1].set_xlabel('Season')
    axes[0, 1].set_ylabel('SB Attempts')
    axes[0, 1].legend()
    axes[0, 1].grid(alpha=0.3)

    # Success rate
    axes[1, 0].plot(sb_data['Season'], sb_data['SB_Success_Rate'],
                    marker='^', linewidth=2, markersize=8, color='purple')
    axes[1, 0].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
    axes[1, 0].set_title('Stolen Base Success Rate', fontsize=12, fontweight='bold')
    axes[1, 0].set_xlabel('Season')
    axes[1, 0].set_ylabel('Success Rate (%)')
    axes[1, 0].legend()
    axes[1, 0].grid(alpha=0.3)

    # Pickoff attempts
    axes[1, 1].plot(sb_data['Season'], sb_data['Pickoff_Attempts_Per_Game'],
                    marker='D', linewidth=2, markersize=8, color='orangered')
    axes[1, 1].axvline(x=2022.5, color='red', linestyle='--', label='Pitch Clock')
    axes[1, 1].set_title('Pickoff Attempts per Game', fontsize=12, fontweight='bold')
    axes[1, 1].set_xlabel('Season')
    axes[1, 1].set_ylabel('Pickoff Attempts')
    axes[1, 1].legend()
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    return fig

def calculate_violation_rates(year=2023):
    """
    Calculate pitch clock violation rates by team and pitcher
    """
    data = statcast(start_dt=f"{year}-04-01", end_dt=f"{year}-10-31")

    # Assuming violation data is tracked in the dataset
    # This would need to be adapted based on actual data structure
    violation_summary = data.groupby('pitcher').agg({
        'pitch_number': 'count'
    }).reset_index()

    violation_summary.columns = ['pitcher_id', 'total_pitches']

    return violation_summary

27.4 Larger Bases & Stolen Base Renaissance

The Base Size Change

For the 2023 season, MLB increased base sizes from 15 inches square to 18 inches square. While seemingly minor, this change reduced the distance between bases by approximately 4.5 inches, creating measurable impacts on stolen base success rates and overall baserunning dynamics.

Mathematical Impact Analysis

library(dplyr)

calculate_base_distance_impact <- function() {

  # Calculate distance reduction
  old_base_size <- 15  # inches
  new_base_size <- 18  # inches

  # Distance from back of one base to front of next
  distance_reduction_per_base <- (new_base_size - old_base_size) / 2

  # Total reduction from first to second (two bases involved)
  total_reduction <- distance_reduction_per_base * 2

  # Original distance (90 feet = 1080 inches)
  original_distance <- 90 * 12

  # New distance
  new_distance <- original_distance - total_reduction

  # Percentage reduction
  pct_reduction <- (total_reduction / original_distance) * 100

  results <- data.frame(
    Original_Distance_Inches = original_distance,
    New_Distance_Inches = new_distance,
    Reduction_Inches = total_reduction,
    Reduction_Feet = total_reduction / 12,
    Percentage_Reduction = pct_reduction
  )

  return(results)
}

analyze_stolen_base_trends <- function() {

  # Historical stolen base data
  sb_trends <- data.frame(
    Season = c(2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023),
    SB_Total = c(2505, 2537, 2527, 2474, 2280, 874, 2213, 2487, 3503),
    Games_Played = c(2430, 2428, 2430, 2431, 2429, 898, 2430, 2430, 2430),
    SB_Success_Rate = c(73.2, 73.8, 74.1, 74.8, 75.3, 74.8, 75.1, 75.4, 79.9)
  ) %>%
    mutate(
      SB_Per_Game = SB_Total / Games_Played,
      Era = ifelse(Season >= 2023, "Larger Bases + Pitch Clock", "Traditional")
    )

  return(sb_trends)
}

identify_stolen_base_leaders <- function(sb_data) {

  # Analyze which types of players benefited most
  leader_analysis <- sb_data %>%
    filter(SB >= 20) %>%
    group_by(Season) %>%
    summarise(
      Players_20Plus_SB = n(),
      Players_30Plus_SB = sum(SB >= 30),
      Players_40Plus_SB = sum(SB >= 40),
      Players_50Plus_SB = sum(SB >= 50),
      Avg_SB_Rate = mean(SB_Success_Rate),
      .groups = "drop"
    )

  return(leader_analysis)
}

plot_stolen_base_renaissance <- function(sb_trends) {

  library(ggplot2)

  ggplot(sb_trends, aes(x = Season, y = SB_Per_Game, fill = Era)) +
    geom_col() +
    geom_text(aes(label = round(SB_Per_Game, 2)),
              vjust = -0.5, size = 3) +
    labs(
      title = "Stolen Bases per Game: The 2023 Renaissance",
      subtitle = "Impact of larger bases and pitch clock on stolen base frequency",
      x = "Season",
      y = "Stolen Bases per Game",
      fill = "Era"
    ) +
    theme_minimal() +
    scale_fill_manual(values = c("Traditional" = "#808080",
                                  "Larger Bases + Pitch Clock" = "#00AA00"))
}

Analyzing Individual Player Impact

import pandas as pd
import numpy as np
from pybaseball import batting_stats
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_baserunner_performance(years=[2022, 2023]):
    """
    Compare baserunning performance before and after rule changes
    """
    baserunning_data = []

    for year in years:
        # Get batting statistics including stolen bases
        stats = batting_stats(year, year, qual=200)
        stats['Season'] = year
        baserunning_data.append(stats[['Name', 'Season', 'SB', 'CS', 'PA']])

    combined = pd.concat(baserunning_data)

    # Calculate success rates
    combined['SB_Attempts'] = combined['SB'] + combined['CS']
    combined['SB_Success_Rate'] = np.where(
        combined['SB_Attempts'] > 0,
        combined['SB'] / combined['SB_Attempts'] * 100,
        0
    )

    return combined

def identify_biggest_beneficiaries(baserunning_df):
    """
    Find players who benefited most from rule changes
    """
    # Pivot to compare seasons
    pivot = baserunning_df.pivot_table(
        index='Name',
        columns='Season',
        values=['SB', 'SB_Success_Rate'],
        aggfunc='mean'
    )

    # Calculate improvements
    pivot['SB_Increase'] = pivot[('SB', 2023)] - pivot[('SB', 2022)]
    pivot['Success_Rate_Increase'] = (pivot[('SB_Success_Rate', 2023)] -
                                      pivot[('SB_Success_Rate', 2022)])

    # Filter for players with significant attempts in both years
    qualified = pivot[
        (pivot[('SB', 2022)] >= 10) &
        (pivot[('SB', 2023)] >= 10)
    ]

    return qualified.sort_values('SB_Increase', ascending=False)

def analyze_speed_tier_impact():
    """
    Analyze how different speed tiers were affected by rule changes
    """
    # Sprint speed data (would come from Statcast)
    speed_impact = pd.DataFrame({
        'Speed_Tier': ['Elite (29+ ft/s)', 'Above Avg (28-29 ft/s)',
                      'Average (27-28 ft/s)', 'Below Avg (<27 ft/s)'],
        '2022_SB_Success': [78.5, 76.2, 74.1, 68.9],
        '2023_SB_Success': [82.7, 80.4, 78.8, 74.2],
        '2022_Avg_SB': [18.3, 12.5, 7.2, 3.1],
        '2023_Avg_SB': [25.7, 17.8, 10.9, 5.4]
    })

    speed_impact['Success_Increase'] = (speed_impact['2023_SB_Success'] -
                                        speed_impact['2022_SB_Success'])
    speed_impact['SB_Increase'] = (speed_impact['2023_Avg_SB'] -
                                   speed_impact['2022_Avg_SB'])

    return speed_impact

def visualize_baserunning_revolution(speed_impact_df):
    """
    Create visualizations showing baserunning changes
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # Success rate comparison
    x = np.arange(len(speed_impact_df))
    width = 0.35

    axes[0].bar(x - width/2, speed_impact_df['2022_SB_Success'],
                width, label='2022', color='lightcoral')
    axes[0].bar(x + width/2, speed_impact_df['2023_SB_Success'],
                width, label='2023', color='lightgreen')

    axes[0].set_xlabel('Speed Tier')
    axes[0].set_ylabel('Success Rate (%)')
    axes[0].set_title('Stolen Base Success Rate by Speed Tier')
    axes[0].set_xticks(x)
    axes[0].set_xticklabels(speed_impact_df['Speed_Tier'], rotation=45, ha='right')
    axes[0].legend()
    axes[0].grid(axis='y', alpha=0.3)

    # Average SB comparison
    axes[1].bar(x - width/2, speed_impact_df['2022_Avg_SB'],
                width, label='2022', color='lightcoral')
    axes[1].bar(x + width/2, speed_impact_df['2023_Avg_SB'],
                width, label='2023', color='lightgreen')

    axes[1].set_xlabel('Speed Tier')
    axes[1].set_ylabel('Average Stolen Bases')
    axes[1].set_title('Average Stolen Bases by Speed Tier')
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(speed_impact_df['Speed_Tier'], rotation=45, ha='right')
    axes[1].legend()
    axes[1].grid(axis='y', alpha=0.3)

    plt.tight_layout()
    return fig

def calculate_run_value_impact():
    """
    Calculate the run value impact of increased stolen bases
    """
    # Run expectancy values
    run_expectancy = {
        'runner_1st': 0.831,
        'runner_2nd': 1.068,
        'out_added': -0.25  # approximate cost of caught stealing
    }

    # 2022 vs 2023 league totals
    comparison = pd.DataFrame({
        'Metric': ['SB Total', 'CS Total', 'SB Success Rate'],
        '2022': [2487, 813, 75.4],
        '2023': [3503, 877, 79.9]
    })

    # Calculate run value
    runs_2022 = (2487 * (run_expectancy['runner_2nd'] - run_expectancy['runner_1st']) -
                 813 * abs(run_expectancy['out_added']))
    runs_2023 = (3503 * (run_expectancy['runner_2nd'] - run_expectancy['runner_1st']) -
                 877 * abs(run_expectancy['out_added']))

    run_value_impact = pd.DataFrame({
        'Season': [2022, 2023],
        'Estimated_Run_Value': [runs_2022, runs_2023],
        'Run_Value_Per_Game': [runs_2022 / 2430, runs_2023 / 2430]
    })

    run_value_impact['Increase'] = (run_value_impact['Estimated_Run_Value'] -
                                    runs_2022)

    return run_value_impact

Key Findings: Stolen Base Renaissance

The combined impact of larger bases, pitch clock, and pickoff limitations created the most dramatic stolen base increase in decades:

Volume Increases:


  • Total stolen bases increased 41% (2,487 in 2022 to 3,503 in 2023)

  • 24 players stole 30+ bases (compared to 16 in 2022)

  • 8 players stole 50+ bases (compared to 3 in 2022)

  • Ronald Acuna Jr. became first 40-70 player (40 HR, 70 SB)

Success Rate Improvements:


  • League success rate improved from 75.4% to 79.9%

  • Elite speed players (29+ ft/s sprint speed) succeeded at 82.7%

  • Even below-average runners saw 5+ point improvement

Strategic Impact:


  • Teams became more aggressive in all counts, not just obvious running situations

  • Increase in first-to-third advancement on singles

  • More frequent double steals and delayed steals

R
library(dplyr)

calculate_base_distance_impact <- function() {

  # Calculate distance reduction
  old_base_size <- 15  # inches
  new_base_size <- 18  # inches

  # Distance from back of one base to front of next
  distance_reduction_per_base <- (new_base_size - old_base_size) / 2

  # Total reduction from first to second (two bases involved)
  total_reduction <- distance_reduction_per_base * 2

  # Original distance (90 feet = 1080 inches)
  original_distance <- 90 * 12

  # New distance
  new_distance <- original_distance - total_reduction

  # Percentage reduction
  pct_reduction <- (total_reduction / original_distance) * 100

  results <- data.frame(
    Original_Distance_Inches = original_distance,
    New_Distance_Inches = new_distance,
    Reduction_Inches = total_reduction,
    Reduction_Feet = total_reduction / 12,
    Percentage_Reduction = pct_reduction
  )

  return(results)
}

analyze_stolen_base_trends <- function() {

  # Historical stolen base data
  sb_trends <- data.frame(
    Season = c(2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023),
    SB_Total = c(2505, 2537, 2527, 2474, 2280, 874, 2213, 2487, 3503),
    Games_Played = c(2430, 2428, 2430, 2431, 2429, 898, 2430, 2430, 2430),
    SB_Success_Rate = c(73.2, 73.8, 74.1, 74.8, 75.3, 74.8, 75.1, 75.4, 79.9)
  ) %>%
    mutate(
      SB_Per_Game = SB_Total / Games_Played,
      Era = ifelse(Season >= 2023, "Larger Bases + Pitch Clock", "Traditional")
    )

  return(sb_trends)
}

identify_stolen_base_leaders <- function(sb_data) {

  # Analyze which types of players benefited most
  leader_analysis <- sb_data %>%
    filter(SB >= 20) %>%
    group_by(Season) %>%
    summarise(
      Players_20Plus_SB = n(),
      Players_30Plus_SB = sum(SB >= 30),
      Players_40Plus_SB = sum(SB >= 40),
      Players_50Plus_SB = sum(SB >= 50),
      Avg_SB_Rate = mean(SB_Success_Rate),
      .groups = "drop"
    )

  return(leader_analysis)
}

plot_stolen_base_renaissance <- function(sb_trends) {

  library(ggplot2)

  ggplot(sb_trends, aes(x = Season, y = SB_Per_Game, fill = Era)) +
    geom_col() +
    geom_text(aes(label = round(SB_Per_Game, 2)),
              vjust = -0.5, size = 3) +
    labs(
      title = "Stolen Bases per Game: The 2023 Renaissance",
      subtitle = "Impact of larger bases and pitch clock on stolen base frequency",
      x = "Season",
      y = "Stolen Bases per Game",
      fill = "Era"
    ) +
    theme_minimal() +
    scale_fill_manual(values = c("Traditional" = "#808080",
                                  "Larger Bases + Pitch Clock" = "#00AA00"))
}
Python
import pandas as pd
import numpy as np
from pybaseball import batting_stats
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_baserunner_performance(years=[2022, 2023]):
    """
    Compare baserunning performance before and after rule changes
    """
    baserunning_data = []

    for year in years:
        # Get batting statistics including stolen bases
        stats = batting_stats(year, year, qual=200)
        stats['Season'] = year
        baserunning_data.append(stats[['Name', 'Season', 'SB', 'CS', 'PA']])

    combined = pd.concat(baserunning_data)

    # Calculate success rates
    combined['SB_Attempts'] = combined['SB'] + combined['CS']
    combined['SB_Success_Rate'] = np.where(
        combined['SB_Attempts'] > 0,
        combined['SB'] / combined['SB_Attempts'] * 100,
        0
    )

    return combined

def identify_biggest_beneficiaries(baserunning_df):
    """
    Find players who benefited most from rule changes
    """
    # Pivot to compare seasons
    pivot = baserunning_df.pivot_table(
        index='Name',
        columns='Season',
        values=['SB', 'SB_Success_Rate'],
        aggfunc='mean'
    )

    # Calculate improvements
    pivot['SB_Increase'] = pivot[('SB', 2023)] - pivot[('SB', 2022)]
    pivot['Success_Rate_Increase'] = (pivot[('SB_Success_Rate', 2023)] -
                                      pivot[('SB_Success_Rate', 2022)])

    # Filter for players with significant attempts in both years
    qualified = pivot[
        (pivot[('SB', 2022)] >= 10) &
        (pivot[('SB', 2023)] >= 10)
    ]

    return qualified.sort_values('SB_Increase', ascending=False)

def analyze_speed_tier_impact():
    """
    Analyze how different speed tiers were affected by rule changes
    """
    # Sprint speed data (would come from Statcast)
    speed_impact = pd.DataFrame({
        'Speed_Tier': ['Elite (29+ ft/s)', 'Above Avg (28-29 ft/s)',
                      'Average (27-28 ft/s)', 'Below Avg (<27 ft/s)'],
        '2022_SB_Success': [78.5, 76.2, 74.1, 68.9],
        '2023_SB_Success': [82.7, 80.4, 78.8, 74.2],
        '2022_Avg_SB': [18.3, 12.5, 7.2, 3.1],
        '2023_Avg_SB': [25.7, 17.8, 10.9, 5.4]
    })

    speed_impact['Success_Increase'] = (speed_impact['2023_SB_Success'] -
                                        speed_impact['2022_SB_Success'])
    speed_impact['SB_Increase'] = (speed_impact['2023_Avg_SB'] -
                                   speed_impact['2022_Avg_SB'])

    return speed_impact

def visualize_baserunning_revolution(speed_impact_df):
    """
    Create visualizations showing baserunning changes
    """
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # Success rate comparison
    x = np.arange(len(speed_impact_df))
    width = 0.35

    axes[0].bar(x - width/2, speed_impact_df['2022_SB_Success'],
                width, label='2022', color='lightcoral')
    axes[0].bar(x + width/2, speed_impact_df['2023_SB_Success'],
                width, label='2023', color='lightgreen')

    axes[0].set_xlabel('Speed Tier')
    axes[0].set_ylabel('Success Rate (%)')
    axes[0].set_title('Stolen Base Success Rate by Speed Tier')
    axes[0].set_xticks(x)
    axes[0].set_xticklabels(speed_impact_df['Speed_Tier'], rotation=45, ha='right')
    axes[0].legend()
    axes[0].grid(axis='y', alpha=0.3)

    # Average SB comparison
    axes[1].bar(x - width/2, speed_impact_df['2022_Avg_SB'],
                width, label='2022', color='lightcoral')
    axes[1].bar(x + width/2, speed_impact_df['2023_Avg_SB'],
                width, label='2023', color='lightgreen')

    axes[1].set_xlabel('Speed Tier')
    axes[1].set_ylabel('Average Stolen Bases')
    axes[1].set_title('Average Stolen Bases by Speed Tier')
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(speed_impact_df['Speed_Tier'], rotation=45, ha='right')
    axes[1].legend()
    axes[1].grid(axis='y', alpha=0.3)

    plt.tight_layout()
    return fig

def calculate_run_value_impact():
    """
    Calculate the run value impact of increased stolen bases
    """
    # Run expectancy values
    run_expectancy = {
        'runner_1st': 0.831,
        'runner_2nd': 1.068,
        'out_added': -0.25  # approximate cost of caught stealing
    }

    # 2022 vs 2023 league totals
    comparison = pd.DataFrame({
        'Metric': ['SB Total', 'CS Total', 'SB Success Rate'],
        '2022': [2487, 813, 75.4],
        '2023': [3503, 877, 79.9]
    })

    # Calculate run value
    runs_2022 = (2487 * (run_expectancy['runner_2nd'] - run_expectancy['runner_1st']) -
                 813 * abs(run_expectancy['out_added']))
    runs_2023 = (3503 * (run_expectancy['runner_2nd'] - run_expectancy['runner_1st']) -
                 877 * abs(run_expectancy['out_added']))

    run_value_impact = pd.DataFrame({
        'Season': [2022, 2023],
        'Estimated_Run_Value': [runs_2022, runs_2023],
        'Run_Value_Per_Game': [runs_2022 / 2430, runs_2023 / 2430]
    })

    run_value_impact['Increase'] = (run_value_impact['Estimated_Run_Value'] -
                                    runs_2022)

    return run_value_impact

27.5 Opener & Bullpen Game Strategy

Evolution of the Opener Strategy

The "opener" strategy, popularized by the Tampa Bay Rays starting in 2018, fundamentally challenged traditional pitching deployment. Instead of a traditional starting pitcher throwing 5-7 innings, teams would deploy:

  1. Opener (1-2 innings): High-velocity reliever to face lineup first time through
  2. Bulk Pitcher (3-5 innings): Long reliever or piggyback starter
  3. Leverage Relievers (2-3 innings): Traditional closer/setup configuration

Analyzing Opener Effectiveness

library(baseballr)
library(dplyr)
library(ggplot2)

analyze_opener_usage <- function(season = 2023) {

  # Identify opener games (pitcher throws <3 innings as "starter")
  opener_games <- data.frame(
    team = character(),
    games_with_opener = numeric(),
    total_games = numeric(),
    opener_win_pct = numeric()
  )

  # This would query game logs to identify patterns
  # Simplified example structure:

  example_data <- data.frame(
    team = c("TB", "BAL", "MIL", "TB", "BAL"),
    pitcher_type = c("Opener", "Traditional", "Opener", "Traditional", "Opener"),
    innings_pitched = c(1.2, 6.0, 2.0, 5.1, 1.1),
    runs_allowed = c(0, 2, 1, 3, 0),
    result = c("W", "W", "L", "L", "W")
  )

  return(example_data)
}

calculate_times_through_order_penalty <- function(pitch_data) {

  # Analyze performance by times through batting order
  tto_analysis <- pitch_data %>%
    filter(!is.na(times_through_order)) %>%
    group_by(times_through_order) %>%
    summarise(
      pitches = n(),
      avg_exit_velocity = mean(launch_speed, na.rm = TRUE),
      avg_launch_angle = mean(launch_angle, na.rm = TRUE),
      whiff_rate = sum(description == "swinging_strike", na.rm = TRUE) / pitches,
      barrel_rate = sum(barrel == 1, na.rm = TRUE) / sum(!is.na(barrel)),
      .groups = "drop"
    )

  return(tto_analysis)
}

plot_times_through_order <- function(tto_data) {

  ggplot(tto_data, aes(x = times_through_order, y = avg_exit_velocity)) +
    geom_line(linewidth = 1.2, color = "#002D72") +
    geom_point(size = 4, color = "#D50032") +
    labs(
      title = "Exit Velocity by Times Through Batting Order",
      subtitle = "Demonstrates the increasing penalty for pitcher familiarity",
      x = "Times Through Order",
      y = "Average Exit Velocity (mph)"
    ) +
    theme_minimal() +
    scale_x_continuous(breaks = 1:4)
}

analyze_bullpen_game_performance <- function(team_data) {

  # Compare bullpen games vs traditional starts
  performance_comparison <- team_data %>%
    mutate(
      game_type = case_when(
        starter_ip < 3 ~ "Bullpen Game",
        starter_ip >= 5 ~ "Traditional Start",
        TRUE ~ "Short Start"
      )
    ) %>%
    group_by(game_type) %>%
    summarise(
      games = n(),
      win_pct = mean(result == "W"),
      avg_runs_allowed = mean(runs_allowed),
      avg_total_pitchers_used = mean(pitchers_used),
      .groups = "drop"
    )

  return(performance_comparison)
}

Opener Strategy Analysis with Python

import pandas as pd
import numpy as np
from pybaseball import schedule_and_record, pitching_stats
import matplotlib.pyplot as plt
import seaborn as sns

def identify_opener_games(season=2023, team='TBR'):
    """
    Identify games where team used opener strategy
    """
    # Get team schedule
    schedule = schedule_and_record(season, team)

    # Would need game-level pitching data to identify openers
    # Simplified example:

    opener_characteristics = {
        'First_Pitcher_IP_Max': 2.0,
        'Second_Pitcher_IP_Min': 3.0,
        'Total_Pitchers_Used_Min': 4
    }

    return opener_characteristics

def analyze_times_through_order_penalty():
    """
    Calculate the times through order penalty for starting pitchers
    """
    # Aggregate data showing performance degradation
    tto_penalty = pd.DataFrame({
        'Times_Through_Order': [1, 2, 3, 4],
        'AVG': [.237, .251, .268, .289],
        'SLG': [.389, .412, .441, .478],
        'wOBA': [.301, .319, .338, .361],
        'K_Rate': [24.3, 22.1, 19.8, 17.2],
        'BB_Rate': [7.8, 8.4, 9.2, 10.1]
    })

    return tto_penalty

def calculate_optimal_pitcher_usage():
    """
    Calculate optimal innings for each pitcher based on TTO penalty
    """
    # Run expectancy by times through order
    run_expectancy_multiplier = {
        1: 1.00,
        2: 1.15,
        3: 1.32,
        4: 1.48
    }

    # Calculate expected runs for different strategies
    strategies = pd.DataFrame({
        'Strategy': ['Traditional (7 IP)', 'Opener (1-4-2)', 'Bullpen Game (2-2-2-2)'],
        'First_TTO_IP': [7, 1, 2],
        'Second_TTO_IP': [0, 4, 2],
        'Weighted_Multiplier': [
            (7 * 1.00 + 0 * 1.15) / 7,  # Traditional
            (1 * 1.00 + 4 * 1.00) / 5,  # Opener
            (2 * 1.00 + 2 * 1.00 + 2 * 1.00) / 6  # Bullpen
        ]
    })

    return strategies

def visualize_opener_effectiveness(tto_penalty_df):
    """
    Visualize the times through order penalty
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Batting average
    axes[0, 0].plot(tto_penalty_df['Times_Through_Order'],
                    tto_penalty_df['AVG'],
                    marker='o', linewidth=2, markersize=8, color='steelblue')
    axes[0, 0].set_title('Batting Average by Times Through Order')
    axes[0, 0].set_xlabel('Times Through Order')
    axes[0, 0].set_ylabel('AVG')
    axes[0, 0].grid(alpha=0.3)

    # Slugging percentage
    axes[0, 1].plot(tto_penalty_df['Times_Through_Order'],
                    tto_penalty_df['SLG'],
                    marker='s', linewidth=2, markersize=8, color='darkgreen')
    axes[0, 1].set_title('Slugging Percentage by Times Through Order')
    axes[0, 1].set_xlabel('Times Through Order')
    axes[0, 1].set_ylabel('SLG')
    axes[0, 1].grid(alpha=0.3)

    # Strikeout rate
    axes[1, 0].plot(tto_penalty_df['Times_Through_Order'],
                    tto_penalty_df['K_Rate'],
                    marker='^', linewidth=2, markersize=8, color='darkred')
    axes[1, 0].set_title('Strikeout Rate by Times Through Order')
    axes[1, 0].set_xlabel('Times Through Order')
    axes[1, 0].set_ylabel('K%')
    axes[1, 0].grid(alpha=0.3)

    # Walk rate
    axes[1, 1].plot(tto_penalty_df['Times_Through_Order'],
                    tto_penalty_df['BB_Rate'],
                    marker='D', linewidth=2, markersize=8, color='purple')
    axes[1, 1].set_title('Walk Rate by Times Through Order')
    axes[1, 1].set_xlabel('Times Through Order')
    axes[1, 1].set_ylabel('BB%')
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    return fig

def analyze_bullpen_workload_impact(opener_frequency):
    """
    Analyze how opener strategy affects bullpen workload and fatigue
    """
    # Model bullpen usage patterns
    workload_comparison = pd.DataFrame({
        'Strategy': ['Traditional', 'Frequent Opener (20%)', 'Heavy Opener (40%)'],
        'Avg_Relievers_Per_Game': [3.2, 4.1, 5.3],
        'Reliever_IP_Per_Game': [3.1, 4.8, 6.2],
        'Days_Rest_Avg': [2.1, 1.6, 1.3],
        'Late_Season_ERA': [3.78, 4.12, 4.48]
    })

    return workload_comparison

def simulate_opener_season(num_games=162, opener_pct=25):
    """
    Simulate a season with varying opener usage
    """
    np.random.seed(42)

    results = []

    for game in range(num_games):
        is_opener = np.random.random() < (opener_pct / 100)

        if is_opener:
            # Opener game simulation
            runs_allowed = np.random.poisson(3.8)  # Slightly better run prevention
            win_prob = 0.52
        else:
            # Traditional start
            runs_allowed = np.random.poisson(4.1)
            win_prob = 0.50

        won = np.random.random() < win_prob

        results.append({
            'Game': game + 1,
            'Is_Opener': is_opener,
            'Runs_Allowed': runs_allowed,
            'Won': won
        })

    results_df = pd.DataFrame(results)

    summary = results_df.groupby('Is_Opener').agg({
        'Won': ['sum', 'mean'],
        'Runs_Allowed': 'mean'
    }).round(3)

    return results_df, summary

Key Insights: Opener Strategy

Times Through Order Penalty:


  • Batting average increases approximately 31 points from 1st to 3rd time through order

  • Slugging percentage increases by 52 points

  • This penalty justifies limiting starter exposure to batting order

Opener Advantages:


  1. Platoon Optimization: Match best relievers against top of lineup

  2. Velocity Advantage: Fresh arms throw harder (avg 1.5 mph increase)

  3. Times Through Order: Minimize 3rd time through exposure

  4. Roster Flexibility: Avoid need for 5th starter

Opener Disadvantages:


  1. Bullpen Fatigue: Increases reliever workload by 25-30%

  2. Reduced Flexibility: Limits in-game pitching changes

  3. Starter Development: Fewer opportunities for traditional starters

  4. Late Season Fatigue: Bullpen ERA typically rises in September

R
library(baseballr)
library(dplyr)
library(ggplot2)

analyze_opener_usage <- function(season = 2023) {

  # Identify opener games (pitcher throws <3 innings as "starter")
  opener_games <- data.frame(
    team = character(),
    games_with_opener = numeric(),
    total_games = numeric(),
    opener_win_pct = numeric()
  )

  # This would query game logs to identify patterns
  # Simplified example structure:

  example_data <- data.frame(
    team = c("TB", "BAL", "MIL", "TB", "BAL"),
    pitcher_type = c("Opener", "Traditional", "Opener", "Traditional", "Opener"),
    innings_pitched = c(1.2, 6.0, 2.0, 5.1, 1.1),
    runs_allowed = c(0, 2, 1, 3, 0),
    result = c("W", "W", "L", "L", "W")
  )

  return(example_data)
}

calculate_times_through_order_penalty <- function(pitch_data) {

  # Analyze performance by times through batting order
  tto_analysis <- pitch_data %>%
    filter(!is.na(times_through_order)) %>%
    group_by(times_through_order) %>%
    summarise(
      pitches = n(),
      avg_exit_velocity = mean(launch_speed, na.rm = TRUE),
      avg_launch_angle = mean(launch_angle, na.rm = TRUE),
      whiff_rate = sum(description == "swinging_strike", na.rm = TRUE) / pitches,
      barrel_rate = sum(barrel == 1, na.rm = TRUE) / sum(!is.na(barrel)),
      .groups = "drop"
    )

  return(tto_analysis)
}

plot_times_through_order <- function(tto_data) {

  ggplot(tto_data, aes(x = times_through_order, y = avg_exit_velocity)) +
    geom_line(linewidth = 1.2, color = "#002D72") +
    geom_point(size = 4, color = "#D50032") +
    labs(
      title = "Exit Velocity by Times Through Batting Order",
      subtitle = "Demonstrates the increasing penalty for pitcher familiarity",
      x = "Times Through Order",
      y = "Average Exit Velocity (mph)"
    ) +
    theme_minimal() +
    scale_x_continuous(breaks = 1:4)
}

analyze_bullpen_game_performance <- function(team_data) {

  # Compare bullpen games vs traditional starts
  performance_comparison <- team_data %>%
    mutate(
      game_type = case_when(
        starter_ip < 3 ~ "Bullpen Game",
        starter_ip >= 5 ~ "Traditional Start",
        TRUE ~ "Short Start"
      )
    ) %>%
    group_by(game_type) %>%
    summarise(
      games = n(),
      win_pct = mean(result == "W"),
      avg_runs_allowed = mean(runs_allowed),
      avg_total_pitchers_used = mean(pitchers_used),
      .groups = "drop"
    )

  return(performance_comparison)
}
Python
import pandas as pd
import numpy as np
from pybaseball import schedule_and_record, pitching_stats
import matplotlib.pyplot as plt
import seaborn as sns

def identify_opener_games(season=2023, team='TBR'):
    """
    Identify games where team used opener strategy
    """
    # Get team schedule
    schedule = schedule_and_record(season, team)

    # Would need game-level pitching data to identify openers
    # Simplified example:

    opener_characteristics = {
        'First_Pitcher_IP_Max': 2.0,
        'Second_Pitcher_IP_Min': 3.0,
        'Total_Pitchers_Used_Min': 4
    }

    return opener_characteristics

def analyze_times_through_order_penalty():
    """
    Calculate the times through order penalty for starting pitchers
    """
    # Aggregate data showing performance degradation
    tto_penalty = pd.DataFrame({
        'Times_Through_Order': [1, 2, 3, 4],
        'AVG': [.237, .251, .268, .289],
        'SLG': [.389, .412, .441, .478],
        'wOBA': [.301, .319, .338, .361],
        'K_Rate': [24.3, 22.1, 19.8, 17.2],
        'BB_Rate': [7.8, 8.4, 9.2, 10.1]
    })

    return tto_penalty

def calculate_optimal_pitcher_usage():
    """
    Calculate optimal innings for each pitcher based on TTO penalty
    """
    # Run expectancy by times through order
    run_expectancy_multiplier = {
        1: 1.00,
        2: 1.15,
        3: 1.32,
        4: 1.48
    }

    # Calculate expected runs for different strategies
    strategies = pd.DataFrame({
        'Strategy': ['Traditional (7 IP)', 'Opener (1-4-2)', 'Bullpen Game (2-2-2-2)'],
        'First_TTO_IP': [7, 1, 2],
        'Second_TTO_IP': [0, 4, 2],
        'Weighted_Multiplier': [
            (7 * 1.00 + 0 * 1.15) / 7,  # Traditional
            (1 * 1.00 + 4 * 1.00) / 5,  # Opener
            (2 * 1.00 + 2 * 1.00 + 2 * 1.00) / 6  # Bullpen
        ]
    })

    return strategies

def visualize_opener_effectiveness(tto_penalty_df):
    """
    Visualize the times through order penalty
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # Batting average
    axes[0, 0].plot(tto_penalty_df['Times_Through_Order'],
                    tto_penalty_df['AVG'],
                    marker='o', linewidth=2, markersize=8, color='steelblue')
    axes[0, 0].set_title('Batting Average by Times Through Order')
    axes[0, 0].set_xlabel('Times Through Order')
    axes[0, 0].set_ylabel('AVG')
    axes[0, 0].grid(alpha=0.3)

    # Slugging percentage
    axes[0, 1].plot(tto_penalty_df['Times_Through_Order'],
                    tto_penalty_df['SLG'],
                    marker='s', linewidth=2, markersize=8, color='darkgreen')
    axes[0, 1].set_title('Slugging Percentage by Times Through Order')
    axes[0, 1].set_xlabel('Times Through Order')
    axes[0, 1].set_ylabel('SLG')
    axes[0, 1].grid(alpha=0.3)

    # Strikeout rate
    axes[1, 0].plot(tto_penalty_df['Times_Through_Order'],
                    tto_penalty_df['K_Rate'],
                    marker='^', linewidth=2, markersize=8, color='darkred')
    axes[1, 0].set_title('Strikeout Rate by Times Through Order')
    axes[1, 0].set_xlabel('Times Through Order')
    axes[1, 0].set_ylabel('K%')
    axes[1, 0].grid(alpha=0.3)

    # Walk rate
    axes[1, 1].plot(tto_penalty_df['Times_Through_Order'],
                    tto_penalty_df['BB_Rate'],
                    marker='D', linewidth=2, markersize=8, color='purple')
    axes[1, 1].set_title('Walk Rate by Times Through Order')
    axes[1, 1].set_xlabel('Times Through Order')
    axes[1, 1].set_ylabel('BB%')
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    return fig

def analyze_bullpen_workload_impact(opener_frequency):
    """
    Analyze how opener strategy affects bullpen workload and fatigue
    """
    # Model bullpen usage patterns
    workload_comparison = pd.DataFrame({
        'Strategy': ['Traditional', 'Frequent Opener (20%)', 'Heavy Opener (40%)'],
        'Avg_Relievers_Per_Game': [3.2, 4.1, 5.3],
        'Reliever_IP_Per_Game': [3.1, 4.8, 6.2],
        'Days_Rest_Avg': [2.1, 1.6, 1.3],
        'Late_Season_ERA': [3.78, 4.12, 4.48]
    })

    return workload_comparison

def simulate_opener_season(num_games=162, opener_pct=25):
    """
    Simulate a season with varying opener usage
    """
    np.random.seed(42)

    results = []

    for game in range(num_games):
        is_opener = np.random.random() < (opener_pct / 100)

        if is_opener:
            # Opener game simulation
            runs_allowed = np.random.poisson(3.8)  # Slightly better run prevention
            win_prob = 0.52
        else:
            # Traditional start
            runs_allowed = np.random.poisson(4.1)
            win_prob = 0.50

        won = np.random.random() < win_prob

        results.append({
            'Game': game + 1,
            'Is_Opener': is_opener,
            'Runs_Allowed': runs_allowed,
            'Won': won
        })

    results_df = pd.DataFrame(results)

    summary = results_df.groupby('Is_Opener').agg({
        'Won': ['sum', 'mean'],
        'Runs_Allowed': 'mean'
    }).round(3)

    return results_df, summary

27.6 Three True Outcomes Evolution

Defining the Three True Outcomes Era

The "Three True Outcomes" (TTO) - home runs, strikeouts, and walks - represent plate appearances where defense plays no role. The modern era has seen these outcomes dominate baseball like never before:

library(dplyr)
library(ggplot2)

analyze_tto_trends <- function() {

  # Historical TTO rates
  tto_evolution <- data.frame(
    Season = seq(1980, 2023, by = 5),
    HR_Rate = c(1.6, 2.0, 2.3, 2.6, 2.5, 2.7, 2.8, 3.0, 3.1, 3.0),
    K_Rate = c(13.8, 15.2, 16.4, 16.8, 17.9, 19.8, 21.4, 23.2, 22.7, 22.4),
    BB_Rate = c(8.7, 8.4, 9.1, 9.2, 8.7, 8.3, 8.5, 8.7, 8.8, 8.6),
    TTO_Rate = c(24.1, 25.6, 27.8, 28.6, 29.1, 30.8, 32.7, 34.9, 34.6, 34.0)
  )

  return(tto_evolution)
}

plot_tto_evolution <- function(tto_data) {

  library(tidyr)

  tto_long <- tto_data %>%
    pivot_longer(cols = c(HR_Rate, K_Rate, BB_Rate),
                 names_to = "Outcome",
                 values_to = "Rate")

  ggplot(tto_long, aes(x = Season, y = Rate, color = Outcome)) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 3) +
    labs(
      title = "Evolution of Three True Outcomes (1980-2023)",
      subtitle = "Percentage of plate appearances ending in HR, K, or BB",
      x = "Season",
      y = "Rate (%)",
      color = "Outcome Type"
    ) +
    theme_minimal() +
    scale_color_manual(
      values = c("HR_Rate" = "#D50032",
                 "K_Rate" = "#002D72",
                 "BB_Rate" = "#FF8C00"),
      labels = c("Home Runs", "Strikeouts", "Walks")
    )
}

analyze_balls_in_play_decline <- function() {

  # Calculate balls in play trends
  bip_trends <- data.frame(
    Season = 2015:2023,
    BIP_Per_Game = c(26.8, 26.2, 25.9, 25.3, 24.7, 24.8, 24.9, 24.4, 25.1),
    Action_Plays_Pct = c(42.3, 41.1, 40.2, 39.4, 38.7, 38.9, 39.1, 38.5, 39.8)
  ) %>%
    mutate(
      Pct_Change_From_2015 = (BIP_Per_Game / first(BIP_Per_Game) - 1) * 100
    )

  return(bip_trends)
}

identify_extreme_tto_players <- function(batting_data) {

  # Find players with highest TTO rates
  extreme_tto <- batting_data %>%
    filter(PA >= 400) %>%
    mutate(
      TTO = HR + SO + BB,
      TTO_Rate = TTO / PA * 100
    ) %>%
    arrange(desc(TTO_Rate)) %>%
    select(player_name, PA, HR, SO, BB, TTO, TTO_Rate)

  return(extreme_tto)
}

TTO Analysis with Python

import pandas as pd
import numpy as np
from pybaseball import batting_stats, team_batting
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_league_tto_evolution(start_year=2000, end_year=2023):
    """
    Analyze league-wide three true outcomes trends
    """
    league_stats = []

    for year in range(start_year, end_year + 1):
        try:
            stats = team_batting(year)

            # Calculate TTO metrics
            total_pa = stats['PA'].sum()
            total_hr = stats['HR'].sum()
            total_so = stats['SO'].sum()
            total_bb = stats['BB'].sum()

            league_stats.append({
                'Season': year,
                'HR_Rate': (total_hr / total_pa) * 100,
                'K_Rate': (total_so / total_pa) * 100,
                'BB_Rate': (total_bb / total_pa) * 100,
                'TTO_Rate': ((total_hr + total_so + total_bb) / total_pa) * 100,
                'Non_TTO_Rate': ((total_pa - total_hr - total_so - total_bb) / total_pa) * 100
            })

        except Exception as e:
            print(f"Error processing {year}: {e}")

    return pd.DataFrame(league_stats)

def identify_tto_extremes(season=2023, min_pa=400):
    """
    Identify players with extreme TTO profiles
    """
    stats = batting_stats(season, season, qual=min_pa)

    # Calculate TTO rate
    stats['TTO'] = stats['HR'] + stats['SO'] + stats['BB']
    stats['TTO_Rate'] = (stats['TTO'] / stats['PA']) * 100
    stats['Non_TTO_Rate'] = 100 - stats['TTO_Rate']

    # Identify extremes
    highest_tto = stats.nlargest(20, 'TTO_Rate')[
        ['Name', 'PA', 'HR', 'SO', 'BB', 'TTO_Rate', 'AVG', 'OBP', 'SLG']
    ]

    lowest_tto = stats.nsmallest(20, 'TTO_Rate')[
        ['Name', 'PA', 'HR', 'SO', 'BB', 'TTO_Rate', 'AVG', 'OBP', 'SLG']
    ]

    return highest_tto, lowest_tto

def analyze_tto_correlation_with_success():
    """
    Analyze whether TTO profile correlates with team success
    """
    # Example data structure
    team_tto_data = pd.DataFrame({
        'Team': ['TB', 'LAD', 'HOU', 'ATL', 'NYY'],
        'TTO_Rate': [35.2, 34.8, 33.9, 32.1, 36.4],
        'Wins': [99, 100, 90, 104, 82],
        'Run_Differential': [+145, +167, +89, +198, +45],
        'Playoff_Berth': [True, True, True, True, False]
    })

    # Calculate correlation
    correlation = team_tto_data[['TTO_Rate', 'Wins']].corr().iloc[0, 1]

    return team_tto_data, correlation

def visualize_tto_player_types(high_tto_df, low_tto_df):
    """
    Compare offensive profiles of high vs low TTO players
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # AVG comparison
    axes[0, 0].hist([high_tto_df['AVG'], low_tto_df['AVG']],
                    label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
    axes[0, 0].set_title('Batting Average Distribution')
    axes[0, 0].set_xlabel('AVG')
    axes[0, 0].set_ylabel('Count')
    axes[0, 0].legend()

    # OBP comparison
    axes[0, 1].hist([high_tto_df['OBP'], low_tto_df['OBP']],
                    label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
    axes[0, 1].set_title('On-Base Percentage Distribution')
    axes[0, 1].set_xlabel('OBP')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].legend()

    # SLG comparison
    axes[1, 0].hist([high_tto_df['SLG'], low_tto_df['SLG']],
                    label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
    axes[1, 0].set_title('Slugging Percentage Distribution')
    axes[1, 0].set_xlabel('SLG')
    axes[1, 0].set_ylabel('Count')
    axes[1, 0].legend()

    # Scatter: TTO Rate vs OPS
    high_tto_df['OPS'] = high_tto_df['OBP'] + high_tto_df['SLG']
    low_tto_df['OPS'] = low_tto_df['OBP'] + low_tto_df['SLG']

    axes[1, 1].scatter(high_tto_df['TTO_Rate'], high_tto_df['OPS'],
                      label='High TTO', alpha=0.6, s=60)
    axes[1, 1].scatter(low_tto_df['TTO_Rate'], low_tto_df['OPS'],
                      label='Low TTO', alpha=0.6, s=60)
    axes[1, 1].set_title('TTO Rate vs OPS')
    axes[1, 1].set_xlabel('TTO Rate (%)')
    axes[1, 1].set_ylabel('OPS')
    axes[1, 1].legend()
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    return fig

def calculate_entertainment_value():
    """
    Attempt to quantify 'entertainment value' of TTO vs action plays
    """
    # Hypothetical engagement metrics
    engagement_data = pd.DataFrame({
        'Play_Type': ['Home Run', 'Strikeout', 'Walk', 'Single', 'Double',
                     'Triple', 'Stolen Base', 'Double Play'],
        'Avg_Excitement_Score': [9.2, 3.1, 2.4, 6.8, 8.1, 9.5, 7.8, 8.9],
        'Frequency_2000': [2.0, 15.8, 8.7, 14.2, 4.8, 0.8, 1.9, 2.1],
        'Frequency_2023': [3.0, 22.4, 8.6, 12.1, 4.2, 0.5, 2.3, 1.8]
    })

    # Calculate weighted excitement score
    engagement_data['Weighted_Excitement_2000'] = (
        engagement_data['Avg_Excitement_Score'] *
        engagement_data['Frequency_2000']
    )
    engagement_data['Weighted_Excitement_2023'] = (
        engagement_data['Avg_Excitement_Score'] *
        engagement_data['Frequency_2023']
    )

    total_2000 = engagement_data['Weighted_Excitement_2000'].sum()
    total_2023 = engagement_data['Weighted_Excitement_2023'].sum()

    print(f"Weighted Excitement Score Change: {total_2023 - total_2000:.2f}")

    return engagement_data

Key Findings: Three True Outcomes

Historical Progression:


  • TTO rate has increased from 24.1% (1980) to 34.0% (2023)

  • Peak was 34.9% in 2019 before slight decline with rule changes

  • Strikeouts alone now account for over 22% of all plate appearances

Player Archetypes:

Extreme TTO Players (2023):


  • Kyle Schwarber: 50.2% TTO rate (46 HR, 215 SO, 80 BB)

  • Joey Gallo: 52.7% TTO rate (when qualified)

  • Mike Trout: 44.3% TTO rate (still elite overall production)

Contact-Oriented Players (2023):


  • Luis Arraez: 15.8% TTO rate (won batting title)

  • Steven Kwan: 19.2% TTO rate (elite contact skills)

  • Jeff McNeil: 18.7% TTO rate

Impact on Game Quality:


  • Balls in play decreased from 26.8 per game (2015) to 25.1 (2023)

  • Average game action (defensive plays, baserunning) declined until 2023 rule changes

  • 2023 rules partially reversed trend, increasing action plays by 3.4%

R
library(dplyr)
library(ggplot2)

analyze_tto_trends <- function() {

  # Historical TTO rates
  tto_evolution <- data.frame(
    Season = seq(1980, 2023, by = 5),
    HR_Rate = c(1.6, 2.0, 2.3, 2.6, 2.5, 2.7, 2.8, 3.0, 3.1, 3.0),
    K_Rate = c(13.8, 15.2, 16.4, 16.8, 17.9, 19.8, 21.4, 23.2, 22.7, 22.4),
    BB_Rate = c(8.7, 8.4, 9.1, 9.2, 8.7, 8.3, 8.5, 8.7, 8.8, 8.6),
    TTO_Rate = c(24.1, 25.6, 27.8, 28.6, 29.1, 30.8, 32.7, 34.9, 34.6, 34.0)
  )

  return(tto_evolution)
}

plot_tto_evolution <- function(tto_data) {

  library(tidyr)

  tto_long <- tto_data %>%
    pivot_longer(cols = c(HR_Rate, K_Rate, BB_Rate),
                 names_to = "Outcome",
                 values_to = "Rate")

  ggplot(tto_long, aes(x = Season, y = Rate, color = Outcome)) +
    geom_line(linewidth = 1.2) +
    geom_point(size = 3) +
    labs(
      title = "Evolution of Three True Outcomes (1980-2023)",
      subtitle = "Percentage of plate appearances ending in HR, K, or BB",
      x = "Season",
      y = "Rate (%)",
      color = "Outcome Type"
    ) +
    theme_minimal() +
    scale_color_manual(
      values = c("HR_Rate" = "#D50032",
                 "K_Rate" = "#002D72",
                 "BB_Rate" = "#FF8C00"),
      labels = c("Home Runs", "Strikeouts", "Walks")
    )
}

analyze_balls_in_play_decline <- function() {

  # Calculate balls in play trends
  bip_trends <- data.frame(
    Season = 2015:2023,
    BIP_Per_Game = c(26.8, 26.2, 25.9, 25.3, 24.7, 24.8, 24.9, 24.4, 25.1),
    Action_Plays_Pct = c(42.3, 41.1, 40.2, 39.4, 38.7, 38.9, 39.1, 38.5, 39.8)
  ) %>%
    mutate(
      Pct_Change_From_2015 = (BIP_Per_Game / first(BIP_Per_Game) - 1) * 100
    )

  return(bip_trends)
}

identify_extreme_tto_players <- function(batting_data) {

  # Find players with highest TTO rates
  extreme_tto <- batting_data %>%
    filter(PA >= 400) %>%
    mutate(
      TTO = HR + SO + BB,
      TTO_Rate = TTO / PA * 100
    ) %>%
    arrange(desc(TTO_Rate)) %>%
    select(player_name, PA, HR, SO, BB, TTO, TTO_Rate)

  return(extreme_tto)
}
Python
import pandas as pd
import numpy as np
from pybaseball import batting_stats, team_batting
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_league_tto_evolution(start_year=2000, end_year=2023):
    """
    Analyze league-wide three true outcomes trends
    """
    league_stats = []

    for year in range(start_year, end_year + 1):
        try:
            stats = team_batting(year)

            # Calculate TTO metrics
            total_pa = stats['PA'].sum()
            total_hr = stats['HR'].sum()
            total_so = stats['SO'].sum()
            total_bb = stats['BB'].sum()

            league_stats.append({
                'Season': year,
                'HR_Rate': (total_hr / total_pa) * 100,
                'K_Rate': (total_so / total_pa) * 100,
                'BB_Rate': (total_bb / total_pa) * 100,
                'TTO_Rate': ((total_hr + total_so + total_bb) / total_pa) * 100,
                'Non_TTO_Rate': ((total_pa - total_hr - total_so - total_bb) / total_pa) * 100
            })

        except Exception as e:
            print(f"Error processing {year}: {e}")

    return pd.DataFrame(league_stats)

def identify_tto_extremes(season=2023, min_pa=400):
    """
    Identify players with extreme TTO profiles
    """
    stats = batting_stats(season, season, qual=min_pa)

    # Calculate TTO rate
    stats['TTO'] = stats['HR'] + stats['SO'] + stats['BB']
    stats['TTO_Rate'] = (stats['TTO'] / stats['PA']) * 100
    stats['Non_TTO_Rate'] = 100 - stats['TTO_Rate']

    # Identify extremes
    highest_tto = stats.nlargest(20, 'TTO_Rate')[
        ['Name', 'PA', 'HR', 'SO', 'BB', 'TTO_Rate', 'AVG', 'OBP', 'SLG']
    ]

    lowest_tto = stats.nsmallest(20, 'TTO_Rate')[
        ['Name', 'PA', 'HR', 'SO', 'BB', 'TTO_Rate', 'AVG', 'OBP', 'SLG']
    ]

    return highest_tto, lowest_tto

def analyze_tto_correlation_with_success():
    """
    Analyze whether TTO profile correlates with team success
    """
    # Example data structure
    team_tto_data = pd.DataFrame({
        'Team': ['TB', 'LAD', 'HOU', 'ATL', 'NYY'],
        'TTO_Rate': [35.2, 34.8, 33.9, 32.1, 36.4],
        'Wins': [99, 100, 90, 104, 82],
        'Run_Differential': [+145, +167, +89, +198, +45],
        'Playoff_Berth': [True, True, True, True, False]
    })

    # Calculate correlation
    correlation = team_tto_data[['TTO_Rate', 'Wins']].corr().iloc[0, 1]

    return team_tto_data, correlation

def visualize_tto_player_types(high_tto_df, low_tto_df):
    """
    Compare offensive profiles of high vs low TTO players
    """
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))

    # AVG comparison
    axes[0, 0].hist([high_tto_df['AVG'], low_tto_df['AVG']],
                    label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
    axes[0, 0].set_title('Batting Average Distribution')
    axes[0, 0].set_xlabel('AVG')
    axes[0, 0].set_ylabel('Count')
    axes[0, 0].legend()

    # OBP comparison
    axes[0, 1].hist([high_tto_df['OBP'], low_tto_df['OBP']],
                    label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
    axes[0, 1].set_title('On-Base Percentage Distribution')
    axes[0, 1].set_xlabel('OBP')
    axes[0, 1].set_ylabel('Count')
    axes[0, 1].legend()

    # SLG comparison
    axes[1, 0].hist([high_tto_df['SLG'], low_tto_df['SLG']],
                    label=['High TTO', 'Low TTO'], bins=15, alpha=0.7)
    axes[1, 0].set_title('Slugging Percentage Distribution')
    axes[1, 0].set_xlabel('SLG')
    axes[1, 0].set_ylabel('Count')
    axes[1, 0].legend()

    # Scatter: TTO Rate vs OPS
    high_tto_df['OPS'] = high_tto_df['OBP'] + high_tto_df['SLG']
    low_tto_df['OPS'] = low_tto_df['OBP'] + low_tto_df['SLG']

    axes[1, 1].scatter(high_tto_df['TTO_Rate'], high_tto_df['OPS'],
                      label='High TTO', alpha=0.6, s=60)
    axes[1, 1].scatter(low_tto_df['TTO_Rate'], low_tto_df['OPS'],
                      label='Low TTO', alpha=0.6, s=60)
    axes[1, 1].set_title('TTO Rate vs OPS')
    axes[1, 1].set_xlabel('TTO Rate (%)')
    axes[1, 1].set_ylabel('OPS')
    axes[1, 1].legend()
    axes[1, 1].grid(alpha=0.3)

    plt.tight_layout()
    return fig

def calculate_entertainment_value():
    """
    Attempt to quantify 'entertainment value' of TTO vs action plays
    """
    # Hypothetical engagement metrics
    engagement_data = pd.DataFrame({
        'Play_Type': ['Home Run', 'Strikeout', 'Walk', 'Single', 'Double',
                     'Triple', 'Stolen Base', 'Double Play'],
        'Avg_Excitement_Score': [9.2, 3.1, 2.4, 6.8, 8.1, 9.5, 7.8, 8.9],
        'Frequency_2000': [2.0, 15.8, 8.7, 14.2, 4.8, 0.8, 1.9, 2.1],
        'Frequency_2023': [3.0, 22.4, 8.6, 12.1, 4.2, 0.5, 2.3, 1.8]
    })

    # Calculate weighted excitement score
    engagement_data['Weighted_Excitement_2000'] = (
        engagement_data['Avg_Excitement_Score'] *
        engagement_data['Frequency_2000']
    )
    engagement_data['Weighted_Excitement_2023'] = (
        engagement_data['Avg_Excitement_Score'] *
        engagement_data['Frequency_2023']
    )

    total_2000 = engagement_data['Weighted_Excitement_2000'].sum()
    total_2023 = engagement_data['Weighted_Excitement_2023'].sum()

    print(f"Weighted Excitement Score Change: {total_2023 - total_2000:.2f}")

    return engagement_data

27.7 Analyzing Rule Change Impact

Statistical Framework for Rule Change Analysis

When MLB implements rule changes, analyzing their impact requires careful statistical methodology to separate rule effects from natural variation, player development, and other confounding factors.

library(dplyr)
library(ggplot2)
library(broom)

# Difference-in-differences analysis framework
did_analysis_framework <- function(pre_data, post_data, treatment_group, control_group) {

  # Combine data
  combined_data <- bind_rows(
    pre_data %>% mutate(period = "pre", treated = group %in% treatment_group),
    post_data %>% mutate(period = "post", treated = group %in% treatment_group)
  )

  # Run difference-in-differences regression
  did_model <- lm(outcome ~ treated * period, data = combined_data)

  # Extract results
  results <- tidy(did_model) %>%
    filter(term == "treatedTRUE:periodpost")

  return(list(model = did_model, did_estimate = results))
}

# Analyze shift ban impact with statistical rigor
analyze_shift_ban_statistical <- function() {

  # Simulate player-level data for demonstration
  set.seed(123)

  # Pre-shift ban (2022) - left-handed pull hitters
  pre_ban <- data.frame(
    player_id = 1:100,
    player_type = "LH_Pull",
    babip = rnorm(100, mean = 0.260, sd = 0.040),
    shift_faced_pct = runif(100, 60, 85),
    season = 2022
  )

  # Post-shift ban (2023) - same players
  post_ban <- data.frame(
    player_id = 1:100,
    player_type = "LH_Pull",
    babip = rnorm(100, mean = 0.285, sd = 0.040),
    shift_faced_pct = 0,  # No shifts allowed
    season = 2023
  )

  # Control group (right-handed hitters less affected by shifts)
  control_pre <- data.frame(
    player_id = 101:200,
    player_type = "RH",
    babip = rnorm(100, mean = 0.290, sd = 0.040),
    shift_faced_pct = runif(100, 10, 25),
    season = 2022
  )

  control_post <- data.frame(
    player_id = 101:200,
    player_type = "RH",
    babip = rnorm(100, mean = 0.295, sd = 0.040),
    shift_faced_pct = 0,
    season = 2023
  )

  # Combine all data
  full_data <- bind_rows(pre_ban, post_ban, control_pre, control_post) %>%
    mutate(
      post_ban = season == 2023,
      treatment = player_type == "LH_Pull"
    )

  # Run DiD model
  did_model <- lm(babip ~ treatment * post_ban, data = full_data)

  return(list(data = full_data, model = did_model, summary = summary(did_model)))
}

# Regression discontinuity design for pitch clock
analyze_pitch_clock_rdd <- function(game_data) {

  # Create discontinuity at 2023 season start
  rdd_data <- game_data %>%
    mutate(
      days_from_cutoff = as.numeric(game_date - as.Date("2023-03-30")),
      post_clock = days_from_cutoff > 0
    ) %>%
    filter(abs(days_from_cutoff) <= 30)  # 30-day window around implementation

  # Run RDD model
  rdd_model <- lm(game_length_minutes ~ days_from_cutoff * post_clock,
                  data = rdd_data)

  # Estimate discontinuity (jump) at cutoff
  discontinuity <- coef(rdd_model)["post_clockTRUE"]

  return(list(model = rdd_model, effect = discontinuity))
}

# Time series analysis for rule changes
analyze_time_series_impact <- function(metric_data) {

  library(forecast)

  # Split into pre and post periods
  pre_period <- metric_data %>% filter(season < 2023)
  post_period <- metric_data %>% filter(season >= 2023)

  # Fit ARIMA model on pre-period
  ts_data <- ts(pre_period$value, start = min(pre_period$season))
  arima_model <- auto.arima(ts_data)

  # Forecast what would have happened without intervention
  forecast_result <- forecast(arima_model, h = nrow(post_period))

  # Compare actual vs forecast
  comparison <- data.frame(
    season = post_period$season,
    actual = post_period$value,
    predicted = as.numeric(forecast_result$mean),
    lower_95 = as.numeric(forecast_result$lower[,2]),
    upper_95 = as.numeric(forecast_result$upper[,2])
  ) %>%
    mutate(
      impact = actual - predicted,
      significant = actual < lower_95 | actual > upper_95
    )

  return(list(model = arima_model, forecast = forecast_result, comparison = comparison))
}

# Visualization of rule change impact
plot_rule_change_impact <- function(comparison_data, rule_name) {

  ggplot(comparison_data, aes(x = season)) +
    geom_ribbon(aes(ymin = lower_95, ymax = upper_95),
                alpha = 0.2, fill = "blue") +
    geom_line(aes(y = predicted, color = "Predicted"),
              linewidth = 1, linetype = "dashed") +
    geom_line(aes(y = actual, color = "Actual"),
              linewidth = 1.2) +
    geom_point(aes(y = actual, color = "Actual"),
               size = 3) +
    geom_vline(xintercept = min(comparison_data$season) - 0.5,
               linetype = "dashed", color = "red") +
    labs(
      title = paste("Impact of", rule_name),
      subtitle = "Actual values vs forecasted counterfactual",
      x = "Season",
      y = "Metric Value",
      color = "Series"
    ) +
    theme_minimal() +
    scale_color_manual(values = c("Actual" = "#D50032", "Predicted" = "#002D72"))
}

Comprehensive Rule Change Analysis in Python

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
import matplotlib.pyplot as plt
import seaborn as sns

def difference_in_differences_analysis(data, outcome_var, treatment_var,
                                      time_var, player_var):
    """
    Implement difference-in-differences estimation for rule changes
    """
    # Create interaction term
    data['treatment_x_time'] = data[treatment_var] * data[time_var]

    # Prepare regression variables
    X = data[[treatment_var, time_var, 'treatment_x_time']]
    X = sm.add_constant(X)
    y = data[outcome_var]

    # Run OLS regression
    model = sm.OLS(y, X).fit()

    # The coefficient on treatment_x_time is the DiD estimate
    did_estimate = model.params['treatment_x_time']
    did_se = model.bse['treatment_x_time']
    did_pvalue = model.pvalues['treatment_x_time']

    results = {
        'estimate': did_estimate,
        'std_error': did_se,
        'p_value': did_pvalue,
        'confidence_interval': model.conf_int().loc['treatment_x_time'].values,
        'model_summary': model.summary()
    }

    return results

def parallel_trends_test(data, outcome_var, treatment_var, pre_periods):
    """
    Test parallel trends assumption for DiD
    """
    # Filter to pre-treatment periods only
    pre_data = data[data['period'].isin(pre_periods)]

    # Create time trend interaction
    pre_data['time_trend'] = pre_data['period'].astype(int)
    pre_data['treatment_x_trend'] = (pre_data[treatment_var] *
                                     pre_data['time_trend'])

    # Regression: outcome ~ treatment + time_trend + treatment*time_trend
    X = pre_data[[treatment_var, 'time_trend', 'treatment_x_trend']]
    X = sm.add_constant(X)
    y = pre_data[outcome_var]

    model = sm.OLS(y, X).fit()

    # If treatment*trend coefficient is insignificant, parallel trends holds
    trend_coef = model.params['treatment_x_trend']
    trend_pvalue = model.pvalues['treatment_x_trend']

    parallel_trends_holds = trend_pvalue > 0.05

    return {
        'parallel_trends_satisfied': parallel_trends_holds,
        'trend_coefficient': trend_coef,
        'p_value': trend_pvalue
    }

def synthetic_control_analysis(treated_unit, control_pool, pre_periods, post_periods):
    """
    Implement synthetic control method for rule change analysis
    """
    from sklearn.linear_model import LinearRegression

    # Get pre-treatment outcomes
    treated_pre = treated_unit[treated_unit['period'].isin(pre_periods)]['outcome']
    control_pre = control_pool[control_pool['period'].isin(pre_periods)].pivot(
        index='period', columns='unit', values='outcome'
    )

    # Find optimal weights to match treated unit in pre-period
    model = LinearRegression(fit_intercept=False, positive=True)
    model.fit(control_pre.values, treated_pre.values)

    # Normalize weights to sum to 1
    weights = model.coef_ / model.coef_.sum()

    # Create synthetic control
    control_all = control_pool.pivot(
        index='period', columns='unit', values='outcome'
    )
    synthetic = (control_all * weights).sum(axis=1)

    # Calculate treatment effect in post period
    treated_all = treated_unit.set_index('period')['outcome']

    effects = []
    for period in post_periods:
        effect = treated_all[period] - synthetic[period]
        effects.append({
            'period': period,
            'treated': treated_all[period],
            'synthetic': synthetic[period],
            'effect': effect
        })

    return pd.DataFrame(effects), weights

def event_study_analysis(data, outcome_var, event_time_var, max_leads=3, max_lags=3):
    """
    Event study regression to estimate dynamic treatment effects
    """
    # Create dummy variables for each event time
    dummies = pd.get_dummies(data[event_time_var], prefix='event')

    # Exclude t=-1 as reference period (or t=0 if no pre-period)
    reference_col = 'event_-1' if 'event_-1' in dummies.columns else 'event_0'
    if reference_col in dummies.columns:
        dummies = dummies.drop(columns=[reference_col])

    # Run regression
    X = sm.add_constant(dummies)
    y = data[outcome_var]

    model = sm.OLS(y, X).fit()

    # Extract coefficients for plotting
    coefficients = []
    for col in dummies.columns:
        event_time = int(col.split('_')[1])
        coefficients.append({
            'event_time': event_time,
            'coefficient': model.params[col],
            'std_error': model.bse[col],
            'conf_low': model.conf_int().loc[col, 0],
            'conf_high': model.conf_int().loc[col, 1]
        })

    coef_df = pd.DataFrame(coefficients).sort_values('event_time')

    return coef_df, model

def visualize_event_study(coef_df, rule_name):
    """
    Create event study plot
    """
    fig, ax = plt.subplots(figsize=(12, 6))

    ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
    ax.axvline(x=0, color='red', linestyle='--', linewidth=1, alpha=0.5)

    ax.errorbar(coef_df['event_time'], coef_df['coefficient'],
                yerr=1.96 * coef_df['std_error'],
                fmt='o', markersize=8, capsize=5, capthick=2,
                color='steelblue', ecolor='gray', elinewidth=2)

    ax.set_xlabel('Periods Relative to Rule Change', fontsize=12)
    ax.set_ylabel('Estimated Effect', fontsize=12)
    ax.set_title(f'Event Study: {rule_name}', fontsize=14, fontweight='bold')
    ax.grid(alpha=0.3)

    # Add shading for pre/post periods
    pre_periods = coef_df[coef_df['event_time'] < 0]['event_time']
    if len(pre_periods) > 0:
        ax.axvspan(pre_periods.min() - 0.5, -0.5, alpha=0.1, color='gray',
                  label='Pre-treatment')

    post_periods = coef_df[coef_df['event_time'] > 0]['event_time']
    if len(post_periods) > 0:
        ax.axvspan(0.5, post_periods.max() + 0.5, alpha=0.1, color='lightblue',
                  label='Post-treatment')

    ax.legend()
    plt.tight_layout()

    return fig

def calculate_statistical_power(effect_size, sample_size, alpha=0.05):
    """
    Calculate statistical power for detecting rule change effects
    """
    from scipy.stats import norm

    # Calculate non-centrality parameter
    ncp = effect_size * np.sqrt(sample_size)

    # Critical value for two-tailed test
    z_critical = norm.ppf(1 - alpha/2)

    # Power = P(reject H0 | H1 is true)
    power = 1 - norm.cdf(z_critical - ncp) + norm.cdf(-z_critical - ncp)

    return power

def simulate_rule_change_impact(true_effect, sample_size, num_simulations=1000):
    """
    Monte Carlo simulation to assess rule change detection
    """
    np.random.seed(42)

    detected = 0
    estimates = []

    for _ in range(num_simulations):
        # Simulate control group (no change)
        control_pre = np.random.normal(0, 1, sample_size)
        control_post = np.random.normal(0, 1, sample_size)

        # Simulate treatment group (with effect)
        treatment_pre = np.random.normal(0, 1, sample_size)
        treatment_post = np.random.normal(true_effect, 1, sample_size)

        # Calculate DiD estimate
        did_est = ((treatment_post.mean() - treatment_pre.mean()) -
                   (control_post.mean() - control_pre.mean()))

        estimates.append(did_est)

        # Test for significance
        # Simplified t-test
        se = np.sqrt(1/sample_size + 1/sample_size + 1/sample_size + 1/sample_size)
        t_stat = did_est / se

        if abs(t_stat) > 1.96:
            detected += 1

    power = detected / num_simulations

    return {
        'power': power,
        'mean_estimate': np.mean(estimates),
        'std_estimate': np.std(estimates),
        'estimates': estimates
    }

Practical Example: Complete Rule Change Analysis

# Complete workflow for analyzing shift ban impact

def complete_shift_ban_analysis():
    """
    Comprehensive statistical analysis of shift ban impact
    """

    # Step 1: Prepare data
    np.random.seed(42)

    # Create synthetic panel data
    players = 200
    seasons = [2021, 2022, 2023]

    data_list = []
    for player in range(players):
        is_lh_pull = player < 100  # First 100 are LH pull hitters

        for season in seasons:
            # Base BABIP
            base_babip = 0.290 if not is_lh_pull else 0.260

            # Shift effect (only pre-2023)
            if season < 2023 and is_lh_pull:
                shift_penalty = -0.025
            else:
                shift_penalty = 0

            # Generate outcome with noise
            babip = base_babip + shift_penalty + np.random.normal(0, 0.030)

            data_list.append({
                'player_id': player,
                'season': season,
                'lh_pull_hitter': is_lh_pull,
                'post_ban': season >= 2023,
                'babip': babip
            })

    df = pd.DataFrame(data_list)

    # Step 2: Difference-in-differences
    df['treatment_x_post'] = df['lh_pull_hitter'] * df['post_ban']

    X = df[['lh_pull_hitter', 'post_ban', 'treatment_x_post']]
    X = sm.add_constant(X)
    y = df['babip']

    did_model = sm.OLS(y, X).fit()

    print("=" * 60)
    print("SHIFT BAN IMPACT ANALYSIS")
    print("=" * 60)
    print("\nDifference-in-Differences Results:")
    print(f"Treatment Effect: {did_model.params['treatment_x_post']:.4f}")
    print(f"Standard Error: {did_model.bse['treatment_x_post']:.4f}")
    print(f"P-value: {did_model.pvalues['treatment_x_post']:.4f}")
    print(f"95% CI: [{did_model.conf_int().loc['treatment_x_post', 0]:.4f}, "
          f"{did_model.conf_int().loc['treatment_x_post', 1]:.4f}]")

    # Step 3: Visualize
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # Panel A: Trends
    trends = df.groupby(['season', 'lh_pull_hitter'])['babip'].mean().reset_index()

    for is_lh in [False, True]:
        subset = trends[trends['lh_pull_hitter'] == is_lh]
        label = 'LH Pull Hitters' if is_lh else 'Other Hitters'
        axes[0].plot(subset['season'], subset['babip'], marker='o',
                    label=label, linewidth=2, markersize=8)

    axes[0].axvline(x=2022.5, color='red', linestyle='--',
                   label='Shift Ban', linewidth=2)
    axes[0].set_xlabel('Season')
    axes[0].set_ylabel('BABIP')
    axes[0].set_title('BABIP Trends by Player Type')
    axes[0].legend()
    axes[0].grid(alpha=0.3)

    # Panel B: Distribution comparison
    pre_treatment = df[(df['lh_pull_hitter']) & (df['season'] < 2023)]['babip']
    post_treatment = df[(df['lh_pull_hitter']) & (df['season'] >= 2023)]['babip']

    axes[1].hist(pre_treatment, alpha=0.5, bins=30,
                label='Pre-Ban (2021-2022)', color='lightcoral', density=True)
    axes[1].hist(post_treatment, alpha=0.5, bins=30,
                label='Post-Ban (2023)', color='lightgreen', density=True)
    axes[1].axvline(pre_treatment.mean(), color='red', linestyle='--', linewidth=2)
    axes[1].axvline(post_treatment.mean(), color='green', linestyle='--', linewidth=2)
    axes[1].set_xlabel('BABIP')
    axes[1].set_ylabel('Density')
    axes[1].set_title('BABIP Distribution: LH Pull Hitters')
    axes[1].legend()

    plt.tight_layout()

    return df, did_model, fig

# Run the complete analysis
data, model, fig = complete_shift_ban_analysis()
plt.show()
R
library(dplyr)
library(ggplot2)
library(broom)

# Difference-in-differences analysis framework
did_analysis_framework <- function(pre_data, post_data, treatment_group, control_group) {

  # Combine data
  combined_data <- bind_rows(
    pre_data %>% mutate(period = "pre", treated = group %in% treatment_group),
    post_data %>% mutate(period = "post", treated = group %in% treatment_group)
  )

  # Run difference-in-differences regression
  did_model <- lm(outcome ~ treated * period, data = combined_data)

  # Extract results
  results <- tidy(did_model) %>%
    filter(term == "treatedTRUE:periodpost")

  return(list(model = did_model, did_estimate = results))
}

# Analyze shift ban impact with statistical rigor
analyze_shift_ban_statistical <- function() {

  # Simulate player-level data for demonstration
  set.seed(123)

  # Pre-shift ban (2022) - left-handed pull hitters
  pre_ban <- data.frame(
    player_id = 1:100,
    player_type = "LH_Pull",
    babip = rnorm(100, mean = 0.260, sd = 0.040),
    shift_faced_pct = runif(100, 60, 85),
    season = 2022
  )

  # Post-shift ban (2023) - same players
  post_ban <- data.frame(
    player_id = 1:100,
    player_type = "LH_Pull",
    babip = rnorm(100, mean = 0.285, sd = 0.040),
    shift_faced_pct = 0,  # No shifts allowed
    season = 2023
  )

  # Control group (right-handed hitters less affected by shifts)
  control_pre <- data.frame(
    player_id = 101:200,
    player_type = "RH",
    babip = rnorm(100, mean = 0.290, sd = 0.040),
    shift_faced_pct = runif(100, 10, 25),
    season = 2022
  )

  control_post <- data.frame(
    player_id = 101:200,
    player_type = "RH",
    babip = rnorm(100, mean = 0.295, sd = 0.040),
    shift_faced_pct = 0,
    season = 2023
  )

  # Combine all data
  full_data <- bind_rows(pre_ban, post_ban, control_pre, control_post) %>%
    mutate(
      post_ban = season == 2023,
      treatment = player_type == "LH_Pull"
    )

  # Run DiD model
  did_model <- lm(babip ~ treatment * post_ban, data = full_data)

  return(list(data = full_data, model = did_model, summary = summary(did_model)))
}

# Regression discontinuity design for pitch clock
analyze_pitch_clock_rdd <- function(game_data) {

  # Create discontinuity at 2023 season start
  rdd_data <- game_data %>%
    mutate(
      days_from_cutoff = as.numeric(game_date - as.Date("2023-03-30")),
      post_clock = days_from_cutoff > 0
    ) %>%
    filter(abs(days_from_cutoff) <= 30)  # 30-day window around implementation

  # Run RDD model
  rdd_model <- lm(game_length_minutes ~ days_from_cutoff * post_clock,
                  data = rdd_data)

  # Estimate discontinuity (jump) at cutoff
  discontinuity <- coef(rdd_model)["post_clockTRUE"]

  return(list(model = rdd_model, effect = discontinuity))
}

# Time series analysis for rule changes
analyze_time_series_impact <- function(metric_data) {

  library(forecast)

  # Split into pre and post periods
  pre_period <- metric_data %>% filter(season < 2023)
  post_period <- metric_data %>% filter(season >= 2023)

  # Fit ARIMA model on pre-period
  ts_data <- ts(pre_period$value, start = min(pre_period$season))
  arima_model <- auto.arima(ts_data)

  # Forecast what would have happened without intervention
  forecast_result <- forecast(arima_model, h = nrow(post_period))

  # Compare actual vs forecast
  comparison <- data.frame(
    season = post_period$season,
    actual = post_period$value,
    predicted = as.numeric(forecast_result$mean),
    lower_95 = as.numeric(forecast_result$lower[,2]),
    upper_95 = as.numeric(forecast_result$upper[,2])
  ) %>%
    mutate(
      impact = actual - predicted,
      significant = actual < lower_95 | actual > upper_95
    )

  return(list(model = arima_model, forecast = forecast_result, comparison = comparison))
}

# Visualization of rule change impact
plot_rule_change_impact <- function(comparison_data, rule_name) {

  ggplot(comparison_data, aes(x = season)) +
    geom_ribbon(aes(ymin = lower_95, ymax = upper_95),
                alpha = 0.2, fill = "blue") +
    geom_line(aes(y = predicted, color = "Predicted"),
              linewidth = 1, linetype = "dashed") +
    geom_line(aes(y = actual, color = "Actual"),
              linewidth = 1.2) +
    geom_point(aes(y = actual, color = "Actual"),
               size = 3) +
    geom_vline(xintercept = min(comparison_data$season) - 0.5,
               linetype = "dashed", color = "red") +
    labs(
      title = paste("Impact of", rule_name),
      subtitle = "Actual values vs forecasted counterfactual",
      x = "Season",
      y = "Metric Value",
      color = "Series"
    ) +
    theme_minimal() +
    scale_color_manual(values = c("Actual" = "#D50032", "Predicted" = "#002D72"))
}
Python
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan
import matplotlib.pyplot as plt
import seaborn as sns

def difference_in_differences_analysis(data, outcome_var, treatment_var,
                                      time_var, player_var):
    """
    Implement difference-in-differences estimation for rule changes
    """
    # Create interaction term
    data['treatment_x_time'] = data[treatment_var] * data[time_var]

    # Prepare regression variables
    X = data[[treatment_var, time_var, 'treatment_x_time']]
    X = sm.add_constant(X)
    y = data[outcome_var]

    # Run OLS regression
    model = sm.OLS(y, X).fit()

    # The coefficient on treatment_x_time is the DiD estimate
    did_estimate = model.params['treatment_x_time']
    did_se = model.bse['treatment_x_time']
    did_pvalue = model.pvalues['treatment_x_time']

    results = {
        'estimate': did_estimate,
        'std_error': did_se,
        'p_value': did_pvalue,
        'confidence_interval': model.conf_int().loc['treatment_x_time'].values,
        'model_summary': model.summary()
    }

    return results

def parallel_trends_test(data, outcome_var, treatment_var, pre_periods):
    """
    Test parallel trends assumption for DiD
    """
    # Filter to pre-treatment periods only
    pre_data = data[data['period'].isin(pre_periods)]

    # Create time trend interaction
    pre_data['time_trend'] = pre_data['period'].astype(int)
    pre_data['treatment_x_trend'] = (pre_data[treatment_var] *
                                     pre_data['time_trend'])

    # Regression: outcome ~ treatment + time_trend + treatment*time_trend
    X = pre_data[[treatment_var, 'time_trend', 'treatment_x_trend']]
    X = sm.add_constant(X)
    y = pre_data[outcome_var]

    model = sm.OLS(y, X).fit()

    # If treatment*trend coefficient is insignificant, parallel trends holds
    trend_coef = model.params['treatment_x_trend']
    trend_pvalue = model.pvalues['treatment_x_trend']

    parallel_trends_holds = trend_pvalue > 0.05

    return {
        'parallel_trends_satisfied': parallel_trends_holds,
        'trend_coefficient': trend_coef,
        'p_value': trend_pvalue
    }

def synthetic_control_analysis(treated_unit, control_pool, pre_periods, post_periods):
    """
    Implement synthetic control method for rule change analysis
    """
    from sklearn.linear_model import LinearRegression

    # Get pre-treatment outcomes
    treated_pre = treated_unit[treated_unit['period'].isin(pre_periods)]['outcome']
    control_pre = control_pool[control_pool['period'].isin(pre_periods)].pivot(
        index='period', columns='unit', values='outcome'
    )

    # Find optimal weights to match treated unit in pre-period
    model = LinearRegression(fit_intercept=False, positive=True)
    model.fit(control_pre.values, treated_pre.values)

    # Normalize weights to sum to 1
    weights = model.coef_ / model.coef_.sum()

    # Create synthetic control
    control_all = control_pool.pivot(
        index='period', columns='unit', values='outcome'
    )
    synthetic = (control_all * weights).sum(axis=1)

    # Calculate treatment effect in post period
    treated_all = treated_unit.set_index('period')['outcome']

    effects = []
    for period in post_periods:
        effect = treated_all[period] - synthetic[period]
        effects.append({
            'period': period,
            'treated': treated_all[period],
            'synthetic': synthetic[period],
            'effect': effect
        })

    return pd.DataFrame(effects), weights

def event_study_analysis(data, outcome_var, event_time_var, max_leads=3, max_lags=3):
    """
    Event study regression to estimate dynamic treatment effects
    """
    # Create dummy variables for each event time
    dummies = pd.get_dummies(data[event_time_var], prefix='event')

    # Exclude t=-1 as reference period (or t=0 if no pre-period)
    reference_col = 'event_-1' if 'event_-1' in dummies.columns else 'event_0'
    if reference_col in dummies.columns:
        dummies = dummies.drop(columns=[reference_col])

    # Run regression
    X = sm.add_constant(dummies)
    y = data[outcome_var]

    model = sm.OLS(y, X).fit()

    # Extract coefficients for plotting
    coefficients = []
    for col in dummies.columns:
        event_time = int(col.split('_')[1])
        coefficients.append({
            'event_time': event_time,
            'coefficient': model.params[col],
            'std_error': model.bse[col],
            'conf_low': model.conf_int().loc[col, 0],
            'conf_high': model.conf_int().loc[col, 1]
        })

    coef_df = pd.DataFrame(coefficients).sort_values('event_time')

    return coef_df, model

def visualize_event_study(coef_df, rule_name):
    """
    Create event study plot
    """
    fig, ax = plt.subplots(figsize=(12, 6))

    ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
    ax.axvline(x=0, color='red', linestyle='--', linewidth=1, alpha=0.5)

    ax.errorbar(coef_df['event_time'], coef_df['coefficient'],
                yerr=1.96 * coef_df['std_error'],
                fmt='o', markersize=8, capsize=5, capthick=2,
                color='steelblue', ecolor='gray', elinewidth=2)

    ax.set_xlabel('Periods Relative to Rule Change', fontsize=12)
    ax.set_ylabel('Estimated Effect', fontsize=12)
    ax.set_title(f'Event Study: {rule_name}', fontsize=14, fontweight='bold')
    ax.grid(alpha=0.3)

    # Add shading for pre/post periods
    pre_periods = coef_df[coef_df['event_time'] < 0]['event_time']
    if len(pre_periods) > 0:
        ax.axvspan(pre_periods.min() - 0.5, -0.5, alpha=0.1, color='gray',
                  label='Pre-treatment')

    post_periods = coef_df[coef_df['event_time'] > 0]['event_time']
    if len(post_periods) > 0:
        ax.axvspan(0.5, post_periods.max() + 0.5, alpha=0.1, color='lightblue',
                  label='Post-treatment')

    ax.legend()
    plt.tight_layout()

    return fig

def calculate_statistical_power(effect_size, sample_size, alpha=0.05):
    """
    Calculate statistical power for detecting rule change effects
    """
    from scipy.stats import norm

    # Calculate non-centrality parameter
    ncp = effect_size * np.sqrt(sample_size)

    # Critical value for two-tailed test
    z_critical = norm.ppf(1 - alpha/2)

    # Power = P(reject H0 | H1 is true)
    power = 1 - norm.cdf(z_critical - ncp) + norm.cdf(-z_critical - ncp)

    return power

def simulate_rule_change_impact(true_effect, sample_size, num_simulations=1000):
    """
    Monte Carlo simulation to assess rule change detection
    """
    np.random.seed(42)

    detected = 0
    estimates = []

    for _ in range(num_simulations):
        # Simulate control group (no change)
        control_pre = np.random.normal(0, 1, sample_size)
        control_post = np.random.normal(0, 1, sample_size)

        # Simulate treatment group (with effect)
        treatment_pre = np.random.normal(0, 1, sample_size)
        treatment_post = np.random.normal(true_effect, 1, sample_size)

        # Calculate DiD estimate
        did_est = ((treatment_post.mean() - treatment_pre.mean()) -
                   (control_post.mean() - control_pre.mean()))

        estimates.append(did_est)

        # Test for significance
        # Simplified t-test
        se = np.sqrt(1/sample_size + 1/sample_size + 1/sample_size + 1/sample_size)
        t_stat = did_est / se

        if abs(t_stat) > 1.96:
            detected += 1

    power = detected / num_simulations

    return {
        'power': power,
        'mean_estimate': np.mean(estimates),
        'std_estimate': np.std(estimates),
        'estimates': estimates
    }
Python
# Complete workflow for analyzing shift ban impact

def complete_shift_ban_analysis():
    """
    Comprehensive statistical analysis of shift ban impact
    """

    # Step 1: Prepare data
    np.random.seed(42)

    # Create synthetic panel data
    players = 200
    seasons = [2021, 2022, 2023]

    data_list = []
    for player in range(players):
        is_lh_pull = player < 100  # First 100 are LH pull hitters

        for season in seasons:
            # Base BABIP
            base_babip = 0.290 if not is_lh_pull else 0.260

            # Shift effect (only pre-2023)
            if season < 2023 and is_lh_pull:
                shift_penalty = -0.025
            else:
                shift_penalty = 0

            # Generate outcome with noise
            babip = base_babip + shift_penalty + np.random.normal(0, 0.030)

            data_list.append({
                'player_id': player,
                'season': season,
                'lh_pull_hitter': is_lh_pull,
                'post_ban': season >= 2023,
                'babip': babip
            })

    df = pd.DataFrame(data_list)

    # Step 2: Difference-in-differences
    df['treatment_x_post'] = df['lh_pull_hitter'] * df['post_ban']

    X = df[['lh_pull_hitter', 'post_ban', 'treatment_x_post']]
    X = sm.add_constant(X)
    y = df['babip']

    did_model = sm.OLS(y, X).fit()

    print("=" * 60)
    print("SHIFT BAN IMPACT ANALYSIS")
    print("=" * 60)
    print("\nDifference-in-Differences Results:")
    print(f"Treatment Effect: {did_model.params['treatment_x_post']:.4f}")
    print(f"Standard Error: {did_model.bse['treatment_x_post']:.4f}")
    print(f"P-value: {did_model.pvalues['treatment_x_post']:.4f}")
    print(f"95% CI: [{did_model.conf_int().loc['treatment_x_post', 0]:.4f}, "
          f"{did_model.conf_int().loc['treatment_x_post', 1]:.4f}]")

    # Step 3: Visualize
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # Panel A: Trends
    trends = df.groupby(['season', 'lh_pull_hitter'])['babip'].mean().reset_index()

    for is_lh in [False, True]:
        subset = trends[trends['lh_pull_hitter'] == is_lh]
        label = 'LH Pull Hitters' if is_lh else 'Other Hitters'
        axes[0].plot(subset['season'], subset['babip'], marker='o',
                    label=label, linewidth=2, markersize=8)

    axes[0].axvline(x=2022.5, color='red', linestyle='--',
                   label='Shift Ban', linewidth=2)
    axes[0].set_xlabel('Season')
    axes[0].set_ylabel('BABIP')
    axes[0].set_title('BABIP Trends by Player Type')
    axes[0].legend()
    axes[0].grid(alpha=0.3)

    # Panel B: Distribution comparison
    pre_treatment = df[(df['lh_pull_hitter']) & (df['season'] < 2023)]['babip']
    post_treatment = df[(df['lh_pull_hitter']) & (df['season'] >= 2023)]['babip']

    axes[1].hist(pre_treatment, alpha=0.5, bins=30,
                label='Pre-Ban (2021-2022)', color='lightcoral', density=True)
    axes[1].hist(post_treatment, alpha=0.5, bins=30,
                label='Post-Ban (2023)', color='lightgreen', density=True)
    axes[1].axvline(pre_treatment.mean(), color='red', linestyle='--', linewidth=2)
    axes[1].axvline(post_treatment.mean(), color='green', linestyle='--', linewidth=2)
    axes[1].set_xlabel('BABIP')
    axes[1].set_ylabel('Density')
    axes[1].set_title('BABIP Distribution: LH Pull Hitters')
    axes[1].legend()

    plt.tight_layout()

    return df, did_model, fig

# Run the complete analysis
data, model, fig = complete_shift_ban_analysis()
plt.show()

27.8 Exercises

Easy Exercises

Exercise 27.1 (Easy)
Calculate the percentage increase in stolen base attempts from 2022 to 2023. The 2022 season had 2,487 stolen bases across 2,430 games, while 2023 had 3,503 stolen bases across 2,430 games.

Exercise 27.2 (Easy)
Using the baseballr or pybaseball package, retrieve pitch velocity data for a single pitcher across the 2023 season. Calculate their average fastball velocity and identify their fastest pitch of the season.

# Starter code for R
library(baseballr)

# Replace with actual pitcher ID
pitcher_id <- 592789  # Jacob deGrom example

# Your code here
# Starter code for Python
from pybaseball import statcast_pitcher

# Your code here

Exercise 27.3 (Easy)
Calculate the reduction in base distance (in inches) caused by the increase from 15-inch to 18-inch bases. What percentage reduction does this represent from the original 90-foot (1,080-inch) base path?

Medium Exercises

Exercise 27.4 (Medium)
Analyze the times-through-order penalty for a specific starting pitcher. Using Statcast data, calculate their batting average against, slugging percentage, and exit velocity for the 1st, 2nd, and 3rd time through the batting order.

# Starter code
library(baseballr)
library(dplyr)

analyze_tto_penalty <- function(pitcher_id, season = 2023) {
  # Get pitcher's data

  # Group by times_through_order

  # Calculate metrics

  # Return summary
}

Exercise 27.5 (Medium)
Compare left-handed pull hitters' performance in 2022 vs 2023 to quantify shift ban impact. Calculate:


  • Average BABIP in 2022 vs 2023

  • Average batting average on ground balls

  • Statistical significance of the difference (t-test)

Exercise 27.6 (Medium)
Create a visualization showing the evolution of game times from 2019-2023. Include vertical line for pitch clock implementation and annotate the average reduction in minutes.

Hard Exercises

Exercise 27.7 (Hard)
Implement a difference-in-differences analysis to estimate the causal impact of the shift ban on left-handed pull hitters. Use right-handed hitters as a control group. Your analysis should include:


  • Parallel trends test for pre-treatment periods

  • DiD regression model

  • Robustness checks (placebo tests)

  • Visualization of results

# Starter code
def shift_ban_did_analysis(data):
    """
    Implement DiD analysis for shift ban

    Parameters:
    -----------
    data : DataFrame with columns [player_id, season, babip, is_lh_pull]

    Returns:
    --------
    Dictionary with DiD estimate, standard error, p-value, and plots
    """
    # Your code here
    pass

Exercise 27.8 (Hard)
Build a predictive model to estimate the impact of the pitch clock on stolen base success rate. Your model should:


  • Incorporate pitcher delivery time to home plate

  • Account for runner sprint speed

  • Include lead distance from base

  • Control for pitcher pickoff move quality

  • Validate on 2023 data

# Starter code
library(randomForest)

build_sb_success_model <- function(training_data) {
  # Feature engineering

  # Train model

  # Evaluate performance

  # Return model and metrics
}

Exercise 27.9 (Hard)
Conduct an event study analysis of the pitch clock implementation on pitcher performance. Create event-time dummies for each month relative to implementation and estimate dynamic treatment effects. Plot coefficients with 95% confidence intervals to show how effects evolved over the first season.

Exercise 27.10 (Hard)
Design and implement a simulation study to determine the optimal pitcher usage strategy given the times-through-order penalty. Compare:


  • Traditional starter (6-7 IP)

  • Opener strategy (1-4-2 configuration)

  • Pure bullpen game (9 relievers)

Your simulation should account for:


  • TTO penalty magnitude

  • Bullpen fatigue effects

  • Reliever availability constraints

  • Win probability maximization

Include visualization of expected win rates and season-long bullpen usage patterns for each strategy.


Summary

This chapter has explored the most significant trends and rule changes shaping modern Major League Baseball. The velocity revolution continues to push the boundaries of human performance, while also raising concerns about pitcher health and sustainability. The 2023 rule changes - shift ban, pitch clock, and larger bases - represented the most comprehensive modifications to gameplay in generations, with measurable impacts on offensive production, game pace, and strategic approaches.

The rise of Three True Outcomes has fundamentally altered the aesthetic and strategic nature of baseball, though recent rule changes have begun to restore some balance between power and action. The opener strategy and bullpen specialization reflect teams' increasingly sophisticated understanding of performance optimization, even as they challenge traditional conceptions of pitcher roles.

As analysts, understanding these trends requires rigorous statistical methodology, careful consideration of confounding factors, and appropriate causal inference techniques. The frameworks presented in this chapter - difference-in-differences, regression discontinuity, event studies, and synthetic controls - provide the tools necessary to separate true causal effects from spurious correlations.

The future of baseball analytics will continue to grapple with these tensions: power vs. contact, specialization vs. versatility, tradition vs. optimization, and entertainment value vs. competitive advantage.

R
# Starter code for R
library(baseballr)

# Replace with actual pitcher ID
pitcher_id <- 592789  # Jacob deGrom example

# Your code here
R
# Starter code
library(baseballr)
library(dplyr)

analyze_tto_penalty <- function(pitcher_id, season = 2023) {
  # Get pitcher's data

  # Group by times_through_order

  # Calculate metrics

  # Return summary
}
R
# Starter code
library(randomForest)

build_sb_success_model <- function(training_data) {
  # Feature engineering

  # Train model

  # Evaluate performance

  # Return model and metrics
}
Python
# Starter code for Python
from pybaseball import statcast_pitcher

# Your code here
Python
# Starter code
def shift_ban_did_analysis(data):
    """
    Implement DiD analysis for shift ban

    Parameters:
    -----------
    data : DataFrame with columns [player_id, season, babip, is_lh_pull]

    Returns:
    --------
    Dictionary with DiD estimate, standard error, p-value, and plots
    """
    # Your code here
    pass

Chapter Summary

In this chapter, you learned about modern mlb trends & rule changes. Key topics covered:

  • The Velocity Revolution
  • Shift Ban Impact Analysis
  • Pitch Clock Effects on Performance
  • Larger Bases & Stolen Base Renaissance
  • Opener & Bullpen Game Strategy
  • Three True Outcomes Evolution