8.1.1 Why Defense is Hard to Quantify {#defense-challenge}
Defense presents unique measurement challenges that make it far more difficult to evaluate than offense or pitching:
Limited sample sizes: Even full-time players get 200-400 defensive chances per season, compared to 600-700 plate appearances. A shortstop might get three difficult ground balls in one game, none in the next. This variance makes defensive metrics noisier than offensive metrics—even a full season might not provide enough data for confident conclusions about true talent.
Context dependency: Defensive opportunities depend heavily on factors beyond the fielder's control. A pitcher who induces many ground balls gives his infielders more chances (and potentially better defensive metrics). A fly-ball pitcher makes his outfielders look better. Park dimensions matter enormously—Fenway Park's Green Monster creates unique defensive challenges that don't exist in symmetrical ballparks.
Team effects: Unlike hitting where each plate appearance is largely independent, defense involves coordination. A slow first baseman might make the second baseman's job harder by requiring wider throws. A catcher's pitch framing affects how umpires call pitches for all the team's pitchers. Disentangling individual contributions from team effects requires sophisticated modeling.
Opportunity variation: Not all defensive chances are created equal. A ground ball hit 70 mph directly at a fielder is easy; one hit 100 mph in the 4-hole requires elite reaction time and range. Traditional statistics like fielding percentage treated all chances equally, while a routine play and a spectacular diving catch both counted as a single out.
The counterfactual problem: To evaluate defense, we need to know what would have happened with an average defender at that position. If a shortstop makes a diving stop on a ball up the middle, should that be worth +0.75 runs because an average shortstop would have let it through 75% of the time? But we can't actually observe the counterfactual—we only see what this specific defender did.
8.1.2 Evolution of Defensive Metrics {#defense-evolution}
Baseball's attempts to quantify defense have evolved through several generations:
Traditional Era (1900s-1990s): Defense was measured almost exclusively by errors and fielding percentage (chances handled cleanly / total chances). These metrics were deeply flawed. They penalized defenders with great range (who got to balls others couldn't reach, creating more error opportunities) and credited defenders with poor range (who never touched balls they couldn't reach). A shortstop who made 10 errors but reached 50 balls others couldn't was rated worse than one who made 5 errors but only reached routine plays.
Zone Rating Era (1990s-2000s): Stats Inc. introduced zone rating, which divided the field into zones and measured how often fielders made plays on balls hit into their zone. This was the first metric to account for range. Players were credited for making plays, not just penalized for errors. However, zone definitions were somewhat arbitrary, and the metric didn't account for batted ball velocity or exact positioning.
UZR/DRS Era (2000s-2010s): Ultimate Zone Rating (UZR) and Defensive Runs Saved (DRS) represented major advances. Both metrics:
- Divided the field into much finer zones
- Accounted for batted ball type (ground ball, line drive, fly ball)
- Considered game situation (runners on base, shift positioning)
- Converted defensive plays to runs saved/cost relative to average
These metrics became the gold standard for defensive evaluation and remain widely used. We'll examine them in detail later in this chapter.
Statcast Era (2015-present): MLB's Statcast system uses high-speed cameras and radar to track every player's position and movement at 30 frames per second. This enables unprecedented precision in defensive measurement:
- Exact positioning when the ball is hit
- Route efficiency to the ball
- Sprint speed and burst
- Jump/reaction time
- Catch probability based on distance, hang time, and direction
- Arm strength and exchange time for throws
Statcast's Outs Above Average (OAA) has become the premier defensive metric because it's based on objective tracking data rather than subjective zone assignments.
8.1.3 What Good Defensive Metrics Should Measure {#good-metrics}
An ideal defensive metric should:
- Account for opportunity: Credit fielders for attempting difficult plays, not just making routine ones
- Adjust for context: Consider ballpark, pitcher tendencies, and positioning
- Be repeatable: Measure true talent rather than random variance (year-to-year correlation)
- Translate to runs: Express defensive value in runs saved/cost for easy comparison to offense
- Be granular: Break down into components (range, arm, hands) so we understand what drives value
- Update with information: Incorporate new tracking data as it becomes available
Statcast-era metrics meet these criteria better than any previous system, though imperfections remain. Sample size issues persist, and certain aspects of defense (positioning, communication, relay throws) remain difficult to fully capture.
Outs Above Average (OAA) is Statcast's flagship defensive metric. It estimates how many outs a player made above or below what an average fielder at their position would have made on the same set of batted balls.
8.2.1 Understanding OAA {#understanding-oaa}
OAA works by calculating the probability that an average fielder would successfully convert each batted ball into an out, then comparing actual results to these probabilities:
OAA = Σ (Actual Outcome - Expected Outcome)
For each batted ball:
- If a fielder makes a play with 30% catch probability: +0.7 outs (1.0 actual - 0.3 expected)
- If a fielder misses a play with 80% catch probability: -0.8 outs (0.0 actual - 0.8 expected)
- If a fielder makes a routine play with 99% catch probability: +0.01 outs
Over a full season, these credits and debits accumulate. A player with +10 OAA made 10 more outs than an average fielder would have made on the same batted balls. A player with -5 OAA made 5 fewer outs.
Why OAA is valuable:
- Based on objective tracking data, not subjective scorer decisions
- Accounts for difficulty of each play via catch probability
- Updates continuously as more data accumulates
- Breaks down by component (range, arm, blocking for catchers)
- Publicly available for all players at Baseball Savant
Limitations of OAA:
- Small sample sizes create noise (even full-season metrics have uncertainty)
- Doesn't account for all defensive contributions (positioning, communication)
- Catch probability models can struggle with extreme plays (diving catches, wall plays)
- Positioning data isn't always perfect, affecting opportunity calculations
8.2.2 OAA Components {#oaa-components}
OAA breaks down into several components depending on position:
For outfielders:
- Outs Above Average: Total outfield OAA (range + jumps + routes)
- Arm component: Baserunner advancement prevented (not fully integrated into total OAA)
For infielders:
- Outs Above Average: Total infield OAA (range + reaction)
- Arm component: Evaluated separately for some positions
For catchers:
- Framing runs: Extra strikes gained through receiving
- Throwing runs: Preventing stolen bases
- Blocking runs: Preventing wild pitches/passed balls
Let's examine how to calculate OAA-like metrics using Statcast data:
R Implementation:
library(tidyverse)
library(baseballr)
# Get Statcast data for outfield plays (example: July 2024)
outfield_plays <- statcast_search(
start_date = "2024-07-01",
end_date = "2024-07-31",
player_type = "batter"
) %>%
filter(
!is.na(hit_distance_sc),
!is.na(hit_location),
!is.na(hc_x),
!is.na(hc_y)
)
# Calculate catch probability based on hang time and distance
# This is simplified - actual Statcast model is more complex
calc_catch_probability <- function(hang_time, distance, direction) {
# Baseline probability
base_prob <- plogis(3 - 0.15 * distance - 0.5 * abs(direction) + 2 * hang_time)
# Bound between 0 and 1
pmax(0.01, pmin(0.99, base_prob))
}
# Process outfield catches
outfield_defense <- outfield_plays %>%
filter(hit_location %in% c(7, 8, 9)) %>% # Outfield positions
mutate(
was_caught = ifelse(events %in% c("field_out", "double_play",
"triple_play", "sac_fly"), 1, 0),
hang_time = launch_speed / 100, # Simplified hang time calculation
direction = hc_x - 125.42, # Distance from center line
catch_prob = calc_catch_probability(hang_time, hit_distance_sc, direction),
outs_above_avg = was_caught - catch_prob
)
# Aggregate by fielder
player_oaa <- outfield_defense %>%
filter(!is.na(fielder_2)) %>%
group_by(fielder_2) %>%
summarize(
opportunities = n(),
catches = sum(was_caught),
expected_catches = sum(catch_prob),
oaa = sum(outs_above_avg),
.groups = "drop"
) %>%
filter(opportunities >= 20) %>% # Minimum sample size
arrange(desc(oaa))
# Display top defenders
print(player_oaa %>% head(10))
# Visualize OAA distribution
ggplot(player_oaa, aes(x = oaa)) +
geom_histogram(binwidth = 0.5, fill = "#2b8cbe", color = "white", alpha = 0.8) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red", size = 1) +
labs(
title = "Distribution of Outfielder OAA - July 2024",
subtitle = "One month of data shows wide variance in defensive value",
x = "Outs Above Average (OAA)",
y = "Number of Players",
caption = "Data: MLB Statcast via baseballr\nNote: Simplified catch probability model"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
Python Implementation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
from scipy.special import expit # Logistic function
# Set style
sns.set_style("whitegrid")
# Get Statcast data for July 2024
outfield_plays = statcast(start_dt='2024-07-01', end_dt='2024-07-31')
# Filter for valid outfield plays
outfield_plays = outfield_plays[
outfield_plays['hit_distance_sc'].notna() &
outfield_plays['hc_x'].notna() &
outfield_plays['hc_y'].notna() &
outfield_plays['hit_location'].isin([7, 8, 9]) # Outfield
].copy()
# Calculate catch probability (simplified model)
def calc_catch_probability(row):
hang_time = row['launch_speed'] / 100 # Simplified
distance = row['hit_distance_sc']
direction = abs(row['hc_x'] - 125.42) # Distance from center
# Logistic regression model (simplified)
logit = 3 - 0.15 * distance - 0.5 * direction + 2 * hang_time
prob = expit(logit) # Sigmoid function
return np.clip(prob, 0.01, 0.99)
# Apply catch probability calculation
outfield_plays['catch_prob'] = outfield_plays.apply(calc_catch_probability, axis=1)
# Determine if ball was caught
caught_events = ['field_out', 'double_play', 'triple_play', 'sac_fly']
outfield_plays['was_caught'] = outfield_plays['events'].isin(caught_events).astype(int)
# Calculate outs above average
outfield_plays['outs_above_avg'] = outfield_plays['was_caught'] - outfield_plays['catch_prob']
# Aggregate by fielder
player_oaa = outfield_plays.groupby('fielder_2').agg(
opportunities=('outs_above_avg', 'count'),
catches=('was_caught', 'sum'),
expected_catches=('catch_prob', 'sum'),
oaa=('outs_above_avg', 'sum')
).reset_index()
# Filter minimum sample size
player_oaa = player_oaa[player_oaa['opportunities'] >= 20].sort_values('oaa', ascending=False)
print("\nTop 10 Defenders by OAA (July 2024):")
print(player_oaa.head(10))
# Visualize OAA distribution
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(player_oaa['oaa'], bins=20, color='#2b8cbe', alpha=0.8, edgecolor='white')
ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='Average (0 OAA)')
ax.set_xlabel('Outs Above Average (OAA)', fontsize=12)
ax.set_ylabel('Number of Players', fontsize=12)
ax.set_title('Distribution of Outfielder OAA - July 2024\nOne month of data shows wide variance in defensive value',
fontsize=14, fontweight='bold', pad=20)
ax.legend()
plt.figtext(0.99, 0.01, 'Data: MLB Statcast via pybaseball\nNote: Simplified catch probability model',
ha='right', fontsize=9, style='italic')
plt.tight_layout()
plt.show()
# Calculate summary statistics
print(f"\nOAA Summary Statistics:")
print(f"Mean: {player_oaa['oaa'].mean():.2f}")
print(f"Median: {player_oaa['oaa'].median():.2f}")
print(f"Std Dev: {player_oaa['oaa'].std():.2f}")
print(f"Range: {player_oaa['oaa'].min():.2f} to {player_oaa['oaa'].max():.2f}")
8.2.3 Catch Probability Explained {#catch-probability}
Catch probability is the foundation of OAA. For each batted ball, Statcast estimates the likelihood an average defender would convert it to an out based on:
Distance to ball: How far the fielder must travel from their starting position
Hang time: How long the ball is in the air (for fly balls)
Direction: Whether the ball is directly in front, to the side, or behind
Initial velocity: How hard the ball was hit
Spin and trajectory: Ball flight characteristics
The actual Statcast model uses machine learning trained on hundreds of thousands of batted balls. For each play, it identifies similar plays from the historical database and calculates what percentage resulted in outs.
Real example - Kevin Kiermaier diving catch (2019):
- Distance to ball: 92 feet
- Hang time: 4.2 seconds
- Direction: Directly to right
- Catch probability: 28%
- Credit: +0.72 OAA
Real example - Byron Buxton routine fly (2023):
- Distance to ball: 35 feet
- Hang time: 5.1 seconds
- Direction: Slightly to left
- Catch probability: 97%
- Credit: +0.03 OAA
The beauty of this system is that spectacular plays on low-probability balls are properly valued, while routine plays on high-probability balls receive minimal credit. This solves the long-standing problem of defenders getting equal credit for vastly different difficulty levels.
library(tidyverse)
library(baseballr)
# Get Statcast data for outfield plays (example: July 2024)
outfield_plays <- statcast_search(
start_date = "2024-07-01",
end_date = "2024-07-31",
player_type = "batter"
) %>%
filter(
!is.na(hit_distance_sc),
!is.na(hit_location),
!is.na(hc_x),
!is.na(hc_y)
)
# Calculate catch probability based on hang time and distance
# This is simplified - actual Statcast model is more complex
calc_catch_probability <- function(hang_time, distance, direction) {
# Baseline probability
base_prob <- plogis(3 - 0.15 * distance - 0.5 * abs(direction) + 2 * hang_time)
# Bound between 0 and 1
pmax(0.01, pmin(0.99, base_prob))
}
# Process outfield catches
outfield_defense <- outfield_plays %>%
filter(hit_location %in% c(7, 8, 9)) %>% # Outfield positions
mutate(
was_caught = ifelse(events %in% c("field_out", "double_play",
"triple_play", "sac_fly"), 1, 0),
hang_time = launch_speed / 100, # Simplified hang time calculation
direction = hc_x - 125.42, # Distance from center line
catch_prob = calc_catch_probability(hang_time, hit_distance_sc, direction),
outs_above_avg = was_caught - catch_prob
)
# Aggregate by fielder
player_oaa <- outfield_defense %>%
filter(!is.na(fielder_2)) %>%
group_by(fielder_2) %>%
summarize(
opportunities = n(),
catches = sum(was_caught),
expected_catches = sum(catch_prob),
oaa = sum(outs_above_avg),
.groups = "drop"
) %>%
filter(opportunities >= 20) %>% # Minimum sample size
arrange(desc(oaa))
# Display top defenders
print(player_oaa %>% head(10))
# Visualize OAA distribution
ggplot(player_oaa, aes(x = oaa)) +
geom_histogram(binwidth = 0.5, fill = "#2b8cbe", color = "white", alpha = 0.8) +
geom_vline(xintercept = 0, linetype = "dashed", color = "red", size = 1) +
labs(
title = "Distribution of Outfielder OAA - July 2024",
subtitle = "One month of data shows wide variance in defensive value",
x = "Outs Above Average (OAA)",
y = "Number of Players",
caption = "Data: MLB Statcast via baseballr\nNote: Simplified catch probability model"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
from scipy.special import expit # Logistic function
# Set style
sns.set_style("whitegrid")
# Get Statcast data for July 2024
outfield_plays = statcast(start_dt='2024-07-01', end_dt='2024-07-31')
# Filter for valid outfield plays
outfield_plays = outfield_plays[
outfield_plays['hit_distance_sc'].notna() &
outfield_plays['hc_x'].notna() &
outfield_plays['hc_y'].notna() &
outfield_plays['hit_location'].isin([7, 8, 9]) # Outfield
].copy()
# Calculate catch probability (simplified model)
def calc_catch_probability(row):
hang_time = row['launch_speed'] / 100 # Simplified
distance = row['hit_distance_sc']
direction = abs(row['hc_x'] - 125.42) # Distance from center
# Logistic regression model (simplified)
logit = 3 - 0.15 * distance - 0.5 * direction + 2 * hang_time
prob = expit(logit) # Sigmoid function
return np.clip(prob, 0.01, 0.99)
# Apply catch probability calculation
outfield_plays['catch_prob'] = outfield_plays.apply(calc_catch_probability, axis=1)
# Determine if ball was caught
caught_events = ['field_out', 'double_play', 'triple_play', 'sac_fly']
outfield_plays['was_caught'] = outfield_plays['events'].isin(caught_events).astype(int)
# Calculate outs above average
outfield_plays['outs_above_avg'] = outfield_plays['was_caught'] - outfield_plays['catch_prob']
# Aggregate by fielder
player_oaa = outfield_plays.groupby('fielder_2').agg(
opportunities=('outs_above_avg', 'count'),
catches=('was_caught', 'sum'),
expected_catches=('catch_prob', 'sum'),
oaa=('outs_above_avg', 'sum')
).reset_index()
# Filter minimum sample size
player_oaa = player_oaa[player_oaa['opportunities'] >= 20].sort_values('oaa', ascending=False)
print("\nTop 10 Defenders by OAA (July 2024):")
print(player_oaa.head(10))
# Visualize OAA distribution
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(player_oaa['oaa'], bins=20, color='#2b8cbe', alpha=0.8, edgecolor='white')
ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='Average (0 OAA)')
ax.set_xlabel('Outs Above Average (OAA)', fontsize=12)
ax.set_ylabel('Number of Players', fontsize=12)
ax.set_title('Distribution of Outfielder OAA - July 2024\nOne month of data shows wide variance in defensive value',
fontsize=14, fontweight='bold', pad=20)
ax.legend()
plt.figtext(0.99, 0.01, 'Data: MLB Statcast via pybaseball\nNote: Simplified catch probability model',
ha='right', fontsize=9, style='italic')
plt.tight_layout()
plt.show()
# Calculate summary statistics
print(f"\nOAA Summary Statistics:")
print(f"Mean: {player_oaa['oaa'].mean():.2f}")
print(f"Median: {player_oaa['oaa'].median():.2f}")
print(f"Std Dev: {player_oaa['oaa'].std():.2f}")
print(f"Range: {player_oaa['oaa'].min():.2f} to {player_oaa['oaa'].max():.2f}")
Beyond just making plays, how efficiently fielders reach batted balls matters enormously. Two outfielders might both catch the same fly ball, but one might take an optimal route while the other wastes precious seconds on a roundabout path.
8.3.1 Outfielder Routes and Metrics {#outfield-routes}
Route efficiency measures how direct a fielder's path was to the ball, expressed as a percentage:
Route Efficiency = (Direct Distance to Ball) / (Actual Distance Covered) × 100%
A perfectly direct route has 100% efficiency. A route that covers extra distance (due to poor reads, hesitation, or avoiding obstacles) scores lower. Elite outfielders consistently achieve 95%+ efficiency on routine flies.
Jump measures how quickly a fielder reacts to the batted ball, calculated as:
Jump = Distance Covered in First 3 Seconds After Contact
A good jump requires:
- Quick reaction to ball off bat
- Correct read of ball trajectory
- Explosive first step
- Proper positioning for the play
Real-world examples:
Mookie Betts (2023 average):
- Route efficiency: 96.8% (elite)
- Average jump: 11.2 feet
- Sprint speed: 28.9 ft/s
- OAA: +12 (excellent)
Kevin Kiermaier (career average):
- Route efficiency: 97.1% (historically elite)
- Average jump: 12.1 feet (exceptional)
- Sprint speed: 28.2 ft/s
- OAA: +67 from 2015-2022 (Gold Glove caliber)
Let's analyze route efficiency with code:
R Implementation:
library(tidyverse)
library(baseballr)
# Get player tracking data (conceptual - actual API varies)
# In practice, you'd use Baseball Savant's sprint speed and route data
# Simulated route data for illustration
set.seed(42)
route_data <- data.frame(
player = rep(c("Mookie Betts", "Harrison Bader", "Randy Arozarena",
"Kyle Tucker", "Byron Buxton"), each = 50),
play_id = 1:250
) %>%
mutate(
direct_distance = runif(250, 20, 100),
route_efficiency = case_when(
player == "Mookie Betts" ~ rnorm(50, 0.968, 0.02),
player == "Harrison Bader" ~ rnorm(50, 0.965, 0.025),
player == "Byron Buxton" ~ rnorm(50, 0.972, 0.018),
player == "Randy Arozarena" ~ rnorm(50, 0.955, 0.03),
player == "Kyle Tucker" ~ rnorm(50, 0.958, 0.028)
),
route_efficiency = pmin(1.0, pmax(0.85, route_efficiency)),
actual_distance = direct_distance / route_efficiency,
extra_distance = actual_distance - direct_distance,
jump_distance = rnorm(250, 11, 1.5)
)
# Summarize by player
player_routes <- route_data %>%
group_by(player) %>%
summarize(
plays = n(),
avg_route_efficiency = mean(route_efficiency),
avg_jump = mean(jump_distance),
avg_direct_distance = mean(direct_distance),
avg_extra_distance = mean(extra_distance),
.groups = "drop"
) %>%
arrange(desc(avg_route_efficiency))
print(player_routes)
# Visualize route efficiency vs jump
ggplot(player_routes, aes(x = avg_route_efficiency, y = avg_jump)) +
geom_point(size = 4, color = "#2b8cbe", alpha = 0.7) +
geom_text(aes(label = player), vjust = -0.8, size = 3.5) +
geom_vline(xintercept = mean(player_routes$avg_route_efficiency),
linetype = "dashed", color = "gray50", alpha = 0.7) +
geom_hline(yintercept = mean(player_routes$avg_jump),
linetype = "dashed", color = "gray50", alpha = 0.7) +
scale_x_continuous(labels = scales::percent_format(accuracy = 0.1)) +
labs(
title = "Outfielder Route Efficiency vs Jump",
subtitle = "Top right quadrant shows elite combination of reads and reactions",
x = "Average Route Efficiency (%)",
y = "Average Jump (feet in first 3 seconds)",
caption = "Data: Simulated for illustration"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
panel.grid.minor = element_blank()
)
# Show distribution of route efficiency
ggplot(route_data, aes(x = route_efficiency, fill = player)) +
geom_density(alpha = 0.5) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Route Efficiency Distribution by Player",
subtitle = "Tighter distributions indicate more consistent routes",
x = "Route Efficiency",
y = "Density",
fill = "Player"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
legend.position = "bottom"
)
Python Implementation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set random seed for reproducibility
np.random.seed(42)
sns.set_style("whitegrid")
# Simulate route data for top outfielders
players = ['Mookie Betts', 'Harrison Bader', 'Randy Arozarena', 'Kyle Tucker', 'Byron Buxton']
n_plays = 50
route_data = []
for player in players:
# Generate player-specific route efficiency
if player == 'Mookie Betts':
efficiency = np.random.normal(0.968, 0.02, n_plays)
elif player == 'Harrison Bader':
efficiency = np.random.normal(0.965, 0.025, n_plays)
elif player == 'Byron Buxton':
efficiency = np.random.normal(0.972, 0.018, n_plays)
elif player == 'Randy Arozarena':
efficiency = np.random.normal(0.955, 0.03, n_plays)
else: # Kyle Tucker
efficiency = np.random.normal(0.958, 0.028, n_plays)
# Clip to realistic range
efficiency = np.clip(efficiency, 0.85, 1.0)
# Generate other metrics
direct_distance = np.random.uniform(20, 100, n_plays)
actual_distance = direct_distance / efficiency
jump_distance = np.random.normal(11, 1.5, n_plays)
for i in range(n_plays):
route_data.append({
'player': player,
'play_id': len(route_data) + 1,
'direct_distance': direct_distance[i],
'route_efficiency': efficiency[i],
'actual_distance': actual_distance[i],
'extra_distance': actual_distance[i] - direct_distance[i],
'jump_distance': jump_distance[i]
})
route_df = pd.DataFrame(route_data)
# Summarize by player
player_routes = route_df.groupby('player').agg({
'play_id': 'count',
'route_efficiency': 'mean',
'jump_distance': 'mean',
'direct_distance': 'mean',
'extra_distance': 'mean'
}).round(3)
player_routes.columns = ['plays', 'avg_route_efficiency', 'avg_jump',
'avg_direct_distance', 'avg_extra_distance']
player_routes = player_routes.sort_values('avg_route_efficiency', ascending=False)
print("\nPlayer Route Metrics:")
print(player_routes)
# Visualize route efficiency vs jump
fig, ax = plt.subplots(figsize=(12, 8))
for player in players:
player_data = player_routes.loc[player]
ax.scatter(player_data['avg_route_efficiency'], player_data['avg_jump'],
s=150, alpha=0.7, label=player)
ax.text(player_data['avg_route_efficiency'], player_data['avg_jump'] + 0.15,
player, fontsize=9, ha='center')
# Add average lines
ax.axvline(player_routes['avg_route_efficiency'].mean(),
color='gray', linestyle='--', alpha=0.5, label='Avg Efficiency')
ax.axhline(player_routes['avg_jump'].mean(),
color='gray', linestyle='--', alpha=0.5, label='Avg Jump')
ax.set_xlabel('Average Route Efficiency', fontsize=12)
ax.set_ylabel('Average Jump (feet in first 3 seconds)', fontsize=12)
ax.set_title('Outfielder Route Efficiency vs Jump\nTop right quadrant shows elite combination of reads and reactions',
fontsize=14, fontweight='bold', pad=20)
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x:.1%}'))
plt.figtext(0.99, 0.01, 'Data: Simulated for illustration',
ha='right', fontsize=9, style='italic')
plt.tight_layout()
plt.show()
# Show distribution of route efficiency by player
fig, ax = plt.subplots(figsize=(12, 6))
for player in players:
player_data = route_df[route_df['player'] == player]
ax.hist(player_data['route_efficiency'], bins=15, alpha=0.4, label=player)
ax.set_xlabel('Route Efficiency', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('Route Efficiency Distribution by Player\nTighter distributions indicate more consistent routes',
fontsize=14, fontweight='bold')
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x:.1%}'))
ax.legend()
plt.tight_layout()
plt.show()
8.3.2 Infielder Metrics {#infield-metrics}
Infield defense requires different skills than outfield defense. Rather than long routes to fly balls, infielders need explosive reactions to hard-hit grounders and line drives, plus quick exchanges and accurate throws.
Key infield metrics:
Reaction time: Time from ball contact to first movement (typical: 0.15-0.25 seconds)
Exchange time: Time from ball hitting glove to release of throw (elite: <0.6 seconds, average: 0.7-0.8 seconds)
Arm strength: Throwing velocity across the infield (varies by position)
- Third basemen: 80-90 mph
- Shortstops: 80-88 mph
- Second basemen: 75-85 mph
- First basemen: 70-80 mph (less relevant)
Range: Lateral movement and area covered (measured via OAA)
Hands/Fielding: Converting batted balls into controlled outs (measured via error rate on contacted balls)
Real-world examples:
Dansby Swanson (2024):
- Reaction time: 0.18 seconds (elite)
- Median exchange: 0.65 seconds
- Arm strength: 86.2 mph (strong for SS)
- OAA: +11 (excellent)
Matt Chapman (2024):
- Reaction time: 0.19 seconds
- Median exchange: 0.63 seconds
- Arm strength: 89.1 mph (elite for 3B)
- OAA: +14 (Gold Glove caliber)
Infielders also benefit from good positioning and pitch anticipation. Knowing a pitcher tends to induce ground balls to the pull side allows infielders to shade accordingly, improving their effective range.
library(tidyverse)
library(baseballr)
# Get player tracking data (conceptual - actual API varies)
# In practice, you'd use Baseball Savant's sprint speed and route data
# Simulated route data for illustration
set.seed(42)
route_data <- data.frame(
player = rep(c("Mookie Betts", "Harrison Bader", "Randy Arozarena",
"Kyle Tucker", "Byron Buxton"), each = 50),
play_id = 1:250
) %>%
mutate(
direct_distance = runif(250, 20, 100),
route_efficiency = case_when(
player == "Mookie Betts" ~ rnorm(50, 0.968, 0.02),
player == "Harrison Bader" ~ rnorm(50, 0.965, 0.025),
player == "Byron Buxton" ~ rnorm(50, 0.972, 0.018),
player == "Randy Arozarena" ~ rnorm(50, 0.955, 0.03),
player == "Kyle Tucker" ~ rnorm(50, 0.958, 0.028)
),
route_efficiency = pmin(1.0, pmax(0.85, route_efficiency)),
actual_distance = direct_distance / route_efficiency,
extra_distance = actual_distance - direct_distance,
jump_distance = rnorm(250, 11, 1.5)
)
# Summarize by player
player_routes <- route_data %>%
group_by(player) %>%
summarize(
plays = n(),
avg_route_efficiency = mean(route_efficiency),
avg_jump = mean(jump_distance),
avg_direct_distance = mean(direct_distance),
avg_extra_distance = mean(extra_distance),
.groups = "drop"
) %>%
arrange(desc(avg_route_efficiency))
print(player_routes)
# Visualize route efficiency vs jump
ggplot(player_routes, aes(x = avg_route_efficiency, y = avg_jump)) +
geom_point(size = 4, color = "#2b8cbe", alpha = 0.7) +
geom_text(aes(label = player), vjust = -0.8, size = 3.5) +
geom_vline(xintercept = mean(player_routes$avg_route_efficiency),
linetype = "dashed", color = "gray50", alpha = 0.7) +
geom_hline(yintercept = mean(player_routes$avg_jump),
linetype = "dashed", color = "gray50", alpha = 0.7) +
scale_x_continuous(labels = scales::percent_format(accuracy = 0.1)) +
labs(
title = "Outfielder Route Efficiency vs Jump",
subtitle = "Top right quadrant shows elite combination of reads and reactions",
x = "Average Route Efficiency (%)",
y = "Average Jump (feet in first 3 seconds)",
caption = "Data: Simulated for illustration"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
panel.grid.minor = element_blank()
)
# Show distribution of route efficiency
ggplot(route_data, aes(x = route_efficiency, fill = player)) +
geom_density(alpha = 0.5) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Route Efficiency Distribution by Player",
subtitle = "Tighter distributions indicate more consistent routes",
x = "Route Efficiency",
y = "Density",
fill = "Player"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
legend.position = "bottom"
)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set random seed for reproducibility
np.random.seed(42)
sns.set_style("whitegrid")
# Simulate route data for top outfielders
players = ['Mookie Betts', 'Harrison Bader', 'Randy Arozarena', 'Kyle Tucker', 'Byron Buxton']
n_plays = 50
route_data = []
for player in players:
# Generate player-specific route efficiency
if player == 'Mookie Betts':
efficiency = np.random.normal(0.968, 0.02, n_plays)
elif player == 'Harrison Bader':
efficiency = np.random.normal(0.965, 0.025, n_plays)
elif player == 'Byron Buxton':
efficiency = np.random.normal(0.972, 0.018, n_plays)
elif player == 'Randy Arozarena':
efficiency = np.random.normal(0.955, 0.03, n_plays)
else: # Kyle Tucker
efficiency = np.random.normal(0.958, 0.028, n_plays)
# Clip to realistic range
efficiency = np.clip(efficiency, 0.85, 1.0)
# Generate other metrics
direct_distance = np.random.uniform(20, 100, n_plays)
actual_distance = direct_distance / efficiency
jump_distance = np.random.normal(11, 1.5, n_plays)
for i in range(n_plays):
route_data.append({
'player': player,
'play_id': len(route_data) + 1,
'direct_distance': direct_distance[i],
'route_efficiency': efficiency[i],
'actual_distance': actual_distance[i],
'extra_distance': actual_distance[i] - direct_distance[i],
'jump_distance': jump_distance[i]
})
route_df = pd.DataFrame(route_data)
# Summarize by player
player_routes = route_df.groupby('player').agg({
'play_id': 'count',
'route_efficiency': 'mean',
'jump_distance': 'mean',
'direct_distance': 'mean',
'extra_distance': 'mean'
}).round(3)
player_routes.columns = ['plays', 'avg_route_efficiency', 'avg_jump',
'avg_direct_distance', 'avg_extra_distance']
player_routes = player_routes.sort_values('avg_route_efficiency', ascending=False)
print("\nPlayer Route Metrics:")
print(player_routes)
# Visualize route efficiency vs jump
fig, ax = plt.subplots(figsize=(12, 8))
for player in players:
player_data = player_routes.loc[player]
ax.scatter(player_data['avg_route_efficiency'], player_data['avg_jump'],
s=150, alpha=0.7, label=player)
ax.text(player_data['avg_route_efficiency'], player_data['avg_jump'] + 0.15,
player, fontsize=9, ha='center')
# Add average lines
ax.axvline(player_routes['avg_route_efficiency'].mean(),
color='gray', linestyle='--', alpha=0.5, label='Avg Efficiency')
ax.axhline(player_routes['avg_jump'].mean(),
color='gray', linestyle='--', alpha=0.5, label='Avg Jump')
ax.set_xlabel('Average Route Efficiency', fontsize=12)
ax.set_ylabel('Average Jump (feet in first 3 seconds)', fontsize=12)
ax.set_title('Outfielder Route Efficiency vs Jump\nTop right quadrant shows elite combination of reads and reactions',
fontsize=14, fontweight='bold', pad=20)
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x:.1%}'))
plt.figtext(0.99, 0.01, 'Data: Simulated for illustration',
ha='right', fontsize=9, style='italic')
plt.tight_layout()
plt.show()
# Show distribution of route efficiency by player
fig, ax = plt.subplots(figsize=(12, 6))
for player in players:
player_data = route_df[route_df['player'] == player]
ax.hist(player_data['route_efficiency'], bins=15, alpha=0.4, label=player)
ax.set_xlabel('Route Efficiency', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
ax.set_title('Route Efficiency Distribution by Player\nTighter distributions indicate more consistent routes',
fontsize=14, fontweight='bold')
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{x:.1%}'))
ax.legend()
plt.tight_layout()
plt.show()
8.4.1 Outfield Arm Metrics {#outfield-arm}
An outfielder's arm affects the game both by throwing out baserunners and by deterring them from attempting extra bases. A strong-armed outfielder might only record 8 assists per season, but prevent 20+ runners from trying to advance.
Outfield arm metrics:
Arm strength: Maximum throwing velocity (measured on hardest throw)
- Elite: 95+ mph
- Above average: 90-95 mph
- Average: 85-90 mph
- Below average: <85 mph
Assists: Direct outs via throws (traditional stat, limited sample)
Baserunner kills: Runners thrown out attempting to advance
Extra bases prevented: Runners who held up due to arm reputation (harder to measure directly)
Real examples (2024 leaders):
Jorge Soler (OF, Miami):
- Max arm strength: 97.4 mph
- MLB rank: Top 5%
- Assists: 7
Randy Arozarena (OF, Tampa Bay/Seattle):
- Max arm strength: 95.8 mph
- MLB rank: Top 10%
- Assists: 9
Aaron Judge (OF, New York Yankees):
- Max arm strength: 94.1 mph
- MLB rank: Top 15%
- Assists: 6
Let's analyze outfield arms:
R Implementation:
library(tidyverse)
library(baseballr)
# Simulated outfield arm data
set.seed(123)
n_players <- 30
arm_data <- data.frame(
player = paste("Player", 1:n_players),
max_arm_strength = rnorm(n_players, 88, 4),
assists = rpois(n_players, 5),
opportunities = rpois(n_players, 25)
) %>%
mutate(
max_arm_strength = pmax(78, pmin(98, max_arm_strength)),
assist_rate = assists / opportunities,
arm_tier = case_when(
max_arm_strength >= 95 ~ "Elite (95+)",
max_arm_strength >= 90 ~ "Above Avg (90-95)",
max_arm_strength >= 85 ~ "Average (85-90)",
TRUE ~ "Below Avg (<85)"
),
arm_tier = factor(arm_tier, levels = c("Elite (95+)", "Above Avg (90-95)",
"Average (85-90)", "Below Avg (<85)"))
)
# Analyze arm strength vs assists
arm_summary <- arm_data %>%
group_by(arm_tier) %>%
summarize(
players = n(),
avg_arm_strength = mean(max_arm_strength),
avg_assists = mean(assists),
avg_assist_rate = mean(assist_rate),
.groups = "drop"
)
print(arm_summary)
# Visualize relationship between arm strength and assists
ggplot(arm_data, aes(x = max_arm_strength, y = assists)) +
geom_point(aes(color = arm_tier), size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "#2b8cbe", linetype = "dashed") +
scale_color_manual(
values = c("Elite (95+)" = "#d7191c",
"Above Avg (90-95)" = "#fdae61",
"Average (85-90)" = "#abd9e9",
"Below Avg (<85)" = "#2c7bb6")
) +
labs(
title = "Outfield Arm Strength vs Assists",
subtitle = "Stronger arms correlate with more assists, though opportunities matter",
x = "Max Arm Strength (mph)",
y = "Assists",
color = "Arm Tier",
caption = "Data: Simulated for illustration"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
# Box plot of arm strength by tier
ggplot(arm_data, aes(x = arm_tier, y = max_arm_strength, fill = arm_tier)) +
geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
scale_fill_manual(
values = c("Elite (95+)" = "#d7191c",
"Above Avg (90-95)" = "#fdae61",
"Average (85-90)" = "#abd9e9",
"Below Avg (<85)" = "#2c7bb6")
) +
labs(
title = "Distribution of Arm Strength by Tier",
x = NULL,
y = "Max Arm Strength (mph)",
caption = "Data: Simulated for illustration"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1)
)
Python Implementation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Set random seed
np.random.seed(123)
sns.set_style("whitegrid")
# Simulate outfield arm data
n_players = 30
arm_data = pd.DataFrame({
'player': [f'Player {i}' for i in range(1, n_players + 1)],
'max_arm_strength': np.random.normal(88, 4, n_players),
'assists': np.random.poisson(5, n_players),
'opportunities': np.random.poisson(25, n_players)
})
# Clip arm strength to realistic range
arm_data['max_arm_strength'] = arm_data['max_arm_strength'].clip(78, 98)
arm_data['assist_rate'] = arm_data['assists'] / arm_data['opportunities']
# Create arm tier categories
def categorize_arm(strength):
if strength >= 95:
return 'Elite (95+)'
elif strength >= 90:
return 'Above Avg (90-95)'
elif strength >= 85:
return 'Average (85-90)'
else:
return 'Below Avg (<85)'
arm_data['arm_tier'] = arm_data['max_arm_strength'].apply(categorize_arm)
# Order categories
tier_order = ['Elite (95+)', 'Above Avg (90-95)', 'Average (85-90)', 'Below Avg (<85)']
arm_data['arm_tier'] = pd.Categorical(arm_data['arm_tier'], categories=tier_order, ordered=True)
# Summary statistics
arm_summary = arm_data.groupby('arm_tier').agg({
'player': 'count',
'max_arm_strength': 'mean',
'assists': 'mean',
'assist_rate': 'mean'
}).round(2)
arm_summary.columns = ['players', 'avg_arm_strength', 'avg_assists', 'avg_assist_rate']
print("\nArm Strength Summary by Tier:")
print(arm_summary)
# Visualize arm strength vs assists
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Scatter plot with regression
colors = {'Elite (95+)': '#d7191c', 'Above Avg (90-95)': '#fdae61',
'Average (85-90)': '#abd9e9', 'Below Avg (<85)': '#2c7bb6'}
for tier in tier_order:
tier_data = arm_data[arm_data['arm_tier'] == tier]
ax1.scatter(tier_data['max_arm_strength'], tier_data['assists'],
label=tier, color=colors[tier], s=80, alpha=0.7)
# Add regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(
arm_data['max_arm_strength'], arm_data['assists'])
x_line = np.array([arm_data['max_arm_strength'].min(), arm_data['max_arm_strength'].max()])
y_line = slope * x_line + intercept
ax1.plot(x_line, y_line, 'b--', alpha=0.5, linewidth=2,
label=f'Regression (R² = {r_value**2:.3f})')
ax1.set_xlabel('Max Arm Strength (mph)', fontsize=12)
ax1.set_ylabel('Assists', fontsize=12)
ax1.set_title('Outfield Arm Strength vs Assists\nStronger arms correlate with more assists',
fontsize=13, fontweight='bold')
ax1.legend(loc='upper left', fontsize=9)
ax1.grid(True, alpha=0.3)
# Box plot of arm strength by tier
positions = range(len(tier_order))
bp = ax2.boxplot([arm_data[arm_data['arm_tier'] == tier]['max_arm_strength'].values
for tier in tier_order],
positions=positions,
labels=tier_order,
patch_artist=True,
widths=0.6)
# Color the boxes
for patch, tier in zip(bp['boxes'], tier_order):
patch.set_facecolor(colors[tier])
patch.set_alpha(0.7)
ax2.set_ylabel('Max Arm Strength (mph)', fontsize=12)
ax2.set_title('Distribution of Arm Strength by Tier',
fontsize=13, fontweight='bold')
ax2.tick_params(axis='x', rotation=15)
ax2.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.figtext(0.99, 0.01, 'Data: Simulated for illustration',
ha='right', fontsize=9, style='italic')
plt.show()
8.4.2 Infield Arm Considerations {#infield-arm}
Infield arms work differently than outfield arms. The premium is on quick release and accuracy rather than pure velocity, though elite infielders combine all three.
Position-specific arm requirements:
Third base: Longest throw across the diamond (127 feet to first). Needs elite arm strength and quick release. Third basemen routinely make 85-90 mph throws.
Shortstop: Slightly shorter throw than 3B (105-120 feet depending on positioning). Must combine arm strength with the ability to throw from multiple arm angles and off-balance positions.
Second base: Shortest throws (90-100 feet). Quick release matters more than raw strength. However, second basemen turning double plays need enough arm to complete the relay throw.
First base: Rarely makes long throws. Arm strength is least important at this position.
The exchange time (from catch to release) is often more important than velocity for infielders. A shortstop with a 0.6-second exchange and 85 mph throw will get more outs than one with a 0.8-second exchange and 88 mph throw, because the extra 0.2 seconds allows the runner to cover more distance than the 3 mph velocity difference saves.
library(tidyverse)
library(baseballr)
# Simulated outfield arm data
set.seed(123)
n_players <- 30
arm_data <- data.frame(
player = paste("Player", 1:n_players),
max_arm_strength = rnorm(n_players, 88, 4),
assists = rpois(n_players, 5),
opportunities = rpois(n_players, 25)
) %>%
mutate(
max_arm_strength = pmax(78, pmin(98, max_arm_strength)),
assist_rate = assists / opportunities,
arm_tier = case_when(
max_arm_strength >= 95 ~ "Elite (95+)",
max_arm_strength >= 90 ~ "Above Avg (90-95)",
max_arm_strength >= 85 ~ "Average (85-90)",
TRUE ~ "Below Avg (<85)"
),
arm_tier = factor(arm_tier, levels = c("Elite (95+)", "Above Avg (90-95)",
"Average (85-90)", "Below Avg (<85)"))
)
# Analyze arm strength vs assists
arm_summary <- arm_data %>%
group_by(arm_tier) %>%
summarize(
players = n(),
avg_arm_strength = mean(max_arm_strength),
avg_assists = mean(assists),
avg_assist_rate = mean(assist_rate),
.groups = "drop"
)
print(arm_summary)
# Visualize relationship between arm strength and assists
ggplot(arm_data, aes(x = max_arm_strength, y = assists)) +
geom_point(aes(color = arm_tier), size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "#2b8cbe", linetype = "dashed") +
scale_color_manual(
values = c("Elite (95+)" = "#d7191c",
"Above Avg (90-95)" = "#fdae61",
"Average (85-90)" = "#abd9e9",
"Below Avg (<85)" = "#2c7bb6")
) +
labs(
title = "Outfield Arm Strength vs Assists",
subtitle = "Stronger arms correlate with more assists, though opportunities matter",
x = "Max Arm Strength (mph)",
y = "Assists",
color = "Arm Tier",
caption = "Data: Simulated for illustration"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
# Box plot of arm strength by tier
ggplot(arm_data, aes(x = arm_tier, y = max_arm_strength, fill = arm_tier)) +
geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
scale_fill_manual(
values = c("Elite (95+)" = "#d7191c",
"Above Avg (90-95)" = "#fdae61",
"Average (85-90)" = "#abd9e9",
"Below Avg (<85)" = "#2c7bb6")
) +
labs(
title = "Distribution of Arm Strength by Tier",
x = NULL,
y = "Max Arm Strength (mph)",
caption = "Data: Simulated for illustration"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
legend.position = "none",
axis.text.x = element_text(angle = 15, hjust = 1)
)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Set random seed
np.random.seed(123)
sns.set_style("whitegrid")
# Simulate outfield arm data
n_players = 30
arm_data = pd.DataFrame({
'player': [f'Player {i}' for i in range(1, n_players + 1)],
'max_arm_strength': np.random.normal(88, 4, n_players),
'assists': np.random.poisson(5, n_players),
'opportunities': np.random.poisson(25, n_players)
})
# Clip arm strength to realistic range
arm_data['max_arm_strength'] = arm_data['max_arm_strength'].clip(78, 98)
arm_data['assist_rate'] = arm_data['assists'] / arm_data['opportunities']
# Create arm tier categories
def categorize_arm(strength):
if strength >= 95:
return 'Elite (95+)'
elif strength >= 90:
return 'Above Avg (90-95)'
elif strength >= 85:
return 'Average (85-90)'
else:
return 'Below Avg (<85)'
arm_data['arm_tier'] = arm_data['max_arm_strength'].apply(categorize_arm)
# Order categories
tier_order = ['Elite (95+)', 'Above Avg (90-95)', 'Average (85-90)', 'Below Avg (<85)']
arm_data['arm_tier'] = pd.Categorical(arm_data['arm_tier'], categories=tier_order, ordered=True)
# Summary statistics
arm_summary = arm_data.groupby('arm_tier').agg({
'player': 'count',
'max_arm_strength': 'mean',
'assists': 'mean',
'assist_rate': 'mean'
}).round(2)
arm_summary.columns = ['players', 'avg_arm_strength', 'avg_assists', 'avg_assist_rate']
print("\nArm Strength Summary by Tier:")
print(arm_summary)
# Visualize arm strength vs assists
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Scatter plot with regression
colors = {'Elite (95+)': '#d7191c', 'Above Avg (90-95)': '#fdae61',
'Average (85-90)': '#abd9e9', 'Below Avg (<85)': '#2c7bb6'}
for tier in tier_order:
tier_data = arm_data[arm_data['arm_tier'] == tier]
ax1.scatter(tier_data['max_arm_strength'], tier_data['assists'],
label=tier, color=colors[tier], s=80, alpha=0.7)
# Add regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(
arm_data['max_arm_strength'], arm_data['assists'])
x_line = np.array([arm_data['max_arm_strength'].min(), arm_data['max_arm_strength'].max()])
y_line = slope * x_line + intercept
ax1.plot(x_line, y_line, 'b--', alpha=0.5, linewidth=2,
label=f'Regression (R² = {r_value**2:.3f})')
ax1.set_xlabel('Max Arm Strength (mph)', fontsize=12)
ax1.set_ylabel('Assists', fontsize=12)
ax1.set_title('Outfield Arm Strength vs Assists\nStronger arms correlate with more assists',
fontsize=13, fontweight='bold')
ax1.legend(loc='upper left', fontsize=9)
ax1.grid(True, alpha=0.3)
# Box plot of arm strength by tier
positions = range(len(tier_order))
bp = ax2.boxplot([arm_data[arm_data['arm_tier'] == tier]['max_arm_strength'].values
for tier in tier_order],
positions=positions,
labels=tier_order,
patch_artist=True,
widths=0.6)
# Color the boxes
for patch, tier in zip(bp['boxes'], tier_order):
patch.set_facecolor(colors[tier])
patch.set_alpha(0.7)
ax2.set_ylabel('Max Arm Strength (mph)', fontsize=12)
ax2.set_title('Distribution of Arm Strength by Tier',
fontsize=13, fontweight='bold')
ax2.tick_params(axis='x', rotation=15)
ax2.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.figtext(0.99, 0.01, 'Data: Simulated for illustration',
ha='right', fontsize=9, style='italic')
plt.show()
Catchers are baseball's most complex defensive position. They handle every pitch, frame borderline calls, block balls in the dirt, and control the running game. Modern analytics has revealed that elite catchers provide enormous value—sometimes 20-30 runs per season—that was completely invisible in traditional statistics.
8.5.1 Framing {#framing}
Pitch framing is the art of receiving pitches to maximize the likelihood that borderline pitches are called strikes. Elite framers can "steal" 10-15 strikes per game, adding up to 150-200 runs over a full season.
How framing works:
- Subtle glove movement to bring pitches toward the strike zone
- "Sticking" pitches rather than stabbing at them
- Quiet receiving without excessive movement
- Understanding each umpire's zone tendencies
Framing value calculation:
- Determine the expected called strike probability for each pitch based on location
- Compare actual calls to expected calls
- Convert extra strikes to runs using run expectancy framework
- Sum over all pitches caught
Top framers (2024):
Tyler Stephenson (Cincinnati):
- Framing runs: +11
- Extra strikes per game: ~2.5
- Strike rate on edge pitches: 55% (league avg: 50%)
Salvador Perez (Kansas City):
- Framing runs: +9
- Extra strikes per game: ~2.2
- Career framing: +45 runs (2015-2024)
How much is framing worth?: A catcher who provides +15 framing runs is adding 1.5 wins above average through receiving alone—roughly equivalent to a hitter providing a .280/.340/.450 slash line. This is an enormous skill that was completely unmeasured before pitch-tracking technology.
R Implementation:
library(tidyverse)
# Simulated framing data
set.seed(456)
# Generate pitch locations and outcomes
framing_data <- data.frame(
catcher = rep(c("Elite Framer", "Average Framer", "Poor Framer"), each = 1000),
pitch_id = 1:3000
) %>%
mutate(
# Distance from zone edge (0 = edge, negative = inside zone, positive = outside)
edge_distance = rnorm(3000, 0, 0.3),
# Expected strike probability based on location
expected_strike_prob = plogis(1.5 - 6 * abs(edge_distance)),
# Actual call depends on catcher skill
framing_adjustment = case_when(
catcher == "Elite Framer" ~ 0.15,
catcher == "Average Framer" ~ 0.0,
catcher == "Poor Framer" ~ -0.12
),
actual_strike_prob = pmin(0.95, pmax(0.05,
expected_strike_prob + framing_adjustment)),
called_strike = rbinom(3000, 1, actual_strike_prob),
# Each strike worth ~0.13 runs
runs_value = (called_strike - expected_strike_prob) * 0.13
)
# Summarize by catcher
catcher_framing <- framing_data %>%
group_by(catcher) %>%
summarize(
pitches = n(),
strikes = sum(called_strike),
expected_strikes = sum(expected_strike_prob),
extra_strikes = strikes - expected_strikes,
framing_runs = sum(runs_value),
strikes_per_game = (extra_strikes / pitches) * 140, # ~140 pitches/game
.groups = "drop"
) %>%
arrange(desc(framing_runs))
print(catcher_framing)
# Visualize framing runs
ggplot(catcher_framing, aes(x = reorder(catcher, framing_runs), y = framing_runs,
fill = framing_runs)) +
geom_col(width = 0.7) +
geom_text(aes(label = sprintf("+%.1f", framing_runs)),
hjust = ifelse(catcher_framing$framing_runs > 0, -0.2, 1.2),
size = 5, fontface = "bold") +
scale_fill_gradient2(low = "#d7191c", mid = "gray90", high = "#2c7bb6",
midpoint = 0) +
coord_flip() +
labs(
title = "Catcher Framing Runs Above Average",
subtitle = "Elite framers add 10+ runs per season through receiving",
x = NULL,
y = "Framing Runs Above Average",
caption = "Data: Simulated | Each strike worth ~0.13 runs"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "none",
panel.grid.major.y = element_blank()
)
# Show strike rate by pitch location
ggplot(framing_data, aes(x = edge_distance, y = called_strike, color = catcher)) +
geom_smooth(method = "loess", se = FALSE, size = 1.3) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
annotate("text", x = -0.2, y = 0.9, label = "Inside Zone", size = 3) +
annotate("text", x = 0.2, y = 0.9, label = "Outside Zone", size = 3) +
scale_color_manual(values = c("Elite Framer" = "#2c7bb6",
"Average Framer" = "gray50",
"Poor Framer" = "#d7191c")) +
labs(
title = "Called Strike Rate by Pitch Location and Framer Quality",
subtitle = "Elite framers get more strikes on borderline pitches",
x = "Distance from Zone Edge (feet, negative = inside)",
y = "Called Strike Rate",
color = "Catcher Type"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
Python Implementation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import expit
# Set random seed
np.random.seed(456)
sns.set_style("whitegrid")
# Simulate framing data
catchers = ['Elite Framer', 'Average Framer', 'Poor Framer']
n_pitches = 1000
framing_data = []
for catcher in catchers:
# Generate pitch locations relative to zone edge
edge_distance = np.random.normal(0, 0.3, n_pitches)
# Expected strike probability based on location
expected_strike_prob = expit(1.5 - 6 * np.abs(edge_distance))
# Framing adjustment
if catcher == 'Elite Framer':
adjustment = 0.15
elif catcher == 'Average Framer':
adjustment = 0.0
else:
adjustment = -0.12
# Actual strike probability with framing
actual_strike_prob = np.clip(expected_strike_prob + adjustment, 0.05, 0.95)
# Generate actual calls
called_strike = np.random.binomial(1, actual_strike_prob)
# Calculate runs value (each strike worth ~0.13 runs)
runs_value = (called_strike - expected_strike_prob) * 0.13
for i in range(n_pitches):
framing_data.append({
'catcher': catcher,
'edge_distance': edge_distance[i],
'expected_strike_prob': expected_strike_prob[i],
'actual_strike_prob': actual_strike_prob[i],
'called_strike': called_strike[i],
'runs_value': runs_value[i]
})
framing_df = pd.DataFrame(framing_data)
# Summarize by catcher
catcher_framing = framing_df.groupby('catcher').agg({
'edge_distance': 'count',
'called_strike': 'sum',
'expected_strike_prob': 'sum',
'runs_value': 'sum'
}).round(2)
catcher_framing.columns = ['pitches', 'strikes', 'expected_strikes', 'framing_runs']
catcher_framing['extra_strikes'] = catcher_framing['strikes'] - catcher_framing['expected_strikes']
catcher_framing['strikes_per_game'] = (catcher_framing['extra_strikes'] /
catcher_framing['pitches']) * 140
catcher_framing = catcher_framing.sort_values('framing_runs', ascending=False)
print("\nCatcher Framing Summary:")
print(catcher_framing)
# Visualize framing runs
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Bar chart of framing runs
colors_map = {'Elite Framer': '#2c7bb6', 'Average Framer': 'gray', 'Poor Framer': '#d7191c'}
colors = [colors_map[c] for c in catcher_framing.index]
bars = ax1.barh(range(len(catcher_framing)), catcher_framing['framing_runs'], color=colors, alpha=0.8)
ax1.set_yticks(range(len(catcher_framing)))
ax1.set_yticklabels(catcher_framing.index)
ax1.set_xlabel('Framing Runs Above Average', fontsize=12)
ax1.set_title('Catcher Framing Runs Above Average\nElite framers add 10+ runs per season through receiving',
fontsize=13, fontweight='bold')
ax1.axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax1.grid(True, alpha=0.3, axis='x')
# Add value labels
for i, (bar, value) in enumerate(zip(bars, catcher_framing['framing_runs'])):
x_pos = value + (0.5 if value > 0 else -0.5)
ax1.text(x_pos, bar.get_y() + bar.get_height()/2, f'{value:+.1f}',
va='center', ha='left' if value > 0 else 'right', fontweight='bold')
# Strike rate by location
for catcher in catchers:
catcher_data = framing_df[framing_df['catcher'] == catcher].sort_values('edge_distance')
# Smooth with rolling average
window = 100
x_smooth = catcher_data['edge_distance'].rolling(window, center=True).mean()
y_smooth = catcher_data['called_strike'].rolling(window, center=True).mean()
ax2.plot(x_smooth, y_smooth, linewidth=2.5, label=catcher, color=colors_map[catcher])
ax2.axvline(x=0, color='black', linestyle='--', linewidth=1.5, alpha=0.7)
ax2.text(-0.2, 0.9, 'Inside Zone', fontsize=10, ha='center')
ax2.text(0.2, 0.9, 'Outside Zone', fontsize=10, ha='center')
ax2.set_xlabel('Distance from Zone Edge (feet, negative = inside)', fontsize=12)
ax2.set_ylabel('Called Strike Rate', fontsize=12)
ax2.set_title('Called Strike Rate by Pitch Location\nElite framers get more strikes on borderline pitches',
fontsize=13, fontweight='bold')
ax2.legend(loc='lower right')
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 1)
plt.tight_layout()
plt.figtext(0.99, 0.01, 'Data: Simulated | Each strike worth ~0.13 runs',
ha='right', fontsize=9, style='italic')
plt.show()
8.5.2 Pop Time {#pop-time}
Pop time measures how quickly a catcher receives a pitch, transfers it, and throws to second base, measured from the moment the ball hits the catcher's glove to when it reaches the fielder's glove at second base.
Pop time benchmarks:
- Elite: <1.90 seconds
- Above average: 1.90-1.95 seconds
- Average: 1.95-2.00 seconds
- Below average: 2.00-2.05 seconds
- Poor: >2.05 seconds
Components of pop time:
- Transfer time: From glove to throwing hand (elite: 0.6-0.7 seconds)
- Arm strength: Throwing velocity (elite: 85+ mph to second base)
- Accuracy: On-target throws allow quicker tags
Pop time leaders (2024):
J.T. Realmuto (Philadelphia):
- Average pop time: 1.87 seconds
- Exchange time: 0.64 seconds
- Arm strength: 86.1 mph
- CS%: 38% (elite)
Adley Rutschman (Baltimore):
- Average pop time: 1.94 seconds
- Exchange time: 0.68 seconds
- Arm strength: 83.7 mph
- CS%: 32% (above average)
Salvador Perez (Kansas City):
- Average pop time: 1.98 seconds
- Exchange time: 0.71 seconds
- Arm strength: 82.4 mph
- CS%: 26% (average)
Why pop time matters: A runner stealing second base covers 90 feet. At typical sprint speed (28 ft/s), this takes about 3.2 seconds from first movement. The pitcher's delivery adds ~1.3-1.5 seconds (from leg lift to ball reaching catcher). This leaves roughly 1.7-1.9 seconds for the catcher to throw out the runner. A catcher with 1.85-second pop time has a realistic chance; one with 2.10-second pop time has almost none.
8.5.3 Blocking {#blocking}
Blocking measures a catcher's ability to prevent wild pitches and passed balls on pitches in the dirt. While less valuable than framing (fewer opportunities, smaller run value per event), good blocking still saves 3-5 runs per season.
Blocking metrics:
- Block rate: Percentage of pitches in the dirt that are blocked (average: ~85%)
- Blocking runs: Runs saved by preventing wild pitches/passed balls
What makes good blocking:
- Quick drop to knees
- Wide base to cover maximum area
- Proper glove position (down, creating a backstop)
- Anticipation of breaking balls in the dirt
Top blockers (2024 estimated):
- Willson Contreras: +3 blocking runs
- Tyler Stephenson: +2 blocking runs
- Sean Murphy: +2 blocking runs
Poor blockers can cost their team 5+ runs through passed balls that advance runners or allow scoring.
library(tidyverse)
# Simulated framing data
set.seed(456)
# Generate pitch locations and outcomes
framing_data <- data.frame(
catcher = rep(c("Elite Framer", "Average Framer", "Poor Framer"), each = 1000),
pitch_id = 1:3000
) %>%
mutate(
# Distance from zone edge (0 = edge, negative = inside zone, positive = outside)
edge_distance = rnorm(3000, 0, 0.3),
# Expected strike probability based on location
expected_strike_prob = plogis(1.5 - 6 * abs(edge_distance)),
# Actual call depends on catcher skill
framing_adjustment = case_when(
catcher == "Elite Framer" ~ 0.15,
catcher == "Average Framer" ~ 0.0,
catcher == "Poor Framer" ~ -0.12
),
actual_strike_prob = pmin(0.95, pmax(0.05,
expected_strike_prob + framing_adjustment)),
called_strike = rbinom(3000, 1, actual_strike_prob),
# Each strike worth ~0.13 runs
runs_value = (called_strike - expected_strike_prob) * 0.13
)
# Summarize by catcher
catcher_framing <- framing_data %>%
group_by(catcher) %>%
summarize(
pitches = n(),
strikes = sum(called_strike),
expected_strikes = sum(expected_strike_prob),
extra_strikes = strikes - expected_strikes,
framing_runs = sum(runs_value),
strikes_per_game = (extra_strikes / pitches) * 140, # ~140 pitches/game
.groups = "drop"
) %>%
arrange(desc(framing_runs))
print(catcher_framing)
# Visualize framing runs
ggplot(catcher_framing, aes(x = reorder(catcher, framing_runs), y = framing_runs,
fill = framing_runs)) +
geom_col(width = 0.7) +
geom_text(aes(label = sprintf("+%.1f", framing_runs)),
hjust = ifelse(catcher_framing$framing_runs > 0, -0.2, 1.2),
size = 5, fontface = "bold") +
scale_fill_gradient2(low = "#d7191c", mid = "gray90", high = "#2c7bb6",
midpoint = 0) +
coord_flip() +
labs(
title = "Catcher Framing Runs Above Average",
subtitle = "Elite framers add 10+ runs per season through receiving",
x = NULL,
y = "Framing Runs Above Average",
caption = "Data: Simulated | Each strike worth ~0.13 runs"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "none",
panel.grid.major.y = element_blank()
)
# Show strike rate by pitch location
ggplot(framing_data, aes(x = edge_distance, y = called_strike, color = catcher)) +
geom_smooth(method = "loess", se = FALSE, size = 1.3) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
annotate("text", x = -0.2, y = 0.9, label = "Inside Zone", size = 3) +
annotate("text", x = 0.2, y = 0.9, label = "Outside Zone", size = 3) +
scale_color_manual(values = c("Elite Framer" = "#2c7bb6",
"Average Framer" = "gray50",
"Poor Framer" = "#d7191c")) +
labs(
title = "Called Strike Rate by Pitch Location and Framer Quality",
subtitle = "Elite framers get more strikes on borderline pitches",
x = "Distance from Zone Edge (feet, negative = inside)",
y = "Called Strike Rate",
color = "Catcher Type"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import expit
# Set random seed
np.random.seed(456)
sns.set_style("whitegrid")
# Simulate framing data
catchers = ['Elite Framer', 'Average Framer', 'Poor Framer']
n_pitches = 1000
framing_data = []
for catcher in catchers:
# Generate pitch locations relative to zone edge
edge_distance = np.random.normal(0, 0.3, n_pitches)
# Expected strike probability based on location
expected_strike_prob = expit(1.5 - 6 * np.abs(edge_distance))
# Framing adjustment
if catcher == 'Elite Framer':
adjustment = 0.15
elif catcher == 'Average Framer':
adjustment = 0.0
else:
adjustment = -0.12
# Actual strike probability with framing
actual_strike_prob = np.clip(expected_strike_prob + adjustment, 0.05, 0.95)
# Generate actual calls
called_strike = np.random.binomial(1, actual_strike_prob)
# Calculate runs value (each strike worth ~0.13 runs)
runs_value = (called_strike - expected_strike_prob) * 0.13
for i in range(n_pitches):
framing_data.append({
'catcher': catcher,
'edge_distance': edge_distance[i],
'expected_strike_prob': expected_strike_prob[i],
'actual_strike_prob': actual_strike_prob[i],
'called_strike': called_strike[i],
'runs_value': runs_value[i]
})
framing_df = pd.DataFrame(framing_data)
# Summarize by catcher
catcher_framing = framing_df.groupby('catcher').agg({
'edge_distance': 'count',
'called_strike': 'sum',
'expected_strike_prob': 'sum',
'runs_value': 'sum'
}).round(2)
catcher_framing.columns = ['pitches', 'strikes', 'expected_strikes', 'framing_runs']
catcher_framing['extra_strikes'] = catcher_framing['strikes'] - catcher_framing['expected_strikes']
catcher_framing['strikes_per_game'] = (catcher_framing['extra_strikes'] /
catcher_framing['pitches']) * 140
catcher_framing = catcher_framing.sort_values('framing_runs', ascending=False)
print("\nCatcher Framing Summary:")
print(catcher_framing)
# Visualize framing runs
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Bar chart of framing runs
colors_map = {'Elite Framer': '#2c7bb6', 'Average Framer': 'gray', 'Poor Framer': '#d7191c'}
colors = [colors_map[c] for c in catcher_framing.index]
bars = ax1.barh(range(len(catcher_framing)), catcher_framing['framing_runs'], color=colors, alpha=0.8)
ax1.set_yticks(range(len(catcher_framing)))
ax1.set_yticklabels(catcher_framing.index)
ax1.set_xlabel('Framing Runs Above Average', fontsize=12)
ax1.set_title('Catcher Framing Runs Above Average\nElite framers add 10+ runs per season through receiving',
fontsize=13, fontweight='bold')
ax1.axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax1.grid(True, alpha=0.3, axis='x')
# Add value labels
for i, (bar, value) in enumerate(zip(bars, catcher_framing['framing_runs'])):
x_pos = value + (0.5 if value > 0 else -0.5)
ax1.text(x_pos, bar.get_y() + bar.get_height()/2, f'{value:+.1f}',
va='center', ha='left' if value > 0 else 'right', fontweight='bold')
# Strike rate by location
for catcher in catchers:
catcher_data = framing_df[framing_df['catcher'] == catcher].sort_values('edge_distance')
# Smooth with rolling average
window = 100
x_smooth = catcher_data['edge_distance'].rolling(window, center=True).mean()
y_smooth = catcher_data['called_strike'].rolling(window, center=True).mean()
ax2.plot(x_smooth, y_smooth, linewidth=2.5, label=catcher, color=colors_map[catcher])
ax2.axvline(x=0, color='black', linestyle='--', linewidth=1.5, alpha=0.7)
ax2.text(-0.2, 0.9, 'Inside Zone', fontsize=10, ha='center')
ax2.text(0.2, 0.9, 'Outside Zone', fontsize=10, ha='center')
ax2.set_xlabel('Distance from Zone Edge (feet, negative = inside)', fontsize=12)
ax2.set_ylabel('Called Strike Rate', fontsize=12)
ax2.set_title('Called Strike Rate by Pitch Location\nElite framers get more strikes on borderline pitches',
fontsize=13, fontweight='bold')
ax2.legend(loc='lower right')
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 1)
plt.tight_layout()
plt.figtext(0.99, 0.01, 'Data: Simulated | Each strike worth ~0.13 runs',
ha='right', fontsize=9, style='italic')
plt.show()
Before Statcast, two metrics dominated defensive evaluation: Ultimate Zone Rating (UZR) and Defensive Runs Saved (DRS). Both remain widely used and valuable, especially for historical comparisons.
8.6.1 Ultimate Zone Rating (UZR) {#uzr}
UZR, developed by Mitchel Lichtman and published by FanGraphs, divides the field into zones and credits fielders for plays made within those zones relative to league average.
UZR components:
- Range runs: Value from getting to balls (most important component)
- Error runs: Cost of errors made
- Double play runs: Value from turning double plays (infielders only)
- Arm runs: Value from assists and preventing advancement (outfielders)
Calculation approach:
- Classify each batted ball by type (GB, FB, LD), location, and velocity
- Determine league-average out rate for similar balls
- Credit fielder for outs above/below expectation
- Convert to runs using run expectancy
- Sum over full season
UZR interpretation:
- +15: Elite (Gold Glove caliber)
- +5 to +15: Above average
- -5 to +5: Average
- -5 to -15: Below average
- <-15: Poor
UZR strengths:
- Long track record (data back to 2002)
- Publicly available at FanGraphs
- Breaks down into understandable components
- Generally stable year-to-year for true talent
UZR limitations:
- Relies on STATS Inc. zone data (less precise than Statcast)
- Doesn't account for exact positioning
- Sample size issues persist (one season = 100-150 runs uncertainty)
- Can be affected by team defensive shifts
8.6.2 Defensive Runs Saved (DRS) {#drs}
DRS, developed by John Dewan and published by Sports Info Solutions, uses a similar zone-based approach but with some methodological differences.
DRS components:
- Plus/Minus runs: Range and fielding (similar to UZR's range runs)
- Outfield Arm runs: Assists and baserunner kills
- Double Play runs: Value from turning DPs
- Bunt defense runs: Infielder ability on bunts
- Good Play/Misplay runs: Subjective evaluation of technique
DRS vs UZR differences:
- DRS includes some subjective "good play/misplay" evaluations
- DRS and UZR often agree directionally but differ in magnitude
- DRS has separate metrics for specific skills (bunt defense)
- Both use proprietary data, limiting replication
DRS interpretation (same scale as UZR):
- +15: Elite
- +5 to +15: Above average
- -5 to +5: Average
- -5 to -15: Below average
- <-15: Poor
8.6.3 Comparison Table of Metrics {#metrics-comparison}
| Metric | Data Source | Availability | Strengths | Weaknesses | Best Use |
|---|---|---|---|---|---|
| OAA | Statcast tracking | 2016-present | Most precise, objective | Limited history, small samples | Modern player evaluation |
| UZR | STATS zones | 2002-present | Long history, public | Zone-based approximation | Historical comparisons |
| DRS | SIS zones | 2003-present | Detailed components | Some subjectivity | Comprehensive evaluation |
| Fielding % | Official stats | 1876-present | Long history, simple | Ignores range completely | Only for historical context |
Correlation between metrics (typical season):
- OAA vs UZR: r ≈ 0.70
- OAA vs DRS: r ≈ 0.72
- UZR vs DRS: r ≈ 0.85
The metrics generally agree on who are elite/poor defenders but differ on exact magnitudes. For modern players (2016+), prefer OAA. For historical analysis or when OAA isn't available, use UZR or DRS.
Where fielders stand before the pitch dramatically affects defensive outcomes. The shift era revolutionized this understanding, and rule changes in 2023 created a natural experiment in positioning's impact.
8.7.1 The Shift Era (2010-2022) {#shift-era}
What was "the shift"? In traditional defensive alignment, two infielders play on each side of second base. The shift involved placing three infielders on one side—typically against left-handed pull hitters who hit ground balls to the right side.
Evolution of shift usage:
- 2010: ~2% of plate appearances
- 2015: ~13% of plate appearances
- 2020: ~30% of plate appearances
- 2022: ~35% of plate appearances
Why did shifting increase?
- Data availability: Spray charts showed extreme pull tendencies
- Analytical acceptance: Teams trusted data over tradition
- Proven effectiveness: Shifts reduced BABIP by 20-30 points for shifted batters
- Competitive pressure: If opponents shifted, you had to also
Real examples of shift effectiveness:
Kyle Seager (2022, LHH):
- PA faced with shift: 482 (79% of PA)
- BABIP with shift: .254
- BABIP without shift: .312
- Estimated hits prevented: ~18
Brian Anderson (2022, RHH):
- PA faced with shift: 301 (56% of PA)
- BABIP with shift: .241
- BABIP without shift: .289
- Estimated hits prevented: ~10
Teams collectively prevented an estimated 2,000-3,000 hits per season through shifting in the peak years (2020-2022).
8.7.2 Post-Shift Rules (2023+) {#post-shift-rules}
Starting in 2023, MLB banned most shifts by requiring:
- Two infielders on each side of second base when the pitch is delivered
- Four infielders on the infield dirt (no more outfielders playing shallow infield)
- Feet must be touching dirt when pitch is released
Impact of shift ban (2023 vs 2022):
League-wide batting average:
- 2022: .243
- 2023: .248 (+5 points)
Ground ball hit rate (left-handed batters):
- 2022: .235
- 2023: .245 (+10 points, +4.3%)
Biggest beneficiaries (2023 improvements):
- Pull-heavy left-handed hitters: +.015 AVG
- Extreme ground ball hitters: +.020 AVG
- Power hitters who grounded out frequently: +.012 AVG
Strategic adjustments:
Teams now optimize positioning within legal constraints:
- Deeper/shallower positioning by batter tendency
- Shade toward pull side while keeping two on each side
- More aggressive outfield positioning (can still shift OF)
Let's analyze shift ban impact:
R Implementation:
library(tidyverse)
# Simulate 2022 vs 2023 ground ball outcomes for left-handed batters
set.seed(789)
shift_comparison <- data.frame(
year = rep(c(2022, 2023), each = 5000),
batter_id = rep(1:5000, 2)
) %>%
mutate(
# 2022: Heavy shift usage
shift_rate = ifelse(year == 2022, 0.75, 0.0),
was_shifted = rbinom(n(), 1, shift_rate),
# Ground ball hit rate depends on shift
base_hit_rate = rnorm(n(), 0.240, 0.05),
shift_penalty = ifelse(was_shifted == 1, -0.03, 0),
year_boost = ifelse(year == 2023, 0.01, 0), # Overall rule changes
hit_rate = pmax(0.1, pmin(0.4, base_hit_rate + shift_penalty + year_boost)),
was_hit = rbinom(n(), 1, hit_rate)
)
# Compare years
year_summary <- shift_comparison %>%
group_by(year) %>%
summarize(
ground_balls = n(),
hits = sum(was_hit),
hit_rate = hits / ground_balls,
avg_shift_rate = mean(shift_rate),
.groups = "drop"
)
print(year_summary)
# Statistical test
gb_2022 <- shift_comparison %>% filter(year == 2022)
gb_2023 <- shift_comparison %>% filter(year == 2023)
test_result <- prop.test(
x = c(sum(gb_2022$was_hit), sum(gb_2023$was_hit)),
n = c(nrow(gb_2022), nrow(gb_2023))
)
cat("\nProportion test for hit rate difference:\n")
print(test_result)
# Visualize shift impact
ggplot(year_summary, aes(x = factor(year), y = hit_rate, fill = factor(year))) +
geom_col(width = 0.6) +
geom_text(aes(label = sprintf(".%.3f", hit_rate)),
vjust = -0.5, size = 6, fontface = "bold") +
scale_y_continuous(labels = scales::number_format(accuracy = 0.001),
limits = c(0, 0.30)) +
scale_fill_manual(values = c("2022" = "#E03A3E", "2023" = "#4A90E2")) +
labs(
title = "Ground Ball Hit Rate: Left-Handed Batters",
subtitle = "Comparing 2022 (Shifts Allowed) vs 2023 (Shifts Banned)",
x = "Season",
y = "Ground Ball Hit Rate",
caption = "Data: Simulated based on actual MLB trends\nP-value < 0.001"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12),
legend.position = "none",
panel.grid.major.x = element_blank()
)
# Show shift usage decline
shift_usage <- data.frame(
year = 2015:2023,
shift_pct = c(0.13, 0.17, 0.21, 0.26, 0.28, 0.31, 0.33, 0.35, 0.01)
)
ggplot(shift_usage, aes(x = year, y = shift_pct)) +
geom_line(size = 1.5, color = "#2b8cbe") +
geom_point(size = 4, color = "#08519c") +
geom_vline(xintercept = 2022.5, linetype = "dashed", color = "red", size = 1) +
annotate("text", x = 2020, y = 0.32, label = "Shift Era", size = 5, fontface = "bold") +
annotate("text", x = 2023, y = 0.15, label = "Ban Implemented",
size = 4, color = "red") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "The Rise and Fall of Defensive Shifting",
subtitle = "Percentage of plate appearances with defensive shift",
x = "Year",
y = "Shift Usage Rate",
caption = "Data: MLB.com"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
panel.grid.minor = element_blank()
)
Python Implementation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import proportions_ztest
# Set random seed
np.random.seed(789)
sns.set_style("whitegrid")
# Simulate 2022 vs 2023 ground ball outcomes
years = [2022, 2023]
n_batters = 5000
shift_data = []
for year in years:
shift_rate = 0.75 if year == 2022 else 0.0
year_boost = 0.01 if year == 2023 else 0.0
for batter in range(n_batters):
was_shifted = np.random.binomial(1, shift_rate)
base_hit_rate = np.random.normal(0.240, 0.05)
shift_penalty = -0.03 if was_shifted else 0
hit_rate = np.clip(base_hit_rate + shift_penalty + year_boost, 0.1, 0.4)
was_hit = np.random.binomial(1, hit_rate)
shift_data.append({
'year': year,
'was_shifted': was_shifted,
'hit_rate': hit_rate,
'was_hit': was_hit
})
shift_df = pd.DataFrame(shift_data)
# Compare years
year_summary = shift_df.groupby('year').agg({
'was_hit': ['count', 'sum', 'mean'],
'was_shifted': 'mean'
}).round(4)
year_summary.columns = ['ground_balls', 'hits', 'hit_rate', 'avg_shift_rate']
print("\nYearly Comparison:")
print(year_summary)
# Statistical test
gb_2022 = shift_df[shift_df['year'] == 2022]['was_hit']
gb_2023 = shift_df[shift_df['year'] == 2023]['was_hit']
count = np.array([gb_2022.sum(), gb_2023.sum()])
nobs = np.array([len(gb_2022), len(gb_2023)])
stat, pval = proportions_ztest(count, nobs)
print(f"\nProportions Z-test:")
print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {pval:.4e}")
# Visualize comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Bar chart comparing hit rates
years_plot = year_summary.index.astype(str)
colors = ['#E03A3E', '#4A90E2']
bars = ax1.bar(years_plot, year_summary['hit_rate'], color=colors, width=0.6, alpha=0.8)
for bar, rate in zip(bars, year_summary['hit_rate']):
height = bar.get_height()
ax1.text(bar.get_x() + bar.get_width()/2., height,
f'.{rate:.3f}',
ha='center', va='bottom', fontsize=14, fontweight='bold')
ax1.set_ylabel('Ground Ball Hit Rate', fontsize=12)
ax1.set_xlabel('Season', fontsize=12)
ax1.set_title('Ground Ball Hit Rate: Left-Handed Batters\nComparing 2022 (Shifts Allowed) vs 2023 (Shifts Banned)',
fontsize=13, fontweight='bold', pad=15)
ax1.set_ylim(0, 0.30)
ax1.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.3f}'))
# Line chart showing shift usage over time
shift_usage = pd.DataFrame({
'year': range(2015, 2024),
'shift_pct': [0.13, 0.17, 0.21, 0.26, 0.28, 0.31, 0.33, 0.35, 0.01]
})
ax2.plot(shift_usage['year'], shift_usage['shift_pct'],
linewidth=2.5, color='#2b8cbe', marker='o', markersize=8, markerfacecolor='#08519c')
ax2.axvline(x=2022.5, color='red', linestyle='--', linewidth=2, alpha=0.7)
ax2.text(2020, 0.32, 'Shift Era', fontsize=12, fontweight='bold', ha='center')
ax2.text(2023, 0.15, 'Ban Implemented', fontsize=10, color='red', ha='center')
ax2.set_xlabel('Year', fontsize=12)
ax2.set_ylabel('Shift Usage Rate', fontsize=12)
ax2.set_title('The Rise and Fall of Defensive Shifting\nPercentage of plate appearances with defensive shift',
fontsize=13, fontweight='bold', pad=15)
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.figtext(0.99, 0.01, 'Data: Simulated based on actual MLB trends | P-value < 0.001',
ha='right', fontsize=9, style='italic')
plt.show()
8.7.3 Optimal Positioning Analysis {#optimal-positioning}
Even without extreme shifts, positioning matters. Studies show that optimal positioning (within the new rules) can save 10-15 runs per team per season compared to traditional positioning.
Positioning optimization factors:
- Batter spray tendencies: Pull%, opposite field%
- Pitch type: Fastballs pulled more than off-speed
- Count: Behind in count = more pull
- Ballpark: Dimensions affect optimal positioning
- Score/situation: Late innings, close games affect approach
Modern teams use algorithmic positioning that updates based on pitch type and count, maximizing defensive coverage within legal constraints.
library(tidyverse)
# Simulate 2022 vs 2023 ground ball outcomes for left-handed batters
set.seed(789)
shift_comparison <- data.frame(
year = rep(c(2022, 2023), each = 5000),
batter_id = rep(1:5000, 2)
) %>%
mutate(
# 2022: Heavy shift usage
shift_rate = ifelse(year == 2022, 0.75, 0.0),
was_shifted = rbinom(n(), 1, shift_rate),
# Ground ball hit rate depends on shift
base_hit_rate = rnorm(n(), 0.240, 0.05),
shift_penalty = ifelse(was_shifted == 1, -0.03, 0),
year_boost = ifelse(year == 2023, 0.01, 0), # Overall rule changes
hit_rate = pmax(0.1, pmin(0.4, base_hit_rate + shift_penalty + year_boost)),
was_hit = rbinom(n(), 1, hit_rate)
)
# Compare years
year_summary <- shift_comparison %>%
group_by(year) %>%
summarize(
ground_balls = n(),
hits = sum(was_hit),
hit_rate = hits / ground_balls,
avg_shift_rate = mean(shift_rate),
.groups = "drop"
)
print(year_summary)
# Statistical test
gb_2022 <- shift_comparison %>% filter(year == 2022)
gb_2023 <- shift_comparison %>% filter(year == 2023)
test_result <- prop.test(
x = c(sum(gb_2022$was_hit), sum(gb_2023$was_hit)),
n = c(nrow(gb_2022), nrow(gb_2023))
)
cat("\nProportion test for hit rate difference:\n")
print(test_result)
# Visualize shift impact
ggplot(year_summary, aes(x = factor(year), y = hit_rate, fill = factor(year))) +
geom_col(width = 0.6) +
geom_text(aes(label = sprintf(".%.3f", hit_rate)),
vjust = -0.5, size = 6, fontface = "bold") +
scale_y_continuous(labels = scales::number_format(accuracy = 0.001),
limits = c(0, 0.30)) +
scale_fill_manual(values = c("2022" = "#E03A3E", "2023" = "#4A90E2")) +
labs(
title = "Ground Ball Hit Rate: Left-Handed Batters",
subtitle = "Comparing 2022 (Shifts Allowed) vs 2023 (Shifts Banned)",
x = "Season",
y = "Ground Ball Hit Rate",
caption = "Data: Simulated based on actual MLB trends\nP-value < 0.001"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12),
legend.position = "none",
panel.grid.major.x = element_blank()
)
# Show shift usage decline
shift_usage <- data.frame(
year = 2015:2023,
shift_pct = c(0.13, 0.17, 0.21, 0.26, 0.28, 0.31, 0.33, 0.35, 0.01)
)
ggplot(shift_usage, aes(x = year, y = shift_pct)) +
geom_line(size = 1.5, color = "#2b8cbe") +
geom_point(size = 4, color = "#08519c") +
geom_vline(xintercept = 2022.5, linetype = "dashed", color = "red", size = 1) +
annotate("text", x = 2020, y = 0.32, label = "Shift Era", size = 5, fontface = "bold") +
annotate("text", x = 2023, y = 0.15, label = "Ban Implemented",
size = 4, color = "red") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "The Rise and Fall of Defensive Shifting",
subtitle = "Percentage of plate appearances with defensive shift",
x = "Year",
y = "Shift Usage Rate",
caption = "Data: MLB.com"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
panel.grid.minor = element_blank()
)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import proportions_ztest
# Set random seed
np.random.seed(789)
sns.set_style("whitegrid")
# Simulate 2022 vs 2023 ground ball outcomes
years = [2022, 2023]
n_batters = 5000
shift_data = []
for year in years:
shift_rate = 0.75 if year == 2022 else 0.0
year_boost = 0.01 if year == 2023 else 0.0
for batter in range(n_batters):
was_shifted = np.random.binomial(1, shift_rate)
base_hit_rate = np.random.normal(0.240, 0.05)
shift_penalty = -0.03 if was_shifted else 0
hit_rate = np.clip(base_hit_rate + shift_penalty + year_boost, 0.1, 0.4)
was_hit = np.random.binomial(1, hit_rate)
shift_data.append({
'year': year,
'was_shifted': was_shifted,
'hit_rate': hit_rate,
'was_hit': was_hit
})
shift_df = pd.DataFrame(shift_data)
# Compare years
year_summary = shift_df.groupby('year').agg({
'was_hit': ['count', 'sum', 'mean'],
'was_shifted': 'mean'
}).round(4)
year_summary.columns = ['ground_balls', 'hits', 'hit_rate', 'avg_shift_rate']
print("\nYearly Comparison:")
print(year_summary)
# Statistical test
gb_2022 = shift_df[shift_df['year'] == 2022]['was_hit']
gb_2023 = shift_df[shift_df['year'] == 2023]['was_hit']
count = np.array([gb_2022.sum(), gb_2023.sum()])
nobs = np.array([len(gb_2022), len(gb_2023)])
stat, pval = proportions_ztest(count, nobs)
print(f"\nProportions Z-test:")
print(f"Z-statistic: {stat:.4f}")
print(f"P-value: {pval:.4e}")
# Visualize comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Bar chart comparing hit rates
years_plot = year_summary.index.astype(str)
colors = ['#E03A3E', '#4A90E2']
bars = ax1.bar(years_plot, year_summary['hit_rate'], color=colors, width=0.6, alpha=0.8)
for bar, rate in zip(bars, year_summary['hit_rate']):
height = bar.get_height()
ax1.text(bar.get_x() + bar.get_width()/2., height,
f'.{rate:.3f}',
ha='center', va='bottom', fontsize=14, fontweight='bold')
ax1.set_ylabel('Ground Ball Hit Rate', fontsize=12)
ax1.set_xlabel('Season', fontsize=12)
ax1.set_title('Ground Ball Hit Rate: Left-Handed Batters\nComparing 2022 (Shifts Allowed) vs 2023 (Shifts Banned)',
fontsize=13, fontweight='bold', pad=15)
ax1.set_ylim(0, 0.30)
ax1.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.3f}'))
# Line chart showing shift usage over time
shift_usage = pd.DataFrame({
'year': range(2015, 2024),
'shift_pct': [0.13, 0.17, 0.21, 0.26, 0.28, 0.31, 0.33, 0.35, 0.01]
})
ax2.plot(shift_usage['year'], shift_usage['shift_pct'],
linewidth=2.5, color='#2b8cbe', marker='o', markersize=8, markerfacecolor='#08519c')
ax2.axvline(x=2022.5, color='red', linestyle='--', linewidth=2, alpha=0.7)
ax2.text(2020, 0.32, 'Shift Era', fontsize=12, fontweight='bold', ha='center')
ax2.text(2023, 0.15, 'Ban Implemented', fontsize=10, color='red', ha='center')
ax2.set_xlabel('Year', fontsize=12)
ax2.set_ylabel('Shift Usage Rate', fontsize=12)
ax2.set_title('The Rise and Fall of Defensive Shifting\nPercentage of plate appearances with defensive shift',
fontsize=13, fontweight='bold', pad=15)
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.figtext(0.99, 0.01, 'Data: Simulated based on actual MLB trends | P-value < 0.001',
ha='right', fontsize=9, style='italic')
plt.show()
Baserunning is the forgotten skill of baseball analytics. Unlike hitting or fielding, baserunning provides smaller but consistent value—elite baserunners are worth 5-10 runs per season, which equals 0.5-1.0 WAR.
8.8.1 Sprint Speed {#sprint-speed}
Sprint speed measures a player's maximum running velocity, expressed in feet per second (ft/s). Statcast calculates it using the player's fastest one-second window during competitive plays.
Sprint speed benchmarks (2024):
- Elite (Top 10%): 30+ ft/s
- Above average: 28-30 ft/s
- Average: 27-28 ft/s
- Below average: 26-27 ft/s
- Poor (Bottom 10%): <26 ft/s
Sprint speed leaders (2024):
Elly De La Cruz (Cincinnati):
- Sprint speed: 30.8 ft/s (fastest in MLB)
- Age: 22
- Position: Shortstop
Bobby Witt Jr. (Kansas City):
- Sprint speed: 30.5 ft/s
- Age: 24
- Position: Shortstop
Corbin Carroll (Arizona):
- Sprint speed: 30.3 ft/s
- Age: 24
- Position: Outfield
Why sprint speed matters:
- Stolen base success: Faster runners steal more successfully
- Infield hits: Speed creates hits on ground balls
- Extra bases: Fast runners stretch singles to doubles, doubles to triples
- Defensive value: Speed improves range in outfield
R Implementation:
library(tidyverse)
library(baseballr)
# Simulated sprint speed data
set.seed(321)
sprint_data <- data.frame(
player = paste("Player", 1:150),
position = sample(c("IF", "OF", "C"), 150, replace = TRUE, prob = c(0.45, 0.45, 0.1))
) %>%
mutate(
# Sprint speed varies by position
sprint_speed = case_when(
position == "IF" ~ rnorm(n(), 28.0, 1.8),
position == "OF" ~ rnorm(n(), 28.5, 1.7),
position == "C" ~ rnorm(n(), 26.2, 1.4)
),
sprint_speed = pmax(24, pmin(31, sprint_speed)),
# Speed affects baserunning outcomes
sb_attempts = rpois(n(), 8),
sb_success_rate = plogis(-1 + 0.15 * sprint_speed + rnorm(n(), 0, 0.3)),
stolen_bases = rbinom(n(), sb_attempts, sb_success_rate),
caught_stealing = sb_attempts - stolen_bases,
# Extra bases taken
opportunities_1b_to_3b = rpois(n(), 12),
extra_base_rate = plogis(-2 + 0.12 * sprint_speed),
extra_bases_taken = rbinom(n(), opportunities_1b_to_3b, extra_base_rate)
)
# Sprint speed by position
position_speed <- sprint_data %>%
group_by(position) %>%
summarize(
players = n(),
avg_speed = mean(sprint_speed),
sd_speed = sd(sprint_speed),
.groups = "drop"
) %>%
arrange(desc(avg_speed))
print(position_speed)
# Visualize sprint speed distribution by position
ggplot(sprint_data, aes(x = sprint_speed, fill = position)) +
geom_density(alpha = 0.6) +
geom_vline(xintercept = 27, linetype = "dashed", color = "black", size = 1) +
annotate("text", x = 27.5, y = 0.25, label = "MLB Average (27 ft/s)", size = 3.5) +
scale_fill_manual(
values = c("IF" = "#2b8cbe", "OF" = "#2ca25f", "C" = "#de2d26"),
labels = c("Infielders", "Outfielders", "Catchers")
) +
labs(
title = "Sprint Speed Distribution by Position",
subtitle = "Outfielders and infielders are faster than catchers",
x = "Sprint Speed (ft/s)",
y = "Density",
fill = "Position",
caption = "Data: Simulated | MLB average ≈ 27 ft/s"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
# Sprint speed vs stolen base success
ggplot(sprint_data %>% filter(sb_attempts >= 5),
aes(x = sprint_speed, y = sb_success_rate)) +
geom_point(aes(color = position, size = sb_attempts), alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
scale_color_manual(
values = c("IF" = "#2b8cbe", "OF" = "#2ca25f", "C" = "#de2d26"),
labels = c("Infielders", "Outfielders", "Catchers")
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "Sprint Speed vs Stolen Base Success Rate",
subtitle = "Faster players steal bases more successfully",
x = "Sprint Speed (ft/s)",
y = "Stolen Base Success Rate",
color = "Position",
size = "SB Attempts",
caption = "Data: Simulated | Minimum 5 SB attempts"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
Python Implementation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import expit
# Set random seed
np.random.seed(321)
sns.set_style("whitegrid")
# Simulate sprint speed data
n_players = 150
positions = np.random.choice(['IF', 'OF', 'C'], n_players, p=[0.45, 0.45, 0.1])
sprint_data = []
for i, pos in enumerate(positions):
# Sprint speed varies by position
if pos == 'IF':
speed = np.random.normal(28.0, 1.8)
elif pos == 'OF':
speed = np.random.normal(28.5, 1.7)
else: # C
speed = np.random.normal(26.2, 1.4)
speed = np.clip(speed, 24, 31)
# Speed affects baserunning outcomes
sb_attempts = np.random.poisson(8)
sb_success_rate = expit(-1 + 0.15 * speed + np.random.normal(0, 0.3))
stolen_bases = np.random.binomial(sb_attempts, sb_success_rate)
opportunities_1b_to_3b = np.random.poisson(12)
extra_base_rate = expit(-2 + 0.12 * speed)
extra_bases_taken = np.random.binomial(opportunities_1b_to_3b, extra_base_rate)
sprint_data.append({
'player': f'Player {i+1}',
'position': pos,
'sprint_speed': speed,
'sb_attempts': sb_attempts,
'sb_success_rate': sb_success_rate,
'stolen_bases': stolen_bases,
'caught_stealing': sb_attempts - stolen_bases,
'opportunities_1b_to_3b': opportunities_1b_to_3b,
'extra_bases_taken': extra_bases_taken
})
sprint_df = pd.DataFrame(sprint_data)
# Sprint speed by position
position_speed = sprint_df.groupby('position')['sprint_speed'].agg(['count', 'mean', 'std']).round(2)
position_speed.columns = ['players', 'avg_speed', 'sd_speed']
position_speed = position_speed.sort_values('avg_speed', ascending=False)
print("\nSprint Speed by Position:")
print(position_speed)
# Visualize sprint speed distribution by position
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Density plot
colors = {'IF': '#2b8cbe', 'OF': '#2ca25f', 'C': '#de2d26'}
for pos in ['IF', 'OF', 'C']:
pos_data = sprint_df[sprint_df['position'] == pos]['sprint_speed']
pos_data.plot(kind='density', ax=ax1, label=pos, color=colors[pos], alpha=0.6, linewidth=2.5)
ax1.axvline(x=27, color='black', linestyle='--', linewidth=2, alpha=0.7)
ax1.text(27.3, 0.25, 'MLB Average\n(27 ft/s)', fontsize=10, ha='left')
ax1.set_xlabel('Sprint Speed (ft/s)', fontsize=12)
ax1.set_ylabel('Density', fontsize=12)
ax1.set_title('Sprint Speed Distribution by Position\nOutfielders and infielders are faster than catchers',
fontsize=13, fontweight='bold')
ax1.legend(title='Position', labels=['Catchers', 'Infielders', 'Outfielders'])
ax1.grid(True, alpha=0.3)
# Sprint speed vs SB success
sprint_sb = sprint_df[sprint_df['sb_attempts'] >= 5].copy()
for pos in ['IF', 'OF', 'C']:
pos_data = sprint_sb[sprint_sb['position'] == pos]
ax2.scatter(pos_data['sprint_speed'], pos_data['sb_success_rate'],
s=pos_data['sb_attempts'] * 10, alpha=0.6, color=colors[pos], label=pos)
# Add regression line
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(
sprint_sb['sprint_speed'], sprint_sb['sb_success_rate'])
x_line = np.array([sprint_sb['sprint_speed'].min(), sprint_sb['sprint_speed'].max()])
y_line = slope * x_line + intercept
ax2.plot(x_line, y_line, 'k--', alpha=0.7, linewidth=2, label=f'Trend (R² = {r_value**2:.3f})')
ax2.set_xlabel('Sprint Speed (ft/s)', fontsize=12)
ax2.set_ylabel('Stolen Base Success Rate', fontsize=12)
ax2.set_title('Sprint Speed vs Stolen Base Success Rate\nFaster players steal bases more successfully',
fontsize=13, fontweight='bold')
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
ax2.legend(title='Position', labels=['Catchers', 'Infielders', 'Outfielders', 'Trend'])
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.figtext(0.99, 0.01, 'Data: Simulated | MLB average ≈ 27 ft/s | Minimum 5 SB attempts for scatter plot',
ha='right', fontsize=9, style='italic')
plt.show()
# Print top speedsters
print("\nTop 10 Fastest Players:")
print(sprint_df.nlargest(10, 'sprint_speed')[['player', 'position', 'sprint_speed', 'stolen_bases']])
8.8.2 Baserunning Value {#baserunning-value}
Beyond sprint speed, actual baserunning decisions matter enormously. A fast player who takes bad risks provides little value; a smart player with average speed can be highly valuable.
Components of baserunning value:
Stolen bases (SB) and caught stealing (CS):
- Success threshold: ~75% (need 3 successes per failure to break even)
- Value of SB: +0.2 runs
- Cost of CS: -0.45 runs
- Net value: (SB × 0.2) - (CS × 0.45)
Extra bases taken:
- First to third on single: +0.27 runs
- Second to home on single: +0.53 runs
- First to home on double: +0.40 runs
Outs on bases:
- Getting thrown out on basepaths: -0.50 runs (varies by situation)
Going first to third (example):
- Opportunities: 35 per season (varies by lineup position)
- League average rate: 28%
- Elite baserunner rate: 40-45%
- Runs above average: (Your rate - 0.28) × 35 × 0.27 ≈ 1-2 runs
Real examples (2024 estimated):
Bobby Witt Jr. (Kansas City):
- Sprint speed: 30.5 ft/s
- Stolen bases: 31 (86% success rate)
- Extra bases taken: Above average
- Baserunning value: +8 runs
Ronald Acuna Jr. (Atlanta, when healthy):
- Sprint speed: 30.1 ft/s
- Stolen bases: Elite success rate (typically 85%+)
- Extra bases taken: Elite
- Baserunning value: +10-12 runs (full season)
Salvador Perez (Kansas City):
- Sprint speed: 25.1 ft/s
- Stolen bases: 1 (minimal attempts)
- Extra bases taken: Below average
- Baserunning value: -3 runs
8.8.3 BsR (Baserunning Runs) {#bsr}
BsR (Baserunning Runs Above Average) is FanGraphs' comprehensive baserunning metric. It combines all baserunning contributions into a single runs number:
BsR = wSB + UBR + wGDP
Where:
- wSB: Weighted stolen base runs (SB value - CS cost)
- UBR: Ultimate Base Running (extra bases, outs on bases)
- wGDP: Grounded into double play runs (avoiding GIDP is valuable)
BsR interpretation:
- +5: Elite baserunner (provides real value)
- +2 to +5: Above average
- -2 to +2: Average
- -2 to -5: Below average
- <-5: Poor baserunner (costs team runs)
BsR leaders often combine:
- Speed to take extra bases and steal successfully
- Aggressiveness to attempt steals and advances
- Instincts to avoid outs on bases
- Speed to avoid double plays
BsR shows that baserunning, while less valuable than hitting or pitching, still matters. The gap between elite and poor baserunners is 10-15 runs per season—equivalent to 30-40 points of wOBA or 1.0-1.5 WAR.
library(tidyverse)
library(baseballr)
# Simulated sprint speed data
set.seed(321)
sprint_data <- data.frame(
player = paste("Player", 1:150),
position = sample(c("IF", "OF", "C"), 150, replace = TRUE, prob = c(0.45, 0.45, 0.1))
) %>%
mutate(
# Sprint speed varies by position
sprint_speed = case_when(
position == "IF" ~ rnorm(n(), 28.0, 1.8),
position == "OF" ~ rnorm(n(), 28.5, 1.7),
position == "C" ~ rnorm(n(), 26.2, 1.4)
),
sprint_speed = pmax(24, pmin(31, sprint_speed)),
# Speed affects baserunning outcomes
sb_attempts = rpois(n(), 8),
sb_success_rate = plogis(-1 + 0.15 * sprint_speed + rnorm(n(), 0, 0.3)),
stolen_bases = rbinom(n(), sb_attempts, sb_success_rate),
caught_stealing = sb_attempts - stolen_bases,
# Extra bases taken
opportunities_1b_to_3b = rpois(n(), 12),
extra_base_rate = plogis(-2 + 0.12 * sprint_speed),
extra_bases_taken = rbinom(n(), opportunities_1b_to_3b, extra_base_rate)
)
# Sprint speed by position
position_speed <- sprint_data %>%
group_by(position) %>%
summarize(
players = n(),
avg_speed = mean(sprint_speed),
sd_speed = sd(sprint_speed),
.groups = "drop"
) %>%
arrange(desc(avg_speed))
print(position_speed)
# Visualize sprint speed distribution by position
ggplot(sprint_data, aes(x = sprint_speed, fill = position)) +
geom_density(alpha = 0.6) +
geom_vline(xintercept = 27, linetype = "dashed", color = "black", size = 1) +
annotate("text", x = 27.5, y = 0.25, label = "MLB Average (27 ft/s)", size = 3.5) +
scale_fill_manual(
values = c("IF" = "#2b8cbe", "OF" = "#2ca25f", "C" = "#de2d26"),
labels = c("Infielders", "Outfielders", "Catchers")
) +
labs(
title = "Sprint Speed Distribution by Position",
subtitle = "Outfielders and infielders are faster than catchers",
x = "Sprint Speed (ft/s)",
y = "Density",
fill = "Position",
caption = "Data: Simulated | MLB average ≈ 27 ft/s"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
# Sprint speed vs stolen base success
ggplot(sprint_data %>% filter(sb_attempts >= 5),
aes(x = sprint_speed, y = sb_success_rate)) +
geom_point(aes(color = position, size = sb_attempts), alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "black", linetype = "dashed") +
scale_color_manual(
values = c("IF" = "#2b8cbe", "OF" = "#2ca25f", "C" = "#de2d26"),
labels = c("Infielders", "Outfielders", "Catchers")
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(
title = "Sprint Speed vs Stolen Base Success Rate",
subtitle = "Faster players steal bases more successfully",
x = "Sprint Speed (ft/s)",
y = "Stolen Base Success Rate",
color = "Position",
size = "SB Attempts",
caption = "Data: Simulated | Minimum 5 SB attempts"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import expit
# Set random seed
np.random.seed(321)
sns.set_style("whitegrid")
# Simulate sprint speed data
n_players = 150
positions = np.random.choice(['IF', 'OF', 'C'], n_players, p=[0.45, 0.45, 0.1])
sprint_data = []
for i, pos in enumerate(positions):
# Sprint speed varies by position
if pos == 'IF':
speed = np.random.normal(28.0, 1.8)
elif pos == 'OF':
speed = np.random.normal(28.5, 1.7)
else: # C
speed = np.random.normal(26.2, 1.4)
speed = np.clip(speed, 24, 31)
# Speed affects baserunning outcomes
sb_attempts = np.random.poisson(8)
sb_success_rate = expit(-1 + 0.15 * speed + np.random.normal(0, 0.3))
stolen_bases = np.random.binomial(sb_attempts, sb_success_rate)
opportunities_1b_to_3b = np.random.poisson(12)
extra_base_rate = expit(-2 + 0.12 * speed)
extra_bases_taken = np.random.binomial(opportunities_1b_to_3b, extra_base_rate)
sprint_data.append({
'player': f'Player {i+1}',
'position': pos,
'sprint_speed': speed,
'sb_attempts': sb_attempts,
'sb_success_rate': sb_success_rate,
'stolen_bases': stolen_bases,
'caught_stealing': sb_attempts - stolen_bases,
'opportunities_1b_to_3b': opportunities_1b_to_3b,
'extra_bases_taken': extra_bases_taken
})
sprint_df = pd.DataFrame(sprint_data)
# Sprint speed by position
position_speed = sprint_df.groupby('position')['sprint_speed'].agg(['count', 'mean', 'std']).round(2)
position_speed.columns = ['players', 'avg_speed', 'sd_speed']
position_speed = position_speed.sort_values('avg_speed', ascending=False)
print("\nSprint Speed by Position:")
print(position_speed)
# Visualize sprint speed distribution by position
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Density plot
colors = {'IF': '#2b8cbe', 'OF': '#2ca25f', 'C': '#de2d26'}
for pos in ['IF', 'OF', 'C']:
pos_data = sprint_df[sprint_df['position'] == pos]['sprint_speed']
pos_data.plot(kind='density', ax=ax1, label=pos, color=colors[pos], alpha=0.6, linewidth=2.5)
ax1.axvline(x=27, color='black', linestyle='--', linewidth=2, alpha=0.7)
ax1.text(27.3, 0.25, 'MLB Average\n(27 ft/s)', fontsize=10, ha='left')
ax1.set_xlabel('Sprint Speed (ft/s)', fontsize=12)
ax1.set_ylabel('Density', fontsize=12)
ax1.set_title('Sprint Speed Distribution by Position\nOutfielders and infielders are faster than catchers',
fontsize=13, fontweight='bold')
ax1.legend(title='Position', labels=['Catchers', 'Infielders', 'Outfielders'])
ax1.grid(True, alpha=0.3)
# Sprint speed vs SB success
sprint_sb = sprint_df[sprint_df['sb_attempts'] >= 5].copy()
for pos in ['IF', 'OF', 'C']:
pos_data = sprint_sb[sprint_sb['position'] == pos]
ax2.scatter(pos_data['sprint_speed'], pos_data['sb_success_rate'],
s=pos_data['sb_attempts'] * 10, alpha=0.6, color=colors[pos], label=pos)
# Add regression line
from scipy.stats import linregress
slope, intercept, r_value, p_value, std_err = linregress(
sprint_sb['sprint_speed'], sprint_sb['sb_success_rate'])
x_line = np.array([sprint_sb['sprint_speed'].min(), sprint_sb['sprint_speed'].max()])
y_line = slope * x_line + intercept
ax2.plot(x_line, y_line, 'k--', alpha=0.7, linewidth=2, label=f'Trend (R² = {r_value**2:.3f})')
ax2.set_xlabel('Sprint Speed (ft/s)', fontsize=12)
ax2.set_ylabel('Stolen Base Success Rate', fontsize=12)
ax2.set_title('Sprint Speed vs Stolen Base Success Rate\nFaster players steal bases more successfully',
fontsize=13, fontweight='bold')
ax2.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
ax2.legend(title='Position', labels=['Catchers', 'Infielders', 'Outfielders', 'Trend'])
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.figtext(0.99, 0.01, 'Data: Simulated | MLB average ≈ 27 ft/s | Minimum 5 SB attempts for scatter plot',
ha='right', fontsize=9, style='italic')
plt.show()
# Print top speedsters
print("\nTop 10 Fastest Players:")
print(sprint_df.nlargest(10, 'sprint_speed')[['player', 'position', 'sprint_speed', 'stolen_bases']])
Interactive visualizations transform fielding analysis from static reporting to dynamic exploration, enabling coaches, analysts, and fans to investigate defensive performance patterns that aggregated metrics alone cannot reveal. While traditional defensive statistics reduce complex spatial and temporal data to single numbers, interactive tools preserve the richness of the underlying information. This section introduces three advanced interactive visualization techniques using Plotly's powerful graphing library, which provides zoom, filtering, hover details, and animation capabilities essential for modern defensive analysis.
The value of interactive fielding visualizations extends across multiple use cases. Player development staff identify positioning adjustments and skill gaps requiring targeted training. Front office personnel evaluate trade and free agent targets by exploring defensive performance in various contexts. Opposing teams scout defensive tendencies to optimize baserunning and hit placement strategies. The shift from static to interactive visualization represents a paradigm shift in how defensive data informs decision-making.
8.8.1 Interactive Catch Probability Heat Map
Catch probability heat maps visualize where fielders make plays and which plays they convert at rates above or below expectation. Making these maps interactive enables filtering by game situation (close vs. blowout), batted ball type (fly ball vs. line drive), or time period (early season vs. late season). Users can hover over specific zones to see conversion rates, expected rates, and outs above average for that region. This granularity reveals positioning inefficiencies and range limitations that summary statistics obscure.
R Implementation:
library(tidyverse)
library(plotly)
library(baseballr)
create_catch_probability_heatmap <- function(fielding_data, player_name = "Fielder",
position = "OF") {
# Filter and prepare data
plays <- fielding_data %>%
filter(!is.na(hc_x), !is.na(hc_y), !is.na(hit_distance_sc)) %>%
mutate(
# Adjust coordinates for plotting (home plate at origin)
x_coord = hc_x - 125.42, # Center on home plate
y_coord = 208.48 - hc_y, # Flip y-axis
# Simplified catch probability based on distance
distance = hit_distance_sc,
expected_catch_prob = case_when(
distance < 50 ~ 0.98,
distance < 100 ~ 0.85,
distance < 150 ~ 0.65,
distance < 200 ~ 0.40,
distance < 250 ~ 0.20,
TRUE ~ 0.05
),
was_caught = ifelse(events %in% c("field_out", "double_play",
"sac_fly"), 1, 0),
outs_above_avg = was_caught - expected_catch_prob
)
# Create hexagonal bins for aggregation
# Convert to grid cells (20x20 foot cells)
plays <- plays %>%
mutate(
x_bin = round(x_coord / 20) * 20,
y_bin = round(y_coord / 20) * 20,
hover_text = paste0(
"Location: (", round(x_coord, 0), ", ", round(y_coord, 0), ")<br>",
"Distance: ", round(distance, 0), " ft<br>",
"Expected: ", round(expected_catch_prob * 100, 0), "%<br>",
"Result: ", ifelse(was_caught == 1, "Out", "Hit")
)
)
# Aggregate by bins
binned_data <- plays %>%
group_by(x_bin, y_bin) %>%
summarize(
plays = n(),
catches = sum(was_caught),
expected_catches = sum(expected_catch_prob),
catch_rate = catches / plays,
oaa_zone = sum(outs_above_avg),
avg_distance = mean(distance),
.groups = "drop"
) %>%
filter(plays >= 3) %>% # Minimum plays per zone
mutate(
hover_info = paste0(
"<b>Zone Performance</b><br>",
"Plays: ", plays, "<br>",
"Catch Rate: ", round(catch_rate * 100, 0), "%<br>",
"Expected: ", round((expected_catches/plays) * 100, 0), "%<br>",
"OAA (zone): ", sprintf("%+.1f", oaa_zone), "<br>",
"Avg Distance: ", round(avg_distance, 0), " ft"
)
)
# Create interactive heatmap
p <- plot_ly(
data = binned_data,
x = ~x_bin,
y = ~y_bin,
z = ~oaa_zone,
type = "contour",
colorscale = list(
c(0, "rgb(215, 48, 39)"), # Red for negative OAA
c(0.5, "rgb(255, 255, 191)"), # Yellow for average
c(1, "rgb(44, 123, 182)") # Blue for positive OAA
),
colorbar = list(
title = "<b>Outs Above<br>Average</b>",
tickformat = "+.1f"
),
text = ~hover_info,
hoverinfo = "text",
contours = list(
showlabels = TRUE,
labelfont = list(size = 10, color = 'white')
)
) %>%
add_trace(
data = plays,
x = ~x_coord,
y = ~y_coord,
type = "scatter",
mode = "markers",
marker = list(
size = 4,
color = ~was_caught,
colorscale = list(c(0, "red"), c(1, "green")),
opacity = 0.3,
line = list(width = 0.5, color = 'black')
),
text = ~hover_text,
hoverinfo = "text",
showlegend = FALSE
) %>%
layout(
title = list(
text = paste0("<b>", player_name, " Catch Probability Heat Map</b><br>",
"<sub>", position, " - Red zones = below average, Blue = above average</sub>"),
font = list(size = 16)
),
xaxis = list(
title = "<b>Horizontal Distance from Home Plate (ft)</b><br>← LF | CF | RF →",
range = c(-250, 250),
zeroline = TRUE,
zerolinewidth = 2,
zerolinecolor = 'black',
gridcolor = 'lightgray'
),
yaxis = list(
title = "<b>Distance from Home Plate (ft)</b>",
range = c(0, 450),
scaleanchor = "x",
scaleratio = 1,
gridcolor = 'lightgray'
),
hovermode = 'closest',
showlegend = FALSE,
margin = list(l = 80, r = 120, t = 100, b = 80)
) %>%
config(displayModeBar = TRUE, displaylogo = FALSE)
return(p)
}
# Example usage
# fielding_data <- statcast_search(
# start_date = "2024-04-01",
# end_date = "2024-10-01",
# player_type = "batter"
# ) %>%
# filter(fielder_2 == player_id_of_interest) # Filter for specific fielder
#
# heatmap <- create_catch_probability_heatmap(
# fielding_data,
# "Mookie Betts",
# "RF"
# )
# heatmap
Python Implementation:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import statcast
def create_catch_probability_heatmap(fielding_data, player_name="Fielder",
position="OF"):
"""
Create interactive catch probability heat map.
Parameters:
fielding_data: DataFrame with Statcast batted ball data
player_name: Name for chart title
position: Fielder position
Returns:
Plotly figure object
"""
# Filter and prepare data
plays = fielding_data[
fielding_data['hc_x'].notna() &
fielding_data['hc_y'].notna() &
fielding_data['hit_distance_sc'].notna()
].copy()
# Adjust coordinates (home plate at origin)
plays['x_coord'] = plays['hc_x'] - 125.42
plays['y_coord'] = 208.48 - plays['hc_y']
plays['distance'] = plays['hit_distance_sc']
# Simplified catch probability model
def calc_expected_catch(distance):
if distance < 50:
return 0.98
elif distance < 100:
return 0.85
elif distance < 150:
return 0.65
elif distance < 200:
return 0.40
elif distance < 250:
return 0.20
else:
return 0.05
plays['expected_catch_prob'] = plays['distance'].apply(calc_expected_catch)
# Determine outcomes
caught_events = ['field_out', 'double_play', 'sac_fly']
plays['was_caught'] = plays['events'].isin(caught_events).astype(int)
plays['outs_above_avg'] = plays['was_caught'] - plays['expected_catch_prob']
# Create hover text for individual plays
plays['hover_text'] = plays.apply(
lambda row: f"Location: ({row['x_coord']:.0f}, {row['y_coord']:.0f})<br>" +
f"Distance: {row['distance']:.0f} ft<br>" +
f"Expected: {row['expected_catch_prob']*100:.0f}%<br>" +
f"Result: {'Out' if row['was_caught'] else 'Hit'}",
axis=1
)
# Create grid bins (20x20 feet)
plays['x_bin'] = (plays['x_coord'] / 20).round() * 20
plays['y_bin'] = (plays['y_coord'] / 20).round() * 20
# Aggregate by bins
binned_data = plays.groupby(['x_bin', 'y_bin']).agg({
'was_caught': ['count', 'sum'],
'expected_catch_prob': 'sum',
'outs_above_avg': 'sum',
'distance': 'mean'
}).reset_index()
binned_data.columns = ['x_bin', 'y_bin', 'plays', 'catches',
'expected_catches', 'oaa_zone', 'avg_distance']
# Filter minimum plays
binned_data = binned_data[binned_data['plays'] >= 3]
binned_data['catch_rate'] = binned_data['catches'] / binned_data['plays']
binned_data['expected_rate'] = binned_data['expected_catches'] / binned_data['plays']
# Create hover info
binned_data['hover_info'] = binned_data.apply(
lambda row: f"<b>Zone Performance</b><br>" +
f"Plays: {row['plays']}<br>" +
f"Catch Rate: {row['catch_rate']*100:.0f}%<br>" +
f"Expected: {row['expected_rate']*100:.0f}%<br>" +
f"OAA (zone): {row['oaa_zone']:+.1f}<br>" +
f"Avg Distance: {row['avg_distance']:.0f} ft",
axis=1
)
# Create figure
fig = go.Figure()
# Add contour heat map
# Create grid for contour
x_range = np.arange(binned_data['x_bin'].min(), binned_data['x_bin'].max() + 20, 20)
y_range = np.arange(binned_data['y_bin'].min(), binned_data['y_bin'].max() + 20, 20)
# Create z-values matrix
z_matrix = np.full((len(y_range), len(x_range)), np.nan)
for _, row in binned_data.iterrows():
x_idx = np.where(x_range == row['x_bin'])[0]
y_idx = np.where(y_range == row['y_bin'])[0]
if len(x_idx) > 0 and len(y_idx) > 0:
z_matrix[y_idx[0], x_idx[0]] = row['oaa_zone']
fig.add_trace(go.Contour(
x=x_range,
y=y_range,
z=z_matrix,
colorscale=[
[0, 'rgb(215, 48, 39)'], # Red for negative
[0.5, 'rgb(255, 255, 191)'], # Yellow for average
[1, 'rgb(44, 123, 182)'] # Blue for positive
],
colorbar=dict(
title="<b>Outs Above<br>Average</b>",
tickformat="+.1f"
),
hoverinfo='skip',
contours=dict(
showlabels=True,
labelfont=dict(size=10, color='white')
)
))
# Add scatter points for individual plays
fig.add_trace(go.Scatter(
x=plays['x_coord'],
y=plays['y_coord'],
mode='markers',
marker=dict(
size=4,
color=plays['was_caught'],
colorscale=[[0, 'red'], [1, 'green']],
opacity=0.3,
line=dict(width=0.5, color='black'),
showscale=False
),
text=plays['hover_text'],
hoverinfo='text',
showlegend=False
))
# Update layout
fig.update_layout(
title=dict(
text=f"<b>{player_name} Catch Probability Heat Map</b><br>" +
f"<sub>{position} - Red zones = below average, Blue = above average</sub>",
x=0.5,
xanchor='center',
font=dict(size=16)
),
xaxis=dict(
title="<b>Horizontal Distance from Home Plate (ft)</b><br>← LF | CF | RF →",
range=[-250, 250],
zeroline=True,
zerolinewidth=2,
zerolinecolor='black',
gridcolor='lightgray',
showgrid=True
),
yaxis=dict(
title="<b>Distance from Home Plate (ft)</b>",
range=[0, 450],
scaleanchor="x",
scaleratio=1,
gridcolor='lightgray',
showgrid=True
),
hovermode='closest',
showlegend=False,
width=1000,
height=1000,
margin=dict(l=80, r=150, t=100, b=80),
template='plotly_white'
)
return fig
# Example usage
# fielding_data = statcast(start_dt='2024-04-01', end_dt='2024-10-01')
# fielding_data = fielding_data[fielding_data['fielder_2'] == player_id]
# fig = create_catch_probability_heatmap(fielding_data, "Mookie Betts", "RF")
# fig.show()
The interactive catch probability heat map reveals positioning and range patterns that single-number metrics cannot capture. A right fielder might have excellent OAA in straightaway right field but struggle in the gap, suggesting either positioning adjustments or specific training on balls hit to that zone. The ability to click and filter by batted ball characteristics or game situations enables highly targeted developmental interventions.
8.8.2 Interactive Sprint Speed Comparison Chart
Sprint speed affects both offensive (baserunning) and defensive (range) performance, making it a crucial athletic tool for position players. An interactive sprint speed comparison chart allows filtering by position, age, or team while displaying relationships between speed and performance outcomes. Users can identify speed-dependent skills, track aging curves, and benchmark players against positional peers. Hover details reveal sprint speed rankings, percentiles, and associated performance metrics.
R Implementation:
library(plotly)
library(dplyr)
create_sprint_speed_comparison <- function(player_data) {
# Prepare data
sprint_data <- player_data %>%
filter(!is.na(sprint_speed)) %>%
mutate(
# Create position groups
position_group = case_when(
position %in% c("LF", "CF", "RF") ~ "Outfield",
position %in% c("SS", "2B", "3B") ~ "Infield",
position == "C" ~ "Catcher",
position == "1B" ~ "First Base",
TRUE ~ "Other"
),
# Calculate percentile
speed_percentile = percent_rank(sprint_speed) * 100,
hover_text = paste0(
"<b>", player_name, "</b><br>",
"Position: ", position, "<br>",
"Sprint Speed: ", round(sprint_speed, 2), " ft/s<br>",
"Percentile: ", round(speed_percentile, 0), "%<br>",
"Age: ", age, "<br>",
"SB: ", stolen_bases, " (", round(sb_success_rate * 100, 0), "% success)"
)
)
# Create interactive scatter plot
p <- plot_ly(
data = sprint_data,
x = ~age,
y = ~sprint_speed,
color = ~position_group,
colors = "Set2",
size = ~stolen_bases,
sizes = c(10, 100),
text = ~hover_text,
hoverinfo = "text",
type = "scatter",
mode = "markers",
marker = list(
opacity = 0.7,
line = list(width = 1, color = 'black')
)
) %>%
layout(
title = list(
text = "<b>Sprint Speed by Age and Position</b><br>" +
"<sub>Size = Stolen Bases | Hover for Details</sub>",
font = list(size = 16)
),
xaxis = list(
title = "<b>Age (years)</b>",
range = c(20, 42),
gridcolor = 'lightgray'
),
yaxis = list(
title = "<b>Sprint Speed (ft/s)</b>",
range = c(24, 32),
gridcolor = 'lightgray'
),
hovermode = 'closest',
showlegend = TRUE,
legend = list(
title = list(text = '<b>Position Group</b>'),
orientation = 'v',
x = 1.02,
y = 1
),
margin = list(l = 80, r = 150, t = 100, b = 80)
) %>%
add_lines(
data = sprint_data %>%
group_by(age) %>%
summarize(avg_speed = mean(sprint_speed, na.rm = TRUE)),
x = ~age,
y = ~avg_speed,
line = list(color = 'black', width = 2, dash = 'dash'),
name = 'Age Curve',
hoverinfo = 'skip',
showlegend = TRUE
) %>%
config(displayModeBar = TRUE, displaylogo = FALSE)
return(p)
}
# Example usage with simulated data
# set.seed(123)
# player_data <- data.frame(
# player_name = paste("Player", 1:200),
# position = sample(c("CF", "RF", "LF", "SS", "2B", "3B", "1B", "C"),
# 200, replace = TRUE),
# age = sample(22:40, 200, replace = TRUE),
# sprint_speed = rnorm(200, 27.5, 2),
# stolen_bases = rpois(200, 10),
# sb_success_rate = rbeta(200, 7, 3)
# )
#
# sprint_chart <- create_sprint_speed_comparison(player_data)
# sprint_chart
Python Implementation:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
def create_sprint_speed_comparison(player_data):
"""
Create interactive sprint speed comparison chart.
Parameters:
player_data: DataFrame with player sprint speed and performance data
Returns:
Plotly figure object
"""
# Prepare data
sprint_data = player_data[player_data['sprint_speed'].notna()].copy()
# Create position groups
def position_group(pos):
if pos in ['LF', 'CF', 'RF']:
return 'Outfield'
elif pos in ['SS', '2B', '3B']:
return 'Infield'
elif pos == 'C':
return 'Catcher'
elif pos == '1B':
return 'First Base'
else:
return 'Other'
sprint_data['position_group'] = sprint_data['position'].apply(position_group)
# Calculate percentiles
sprint_data['speed_percentile'] = sprint_data['sprint_speed'].rank(pct=True) * 100
# Create hover text
sprint_data['hover_text'] = sprint_data.apply(
lambda row: f"<b>{row['player_name']}</b><br>" +
f"Position: {row['position']}<br>" +
f"Sprint Speed: {row['sprint_speed']:.2f} ft/s<br>" +
f"Percentile: {row['speed_percentile']:.0f}%<br>" +
f"Age: {row['age']}<br>" +
f"SB: {row['stolen_bases']} " +
f"({row['sb_success_rate']*100:.0f}% success)",
axis=1
)
# Create figure
fig = go.Figure()
# Color mapping for position groups
color_map = {
'Outfield': '#1f77b4',
'Infield': '#ff7f0e',
'Catcher': '#2ca02c',
'First Base': '#d62728',
'Other': '#9467bd'
}
# Add scatter for each position group
for pos_group in sprint_data['position_group'].unique():
group_data = sprint_data[sprint_data['position_group'] == pos_group]
fig.add_trace(go.Scatter(
x=group_data['age'],
y=group_data['sprint_speed'],
mode='markers',
name=pos_group,
text=group_data['hover_text'],
hoverinfo='text',
marker=dict(
size=group_data['stolen_bases'].clip(upper=30), # Cap size
sizemode='diameter',
sizeref=0.5,
color=color_map.get(pos_group, '#7f7f7f'),
opacity=0.7,
line=dict(width=1, color='black')
)
))
# Add age curve trend line
age_curve = sprint_data.groupby('age')['sprint_speed'].mean().reset_index()
fig.add_trace(go.Scatter(
x=age_curve['age'],
y=age_curve['sprint_speed'],
mode='lines',
name='Age Curve',
line=dict(color='black', width=2, dash='dash'),
hoverinfo='skip'
))
# Update layout
fig.update_layout(
title=dict(
text="<b>Sprint Speed by Age and Position</b><br>" +
"<sub>Size = Stolen Bases | Hover for Details</sub>",
x=0.5,
xanchor='center',
font=dict(size=16)
),
xaxis=dict(
title="<b>Age (years)</b>",
range=[20, 42],
gridcolor='lightgray',
showgrid=True
),
yaxis=dict(
title="<b>Sprint Speed (ft/s)</b>",
range=[24, 32],
gridcolor='lightgray',
showgrid=True
),
hovermode='closest',
showlegend=True,
legend=dict(
title=dict(text='<b>Position Group</b>'),
orientation='v',
x=1.02,
y=1
),
width=1100,
height=700,
margin=dict(l=80, r=150, t=100, b=80),
template='plotly_white'
)
return fig
# Example usage with simulated data
# np.random.seed(123)
# player_data = pd.DataFrame({
# 'player_name': [f'Player {i}' for i in range(1, 201)],
# 'position': np.random.choice(['CF', 'RF', 'LF', 'SS', '2B', '3B', '1B', 'C'],
# 200),
# 'age': np.random.randint(22, 41, 200),
# 'sprint_speed': np.random.normal(27.5, 2, 200),
# 'stolen_bases': np.random.poisson(10, 200),
# 'sb_success_rate': np.random.beta(7, 3, 200)
# })
#
# fig = create_sprint_speed_comparison(player_data)
# fig.show()
The interactive sprint speed comparison chart enables multi-dimensional analysis of the speed-performance relationship. Analysts can quickly identify players maintaining elite speed into their 30s (potential trade targets), young players with poor speed who might not age well defensively, or position players whose speed doesn't translate to stolen base success (indicating poor baserunning instincts despite physical tools).
8.8.3 Animated Play Trajectory Visualization
Defensive plays unfold over time and space, making them ideal candidates for animated visualization. An animated play trajectory shows fielder movement from initial positioning through ball contact to catch or miss, overlaid with optimal routes and timing benchmarks. This reveals jump efficiency, route selection, and closing speed in ways that static metrics cannot. Analysts can compare actual routes to optimal paths, identify hesitation or misdirection, and evaluate recovery ability after poor initial reads.
R Implementation:
library(plotly)
library(dplyr)
create_animated_play_trajectory <- function(play_tracking_data, play_description = "Defensive Play") {
# Prepare tracking data
# Expected format: frame-by-frame position data
tracking <- play_tracking_data %>%
arrange(frame_time) %>%
mutate(
time_elapsed = frame_time - min(frame_time),
distance_from_ball = sqrt((x_position - ball_x)^2 + (y_position - ball_y)^2),
speed = sqrt(velocity_x^2 + velocity_y^2),
frame_label = paste0(
"Time: ", round(time_elapsed, 2), " sec<br>",
"Fielder Position: (", round(x_position, 1), ", ", round(y_position, 1), ")<br>",
"Distance to Ball: ", round(distance_from_ball, 1), " ft<br>",
"Speed: ", round(speed, 2), " ft/s"
)
)
# Calculate optimal route (straight line from start to catch point)
start_x <- tracking$x_position[1]
start_y <- tracking$y_position[1]
end_x <- tracking$ball_x[nrow(tracking)]
end_y <- tracking$ball_y[nrow(tracking)]
optimal_route <- data.frame(
x = seq(start_x, end_x, length.out = 50),
y = seq(start_y, end_y, length.out = 50)
)
# Create animated plot
p <- plot_ly() %>%
# Add field outline (simplified)
add_trace(
type = "scatter",
mode = "lines",
x = c(-200, 200, 200, -200, -200),
y = c(0, 0, 400, 400, 0),
line = list(color = "green", width = 2),
showlegend = FALSE,
hoverinfo = "skip"
) %>%
# Add optimal route
add_trace(
data = optimal_route,
x = ~x,
y = ~y,
type = "scatter",
mode = "lines",
line = list(color = "blue", width = 2, dash = "dash"),
name = "Optimal Route",
hoverinfo = "skip"
) %>%
# Add fielder trajectory (animated)
add_trace(
data = tracking,
x = ~x_position,
y = ~y_position,
frame = ~frame_time,
type = "scatter",
mode = "markers+lines",
marker = list(
color = "red",
size = 12,
symbol = "circle"
),
line = list(color = "red", width = 2),
text = ~frame_label,
hoverinfo = "text",
name = "Fielder Path"
) %>%
# Add ball position (animated)
add_trace(
data = tracking,
x = ~ball_x,
y = ~ball_y,
frame = ~frame_time,
type = "scatter",
mode = "markers",
marker = list(
color = "orange",
size = 10,
symbol = "star"
),
name = "Ball",
hoverinfo = "skip"
) %>%
layout(
title = list(
text = paste0("<b>", play_description, "</b><br>",
"<sub>Red = Fielder | Orange = Ball | Blue Dash = Optimal Route</sub>"),
font = list(size = 16)
),
xaxis = list(
title = "<b>Distance from Home (ft)</b>",
range = c(-250, 250),
zeroline = TRUE,
zerolinecolor = 'black'
),
yaxis = list(
title = "<b>Distance from Home (ft)</b>",
range = c(0, 450),
scaleanchor = "x",
scaleratio = 1
),
showlegend = TRUE,
legend = list(x = 1.02, y = 1)
) %>%
animation_opts(
frame = 100, # 100ms per frame
transition = 50,
redraw = FALSE
) %>%
animation_slider(
currentvalue = list(
prefix = "Time: ",
suffix = " sec",
font = list(size = 14)
)
) %>%
config(displayModeBar = TRUE, displaylogo = FALSE)
return(p)
}
# Example usage with simulated tracking data
# n_frames <- 30
# play_tracking <- data.frame(
# frame_time = seq(0, 3, length.out = n_frames),
# x_position = seq(50, 180, length.out = n_frames) + rnorm(n_frames, 0, 5),
# y_position = seq(100, 250, length.out = n_frames) + rnorm(n_frames, 0, 5),
# ball_x = rep(180, n_frames),
# ball_y = seq(0, 250, length.out = n_frames),
# velocity_x = c(rep(4, n_frames)),
# velocity_y = c(rep(5, n_frames))
# )
#
# play_viz <- create_animated_play_trajectory(play_tracking, "Byron Buxton Diving Catch")
# play_viz
Python Implementation:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
def create_animated_play_trajectory(play_tracking_data, play_description="Defensive Play"):
"""
Create animated defensive play trajectory visualization.
Parameters:
play_tracking_data: DataFrame with frame-by-frame tracking data
Required columns: frame_time, x_position, y_position, ball_x, ball_y,
velocity_x, velocity_y
play_description: Description for chart title
Returns:
Plotly figure object
"""
# Prepare data
tracking = play_tracking_data.sort_values('frame_time').copy()
tracking['time_elapsed'] = tracking['frame_time'] - tracking['frame_time'].min()
tracking['distance_from_ball'] = np.sqrt(
(tracking['x_position'] - tracking['ball_x'])**2 +
(tracking['y_position'] - tracking['ball_y'])**2
)
tracking['speed'] = np.sqrt(
tracking['velocity_x']**2 + tracking['velocity_y']**2
)
# Create hover labels
tracking['frame_label'] = tracking.apply(
lambda row: f"Time: {row['time_elapsed']:.2f} sec<br>" +
f"Fielder Position: ({row['x_position']:.1f}, {row['y_position']:.1f})<br>" +
f"Distance to Ball: {row['distance_from_ball']:.1f} ft<br>" +
f"Speed: {row['speed']:.2f} ft/s",
axis=1
)
# Calculate optimal route
start_x = tracking['x_position'].iloc[0]
start_y = tracking['y_position'].iloc[0]
end_x = tracking['ball_x'].iloc[-1]
end_y = tracking['ball_y'].iloc[-1]
optimal_x = np.linspace(start_x, end_x, 50)
optimal_y = np.linspace(start_y, end_y, 50)
# Create figure
fig = go.Figure()
# Add field outline
fig.add_trace(go.Scatter(
x=[-200, 200, 200, -200, -200],
y=[0, 0, 400, 400, 0],
mode='lines',
line=dict(color='green', width=2),
showlegend=False,
hoverinfo='skip'
))
# Add optimal route
fig.add_trace(go.Scatter(
x=optimal_x,
y=optimal_y,
mode='lines',
line=dict(color='blue', width=2, dash='dash'),
name='Optimal Route',
hoverinfo='skip'
))
# Create frames for animation
frames = []
for i, frame_time in enumerate(tracking['frame_time'].unique()):
frame_data = tracking[tracking['frame_time'] <= frame_time]
frame_traces = [
# Field outline (static)
go.Scatter(
x=[-200, 200, 200, -200, -200],
y=[0, 0, 400, 400, 0],
mode='lines',
line=dict(color='green', width=2),
showlegend=False,
hoverinfo='skip'
),
# Optimal route (static)
go.Scatter(
x=optimal_x,
y=optimal_y,
mode='lines',
line=dict(color='blue', width=2, dash='dash'),
name='Optimal Route',
hoverinfo='skip',
showlegend=(i == 0)
),
# Fielder path
go.Scatter(
x=frame_data['x_position'],
y=frame_data['y_position'],
mode='markers+lines',
marker=dict(color='red', size=12),
line=dict(color='red', width=2),
text=frame_data['frame_label'],
hoverinfo='text',
name='Fielder Path',
showlegend=(i == 0)
),
# Current ball position
go.Scatter(
x=[frame_data['ball_x'].iloc[-1]],
y=[frame_data['ball_y'].iloc[-1]],
mode='markers',
marker=dict(color='orange', size=15, symbol='star'),
name='Ball',
hoverinfo='skip',
showlegend=(i == 0)
)
]
frames.append(go.Frame(data=frame_traces, name=str(frame_time)))
# Add initial frame data
initial_data = tracking.iloc[:1]
fig.add_trace(go.Scatter(
x=initial_data['x_position'],
y=initial_data['y_position'],
mode='markers+lines',
marker=dict(color='red', size=12),
line=dict(color='red', width=2),
text=initial_data['frame_label'],
hoverinfo='text',
name='Fielder Path'
))
fig.add_trace(go.Scatter(
x=initial_data['ball_x'],
y=initial_data['ball_y'],
mode='markers',
marker=dict(color='orange', size=15, symbol='star'),
name='Ball',
hoverinfo='skip'
))
fig.frames = frames
# Update layout
fig.update_layout(
title=dict(
text=f"<b>{play_description}</b><br>" +
"<sub>Red = Fielder | Orange = Ball | Blue Dash = Optimal Route</sub>",
x=0.5,
xanchor='center',
font=dict(size=16)
),
xaxis=dict(
title="<b>Distance from Home (ft)</b>",
range=[-250, 250],
zeroline=True,
zerolinecolor='black',
gridcolor='lightgray',
showgrid=True
),
yaxis=dict(
title="<b>Distance from Home (ft)</b>",
range=[0, 450],
scaleanchor="x",
scaleratio=1,
gridcolor='lightgray',
showgrid=True
),
showlegend=True,
legend=dict(x=1.02, y=1),
updatemenus=[{
'type': 'buttons',
'showactive': False,
'buttons': [
{
'label': 'Play',
'method': 'animate',
'args': [None, {
'frame': {'duration': 100, 'redraw': True},
'fromcurrent': True,
'transition': {'duration': 50}
}]
},
{
'label': 'Pause',
'method': 'animate',
'args': [[None], {
'frame': {'duration': 0, 'redraw': False},
'mode': 'immediate',
'transition': {'duration': 0}
}]
}
],
'x': 0.1,
'y': 0
}],
sliders=[{
'active': 0,
'steps': [
{
'args': [[f.name], {
'frame': {'duration': 0, 'redraw': True},
'mode': 'immediate',
'transition': {'duration': 0}
}],
'label': f'{float(f.name):.2f}',
'method': 'animate'
}
for f in frames
],
'currentvalue': {
'prefix': 'Time: ',
'suffix': ' sec',
'font': {'size': 14}
},
'x': 0.1,
'len': 0.9,
'xanchor': 'left',
'y': 0,
'yanchor': 'top'
}],
width=1000,
height=1000,
template='plotly_white'
)
return fig
# Example usage with simulated tracking data
# n_frames = 30
# play_tracking = pd.DataFrame({
# 'frame_time': np.linspace(0, 3, n_frames),
# 'x_position': np.linspace(50, 180, n_frames) + np.random.normal(0, 5, n_frames),
# 'y_position': np.linspace(100, 250, n_frames) + np.random.normal(0, 5, n_frames),
# 'ball_x': np.repeat(180, n_frames),
# 'ball_y': np.linspace(0, 250, n_frames),
# 'velocity_x': np.repeat(4, n_frames),
# 'velocity_y': np.repeat(5, n_frames)
# })
#
# fig = create_animated_play_trajectory(play_tracking, "Byron Buxton Diving Catch")
# fig.show()
The animated play trajectory visualization brings defensive plays to life in ways that highlight-reel videos and OAA numbers alone cannot achieve. Coaches can identify exactly when a fielder hesitated, took a suboptimal angle, or compensated brilliantly for an initial misstep. The comparison between actual and optimal routes quantifies route efficiency in intuitive visual terms, making technical feedback accessible to players at all analytical sophistication levels.
These three interactive fielding visualizations—catch probability heat maps, sprint speed comparisons, and animated play trajectories—represent the frontier of defensive analysis. They preserve spatial and temporal richness while enabling exploration and filtering impossible with static charts. The combination of interactive tools and traditional defensive metrics creates a comprehensive analytical framework for understanding, evaluating, and improving defensive performance in modern baseball.
library(tidyverse)
library(plotly)
library(baseballr)
create_catch_probability_heatmap <- function(fielding_data, player_name = "Fielder",
position = "OF") {
# Filter and prepare data
plays <- fielding_data %>%
filter(!is.na(hc_x), !is.na(hc_y), !is.na(hit_distance_sc)) %>%
mutate(
# Adjust coordinates for plotting (home plate at origin)
x_coord = hc_x - 125.42, # Center on home plate
y_coord = 208.48 - hc_y, # Flip y-axis
# Simplified catch probability based on distance
distance = hit_distance_sc,
expected_catch_prob = case_when(
distance < 50 ~ 0.98,
distance < 100 ~ 0.85,
distance < 150 ~ 0.65,
distance < 200 ~ 0.40,
distance < 250 ~ 0.20,
TRUE ~ 0.05
),
was_caught = ifelse(events %in% c("field_out", "double_play",
"sac_fly"), 1, 0),
outs_above_avg = was_caught - expected_catch_prob
)
# Create hexagonal bins for aggregation
# Convert to grid cells (20x20 foot cells)
plays <- plays %>%
mutate(
x_bin = round(x_coord / 20) * 20,
y_bin = round(y_coord / 20) * 20,
hover_text = paste0(
"Location: (", round(x_coord, 0), ", ", round(y_coord, 0), ")<br>",
"Distance: ", round(distance, 0), " ft<br>",
"Expected: ", round(expected_catch_prob * 100, 0), "%<br>",
"Result: ", ifelse(was_caught == 1, "Out", "Hit")
)
)
# Aggregate by bins
binned_data <- plays %>%
group_by(x_bin, y_bin) %>%
summarize(
plays = n(),
catches = sum(was_caught),
expected_catches = sum(expected_catch_prob),
catch_rate = catches / plays,
oaa_zone = sum(outs_above_avg),
avg_distance = mean(distance),
.groups = "drop"
) %>%
filter(plays >= 3) %>% # Minimum plays per zone
mutate(
hover_info = paste0(
"<b>Zone Performance</b><br>",
"Plays: ", plays, "<br>",
"Catch Rate: ", round(catch_rate * 100, 0), "%<br>",
"Expected: ", round((expected_catches/plays) * 100, 0), "%<br>",
"OAA (zone): ", sprintf("%+.1f", oaa_zone), "<br>",
"Avg Distance: ", round(avg_distance, 0), " ft"
)
)
# Create interactive heatmap
p <- plot_ly(
data = binned_data,
x = ~x_bin,
y = ~y_bin,
z = ~oaa_zone,
type = "contour",
colorscale = list(
c(0, "rgb(215, 48, 39)"), # Red for negative OAA
c(0.5, "rgb(255, 255, 191)"), # Yellow for average
c(1, "rgb(44, 123, 182)") # Blue for positive OAA
),
colorbar = list(
title = "<b>Outs Above<br>Average</b>",
tickformat = "+.1f"
),
text = ~hover_info,
hoverinfo = "text",
contours = list(
showlabels = TRUE,
labelfont = list(size = 10, color = 'white')
)
) %>%
add_trace(
data = plays,
x = ~x_coord,
y = ~y_coord,
type = "scatter",
mode = "markers",
marker = list(
size = 4,
color = ~was_caught,
colorscale = list(c(0, "red"), c(1, "green")),
opacity = 0.3,
line = list(width = 0.5, color = 'black')
),
text = ~hover_text,
hoverinfo = "text",
showlegend = FALSE
) %>%
layout(
title = list(
text = paste0("<b>", player_name, " Catch Probability Heat Map</b><br>",
"<sub>", position, " - Red zones = below average, Blue = above average</sub>"),
font = list(size = 16)
),
xaxis = list(
title = "<b>Horizontal Distance from Home Plate (ft)</b><br>← LF | CF | RF →",
range = c(-250, 250),
zeroline = TRUE,
zerolinewidth = 2,
zerolinecolor = 'black',
gridcolor = 'lightgray'
),
yaxis = list(
title = "<b>Distance from Home Plate (ft)</b>",
range = c(0, 450),
scaleanchor = "x",
scaleratio = 1,
gridcolor = 'lightgray'
),
hovermode = 'closest',
showlegend = FALSE,
margin = list(l = 80, r = 120, t = 100, b = 80)
) %>%
config(displayModeBar = TRUE, displaylogo = FALSE)
return(p)
}
# Example usage
# fielding_data <- statcast_search(
# start_date = "2024-04-01",
# end_date = "2024-10-01",
# player_type = "batter"
# ) %>%
# filter(fielder_2 == player_id_of_interest) # Filter for specific fielder
#
# heatmap <- create_catch_probability_heatmap(
# fielding_data,
# "Mookie Betts",
# "RF"
# )
# heatmap
library(plotly)
library(dplyr)
create_sprint_speed_comparison <- function(player_data) {
# Prepare data
sprint_data <- player_data %>%
filter(!is.na(sprint_speed)) %>%
mutate(
# Create position groups
position_group = case_when(
position %in% c("LF", "CF", "RF") ~ "Outfield",
position %in% c("SS", "2B", "3B") ~ "Infield",
position == "C" ~ "Catcher",
position == "1B" ~ "First Base",
TRUE ~ "Other"
),
# Calculate percentile
speed_percentile = percent_rank(sprint_speed) * 100,
hover_text = paste0(
"<b>", player_name, "</b><br>",
"Position: ", position, "<br>",
"Sprint Speed: ", round(sprint_speed, 2), " ft/s<br>",
"Percentile: ", round(speed_percentile, 0), "%<br>",
"Age: ", age, "<br>",
"SB: ", stolen_bases, " (", round(sb_success_rate * 100, 0), "% success)"
)
)
# Create interactive scatter plot
p <- plot_ly(
data = sprint_data,
x = ~age,
y = ~sprint_speed,
color = ~position_group,
colors = "Set2",
size = ~stolen_bases,
sizes = c(10, 100),
text = ~hover_text,
hoverinfo = "text",
type = "scatter",
mode = "markers",
marker = list(
opacity = 0.7,
line = list(width = 1, color = 'black')
)
) %>%
layout(
title = list(
text = "<b>Sprint Speed by Age and Position</b><br>" +
"<sub>Size = Stolen Bases | Hover for Details</sub>",
font = list(size = 16)
),
xaxis = list(
title = "<b>Age (years)</b>",
range = c(20, 42),
gridcolor = 'lightgray'
),
yaxis = list(
title = "<b>Sprint Speed (ft/s)</b>",
range = c(24, 32),
gridcolor = 'lightgray'
),
hovermode = 'closest',
showlegend = TRUE,
legend = list(
title = list(text = '<b>Position Group</b>'),
orientation = 'v',
x = 1.02,
y = 1
),
margin = list(l = 80, r = 150, t = 100, b = 80)
) %>%
add_lines(
data = sprint_data %>%
group_by(age) %>%
summarize(avg_speed = mean(sprint_speed, na.rm = TRUE)),
x = ~age,
y = ~avg_speed,
line = list(color = 'black', width = 2, dash = 'dash'),
name = 'Age Curve',
hoverinfo = 'skip',
showlegend = TRUE
) %>%
config(displayModeBar = TRUE, displaylogo = FALSE)
return(p)
}
# Example usage with simulated data
# set.seed(123)
# player_data <- data.frame(
# player_name = paste("Player", 1:200),
# position = sample(c("CF", "RF", "LF", "SS", "2B", "3B", "1B", "C"),
# 200, replace = TRUE),
# age = sample(22:40, 200, replace = TRUE),
# sprint_speed = rnorm(200, 27.5, 2),
# stolen_bases = rpois(200, 10),
# sb_success_rate = rbeta(200, 7, 3)
# )
#
# sprint_chart <- create_sprint_speed_comparison(player_data)
# sprint_chart
library(plotly)
library(dplyr)
create_animated_play_trajectory <- function(play_tracking_data, play_description = "Defensive Play") {
# Prepare tracking data
# Expected format: frame-by-frame position data
tracking <- play_tracking_data %>%
arrange(frame_time) %>%
mutate(
time_elapsed = frame_time - min(frame_time),
distance_from_ball = sqrt((x_position - ball_x)^2 + (y_position - ball_y)^2),
speed = sqrt(velocity_x^2 + velocity_y^2),
frame_label = paste0(
"Time: ", round(time_elapsed, 2), " sec<br>",
"Fielder Position: (", round(x_position, 1), ", ", round(y_position, 1), ")<br>",
"Distance to Ball: ", round(distance_from_ball, 1), " ft<br>",
"Speed: ", round(speed, 2), " ft/s"
)
)
# Calculate optimal route (straight line from start to catch point)
start_x <- tracking$x_position[1]
start_y <- tracking$y_position[1]
end_x <- tracking$ball_x[nrow(tracking)]
end_y <- tracking$ball_y[nrow(tracking)]
optimal_route <- data.frame(
x = seq(start_x, end_x, length.out = 50),
y = seq(start_y, end_y, length.out = 50)
)
# Create animated plot
p <- plot_ly() %>%
# Add field outline (simplified)
add_trace(
type = "scatter",
mode = "lines",
x = c(-200, 200, 200, -200, -200),
y = c(0, 0, 400, 400, 0),
line = list(color = "green", width = 2),
showlegend = FALSE,
hoverinfo = "skip"
) %>%
# Add optimal route
add_trace(
data = optimal_route,
x = ~x,
y = ~y,
type = "scatter",
mode = "lines",
line = list(color = "blue", width = 2, dash = "dash"),
name = "Optimal Route",
hoverinfo = "skip"
) %>%
# Add fielder trajectory (animated)
add_trace(
data = tracking,
x = ~x_position,
y = ~y_position,
frame = ~frame_time,
type = "scatter",
mode = "markers+lines",
marker = list(
color = "red",
size = 12,
symbol = "circle"
),
line = list(color = "red", width = 2),
text = ~frame_label,
hoverinfo = "text",
name = "Fielder Path"
) %>%
# Add ball position (animated)
add_trace(
data = tracking,
x = ~ball_x,
y = ~ball_y,
frame = ~frame_time,
type = "scatter",
mode = "markers",
marker = list(
color = "orange",
size = 10,
symbol = "star"
),
name = "Ball",
hoverinfo = "skip"
) %>%
layout(
title = list(
text = paste0("<b>", play_description, "</b><br>",
"<sub>Red = Fielder | Orange = Ball | Blue Dash = Optimal Route</sub>"),
font = list(size = 16)
),
xaxis = list(
title = "<b>Distance from Home (ft)</b>",
range = c(-250, 250),
zeroline = TRUE,
zerolinecolor = 'black'
),
yaxis = list(
title = "<b>Distance from Home (ft)</b>",
range = c(0, 450),
scaleanchor = "x",
scaleratio = 1
),
showlegend = TRUE,
legend = list(x = 1.02, y = 1)
) %>%
animation_opts(
frame = 100, # 100ms per frame
transition = 50,
redraw = FALSE
) %>%
animation_slider(
currentvalue = list(
prefix = "Time: ",
suffix = " sec",
font = list(size = 14)
)
) %>%
config(displayModeBar = TRUE, displaylogo = FALSE)
return(p)
}
# Example usage with simulated tracking data
# n_frames <- 30
# play_tracking <- data.frame(
# frame_time = seq(0, 3, length.out = n_frames),
# x_position = seq(50, 180, length.out = n_frames) + rnorm(n_frames, 0, 5),
# y_position = seq(100, 250, length.out = n_frames) + rnorm(n_frames, 0, 5),
# ball_x = rep(180, n_frames),
# ball_y = seq(0, 250, length.out = n_frames),
# velocity_x = c(rep(4, n_frames)),
# velocity_y = c(rep(5, n_frames))
# )
#
# play_viz <- create_animated_play_trajectory(play_tracking, "Byron Buxton Diving Catch")
# play_viz
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pybaseball import statcast
def create_catch_probability_heatmap(fielding_data, player_name="Fielder",
position="OF"):
"""
Create interactive catch probability heat map.
Parameters:
fielding_data: DataFrame with Statcast batted ball data
player_name: Name for chart title
position: Fielder position
Returns:
Plotly figure object
"""
# Filter and prepare data
plays = fielding_data[
fielding_data['hc_x'].notna() &
fielding_data['hc_y'].notna() &
fielding_data['hit_distance_sc'].notna()
].copy()
# Adjust coordinates (home plate at origin)
plays['x_coord'] = plays['hc_x'] - 125.42
plays['y_coord'] = 208.48 - plays['hc_y']
plays['distance'] = plays['hit_distance_sc']
# Simplified catch probability model
def calc_expected_catch(distance):
if distance < 50:
return 0.98
elif distance < 100:
return 0.85
elif distance < 150:
return 0.65
elif distance < 200:
return 0.40
elif distance < 250:
return 0.20
else:
return 0.05
plays['expected_catch_prob'] = plays['distance'].apply(calc_expected_catch)
# Determine outcomes
caught_events = ['field_out', 'double_play', 'sac_fly']
plays['was_caught'] = plays['events'].isin(caught_events).astype(int)
plays['outs_above_avg'] = plays['was_caught'] - plays['expected_catch_prob']
# Create hover text for individual plays
plays['hover_text'] = plays.apply(
lambda row: f"Location: ({row['x_coord']:.0f}, {row['y_coord']:.0f})<br>" +
f"Distance: {row['distance']:.0f} ft<br>" +
f"Expected: {row['expected_catch_prob']*100:.0f}%<br>" +
f"Result: {'Out' if row['was_caught'] else 'Hit'}",
axis=1
)
# Create grid bins (20x20 feet)
plays['x_bin'] = (plays['x_coord'] / 20).round() * 20
plays['y_bin'] = (plays['y_coord'] / 20).round() * 20
# Aggregate by bins
binned_data = plays.groupby(['x_bin', 'y_bin']).agg({
'was_caught': ['count', 'sum'],
'expected_catch_prob': 'sum',
'outs_above_avg': 'sum',
'distance': 'mean'
}).reset_index()
binned_data.columns = ['x_bin', 'y_bin', 'plays', 'catches',
'expected_catches', 'oaa_zone', 'avg_distance']
# Filter minimum plays
binned_data = binned_data[binned_data['plays'] >= 3]
binned_data['catch_rate'] = binned_data['catches'] / binned_data['plays']
binned_data['expected_rate'] = binned_data['expected_catches'] / binned_data['plays']
# Create hover info
binned_data['hover_info'] = binned_data.apply(
lambda row: f"<b>Zone Performance</b><br>" +
f"Plays: {row['plays']}<br>" +
f"Catch Rate: {row['catch_rate']*100:.0f}%<br>" +
f"Expected: {row['expected_rate']*100:.0f}%<br>" +
f"OAA (zone): {row['oaa_zone']:+.1f}<br>" +
f"Avg Distance: {row['avg_distance']:.0f} ft",
axis=1
)
# Create figure
fig = go.Figure()
# Add contour heat map
# Create grid for contour
x_range = np.arange(binned_data['x_bin'].min(), binned_data['x_bin'].max() + 20, 20)
y_range = np.arange(binned_data['y_bin'].min(), binned_data['y_bin'].max() + 20, 20)
# Create z-values matrix
z_matrix = np.full((len(y_range), len(x_range)), np.nan)
for _, row in binned_data.iterrows():
x_idx = np.where(x_range == row['x_bin'])[0]
y_idx = np.where(y_range == row['y_bin'])[0]
if len(x_idx) > 0 and len(y_idx) > 0:
z_matrix[y_idx[0], x_idx[0]] = row['oaa_zone']
fig.add_trace(go.Contour(
x=x_range,
y=y_range,
z=z_matrix,
colorscale=[
[0, 'rgb(215, 48, 39)'], # Red for negative
[0.5, 'rgb(255, 255, 191)'], # Yellow for average
[1, 'rgb(44, 123, 182)'] # Blue for positive
],
colorbar=dict(
title="<b>Outs Above<br>Average</b>",
tickformat="+.1f"
),
hoverinfo='skip',
contours=dict(
showlabels=True,
labelfont=dict(size=10, color='white')
)
))
# Add scatter points for individual plays
fig.add_trace(go.Scatter(
x=plays['x_coord'],
y=plays['y_coord'],
mode='markers',
marker=dict(
size=4,
color=plays['was_caught'],
colorscale=[[0, 'red'], [1, 'green']],
opacity=0.3,
line=dict(width=0.5, color='black'),
showscale=False
),
text=plays['hover_text'],
hoverinfo='text',
showlegend=False
))
# Update layout
fig.update_layout(
title=dict(
text=f"<b>{player_name} Catch Probability Heat Map</b><br>" +
f"<sub>{position} - Red zones = below average, Blue = above average</sub>",
x=0.5,
xanchor='center',
font=dict(size=16)
),
xaxis=dict(
title="<b>Horizontal Distance from Home Plate (ft)</b><br>← LF | CF | RF →",
range=[-250, 250],
zeroline=True,
zerolinewidth=2,
zerolinecolor='black',
gridcolor='lightgray',
showgrid=True
),
yaxis=dict(
title="<b>Distance from Home Plate (ft)</b>",
range=[0, 450],
scaleanchor="x",
scaleratio=1,
gridcolor='lightgray',
showgrid=True
),
hovermode='closest',
showlegend=False,
width=1000,
height=1000,
margin=dict(l=80, r=150, t=100, b=80),
template='plotly_white'
)
return fig
# Example usage
# fielding_data = statcast(start_dt='2024-04-01', end_dt='2024-10-01')
# fielding_data = fielding_data[fielding_data['fielder_2'] == player_id]
# fig = create_catch_probability_heatmap(fielding_data, "Mookie Betts", "RF")
# fig.show()
import plotly.graph_objects as go
import pandas as pd
import numpy as np
def create_sprint_speed_comparison(player_data):
"""
Create interactive sprint speed comparison chart.
Parameters:
player_data: DataFrame with player sprint speed and performance data
Returns:
Plotly figure object
"""
# Prepare data
sprint_data = player_data[player_data['sprint_speed'].notna()].copy()
# Create position groups
def position_group(pos):
if pos in ['LF', 'CF', 'RF']:
return 'Outfield'
elif pos in ['SS', '2B', '3B']:
return 'Infield'
elif pos == 'C':
return 'Catcher'
elif pos == '1B':
return 'First Base'
else:
return 'Other'
sprint_data['position_group'] = sprint_data['position'].apply(position_group)
# Calculate percentiles
sprint_data['speed_percentile'] = sprint_data['sprint_speed'].rank(pct=True) * 100
# Create hover text
sprint_data['hover_text'] = sprint_data.apply(
lambda row: f"<b>{row['player_name']}</b><br>" +
f"Position: {row['position']}<br>" +
f"Sprint Speed: {row['sprint_speed']:.2f} ft/s<br>" +
f"Percentile: {row['speed_percentile']:.0f}%<br>" +
f"Age: {row['age']}<br>" +
f"SB: {row['stolen_bases']} " +
f"({row['sb_success_rate']*100:.0f}% success)",
axis=1
)
# Create figure
fig = go.Figure()
# Color mapping for position groups
color_map = {
'Outfield': '#1f77b4',
'Infield': '#ff7f0e',
'Catcher': '#2ca02c',
'First Base': '#d62728',
'Other': '#9467bd'
}
# Add scatter for each position group
for pos_group in sprint_data['position_group'].unique():
group_data = sprint_data[sprint_data['position_group'] == pos_group]
fig.add_trace(go.Scatter(
x=group_data['age'],
y=group_data['sprint_speed'],
mode='markers',
name=pos_group,
text=group_data['hover_text'],
hoverinfo='text',
marker=dict(
size=group_data['stolen_bases'].clip(upper=30), # Cap size
sizemode='diameter',
sizeref=0.5,
color=color_map.get(pos_group, '#7f7f7f'),
opacity=0.7,
line=dict(width=1, color='black')
)
))
# Add age curve trend line
age_curve = sprint_data.groupby('age')['sprint_speed'].mean().reset_index()
fig.add_trace(go.Scatter(
x=age_curve['age'],
y=age_curve['sprint_speed'],
mode='lines',
name='Age Curve',
line=dict(color='black', width=2, dash='dash'),
hoverinfo='skip'
))
# Update layout
fig.update_layout(
title=dict(
text="<b>Sprint Speed by Age and Position</b><br>" +
"<sub>Size = Stolen Bases | Hover for Details</sub>",
x=0.5,
xanchor='center',
font=dict(size=16)
),
xaxis=dict(
title="<b>Age (years)</b>",
range=[20, 42],
gridcolor='lightgray',
showgrid=True
),
yaxis=dict(
title="<b>Sprint Speed (ft/s)</b>",
range=[24, 32],
gridcolor='lightgray',
showgrid=True
),
hovermode='closest',
showlegend=True,
legend=dict(
title=dict(text='<b>Position Group</b>'),
orientation='v',
x=1.02,
y=1
),
width=1100,
height=700,
margin=dict(l=80, r=150, t=100, b=80),
template='plotly_white'
)
return fig
# Example usage with simulated data
# np.random.seed(123)
# player_data = pd.DataFrame({
# 'player_name': [f'Player {i}' for i in range(1, 201)],
# 'position': np.random.choice(['CF', 'RF', 'LF', 'SS', '2B', '3B', '1B', 'C'],
# 200),
# 'age': np.random.randint(22, 41, 200),
# 'sprint_speed': np.random.normal(27.5, 2, 200),
# 'stolen_bases': np.random.poisson(10, 200),
# 'sb_success_rate': np.random.beta(7, 3, 200)
# })
#
# fig = create_sprint_speed_comparison(player_data)
# fig.show()
import plotly.graph_objects as go
import pandas as pd
import numpy as np
def create_animated_play_trajectory(play_tracking_data, play_description="Defensive Play"):
"""
Create animated defensive play trajectory visualization.
Parameters:
play_tracking_data: DataFrame with frame-by-frame tracking data
Required columns: frame_time, x_position, y_position, ball_x, ball_y,
velocity_x, velocity_y
play_description: Description for chart title
Returns:
Plotly figure object
"""
# Prepare data
tracking = play_tracking_data.sort_values('frame_time').copy()
tracking['time_elapsed'] = tracking['frame_time'] - tracking['frame_time'].min()
tracking['distance_from_ball'] = np.sqrt(
(tracking['x_position'] - tracking['ball_x'])**2 +
(tracking['y_position'] - tracking['ball_y'])**2
)
tracking['speed'] = np.sqrt(
tracking['velocity_x']**2 + tracking['velocity_y']**2
)
# Create hover labels
tracking['frame_label'] = tracking.apply(
lambda row: f"Time: {row['time_elapsed']:.2f} sec<br>" +
f"Fielder Position: ({row['x_position']:.1f}, {row['y_position']:.1f})<br>" +
f"Distance to Ball: {row['distance_from_ball']:.1f} ft<br>" +
f"Speed: {row['speed']:.2f} ft/s",
axis=1
)
# Calculate optimal route
start_x = tracking['x_position'].iloc[0]
start_y = tracking['y_position'].iloc[0]
end_x = tracking['ball_x'].iloc[-1]
end_y = tracking['ball_y'].iloc[-1]
optimal_x = np.linspace(start_x, end_x, 50)
optimal_y = np.linspace(start_y, end_y, 50)
# Create figure
fig = go.Figure()
# Add field outline
fig.add_trace(go.Scatter(
x=[-200, 200, 200, -200, -200],
y=[0, 0, 400, 400, 0],
mode='lines',
line=dict(color='green', width=2),
showlegend=False,
hoverinfo='skip'
))
# Add optimal route
fig.add_trace(go.Scatter(
x=optimal_x,
y=optimal_y,
mode='lines',
line=dict(color='blue', width=2, dash='dash'),
name='Optimal Route',
hoverinfo='skip'
))
# Create frames for animation
frames = []
for i, frame_time in enumerate(tracking['frame_time'].unique()):
frame_data = tracking[tracking['frame_time'] <= frame_time]
frame_traces = [
# Field outline (static)
go.Scatter(
x=[-200, 200, 200, -200, -200],
y=[0, 0, 400, 400, 0],
mode='lines',
line=dict(color='green', width=2),
showlegend=False,
hoverinfo='skip'
),
# Optimal route (static)
go.Scatter(
x=optimal_x,
y=optimal_y,
mode='lines',
line=dict(color='blue', width=2, dash='dash'),
name='Optimal Route',
hoverinfo='skip',
showlegend=(i == 0)
),
# Fielder path
go.Scatter(
x=frame_data['x_position'],
y=frame_data['y_position'],
mode='markers+lines',
marker=dict(color='red', size=12),
line=dict(color='red', width=2),
text=frame_data['frame_label'],
hoverinfo='text',
name='Fielder Path',
showlegend=(i == 0)
),
# Current ball position
go.Scatter(
x=[frame_data['ball_x'].iloc[-1]],
y=[frame_data['ball_y'].iloc[-1]],
mode='markers',
marker=dict(color='orange', size=15, symbol='star'),
name='Ball',
hoverinfo='skip',
showlegend=(i == 0)
)
]
frames.append(go.Frame(data=frame_traces, name=str(frame_time)))
# Add initial frame data
initial_data = tracking.iloc[:1]
fig.add_trace(go.Scatter(
x=initial_data['x_position'],
y=initial_data['y_position'],
mode='markers+lines',
marker=dict(color='red', size=12),
line=dict(color='red', width=2),
text=initial_data['frame_label'],
hoverinfo='text',
name='Fielder Path'
))
fig.add_trace(go.Scatter(
x=initial_data['ball_x'],
y=initial_data['ball_y'],
mode='markers',
marker=dict(color='orange', size=15, symbol='star'),
name='Ball',
hoverinfo='skip'
))
fig.frames = frames
# Update layout
fig.update_layout(
title=dict(
text=f"<b>{play_description}</b><br>" +
"<sub>Red = Fielder | Orange = Ball | Blue Dash = Optimal Route</sub>",
x=0.5,
xanchor='center',
font=dict(size=16)
),
xaxis=dict(
title="<b>Distance from Home (ft)</b>",
range=[-250, 250],
zeroline=True,
zerolinecolor='black',
gridcolor='lightgray',
showgrid=True
),
yaxis=dict(
title="<b>Distance from Home (ft)</b>",
range=[0, 450],
scaleanchor="x",
scaleratio=1,
gridcolor='lightgray',
showgrid=True
),
showlegend=True,
legend=dict(x=1.02, y=1),
updatemenus=[{
'type': 'buttons',
'showactive': False,
'buttons': [
{
'label': 'Play',
'method': 'animate',
'args': [None, {
'frame': {'duration': 100, 'redraw': True},
'fromcurrent': True,
'transition': {'duration': 50}
}]
},
{
'label': 'Pause',
'method': 'animate',
'args': [[None], {
'frame': {'duration': 0, 'redraw': False},
'mode': 'immediate',
'transition': {'duration': 0}
}]
}
],
'x': 0.1,
'y': 0
}],
sliders=[{
'active': 0,
'steps': [
{
'args': [[f.name], {
'frame': {'duration': 0, 'redraw': True},
'mode': 'immediate',
'transition': {'duration': 0}
}],
'label': f'{float(f.name):.2f}',
'method': 'animate'
}
for f in frames
],
'currentvalue': {
'prefix': 'Time: ',
'suffix': ' sec',
'font': {'size': 14}
},
'x': 0.1,
'len': 0.9,
'xanchor': 'left',
'y': 0,
'yanchor': 'top'
}],
width=1000,
height=1000,
template='plotly_white'
)
return fig
# Example usage with simulated tracking data
# n_frames = 30
# play_tracking = pd.DataFrame({
# 'frame_time': np.linspace(0, 3, n_frames),
# 'x_position': np.linspace(50, 180, n_frames) + np.random.normal(0, 5, n_frames),
# 'y_position': np.linspace(100, 250, n_frames) + np.random.normal(0, 5, n_frames),
# 'ball_x': np.repeat(180, n_frames),
# 'ball_y': np.linspace(0, 250, n_frames),
# 'velocity_x': np.repeat(4, n_frames),
# 'velocity_y': np.repeat(5, n_frames)
# })
#
# fig = create_animated_play_trajectory(play_tracking, "Byron Buxton Diving Catch")
# fig.show()
Exercise 8.1: Calculating Simple OAA
Using Statcast data for a single month, calculate a simplified version of OAA for outfielders:
- Get batted ball data for one month (July 2024 recommended)
- Filter for outfield fly balls and line drives
- Calculate catch probability based on distance and hang time (use simplified model from section 8.2.2)
- Determine actual outcomes (caught or not)
- Calculate OAA for each outfielder as sum of (actual - expected)
- Identify the top 5 and bottom 5 defenders
Bonus: Compare your simplified OAA to Baseball Savant's official OAA for the same players and time period. How close are they?
Exercise 8.2: Shift Impact Analysis
Replicate the shift ban analysis from section 8.7.2 using real data:
- Get Statcast data for ground balls hit by left-handed batters for:
- May 2022 (shifts allowed)
- May 2023 (shifts banned)
- Calculate ground ball hit rates for each month
- Perform a statistical test for the difference
- Create a visualization comparing the two periods
- Calculate how many extra hits occurred in 2023 vs expected based on 2022 rates
Challenge: Identify which individual players benefited most from the shift ban by comparing their 2022 vs 2023 ground ball BABIP.
Exercise 8.3: Sprint Speed and Stolen Base Efficiency
Analyze the relationship between sprint speed and stolen base success:
- Get sprint speed data for all qualified players (2024)
- Get stolen base attempts and success rates
- Calculate stolen base success rate for players with 10+ attempts
- Create a scatter plot of sprint speed vs SB success rate
- Fit a regression model and interpret the relationship
- Identify players who over/under-perform their expected SB rate based on speed
Question: What sprint speed corresponds to 75% stolen base success (break-even point)?
Exercise 8.4: Defensive Value Comparison
Compare defensive metrics across different systems:
- Select 20 players across multiple positions (2024 season)
- Collect their OAA, UZR, and DRS values
- Standardize all metrics to same scale (z-scores)
- Calculate correlation between metrics
- Identify players where metrics disagree significantly
- Create a visualization showing agreement/disagreement
Question: For which positions do the metrics agree most? Where do they diverge most? Why might this be?
You've now completed your introduction to fielding and baserunning analytics. You understand why defense is challenging to measure, how modern metrics like OAA work, the value of positioning and shifts, and how to evaluate baserunning contribution. Defense and baserunning combined can account for 2-3 WAR per season for elite players—real, measurable value that traditional statistics completely missed.
The Statcast revolution has transformed defensive evaluation from subjective ("he looks good") to objective ("he made 73% of plays with 65% average probability"). We can now properly credit players like Kevin Kiermaier, Yadier Molina, and Matt Chapman for defensive excellence that was previously unrecognized in conventional statistics.
In Chapter 9, we'll turn to win probability and leveraged situations, understanding how context affects player and managerial decisions. The technical skills you've developed throughout this book will combine to help you evaluate complete players—offense, defense, baserunning, and situational performance—using the full arsenal of modern analytics.
Practice Exercises
Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.
Tips for Success
- Read the problem carefully before starting to code
- Break down complex problems into smaller steps
- Use the hints if you're stuck - they won't give away the answer
- After solving, compare your approach with the solution
Calculating Simple OAA
1. Get batted ball data for one month (July 2024 recommended)
2. Filter for outfield fly balls and line drives
3. Calculate catch probability based on distance and hang time (use simplified model from section 8.2.2)
4. Determine actual outcomes (caught or not)
5. Calculate OAA for each outfielder as sum of (actual - expected)
6. Identify the top 5 and bottom 5 defenders
**Bonus**: Compare your simplified OAA to Baseball Savant's official OAA for the same players and time period. How close are they?
Shift Impact Analysis
1. Get Statcast data for ground balls hit by left-handed batters for:
- May 2022 (shifts allowed)
- May 2023 (shifts banned)
2. Calculate ground ball hit rates for each month
3. Perform a statistical test for the difference
4. Create a visualization comparing the two periods
5. Calculate how many extra hits occurred in 2023 vs expected based on 2022 rates
**Challenge**: Identify which individual players benefited most from the shift ban by comparing their 2022 vs 2023 ground ball BABIP.
Sprint Speed and Stolen Base Efficiency
1. Get sprint speed data for all qualified players (2024)
2. Get stolen base attempts and success rates
3. Calculate stolen base success rate for players with 10+ attempts
4. Create a scatter plot of sprint speed vs SB success rate
5. Fit a regression model and interpret the relationship
6. Identify players who over/under-perform their expected SB rate based on speed
**Question**: What sprint speed corresponds to 75% stolen base success (break-even point)?
Defensive Value Comparison
1. Select 20 players across multiple positions (2024 season)
2. Collect their OAA, UZR, and DRS values
3. Standardize all metrics to same scale (z-scores)
4. Calculate correlation between metrics
5. Identify players where metrics disagree significantly
6. Create a visualization showing agreement/disagreement
**Question**: For which positions do the metrics agree most? Where do they diverge most? Why might this be?
---
You've now completed your introduction to fielding and baserunning analytics. You understand why defense is challenging to measure, how modern metrics like OAA work, the value of positioning and shifts, and how to evaluate baserunning contribution. Defense and baserunning combined can account for 2-3 WAR per season for elite players—real, measurable value that traditional statistics completely missed.
The Statcast revolution has transformed defensive evaluation from subjective ("he looks good") to objective ("he made 73% of plays with 65% average probability"). We can now properly credit players like Kevin Kiermaier, Yadier Molina, and Matt Chapman for defensive excellence that was previously unrecognized in conventional statistics.
In Chapter 9, we'll turn to win probability and leveraged situations, understanding how context affects player and managerial decisions. The technical skills you've developed throughout this book will combine to help you evaluate complete players—offense, defense, baserunning, and situational performance—using the full arsenal of modern analytics.