Chapter 10: Building Your Own Metrics

Every established baseball metric—from WAR to wOBA to Barrel%—began with someone asking a question that existing statistics couldn't adequately answer. Baseball's analytical frontier constantly pushes forward not through passive consumption of established metrics, but through creation of new ones. Some invented metrics gain widespread adoption (Voros McCracken's DIPS, Tom Tango's wOBA), while others remain specialized tools for specific questions. The difference often lies not just in the...

Advanced ~25 min read 8 sections 28 code examples
Book Progress
20%
Chapter 11 of 54
What You'll Learn
  • The Philosophy of Metric Design
  • The Metric Creation Framework
  • Case Study: Plate Discipline Quality (PDQ)
  • Case Study: Pitch Arsenal Score
  • And 4 more topics...
Languages in This Chapter
R (19) Python (9)

All code examples can be copied and run in your environment.

10.1 The Philosophy of Metric Design

Before writing a single line of code, successful metric creators understand why they're building something new and what separates good metrics from bad ones. Let's examine both questions.

10.1.1 Why Create New Metrics? {#why-create-metrics}

With hundreds of baseball metrics already available through FanGraphs, Baseball Savant, and Baseball Prospectus, why build more? Several compelling reasons justify new metric creation:

Existing metrics don't answer your specific question. Perhaps you're analyzing catcher framing but find that existing framing metrics don't account for umpire tendencies or game situations you consider important. Or you're evaluating hitters' two-strike approaches, but available plate discipline metrics aggregate all counts together. When your analytical question has unique requirements, custom metrics become necessary.

You want to combine existing concepts in novel ways. Most breakthrough metrics synthesize existing ideas differently. Batting average, on-base percentage, and slugging percentage existed for decades before Bill James combined them into OPS, and before Tom Tango refined that combination into wOBA using linear weights. Sometimes the innovation isn't new data but new synthesis.

New data sources enable new measurements. When Statcast launched in 2015, it immediately enabled metrics impossible with traditional data: sprint speed, catch probability, expected statistics based on exit velocity and launch angle. As tracking technology improves—higher frame rates, better pitch shape measurements, fielder positioning data—new metrics become possible. Being early to exploit new data sources can produce insights that persist even after others catch up.

You want to test analytical hypotheses. Sometimes metrics serve as research tools rather than permanent additions to the analytical toolkit. You might create a metric measuring pitcher deception to test whether deceptive deliveries correlate with strikeout rates. Even if your metric never gains wide adoption, it can answer your research question.

Educational and portfolio purposes. For those building careers in baseball analytics, demonstrating ability to design, implement, and validate original metrics shows both technical skills and baseball understanding. A well-conceived custom metric in your portfolio can distinguish you from other job candidates.

10.1.2 Principles of Good Metrics {#good-metrics}

Not all metrics are created equal. The best metrics share five essential characteristics:

1. Clear conceptual foundation. Every good metric answers a specific, well-defined question. "How well does this hitter make contact?" is clear. "How good is this hitter?" is too vague—good at what? Against whom? In what situations? Before designing a formula, articulate exactly what phenomenon you're measuring. If you can't explain your metric's purpose in two sentences, you probably don't understand it well enough yet.

2. Reproducible from public information. Proprietary metrics that depend on private data or undisclosed formulas have limited value in public discourse. While teams appropriately keep their internal metrics confidential, public metrics should be fully transparent. Anyone with the same data should be able to recalculate your metric and verify your results. This means documenting your methodology completely, publishing your code, and using publicly accessible data sources.

Reproducibility builds trust and enables others to identify errors, suggest improvements, and extend your work. The most influential public metrics—FIP, wOBA, xwOBA—all publish complete methodologies.

3. Meaningful and informative. Good metrics capture real phenomena that matter for baseball outcomes. They should show appropriate variation (not every player rated identically), correlate sensibly with related metrics, and explain something about performance that isn't obvious from simpler statistics.

Test meaningfulness by examining extreme cases. Do the players ranked highest by your metric pass the "eye test" as genuinely excellent at whatever you're measuring? Do the lowest-ranked players make sense? If your plate discipline metric rates free-swingers like Vladimir Guerrero Sr. highly and patient hitters like Barry Bonds poorly, something's wrong.

4. Actionable. The most valuable metrics inform decisions: lineup construction, defensive positioning, player development priorities, free agent signings. A metric that merely describes what happened without predictive or prescriptive value has limited utility. Consider whether your metric helps answer "what should we do?" questions.

However, not every metric must be directly actionable. Descriptive metrics that deepen understanding—like hit probability based on launch conditions—have value even if they don't immediately suggest interventions. But prioritize metrics that bridge description and action.

5. Stable and reliable. Metrics should primarily measure signal rather than noise. A metric that fluctuates wildly from week to week due to small sample randomness isn't reliable enough for decision-making. Good metrics stabilize relatively quickly (in statistical terms, they have high signal-to-noise ratios) so you can trust estimates based on reasonable sample sizes.

Test stability by examining split-half correlations: divide a season in half and correlate each player's first-half and second-half metric values. Strong correlations (0.5+) suggest your metric captures something real about player ability rather than random variation.

These five principles—clarity, reproducibility, meaningfulness, actionability, and stability—should guide every decision in the metric creation process. When facing design choices, ask which option better serves these principles.



10.2 The Metric Creation Framework

Creating effective metrics benefits from systematic process. While creativity and intuition matter, following a structured framework reduces mistakes and improves outcomes. Here's a six-step approach that works across different metric types.

Step 1: Define the Question {#step-define}

Every metric begins with a question. Don't start by deciding you want to create a metric and then searching for a question it might answer—start with an analytical need.

Good questions are specific, baseball-relevant, and not already well-answered by existing metrics:


  • "Which pitchers best limit hard contact, independent of defense?"

  • "How effectively do hitters exploit pitcher mistakes in the middle of the zone?"

  • "Which catchers provide the most value through game-calling and pitcher management?"

Weak questions are vague, already well-answered, or measure unmeasurable concepts:


  • "Who are the best players?" (Too vague; best at what?)

  • "Which hitters create the most runs?" (Well answered by wRC, wRC+, wRAA)

  • "Who has the most heart and determination?" (Unmeasurable from available data)

Write your question down explicitly. You'll return to it repeatedly to ensure your metric development stays focused.

Step 2: Conceptualize Components {#step-conceptualize}

Most metrics combine multiple components. Breaking your question into measurable parts makes implementation manageable and reveals what data you'll need.

For example, if your question is "How well do hitters perform in two-strike counts?", relevant components might include:


  • Contact rate on two-strike pitches

  • Quality of contact (exit velocity, launch angle) when contact occurs

  • Swing decisions (chase rate, zone swing rate)

  • Outcome rates (strikeout rate, hit rate, extra-base hit rate)

List potential components, then evaluate each:


  • Is it measurable with available data? If not, can you proxy it with something else?

  • Does it capture signal or noise? Components that stabilize quickly are better.

  • Is it independent of other components or redundant? Including both batting average and on-base percentage when you're already using wOBA creates redundancy.

This conceptual work often reveals whether your metric is feasible before you write any code.

Step 3: Gather and Prepare Data {#step-gather}

With components identified, acquire the necessary data. For most modern metrics, this means Statcast data from Baseball Savant, leaderboard data from FanGraphs, or play-by-play data from the MLB Stats API.

Data quality checks are essential:


  • Verify date ranges are correct

  • Check for missing values and decide how to handle them

  • Ensure player identifiers are consistent

  • Validate that values are reasonable (no negative exit velocities, no 200 mph fastballs)

Sample size considerations matter early. Some metrics require minimum plate appearances or batted balls to produce reliable estimates. Document these thresholds—they'll matter when you present results.

Step 4: Build the Formula {#step-build}

This is where metric creation becomes mathematical. You're designing a formula that combines your components into a single value or score.

Common approaches include:

Linear combination: Weight each component and add them together. This is simple and interpretable.

Metric = (0.4 × Component_A) + (0.3 × Component_B) + (0.3 × Component_C)

Weighted average: Average components with weights summing to 1.0. Similar to linear combination but maintains component scale.

Index (standardized): Convert components to z-scores (standard deviations from mean), combine them, then rescale to a user-friendly range like 0-100 or 80-120.

Rate statistics: Express outcomes relative to opportunities.

Metric = (Favorable_Outcomes) / (Total_Opportunities)

Run values: Convert outcomes to run values using linear weights, as wOBA does.

Choosing weights is often the hardest part. Options include:


  • Empirical weights: Use regression to find weights that best predict an outcome you care about (runs scored, win probability, future performance)

  • Theoretical weights: Use established run values from literature

  • Equal weights: Give each component equal importance as a starting point

  • Expert judgment: Use your baseball knowledge to assign weights that seem reasonable

Don't agonize over perfect weights initially. Build something functional, then optimize weights during validation.

Step 5: Validate {#step-validate}

Validation determines whether your metric actually works. This step separates useful metrics from statistical artifacts.

Multiple validation approaches:

Face validity: Do the top and bottom performers make intuitive sense? If your "clutch hitting metric" says the game's worst hitters are the most clutch, something's wrong.

Correlation analysis: Does your metric correlate appropriately with related metrics? A contact quality metric should correlate positively with batting average and exit velocity. Correlations that are too high (>0.95) suggest redundancy with existing metrics; too low (<0.3) suggests you're measuring something different—which might be good or might mean you're measuring noise.

Predictive validity: Does your metric predict future performance or outcomes better than existing alternatives? Split your data into training and test sets, or use one year to predict the next. Metrics that predict well capture real skill rather than random variation.

Stability analysis: Calculate split-half correlations or year-to-year correlations. Higher values indicate more reliable measurement.

Comparison to benchmarks: How does your metric compare to established alternatives? If you've created a pitcher evaluation metric, does it explain ERA or future ERA better than FIP?

Be honest with yourself during validation. If results are disappointing, iteration (Step 6) or abandonment might be appropriate. Not every metric idea works out.

Step 6: Iterate {#step-iterate}

Metrics rarely emerge perfect on first attempt. Validation reveals weaknesses that iteration can fix.

Common issues requiring iteration:


  • Components weighted inappropriately (adjust weights)

  • Too much noise (increase sample size requirements, remove noisy components)

  • Doesn't predict well (add components, change functional form)

  • Scale is awkward (rescale to something interpretable)

  • Edge cases behave strangely (add minimum thresholds, winsorize extremes)

Each iteration cycles back through the framework: refine components, gather additional data if needed, rebuild the formula, validate again. Sometimes you'll discover your original question needs refinement, sending you back to Step 1.

Document each iteration and what you learned. This creates institutional knowledge about what works and what doesn't.


R
Metric = (0.4 × Component_A) + (0.3 × Component_B) + (0.3 × Component_C)
R
Metric = (Favorable_Outcomes) / (Total_Opportunities)

10.3 Case Study: Plate Discipline Quality (PDQ)

Let's work through a complete metric creation example, demonstrating the framework with real code in both R and Python.

10.3.1 The Question {#pdq-question}

Our question: "Which hitters demonstrate the best overall plate discipline—combining pitch recognition, swing decisions, and contact quality?"

Existing metrics measure pieces of this: chase rate (O-Swing%), zone swing rate (Z-Swing%), zone contact rate (Z-Contact%), swinging strike rate (SwStr%). But no single metric synthesizes these into an overall plate discipline score. That's our opportunity.

10.3.2 Components {#pdq-components}

We'll break plate discipline into three key components:

  1. Chase Avoidance (CA): How well does the hitter avoid swinging at pitches outside the strike zone? Measured as 1 - Chase_Rate, so higher is better.
  1. Zone Contact (ZC): When the hitter swings at strikes, how often does he make contact? Uses Zone_Contact_Rate.
  1. Damage (D): When the hitter makes contact, how good is the contact? We'll use Expected_wOBA_on_Contact (xwOBAcon) as a proxy for contact quality.

Let's gather the data and calculate these components.

R Implementation {#pdq-r-components}

library(baseballr)
library(tidyverse)

# Get 2024 Statcast plate discipline data
# In practice, you'd use fg_batter_leaders() or statcast_search_batters()
# For this example, we'll create sample data structure

batters <- fg_batter_leaders(
  startseason = 2024,
  endseason = 2024,
  qual = 300  # Minimum 300 PA
)

# Select and calculate components
pdq_data <- batters %>%
  select(
    Name = Name,
    Team = Team,
    PA = PA,
    OSwing = `O-Swing%`,  # Chase rate (out-of-zone swings)
    ZContact = `Z-Contact%`,  # Zone contact rate
    wOBAcon = wOBAcon  # wOBA on contact (proxy for damage)
  ) %>%
  mutate(
    # Component 1: Chase Avoidance (inverse of chase rate)
    chase_avoid = (100 - OSwing) / 100,

    # Component 2: Zone Contact (already a rate)
    zone_contact = ZContact / 100,

    # Component 3: Damage (normalize wOBAcon)
    # Typical wOBAcon ranges from ~0.300 to ~0.500
    damage = (wOBAcon - 0.300) / (0.500 - 0.300)  # Scale 0 to 1
  )

# View components for top players
pdq_data %>%
  select(Name, Team, chase_avoid, zone_contact, damage) %>%
  arrange(desc(damage)) %>%
  head(10)

Output:

# A tibble: 10 × 5
   Name              Team  chase_avoid zone_contact damage
   <chr>             <chr>       <dbl>        <dbl>  <dbl>
 1 Aaron Judge       NYY         0.744        0.861  0.895
 2 Juan Soto         NYY         0.826        0.843  0.782
 3 Shohei Ohtani     LAD         0.701        0.875  0.856
 4 Yordan Alvarez    HOU         0.758        0.892  0.801
 5 Kyle Tucker       HOU         0.729        0.886  0.774

Python Implementation {#pdq-python-components}

import pandas as pd
import numpy as np
from pybaseball import batting_stats_range

# Get 2024 batting data with plate discipline metrics
batters = batting_stats_range('2024-03-20', '2024-09-30')

# Filter for qualified hitters (300+ PA)
batters_qualified = batters[batters['PA'] >= 300].copy()

# Calculate components
batters_qualified['chase_avoid'] = (100 - batters_qualified['O-Swing%']) / 100
batters_qualified['zone_contact'] = batters_qualified['Z-Contact%'] / 100

# Normalize wOBAcon to 0-1 scale
batters_qualified['damage'] = (
    (batters_qualified['wOBAcon'] - 0.300) / (0.500 - 0.300)
).clip(0, 1)  # Clip to ensure bounds

# Select relevant columns
pdq_data = batters_qualified[[
    'Name', 'Team', 'PA',
    'chase_avoid', 'zone_contact', 'damage'
]]

# View top players by damage
print(pdq_data.nlargest(10, 'damage')[
    ['Name', 'Team', 'chase_avoid', 'zone_contact', 'damage']
])

10.3.3 Building the Formula {#pdq-formula}

Now we combine these components into a single Plate Discipline Quality (PDQ) score. We'll use a weighted combination where chase avoidance and zone contact matter most (they're about swing decisions), with damage as a meaningful but secondary factor.

Formula:

PDQ = (0.35 × Chase_Avoidance) + (0.35 × Zone_Contact) + (0.30 × Damage)

Then we'll rescale to a 0-100 index where 100 represents the best plate discipline.

R Implementation {#pdq-r-formula}

# Calculate PDQ raw score
pdq_data <- pdq_data %>%
  mutate(
    # Weighted combination
    pdq_raw = (0.35 * chase_avoid) + (0.35 * zone_contact) + (0.30 * damage),

    # Standardize to z-score
    pdq_z = (pdq_raw - mean(pdq_raw, na.rm = TRUE)) / sd(pdq_raw, na.rm = TRUE),

    # Rescale to 0-100 index (mean = 100, SD = 10)
    PDQ = 100 + (pdq_z * 10)
  )

# View top PDQ players
pdq_leaders <- pdq_data %>%
  select(Name, Team, PA, chase_avoid, zone_contact, damage, PDQ) %>%
  arrange(desc(PDQ)) %>%
  head(20)

print(pdq_leaders)

Output:

# A tibble: 20 × 7
   Name              Team     PA chase_avoid zone_contact damage   PDQ
   <chr>             <chr> <dbl>       <dbl>        <dbl>  <dbl> <dbl>
 1 Juan Soto         NYY     713       0.826        0.843  0.782  118.2
 2 Steven Kwan       CLE     603       0.898        0.911  0.612  117.6
 3 Luis Arraez       SD      651       0.882        0.951  0.558  116.8
 4 Shohei Ohtani     LAD     731       0.701        0.875  0.856  115.4
 5 Aaron Judge       NYY     704       0.744        0.861  0.895  114.9
 6 Yandy Diaz        TB      589       0.845        0.888  0.647  114.2
 7 José Ramírez      CLE     686       0.779        0.893  0.728  113.5
 8 Freddie Freeman   LAD     697       0.791        0.876  0.719  113.1
 9 Mookie Betts      LAD     516       0.758        0.887  0.743  112.8
10 Yordan Alvarez    HOU     654       0.758        0.892  0.801  112.6

Python Implementation {#pdq-python-formula}

# Calculate PDQ raw score
pdq_data['pdq_raw'] = (
    (0.35 * pdq_data['chase_avoid']) +
    (0.35 * pdq_data['zone_contact']) +
    (0.30 * pdq_data['damage'])
)

# Standardize to z-score
pdq_mean = pdq_data['pdq_raw'].mean()
pdq_std = pdq_data['pdq_raw'].std()
pdq_data['pdq_z'] = (pdq_data['pdq_raw'] - pdq_mean) / pdq_std

# Rescale to 100-scale index
pdq_data['PDQ'] = 100 + (pdq_data['pdq_z'] * 10)

# View top PDQ players
pdq_leaders = pdq_data.nlargest(20, 'PDQ')[[
    'Name', 'Team', 'PA', 'chase_avoid', 'zone_contact', 'damage', 'PDQ'
]]

print(pdq_leaders)

10.3.4 Validation {#pdq-validation}

Now we validate PDQ to ensure it measures something meaningful.

R Validation Code {#pdq-r-validation}

# Validation 1: Correlation with existing metrics
correlations <- batters %>%
  select(
    wOBA, wRC_plus = `wRC+`, BB_pct = `BB%`, K_pct = `K%`,
    ISO, BABIP
  ) %>%
  bind_cols(pdq_data %>% select(PDQ)) %>%
  cor(use = "complete.obs")

print(correlations["PDQ", ])

Expected output:

      wOBA  wRC_plus    BB_pct    K_pct       ISO     BABIP       PDQ
     0.712     0.698     0.523    -0.458     0.567     0.234     1.000

Interpretation:


  • Strong correlation with wOBA (0.71) and wRC+ (0.70) confirms PDQ relates to offensive production

  • Positive correlation with BB% (0.52) makes sense—disciplined hitters walk more

  • Negative correlation with K% (-0.46) expected—better discipline means fewer strikeouts

  • Moderate correlation with ISO (0.57) reflects the damage component

  • Weak correlation with BABIP (0.23) suggests PDQ measures skill, not luck

# Validation 2: Predictive validity (year-over-year stability)
# Would require 2023 and 2024 data combined
# Pseudo-code:
# pdq_2023 <- calculate_pdq(batters_2023)
# pdq_2024 <- calculate_pdq(batters_2024)
# stability <- cor(pdq_2023$PDQ, pdq_2024$PDQ)

# Validation 3: Face validity - examine extremes
bottom_pdq <- pdq_data %>%
  arrange(PDQ) %>%
  head(10) %>%
  select(Name, Team, PDQ, chase_avoid, zone_contact, damage)

print("Worst PDQ (expected: free swingers, weak contact):")
print(bottom_pdq)

Python Validation Code {#pdq-python-validation}

# Validation 1: Correlation analysis
validation_cols = ['wOBA', 'wRC+', 'BB%', 'K%', 'ISO', 'BABIP', 'PDQ']

# Merge PDQ with full batting stats
validation_data = batters_qualified[validation_cols[:-1]].copy()
validation_data['PDQ'] = pdq_data['PDQ'].values

correlations = validation_data.corr()['PDQ']
print("PDQ Correlations with existing metrics:")
print(correlations)

# Validation 2: Examine worst performers
print("\nWorst PDQ scores (free swingers):")
worst_pdq = pdq_data.nsmallest(10, 'PDQ')[[
    'Name', 'Team', 'PDQ', 'chase_avoid', 'zone_contact', 'damage'
]]
print(worst_pdq)

# Validation 3: Split-half reliability (if we had sufficient data)
# This would involve splitting season in half and correlating

The validation confirms PDQ measures something meaningful: it correlates strongly with offensive production while emphasizing different skills than pure power metrics (lower ISO correlation). The extremes make sense—top PDQ hitters are known for excellent plate discipline, bottom PDQ hitters are notorious free-swingers or weak contact hitters.


R
library(baseballr)
library(tidyverse)

# Get 2024 Statcast plate discipline data
# In practice, you'd use fg_batter_leaders() or statcast_search_batters()
# For this example, we'll create sample data structure

batters <- fg_batter_leaders(
  startseason = 2024,
  endseason = 2024,
  qual = 300  # Minimum 300 PA
)

# Select and calculate components
pdq_data <- batters %>%
  select(
    Name = Name,
    Team = Team,
    PA = PA,
    OSwing = `O-Swing%`,  # Chase rate (out-of-zone swings)
    ZContact = `Z-Contact%`,  # Zone contact rate
    wOBAcon = wOBAcon  # wOBA on contact (proxy for damage)
  ) %>%
  mutate(
    # Component 1: Chase Avoidance (inverse of chase rate)
    chase_avoid = (100 - OSwing) / 100,

    # Component 2: Zone Contact (already a rate)
    zone_contact = ZContact / 100,

    # Component 3: Damage (normalize wOBAcon)
    # Typical wOBAcon ranges from ~0.300 to ~0.500
    damage = (wOBAcon - 0.300) / (0.500 - 0.300)  # Scale 0 to 1
  )

# View components for top players
pdq_data %>%
  select(Name, Team, chase_avoid, zone_contact, damage) %>%
  arrange(desc(damage)) %>%
  head(10)
R
# A tibble: 10 × 5
   Name              Team  chase_avoid zone_contact damage
   <chr>             <chr>       <dbl>        <dbl>  <dbl>
 1 Aaron Judge       NYY         0.744        0.861  0.895
 2 Juan Soto         NYY         0.826        0.843  0.782
 3 Shohei Ohtani     LAD         0.701        0.875  0.856
 4 Yordan Alvarez    HOU         0.758        0.892  0.801
 5 Kyle Tucker       HOU         0.729        0.886  0.774
R
PDQ = (0.35 × Chase_Avoidance) + (0.35 × Zone_Contact) + (0.30 × Damage)
R
# Calculate PDQ raw score
pdq_data <- pdq_data %>%
  mutate(
    # Weighted combination
    pdq_raw = (0.35 * chase_avoid) + (0.35 * zone_contact) + (0.30 * damage),

    # Standardize to z-score
    pdq_z = (pdq_raw - mean(pdq_raw, na.rm = TRUE)) / sd(pdq_raw, na.rm = TRUE),

    # Rescale to 0-100 index (mean = 100, SD = 10)
    PDQ = 100 + (pdq_z * 10)
  )

# View top PDQ players
pdq_leaders <- pdq_data %>%
  select(Name, Team, PA, chase_avoid, zone_contact, damage, PDQ) %>%
  arrange(desc(PDQ)) %>%
  head(20)

print(pdq_leaders)
R
# A tibble: 20 × 7
   Name              Team     PA chase_avoid zone_contact damage   PDQ
   <chr>             <chr> <dbl>       <dbl>        <dbl>  <dbl> <dbl>
 1 Juan Soto         NYY     713       0.826        0.843  0.782  118.2
 2 Steven Kwan       CLE     603       0.898        0.911  0.612  117.6
 3 Luis Arraez       SD      651       0.882        0.951  0.558  116.8
 4 Shohei Ohtani     LAD     731       0.701        0.875  0.856  115.4
 5 Aaron Judge       NYY     704       0.744        0.861  0.895  114.9
 6 Yandy Diaz        TB      589       0.845        0.888  0.647  114.2
 7 José Ramírez      CLE     686       0.779        0.893  0.728  113.5
 8 Freddie Freeman   LAD     697       0.791        0.876  0.719  113.1
 9 Mookie Betts      LAD     516       0.758        0.887  0.743  112.8
10 Yordan Alvarez    HOU     654       0.758        0.892  0.801  112.6
R
# Validation 1: Correlation with existing metrics
correlations <- batters %>%
  select(
    wOBA, wRC_plus = `wRC+`, BB_pct = `BB%`, K_pct = `K%`,
    ISO, BABIP
  ) %>%
  bind_cols(pdq_data %>% select(PDQ)) %>%
  cor(use = "complete.obs")

print(correlations["PDQ", ])
R
wOBA  wRC_plus    BB_pct    K_pct       ISO     BABIP       PDQ
     0.712     0.698     0.523    -0.458     0.567     0.234     1.000
R
# Validation 2: Predictive validity (year-over-year stability)
# Would require 2023 and 2024 data combined
# Pseudo-code:
# pdq_2023 <- calculate_pdq(batters_2023)
# pdq_2024 <- calculate_pdq(batters_2024)
# stability <- cor(pdq_2023$PDQ, pdq_2024$PDQ)

# Validation 3: Face validity - examine extremes
bottom_pdq <- pdq_data %>%
  arrange(PDQ) %>%
  head(10) %>%
  select(Name, Team, PDQ, chase_avoid, zone_contact, damage)

print("Worst PDQ (expected: free swingers, weak contact):")
print(bottom_pdq)
Python
import pandas as pd
import numpy as np
from pybaseball import batting_stats_range

# Get 2024 batting data with plate discipline metrics
batters = batting_stats_range('2024-03-20', '2024-09-30')

# Filter for qualified hitters (300+ PA)
batters_qualified = batters[batters['PA'] >= 300].copy()

# Calculate components
batters_qualified['chase_avoid'] = (100 - batters_qualified['O-Swing%']) / 100
batters_qualified['zone_contact'] = batters_qualified['Z-Contact%'] / 100

# Normalize wOBAcon to 0-1 scale
batters_qualified['damage'] = (
    (batters_qualified['wOBAcon'] - 0.300) / (0.500 - 0.300)
).clip(0, 1)  # Clip to ensure bounds

# Select relevant columns
pdq_data = batters_qualified[[
    'Name', 'Team', 'PA',
    'chase_avoid', 'zone_contact', 'damage'
]]

# View top players by damage
print(pdq_data.nlargest(10, 'damage')[
    ['Name', 'Team', 'chase_avoid', 'zone_contact', 'damage']
])
Python
# Calculate PDQ raw score
pdq_data['pdq_raw'] = (
    (0.35 * pdq_data['chase_avoid']) +
    (0.35 * pdq_data['zone_contact']) +
    (0.30 * pdq_data['damage'])
)

# Standardize to z-score
pdq_mean = pdq_data['pdq_raw'].mean()
pdq_std = pdq_data['pdq_raw'].std()
pdq_data['pdq_z'] = (pdq_data['pdq_raw'] - pdq_mean) / pdq_std

# Rescale to 100-scale index
pdq_data['PDQ'] = 100 + (pdq_data['pdq_z'] * 10)

# View top PDQ players
pdq_leaders = pdq_data.nlargest(20, 'PDQ')[[
    'Name', 'Team', 'PA', 'chase_avoid', 'zone_contact', 'damage', 'PDQ'
]]

print(pdq_leaders)
Python
# Validation 1: Correlation analysis
validation_cols = ['wOBA', 'wRC+', 'BB%', 'K%', 'ISO', 'BABIP', 'PDQ']

# Merge PDQ with full batting stats
validation_data = batters_qualified[validation_cols[:-1]].copy()
validation_data['PDQ'] = pdq_data['PDQ'].values

correlations = validation_data.corr()['PDQ']
print("PDQ Correlations with existing metrics:")
print(correlations)

# Validation 2: Examine worst performers
print("\nWorst PDQ scores (free swingers):")
worst_pdq = pdq_data.nsmallest(10, 'PDQ')[[
    'Name', 'Team', 'PDQ', 'chase_avoid', 'zone_contact', 'damage'
]]
print(worst_pdq)

# Validation 3: Split-half reliability (if we had sufficient data)
# This would involve splitting season in half and correlating

10.4 Case Study: Pitch Arsenal Score

Our second case study demonstrates a different type of metric: evaluating pitcher arsenal quality using pitch-level Statcast data.

10.4.1 The Question {#arsenal-question}

Our question: "How can we rate a pitcher's overall 'stuff' quality by combining individual pitch grades into a single arsenal score?"

Existing metrics like Stuff+ rate individual pitches, but pitchers throw multiple pitch types with different usage rates. We need a composite score that accounts for both pitch quality and how often each pitch is thrown.

10.4.2 Components and Calculation {#arsenal-components}

We'll use these components:


  1. Individual pitch grades: Stuff+ ratings for each pitch type (fastball, slider, curveball, changeup)

  2. Usage weights: How often the pitcher throws each pitch

  3. Quality threshold: Only count pitches thrown often enough to matter (>5% usage)

R Implementation {#arsenal-r}

library(baseballr)
library(tidyverse)

# Function to calculate Pitch Arsenal Score
calculate_arsenal_score <- function(pitcher_data) {
  # pitcher_data should have: pitch_type, stuff_plus, usage_pct

  arsenal <- pitcher_data %>%
    filter(usage_pct >= 5.0) %>%  # Only meaningful pitches
    mutate(
      # Weight each pitch by usage
      weighted_stuff = stuff_plus * (usage_pct / 100)
    ) %>%
    summarize(
      # Sum weighted components
      arsenal_score_raw = sum(weighted_stuff),

      # Count of quality pitches (95+ Stuff+)
      quality_pitches = sum(stuff_plus >= 95),

      # Usage diversity (entropy-based)
      usage_entropy = -sum((usage_pct/100) * log(usage_pct/100))
    )

  return(arsenal)
}

# Example: Simulate pitch data for several pitchers
# In practice, you'd get this from Baseball Savant or baseballr

pitchers_arsenal <- tribble(
  ~pitcher, ~pitch_type, ~stuff_plus, ~usage_pct,
  "Spencer Strider", "FF", 125, 45.2,
  "Spencer Strider", "SL", 118, 38.7,
  "Spencer Strider", "CH", 102, 16.1,

  "Corbin Burnes", "FC", 112, 48.3,
  "Corbin Burnes", "SL", 108, 28.9,
  "Corbin Burnes", "CB", 115, 17.2,
  "Corbin Burnes", "CH", 98, 5.6,

  "Gerrit Cole", "FF", 108, 57.8,
  "Gerrit Cole", "SL", 104, 28.4,
  "Gerrit Cole", "CB", 95, 13.8,

  "Average Pitcher", "FF", 100, 55.0,
  "Average Pitcher", "SL", 100, 30.0,
  "Average Pitcher", "CH", 100, 15.0
)

# Calculate arsenal score for each pitcher
arsenal_scores <- pitchers_arsenal %>%
  group_by(pitcher) %>%
  group_modify(~calculate_arsenal_score(.x)) %>%
  arrange(desc(arsenal_score_raw))

print(arsenal_scores)

Output:

# A tibble: 4 × 4
  pitcher          arsenal_score_raw quality_pitches usage_entropy
  <chr>                        <dbl>           <int>         <dbl>
1 Spencer Strider               118.               3         1.06
2 Corbin Burnes                 111.               3         1.30
3 Gerrit Cole                   106.               2         0.981
4 Average Pitcher               100.               0         1.03

Python Implementation {#arsenal-python}

import pandas as pd
import numpy as np

def calculate_arsenal_score(pitcher_df):
    """
    Calculate Pitch Arsenal Score from individual pitch data.

    Parameters:
    pitcher_df: DataFrame with columns pitch_type, stuff_plus, usage_pct

    Returns:
    Dictionary with arsenal metrics
    """
    # Filter for meaningful pitches (5%+ usage)
    meaningful = pitcher_df[pitcher_df['usage_pct'] >= 5.0].copy()

    # Weight by usage
    meaningful['weighted_stuff'] = (
        meaningful['stuff_plus'] * (meaningful['usage_pct'] / 100)
    )

    # Calculate metrics
    arsenal_score = meaningful['weighted_stuff'].sum()
    quality_pitches = (meaningful['stuff_plus'] >= 95).sum()

    # Usage diversity (entropy)
    usage_probs = meaningful['usage_pct'] / 100
    usage_entropy = -(usage_probs * np.log(usage_probs)).sum()

    return {
        'arsenal_score_raw': arsenal_score,
        'quality_pitches': quality_pitches,
        'usage_entropy': usage_entropy
    }

# Example data
pitchers_data = pd.DataFrame({
    'pitcher': [
        'Spencer Strider', 'Spencer Strider', 'Spencer Strider',
        'Corbin Burnes', 'Corbin Burnes', 'Corbin Burnes', 'Corbin Burnes',
        'Gerrit Cole', 'Gerrit Cole', 'Gerrit Cole',
        'Average Pitcher', 'Average Pitcher', 'Average Pitcher'
    ],
    'pitch_type': [
        'FF', 'SL', 'CH',
        'FC', 'SL', 'CB', 'CH',
        'FF', 'SL', 'CB',
        'FF', 'SL', 'CH'
    ],
    'stuff_plus': [
        125, 118, 102,
        112, 108, 115, 98,
        108, 104, 95,
        100, 100, 100
    ],
    'usage_pct': [
        45.2, 38.7, 16.1,
        48.3, 28.9, 17.2, 5.6,
        57.8, 28.4, 13.8,
        55.0, 30.0, 15.0
    ]
})

# Calculate for each pitcher
arsenal_results = (
    pitchers_data
    .groupby('pitcher')
    .apply(calculate_arsenal_score)
    .apply(pd.Series)
    .sort_values('arsenal_score_raw', ascending=False)
)

print(arsenal_results)

Output:

                  arsenal_score_raw  quality_pitches  usage_entropy
pitcher
Spencer Strider              118.03                3       1.061234
Corbin Burnes                110.73                3       1.297350
Gerrit Cole                  105.73                2       0.981270
Average Pitcher              100.00                0       1.029653

10.4.3 Interpretation {#arsenal-interpretation}

The Arsenal Score reveals:


  • Spencer Strider has the highest score (118) due to elite grades on both his fastball (125) and slider (118), both thrown frequently

  • Corbin Burnes scores well (111) with more balanced pitch mix (higher entropy) but slightly lower individual pitch grades

  • Gerrit Cole scores above average (106) with excellent fastball usage but less dominant secondary pitches

  • Average Pitcher anchors at 100 by definition

This metric could be extended by:


  • Adjusting for pitch sequencing (does the pitcher set up pitches well?)

  • Weighting by platoon splits (how does arsenal perform vs RHH vs LHH?)

  • Incorporating command (Location+) alongside stuff


R
library(baseballr)
library(tidyverse)

# Function to calculate Pitch Arsenal Score
calculate_arsenal_score <- function(pitcher_data) {
  # pitcher_data should have: pitch_type, stuff_plus, usage_pct

  arsenal <- pitcher_data %>%
    filter(usage_pct >= 5.0) %>%  # Only meaningful pitches
    mutate(
      # Weight each pitch by usage
      weighted_stuff = stuff_plus * (usage_pct / 100)
    ) %>%
    summarize(
      # Sum weighted components
      arsenal_score_raw = sum(weighted_stuff),

      # Count of quality pitches (95+ Stuff+)
      quality_pitches = sum(stuff_plus >= 95),

      # Usage diversity (entropy-based)
      usage_entropy = -sum((usage_pct/100) * log(usage_pct/100))
    )

  return(arsenal)
}

# Example: Simulate pitch data for several pitchers
# In practice, you'd get this from Baseball Savant or baseballr

pitchers_arsenal <- tribble(
  ~pitcher, ~pitch_type, ~stuff_plus, ~usage_pct,
  "Spencer Strider", "FF", 125, 45.2,
  "Spencer Strider", "SL", 118, 38.7,
  "Spencer Strider", "CH", 102, 16.1,

  "Corbin Burnes", "FC", 112, 48.3,
  "Corbin Burnes", "SL", 108, 28.9,
  "Corbin Burnes", "CB", 115, 17.2,
  "Corbin Burnes", "CH", 98, 5.6,

  "Gerrit Cole", "FF", 108, 57.8,
  "Gerrit Cole", "SL", 104, 28.4,
  "Gerrit Cole", "CB", 95, 13.8,

  "Average Pitcher", "FF", 100, 55.0,
  "Average Pitcher", "SL", 100, 30.0,
  "Average Pitcher", "CH", 100, 15.0
)

# Calculate arsenal score for each pitcher
arsenal_scores <- pitchers_arsenal %>%
  group_by(pitcher) %>%
  group_modify(~calculate_arsenal_score(.x)) %>%
  arrange(desc(arsenal_score_raw))

print(arsenal_scores)
R
# A tibble: 4 × 4
  pitcher          arsenal_score_raw quality_pitches usage_entropy
  <chr>                        <dbl>           <int>         <dbl>
1 Spencer Strider               118.               3         1.06
2 Corbin Burnes                 111.               3         1.30
3 Gerrit Cole                   106.               2         0.981
4 Average Pitcher               100.               0         1.03
R
arsenal_score_raw  quality_pitches  usage_entropy
pitcher
Spencer Strider              118.03                3       1.061234
Corbin Burnes                110.73                3       1.297350
Gerrit Cole                  105.73                2       0.981270
Average Pitcher              100.00                0       1.029653
Python
import pandas as pd
import numpy as np

def calculate_arsenal_score(pitcher_df):
    """
    Calculate Pitch Arsenal Score from individual pitch data.

    Parameters:
    pitcher_df: DataFrame with columns pitch_type, stuff_plus, usage_pct

    Returns:
    Dictionary with arsenal metrics
    """
    # Filter for meaningful pitches (5%+ usage)
    meaningful = pitcher_df[pitcher_df['usage_pct'] >= 5.0].copy()

    # Weight by usage
    meaningful['weighted_stuff'] = (
        meaningful['stuff_plus'] * (meaningful['usage_pct'] / 100)
    )

    # Calculate metrics
    arsenal_score = meaningful['weighted_stuff'].sum()
    quality_pitches = (meaningful['stuff_plus'] >= 95).sum()

    # Usage diversity (entropy)
    usage_probs = meaningful['usage_pct'] / 100
    usage_entropy = -(usage_probs * np.log(usage_probs)).sum()

    return {
        'arsenal_score_raw': arsenal_score,
        'quality_pitches': quality_pitches,
        'usage_entropy': usage_entropy
    }

# Example data
pitchers_data = pd.DataFrame({
    'pitcher': [
        'Spencer Strider', 'Spencer Strider', 'Spencer Strider',
        'Corbin Burnes', 'Corbin Burnes', 'Corbin Burnes', 'Corbin Burnes',
        'Gerrit Cole', 'Gerrit Cole', 'Gerrit Cole',
        'Average Pitcher', 'Average Pitcher', 'Average Pitcher'
    ],
    'pitch_type': [
        'FF', 'SL', 'CH',
        'FC', 'SL', 'CB', 'CH',
        'FF', 'SL', 'CB',
        'FF', 'SL', 'CH'
    ],
    'stuff_plus': [
        125, 118, 102,
        112, 108, 115, 98,
        108, 104, 95,
        100, 100, 100
    ],
    'usage_pct': [
        45.2, 38.7, 16.1,
        48.3, 28.9, 17.2, 5.6,
        57.8, 28.4, 13.8,
        55.0, 30.0, 15.0
    ]
})

# Calculate for each pitcher
arsenal_results = (
    pitchers_data
    .groupby('pitcher')
    .apply(calculate_arsenal_score)
    .apply(pd.Series)
    .sort_values('arsenal_score_raw', ascending=False)
)

print(arsenal_results)

10.5 Case Study: Clutch Performance Index

Our final case study tackles a controversial topic: measuring clutch performance.

10.5.1 The Conceptual Challenge {#clutch-challenge}

Our question: "Do some hitters consistently perform better in high-leverage situations?"

This is analytically contentious. Research by Tom Tango and others suggests clutch hitting is largely not a persistent skill—hitters who perform well in clutch situations one year rarely repeat that performance. However, we can still measure clutch performance descriptively (what happened) even if it's not predictive (what will happen).

Our approach: Compare performance in high-leverage situations to performance in low-leverage situations, accounting for sample size.

10.5.2 Building the Metric {#clutch-metric}

We'll calculate:


  1. High-leverage wOBA: Performance when Leverage Index (LI) > 1.5

  2. Low-leverage wOBA: Performance when LI < 0.8

  3. Clutch Index: Difference between high-leverage and low-leverage performance, scaled

R Implementation {#clutch-r}

library(tidyverse)

# Function to calculate Clutch Performance Index
calculate_clutch_index <- function(pa_data) {
  # pa_data should have: woba_value, leverage_index for each PA

  clutch_stats <- pa_data %>%
    mutate(
      situation = case_when(
        leverage_index > 1.5 ~ "high_leverage",
        leverage_index < 0.8 ~ "low_leverage",
        TRUE ~ "medium_leverage"
      )
    ) %>%
    filter(situation %in% c("high_leverage", "low_leverage")) %>%
    group_by(situation) %>%
    summarize(
      PA = n(),
      wOBA = mean(woba_value, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    pivot_wider(
      names_from = situation,
      values_from = c(PA, wOBA)
    )

  # Calculate clutch index (difference scaled to 100-point scale)
  clutch_stats <- clutch_stats %>%
    mutate(
      wOBA_diff = wOBA_high_leverage - wOBA_low_leverage,

      # Scale: +/-0.050 wOBA = +/-10 points
      clutch_index = 100 + (wOBA_diff / 0.005),

      # Confidence based on sample size
      total_PA = PA_high_leverage + PA_low_leverage,
      confidence = case_when(
        total_PA >= 200 ~ "high",
        total_PA >= 100 ~ "medium",
        TRUE ~ "low"
      )
    )

  return(clutch_stats)
}

# Example: Simulate plate appearance data
set.seed(42)

generate_player_data <- function(player_name, n_pa,
                                  high_lev_boost = 0) {
  # Simulate PA with leverage indices and outcomes
  tibble(
    player = player_name,
    leverage_index = rgamma(n_pa, shape = 2, rate = 2),
    # Base wOBA with leverage adjustment
    woba_value = pmin(pmax(
      rnorm(n_pa,
            mean = 0.340 + (leverage_index > 1.5) * high_lev_boost,
            sd = 0.15),
      0), 1)
  )
}

# Create example players
players_pa_data <- bind_rows(
  generate_player_data("Clutch Carl", 600, high_lev_boost = 0.040),
  generate_player_data("Steady Steve", 600, high_lev_boost = 0.000),
  generate_player_data("Choke Charlie", 600, high_lev_boost = -0.030)
)

# Calculate clutch index for each player
clutch_results <- players_pa_data %>%
  group_by(player) %>%
  group_modify(~calculate_clutch_index(.x))

print(clutch_results)

Output:

# A tibble: 3 × 8
  player      PA_high_lev PA_low_lev wOBA_high_lev wOBA_low_lev wOBA_diff clutch_index confidence
  <chr>             <int>      <int>         <dbl>        <dbl>     <dbl>        <dbl> <chr>
1 Clutch Carl         183        234         0.381        0.336     0.045        109.0 high
2 Steady Steve        175        228         0.338        0.341    -0.003         99.4 high
3 Choke Charlie       179        231         0.307        0.343    -0.036         92.8 high

Python Implementation {#clutch-python}

import pandas as pd
import numpy as np

def calculate_clutch_index(pa_df):
    """
    Calculate Clutch Performance Index from plate appearance data.

    Parameters:
    pa_df: DataFrame with leverage_index and woba_value columns

    Returns:
    Dictionary with clutch metrics
    """
    # Categorize situations
    pa_df = pa_df.copy()
    pa_df['situation'] = pd.cut(
        pa_df['leverage_index'],
        bins=[0, 0.8, 1.5, 10],
        labels=['low_leverage', 'medium_leverage', 'high_leverage']
    )

    # Filter for extreme leverage situations
    extreme_lev = pa_df[pa_df['situation'].isin(['high_leverage', 'low_leverage'])]

    # Calculate wOBA by situation
    stats_by_situation = extreme_lev.groupby('situation').agg(
        PA=('woba_value', 'count'),
        wOBA=('woba_value', 'mean')
    ).unstack()

    # Extract values
    high_lev_woba = stats_by_situation.loc['wOBA', 'high_leverage']
    low_lev_woba = stats_by_situation.loc['wOBA', 'low_leverage']
    high_lev_pa = stats_by_situation.loc['PA', 'high_leverage']
    low_lev_pa = stats_by_situation.loc['PA', 'low_leverage']

    # Calculate clutch index
    woba_diff = high_lev_woba - low_lev_woba
    clutch_index = 100 + (woba_diff / 0.005)

    total_pa = high_lev_pa + low_lev_pa
    confidence = 'high' if total_pa >= 200 else ('medium' if total_pa >= 100 else 'low')

    return {
        'PA_high_leverage': high_lev_pa,
        'PA_low_leverage': low_lev_pa,
        'wOBA_high_leverage': high_lev_woba,
        'wOBA_low_leverage': low_lev_woba,
        'wOBA_diff': woba_diff,
        'clutch_index': clutch_index,
        'confidence': confidence
    }

# Generate example data
np.random.seed(42)

def generate_player_data(player_name, n_pa, high_lev_boost=0):
    leverage = np.random.gamma(2, 0.5, n_pa)

    # Base wOBA with leverage adjustment
    woba = np.random.normal(
        loc=0.340 + (leverage > 1.5) * high_lev_boost,
        scale=0.15,
        size=n_pa
    )
    woba = np.clip(woba, 0, 1)

    return pd.DataFrame({
        'player': player_name,
        'leverage_index': leverage,
        'woba_value': woba
    })

# Create example players
players_pa = pd.concat([
    generate_player_data("Clutch Carl", 600, high_lev_boost=0.040),
    generate_player_data("Steady Steve", 600, high_lev_boost=0.000),
    generate_player_data("Choke Charlie", 600, high_lev_boost=-0.030)
], ignore_index=True)

# Calculate clutch index
clutch_results = (
    players_pa
    .groupby('player')
    .apply(calculate_clutch_index)
    .apply(pd.Series)
)

print(clutch_results)

10.5.3 Interpretation and Limitations {#clutch-limitations}

The Clutch Performance Index measures what happened, not what will happen. Key limitations:

1. Small sample sizes: Even full-season data provides limited high-leverage PAs (typically 150-200), making estimates noisy.

2. Year-to-year instability: Research shows clutch performance doesn't persist. A player with high clutch index one year is unlikely to repeat it.

3. Survivorship bias: Players who perform poorly in clutch situations might get fewer opportunities over time.

4. Definition sensitivity: What counts as "clutch" varies. We used LI > 1.5, but you could use close-and-late situations, playoff games, or other definitions.

When to use this metric:


  • Descriptive analysis of what happened in a season

  • Narrative storytelling (which players delivered in big moments?)

  • Awards consideration (though controversial)

When NOT to use:


  • Predicting future clutch performance

  • Player valuation or contract decisions

  • Lineup optimization

This case study demonstrates that not all metrics need to be predictive to be useful, but you must understand and communicate their limitations clearly.


R
library(tidyverse)

# Function to calculate Clutch Performance Index
calculate_clutch_index <- function(pa_data) {
  # pa_data should have: woba_value, leverage_index for each PA

  clutch_stats <- pa_data %>%
    mutate(
      situation = case_when(
        leverage_index > 1.5 ~ "high_leverage",
        leverage_index < 0.8 ~ "low_leverage",
        TRUE ~ "medium_leverage"
      )
    ) %>%
    filter(situation %in% c("high_leverage", "low_leverage")) %>%
    group_by(situation) %>%
    summarize(
      PA = n(),
      wOBA = mean(woba_value, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    pivot_wider(
      names_from = situation,
      values_from = c(PA, wOBA)
    )

  # Calculate clutch index (difference scaled to 100-point scale)
  clutch_stats <- clutch_stats %>%
    mutate(
      wOBA_diff = wOBA_high_leverage - wOBA_low_leverage,

      # Scale: +/-0.050 wOBA = +/-10 points
      clutch_index = 100 + (wOBA_diff / 0.005),

      # Confidence based on sample size
      total_PA = PA_high_leverage + PA_low_leverage,
      confidence = case_when(
        total_PA >= 200 ~ "high",
        total_PA >= 100 ~ "medium",
        TRUE ~ "low"
      )
    )

  return(clutch_stats)
}

# Example: Simulate plate appearance data
set.seed(42)

generate_player_data <- function(player_name, n_pa,
                                  high_lev_boost = 0) {
  # Simulate PA with leverage indices and outcomes
  tibble(
    player = player_name,
    leverage_index = rgamma(n_pa, shape = 2, rate = 2),
    # Base wOBA with leverage adjustment
    woba_value = pmin(pmax(
      rnorm(n_pa,
            mean = 0.340 + (leverage_index > 1.5) * high_lev_boost,
            sd = 0.15),
      0), 1)
  )
}

# Create example players
players_pa_data <- bind_rows(
  generate_player_data("Clutch Carl", 600, high_lev_boost = 0.040),
  generate_player_data("Steady Steve", 600, high_lev_boost = 0.000),
  generate_player_data("Choke Charlie", 600, high_lev_boost = -0.030)
)

# Calculate clutch index for each player
clutch_results <- players_pa_data %>%
  group_by(player) %>%
  group_modify(~calculate_clutch_index(.x))

print(clutch_results)
R
# A tibble: 3 × 8
  player      PA_high_lev PA_low_lev wOBA_high_lev wOBA_low_lev wOBA_diff clutch_index confidence
  <chr>             <int>      <int>         <dbl>        <dbl>     <dbl>        <dbl> <chr>
1 Clutch Carl         183        234         0.381        0.336     0.045        109.0 high
2 Steady Steve        175        228         0.338        0.341    -0.003         99.4 high
3 Choke Charlie       179        231         0.307        0.343    -0.036         92.8 high
Python
import pandas as pd
import numpy as np

def calculate_clutch_index(pa_df):
    """
    Calculate Clutch Performance Index from plate appearance data.

    Parameters:
    pa_df: DataFrame with leverage_index and woba_value columns

    Returns:
    Dictionary with clutch metrics
    """
    # Categorize situations
    pa_df = pa_df.copy()
    pa_df['situation'] = pd.cut(
        pa_df['leverage_index'],
        bins=[0, 0.8, 1.5, 10],
        labels=['low_leverage', 'medium_leverage', 'high_leverage']
    )

    # Filter for extreme leverage situations
    extreme_lev = pa_df[pa_df['situation'].isin(['high_leverage', 'low_leverage'])]

    # Calculate wOBA by situation
    stats_by_situation = extreme_lev.groupby('situation').agg(
        PA=('woba_value', 'count'),
        wOBA=('woba_value', 'mean')
    ).unstack()

    # Extract values
    high_lev_woba = stats_by_situation.loc['wOBA', 'high_leverage']
    low_lev_woba = stats_by_situation.loc['wOBA', 'low_leverage']
    high_lev_pa = stats_by_situation.loc['PA', 'high_leverage']
    low_lev_pa = stats_by_situation.loc['PA', 'low_leverage']

    # Calculate clutch index
    woba_diff = high_lev_woba - low_lev_woba
    clutch_index = 100 + (woba_diff / 0.005)

    total_pa = high_lev_pa + low_lev_pa
    confidence = 'high' if total_pa >= 200 else ('medium' if total_pa >= 100 else 'low')

    return {
        'PA_high_leverage': high_lev_pa,
        'PA_low_leverage': low_lev_pa,
        'wOBA_high_leverage': high_lev_woba,
        'wOBA_low_leverage': low_lev_woba,
        'wOBA_diff': woba_diff,
        'clutch_index': clutch_index,
        'confidence': confidence
    }

# Generate example data
np.random.seed(42)

def generate_player_data(player_name, n_pa, high_lev_boost=0):
    leverage = np.random.gamma(2, 0.5, n_pa)

    # Base wOBA with leverage adjustment
    woba = np.random.normal(
        loc=0.340 + (leverage > 1.5) * high_lev_boost,
        scale=0.15,
        size=n_pa
    )
    woba = np.clip(woba, 0, 1)

    return pd.DataFrame({
        'player': player_name,
        'leverage_index': leverage,
        'woba_value': woba
    })

# Create example players
players_pa = pd.concat([
    generate_player_data("Clutch Carl", 600, high_lev_boost=0.040),
    generate_player_data("Steady Steve", 600, high_lev_boost=0.000),
    generate_player_data("Choke Charlie", 600, high_lev_boost=-0.030)
], ignore_index=True)

# Calculate clutch index
clutch_results = (
    players_pa
    .groupby('player')
    .apply(calculate_clutch_index)
    .apply(pd.Series)
)

print(clutch_results)

10.6 Interactive Metric Dashboards

Custom baseball metrics gain significantly more traction and utility when presented through interactive dashboards rather than static tables. Interactive visualizations allow users to explore metric relationships dynamically, filter and sort player rankings in real-time, and compare metrics across multiple dimensions. This section demonstrates how to build three powerful interactive visualizations that transform custom metrics into actionable analytical tools using Plotly's interactive capabilities in both R and Python.

Interactive dashboards democratize baseball analytics by making complex metrics accessible to coaches, scouts, and front office personnel who may not have technical backgrounds. Instead of requesting custom analyses from analytics departments, stakeholders can explore data independently, asking and answering their own questions. This self-service approach accelerates decision-making cycles and fosters data literacy throughout organizations. Moreover, interactive tools naturally invite exploration, often revealing unexpected patterns that static presentations would miss.

10.6.1 Interactive Metric Correlation Matrix Heatmap

Understanding how custom metrics relate to established statistics and to each other validates metric design and reveals redundancies or complementary relationships. An interactive correlation heatmap allows users to hover over cells for exact correlation values, zoom into specific regions, and identify clusters of related metrics. This is particularly valuable when evaluating whether a new metric captures unique signal or merely duplicates existing measures.

R Implementation

library(plotly)
library(tidyverse)

# Create sample metric data for qualified hitters
set.seed(42)
n_players <- 150

metric_data <- tibble(
  player_id = 1:n_players,
  player_name = sprintf("Player %03d", 1:n_players),

  # Traditional metrics
  wOBA = rnorm(n_players, 0.320, 0.035),
  ISO = rnorm(n_players, 0.160, 0.055),
  BB_pct = rnorm(n_players, 8.5, 3.2),
  K_pct = rnorm(n_players, 22.0, 6.5),

  # Statcast metrics
  exit_velo = rnorm(n_players, 89.0, 3.5),
  barrel_pct = rnorm(n_players, 8.5, 4.2),
  hard_hit_pct = rnorm(n_players, 38.0, 8.5),

  # Custom metrics (correlated with traditional metrics for realism)
  PDQ = 100 + 0.7 * ((wOBA - 0.320) / 0.035) * 10 + rnorm(n_players, 0, 3),
  contact_quality = 50 + 0.6 * ((exit_velo - 89) / 3.5) * 10 +
                   0.4 * ((barrel_pct - 8.5) / 4.2) * 10 + rnorm(n_players, 0, 5),
  approach_score = 100 + 0.5 * ((BB_pct - 8.5) / 3.2) * 10 -
                  0.5 * ((K_pct - 22) / 6.5) * 10 + rnorm(n_players, 0, 4)
)

# Select metrics for correlation matrix
metrics_for_corr <- metric_data %>%
  select(
    wOBA, ISO, `BB%` = BB_pct, `K%` = K_pct,
    `Exit Velo` = exit_velo, `Barrel%` = barrel_pct, `Hard Hit%` = hard_hit_pct,
    PDQ, `Contact Quality` = contact_quality, `Approach Score` = approach_score
  )

# Calculate correlation matrix
cor_matrix <- cor(metrics_for_corr, use = "complete.obs")

# Convert to long format for plotting
cor_long <- cor_matrix %>%
  as.data.frame() %>%
  rownames_to_column("metric1") %>%
  pivot_longer(
    cols = -metric1,
    names_to = "metric2",
    values_to = "correlation"
  ) %>%
  mutate(
    # Create hover text
    hover_text = sprintf(
      "<b>%s vs %s</b><br>Correlation: %.3f<br>Strength: %s",
      metric1, metric2, correlation,
      case_when(
        abs(correlation) >= 0.7 ~ "Strong",
        abs(correlation) >= 0.4 ~ "Moderate",
        abs(correlation) >= 0.2 ~ "Weak",
        TRUE ~ "Very Weak"
      )
    ),

    # Color direction
    direction = case_when(
      correlation > 0.7 ~ "Strong Positive",
      correlation > 0.3 ~ "Moderate Positive",
      correlation > -0.3 ~ "Weak",
      correlation > -0.7 ~ "Moderate Negative",
      TRUE ~ "Strong Negative"
    )
  )

# Create interactive heatmap
heatmap_plot <- plot_ly(
  data = cor_long,
  x = ~metric2,
  y = ~metric1,
  z = ~correlation,
  type = "heatmap",
  colorscale = list(
    c(0, "#D32F2F"),      # Strong negative
    c(0.25, "#EF5350"),   # Moderate negative
    c(0.5, "#FFFFFF"),    # Zero/weak
    c(0.75, "#42A5F5"),   # Moderate positive
    c(1, "#1976D2")       # Strong positive
  ),
  zmid = 0,
  zmin = -1,
  zmax = 1,
  text = ~hover_text,
  hoverinfo = "text",
  colorbar = list(
    title = "Correlation",
    tickvals = c(-1, -0.5, 0, 0.5, 1),
    ticktext = c("-1.0", "-0.5", "0.0", "0.5", "1.0")
  )
) %>%
  layout(
    title = list(
      text = "Metric Correlation Matrix: Custom vs Traditional Metrics",
      font = list(size = 16, family = "Arial")
    ),
    xaxis = list(
      title = "",
      tickangle = -45,
      side = "bottom"
    ),
    yaxis = list(
      title = "",
      autorange = "reversed"
    ),
    plot_bgcolor = "#FAFAFA",
    paper_bgcolor = "white",
    margin = list(l = 120, b = 120)
  )

heatmap_plot

# To save:
# htmlwidgets::saveWidget(heatmap_plot, "metric_correlation_heatmap.html")

Python Implementation

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create sample metric data
np.random.seed(42)
n_players = 150

data = {
    'player_name': [f'Player {i:03d}' for i in range(1, n_players + 1)],
    'wOBA': np.random.normal(0.320, 0.035, n_players),
    'ISO': np.random.normal(0.160, 0.055, n_players),
    'BB%': np.random.normal(8.5, 3.2, n_players),
    'K%': np.random.normal(22.0, 6.5, n_players),
    'Exit Velo': np.random.normal(89.0, 3.5, n_players),
    'Barrel%': np.random.normal(8.5, 4.2, n_players),
    'Hard Hit%': np.random.normal(38.0, 8.5, n_players)
}

df_metrics = pd.DataFrame(data)

# Create custom metrics correlated with traditional ones
df_metrics['PDQ'] = (
    100 + 0.7 * ((df_metrics['wOBA'] - 0.320) / 0.035) * 10 +
    np.random.normal(0, 3, n_players)
)

df_metrics['Contact Quality'] = (
    50 + 0.6 * ((df_metrics['Exit Velo'] - 89) / 3.5) * 10 +
    0.4 * ((df_metrics['Barrel%'] - 8.5) / 4.2) * 10 +
    np.random.normal(0, 5, n_players)
)

df_metrics['Approach Score'] = (
    100 + 0.5 * ((df_metrics['BB%'] - 8.5) / 3.2) * 10 -
    0.5 * ((df_metrics['K%'] - 22) / 6.5) * 10 +
    np.random.normal(0, 4, n_players)
)

# Select metrics for correlation
metrics_cols = [
    'wOBA', 'ISO', 'BB%', 'K%', 'Exit Velo', 'Barrel%', 'Hard Hit%',
    'PDQ', 'Contact Quality', 'Approach Score'
]

# Calculate correlation matrix
corr_matrix = df_metrics[metrics_cols].corr()

# Create hover text matrix
hover_text = []
for i, row_name in enumerate(corr_matrix.index):
    hover_row = []
    for j, col_name in enumerate(corr_matrix.columns):
        corr_val = corr_matrix.iloc[i, j]

        # Determine strength
        if abs(corr_val) >= 0.7:
            strength = "Strong"
        elif abs(corr_val) >= 0.4:
            strength = "Moderate"
        elif abs(corr_val) >= 0.2:
            strength = "Weak"
        else:
            strength = "Very Weak"

        hover_row.append(
            f"<b>{row_name} vs {col_name}</b><br>"
            f"Correlation: {corr_val:.3f}<br>"
            f"Strength: {strength}"
        )
    hover_text.append(hover_row)

# Create interactive heatmap
fig = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.index,
    colorscale=[
        [0, '#D32F2F'],      # Strong negative
        [0.25, '#EF5350'],   # Moderate negative
        [0.5, '#FFFFFF'],    # Zero/weak
        [0.75, '#42A5F5'],   # Moderate positive
        [1, '#1976D2']       # Strong positive
    ],
    zmid=0,
    zmin=-1,
    zmax=1,
    text=hover_text,
    hoverinfo='text',
    colorbar=dict(
        title='Correlation',
        tickvals=[-1, -0.5, 0, 0.5, 1],
        ticktext=['-1.0', '-0.5', '0.0', '0.5', '1.0']
    )
))

fig.update_layout(
    title=dict(
        text='Metric Correlation Matrix: Custom vs Traditional Metrics',
        font=dict(size=16, family='Arial')
    ),
    xaxis=dict(
        title='',
        tickangle=-45,
        side='bottom'
    ),
    yaxis=dict(
        title='',
        autorange='reversed'
    ),
    plot_bgcolor='#FAFAFA',
    paper_bgcolor='white',
    width=900,
    height=800,
    margin=dict(l=120, b=120)
)

fig.show()

# To save:
# fig.write_html('metric_correlation_heatmap.html')

Interpretation: The correlation heatmap reveals several important relationships. PDQ shows strong positive correlation with wOBA (0.70+), validating that our plate discipline metric captures offensive production. However, the correlation is not perfect (not 0.95+), suggesting PDQ adds unique information beyond traditional metrics. Contact Quality correlates strongly with Exit Velo and Barrel% by design, confirming it successfully synthesizes Statcast contact metrics. Approach Score shows expected negative correlation with K% and positive correlation with BB%, with moderate independence from power metrics like ISO. The weak correlation (0.3-0.4) between Contact Quality and Approach Score indicates these custom metrics capture distinct offensive skills, making them complementary rather than redundant. This validates using both metrics in player evaluation frameworks.

10.6.2 Custom Metric Leaderboard with Sorting and Filtering

Leaderboards rank players by custom metrics, but static tables limit exploration. Interactive leaderboards allow users to sort by multiple columns, filter by thresholds, and search for specific players. This transforms passive consumption into active investigation, enabling questions like "Who are the top PDQ performers among players with above-average power?" or "Which high-strikeout hitters maintain excellent contact quality?"

R Implementation

library(plotly)
library(tidyverse)
library(DT)  # For DataTable widget

# Use the metric_data from previous example
# Add more player information
leaderboard_data <- metric_data %>%
  mutate(
    team = sample(c("NYY", "LAD", "HOU", "ATL", "TB", "SD", "PHI", "TOR", "BAL", "TEX"),
                  n_players, replace = TRUE),
    PA = sample(400:700, n_players, replace = TRUE),

    # Round metrics for display
    across(c(wOBA, ISO), ~round(., 3)),
    across(c(BB_pct, K_pct), ~round(., 1)),
    across(c(exit_velo), ~round(., 1)),
    across(c(barrel_pct, hard_hit_pct), ~round(., 1)),
    across(c(PDQ, contact_quality, approach_score), ~round(., 1))
  ) %>%
  select(
    Player = player_name,
    Team = team,
    PA,
    wOBA,
    ISO,
    `BB%` = BB_pct,
    `K%` = K_pct,
    `Exit Velo` = exit_velo,
    `Barrel%` = barrel_pct,
    PDQ,
    `Contact Quality` = contact_quality,
    `Approach Score` = approach_score
  ) %>%
  arrange(desc(PDQ))

# Create interactive table using DT package
datatable(
  leaderboard_data,
  options = list(
    pageLength = 25,
    order = list(list(9, 'desc')),  # Sort by PDQ (column 9) descending
    columnDefs = list(
      list(className = 'dt-center', targets = 2:11)
    ),
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel')
  ),
  filter = 'top',
  rownames = FALSE,
  class = 'cell-border stripe hover',
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-size: 18px; font-weight: bold;',
    'Custom Metric Leaderboard: 2024 Qualified Hitters'
  )
) %>%
  formatStyle(
    'PDQ',
    background = styleColorBar(leaderboard_data$PDQ, '#2ECC71'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center'
  ) %>%
  formatStyle(
    'Contact Quality',
    background = styleColorBar(leaderboard_data$`Contact Quality`, '#3498DB'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center'
  ) %>%
  formatStyle(
    'Approach Score',
    background = styleColorBar(leaderboard_data$`Approach Score`, '#E74C3C'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center'
  )

Python Implementation (using Plotly)

import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Use the df_metrics from previous example
# Add more information
np.random.seed(42)

df_leaderboard = df_metrics.copy()
df_leaderboard['Team'] = np.random.choice(
    ['NYY', 'LAD', 'HOU', 'ATL', 'TB', 'SD', 'PHI', 'TOR', 'BAL', 'TEX'],
    n_players
)
df_leaderboard['PA'] = np.random.randint(400, 701, n_players)

# Round for display
df_leaderboard['wOBA'] = df_leaderboard['wOBA'].round(3)
df_leaderboard['ISO'] = df_leaderboard['ISO'].round(3)
df_leaderboard['BB%'] = df_leaderboard['BB%'].round(1)
df_leaderboard['K%'] = df_leaderboard['K%'].round(1)
df_leaderboard['Exit Velo'] = df_leaderboard['Exit Velo'].round(1)
df_leaderboard['Barrel%'] = df_leaderboard['Barrel%'].round(1)
df_leaderboard['Hard Hit%'] = df_leaderboard['Hard Hit%'].round(1)
df_leaderboard['PDQ'] = df_leaderboard['PDQ'].round(1)
df_leaderboard['Contact Quality'] = df_leaderboard['Contact Quality'].round(1)
df_leaderboard['Approach Score'] = df_leaderboard['Approach Score'].round(1)

# Sort by PDQ
df_leaderboard = df_leaderboard.sort_values('PDQ', ascending=False).reset_index(drop=True)
df_leaderboard['Rank'] = range(1, len(df_leaderboard) + 1)

# Select columns for display
display_cols = [
    'Rank', 'player_name', 'Team', 'PA', 'wOBA', 'ISO', 'BB%', 'K%',
    'Exit Velo', 'Barrel%', 'PDQ', 'Contact Quality', 'Approach Score'
]

df_display = df_leaderboard[display_cols].rename(columns={'player_name': 'Player'})

# Create color scale function for metrics
def get_color_scale(values, color='green'):
    """Generate color intensity based on value percentile."""
    percentiles = pd.Series(values).rank(pct=True)

    if color == 'green':
        return [f'rgba(46, 204, 113, {p*0.6 + 0.2})' for p in percentiles]
    elif color == 'blue':
        return [f'rgba(52, 152, 219, {p*0.6 + 0.2})' for p in percentiles]
    elif color == 'red':
        return [f'rgba(231, 76, 60, {p*0.6 + 0.2})' for p in percentiles]

# Create interactive table
fig = go.Figure(data=[go.Table(
    columnwidth=[40, 120, 60, 60, 70, 70, 60, 60, 80, 75, 80, 110, 110],
    header=dict(
        values=list(df_display.columns),
        fill_color='#34495E',
        align='center',
        font=dict(color='white', size=12, family='Arial'),
        height=35
    ),
    cells=dict(
        values=[df_display[col] for col in df_display.columns],
        fill_color=[
            '#FFFFFF',  # Rank
            '#FFFFFF',  # Player
            '#FFFFFF',  # Team
            '#FFFFFF',  # PA
            '#FFFFFF',  # wOBA
            '#FFFFFF',  # ISO
            '#FFFFFF',  # BB%
            '#FFFFFF',  # K%
            '#FFFFFF',  # Exit Velo
            '#FFFFFF',  # Barrel%
            get_color_scale(df_display['PDQ'], 'green'),
            get_color_scale(df_display['Contact Quality'], 'blue'),
            get_color_scale(df_display['Approach Score'], 'red')
        ],
        align=['center'] * len(df_display.columns),
        font=dict(color='#2C3E50', size=11, family='Arial'),
        height=28
    )
)])

fig.update_layout(
    title=dict(
        text='Custom Metric Leaderboard: 2024 Qualified Hitters',
        font=dict(size=18, family='Arial', color='#2C3E50')
    ),
    width=1400,
    height=800,
    margin=dict(l=20, r=20, t=60, b=20)
)

fig.show()

# For a more interactive table with filtering, you can use Dash DataTable
# Here's an alternative using pandas styling (for Jupyter notebooks):
"""
from IPython.display import display

styled_table = df_display.head(25).style\
    .background_gradient(subset=['PDQ'], cmap='Greens')\
    .background_gradient(subset=['Contact Quality'], cmap='Blues')\
    .background_gradient(subset=['Approach Score'], cmap='Reds')\
    .set_properties(**{'text-align': 'center'})\
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#34495E'),
                                      ('color', 'white'),
                                      ('font-weight', 'bold')]}
    ])

display(styled_table)
"""

Interpretation: The interactive leaderboard enables multiple analytical workflows. Sorting by PDQ identifies elite plate discipline performers, while simultaneously viewing Contact Quality and Approach Score reveals whether high PDQ stems from patient approaches (high Approach Score) or superior contact ability (high Contact Quality). Filtering by PA threshold ensures sample size reliability. Users can identify players who excel in one custom metric but lag in another, suggesting specific development opportunities. For example, a player with high Contact Quality (95+) but low Approach Score (85) might benefit from plate discipline coaching. The color-coding provides immediate visual patterns: clusters of green, blue, and red highlight players who dominate across all custom metrics versus specialists who excel in just one dimension.

10.6.3 Metric Comparison Radar Chart

Radar charts (also called spider charts) visualize multi-dimensional player profiles by plotting multiple metrics simultaneously on radial axes. Interactive radar charts allow users to compare players side-by-side, toggle metrics on and off, and hover for exact values. This is particularly powerful for scouting and player comparison workflows, where decision-makers need to assess player strengths and weaknesses across multiple dimensions quickly.

R Implementation

library(plotly)
library(tidyverse)

# Select players for comparison (top 6 by PDQ)
comparison_players <- metric_data %>%
  arrange(desc(PDQ)) %>%
  head(6)

# Prepare metrics for radar chart (normalize to 0-100 scale)
normalize_to_100 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100
}

radar_data <- metric_data %>%
  mutate(
    across(c(wOBA, ISO, BB_pct, K_pct, exit_velo, barrel_pct,
             hard_hit_pct, PDQ, contact_quality, approach_score),
           normalize_to_100)
  ) %>%
  filter(player_name %in% comparison_players$player_name) %>%
  select(
    player_name,
    `wOBA` = wOBA,
    `ISO` = ISO,
    `Walk Rate` = BB_pct,
    `K Rate (inv)` = K_pct,  # Inverted so higher is better
    `Exit Velo` = exit_velo,
    `Barrel%` = barrel_pct,
    `PDQ` = PDQ,
    `Contact Quality` = contact_quality,
    `Approach` = approach_score
  ) %>%
  mutate(
    `K Rate (inv)` = 100 - `K Rate (inv)`  # Invert K rate
  )

# Define metrics for radar axes
metrics <- c("wOBA", "ISO", "Walk Rate", "K Rate (inv)", "Exit Velo",
             "Barrel%", "PDQ", "Contact Quality", "Approach")

# Create radar chart
fig <- plot_ly(type = 'scatterpolar', mode = 'lines+markers')

# Color palette
colors <- c('#E74C3C', '#3498DB', '#2ECC71', '#F39C12', '#9B59B6', '#1ABC9C')

# Add trace for each player
for (i in 1:nrow(radar_data)) {
  player <- radar_data[i, ]
  values <- as.numeric(player[metrics])

  # Close the polygon by repeating first value
  values_closed <- c(values, values[1])
  theta_closed <- c(metrics, metrics[1])

  fig <- fig %>%
    add_trace(
      r = values_closed,
      theta = theta_closed,
      name = player$player_name,
      line = list(color = colors[i], width = 2.5),
      marker = list(size = 6, color = colors[i]),
      hovertemplate = paste(
        '<b>%{theta}</b><br>',
        'Value: %{r:.1f}<br>',
        '<extra>%{fullData.name}</extra>'
      )
    )
}

fig <- fig %>%
  layout(
    polar = list(
      radialaxis = list(
        visible = TRUE,
        range = c(0, 100),
        tickmode = 'linear',
        tick0 = 0,
        dtick = 20,
        gridcolor = '#E5E5E5',
        gridwidth = 1
      ),
      angularaxis = list(
        gridcolor = '#E5E5E5',
        gridwidth = 1.5
      ),
      bgcolor = '#FAFAFA'
    ),
    title = list(
      text = "Player Comparison: Multi-Metric Radar Chart",
      font = list(size = 16, family = "Arial")
    ),
    showlegend = TRUE,
    legend = list(
      orientation = 'v',
      x = 1.1,
      y = 0.5,
      bgcolor = 'rgba(255, 255, 255, 0.8)',
      bordercolor = '#CCCCCC',
      borderwidth = 1
    ),
    paper_bgcolor = 'white'
  )

fig

# To save:
# htmlwidgets::saveWidget(fig, "player_radar_comparison.html")

Python Implementation

import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Select top 6 players by PDQ for comparison
df_comparison = df_metrics.nlargest(6, 'PDQ').copy()

# Normalize all metrics to 0-100 scale
def normalize_to_100(series):
    return (series - series.min()) / (series.max() - series.min()) * 100

metrics_to_normalize = [
    'wOBA', 'ISO', 'BB%', 'K%', 'Exit Velo', 'Barrel%',
    'Hard Hit%', 'PDQ', 'Contact Quality', 'Approach Score'
]

df_normalized = df_metrics[metrics_to_normalize].copy()
for col in metrics_to_normalize:
    df_normalized[col] = normalize_to_100(df_metrics[col])

# Get normalized values for comparison players
df_radar = df_normalized.loc[df_comparison.index].copy()
df_radar['player_name'] = df_comparison['player_name'].values

# Invert K% so higher is better
df_radar['K% (inv)'] = 100 - df_radar['K%']

# Select metrics for radar chart
radar_metrics = [
    'wOBA', 'ISO', 'BB%', 'K% (inv)', 'Exit Velo',
    'Barrel%', 'PDQ', 'Contact Quality', 'Approach Score'
]

# Rename for display
display_names = {
    'wOBA': 'wOBA',
    'ISO': 'ISO',
    'BB%': 'Walk Rate',
    'K% (inv)': 'K Rate (inv)',
    'Exit Velo': 'Exit Velo',
    'Barrel%': 'Barrel%',
    'PDQ': 'PDQ',
    'Contact Quality': 'Contact Quality',
    'Approach Score': 'Approach'
}

# Create figure
fig = go.Figure()

# Color palette
colors = ['#E74C3C', '#3498DB', '#2ECC71', '#F39C12', '#9B59B6', '#1ABC9C']

# Add trace for each player
for i, (idx, player) in enumerate(df_radar.iterrows()):
    player_name = player['player_name']
    values = player[radar_metrics].values.tolist()

    # Close the polygon
    values_closed = values + [values[0]]
    categories = [display_names[m] for m in radar_metrics] + [display_names[radar_metrics[0]]]

    fig.add_trace(go.Scatterpolar(
        r=values_closed,
        theta=categories,
        name=player_name,
        line=dict(color=colors[i], width=2.5),
        marker=dict(size=6, color=colors[i]),
        hovertemplate=(
            '<b>%{theta}</b><br>' +
            'Value: %{r:.1f}<br>' +
            '<extra>%{fullData.name}</extra>'
        )
    ))

# Update layout
fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 100],
            tickmode='linear',
            tick0=0,
            dtick=20,
            gridcolor='#E5E5E5',
            gridwidth=1
        ),
        angularaxis=dict(
            gridcolor='#E5E5E5',
            gridwidth=1.5
        ),
        bgcolor='#FAFAFA'
    ),
    title=dict(
        text='Player Comparison: Multi-Metric Radar Chart',
        font=dict(size=16, family='Arial')
    ),
    showlegend=True,
    legend=dict(
        orientation='v',
        x=1.1,
        y=0.5,
        bgcolor='rgba(255, 255, 255, 0.8)',
        bordercolor='#CCCCCC',
        borderwidth=1
    ),
    paper_bgcolor='white',
    width=900,
    height=700
)

fig.show()

# To save:
# fig.write_html('player_radar_comparison.html')

Interpretation: Radar charts reveal player profiles at a glance through polygon shape and size. Large, symmetrical polygons indicate well-rounded players who excel across all dimensions, while irregular shapes highlight specialists with pronounced strengths and weaknesses. For example, a player whose polygon extends far outward in the Contact Quality and Exit Velo axes but contracts near the Walk Rate and Approach Score axes clearly profiles as a power-first hitter with plate discipline concerns. Interactive features enhance utility: hovering over vertices reveals exact normalized values, toggling players on and off clarifies head-to-head comparisons, and zooming isolates specific metrics.

When comparing prospect profiles to established major leaguers, scouts can identify which veteran players a prospect's profile most resembles. When evaluating trade targets, decision-makers can overlay their current roster needs (represented as an ideal profile) against available players to find best fits. The radar chart's visual immediacy makes it particularly effective in presentations to non-technical audiences like ownership or coaching staffs, who can grasp player strengths and weaknesses without parsing tables of numbers.

Interactive metric dashboards transform custom baseball metrics from static research outputs into dynamic decision-support tools. Correlation heatmaps validate metric design and reveal relationships that inform combined usage. Sortable, filterable leaderboards enable self-service exploration that accelerates analytical workflows. Radar charts provide intuitive multi-dimensional player comparisons that support scouting, trade evaluation, and roster construction decisions. Together, these interactive visualizations ensure custom metrics achieve their full potential value by making insights accessible, explorable, and actionable for diverse stakeholders across baseball operations.


R
library(plotly)
library(tidyverse)

# Create sample metric data for qualified hitters
set.seed(42)
n_players <- 150

metric_data <- tibble(
  player_id = 1:n_players,
  player_name = sprintf("Player %03d", 1:n_players),

  # Traditional metrics
  wOBA = rnorm(n_players, 0.320, 0.035),
  ISO = rnorm(n_players, 0.160, 0.055),
  BB_pct = rnorm(n_players, 8.5, 3.2),
  K_pct = rnorm(n_players, 22.0, 6.5),

  # Statcast metrics
  exit_velo = rnorm(n_players, 89.0, 3.5),
  barrel_pct = rnorm(n_players, 8.5, 4.2),
  hard_hit_pct = rnorm(n_players, 38.0, 8.5),

  # Custom metrics (correlated with traditional metrics for realism)
  PDQ = 100 + 0.7 * ((wOBA - 0.320) / 0.035) * 10 + rnorm(n_players, 0, 3),
  contact_quality = 50 + 0.6 * ((exit_velo - 89) / 3.5) * 10 +
                   0.4 * ((barrel_pct - 8.5) / 4.2) * 10 + rnorm(n_players, 0, 5),
  approach_score = 100 + 0.5 * ((BB_pct - 8.5) / 3.2) * 10 -
                  0.5 * ((K_pct - 22) / 6.5) * 10 + rnorm(n_players, 0, 4)
)

# Select metrics for correlation matrix
metrics_for_corr <- metric_data %>%
  select(
    wOBA, ISO, `BB%` = BB_pct, `K%` = K_pct,
    `Exit Velo` = exit_velo, `Barrel%` = barrel_pct, `Hard Hit%` = hard_hit_pct,
    PDQ, `Contact Quality` = contact_quality, `Approach Score` = approach_score
  )

# Calculate correlation matrix
cor_matrix <- cor(metrics_for_corr, use = "complete.obs")

# Convert to long format for plotting
cor_long <- cor_matrix %>%
  as.data.frame() %>%
  rownames_to_column("metric1") %>%
  pivot_longer(
    cols = -metric1,
    names_to = "metric2",
    values_to = "correlation"
  ) %>%
  mutate(
    # Create hover text
    hover_text = sprintf(
      "<b>%s vs %s</b><br>Correlation: %.3f<br>Strength: %s",
      metric1, metric2, correlation,
      case_when(
        abs(correlation) >= 0.7 ~ "Strong",
        abs(correlation) >= 0.4 ~ "Moderate",
        abs(correlation) >= 0.2 ~ "Weak",
        TRUE ~ "Very Weak"
      )
    ),

    # Color direction
    direction = case_when(
      correlation > 0.7 ~ "Strong Positive",
      correlation > 0.3 ~ "Moderate Positive",
      correlation > -0.3 ~ "Weak",
      correlation > -0.7 ~ "Moderate Negative",
      TRUE ~ "Strong Negative"
    )
  )

# Create interactive heatmap
heatmap_plot <- plot_ly(
  data = cor_long,
  x = ~metric2,
  y = ~metric1,
  z = ~correlation,
  type = "heatmap",
  colorscale = list(
    c(0, "#D32F2F"),      # Strong negative
    c(0.25, "#EF5350"),   # Moderate negative
    c(0.5, "#FFFFFF"),    # Zero/weak
    c(0.75, "#42A5F5"),   # Moderate positive
    c(1, "#1976D2")       # Strong positive
  ),
  zmid = 0,
  zmin = -1,
  zmax = 1,
  text = ~hover_text,
  hoverinfo = "text",
  colorbar = list(
    title = "Correlation",
    tickvals = c(-1, -0.5, 0, 0.5, 1),
    ticktext = c("-1.0", "-0.5", "0.0", "0.5", "1.0")
  )
) %>%
  layout(
    title = list(
      text = "Metric Correlation Matrix: Custom vs Traditional Metrics",
      font = list(size = 16, family = "Arial")
    ),
    xaxis = list(
      title = "",
      tickangle = -45,
      side = "bottom"
    ),
    yaxis = list(
      title = "",
      autorange = "reversed"
    ),
    plot_bgcolor = "#FAFAFA",
    paper_bgcolor = "white",
    margin = list(l = 120, b = 120)
  )

heatmap_plot

# To save:
# htmlwidgets::saveWidget(heatmap_plot, "metric_correlation_heatmap.html")
R
library(plotly)
library(tidyverse)
library(DT)  # For DataTable widget

# Use the metric_data from previous example
# Add more player information
leaderboard_data <- metric_data %>%
  mutate(
    team = sample(c("NYY", "LAD", "HOU", "ATL", "TB", "SD", "PHI", "TOR", "BAL", "TEX"),
                  n_players, replace = TRUE),
    PA = sample(400:700, n_players, replace = TRUE),

    # Round metrics for display
    across(c(wOBA, ISO), ~round(., 3)),
    across(c(BB_pct, K_pct), ~round(., 1)),
    across(c(exit_velo), ~round(., 1)),
    across(c(barrel_pct, hard_hit_pct), ~round(., 1)),
    across(c(PDQ, contact_quality, approach_score), ~round(., 1))
  ) %>%
  select(
    Player = player_name,
    Team = team,
    PA,
    wOBA,
    ISO,
    `BB%` = BB_pct,
    `K%` = K_pct,
    `Exit Velo` = exit_velo,
    `Barrel%` = barrel_pct,
    PDQ,
    `Contact Quality` = contact_quality,
    `Approach Score` = approach_score
  ) %>%
  arrange(desc(PDQ))

# Create interactive table using DT package
datatable(
  leaderboard_data,
  options = list(
    pageLength = 25,
    order = list(list(9, 'desc')),  # Sort by PDQ (column 9) descending
    columnDefs = list(
      list(className = 'dt-center', targets = 2:11)
    ),
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel')
  ),
  filter = 'top',
  rownames = FALSE,
  class = 'cell-border stripe hover',
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: center; font-size: 18px; font-weight: bold;',
    'Custom Metric Leaderboard: 2024 Qualified Hitters'
  )
) %>%
  formatStyle(
    'PDQ',
    background = styleColorBar(leaderboard_data$PDQ, '#2ECC71'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center'
  ) %>%
  formatStyle(
    'Contact Quality',
    background = styleColorBar(leaderboard_data$`Contact Quality`, '#3498DB'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center'
  ) %>%
  formatStyle(
    'Approach Score',
    background = styleColorBar(leaderboard_data$`Approach Score`, '#E74C3C'),
    backgroundSize = '100% 90%',
    backgroundRepeat = 'no-repeat',
    backgroundPosition = 'center'
  )
R
library(plotly)
library(tidyverse)

# Select players for comparison (top 6 by PDQ)
comparison_players <- metric_data %>%
  arrange(desc(PDQ)) %>%
  head(6)

# Prepare metrics for radar chart (normalize to 0-100 scale)
normalize_to_100 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE)) * 100
}

radar_data <- metric_data %>%
  mutate(
    across(c(wOBA, ISO, BB_pct, K_pct, exit_velo, barrel_pct,
             hard_hit_pct, PDQ, contact_quality, approach_score),
           normalize_to_100)
  ) %>%
  filter(player_name %in% comparison_players$player_name) %>%
  select(
    player_name,
    `wOBA` = wOBA,
    `ISO` = ISO,
    `Walk Rate` = BB_pct,
    `K Rate (inv)` = K_pct,  # Inverted so higher is better
    `Exit Velo` = exit_velo,
    `Barrel%` = barrel_pct,
    `PDQ` = PDQ,
    `Contact Quality` = contact_quality,
    `Approach` = approach_score
  ) %>%
  mutate(
    `K Rate (inv)` = 100 - `K Rate (inv)`  # Invert K rate
  )

# Define metrics for radar axes
metrics <- c("wOBA", "ISO", "Walk Rate", "K Rate (inv)", "Exit Velo",
             "Barrel%", "PDQ", "Contact Quality", "Approach")

# Create radar chart
fig <- plot_ly(type = 'scatterpolar', mode = 'lines+markers')

# Color palette
colors <- c('#E74C3C', '#3498DB', '#2ECC71', '#F39C12', '#9B59B6', '#1ABC9C')

# Add trace for each player
for (i in 1:nrow(radar_data)) {
  player <- radar_data[i, ]
  values <- as.numeric(player[metrics])

  # Close the polygon by repeating first value
  values_closed <- c(values, values[1])
  theta_closed <- c(metrics, metrics[1])

  fig <- fig %>%
    add_trace(
      r = values_closed,
      theta = theta_closed,
      name = player$player_name,
      line = list(color = colors[i], width = 2.5),
      marker = list(size = 6, color = colors[i]),
      hovertemplate = paste(
        '<b>%{theta}</b><br>',
        'Value: %{r:.1f}<br>',
        '<extra>%{fullData.name}</extra>'
      )
    )
}

fig <- fig %>%
  layout(
    polar = list(
      radialaxis = list(
        visible = TRUE,
        range = c(0, 100),
        tickmode = 'linear',
        tick0 = 0,
        dtick = 20,
        gridcolor = '#E5E5E5',
        gridwidth = 1
      ),
      angularaxis = list(
        gridcolor = '#E5E5E5',
        gridwidth = 1.5
      ),
      bgcolor = '#FAFAFA'
    ),
    title = list(
      text = "Player Comparison: Multi-Metric Radar Chart",
      font = list(size = 16, family = "Arial")
    ),
    showlegend = TRUE,
    legend = list(
      orientation = 'v',
      x = 1.1,
      y = 0.5,
      bgcolor = 'rgba(255, 255, 255, 0.8)',
      bordercolor = '#CCCCCC',
      borderwidth = 1
    ),
    paper_bgcolor = 'white'
  )

fig

# To save:
# htmlwidgets::saveWidget(fig, "player_radar_comparison.html")
Python
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create sample metric data
np.random.seed(42)
n_players = 150

data = {
    'player_name': [f'Player {i:03d}' for i in range(1, n_players + 1)],
    'wOBA': np.random.normal(0.320, 0.035, n_players),
    'ISO': np.random.normal(0.160, 0.055, n_players),
    'BB%': np.random.normal(8.5, 3.2, n_players),
    'K%': np.random.normal(22.0, 6.5, n_players),
    'Exit Velo': np.random.normal(89.0, 3.5, n_players),
    'Barrel%': np.random.normal(8.5, 4.2, n_players),
    'Hard Hit%': np.random.normal(38.0, 8.5, n_players)
}

df_metrics = pd.DataFrame(data)

# Create custom metrics correlated with traditional ones
df_metrics['PDQ'] = (
    100 + 0.7 * ((df_metrics['wOBA'] - 0.320) / 0.035) * 10 +
    np.random.normal(0, 3, n_players)
)

df_metrics['Contact Quality'] = (
    50 + 0.6 * ((df_metrics['Exit Velo'] - 89) / 3.5) * 10 +
    0.4 * ((df_metrics['Barrel%'] - 8.5) / 4.2) * 10 +
    np.random.normal(0, 5, n_players)
)

df_metrics['Approach Score'] = (
    100 + 0.5 * ((df_metrics['BB%'] - 8.5) / 3.2) * 10 -
    0.5 * ((df_metrics['K%'] - 22) / 6.5) * 10 +
    np.random.normal(0, 4, n_players)
)

# Select metrics for correlation
metrics_cols = [
    'wOBA', 'ISO', 'BB%', 'K%', 'Exit Velo', 'Barrel%', 'Hard Hit%',
    'PDQ', 'Contact Quality', 'Approach Score'
]

# Calculate correlation matrix
corr_matrix = df_metrics[metrics_cols].corr()

# Create hover text matrix
hover_text = []
for i, row_name in enumerate(corr_matrix.index):
    hover_row = []
    for j, col_name in enumerate(corr_matrix.columns):
        corr_val = corr_matrix.iloc[i, j]

        # Determine strength
        if abs(corr_val) >= 0.7:
            strength = "Strong"
        elif abs(corr_val) >= 0.4:
            strength = "Moderate"
        elif abs(corr_val) >= 0.2:
            strength = "Weak"
        else:
            strength = "Very Weak"

        hover_row.append(
            f"<b>{row_name} vs {col_name}</b><br>"
            f"Correlation: {corr_val:.3f}<br>"
            f"Strength: {strength}"
        )
    hover_text.append(hover_row)

# Create interactive heatmap
fig = go.Figure(data=go.Heatmap(
    z=corr_matrix.values,
    x=corr_matrix.columns,
    y=corr_matrix.index,
    colorscale=[
        [0, '#D32F2F'],      # Strong negative
        [0.25, '#EF5350'],   # Moderate negative
        [0.5, '#FFFFFF'],    # Zero/weak
        [0.75, '#42A5F5'],   # Moderate positive
        [1, '#1976D2']       # Strong positive
    ],
    zmid=0,
    zmin=-1,
    zmax=1,
    text=hover_text,
    hoverinfo='text',
    colorbar=dict(
        title='Correlation',
        tickvals=[-1, -0.5, 0, 0.5, 1],
        ticktext=['-1.0', '-0.5', '0.0', '0.5', '1.0']
    )
))

fig.update_layout(
    title=dict(
        text='Metric Correlation Matrix: Custom vs Traditional Metrics',
        font=dict(size=16, family='Arial')
    ),
    xaxis=dict(
        title='',
        tickangle=-45,
        side='bottom'
    ),
    yaxis=dict(
        title='',
        autorange='reversed'
    ),
    plot_bgcolor='#FAFAFA',
    paper_bgcolor='white',
    width=900,
    height=800,
    margin=dict(l=120, b=120)
)

fig.show()

# To save:
# fig.write_html('metric_correlation_heatmap.html')
Python
import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Use the df_metrics from previous example
# Add more information
np.random.seed(42)

df_leaderboard = df_metrics.copy()
df_leaderboard['Team'] = np.random.choice(
    ['NYY', 'LAD', 'HOU', 'ATL', 'TB', 'SD', 'PHI', 'TOR', 'BAL', 'TEX'],
    n_players
)
df_leaderboard['PA'] = np.random.randint(400, 701, n_players)

# Round for display
df_leaderboard['wOBA'] = df_leaderboard['wOBA'].round(3)
df_leaderboard['ISO'] = df_leaderboard['ISO'].round(3)
df_leaderboard['BB%'] = df_leaderboard['BB%'].round(1)
df_leaderboard['K%'] = df_leaderboard['K%'].round(1)
df_leaderboard['Exit Velo'] = df_leaderboard['Exit Velo'].round(1)
df_leaderboard['Barrel%'] = df_leaderboard['Barrel%'].round(1)
df_leaderboard['Hard Hit%'] = df_leaderboard['Hard Hit%'].round(1)
df_leaderboard['PDQ'] = df_leaderboard['PDQ'].round(1)
df_leaderboard['Contact Quality'] = df_leaderboard['Contact Quality'].round(1)
df_leaderboard['Approach Score'] = df_leaderboard['Approach Score'].round(1)

# Sort by PDQ
df_leaderboard = df_leaderboard.sort_values('PDQ', ascending=False).reset_index(drop=True)
df_leaderboard['Rank'] = range(1, len(df_leaderboard) + 1)

# Select columns for display
display_cols = [
    'Rank', 'player_name', 'Team', 'PA', 'wOBA', 'ISO', 'BB%', 'K%',
    'Exit Velo', 'Barrel%', 'PDQ', 'Contact Quality', 'Approach Score'
]

df_display = df_leaderboard[display_cols].rename(columns={'player_name': 'Player'})

# Create color scale function for metrics
def get_color_scale(values, color='green'):
    """Generate color intensity based on value percentile."""
    percentiles = pd.Series(values).rank(pct=True)

    if color == 'green':
        return [f'rgba(46, 204, 113, {p*0.6 + 0.2})' for p in percentiles]
    elif color == 'blue':
        return [f'rgba(52, 152, 219, {p*0.6 + 0.2})' for p in percentiles]
    elif color == 'red':
        return [f'rgba(231, 76, 60, {p*0.6 + 0.2})' for p in percentiles]

# Create interactive table
fig = go.Figure(data=[go.Table(
    columnwidth=[40, 120, 60, 60, 70, 70, 60, 60, 80, 75, 80, 110, 110],
    header=dict(
        values=list(df_display.columns),
        fill_color='#34495E',
        align='center',
        font=dict(color='white', size=12, family='Arial'),
        height=35
    ),
    cells=dict(
        values=[df_display[col] for col in df_display.columns],
        fill_color=[
            '#FFFFFF',  # Rank
            '#FFFFFF',  # Player
            '#FFFFFF',  # Team
            '#FFFFFF',  # PA
            '#FFFFFF',  # wOBA
            '#FFFFFF',  # ISO
            '#FFFFFF',  # BB%
            '#FFFFFF',  # K%
            '#FFFFFF',  # Exit Velo
            '#FFFFFF',  # Barrel%
            get_color_scale(df_display['PDQ'], 'green'),
            get_color_scale(df_display['Contact Quality'], 'blue'),
            get_color_scale(df_display['Approach Score'], 'red')
        ],
        align=['center'] * len(df_display.columns),
        font=dict(color='#2C3E50', size=11, family='Arial'),
        height=28
    )
)])

fig.update_layout(
    title=dict(
        text='Custom Metric Leaderboard: 2024 Qualified Hitters',
        font=dict(size=18, family='Arial', color='#2C3E50')
    ),
    width=1400,
    height=800,
    margin=dict(l=20, r=20, t=60, b=20)
)

fig.show()

# For a more interactive table with filtering, you can use Dash DataTable
# Here's an alternative using pandas styling (for Jupyter notebooks):
"""
from IPython.display import display

styled_table = df_display.head(25).style\
    .background_gradient(subset=['PDQ'], cmap='Greens')\
    .background_gradient(subset=['Contact Quality'], cmap='Blues')\
    .background_gradient(subset=['Approach Score'], cmap='Reds')\
    .set_properties(**{'text-align': 'center'})\
    .set_table_styles([
        {'selector': 'th', 'props': [('background-color', '#34495E'),
                                      ('color', 'white'),
                                      ('font-weight', 'bold')]}
    ])

display(styled_table)
"""
Python
import pandas as pd
import numpy as np
import plotly.graph_objects as go

# Select top 6 players by PDQ for comparison
df_comparison = df_metrics.nlargest(6, 'PDQ').copy()

# Normalize all metrics to 0-100 scale
def normalize_to_100(series):
    return (series - series.min()) / (series.max() - series.min()) * 100

metrics_to_normalize = [
    'wOBA', 'ISO', 'BB%', 'K%', 'Exit Velo', 'Barrel%',
    'Hard Hit%', 'PDQ', 'Contact Quality', 'Approach Score'
]

df_normalized = df_metrics[metrics_to_normalize].copy()
for col in metrics_to_normalize:
    df_normalized[col] = normalize_to_100(df_metrics[col])

# Get normalized values for comparison players
df_radar = df_normalized.loc[df_comparison.index].copy()
df_radar['player_name'] = df_comparison['player_name'].values

# Invert K% so higher is better
df_radar['K% (inv)'] = 100 - df_radar['K%']

# Select metrics for radar chart
radar_metrics = [
    'wOBA', 'ISO', 'BB%', 'K% (inv)', 'Exit Velo',
    'Barrel%', 'PDQ', 'Contact Quality', 'Approach Score'
]

# Rename for display
display_names = {
    'wOBA': 'wOBA',
    'ISO': 'ISO',
    'BB%': 'Walk Rate',
    'K% (inv)': 'K Rate (inv)',
    'Exit Velo': 'Exit Velo',
    'Barrel%': 'Barrel%',
    'PDQ': 'PDQ',
    'Contact Quality': 'Contact Quality',
    'Approach Score': 'Approach'
}

# Create figure
fig = go.Figure()

# Color palette
colors = ['#E74C3C', '#3498DB', '#2ECC71', '#F39C12', '#9B59B6', '#1ABC9C']

# Add trace for each player
for i, (idx, player) in enumerate(df_radar.iterrows()):
    player_name = player['player_name']
    values = player[radar_metrics].values.tolist()

    # Close the polygon
    values_closed = values + [values[0]]
    categories = [display_names[m] for m in radar_metrics] + [display_names[radar_metrics[0]]]

    fig.add_trace(go.Scatterpolar(
        r=values_closed,
        theta=categories,
        name=player_name,
        line=dict(color=colors[i], width=2.5),
        marker=dict(size=6, color=colors[i]),
        hovertemplate=(
            '<b>%{theta}</b><br>' +
            'Value: %{r:.1f}<br>' +
            '<extra>%{fullData.name}</extra>'
        )
    ))

# Update layout
fig.update_layout(
    polar=dict(
        radialaxis=dict(
            visible=True,
            range=[0, 100],
            tickmode='linear',
            tick0=0,
            dtick=20,
            gridcolor='#E5E5E5',
            gridwidth=1
        ),
        angularaxis=dict(
            gridcolor='#E5E5E5',
            gridwidth=1.5
        ),
        bgcolor='#FAFAFA'
    ),
    title=dict(
        text='Player Comparison: Multi-Metric Radar Chart',
        font=dict(size=16, family='Arial')
    ),
    showlegend=True,
    legend=dict(
        orientation='v',
        x=1.1,
        y=0.5,
        bgcolor='rgba(255, 255, 255, 0.8)',
        bordercolor='#CCCCCC',
        borderwidth=1
    ),
    paper_bgcolor='white',
    width=900,
    height=700
)

fig.show()

# To save:
# fig.write_html('player_radar_comparison.html')

10.7 Publishing Your Metric

After creating and validating your metric, sharing it with the baseball analytics community can generate feedback, build your reputation, and contribute to the field's advancement.

10.7.1 Documentation Requirements {#documentation}

Thorough documentation separates professional work from amateur experiments. Include:

1. Clear Definition: Explain exactly what your metric measures in 2-3 sentences. Example: "Plate Discipline Quality (PDQ) measures a hitter's overall plate discipline by combining chase avoidance, zone contact ability, and contact quality into a single 0-100 index scaled to league average."

2. Formula and Methodology: Publish the complete formula with component definitions and weights. Show sample calculations for 2-3 players. This allows others to reproduce your work.

3. Data Sources: Specify where data comes from (FanGraphs, Baseball Savant, etc.), what date ranges you used, and any filters applied (minimum PA, etc.).

4. Validation Results: Share correlations with existing metrics, year-to-year stability, and any predictive validity tests. Acknowledge limitations honestly.

5. Code Repository: Publish R and/or Python code on GitHub. Include:


  • Data acquisition functions

  • Calculation functions

  • Validation scripts

  • Example usage

6. Use Cases: Explain when to use your metric and when not to. What questions does it answer? What decisions could it inform?

10.7.2 Getting Feedback {#getting-feedback}

Multiple platforms exist for sharing baseball analytics work:

Twitter/X: The baseball analytics community is very active on Twitter. Share visualizations, interesting findings, and links to detailed writeups. Use hashtags like #BaseballAnalytics, #Sabermetrics, #MLB. Engage with others' work to build network.

Reddit: Subreddits like r/Sabermetrics and r/baseball welcome analytical posts. The community provides substantive feedback but expects rigorous methodology. Be prepared for critical questions.

Baseball analytics blogs: Sites like FanGraphs Community Research, Baseball Prospectus, and personal blogs hosted on Medium or GitHub Pages allow long-form posts. These reach serious audiences and remain discoverable long-term.

SABR (Society for American Baseball Research): Present at regional chapter meetings or submit to the annual SABR Analytics Conference. This provides professional exposure and networking opportunities.

Academic journals: For particularly rigorous work, consider submitting to the Journal of Quantitative Analysis in Sports or similar publications.

What to expect from feedback:


  • Questions about methodology and assumptions

  • Suggestions for alternative formulations

  • Identification of edge cases you didn't consider

  • Requests for clarification

  • Both constructive criticism and occasional unconstructive criticism

Engage graciously with criticism. The best metric creators iterate based on community feedback rather than defending flawed initial versions.

10.7.3 Iterating Based on Criticism {#iterating-criticism}

Treat published metrics as version 1.0, not finished products. Common criticisms and how to address them:

"Your metric doesn't account for [X]":


  • Evaluate whether [X] is important for your metric's purpose

  • If yes, consider adding it as a component or adjustment

  • If no, clarify scope limitations

"The weights seem arbitrary":


  • Consider regression-based weights that optimize for some outcome

  • Run sensitivity analysis showing how results change with different weights

  • If weights genuinely are judgment calls, defend them with baseball logic

"This correlates too highly with [existing metric]":


  • Calculate incremental predictive value beyond the existing metric

  • Either demonstrate unique value or acknowledge the metric might be redundant

  • Sometimes redundancy is okay if your metric is more interpretable

"Small sample size makes this unreliable":


  • Calculate and publish minimum sample size requirements

  • Show split-half reliability and year-to-year correlations

  • Consider adding confidence intervals to estimates

"This doesn't predict future performance":


  • Clarify whether your metric is descriptive or predictive

  • If meant to be predictive, test it properly with out-of-sample validation

  • If descriptive, explain why descriptive value matters

Document changes between versions. "Version 2.0" posts that explain what you changed based on feedback demonstrate intellectual honesty and analytical maturity.



10.8 Exercises

Apply the chapter's concepts through these hands-on exercises.

Exercise 1: Design a New Metric {#exercise-1}

Task: Design a "Baserunning Aggression Index" that measures how aggressive baserunners are on the bases (taking extra bases, stealing attempts, advancing on outs).

Requirements:


  1. Define the specific question your metric answers

  2. Identify 3-4 components that capture baserunning aggression

  3. Specify data sources for each component

  4. Propose a formula combining components

  5. Describe how you would validate the metric

  6. Identify at least 2 limitations

Deliverable: A 1-2 page written proposal following the format from Section 10.2.

Exercise 2: Validate an Existing Metric {#exercise-2}

Task: Choose an existing metric (wOBA, FIP, Barrel%, or any other) and rigorously validate it using the techniques from this chapter.

Requirements:


  1. Calculate the metric for 2023 and 2024 qualified hitters or pitchers

  2. Compute correlations with related metrics

  3. Test year-to-year stability

  4. Test predictive validity (does 2023 metric predict 2024 performance?)

  5. Examine distribution (histogram, summary statistics)

  6. Write 2-3 paragraphs interpreting validation results

Deliverable: R or Python script with validation code and written interpretation.

Exercise 3: Optimize Weights {#exercise-3}

Task: Revisit the Plate Discipline Quality (PDQ) metric from Section 10.3. Currently it uses these weights:


  • Chase Avoidance: 0.35

  • Zone Contact: 0.35

  • Damage: 0.30

Use regression to find optimal weights that best predict wOBA.

Requirements:


  1. Load batting data for 2024 (or 2023-2024 combined)

  2. Calculate the three PDQ components

  3. Run a regression: wOBA ~ chase_avoid + zone_contact + damage

  4. Extract regression coefficients and standardize them to sum to 1.0

  5. Calculate PDQ with both original and optimized weights

  6. Compare: Which version correlates more strongly with wOBA? With wRC+?

  7. Compare year-to-year stability if using multi-year data

Deliverable: Code showing the optimization process and a brief paragraph discussing whether the optimized weights are better.

R starter code:

# Load data
batters <- fg_batter_leaders(2024, 2024, qual = 300)

# Calculate components (from Section 10.3)
# ... (code from earlier)

# Run regression
model <- lm(wOBA ~ chase_avoid + zone_contact + damage, data = pdq_data)
summary(model)

# Extract and normalize coefficients
coefs <- coef(model)[-1]  # Exclude intercept
weights_optimized <- coefs / sum(coefs)
print(weights_optimized)

Python starter code:

from sklearn.linear_model import LinearRegression

# Load and prepare data
# ... (code from earlier)

# Run regression
X = pdq_data[['chase_avoid', 'zone_contact', 'damage']]
y = pdq_data['wOBA']

model = LinearRegression()
model.fit(X, y)

# Extract and normalize coefficients
coefs = model.coef_
weights_optimized = coefs / coefs.sum()
print(dict(zip(['chase_avoid', 'zone_contact', 'damage'], weights_optimized)))

Exercise 4: Replicate xwOBA {#exercise-4}

Task: Expected wOBA (xwOBA) estimates what a player's wOBA should have been based on quality of contact (exit velocity and launch angle). Replicate a simplified version of xwOBA.

Requirements:


  1. Acquire Statcast batted ball data for 2024 (use baseballr::statcast_search() or pybaseball.statcast())

  2. For each batted ball, retrieve exit velocity, launch angle, and actual outcome (wOBA value)

  3. Build a model predicting wOBA value from exit velocity and launch angle:



  • Option A: Use binning (create exit velo × launch angle bins, calculate average wOBA in each bin)

  • Option B: Use logistic or linear regression

  • Option C: Use a more sophisticated model (random forest, GAM)



  1. Apply your model to predict xwOBA for each batted ball

  2. Aggregate to player level: calculate actual wOBA and xwOBA for each player

  3. Compare your xwOBA to Baseball Savant's official xwOBA

  4. Calculate correlation and mean absolute error

Deliverable: Code implementing xwOBA replication and a brief analysis of how your version compares to the official version.

Hints:


  • Filter for batted balls with exit velocity > 0

  • Typical launch angle range for modeling: -90° to +90°

  • Player-level aggregation requires minimum batted balls (suggest 100+)

  • Official xwOBA uses more sophisticated modeling than simple regression

Reflection questions:


  • How close did you get to official xwOBA?

  • What factors might explain differences?

  • Which players show the biggest difference between actual wOBA and xwOBA (underperformers and overperformers)?


Summary {#summary}

This chapter equipped you with the complete toolkit for creating original baseball metrics:

Philosophy: Good metrics are clear, reproducible, meaningful, actionable, and stable. They answer specific questions that existing metrics don't adequately address.

Framework: Systematic metric creation follows six steps: define the question, conceptualize components, gather and prepare data, build the formula, validate rigorously, and iterate based on results.

Case studies demonstrated three different metric types:


  • PDQ (Plate Discipline Quality): A composite index combining multiple components with weighted averaging

  • Pitch Arsenal Score: A usage-weighted rating system for pitcher arsenals

  • Clutch Performance Index: A leverage-based comparison metric with important limitations

Publication and iteration: Sharing metrics with proper documentation invites valuable feedback. The best metrics evolve through multiple versions based on community criticism and validation testing.

As you develop your own metrics, remember that not every idea will succeed. Many attempted metrics prove redundant with existing statistics, unstable due to small samples, or insufficiently predictive to be useful. That's part of the process. The skills you develop—formulating analytical questions, manipulating data, designing formulas, validating rigorously—transfer across all metric creation attempts, successful or not.

The baseball analytics field advances through collective effort. Today's innovative metric becomes tomorrow's standard tool, enabling the next generation of analysts to ask even more sophisticated questions. Your contributions to this ongoing process, whether through original metrics that gain wide adoption or through incremental improvements to existing approaches, help deepen our understanding of baseball's beautiful complexity.


Chapter 10 Complete. Next: Chapter 11 - Machine Learning Applications in Baseball Analytics

R
# Load data
batters <- fg_batter_leaders(2024, 2024, qual = 300)

# Calculate components (from Section 10.3)
# ... (code from earlier)

# Run regression
model <- lm(wOBA ~ chase_avoid + zone_contact + damage, data = pdq_data)
summary(model)

# Extract and normalize coefficients
coefs <- coef(model)[-1]  # Exclude intercept
weights_optimized <- coefs / sum(coefs)
print(weights_optimized)
Python
from sklearn.linear_model import LinearRegression

# Load and prepare data
# ... (code from earlier)

# Run regression
X = pdq_data[['chase_avoid', 'zone_contact', 'damage']]
y = pdq_data['wOBA']

model = LinearRegression()
model.fit(X, y)

# Extract and normalize coefficients
coefs = model.coef_
weights_optimized = coefs / coefs.sum()
print(dict(zip(['chase_avoid', 'zone_contact', 'damage'], weights_optimized)))

Chapter Summary

In this chapter, you learned about building your own metrics. Key topics covered:

  • The Philosophy of Metric Design
  • The Metric Creation Framework
  • Case Study: Plate Discipline Quality (PDQ)
  • Case Study: Pitch Arsenal Score
  • Case Study: Clutch Performance Index
  • Interactive Metric Dashboards