Pitch Data Variables
| Variable | Description | Units | Typical Range |
|---|---|---|---|
release_speed | Velocity of pitch at release point | mph | 70-105 |
release_pos_x | Horizontal release point (catcher's perspective) | feet | -3.5 to 3.5 |
release_pos_y | Distance from home plate at release | feet | 54-56 |
release_pos_z | Height at release point | feet | 4.5-7.0 |
release_spin_rate | Spin rate of pitch | rpm | 1500-3200 |
release_extension | Distance pitcher releases ball in front of rubber | feet | 5.0-7.5 |
pfx_x | Horizontal movement (catcher's perspective) | inches | -25 to 25 |
pfx_z | Vertical movement (vs. gravity) | inches | -25 to 25 |
plate_x | Horizontal location at home plate | feet | -2.5 to 2.5 |
plate_z | Vertical location at home plate | feet | 0.5 to 5.0 |
vx0, vy0, vz0 | Velocity components 50 feet from home | ft/s | varies |
ax, ay, az | Acceleration components | ft/s² | varies |
zone | Strike zone location (1-9 in zone, 11-14 out) | categorical | 1-14 |
spin_axis | Direction of spin (clock face from catcher view) | degrees | 0-360 |
effective_speed | Perceived velocity accounting for extension | mph | 70-105 |
Batted Ball Variables
| Variable | Description | Units | Typical Range |
|---|---|---|---|
launch_speed | Exit velocity off bat | mph | 50-120 |
launch_angle | Vertical angle off bat | degrees | -90 to 90 |
hit_distance_sc | Projected hit distance | feet | 0-500 |
hc_x | Hit coordinate X (horizontal) | feet | 0-250 |
hc_y | Hit coordinate Y (depth) | feet | 0-250 |
barrel | Whether batted ball is a barrel (1=yes, 0=no) | binary | 0-1 |
estimated_woba_using_speedangle | Expected wOBA based on exit velo/launch angle | wOBA scale | 0.000-2.000 |
estimated_ba_using_speedangle | Expected batting average | BA scale | 0.000-1.000 |
hit_location | Fielding position number | categorical | 1-9 |
bb_type | Batted ball type (groundball, linedrive, fly_ball, popup) | categorical | - |
Sprint Speed Variables
| Variable | Description | Units | Typical Range |
|---|---|---|---|
sprint_speed | Feet per second on competitive plays | ft/s | 24-31 |
hp_to_1b | Home to first time on ground balls | seconds | 4.0-5.0 |
Fielding Variables
| Variable | Description | Units | Typical Range |
|---|---|---|---|
outs_above_average | Defensive runs above average (OAA) | runs | -15 to 15 |
catch_probability | Likelihood of making the catch | percentage | 0-100 |
arm_strength | Average throw velocity | mph | 75-95 |
exchange_time | Time to release ball after catch | seconds | 0.5-1.5 |
Batting Metrics
| Metric | Full Name | Formula/Description | League Average |
|---|---|---|---|
| wOBA | Weighted On-Base Average | (0.69×BB + 0.72×HBP + 0.88×1B + 1.24×2B + 1.56×3B + 1.95×HR) / (AB + BB - IBB + SF + HBP) | ~0.320 |
| wRC+ | Weighted Runs Created Plus | ((wRAA/PA + lgR/PA) + (lgR/PA - park factor×lgR/PA)) / (lgwRC/PA) × 100 | 100 |
| ISO | Isolated Power | SLG - AVG or (2B + 2×3B + 3×HR) / AB | ~0.145 |
| BABIP | Batting Average on Balls in Play | (H - HR) / (AB - K - HR + SF) | ~0.300 |
| K% | Strikeout Rate | K / PA × 100 | ~22% |
| BB% | Walk Rate | BB / PA × 100 | ~8.5% |
| AVG | Batting Average | H / AB | ~0.245 |
| OBP | On-Base Percentage | (H + BB + HBP) / (AB + BB + HBP + SF) | ~0.320 |
| SLG | Slugging Percentage | (1B + 2×2B + 3×3B + 4×HR) / AB | ~0.400 |
| OPS | On-Base Plus Slugging | OBP + SLG | ~0.720 |
| wRAA | Weighted Runs Above Average | ((wOBA - lgwOBA) / wOBA scale) × PA | 0 |
| Off | Offensive Runs Above Average | Batting Runs + Base Running Runs | 0 |
| Def | Defensive Runs Above Average | From UZR or OAA | 0 |
| WAR | Wins Above Replacement | (Batting + BaseRunning + Fielding + Positional + Replacement + League) / runsPerWin | ~2.0 |
| SB% | Stolen Base Percentage | SB / (SB + CS) | ~75% |
| Spd | Speed Score | Composite of SB, 3B, runs scored | 4.5 |
| O-Swing% | Outside Zone Swing Rate | Swings at pitches outside zone / Pitches outside zone | ~30% |
| Z-Swing% | Inside Zone Swing Rate | Swings at pitches in zone / Pitches in zone | ~68% |
| SwStr% | Swinging Strike Rate | Swinging strikes / Pitches | ~11% |
| Contact% | Contact Rate | (Swings - Whiffs) / Swings | ~77% |
| O-Contact% | Out-of-Zone Contact Rate | Contact on pitches outside zone / Swings outside zone | ~60% |
| Z-Contact% | In-Zone Contact Rate | Contact on pitches in zone / Swings in zone | ~85% |
Pitching Metrics
| Metric | Full Name | Formula/Description | League Average |
|---|---|---|---|
| ERA | Earned Run Average | (ER × 9) / IP | ~4.00 |
| FIP | Fielding Independent Pitching | ((13×HR + 3×BB - 2×K) / IP) + FIP constant | ~4.00 |
| xFIP | Expected FIP | FIP using league average HR/FB rate | ~4.00 |
| SIERA | Skill-Interactive ERA | Complex formula accounting for GB, K, BB | ~4.00 |
| WHIP | Walks + Hits per Inning | (BB + H) / IP | ~1.30 |
| K/9 | Strikeouts per 9 Innings | (K × 9) / IP | ~8.5 |
| BB/9 | Walks per 9 Innings | (BB × 9) / IP | ~3.0 |
| HR/9 | Home Runs per 9 Innings | (HR × 9) / IP | ~1.2 |
| K/BB | Strikeout-to-Walk Ratio | K / BB | ~2.8 |
| K% | Strikeout Rate | K / BF × 100 | ~22% |
| BB% | Walk Rate | BB / BF × 100 | ~8% |
| HR/FB | Home Run per Fly Ball Rate | HR / FB | ~13% |
| GB% | Ground Ball Rate | GB / Balls in Play | ~45% |
| FB% | Fly Ball Rate | FB / Balls in Play | ~35% |
| LD% | Line Drive Rate | LD / Balls in Play | ~20% |
| IFFB% | Infield Fly Ball Rate | IFFB / FB | ~10% |
| LOB% | Left on Base Percentage | (H + BB + HBP - R) / (H + BB + HBP - 1.4×HR) | ~72% |
| BABIP | Batting Average on Balls in Play | (H - HR) / (BF - K - HR + SF) | ~0.300 |
| ERA- | ERA Minus | (ERA / lgERA) × 100 | 100 |
| FIP- | FIP Minus | (FIP / lgFIP) × 100 | 100 |
| xFIP- | xFIP Minus | (xFIP / lgxFIP) × 100 | 100 |
| Soft% | Soft Contact Rate | Soft contact / Balls in play | ~18% |
| Med% | Medium Contact Rate | Medium contact / Balls in play | ~50% |
| Hard% | Hard Contact Rate | Hard contact / Balls in play | ~32% |
| kwERA | K, BB, ERA estimator | ERA estimator using K and BB rates | ~4.00 |
WAR Components
| Component | Description | Scale |
|---|---|---|
| Batting | Runs from hitting performance | Runs above average |
| Baserunning | Runs from baserunning (UBR, wSB) | Runs above average |
| Fielding | Defensive runs saved (UZR/OAA) | Runs above average |
| Positional | Value adjustment by position | Runs per 162 games |
| Replacement | Runs above replacement level | ~20 runs per 600 PA |
| RAA | Runs Above Average | Sum of above components |
| RAR | Runs Above Replacement | RAA + Replacement |
| WAR | Wins Above Replacement | RAR / Runs per Win (~10) |
Data Acquisition
| Task | R (baseballr) | Python (pybaseball) |
|---|---|---|
| Statcast pitch data | statcast_search(start_date, end_date) | statcast(start_dt, end_dt) |
| Statcast batter data | statcast_search_batters(start_date, end_date, batterid) | statcast_batter(start_dt, end_dt, player_id) |
| Statcast pitcher data | statcast_search_pitchers(start_date, end_date, pitcherid) | statcast_pitcher(start_dt, end_dt, player_id) |
| FanGraphs leaderboard | fg_batter_leaders(x, y) | batting_stats(start_season, end_season) |
| Player lookup | playerid_lookup(last_name, first_name) | playerid_lookup(last, first) |
| Lahman database | Lahman::Batting | lahman.batting() |
| Team data | bref_team_results(team, year) | schedule_and_record(year, team) |
Data Manipulation
| Task | R (tidyverse) | Python (pandas) |
|---|---|---|
| Filter rows | filter(df, condition) | df[df['col'] > value] or df.query('col > value') |
| Select columns | select(df, col1, col2) | df[['col1', 'col2']] |
| Create new column | mutate(df, new_col = expression) | df['new_col'] = expression |
| Group and summarize | group_by(df, col) %>% summarize(mean = mean(val)) | df.groupby('col')['val'].mean() |
| Arrange/Sort | arrange(df, col) | df.sort_values('col') |
| Join tables | left_join(df1, df2, by = "key") | df1.merge(df2, on='key', how='left') |
| Rename columns | rename(df, new_name = old_name) | df.rename(columns={'old': 'new'}) |
| Drop NA values | drop_na(df) or na.omit(df) | df.dropna() |
| Fill NA values | replace_na(df, list(col = value)) | df.fillna(value) |
| Pipe operations | df %>% operation1() %>% operation2() | df.pipe(operation1).pipe(operation2) |
Statistical Analysis
| Task | R | Python |
|---|---|---|
| Linear regression | lm(y ~ x, data = df) | from sklearn.linear_model import LinearRegression |
| Summary statistics | summary(df) | df.describe() |
| Correlation | cor(df$x, df$y) | df[['x', 'y']].corr() |
| T-test | t.test(x, y) | from scipy.stats import ttest_ind |
| Random forest | randomForest::randomForest(y ~ ., data) | from sklearn.ensemble import RandomForestRegressor |
| Cross-validation | caret::train(..., method = "cv") | from sklearn.model_selection import cross_val_score |
Visualization
| Task | R (ggplot2) | Python (matplotlib/seaborn) |
|---|---|---|
| Scatter plot | ggplot(df, aes(x, y)) + geom_point() | plt.scatter(df['x'], df['y']) or sns.scatterplot(data=df, x='x', y='y') |
| Line plot | ggplot(df, aes(x, y)) + geom_line() | plt.plot(df['x'], df['y']) or sns.lineplot(data=df, x='x', y='y') |
| Histogram | ggplot(df, aes(x)) + geom_histogram() | plt.hist(df['x']) or sns.histplot(df['x']) |
| Box plot | ggplot(df, aes(x, y)) + geom_boxplot() | sns.boxplot(data=df, x='x', y='y') |
| Density plot | ggplot(df, aes(x)) + geom_density() | sns.kdeplot(df['x']) |
| Faceting | ... + facet_wrap(~variable) | sns.FacetGrid(df, col='variable') |
| Color by group | ... + aes(color = group) | ... hue='group' |
| Add trend line | ... + geom_smooth(method = "lm") | sns.regplot(...) |
| Team Name | Common Abbr | Statcast | FanGraphs | Baseball Reference | Retrosheet |
|---|---|---|---|---|---|
| Arizona Diamondbacks | ARI | AZ | ARI | ARI | ARI |
| Atlanta Braves | ATL | ATL | ATL | ATL | ATL |
| Baltimore Orioles | BAL | BAL | BAL | BAL | BAL |
| Boston Red Sox | BOS | BOS | BOS | BOS | BOS |
| Chicago Cubs | CHC | CHC | CHC | CHC | CHN |
| Chicago White Sox | CWS | CWS | CWS | CHW | CHA |
| Cincinnati Reds | CIN | CIN | CIN | CIN | CIN |
| Cleveland Guardians | CLE | CLE | CLE | CLE | CLE |
| Colorado Rockies | COL | COL | COL | COL | COL |
| Detroit Tigers | DET | DET | DET | DET | DET |
| Houston Astros | HOU | HOU | HOU | HOU | HOU |
| Kansas City Royals | KC | KC | KCR | KCR | KCA |
| Los Angeles Angels | LAA | LAA | LAA | LAA | ANA |
| Los Angeles Dodgers | LAD | LAD | LAD | LAD | LAN |
| Miami Marlins | MIA | MIA | MIA | MIA | MIA |
| Milwaukee Brewers | MIL | MIL | MIL | MIL | MIL |
| Minnesota Twins | MIN | MIN | MIN | MIN | MIN |
| New York Mets | NYM | NYM | NYM | NYM | NYN |
| New York Yankees | NYY | NYY | NYY | NYY | NYA |
| Oakland Athletics | OAK | OAK | OAK | OAK | OAK |
| Philadelphia Phillies | PHI | PHI | PHI | PHI | PHI |
| Pittsburgh Pirates | PIT | PIT | PIT | PIT | PIT |
| San Diego Padres | SD | SD | SDP | SDP | SDN |
| San Francisco Giants | SF | SF | SFG | SFG | SFN |
| Seattle Mariners | SEA | SEA | SEA | SEA | SEA |
| St. Louis Cardinals | STL | STL | STL | STL | SLN |
| Tampa Bay Rays | TB | TB | TBR | TBR | TBA |
| Texas Rangers | TEX | TEX | TEX | TEX | TEX |
| Toronto Blue Jays | TOR | TOR | TOR | TOR | TOR |
| Washington Nationals | WSH | WSH | WSH | WSH | WAS |
Historical Teams:
- Montreal Expos (1969-2004): MON
- Cleveland Indians (pre-2022): Same as Guardians
- Anaheim Angels/California Angels: Various ANA/CAL codes
ID Systems Overview
| System | Source | Format | Example | Usage |
|---|---|---|---|---|
| MLBAM ID | MLB Advanced Media | Numeric (6 digits) | 545361 | Statcast, MLB API, Baseball Savant |
| FanGraphs ID | FanGraphs | Numeric | 13245 | FanGraphs leaderboards and data |
| Baseball Reference ID | Baseball-Reference.com | Alphanumeric | troutmi01 | BBRef pages and Lahman database |
| Retrosheet ID | Retrosheet | 8-char code | troum001 | Historical play-by-play data |
| NFBC ID | National Fantasy Baseball Championship | Numeric | 2056 | DFS and fantasy platforms |
| CBS ID | CBS Sports | Numeric | 1892094 | Fantasy baseball |
Finding Player IDs
Method 1: Using playerid_lookup()
R (baseballr):
playerid_lookup("Trout", "Mike")
Python (pybaseball):
from pybaseball import playerid_lookup
playerid_lookup('trout', 'mike')
Returns all ID types for the player.
Method 2: Chadwick Bureau Database
Download the Chadwick Bureau register (maintained by Baseball Databank):
- URL: https://github.com/chadwickbureau/register
- Contains comprehensive ID mappings
R:
library(readr)
ids <- read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
Python:
import pandas as pd
ids = pd.read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
Method 3: Baseball Savant Lookup
Navigate to Baseball Savant player page:
- URL format:
https://baseballsavant.mlb.com/savant-player/player-name-MLBAMID - MLBAM ID is in the URL
Method 4: FanGraphs Lookup
Player page URL contains FanGraphs ID:
- Format:
https://www.fangraphs.com/players/player-name/FGID
ID Conversion Table Structure
| namelast | namefirst | mlbamid | fgid | bbrefid | retroid |
|---|---|---|---|---|---|
| Trout | Mike | 545361 | 10155 | troutmi01 | troum001 |
| Judge | Aaron | 592450 | 15640 | judgea01 | judga001 |
| Ohtani | Shohei | 660271 | 19755 | ohtansh01 | ohtans001 |
| Betts | Mookie | 605141 | 13611 | bettsmo01 | bettm001 |
Common Issues and Solutions
Problem: Player has same name as another player
- Solution: Use birth year or debut year to distinguish
- Example:
playerid_lookup("Cruz", "Nelson", year=2005)
Problem: Player name has special characters
- Solution: Use standard English alphabet approximation
- José becomes Jose, Édgar becomes Edgar
Problem: Need to update IDs for current season
- Solution: Re-download Chadwick register at start of each season
- New players typically added within 1-2 weeks of MLB debut
Problem: Historical player not in Statcast (pre-2015)
- Solution: Use Retrosheet ID or BBRef ID for historical analysis
Quick Reference Code
Get all IDs for a player (Python):
from pybaseball import playerid_lookup
player = playerid_lookup('last', 'first')
mlbam_id = player['key_mlbam'].values[0]
fg_id = player['key_fangraphs'].values[0]
bbref_id = player['key_bbref'].values[0]
Get all IDs for a player (R):
player <- playerid_lookup("Last", "First")
mlbam_id <- player$mlbam_id
fg_id <- player$fangraphs_id
bbref_id <- player$bbref_id
Appendix B: Data Sources Directory
playerid_lookup("Trout", "Mike")
library(readr)
ids <- read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
player <- playerid_lookup("Last", "First")
mlbam_id <- player$mlbam_id
fg_id <- player$fangraphs_id
bbref_id <- player$bbref_id
from pybaseball import playerid_lookup
playerid_lookup('trout', 'mike')
import pandas as pd
ids = pd.read_csv("https://raw.githubusercontent.com/chadwickbureau/register/master/data/people.csv")
from pybaseball import playerid_lookup
player = playerid_lookup('last', 'first')
mlbam_id = player['key_mlbam'].values[0]
fg_id = player['key_fangraphs'].values[0]
bbref_id = player['key_bbref'].values[0]
Baseball Savant (Statcast)
- URL: https://baseballsavant.mlb.com
- Data Available: Pitch-level Statcast data (2015-present), exit velocity, launch angle, sprint speed, catch probability, Outs Above Average
- Access Methods:
- Web interface with search tools
- CSV download
- API via R/Python packages
- Best For: Advanced metrics, pitch tracking, batted ball data, player movement
- Update Frequency: Real-time during games, verified data next day
- Historical Coverage: 2015-present (Statcast era)
FanGraphs
- URL: https://www.fangraphs.com
- Data Available: Advanced statistics, WAR, FIP, wOBA, leaderboards, projections (Steamer, ZiPS, THE BAT), batted ball profiles, plate discipline metrics
- Access Methods:
- Web interface with leaderboards
- CSV export
- R:
baseballr::fg_batter_leaders() - Python:
pybaseball.batting_stats() - Best For: Season statistics, advanced metrics, splits, player pages
- Update Frequency: Daily
- Historical Coverage: 1871-present (varies by metric)
Baseball Reference
- URL: https://www.baseball-reference.com
- Data Available: Comprehensive statistics, play index, game logs, splits, traditional and advanced stats, biographical data
- Access Methods:
- Web interface
- Play Index tool
- Stathead subscription for advanced queries
- HTML scraping (respect ToS)
- Best For: Historical research, game logs, comprehensive player pages, team statistics
- Update Frequency: Daily
- Historical Coverage: 1871-present (complete)
Lahman Database
- URL: http://www.seanlahman.com/baseball-archive/statistics/
- GitHub: https://github.com/chadwickbureau/baseballdatabank
- Data Available: Core baseball statistics in relational database format, batting, pitching, fielding, awards, salaries, teams
- Access Methods:
- Direct download (CSV, SQL)
- R:
library(Lahman) - Python:
from pybaseball import lahman - Best For: Statistical analysis, database practice, historical research, teaching
- Update Frequency: Annually
- Historical Coverage: 1871-2023 (2024 coming soon)
Retrosheet
- URL: https://www.retrosheet.org
- Data Available: Play-by-play data, game logs, box scores, historical events
- Access Methods:
- Direct download (event files)
- R:
retrosheetpackage - Python: Parsers available on GitHub
- Best For: Play-by-play analysis, historical game reconstruction, situational stats
- Update Frequency: Seasonally
- Historical Coverage: 1871-present (play-by-play from 1918+)
Baseball Prospectus
- URL: https://www.baseballprospectus.com
- Data Available: PECOTA projections, DRA (Deserved Run Average), WARP, articles, prospect rankings
- Access Methods:
- Web interface (subscription required for full access)
- Limited free leaderboards
- Best For: PECOTA projections, DRA pitching metric, prospect analysis
- Update Frequency: Daily
- Historical Coverage: Varies by metric
Brooks Baseball
- URL: http://www.brooksbaseball.net
- Data Available: PITCHf/x data, pitch classifications, movement charts, player cards
- Access Methods:
- Web interface with visualization tools
- Data underlying site is from MLB's PITCHf/x system
- Best For: Pitch movement visualization, arsenal analysis (2008-2019 PITCHf/x era)
- Update Frequency: Daily (archived; Statcast is current)
- Historical Coverage: 2007-2019 (PITCHf/x era)
Chadwick Bureau
- URL: https://github.com/chadwickbureau
- Data Available: Player ID register, historical databases, data standards
- Access Methods:
- GitHub repository downloads
- CSV files
- Best For: Player ID mapping, data integration
- Update Frequency: Regular updates
- Historical Coverage: Comprehensive historical player registry
pybaseball / baseballr
- Documentation:
- pybaseball: https://github.com/jldbc/pybaseball
- baseballr: https://billpetti.github.io/baseballr/
- Data Available: Wrapper functions for most major data sources
- Access Methods:
- Python:
pip install pybaseball - R:
install.packages("baseballr") - Best For: Programmatic data access, reproducible research
- Update Frequency: Packages updated regularly
- Historical Coverage: Depends on underlying data source
MLB Stats API
- URL: https://statsapi.mlb.com/api/v1/
- Documentation: https://github.com/toddrob99/MLB-StatsAPI
- Data Available: Real-time game data, standings, schedules, player stats, team information
- Access Methods:
- Direct API calls
- Python:
MLB-StatsAPIpackage - R: Functions in
baseballr - Best For: Real-time data, game feeds, current season information
- Update Frequency: Real-time
- Historical Coverage: Several years (varies by endpoint)
Savant CSV Downloads
- URL: https://baseballsavant.mlb.com/statcast_search
- Data Available: Customizable Statcast data exports
- Access Methods:
- Web interface with filters
- Direct CSV download (max ~50,000 rows)
- Best For: Quick data pulls, specific date ranges or players
- Update Frequency: Daily
- Historical Coverage: 2015-present
Synergy Sports
- URL: https://www.synergysports.com
- Data: Advanced video analysis, shot charts, defensive positioning
- Access: Requires institutional or professional subscription
- Cost: Enterprise pricing (contact for quote)
- Best For: Video-based analytics, professional scouting
STATS LLC / SportRadar
- URL: https://www.sportradar.com
- Data: Real-time feeds, official MLB data, advanced metrics
- Access: Commercial licensing
- Cost: Enterprise pricing
- Best For: Commercial applications, media companies, betting platforms
Baseball Prospectus Premium
- URL: https://www.baseballprospectus.com
- Data: Full PECOTA, DRA, cFIP, articles, tools
- Access: Annual subscription
- Cost: $40-90/year (varies by tier)
- Best For: PECOTA projections, premium articles, prospect analysis
FanGraphs Auction Calculator / Projections
- URL: https://www.fangraphs.com
- Data: RotoGraphs content, advanced tools
- Access: Free with premium features available
- Cost: Free (donations encouraged)
- Best For: Fantasy baseball analysis
Sports Info Solutions (SIS)
- URL: https://www.sportsinfosolutions.com
- Data: Proprietary defensive metrics, video tracking, shift data
- Access: Requires MLB team affiliation or media partnership
- Cost: Enterprise pricing
- Best For: Professional teams, advanced defensive analytics
TrackMan Baseball
- URL: https://trackmansports.com
- Data: Ball flight tracking, spin efficiency, approach angle
- Access: Requires purchase of TrackMan system or team affiliation
- Cost: $30,000+ for hardware
- Best For: Player development, pitch design, amateur/college ball
Rapsodo Baseball
- URL: https://rapsodo.com/baseball
- Data: Pitch tracking, hitting metrics (more affordable than TrackMan)
- Access: Purchase required
- Cost: $3,000-5,000 for unit
- Best For: Amateur teams, individual player development
Academic Databases
- Source: Various university libraries
- Access: Requires academic affiliation or library access
- Data: Historical datasets, research data repositories
- Best For: Academic research, thesis work
MLB Stats API
- Base URL:
https://statsapi.mlb.com/api/v1/ - Documentation: https://github.com/toddrob99/MLB-StatsAPI/wiki/Endpoints
- Python Wrapper:
pip install MLB-StatsAPI - Key Endpoints:
/people/{id}- Player information/game/{gamePk}/feed/live- Live game data/schedule- Game schedules/standings- Current standings/teams/{id}/roster- Team rosters
Baseball Savant (Statcast)
- No official public API - Use pybaseball/baseballr wrappers
- Search Tool: https://baseballsavant.mlb.com/statcast_search
- R:
baseballr::statcast_search() - Python:
pybaseball.statcast()
FanGraphs
- No official public API - Use web scraping or package wrappers
- R:
baseballr::fg_batter_leaders(),fg_pitcher_leaders() - Python:
pybaseball.batting_stats(),pitching_stats() - Leaderboards: https://www.fangraphs.com/leaders.aspx
pybaseball Documentation
- GitHub: https://github.com/jldbc/pybaseball
- Docs: http://github.com/jldbc/pybaseball/docs
- Key Functions:
statcast()- Statcast databatting_stats()- FanGraphs battingpitching_stats()- FanGraphs pitchinglahman.batting()- Lahman databaseplayerid_lookup()- Player ID search
baseballr Documentation
- CRAN: https://cran.r-project.org/package=baseballr
- Site: https://billpetti.github.io/baseballr/
- GitHub: https://github.com/BillPetti/baseballr
- Key Functions:
statcast_search()- Statcast datafg_batter_leaders()- FanGraphs battingplayerid_lookup()- Player ID searchbref_daily_batter()- Baseball Reference data
Retrosheet Event Files
- Format Documentation: https://www.retrosheet.org/eventfile.htm
- Download: https://www.retrosheet.org/game.htm
- Parser: Chadwick tools (https://github.com/chadwickbureau/chadwick)
- R Package:
retrosheet
Lahman Database
- R Package:
install.packages("Lahman") - Documentation: https://cran.r-project.org/web/packages/Lahman/Lahman.pdf
- GitHub: https://github.com/chadwickbureau/baseballdatabank
- Tables: Batting, Pitching, Fielding, Teams, People, Salaries, Awards
Appendix C: Statistical Formulas
Batting Statistics
Batting Average (AVG)
AVG = H / AB
Where:
- H = Hits
- AB = At Bats
On-Base Percentage (OBP)
OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
Where:
- BB = Walks (Base on Balls)
- HBP = Hit by Pitch
- SF = Sacrifice Flies
Slugging Percentage (SLG)
SLG = (1B + 2×2B + 3×3B + 4×HR) / AB
Or:
SLG = TB / AB
Where:
- TB = Total Bases = 1B + 2×2B + 3×3B + 4×HR
On-Base Plus Slugging (OPS)
OPS = OBP + SLG
Isolated Power (ISO)
ISO = SLG - AVG
Or:
ISO = (2B + 2×3B + 3×HR) / AB
Batting Average on Balls in Play (BABIP)
BABIP = (H - HR) / (AB - K - HR + SF)
Where:
- K = Strikeouts
Strikeout Rate (K%)
K% = K / PA × 100
Where:
- PA = Plate Appearances = AB + BB + HBP + SF + SH
Walk Rate (BB%)
BB% = BB / PA × 100
Stolen Base Percentage (SB%)
SB% = SB / (SB + CS) × 100
Where:
- SB = Stolen Bases
- CS = Caught Stealing
Pitching Statistics
Earned Run Average (ERA)
ERA = (ER × 9) / IP
Where:
- ER = Earned Runs
- IP = Innings Pitched
Walks and Hits per Inning Pitched (WHIP)
WHIP = (BB + H) / IP
Strikeouts per 9 Innings (K/9)
K/9 = (K × 9) / IP
Walks per 9 Innings (BB/9)
BB/9 = (BB × 9) / IP
Home Runs per 9 Innings (HR/9)
HR/9 = (HR × 9) / IP
Strikeout-to-Walk Ratio (K/BB)
K/BB = K / BB
Win-Loss Percentage (W-L%)
W-L% = W / (W + L)
Opponent Batting Average (OBA)
OBA = H / AB
(From pitcher's perspective)
AVG = H / AB
OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
SLG = (1B + 2×2B + 3×3B + 4×HR) / AB
SLG = TB / AB
OPS = OBP + SLG
ISO = SLG - AVG
ISO = (2B + 2×3B + 3×HR) / AB
BABIP = (H - HR) / (AB - K - HR + SF)
K% = K / PA × 100
BB% = BB / PA × 100
SB% = SB / (SB + CS) × 100
ERA = (ER × 9) / IP
WHIP = (BB + H) / IP
K/9 = (K × 9) / IP
BB/9 = (BB × 9) / IP
HR/9 = (HR × 9) / IP
K/BB = K / BB
W-L% = W / (W + L)
OBA = H / AB
Weighted On-Base Average (wOBA)
Formula (2024 weights):
wOBA = (0.69×BB + 0.72×HBP + 0.88×1B + 1.24×2B + 1.56×3B + 1.95×HR) /
(AB + BB - IBB + SF + HBP)
Where:
- BB = Walks (excluding intentional walks)
- HBP = Hit by Pitch
- 1B = Singles = H - 2B - 3B - HR
- IBB = Intentional Walks
Note: Linear weights change annually based on run environment. Historical weights available at: https://www.fangraphs.com/guts.aspx
wOBA Scale Conversion:
The wOBA scale is designed to be on the same scale as OBP.
Weighted Runs Above Average (wRAA)
wRAA = ((wOBA - lgwOBA) / wOBA_scale) × PA
Where:
- lgwOBA = League average wOBA
- wOBA_scale = Typically ~1.15-1.25 (varies by year)
- PA = Plate Appearances
Weighted Runs Created (wRC)
wRC = ((wOBA - lgwOBA) / wOBA_scale + lgR/PA) × PA
Where:
- lgR/PA = League runs per plate appearance
Weighted Runs Created Plus (wRC+)
wRC+ = ((wRAA/PA + lgR/PA) + (lgR/PA - park_factor × lgR/PA)) / (lgwRC/PA) × 100
Simplified:
wRC+ = (wRC / lgwRC) × 100
Where:
- lgwRC = League weighted runs created
- 100 = League average
- >100 = Above average
- <100 = Below average
Linear Weights by Event (Run Values)
Approximate run values for common events (2024):
| Event | Run Value |
|---|---|
| Home Run | +1.40 |
| Triple | +1.02 |
| Double | +0.75 |
| Single | +0.46 |
| Walk (BB) | +0.31 |
| HBP | +0.33 |
| Stolen Base | +0.18 |
| Caught Stealing | -0.43 |
| Out | -0.27 |
| Strikeout | -0.30 |
| GIDP | -0.43 |
Run Expectancy Matrix is used to derive these values by comparing the run expectancy before and after each event type.
wOBA = (0.69×BB + 0.72×HBP + 0.88×1B + 1.24×2B + 1.56×3B + 1.95×HR) /
(AB + BB - IBB + SF + HBP)
wRAA = ((wOBA - lgwOBA) / wOBA_scale) × PA
wRC = ((wOBA - lgwOBA) / wOBA_scale + lgR/PA) × PA
wRC+ = ((wRAA/PA + lgR/PA) + (lgR/PA - park_factor × lgR/PA)) / (lgwRC/PA) × 100
wRC+ = (wRC / lgwRC) × 100
Basic Park Factor Formula
Single Season Park Factor:
PF = (Home_RS + Home_RA) / (Home_G) /
((Away_RS + Away_RA) / (Away_G))
Where:
- RS = Runs Scored
- RA = Runs Allowed
- G = Games
Multi-Year Park Factor:
PF = (Σ(Home_RS + Home_RA) / Σ(Home_G)) /
(Σ(Away_RS + Away_RA) / Σ(Away_G))
Typically calculated over 3-5 years for stability.
Park Factor Interpretation
- PF = 100: Neutral park
- PF > 100: Hitter-friendly (e.g., PF = 105 means 5% more runs than average)
- PF < 100: Pitcher-friendly (e.g., PF = 95 means 5% fewer runs than average)
Adjusted Statistics Using Park Factors
Park-Adjusted ERA (ERA+):
ERA+ = (lgERA / ERA) × 100
Park-Adjusted OPS (OPS+):
OPS+ = 100 × (OBP/lgOBP + SLG/lgSLG - 1)
Adjusted for park:
OPS+_park = OPS+ × (PF / 100)
FanGraphs Park Factors
FanGraphs uses separate park factors for:
- Runs (overall)
- Home Runs
- Hits
- Walks
- Strikeouts
Available at: https://www.fangraphs.com/guts.aspx?type=pf
Notable Park Factors (2024 approximations)
| Park | Factor | Type |
|---|---|---|
| Coors Field (COL) | 115 | Extreme hitter |
| Great American Ball Park (CIN) | 105 | Hitter-friendly |
| Fenway Park (BOS) | 102 | Slight hitter |
| Oracle Park (SF) | 95 | Pitcher-friendly |
| T-Mobile Park (SEA) | 95 | Pitcher-friendly |
PF = (Home_RS + Home_RA) / (Home_G) /
((Away_RS + Away_RA) / (Away_G))
PF = (Σ(Home_RS + Home_RA) / Σ(Home_G)) /
(Σ(Away_RS + Away_RA) / Σ(Away_G))
ERA+ = (lgERA / ERA) × 100
OPS+ = 100 × (OBP/lgOBP + SLG/lgSLG - 1)
OPS+_park = OPS+ × (PF / 100)
Position Player WAR (FanGraphs fWAR)
Formula:
WAR = (Batting + Baserunning + Fielding + Positional + Replacement + League) / Runs_per_Win
Typically:
Runs_per_Win ≈ 10
WAR Components (Position Players)
1. Batting Runs
Batting = wRAA + League_Adjustment + Park_Adjustment
Where wRAA is calculated from wOBA (see C.2).
2. Baserunning Runs
Baserunning = UBR + wSB
Where:
- UBR = Ultimate Base Running (runs from taking extra bases, avoiding outs)
- wSB = Weighted Stolen Bases = (SB × 0.2) - (CS × 0.43)
3. Fielding Runs
Fielding = UZR or OAA
Where:
- UZR = Ultimate Zone Rating
- OAA = Outs Above Average (Statcast metric)
4. Positional Adjustment
Runs added based on position difficulty (per 162 games):
| Position | Adjustment |
|---|---|
| C (Catcher) | +12.5 |
| SS (Shortstop) | +7.5 |
| 2B (Second Base) | +3.0 |
| CF (Center Field) | +2.5 |
| 3B (Third Base) | +2.0 |
| LF (Left Field) | -7.5 |
| RF (Right Field) | -7.5 |
| 1B (First Base) | -12.5 |
| DH (Designated Hitter) | -17.5 |
5. Replacement Level
Replacement = PA × lgR/PA × 0.03
Approximately 20 runs below average per 600 PA.
6. League Adjustment
Adjusts for differences between AL and NL (primarily DH rule).
Pitcher WAR (FanGraphs fWAR)
Formula:
WAR = (IP × FIP_constant - FIP) / Runs_per_Win + Replacement
Or more detailed:
WAR = (Pitching_Runs + League_Adjustment) / Runs_per_Win
Pitching Runs Above Average:
Pitching_RAA = IP × (lgFIP - FIP) / 9
FIP (Fielding Independent Pitching):
FIP = ((13×HR + 3×BB - 2×K) / IP) + FIP_constant
Where:
- FIP_constant ≈ 3.10 (varies by year to set league average to ERA)
Replacement Level for Pitchers:
Replacement = IP / 9 × 2.0
Approximately 2 runs per 9 IP below average.
Expected WAR Components
xFIP (Expected FIP):
xFIP = ((13×(FB × lgHR/FB_rate) + 3×BB - 2×K) / IP) + FIP_constant
Where:
- FB = Fly Balls
- lgHR/FB_rate = League average HR/FB rate (~13%)
SIERA (Skill-Interactive ERA):
Complex formula accounting for:
- Strikeouts
- Walks
- Ground balls
- Interaction terms
Full formula available at: https://www.fangraphs.com/library/pitching/siera/
Baseball Reference WAR (bWAR/rWAR)
Different from FanGraphs (uses different components):
Position Players:
- Uses Defensive Runs Saved (DRS) instead of UZR
- Uses Runs Above Average (RAA) from Batting Runs instead of wRAA
- Different positional adjustments
Pitchers:
- Uses Runs Allowed (RA9) based on actual runs instead of FIP
- Includes adjustment for defense quality
WAR Interpretation
| WAR | Player Value |
|---|---|
| 8+ | MVP Candidate |
| 5-8 | All-Star |
| 2-5 | Starter |
| 0-2 | Reserve |
| <0 | Below Replacement |
League Average:
- Position Players: ~2.0 WAR per 600 PA
- Pitchers: ~2.0 WAR per 200 IP
Appendix D: Glossary of Terms
A
At Bat (AB): A plate appearance that results in a hit, error, fielder's choice, or out, excluding walks, hit-by-pitches, sacrifice bunts, sacrifice flies, and catcher interference.
Arm Strength: Statcast metric measuring the average velocity of throws by a fielder, measured in miles per hour (mph).
Average Exit Velocity: The mean speed at which a batter hits the ball as measured by Statcast, excluding topped or weakly hit balls, typically above 60 mph.
B
BABIP (Batting Average on Balls in Play): The batting average on balls put into play, calculated as (H - HR) / (AB - K - HR + SF). League average is typically around .300.
Barrel: A Statcast classification for batted balls with the ideal combination of exit velocity and launch angle, generally 98+ mph exit velocity with launch angles between 26-30 degrees, with the range expanding as exit velocity increases.
Base Running Runs (BsR): Runs added or lost due to baserunning ability, including stolen bases, caught stealing, and taking extra bases.
Baseball Savant: MLB's official Statcast data website, providing access to pitch-level tracking data from 2015-present.
BB% (Walk Rate): Walks per plate appearance, calculated as BB / PA × 100. League average is approximately 8-9%.
BB/9 (Walks per 9 Innings): Number of walks a pitcher allows per nine innings, calculated as (BB × 9) / IP.
C
Catch Probability: Statcast metric estimating the likelihood of a fielder making a catch based on distance needed to travel, time available, and direction.
Chase Rate (O-Swing%): Percentage of pitches outside the strike zone at which a batter swings, calculated as swings outside zone / pitches outside zone.
Contact Rate (Contact%): Percentage of swings on which a batter makes contact, calculated as (Swings - Whiffs) / Swings.
Cutter (FC): A fastball variant that breaks slightly toward the pitcher's glove side, combining characteristics of a fastball and slider.
D
Defensive Runs Saved (DRS): A defensive metric quantifying runs prevented or allowed compared to an average fielder at the same position, used in Baseball Reference's WAR calculation.
DRA (Deserved Run Average): Baseball Prospectus's proprietary pitching metric that estimates a pitcher's true run prevention ability based on batted ball and pitch characteristics.
E
ERA (Earned Run Average): The mean of earned runs allowed per nine innings, calculated as (ER × 9) / IP. League average varies but typically around 4.00.
ERA- (ERA Minus): ERA scaled so that 100 is league average, with lower being better. Calculated as (ERA / lgERA) × 100.
Exchange Time: Statcast metric measuring the time between when a fielder catches the ball and releases it on a throw, measured in seconds.
Exit Velocity (EV): The speed at which the ball leaves the bat after contact, measured in miles per hour by Statcast. League average is approximately 88-89 mph.
Expected Batting Average (xBA): Statcast metric estimating batting average based on exit velocity and launch angle of batted balls, removing the influence of luck and defense.
Expected Weighted On-Base Average (xwOBA): Statcast metric estimating wOBA based on quality of contact (exit velocity and launch angle) rather than actual outcomes.
F
FB% (Fly Ball Rate): Percentage of batted balls classified as fly balls, calculated as FB / Balls in Play.
FIP (Fielding Independent Pitching): A pitching metric that estimates ERA based only on events the pitcher can control: strikeouts, walks, hit-by-pitches, and home runs. Calculated as ((13×HR + 3×BB - 2×K) / IP) + FIP constant.
FIP- (FIP Minus): FIP scaled so that 100 is league average, with lower being better. Calculated as (FIP / lgFIP) × 100.
Four-Seam Fastball (FF): The standard fastball grip with all four seams rotating, typically the pitcher's highest velocity pitch with backspin-induced rise.
G
GB% (Ground Ball Rate): Percentage of batted balls classified as ground balls, calculated as GB / Balls in Play. League average is approximately 45%.
GIDP (Grounded Into Double Play): A double play in which the batter hits a ground ball resulting in two outs.
H
Hard Hit Rate (Hard%): Percentage of batted balls classified as hard contact based on exit velocity, typically 95+ mph. League average is approximately 35-40%.
Home Run per Fly Ball Rate (HR/FB): Percentage of fly balls that become home runs. League average is typically 12-14%.
HR/9 (Home Runs per 9 Innings): Number of home runs a pitcher allows per nine innings, calculated as (HR × 9) / IP.
I
IFFB% (Infield Fly Ball Rate): Percentage of fly balls that are infield pop-ups, calculated as IFFB / FB.
Innings Pitched (IP): The number of outs recorded by a pitcher divided by three, with partial innings represented as .1 (1 out) or .2 (2 outs).
ISO (Isolated Power): A measure of raw power calculated as SLG - AVG, or (2B + 2×3B + 3×HR) / AB. League average is approximately .145.
K
K% (Strikeout Rate): Strikeouts per plate appearance, calculated as K / PA × 100. League average for batters is approximately 22-23%, for pitchers approximately 22%.
K/9 (Strikeouts per 9 Innings): Number of strikeouts a pitcher records per nine innings, calculated as (K × 9) / IP.
K/BB (Strikeout-to-Walk Ratio): Ratio of strikeouts to walks, calculated as K / BB. League average for pitchers is approximately 2.5-3.0.
kwERA (Strikeout-Walk ERA): ERA estimator based solely on strikeout and walk rates, useful as a quick-and-dirty ERA predictor.
L
Launch Angle: The vertical angle at which the ball leaves the bat, measured in degrees. Optimal angles for home runs are typically 25-35 degrees.
Launch Speed: See Exit Velocity.
LD% (Line Drive Rate): Percentage of batted balls classified as line drives, calculated as LD / Balls in Play. League average is approximately 20%.
Linear Weights: A system for assigning run values to different offensive events (singles, doubles, walks, etc.) based on their impact on run scoring.
LOB% (Left on Base Percentage): Percentage of baserunners who do not score, calculated as (H + BB + HBP - R) / (H + BB + HBP - 1.4×HR). League average for pitchers is approximately 72%.
M
MLBAM ID: The unique identifier assigned to players by MLB Advanced Media, used in Statcast and most modern baseball databases.
Multi-Year Park Factor: Park factor calculated over multiple seasons (typically 3-5) to increase stability and reduce single-season variance.
O
OAA (Outs Above Average): Statcast's primary defensive metric, quantifying runs prevented above an average fielder based on catch probability and successful plays.
OBP (On-Base Percentage): Frequency with which a batter reaches base, calculated as (H + BB + HBP) / (AB + BB + HBP + SF). League average is approximately .320.
Off (Offense): Total offensive value combining batting runs and baserunning runs in WAR calculations.
OPS (On-Base Plus Slugging): Sum of on-base percentage and slugging percentage. League average is approximately .720.
OPS+ (OPS Plus): OPS scaled and adjusted for park factors so that 100 is league average, with higher being better.
O-Swing% (Outside Zone Swing Rate): Percentage of pitches outside the strike zone at which a batter swings. League average is approximately 30%.
P
Park Factor (PF): A metric quantifying how much a ballpark influences run scoring, with 100 being neutral, >100 favoring hitters, and <100 favoring pitchers.
Pitch Framing: The skill of a catcher in receiving pitches to influence umpire strike calls, typically measured in runs added or strikes gained.
Plate Appearances (PA): Total number of completed batting appearances, calculated as AB + BB + HBP + SF + SH.
Pull Rate (Pull%): Percentage of batted balls hit to the pull field (left field for right-handed batters, right field for left-handed batters).
R
R² (R-Squared): Statistical measure indicating the proportion of variance in the dependent variable explained by the independent variable(s), ranging from 0 to 1.
RAA (Runs Above Average): The number of runs a player contributes above a league-average player, before adjustments for position and replacement level.
RAR (Runs Above Replacement): The number of runs a player contributes above a replacement-level player, the foundation of WAR calculations.
Regression: Statistical technique for modeling the relationship between variables, or the tendency for extreme performances to move toward the mean over time.
Replacement Level: The expected performance level of a player who could be readily acquired from the minor leagues or free agency, approximately 20 runs below average per 600 PA.
Run Expectancy: The average number of runs expected to score in an inning given a specific base-out state.
Run Value: The change in run expectancy caused by a specific event, used in linear weights calculations.
S
SIERA (Skill-Interactive ERA): An ERA estimator that accounts for strikeouts, walks, ground balls, and their interactions, designed to predict future ERA better than FIP.
Sinker (SI): A fastball variant with arm-side movement and downward break, designed to induce ground balls.
Slider (SL): A breaking pitch with tight spin and late break, typically breaking glove-side and down from the pitcher's perspective.
SLG (Slugging Percentage): Total bases per at-bat, calculated as (1B + 2×2B + 3×3B + 4×HR) / AB or TB / AB. League average is approximately .400.
Spin Axis: The direction of a pitch's spin represented as a clock face from the catcher's perspective, measured in degrees from 0-360.
Spin Rate: The rate at which a baseball rotates while in flight, measured in revolutions per minute (rpm). Higher spin rates typically create more movement.
Sprint Speed: Statcast metric measuring a player's feet-per-second rate on competitive plays, such as home-to-first on ground balls. League average is approximately 27 ft/s.
Statcast: MLB's advanced tracking technology that measures player movements, pitch characteristics, and batted ball data using high-resolution cameras and radar.
SwStr% (Swinging Strike Rate): Percentage of all pitches that result in swinging strikes, calculated as swinging strikes / pitches. League average is approximately 11%.
T
Total Bases (TB): Sum of bases from all hits, calculated as 1B + 2×2B + 3×3B + 4×HR.
Two-Seam Fastball (FT): Fastball grip showing two seams in rotation, typically with more arm-side movement and slightly lower velocity than a four-seamer.
U
UBR (Ultimate Base Running): Metric quantifying baserunning value from all non-stolen base situations (taking extra bases, tagging up, avoiding outs), measured in runs.
UZR (Ultimate Zone Rating): Defensive metric dividing the field into zones and measuring a fielder's effectiveness in each zone compared to the league average, measured in runs.
V
Vertical Approach Angle (VAA): The angle at which a pitch crosses the plate relative to the ground, measured in degrees. Flatter angles (closer to 0) are generally harder to hit.
W
WAR (Wins Above Replacement): A comprehensive metric estimating the total number of wins a player contributes above a replacement-level player, combining batting, baserunning, fielding, and positional value.
WARP (Wins Above Replacement Player): Baseball Prospectus's version of WAR, using different component metrics and methodologies.
WHIP (Walks and Hits per Inning Pitched): Number of baserunners allowed per inning, calculated as (BB + H) / IP. League average is approximately 1.30.
Win Probability Added (WPA): The change in win probability caused by a specific event, summed to show a player's contribution to team wins.
wOBA (Weighted On-Base Average): An advanced batting metric that weights different offensive events (singles, walks, home runs, etc.) by their actual run value, scaled to OBP. League average is approximately .320.
wOBA Scale: The constant used to convert wOBA to runs, typically between 1.15-1.25, varying by season based on run environment.
wRAA (Weighted Runs Above Average): Runs created above average based on wOBA, calculated as ((wOBA - lgwOBA) / wOBA_scale) × PA.
wRC (Weighted Runs Created): Total runs created by a batter, calculated using wOBA and scaled to the run environment.
wRC+ (Weighted Runs Created Plus): wRC scaled so that 100 is league average and adjusted for park factors, with higher being better. Accounts for both league and park context.
wSB (Weighted Stolen Bases): Runs added from stolen bases, calculated as (SB × 0.2) - (CS × 0.43), using linear weights for SB success and failure.
X
xBA (Expected Batting Average): Statcast metric estimating what a player's batting average should be based on exit velocity and launch angle, removing luck and defense.
xERA (Expected ERA): ERA estimator based on expected outcomes from quality of contact metrics like xwOBA, xBA, and barrel rate.
xFIP (Expected Fielding Independent Pitching): FIP calculated using league average HR/FB rate instead of actual HR/FB, designed to remove luck from home run outcomes.
xSLG (Expected Slugging Percentage): Statcast metric estimating slugging percentage based on batted ball quality (exit velocity and launch angle).
xwOBA (Expected Weighted On-Base Average): Statcast metric estimating wOBA based on exit velocity and launch angle rather than actual outcomes, designed to measure quality of contact independent of luck.
xwOBAcon (Expected wOBA on Contact): xwOBA calculated only on contacted balls, excluding strikeouts and walks, isolating pure batted ball quality.
Z
Zone% (Zone Rate): Percentage of pitches thrown inside the strike zone, calculated as pitches in zone / total pitches. League average is approximately 45%.
Z-Contact% (Zone Contact Rate): Percentage of swings on pitches in the zone that result in contact. League average is approximately 85%.
Z-Swing% (Zone Swing Rate): Percentage of pitches in the strike zone at which a batter swings. League average is approximately 68%.
Additional Advanced Terms
Statcast & Tracking
Bat Speed: The speed of the bat at the point of contact or maximum speed during the swing, measured in miles per hour (mph). Elite bat speeds exceed 75 mph.
Swing Length: The distance the bat travels during a swing from start to contact, measured in feet. Shorter swings typically correlate with better contact rates.
Squared-Up Rate: Percentage of batted balls where the ball is hit near the sweet spot of the bat, resulting in optimal energy transfer.
Fast Swing Rate: Percentage of swings with bat speed above a threshold (typically 70+ mph), indicating power potential.
Blasts: Batted balls with both high exit velocity (95+ mph) and optimal launch angle (10-30 degrees), indicating well-struck line drives.
Competitive Swing: A swing on a pitch that could reasonably be put in play, used to measure swing decisions.
Pop Time: Time from when the pitch hits the catcher's mitt to when the throw reaches second base on a steal attempt, measured in seconds.
Lead Distance: The distance a baserunner takes from the base before a pitch is delivered, measured in feet.
Secondary Lead: Additional distance gained by a baserunner after the pitch is delivered but before a play is made.
Jump (Fielding): The initial burst of speed and direction by an outfielder when a ball is hit, measured in feet per second or reaction time.
Route Efficiency: Percentage comparing the actual distance traveled by an outfielder to the optimal straight-line route to catch location.
Hang Time: The time a fly ball spends in the air from contact to landing or catch, measured in seconds.
Perceived Velocity: The effective velocity of a pitch accounting for release extension, giving the batter's perceived speed at release.
Spin Efficiency: Percentage of a pitch's total spin that contributes to movement, excluding gyroscopic spin.
Active Spin: The portion of spin that directly affects pitch movement, as opposed to gyroscopic (non-useful) spin.
Gyro Spin: Spin that rotates around the direction of travel (like a football spiral), providing little movement but affecting perception.
Seam-Shifted Wake (SSW): Movement generated by the orientation of seams disrupting airflow asymmetrically, independent of Magnus force.
Induced Vertical Break (IVB): Vertical movement of a pitch relative to a theoretical spinless pitch, accounting for gravity.
Horizontal Break (HB): Horizontal movement of a pitch from the pitcher's perspective, measured in inches.
Approach Angle: The angle at which a pitch crosses the plate vertically, affecting how hitters perceive pitch location.
Tunnel Point: The location where two different pitch types are visually indistinguishable to the batter, typically 20-25 feet from home plate.
Plate Discipline Score: Composite metric combining chase rate, zone swing rate, and contact rates to measure a hitter's approach quality.
Advanced Hitting Metrics
Barrel Rate (Brl%): Percentage of batted ball events classified as barrels, representing the best possible contact outcomes.
Sweet Spot Rate: Percentage of batted balls with launch angles between 8-32 degrees, optimal for base hits.
Hard Hit Rate (HardHit%): Percentage of batted balls with exit velocity of 95+ mph, indicating quality of contact.
Average Launch Angle: Mean vertical angle of all batted balls for a hitter, indicating approach (fly ball vs ground ball tendency).
Pull Rate (Pull%): Percentage of batted balls hit to the pull side (left field for righties, right field for lefties).
Oppo Rate (Oppo%): Percentage of batted balls hit to the opposite field.
Center Rate (Cent%): Percentage of batted balls hit to center field.
Spray Angle: The horizontal angle of a batted ball relative to center field, measured in degrees.
wOBA-xwOBA: The difference between actual wOBA and expected wOBA, indicating luck or over/underperformance.
BA-xBA: The difference between actual batting average and expected batting average.
xISO (Expected Isolated Power): Expected ISO based on batted ball quality, excluding luck and defense.
Clutch: A FanGraphs metric measuring how much better or worse a player performed in high-leverage situations.
WPA/LI (Context-Neutral Wins): Win Probability Added divided by average leverage, showing performance independent of situation.
RE24: Runs above average based on the 24 base-out states, measuring situational production.
REW (Run Expectancy Wins): RE24 converted to wins using runs per win.
Batting Runs: Component of WAR measuring runs created or lost from hitting, relative to average.
Advanced Pitching Metrics
CSW% (Called Strike + Whiff %): Percentage of pitches resulting in either a called strike or swinging strike, measuring pure stuff effectiveness.
Whiff Rate (Whiff%): Percentage of swings that result in misses, calculated as whiffs / swings.
Put Away Rate: Percentage of two-strike pitches that result in a strikeout.
First Pitch Strike Rate (F-Strike%): Percentage of plate appearances starting with a strike.
Stuff+ (Stuff Plus): Model-based metric predicting strikeout rate based purely on pitch characteristics (velocity, movement, etc.).
Location+ (Location Plus): Model-based metric measuring the quality of pitch location relative to optimal targets.
Pitching+ (Pitching Plus): Combined metric incorporating Stuff+, Location+, and pitch sequencing.
xERA (Expected ERA): ERA estimator based on expected outcomes from batted ball quality and strikeouts/walks.
pCRA (Predictive Component ERA): Baseball Prospectus metric estimating ERA based on underlying pitch characteristics.
CERA (Component ERA): ERA estimator using individual component rates (K, BB, HR, etc.) rather than actual runs.
tERA (True ERA): ERA estimator from StatCorner accounting for batted ball types and sequencing.
Pitching Runs: Component of WAR measuring runs prevented or allowed from pitching, relative to average.
Pitch Value (PV): Run value of each pitch type thrown by a pitcher, based on outcomes and context.
wFB (Fastball Runs): Run value above average for all fastballs thrown, per 100 pitches scaled.
wSL (Slider Runs): Run value above average for all sliders thrown.
wCH (Changeup Runs): Run value above average for all changeups thrown.
wCB (Curveball Runs): Run value above average for all curveballs thrown.
Usage Rate: Percentage of total pitches that are a specific pitch type.
Times Through Order Penalty (TTOP): The degradation in pitcher performance each time through the batting order, typically 0.3-0.4 runs per time through.
Pitcher Wins (pWins): Wins component attributed to pitching in win probability models.
Fielding & Baserunning
Range Runs: Runs saved or cost by a fielder's range, measuring ability to get to balls.
Error Runs: Runs saved or cost by avoiding or committing errors.
Double Play Runs: Runs saved or cost by turning (or not turning) double plays.
Outfield Arm Runs: Runs saved by an outfielder's arm preventing runners from advancing.
Catcher Framing Runs: Runs saved by a catcher through pitch framing, turning balls into strikes.
Catcher Blocking Runs: Runs saved by a catcher blocking pitches in the dirt.
Pop Time: Time from pitch hitting the catcher's mitt to the ball reaching the fielder's glove at second base.
Stolen Base Runs (SBR): Run value of stolen base attempts, calculated as (SB × runSB) + (CS × runCS).
Ultimate Base Running (UBR): Baserunning value from non-stolen base events: advancing on hits, sac flies, wild pitches, etc.
Base Running Runs (BsR): Total baserunning contribution combining SBR and UBR.
wGDP (Weighted GIDP): Run cost of grounding into double plays relative to opportunity.
Positional Adjustment: Run value adjustment in WAR based on defensive difficulty of position played.
Value & Projection Metrics
JAWS (Jaffe WAR Score): Hall of Fame metric averaging career WAR with 7-year peak WAR, compared to position average.
Peak WAR: The highest WAR total achieved over a player's best consecutive seasons (typically 7 years).
Career WAR: Total WAR accumulated over an entire career.
WAR7: Sum of a player's seven highest single-season WAR totals.
Projected WAR: Forecasted WAR for upcoming season based on projection systems.
PECOTA: Baseball Prospectus's Player Empirical Comparison and Optimization Test Algorithm, a projection system.
ZiPS: Dan Szymborski's projection system using player comparisons and aging curves.
Steamer: Projection system emphasizing recent performance and regression.
ATC: Average of available projection systems, often more accurate than individual projections.
Surplus Value: The difference between a player's projected production value and their salary cost.
Dollars per WAR ($/WAR): Current market value of one win above replacement, typically $8-10 million.
Contract Value: Total expected value of a player contract based on projected performance.
Arbitration Projection: Estimated salary for arbitration-eligible players based on service time and performance.
Game Theory & Strategy
Run Expectancy (RE): Expected runs to score in remainder of inning given current base-out state.
Run Expectancy Matrix (RE24): Table showing expected runs for all 24 base-out states.
Win Expectancy (WE): Probability of winning given current game state (inning, score, base-out state).
Win Probability Added (WPA): Change in win probability from a single play or event.
Leverage Index (LI): Measure of game situation importance, with 1.0 being average. High-leverage situations have LI > 1.5.
gmLI (Game Leverage Index): Average leverage index when a player entered games.
pLI (Pitcher Leverage Index): Average leverage index for all situations faced by a pitcher.
Shutdown (SD): Relief appearance that increases team's win probability by at least 6%.
Meltdown (MD): Relief appearance that decreases team's win probability by at least 6%.
Clutch (FanGraphs): Performance in high-leverage situations vs overall performance.
Break-Even Stolen Base Rate: The success rate at which a stolen base attempt has neutral expected value, typically around 70-75%.
Optimal Lineup Construction: Arrangement of batting order to maximize expected runs, often placing highest OBP at top.
Platoon Split: Performance difference when facing same-handed vs opposite-handed pitchers/batters.
Sports Betting & Fantasy
Implied Probability: The probability implied by betting odds, calculated as 1 / decimal odds.
Expected Value (EV): The long-term average value of a bet, calculated as (Prob × Win) - ((1-Prob) × Loss).
Closing Line Value (CLV): The difference between the odds when a bet was placed and the closing line, indicating sharp betting.
Kelly Criterion: Optimal bet sizing formula maximizing long-term growth: (bp - q) / b, where b = odds, p = probability, q = 1-p.
Half Kelly: Betting half the Kelly-optimal amount for risk reduction.
Return on Investment (ROI): Net profit divided by total amount wagered, expressed as percentage.
Vig (Vigorish): The bookmaker's commission built into betting odds, typically 4-5% for MLB.
No-Vig Line: True implied probability after removing bookmaker's margin.
Sharp Money: Bets from professional or highly-informed bettors that move lines.
Public Money: Bets from recreational bettors, often favoring favorites and overs.
Auction Value: Dollar value assigned to players in fantasy baseball auction drafts.
Standings Gain Points (SGP): Fantasy metric measuring value of statistics toward category standings.
Replacement Level (Fantasy): The production available on the waiver wire, baseline for player value.
VORP (Value Over Replacement Player): A player's production above replacement level in fantasy context.
Statistical Concepts
Sample Size: Number of observations needed for statistical stability, varying by metric.
Stabilization Point: The sample size at which a metric is approximately 50% skill and 50% noise.
Regression to the Mean: The tendency for extreme performances to move toward average over time.
Year-to-Year Correlation: How well a metric predicts itself from one season to the next.
Confidence Interval: Range of values within which the true parameter likely falls, typically 95%.
Standard Error: Measure of the precision of an estimate, decreasing with sample size.
Bayesian Inference: Statistical approach updating beliefs based on prior information and new data.
Prior Distribution: Initial belief about a parameter before seeing data in Bayesian analysis.
Posterior Distribution: Updated belief about a parameter after incorporating data.
Shrinkage Estimator: Estimate that pulls extreme observations toward the population mean.
Effect Size: The magnitude of a difference or relationship, independent of sample size.
Statistical Significance: The probability that an observed result occurred by chance, typically p < 0.05.
Practical Significance: Whether an effect is large enough to matter in real-world terms.
Correlation Coefficient (r): Measure of linear relationship strength between two variables, ranging from -1 to 1.
Coefficient of Determination (R²): Proportion of variance explained by a model, ranging from 0 to 1.
Root Mean Square Error (RMSE): Standard deviation of prediction errors, measuring model accuracy.
Mean Absolute Error (MAE): Average absolute difference between predictions and actual values.
Log Loss: Measure of prediction accuracy for probability estimates, penalizing confident wrong predictions heavily.
AUC-ROC: Area Under the Receiver Operating Characteristic Curve, measuring classification model performance.
Overfitting: When a model learns noise in training data, performing poorly on new data.
Cross-Validation: Technique for assessing model performance by training and testing on different data subsets.
Feature Importance: Measure of how much each input variable contributes to model predictions.
Ensemble Methods: Combining multiple models to improve prediction accuracy (random forest, boosting, etc.).
Notes on Usage
This appendices section is designed as a quick reference guide for practitioners of baseball analytics. For deeper explanations of concepts and their applications, refer to the relevant chapters in the main text.
Data sources are current as of 2024-2025. URLs and API endpoints may change; check official documentation for updates.
Formulas use standard abbreviations. Ensure data consistency when calculating metrics across different sources.
Statistical constants vary by year. Always use the appropriate year's constants (wOBA weights, FIP constants, etc.) available from FanGraphs Guts page: https://www.fangraphs.com/guts.aspx
WAR implementations differ. FanGraphs (fWAR), Baseball Reference (bWAR/rWAR), and Baseball Prospectus (WARP) use different methodologies. Values are not directly comparable across systems.
Park factors require multi-year data for stability. Single-season park factors can be unreliable due to small sample sizes and weather variation.
End of Appendices
WAR = (Batting + Baserunning + Fielding + Positional + Replacement + League) / Runs_per_Win
Runs_per_Win ≈ 10
Batting = wRAA + League_Adjustment + Park_Adjustment
Baserunning = UBR + wSB
Fielding = UZR or OAA
Replacement = PA × lgR/PA × 0.03
WAR = (IP × FIP_constant - FIP) / Runs_per_Win + Replacement
WAR = (Pitching_Runs + League_Adjustment) / Runs_per_Win
Pitching_RAA = IP × (lgFIP - FIP) / 9
FIP = ((13×HR + 3×BB - 2×K) / IP) + FIP_constant
Replacement = IP / 9 × 2.0
xFIP = ((13×(FB × lgHR/FB_rate) + 3×BB - 2×K) / IP) + FIP_constant