Chapter 1: Introduction to MLB Analytics

1.1 What is Baseball Analytics?

1.1.1 Defining the Field {#defining-field}

At its core, baseball analytics is the application of statistical and computational methods to understand, evaluate, and optimize baseball performance. But this simple definition obscures important distinctions that help clarify what analysts actually do.

Statistics vs. Analytics: These terms are often used interchangeably but represent different approaches. Statistics describe what happened—Mike Trout hit .283 with 35 home runs in 2023. Analytics explain why it happened and predict what will happen—Trout's barrel rate and expected batting average suggest he was unlucky, and we project improvement in 2024. Statistics are descriptive; analytics are explanatory and predictive.

This distinction matters for how we approach problems. A statistical approach to evaluating a pitcher might compare his ERA to league average. An analytical approach might examine his expected ERA based on quality of contact allowed, his performance in different counts, and how his pitch mix and sequencing have evolved. The analyst doesn't just report numbers; she investigates relationships, tests hypotheses, and builds models.

Three Types of Analytics: The analytics field broadly divides into three categories, each progressively more complex and valuable:

Descriptive analytics answers "What happened?" This includes traditional statistics but also more sophisticated summaries. A descriptive analysis might examine how frequently teams shift against left-handed pull hitters, or how often pitchers throw first-pitch fastballs in different counts. Descriptive analytics organizes information, identifies patterns, and provides context. It's foundational—you must understand what happened before explaining why—but limited in directly driving decisions.

Predictive analytics answers "What will happen?" This is where much modern baseball analysis focuses. Projection systems like ZiPS, Steamer, and THE BAT forecast future player performance. Win probability models estimate each team's chance of winning based on game state. Predictive analytics uses historical patterns, statistical models, and machine learning to peer into the future. These predictions are never perfect—baseball involves too much randomness—but they dramatically outperform human intuition.

Prescriptive analytics answers "What should we do?" This is the ultimate goal of analytical work: informing decisions. Should we sign this free agent? Should we shift against this hitter? Should we promote this prospect? Prescriptive analytics combines prediction with decision theory, often incorporating uncertainty, resource constraints, and multiple competing objectives. A team might predict that a free agent will produce 3 WAR next season, but deciding whether to sign him requires considering cost, alternative options, roster construction, and organizational goals.

1.1.2 The Modern Baseball Analytics Landscape {#analytics-landscape}

Baseball analytics exists across multiple interconnected sectors, each with different objectives, constraints, and audiences.

Front Office Analytics is where analytics most directly affects the game. All 30 MLB teams employ quantitative analysts—some teams have departments exceeding 20 people. These analysts support decision-making across baseball operations: amateur and professional scouting, player development, major league strategy, medical and performance science, and front office leadership.

Front office analysts work with proprietary data unavailable to the public: detailed medical records, personality assessments, internal scouting reports, biomechanical measurements, and granular versions of publicly available data. They often can't share their work publicly due to competitive concerns. The role demands not just analytical skill but ability to communicate with baseball decision-makers from different backgrounds, understanding organizational context, and working within institutional constraints.

Media and Public Analytics reaches the broadest audience. Websites like FanGraphs, Baseball Prospectus, and The Athletic employ analysts who write for public consumption. Their work educates fans, drives discourse, and sometimes influences team decision-making (front office analysts read these sites too). Media analysts must balance analytical rigor with accessibility, making complex concepts understandable to diverse audiences.

This sector has democratized baseball knowledge. Insights once confined to team front offices now appear in daily articles and podcasts. When the Houston Astros pioneered aggressive pitching development strategies in the mid-2010s, public analysts documented and explained these approaches, allowing other teams and fans to understand the revolution happening in real-time.

Fantasy and Betting Analytics focuses on prediction for personal or financial gain. The fantasy baseball industry, worth hundreds of millions annually, employs analysts who project player performance for fantasy purposes. Sports betting's legalization has similarly created demand for predictive modeling. These sectors often pioneer methodology that later spreads to other areas—fantasy analysts were early adopters of projection systems and component skills analysis.

Academic and Research Analytics treats baseball as a laboratory for developing and testing analytical methods. Baseball's extensive historical data, clear outcomes, and relatively discrete events make it ideal for academic research. Studies published in journals like the Journal of Quantitative Analysis in Sports often use baseball to demonstrate new statistical techniques, which then diffuse into applied work.

1.1.3 Career Paths in Baseball Analytics {#career-paths}

For those interested in baseball analytics as a career, several paths exist, each requiring different skills and preparation.

The most direct path to front office work increasingly runs through formal education. Many analysts have graduate degrees in statistics, operations research, computer science, or related fields. Some teams recruit from top programs, valuing rigorous training in quantitative methods. However, education alone isn't sufficient—demonstrating passion for baseball and ability to communicate with diverse stakeholders matters enormously.

Building a public portfolio has become essential. Blogging, contributing to community sites, participating in Twitter/X discussions, and making presentations at conferences like SABR Analytics Conference showcases your skills. Many current front office analysts began as hobbyists who built public reputations before being hired. The book you're reading now is designed to help you develop projects worth including in such a portfolio.

Internships provide crucial entry points. Many teams offer analytics internships, often for graduate students. These positions pay poorly or not at all but provide invaluable experience and networking. Summer internships sometimes convert to full-time positions. Even when they don't, they demonstrate you can work in a baseball environment and provide references within the industry.

Alternative entry points exist through adjacent roles. Some analysts start in baseball operations departments handling administrative work before transitioning into analytical roles. Others begin in player development or scouting and gradually incorporate more quantitative methods. Connections and demonstrating value within an organization can matter as much as traditional qualifications.

The competition is intense. Teams receive hundreds of applications for each analytics position. Salaries, while increasing, remain below what similarly qualified people earn in technology or finance. Working conditions include long hours during the season and frequent job insecurity—front office turnover is high, and new leadership often brings new analysts. Yet for those passionate about baseball, the opportunity to affect the game at the highest level makes the challenges worthwhile.

1.2 The Questions Analytics Can Answer

Understanding what questions analytics can meaningfully address—and which questions remain difficult or impossible—helps focus analytical efforts productively.

1.2.1 Player Evaluation Questions {#player-evaluation}

Player evaluation is analytics' most mature application. Decades of work have produced increasingly sophisticated methods for quantifying player value.

"How good is this player?" seems simple but requires specifying what "good" means. Good compared to what baseline? Good at what specific skills? Good in what context? Modern analytics breaks this question into components:

True talent assessment: Separating signal from noise in performance data. A player hitting .340 through 100 at-bats might have true talent of .280—his early performance reflects both skill and luck. Methods like Bayesian updating and regression to the mean help identify underlying ability.

Component skills: Rather than evaluating players by results alone, modern analysis examines the skills that produce results. A hitter's exit velocity, launch angle distribution, swing decisions, and plate discipline often predict future performance better than current batting average or home run totals.

Context adjustment: A player hitting .280 with 25 home runs in Colorado's Coors Field (which inflates offense) might be less valuable than a .260, 20-home run hitter in pitcher-friendly Oracle Park. Analytics adjust for park effects, era effects, league quality, and other contextual factors.

Positional value: A shortstop hitting .250 might be more valuable than a first baseman hitting .280 because shortstop defense is scarcer and more valuable. Positional adjustments quantify these differences.

Comprehensive value metrics: Statistics like WAR (Wins Above Replacement) attempt to summarize a player's total value—offense, defense, baserunning, position—into a single number. While imperfect, these metrics provide useful overall assessments.

1.2.2 In-Game Strategy Questions {#strategy}

Baseball presents numerous strategic decisions each game. Analytics increasingly informs these choices, sometimes overturning conventional wisdom.

"When should we intentionally walk a hitter?" admits to empirical investigation. By calculating win probability in different scenarios—walk the hitter to load the bases or pitch to him—we can identify situations where walking improves win expectancy despite the free baserunner. The answer depends on game state, hitter and pitcher abilities, and who's on deck.

"Should we shift the defense?" became one of baseball's central analytical questions in the 2010s. By analyzing spray chart data showing where hitters typically hit balls, teams determined they could prevent more hits by positioning defenders in unusual spots. The shift's proliferation from fewer than 2,000 shifts in 2011 to over 50,000 in 2022 reflected analytics quantifying its effectiveness. MLB's 2023 rule banning most shifts reversed this trend, creating natural experiments for analysis (which we'll examine later this chapter).

"When should we remove the starting pitcher?" represents perhaps the most visible analytical influence on modern baseball. Traditional baseball valued pitchers working deep into games, but analytics revealed that pitchers become dramatically less effective the third time through the lineup, even when not showing obvious signs of fatigue. Teams increasingly remove starters earlier and use multiple relievers, prioritizing matchup advantages and maintaining peak effectiveness.

1.2.3 Team Building Questions {#team-building}

Baseball's economic structure creates complex team-building decisions. Analytics helps optimize these choices within budget constraints.

"How should we allocate our payroll?" involves balancing competing objectives. Should we spend $300 million on superstars or build depth with multiple solid players? Analytics can estimate the win contribution from different roster construction strategies. Generally, teams should concentrate spending on star players because wins become progressively more valuable (playoff probability increases non-linearly with wins), and top talent is increasingly scarce. However, constraints like luxury tax thresholds complicate this calculus.

"When should we extend or trade players?" requires forecasting future value and considering alternative uses of resources. Extending a 26-year-old star through his early 30s might seem obvious, but analytics considers aging curves (performance typically peaks around 27 and declines thereafter), injury risk, alternative free agents, and prospect development timelines. Sometimes the optimal decision is trading a popular player because future performance doesn't justify the cost.

"How should we balance major league winning and prospect development?" presents an eternal tension. Winning now requires trading prospects for established talent; building for the future requires accepting near-term losses. Analytics can't eliminate this trade-off but can quantify it, estimating how many future wins you're sacrificing for current wins and whether that exchange aligns with organizational objectives.

1.2.4 Player Development Questions {#player-development}

Player development—teaching minor leaguers and young major leaguers to improve—represents analytics' newest frontier. Data and technology enable unprecedented insight into the physical and mechanical factors underlying performance.

"How should this pitcher develop his arsenal?" increasingly admits to data-driven answers. Using pitch design principles informed by Statcast data, coaches can identify which pitch shapes generate the most swings and misses or weak contact. If a pitcher's curveball has below-average spin and vertical break, perhaps he should develop a slider instead. Some pitchers have added 2-3 mph to their fastball through mechanical adjustments identified via biomechanical analysis.

"What swing changes would improve this hitter?" similarly benefits from data. If a hitter produces many ground balls but has above-average exit velocity, adjusting his launch angle might unlock power. Detailed swing tracking reveals mechanical inefficiencies—whether a hitter's bat path creates excessive drag or his swing timing leaves him vulnerable to certain pitch locations.

"Which prospects will succeed?" remains difficult despite extensive data. Prospects face uncertain futures due to development variability, injury risk, and the difficulty of projecting performance across competition levels. Analytics improves prospect evaluation—teams now heavily weigh underlying skills over surface statistics—but projection remains imperfect. The best algorithms still miss future stars and overrate players who plateau.

1.3 Setting Up Your Analytics Environment

Before analyzing baseball, you need tools. This section walks through installing and configuring R and Python, the two dominant languages in baseball analytics, along with essential packages.

1.3.1 Choosing Your Language: R vs. Python {#r-vs-python}

Both R and Python are excellent for baseball analytics, and both are used professionally. Your choice depends on background, preferences, and goals.

R was designed specifically for statistical computing. It excels at data manipulation, statistical modeling, and visualization. R's tidyverse collection of packages provides an elegant, consistent approach to data analysis. Baseball-specific packages like baseballr offer easy access to numerous data sources. R is dominant in academic statistics and widely used in baseball front offices.

R's syntax may seem unusual if you come from traditional programming. It uses many specialized operators and emphasizes functional programming. The learning curve can be steep initially, but the payoff is concise, expressive code for statistical tasks.

Python is a general-purpose programming language that has become central to data science. It handles data analysis, machine learning, web scraping, automation, and application development. For baseball analytics, Python's pandas library provides powerful data manipulation, while pybaseball offers baseball data access similar to R's baseballr.

Python's broader applicability makes it valuable beyond baseball. If you might work in other data science domains or want to build applications around your analyses, Python is the more versatile choice. Python also has more extensive machine learning libraries than R.

Which should you choose? If you're completely new to programming, Python is slightly more beginner-friendly. If you're primarily interested in statistical analysis and visualization, R is slightly more specialized for these tasks. If you already know one language, stick with it—the concepts transfer even if syntax differs. If you're still uncertain, choose Python for its versatility.

The good news: this book provides all examples in both languages, so you can learn either or both.

1.3.2 Installing R and RStudio {#install-r}

R is the programming language; RStudio is an integrated development environment (IDE) that makes working with R much easier. You need both.

Step 1: Install R

Visit the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/

For Windows: Click "Download R for Windows" → "base" → "Download R 4.x.x for Windows" (use the latest version)
For macOS: Click "Download R for macOS" → Download the .pkg file appropriate for your macOS version (Apple silicon or Intel)
For Linux: Follow distribution-specific instructions on the CRAN page

Run the downloaded installer and follow the prompts. Default settings work fine for most users.

Step 2: Install RStudio

Visit: https://posit.co/download/rstudio-desktop/

Download RStudio Desktop (the free version) for your operating system and install it.

Step 3: Verify Installation

Open RStudio. You should see a window with several panes. In the "Console" pane (usually bottom-left), you'll see something like:

R version 4.3.2 (2023-10-31) -- "Eye Holes"

Type the following and press Enter:

print("Hello, Baseball!")

If you see [1] "Hello, Baseball!" you're ready to go.

1.3.3 Installing Python and Jupyter {#install-python}

Python can be installed in many ways. We recommend using Anaconda, a distribution that includes Python, Jupyter notebooks, and many useful packages.

Step 1: Install Anaconda

Visit: https://www.anaconda.com/download

Download the installer for your operating system (choose the latest Python 3.x version, not Python 2.x).

Run the installer. On Windows, you may see an option to add Anaconda to your PATH environment variable. While not required, this makes using Python from the command line easier.

Step 2: Launch Jupyter Notebook

After installation, open Anaconda Navigator (a graphical application installed with Anaconda). Click "Launch" under Jupyter Notebook. This will open a browser window showing your file system.

Navigate to a folder where you want to store your work (or create a new folder for baseball analytics projects). Click "New" → "Python 3" to create a new notebook.

Step 3: Verify Installation

In the first cell of your new notebook, type:

print("Hello, Baseball!")

Press Shift+Enter to run the cell. You should see Hello, Baseball! appear below the cell.

Alternative: Using Python with VS Code

If you prefer a more traditional IDE to Jupyter notebooks, Visual Studio Code (VS Code) is excellent for Python development. Install VS Code from https://code.visualstudio.com/, then install the Python extension. You can run Python scripts directly or use VS Code's built-in notebook support.

1.3.4 Essential Packages Overview {#essential-packages}

Both R and Python ecosystems include thousands of packages. Here are the essential ones for baseball analytics:

R Packages:

tidyverse: Collection of packages for data manipulation and visualization (includes ggplot2, dplyr, tidyr)
baseballr: Access to baseball data from multiple sources (MLB, FanGraphs, Baseball Savant)
lubridate: Date and time handling
glue: String interpolation
scales: Formatting scales for visualization

Python Packages:

pandas: Data manipulation and analysis
numpy: Numerical computing
matplotlib: Plotting library
seaborn: Statistical visualization (built on matplotlib)
pybaseball: Baseball data access (similar to baseballr)
scipy: Scientific computing and statistics

Installing Packages:

In R, run these commands in the RStudio console:

# Install tidyverse (includes many useful packages)
install.packages("tidyverse")

# Install baseballr for baseball data
install.packages("baseballr")

# Install additional useful packages
install.packages(c("lubridate", "glue", "scales"))

For Python, open a terminal or Anaconda Prompt and run:

# Most packages come with Anaconda, but ensure they're updated
conda update pandas numpy matplotlib seaborn scipy

# Install pybaseball (not included in Anaconda)
pip install pybaseball

Alternatively, in a Jupyter notebook, you can install packages by running:

!pip install pybaseball

The ! tells Jupyter to run the command in the system shell rather than as Python code.

1.3.5 Your First Baseball Analysis {#first-analysis}

Let's verify everything works by performing a simple analysis: comparing Aaron Judge's and Shohei Ohtani's 2023 offensive statistics.

R Version:

# Load required libraries
library(tidyverse)
library(baseballr)

# Get player IDs (we need these to query data)
# Aaron Judge's MLBAM ID is 592450
# Shohei Ohtani's MLBAM ID is 660271

# Get their 2023 season stats
# We'll use the MLB stats API through baseballr
judge_stats <- bref_daily_batter("2023-04-01", "2023-10-01") %>%
  filter(Name == "Aaron Judge")

ohtani_stats <- bref_daily_batter("2023-04-01", "2023-10-01") %>%
  filter(Name == "Shohei Ohtani")

# Create a comparison dataframe
comparison <- data.frame(
  Player = c("Aaron Judge", "Shohei Ohtani"),
  Games = c(158, 135),
  HR = c(37, 44),
  RBI = c(80, 95),
  AVG = c(.267, .304),
  OBP = c(.372, .412),
  SLG = c(.518, .654)
)

# Display the comparison
print(comparison)

# Create a simple visualization
comparison_long <- comparison %>%
  select(Player, HR, RBI) %>%
  pivot_longer(cols = c(HR, RBI),
               names_to = "Stat",
               values_to = "Value")

ggplot(comparison_long, aes(x = Stat, y = Value, fill = Player)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Judge vs Ohtani: 2023 Counting Stats",
       y = "Count",
       x = "") +
  theme_minimal() +
  scale_fill_manual(values = c("#003087", "#BA0021"))

Python Version:

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import playerid_lookup, statcast_batter

# Set plotting style
sns.set_style("whitegrid")

# For this simple example, we'll create a comparison directly
# In later chapters, we'll learn to query APIs dynamically

# Create comparison data
data = {
    'Player': ['Aaron Judge', 'Shohei Ohtani'],
    'Games': [158, 135],
    'HR': [37, 44],
    'RBI': [80, 95],
    'AVG': [.267, .304],
    'OBP': [.372, .412],
    'SLG': [.518, .654]
}

comparison = pd.DataFrame(data)
print(comparison)

# Create visualization
fig, ax = plt.subplots(figsize=(10, 6))

x = range(len(comparison['Player']))
width = 0.35

bars1 = ax.bar([i - width/2 for i in x], comparison['HR'],
               width, label='Home Runs', color='#003087')
bars2 = ax.bar([i + width/2 for i in x], comparison['RBI'],
               width, label='RBI', color='#BA0021')

ax.set_xlabel('Player', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_title('Judge vs Ohtani: 2023 Counting Stats', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(comparison['Player'])
ax.legend()

plt.tight_layout()
plt.show()

Both versions create a simple comparison and visualization. Don't worry if you don't understand every line yet—subsequent chapters will explain these concepts in detail. The important thing is verifying that your packages install correctly and you can run baseball-related code.

If the code runs without errors and produces output, congratulations! You've completed your first baseball analysis. If you encounter errors, check that:

All packages installed successfully
You're using recent versions of R (4.x+) or Python (3.8+)
For Python, you're running code in the correct environment

Package installation issues are the most common problems beginners encounter. If a package won't install, search for the error message—you'll usually find solutions quickly.

R version 4.3.2 (2023-10-31) -- "Eye Holes"

print("Hello, Baseball!")

# Install tidyverse (includes many useful packages)
install.packages("tidyverse")

# Install baseballr for baseball data
install.packages("baseballr")

# Install additional useful packages
install.packages(c("lubridate", "glue", "scales"))

# Load required libraries
library(tidyverse)
library(baseballr)

# Get player IDs (we need these to query data)
# Aaron Judge's MLBAM ID is 592450
# Shohei Ohtani's MLBAM ID is 660271

# Get their 2023 season stats
# We'll use the MLB stats API through baseballr
judge_stats <- bref_daily_batter("2023-04-01", "2023-10-01") %>%
  filter(Name == "Aaron Judge")

ohtani_stats <- bref_daily_batter("2023-04-01", "2023-10-01") %>%
  filter(Name == "Shohei Ohtani")

# Create a comparison dataframe
comparison <- data.frame(
  Player = c("Aaron Judge", "Shohei Ohtani"),
  Games = c(158, 135),
  HR = c(37, 44),
  RBI = c(80, 95),
  AVG = c(.267, .304),
  OBP = c(.372, .412),
  SLG = c(.518, .654)
)

# Display the comparison
print(comparison)

# Create a simple visualization
comparison_long <- comparison %>%
  select(Player, HR, RBI) %>%
  pivot_longer(cols = c(HR, RBI),
               names_to = "Stat",
               values_to = "Value")

ggplot(comparison_long, aes(x = Stat, y = Value, fill = Player)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Judge vs Ohtani: 2023 Counting Stats",
       y = "Count",
       x = "") +
  theme_minimal() +
  scale_fill_manual(values = c("#003087", "#BA0021"))

Python

print("Hello, Baseball!")

Python

# Most packages come with Anaconda, but ensure they're updated
conda update pandas numpy matplotlib seaborn scipy

# Install pybaseball (not included in Anaconda)
pip install pybaseball

Python

!pip install pybaseball

Python

# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import playerid_lookup, statcast_batter

# Set plotting style
sns.set_style("whitegrid")

# For this simple example, we'll create a comparison directly
# In later chapters, we'll learn to query APIs dynamically

# Create comparison data
data = {
    'Player': ['Aaron Judge', 'Shohei Ohtani'],
    'Games': [158, 135],
    'HR': [37, 44],
    'RBI': [80, 95],
    'AVG': [.267, .304],
    'OBP': [.372, .412],
    'SLG': [.518, .654]
}

comparison = pd.DataFrame(data)
print(comparison)

# Create visualization
fig, ax = plt.subplots(figsize=(10, 6))

x = range(len(comparison['Player']))
width = 0.35

bars1 = ax.bar([i - width/2 for i in x], comparison['HR'],
               width, label='Home Runs', color='#003087')
bars2 = ax.bar([i + width/2 for i in x], comparison['RBI'],
               width, label='RBI', color='#BA0021')

ax.set_xlabel('Player', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_title('Judge vs Ohtani: 2023 Counting Stats', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(comparison['Player'])
ax.legend()

plt.tight_layout()
plt.show()

1.4 Understanding Baseball's Structure

Before diving into analytics, you need to understand how baseball organizes data and how the game's structure affects analysis.

1.4.1 The Hierarchy of Baseball Data {#data-hierarchy}

Baseball data exists in a nested hierarchy, with each level containing the one below it:

League → Season → Team → Game → Plate Appearance → Pitch

Understanding this hierarchy is crucial for proper analysis. When you filter data, join tables, or aggregate statistics, you're moving up or down this hierarchy.

League level: Major League Baseball comprises two leagues (American and National), though interleague play has made this distinction less meaningful. Historical data sometimes requires league-level awareness since rules (designated hitter) and competition level varied between leagues.

Season level: Each season runs from late March/early April through October (playoffs through early November). The season is the primary unit for most analyses. When comparing players, you usually compare seasons: "In 2023, Judge hit .267 with 37 home runs."

Team level: 30 teams compete in MLB. Team-level analysis examines roster construction, payroll allocation, and organizational performance. Players can change teams mid-season via trades, complicating player-level analysis.

Game level: Each team plays 162 regular season games. Game-level data includes outcomes (final score) and game states (weather, umpire, starting lineups). Some analyses require game-level aggregation: "How do teams perform in extra innings?"

Plate appearance level: A plate appearance (PA) is one batter's turn at bat. This is often the primary unit of baseball analysis. Plate appearances produce discrete outcomes (single, home run, strikeout, walk, etc.). Most offensive statistics aggregate plate appearance outcomes.

Pitch level: Each plate appearance involves multiple pitches (average around 4-5). Pitch-level data from Statcast includes velocity, movement, location, and outcome (ball, called strike, swing and miss, batted ball). This granular data enables the deepest analyses.

1.4.2 Key Baseball Events {#key-events}

Several baseball concepts are frequently confused. Precise definitions matter for analysis:

Plate Appearance (PA) vs. At-Bat (AB): A plate appearance is any completed turn batting. An at-bat is a plate appearance excluding walks, hit-by-pitches, sacrifice bunts, sacrifice flies, and catcher's interference. Most plate appearances are at-bats, but the distinction matters for calculation:

Batting Average = Hits / At-Bats (not plate appearances)
On-Base Percentage = (Hits + Walks + HBP) / Plate Appearances

A player who walks doesn't get an at-bat (denominator unchanged) but does get a plate appearance. This is why batting average and on-base percentage use different denominators.

Hit Types: Not all hits are equal. Baseball distinguishes:

Single (1B): Batter reaches first base safely
Double (2B): Batter reaches second base safely
Triple (3B): Batter reaches third base safely
Home Run (HR): Batter circles all bases and scores

Total Bases (TB) weights hits by bases: TB = 1B + (2×2B) + (3×3B) + (4×HR)

Walks (BB) and Strikeouts (K): Walks (bases on balls) occur when the pitcher throws four pitches outside the strike zone. Strikeouts occur when a batter accumulates three strikes. These "three true outcomes" (walks, strikeouts, home runs) don't involve fielders and thus isolate pitcher-batter interaction.

Outs: Plate appearances that don't result in the batter reaching base safely. Types include:

Strikeouts (K)
Ground outs and fly outs (batted balls fielded for outs)
Double plays (one plate appearance produces two outs)
Sacrifice flies (batter makes an out but a runner scores)

Earned Runs vs. Unearned Runs: Earned runs are those scored without benefit of fielding errors. Pitcher evaluation traditionally uses only earned runs (hence Earned Run Average), but modern analytics often considers total runs since "unearned" runs can result from poor pitcher defense (e.g., not covering first base).

1.4.3 Baseball's Unique Characteristics for Analysis {#unique-characteristics}

Baseball has several characteristics that make it simultaneously easier and harder to analyze than other sports:

Discrete Events: Unlike basketball or soccer where play flows continuously, baseball consists of discrete events (plate appearances, pitches) with clear outcomes. This makes data collection and event-level analysis straightforward. Each pitch is independent in some sense—unlike a basketball possession where earlier actions affect later ones through court positioning.

High Sample Sizes: With 162 games per season, roughly 700 plate appearances per full-time player, and 2,400+ pitches per starting pitcher, baseball generates enormous sample sizes. This enables more confident statistical inference than sports with shorter seasons. However, even these large samples sometimes aren't enough—a single season's defensive metrics can be noisy, and relievers accumulate fewer plate appearances than position players or starters.

Randomness and Luck: Baseball involves tremendous short-term randomness. A batter hitting a ball 105 mph might make an out (fielder catches it) while a 75 mph bloop might fall for a hit. Over 500 plate appearances, this randomness largely evens out, but small samples are dominated by luck. This creates challenges: separating skill from randomness, avoiding overreaction to small samples, and recognizing when a performance change is real versus noise.

No Clock: Baseball has no game clock. Games end after nine innings (or more if tied), not after elapsed time. This affects strategy—teams don't "run out the clock"—and means game length varies substantially. Plate appearances per game varies depending on scoring and pitching efficiency. This complicates some analyses compared to sports where possessions per game are more consistent.

Context Dependence: A home run's value depends dramatically on context. A solo home run adds one run; a grand slam adds four. A home run when trailing by seven matters much less than one in a tied game. Advanced metrics like Win Probability Added (WPA) and Leverage Index account for this context, but it adds complexity—no single number fully captures an event's value.

1.4.4 The Baseball Calendar {#baseball-calendar}

Understanding baseball's temporal structure helps with data collection and analysis:

Spring Training (February-March): Teams prepare for the regular season. Spring training statistics are public but generally considered unreliable for analysis due to limited playing time, younger players competing against major leaguers, and lack of competitive intensity.

Regular Season (late March/early April - late September): 162 games per team over approximately 185 days. This is the primary data source for most analyses. The season includes the All-Star break in mid-July, dividing it into first and second halves.

Playoffs (October - early November): Ten teams (six division winners, three wild cards per league) compete in postseason. Playoff performance is prestigious but represents small samples—even a World Series winner plays at most 20 playoff games. Analyzing playoff performance requires careful treatment of small sample issues.

Offseason (November - February): Teams conduct player acquisitions via free agency, trades, and the amateur draft (early June, technically during the season). The offseason is when most roster construction happens and when analysts produce projections for the upcoming season.

Key Dates for Analysts:

Opening Day (late March/early April): Season begins
Trade Deadline (July 31): Last day for most trades without waiver complications
September Call-ups (September 1): Rosters expand; prospects often debut
Postseason Rosters (October): Teams must set playoff rosters
Free Agency Begins (five days after World Series)
Salary Arbitration Deadline (mid-January)

These dates matter for analysis. Player performance before and after the trade deadline might differ due to team context changes. September statistics can be misleading when prospects face limited competition. When building models, incorporating calendar effects sometimes improves accuracy.

1.5 Case Study: The Shift Ban's Impact

Let's apply what we've learned to a real analytical question: How did MLB's 2023 defensive shift restrictions affect the game?

Background

For years, teams increasingly employed defensive shifts—unusual defensive alignments designed to counter specific hitters' tendencies. Against extreme pull hitters (especially left-handed batters who hit most balls to the right side), teams placed three infielders on one side of second base. The shift's prevalence grew from around 2% of plate appearances in 2011 to over 35% in 2022.

Starting in 2023, MLB banned most shifts, requiring two infielders on each side of second base when the pitch is delivered. Proponents argued shifts made the game less exciting by reducing offense. Opponents claimed they represented smart baseball strategy that rewarded analytical teams.

We can analyze the shift ban's impact by comparing ground ball outcomes between 2022 (shifts allowed) and 2023 (shifts banned).

R Implementation

# Load libraries
library(tidyverse)
library(baseballr)

# Note: This example uses conceptual code. The baseballr package
# periodically updates its functions. Check current documentation.

# Get Statcast data for ground balls in 2022 and 2023
# We'll focus on left-handed batters (most shifted against)

# 2022 data (shifts allowed)
gb_2022 <- statcast_search(start_date = "2022-04-07",
                            end_date = "2022-10-05",
                            player_type = "batter") %>%
  filter(bb_type == "ground_ball",
         stand == "L") %>%  # Left-handed batters
  mutate(year = 2022,
         was_hit = if_else(events %in% c("single", "double", "triple"), 1, 0))

# 2023 data (shifts banned)
gb_2023 <- statcast_search(start_date = "2023-03-30",
                            end_date = "2023-10-01",
                            player_type = "batter") %>%
  filter(bb_type == "ground_ball",
         stand == "L") %>%
  mutate(year = 2023,
         was_hit = if_else(events %in% c("single", "double", "triple"), 1, 0))

# Combine datasets
gb_combined <- bind_rows(gb_2022, gb_2023)

# Calculate ground ball hit rates by year
hit_rates <- gb_combined %>%
  group_by(year) %>%
  summarize(ground_balls = n(),
            hits = sum(was_hit, na.rm = TRUE),
            hit_rate = hits / ground_balls,
            .groups = "drop")

print(hit_rates)

# Statistical test: Did the rate significantly change?
# Chi-square test for difference in proportions
contingency_table <- gb_combined %>%
  count(year, was_hit) %>%
  pivot_wider(names_from = was_hit, values_from = n) %>%
  select(-year) %>%
  as.matrix()

chi_result <- chisq.test(contingency_table)
print(chi_result)

# Visualization
ggplot(hit_rates, aes(x = factor(year), y = hit_rate, fill = factor(year))) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = sprintf("%.3f", hit_rate)),
            vjust = -0.5, size = 5) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1),
                     limits = c(0, 0.30)) +
  scale_fill_manual(values = c("2022" = "#E03A3E", "2023" = "#4A90E2")) +
  labs(title = "Ground Ball Hit Rate: Left-Handed Batters",
       subtitle = "Comparing 2022 (Shifts Allowed) vs 2023 (Shifts Banned)",
       x = "Season",
       y = "Ground Ball Hit Rate",
       caption = "Data: MLB Statcast via baseballr") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12))

Python Implementation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
from scipy.stats import chi2_contingency

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")

# Get Statcast data for 2022 ground balls (shifts allowed)
# Note: Large queries may take time. For practice, you might limit date ranges.
gb_2022 = statcast(start_dt='2022-04-07', end_dt='2022-10-05')
gb_2022 = gb_2022[
    (gb_2022['bb_type'] == 'ground_ball') &
    (gb_2022['stand'] == 'L')  # Left-handed batters
].copy()
gb_2022['year'] = 2022
gb_2022['was_hit'] = gb_2022['events'].isin(['single', 'double', 'triple']).astype(int)

# Get Statcast data for 2023 ground balls (shifts banned)
gb_2023 = statcast(start_dt='2023-03-30', end_dt='2023-10-01')
gb_2023 = gb_2023[
    (gb_2023['bb_type'] == 'ground_ball') &
    (gb_2023['stand'] == 'L')
].copy()
gb_2023['year'] = 2023
gb_2023['was_hit'] = gb_2023['events'].isin(['single', 'double', 'triple']).astype(int)

# Combine datasets
gb_combined = pd.concat([gb_2022, gb_2023], ignore_index=True)

# Calculate hit rates by year
hit_rates = gb_combined.groupby('year').agg(
    ground_balls=('was_hit', 'count'),
    hits=('was_hit', 'sum')
).reset_index()
hit_rates['hit_rate'] = hit_rates['hits'] / hit_rates['ground_balls']

print("\nGround Ball Hit Rates:")
print(hit_rates)

# Statistical test: Chi-square test for independence
contingency_table = pd.crosstab(gb_combined['year'], gb_combined['was_hit'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square test results:")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.bar(hit_rates['year'].astype(str), hit_rates['hit_rate'],
              color=['#E03A3E', '#4A90E2'], width=0.6)

# Add value labels on bars
for i, (bar, rate) in enumerate(zip(bars, hit_rates['hit_rate'])):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{rate:.3f}',
            ha='center', va='bottom', fontsize=14, fontweight='bold')

ax.set_ylabel('Ground Ball Hit Rate', fontsize=12)
ax.set_xlabel('Season', fontsize=12)
ax.set_title('Ground Ball Hit Rate: Left-Handed Batters\nComparing 2022 (Shifts Allowed) vs 2023 (Shifts Banned)',
             fontsize=14, fontweight='bold', pad=20)
ax.set_ylim(0, 0.30)

# Format y-axis as percentage
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.1%}'))

# Add caption
fig.text(0.99, 0.01, 'Data: MLB Statcast via pybaseball',
         ha='right', fontsize=9, style='italic')

plt.tight_layout()
plt.show()

Analysis and Interpretation

This analysis would typically reveal that left-handed batters' ground ball hit rates increased from approximately .235 in 2022 to around .245 in 2023—a roughly 10-point increase (4% relative increase). The chi-square test would show this difference is statistically significant (p < 0.001) given the large sample sizes.

What this tells us:

The shift was effective: The increase in hit rates when shifts were banned confirms that shifts prevented hits. Teams weren't shifting for nothing—it genuinely improved defense.

The magnitude matters: A 10-point increase in BABIP (Batting Average on Balls In Play) for ground balls is substantial. For a player with 400 ground balls per season, that's approximately 4 extra hits—enough to raise batting average by several points.

Left-handed batters benefited most: Left-handed pull hitters were shifted most aggressively. The rule change helped them more than right-handed batters or contact-oriented hitters who weren't heavily shifted.

Broader implications: This increase in offense was part of MLB's motivation for the rule change. Combined with other 2023 rule changes (pitch clock, larger bases, restrictions on pickoff attempts), the shift ban contributed to a more action-oriented game.

Limitations of this analysis:

We only examined ground balls. Line drives and fly balls would tell a more complete story.
We didn't account for potentially changing offensive approaches (hitters might have changed their approach knowing shifts were banned).
We treated all ground balls equally; examining pull percentage or specific batted ball locations would add nuance.
Selection bias might exist if injured players or platoon players affected 2022 vs 2023 differently.

Future chapters will teach more sophisticated techniques to address these limitations. But this analysis demonstrates how even relatively simple methods can meaningfully investigate real baseball questions.

# Load libraries
library(tidyverse)
library(baseballr)

# Note: This example uses conceptual code. The baseballr package
# periodically updates its functions. Check current documentation.

# Get Statcast data for ground balls in 2022 and 2023
# We'll focus on left-handed batters (most shifted against)

# 2022 data (shifts allowed)
gb_2022 <- statcast_search(start_date = "2022-04-07",
                            end_date = "2022-10-05",
                            player_type = "batter") %>%
  filter(bb_type == "ground_ball",
         stand == "L") %>%  # Left-handed batters
  mutate(year = 2022,
         was_hit = if_else(events %in% c("single", "double", "triple"), 1, 0))

# 2023 data (shifts banned)
gb_2023 <- statcast_search(start_date = "2023-03-30",
                            end_date = "2023-10-01",
                            player_type = "batter") %>%
  filter(bb_type == "ground_ball",
         stand == "L") %>%
  mutate(year = 2023,
         was_hit = if_else(events %in% c("single", "double", "triple"), 1, 0))

# Combine datasets
gb_combined <- bind_rows(gb_2022, gb_2023)

# Calculate ground ball hit rates by year
hit_rates <- gb_combined %>%
  group_by(year) %>%
  summarize(ground_balls = n(),
            hits = sum(was_hit, na.rm = TRUE),
            hit_rate = hits / ground_balls,
            .groups = "drop")

print(hit_rates)

# Statistical test: Did the rate significantly change?
# Chi-square test for difference in proportions
contingency_table <- gb_combined %>%
  count(year, was_hit) %>%
  pivot_wider(names_from = was_hit, values_from = n) %>%
  select(-year) %>%
  as.matrix()

chi_result <- chisq.test(contingency_table)
print(chi_result)

# Visualization
ggplot(hit_rates, aes(x = factor(year), y = hit_rate, fill = factor(year))) +
  geom_bar(stat = "identity", width = 0.6) +
  geom_text(aes(label = sprintf("%.3f", hit_rate)),
            vjust = -0.5, size = 5) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.1),
                     limits = c(0, 0.30)) +
  scale_fill_manual(values = c("2022" = "#E03A3E", "2023" = "#4A90E2")) +
  labs(title = "Ground Ball Hit Rate: Left-Handed Batters",
       subtitle = "Comparing 2022 (Shifts Allowed) vs 2023 (Shifts Banned)",
       x = "Season",
       y = "Ground Ball Hit Rate",
       caption = "Data: MLB Statcast via baseballr") +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", size = 16),
        plot.subtitle = element_text(size = 12))

Python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pybaseball import statcast
from scipy.stats import chi2_contingency

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")

# Get Statcast data for 2022 ground balls (shifts allowed)
# Note: Large queries may take time. For practice, you might limit date ranges.
gb_2022 = statcast(start_dt='2022-04-07', end_dt='2022-10-05')
gb_2022 = gb_2022[
    (gb_2022['bb_type'] == 'ground_ball') &
    (gb_2022['stand'] == 'L')  # Left-handed batters
].copy()
gb_2022['year'] = 2022
gb_2022['was_hit'] = gb_2022['events'].isin(['single', 'double', 'triple']).astype(int)

# Get Statcast data for 2023 ground balls (shifts banned)
gb_2023 = statcast(start_dt='2023-03-30', end_dt='2023-10-01')
gb_2023 = gb_2023[
    (gb_2023['bb_type'] == 'ground_ball') &
    (gb_2023['stand'] == 'L')
].copy()
gb_2023['year'] = 2023
gb_2023['was_hit'] = gb_2023['events'].isin(['single', 'double', 'triple']).astype(int)

# Combine datasets
gb_combined = pd.concat([gb_2022, gb_2023], ignore_index=True)

# Calculate hit rates by year
hit_rates = gb_combined.groupby('year').agg(
    ground_balls=('was_hit', 'count'),
    hits=('was_hit', 'sum')
).reset_index()
hit_rates['hit_rate'] = hit_rates['hits'] / hit_rates['ground_balls']

print("\nGround Ball Hit Rates:")
print(hit_rates)

# Statistical test: Chi-square test for independence
contingency_table = pd.crosstab(gb_combined['year'], gb_combined['was_hit'])
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square test results:")
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

bars = ax.bar(hit_rates['year'].astype(str), hit_rates['hit_rate'],
              color=['#E03A3E', '#4A90E2'], width=0.6)

# Add value labels on bars
for i, (bar, rate) in enumerate(zip(bars, hit_rates['hit_rate'])):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{rate:.3f}',
            ha='center', va='bottom', fontsize=14, fontweight='bold')

ax.set_ylabel('Ground Ball Hit Rate', fontsize=12)
ax.set_xlabel('Season', fontsize=12)
ax.set_title('Ground Ball Hit Rate: Left-Handed Batters\nComparing 2022 (Shifts Allowed) vs 2023 (Shifts Banned)',
             fontsize=14, fontweight='bold', pad=20)
ax.set_ylim(0, 0.30)

# Format y-axis as percentage
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.1%}'))

# Add caption
fig.text(0.99, 0.01, 'Data: MLB Statcast via pybaseball',
         ha='right', fontsize=9, style='italic')

plt.tight_layout()
plt.show()

1.6 Exercises

These exercises reinforce the chapter's concepts. Solutions are available in the book's GitHub repository, but attempt them seriously before checking solutions.

Exercise 1.1: Environment Setup Verification

Write code in your chosen language (R or Python) that:

Loads the necessary baseball analytics packages
Prints your R or Python version
Queries basic information about a player of your choice using the baseball data package
Creates a simple plot (any plot) to verify visualization works

This confirms your environment is properly configured.

Exercise 1.2: Data Hierarchy Exploration

Using baseball data from 2023:

Get game-level data for your favorite team
Calculate the team's wins and losses
Aggregate to find total runs scored and allowed across the season
Calculate the team's Pythagorean winning percentage using the formula from the Preface
Compare actual winning percentage to Pythagorean expectation

R Version Hint:

# Use baseballr to get team game logs
library(baseballr)
library(tidyverse)

# Example for Yankees (team_id = 147)
yankees_games <- mlb_team_schedule(season = 2023,
                                    team_id = 147)
# Then aggregate and calculate...

Python Version Hint:

# Use pybaseball to get team game logs
from pybaseball import schedule_and_record

# Example for Yankees
yankees_games = schedule_and_record(2023, 'NYY')
# Then aggregate and calculate...

Exercise 1.3: Hit Type Analysis

Analyze the distribution of hit types (singles, doubles, triples, home runs) for two power hitters and two contact hitters in 2023.

Choose four players representing different offensive profiles
Get their 2023 statistics including hit breakdowns
Calculate what percentage of their hits were singles, doubles, triples, and home runs
Create a visualization comparing these distributions
Write a brief interpretation: How do power hitters' hit distributions differ from contact hitters?

Exercise 1.4: Monthly Performance Trends

Investigate whether players perform differently early vs. late in the season.

Get plate appearance data for a player across the 2023 season
Group plate appearances by month (April, May, June, July, August, September)
Calculate monthly batting average, OBP, and slugging percentage
Create a line plot showing how these metrics changed throughout the season
Discuss: Do you see evidence of the player getting hot or cold? How confident are you given sample sizes?

Challenge extension: Calculate monthly sample sizes and add confidence intervals to your plot to visualize uncertainty.

You've now completed your introduction to baseball analytics. You understand what baseball analytics is, what questions it addresses, have a working analytical environment, understand baseball's data structure, and have completed a real analysis examining the shift ban's impact. The foundation is laid.

Subsequent chapters build systematically on these foundations. Chapter 2 covers data acquisition—how to get data from various sources. Chapter 3 teaches data manipulation and transformation. Chapters 4-8 introduce core baseball metrics, showing both what they mean conceptually and how to calculate them from raw data. Later chapters cover visualization, modeling, prediction, and specialized topics.

Baseball analytics is ultimately about answering questions that matter. As you work through this book, always return to the questions: What am I trying to understand? What evidence would help answer that question? How confident should I be in my conclusions? The technical skills you'll develop are tools in service of clear thinking about meaningful questions.

Let's continue to Chapter 2, where we'll learn to acquire data from multiple sources, building the datasets that fuel every analysis.

# Use baseballr to get team game logs
library(baseballr)
library(tidyverse)

# Example for Yankees (team_id = 147)
yankees_games <- mlb_team_schedule(season = 2023,
                                    team_id = 147)
# Then aggregate and calculate...

Python

# Use pybaseball to get team game logs
from pybaseball import schedule_and_record

# Example for Yankees
yankees_games = schedule_and_record(2023, 'NYY')
# Then aggregate and calculate...

Practice Exercises

Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.

4 exercises

Tips for Success

Read the problem carefully before starting to code
Break down complex problems into smaller steps
Use the hints if you're stuck - they won't give away the answer
After solving, compare your approach with the solution

Exercise 1.1

Environment Setup Verification

Hard

Write code in your chosen language (R or Python) that:

1. Loads the necessary baseball analytics packages
2. Prints your R or Python version
3. Queries basic information about a player of your choice using the baseball data package
4. Creates a simple plot (any plot) to verify visualization works

This confirms your environment is properly configured.

Exercise 1.2

Data Hierarchy Exploration

Medium

Using baseball data from 2023:

1. Get game-level data for your favorite team
2. Calculate the team's wins and losses
3. Aggregate to find total runs scored and allowed across the season
4. Calculate the team's Pythagorean winning percentage using the formula from the Preface
5. Compare actual winning percentage to Pythagorean expectation

**R Version Hint:**

```r
# Use baseballr to get team game logs
library(baseballr)
library(tidyverse)

# Example for Yankees (team_id = 147)
yankees_games <- mlb_team_schedule(season = 2023,
team_id = 147)
# Then aggregate and calculate...
```

**Python Version Hint:**

```python
# Use pybaseball to get team game logs
from pybaseball import schedule_and_record

# Example for Yankees
yankees_games = schedule_and_record(2023, 'NYY')
# Then aggregate and calculate...
```

Exercise 1.3

Hit Type Analysis

Hard

Analyze the distribution of hit types (singles, doubles, triples, home runs) for two power hitters and two contact hitters in 2023.

1. Choose four players representing different offensive profiles
2. Get their 2023 statistics including hit breakdowns
3. Calculate what percentage of their hits were singles, doubles, triples, and home runs
4. Create a visualization comparing these distributions
5. Write a brief interpretation: How do power hitters' hit distributions differ from contact hitters?

Exercise 1.4

Monthly Performance Trends

Hard

Investigate whether players perform differently early vs. late in the season.

1. Get plate appearance data for a player across the 2023 season
2. Group plate appearances by month (April, May, June, July, August, September)
3. Calculate monthly batting average, OBP, and slugging percentage
4. Create a line plot showing how these metrics changed throughout the season
5. Discuss: Do you see evidence of the player getting hot or cold? How confident are you given sample sizes?

**Challenge extension:** Calculate monthly sample sizes and add confidence intervals to your plot to visualize uncertainty.

---

You've now completed your introduction to baseball analytics. You understand what baseball analytics is, what questions it addresses, have a working analytical environment, understand baseball's data structure, and have completed a real analysis examining the shift ban's impact. The foundation is laid.

Subsequent chapters build systematically on these foundations. Chapter 2 covers data acquisition—how to get data from various sources. Chapter 3 teaches data manipulation and transformation. Chapters 4-8 introduce core baseball metrics, showing both what they mean conceptually and how to calculate them from raw data. Later chapters cover visualization, modeling, prediction, and specialized topics.

Baseball analytics is ultimately about answering questions that matter. As you work through this book, always return to the questions: What am I trying to understand? What evidence would help answer that question? How confident should I be in my conclusions? The technical skills you'll develop are tools in service of clear thinking about meaningful questions.

Let's continue to Chapter 2, where we'll learn to acquire data from multiple sources, building the datasets that fuel every analysis.

Chapter 1: Introduction to MLB Analytics

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

1.1 What is Baseball Analytics?

1.1.1 Defining the Field {#defining-field}

1.1.2 The Modern Baseball Analytics Landscape {#analytics-landscape}

1.1.3 Career Paths in Baseball Analytics {#career-paths}

1.2 The Questions Analytics Can Answer

1.2.1 Player Evaluation Questions {#player-evaluation}

1.2.2 In-Game Strategy Questions {#strategy}

1.2.3 Team Building Questions {#team-building}

1.2.4 Player Development Questions {#player-development}

1.3 Setting Up Your Analytics Environment

1.3.1 Choosing Your Language: R vs. Python {#r-vs-python}

1.3.2 Installing R and RStudio {#install-r}

1.3.3 Installing Python and Jupyter {#install-python}

1.3.4 Essential Packages Overview {#essential-packages}

1.3.5 Your First Baseball Analysis {#first-analysis}

1.4 Understanding Baseball's Structure

1.4.1 The Hierarchy of Baseball Data {#data-hierarchy}

1.4.2 Key Baseball Events {#key-events}

1.4.3 Baseball's Unique Characteristics for Analysis {#unique-characteristics}

1.4.4 The Baseball Calendar {#baseball-calendar}

1.5 Case Study: The Shift Ban's Impact

Background

R Implementation

Python Implementation

Analysis and Interpretation

1.6 Exercises

Exercise 1.1: Environment Setup Verification

Exercise 1.2: Data Hierarchy Exploration

Exercise 1.3: Hit Type Analysis

Exercise 1.4: Monthly Performance Trends

Practice Exercises

Tips for Success

Environment Setup Verification

Data Hierarchy Exploration

Hit Type Analysis

Monthly Performance Trends

Chapter Summary

Related Resources

Glossary

Resources

All Chapters