P.1.1 The Pre-Sabermetric Era (1871-1970s) {#pre-sabermetric}
The story of baseball analytics begins not with computers or complex algorithms, but with a British-born sportswriter and his simple innovation. In the 1850s, Henry Chadwick, often called "the father of baseball," adapted the cricket scorecard to create the baseball box score. This humble invention—a tabular summary of a game's events—established the foundation upon which all baseball analysis would be built.
Chadwick's innovation went beyond merely recording events. He created statistics to quantify player performance: batting average, earned run average, and the concept of an error. These metrics, however, reflected the values and assumptions of their time. Batting average, for instance, treats all hits equally—a single is worth the same as a home run. Errors, determined by subjective scorer judgment, became a primary defensive metric despite their obvious limitations.
For nearly a century, these traditional statistics dominated baseball discourse. Batting average, home runs, and runs batted in became the trinity of offensive evaluation. Wins and ERA defined pitcher value. Teams made million-dollar decisions based on these metrics, often missing crucial aspects of player value. A high batting average might mask poor plate discipline. A high RBI total might simply reflect batting behind good hitters. A pitcher's win total depended heavily on run support and bullpen quality.
Yet even in this era, some questioned the conventional wisdom. Branch Rickey, the Brooklyn Dodgers executive who would later break baseball's color barrier by signing Jackie Robinson, developed one of the first sophisticated statistical systems in the 1950s. Rickey's approach, working with statistician Allan Roth, emphasized on-base percentage and slug percentage—presaging concepts that wouldn't enter mainstream consciousness for decades. Roth calculated that each additional base on balls (walk) was worth approximately .7 of a hit, and each additional home run was worth roughly 3.2 singles. These insights, revolutionary for their time, remained largely confined to Rickey's organization.
P.1.2 Bill James and the Birth of Sabermetrics {#bill-james}
In 1977, a Kansas night security guard changed baseball forever. Bill James, working the graveyard shift at a pork and beans cannery, began writing self-published "Baseball Abstracts" that challenged nearly every conventional wisdom about how baseball should be understood and evaluated.
James coined the term "sabermetrics"—a nod to the Society for American Baseball Research (SABR)—to describe "the search for objective knowledge about baseball." His annual Abstracts, initially photocopied and mailed to a handful of subscribers, grew into must-read publications that introduced revolutionary concepts:
Runs Created was perhaps James's most influential early metric. The formula estimated how many runs a player generated through his offensive contributions. The basic version was elegant in its simplicity:
Runs Created = (Hits + Walks) × Total Bases / (At Bats + Walks)
This single formula captured something batting average never could: a player's total offensive value. A player who hit .250 but walked frequently and hit for power could be more valuable than a .300 hitter who never walked and rarely hit extra-base hits.
Pythagorean Win Expectation applied a formula similar to the Pythagorean theorem to predict a team's won-loss record based on runs scored and runs allowed:
Expected Winning Percentage = Runs Scored² / (Runs Scored² + Runs Allowed²)
Teams that significantly outperformed or underperformed their Pythagorean expectation were likely to regress toward the mean—a powerful tool for identifying overvalued and undervalued teams.
James also pioneered defensive metrics that went beyond errors and fielding percentage. His Range Factor—putouts plus assists per game—was a crude but meaningful improvement over traditional defensive stats. He studied platoon splits, bullpen usage, aging curves, and clutch performance, often debunking widely held beliefs with rigorous analysis.
What made James's work revolutionary wasn't just the metrics themselves, but his methodology: question everything, let the data guide you, and never assume the conventional wisdom is correct. His writing style—irreverent, funny, combative—made complex analysis accessible and entertaining. Baseball, James showed, could be both deeply analytical and deeply fun.
P.1.3 The STATS Inc. Era and Early Adopters {#stats-inc}
While James developed his theories with The Baseball Encyclopedia and newspaper box scores, the 1980s brought technological change that would accelerate the analytical revolution. In 1981, a group of sabermetricians including Bill James launched Project Scoresheet, an ambitious effort to collect detailed play-by-play data for every major league game. Before Project Scoresheet, this granular data was proprietary to teams or simply didn't exist in machine-readable form.
Project Scoresheet evolved into STATS Inc., a company that collected and sold detailed baseball data. For the first time, analysts outside of team front offices could access pitch-by-pitch information, situational stats, and defensive data that went beyond traditional box scores. This data enabled new research avenues and more sophisticated metrics.
In 1996, a group of analysts including Nate Silver, Clay Davenport, and Gary Huckabay founded Baseball Prospectus, a website and annual publication that brought sabermetric analysis to a growing online audience. Baseball Prospectus introduced several innovations:
PECOTA (Player Empirical Comparison and Optimization Test Algorithm), developed by Nate Silver, used a complex algorithm to project future player performance based on comparable players. Unlike simple trend-based projections, PECOTA identified players with similar statistical profiles and aging patterns, providing probabilistic forecasts of future performance. When it projected that aging slugger might decline or that breakout candidate might emerge, teams began paying attention.
VORP (Value Over Replacement Player) quantified a player's total value compared to a freely available replacement-level player. This concept—that value should be measured not against average but against the easily acquired alternative—became fundamental to modern baseball economics.
Park factors were refined to account for how different ballparks affected statistics. Coors Field in Colorado didn't just inflate home runs; it increased all offensive production. Adjusting for park effects became essential for accurate player evaluation.
During this period, a few teams began hiring analysts. Sandy Alderson's Oakland Athletics employed Eric Walker, whose book "The Sinister First Baseman" argued that on-base percentage was undervalued. The Toronto Blue Jays hired analysts to inform decision-making. But these were outliers. Most teams remained skeptical that "computer nerds" could tell baseball lifers anything useful.
P.1.4 Moneyball and the Mainstream (2002-2011) {#moneyball}
Everything changed with a single book. Michael Lewis's "Moneyball," published in 2003, told the story of Billy Beane's Oakland Athletics and their use of sabermetric analysis to compete with wealthier teams. The book became a bestseller and was later adapted into an Oscar-nominated film, bringing sabermetrics into mainstream consciousness.
The Moneyball A's didn't invent baseball analytics—Lewis's book explicitly credited Bill James, Eric Walker, and others—but they demonstrated its practical application. Facing severe financial constraints, Beane and assistant general manager Paul DePodesta (a Harvard economics graduate) exploited market inefficiencies. They identified that on-base percentage was undervalued relative to batting average, that defensive position flexibility created value, that college players were safer than high school players, and that closers were overvalued compared to setup relievers.
The results spoke loudly. From 2000 to 2006, Oakland won more games than all but three teams despite having one of baseball's lowest payrolls. They made the playoffs four times, won the AL West twice, and won 20 consecutive games in 2002—an American League record.
Moneyball triggered an analytical arms race. The Boston Red Sox, who hired Bill James as a consultant in 2003, won World Series championships in 2004, 2007, and 2013. The Tampa Bay Rays, with one of the smallest payrolls in baseball, made the playoffs regularly by exploiting inefficiencies like defensive shifts and bullpen optimization. Teams that once dismissed analytics rushed to hire quantitative analysts. By 2010, every major league team employed at least one analyst; many had entire departments.
Simultaneously, fan-facing analytics sites democratized advanced metrics. FanGraphs, founded in 2005, became the premier source for publicly available advanced statistics. The site made metrics like WAR (Wins Above Replacement), wRC+ (weighted runs created plus), and FIP (Fielding Independent Pitching) accessible to any fan with internet access. Brooks Baseball visualized pitch movement and location. These tools transformed how fans discussed and understood the game.
This era also saw statistical innovation accelerate. Tom Tango developed wOBA (weighted on-base average), which accurately weighted different offensive events by their run value. FIP isolated pitcher performance from fielding by focusing on strikeouts, walks, and home runs—the outcomes pitchers fully controlled. UZR (Ultimate Zone Rating) and DRS (Defensive Runs Saved) provided more sophisticated defensive metrics than Range Factor.
P.1.5 The Statcast Revolution (2015-Present) {#statcast}
If Moneyball represented analytics entering baseball's front offices, Statcast represented analytics transforming the game itself. Installed in all 30 major league ballparks before the 2015 season, Statcast uses a combination of Trackman radar and high-definition cameras (now Hawk-Eye optical tracking) to capture unprecedented data about every play.
The scale of Statcast's data collection is staggering. The system tracks:
- Batted balls: exit velocity, launch angle, direction, projected distance, hang time
- Baserunning: sprint speed, route efficiency, jump, reaction time
- Pitching: velocity, spin rate, spin axis, release point, extension, break
- Fielding: first step speed, reaction time, route efficiency, catch probability
Every pitch generates dozens of data points. Over a full season, Statcast produces billions of measurements. This granular tracking data enabled entirely new categories of analysis.
Exit velocity and launch angle revolutionized hitting evaluation. Analysts discovered the "barrel" zone—generally 95+ mph exit velocity at 10-30 degree launch angles—where hits become extra-base hits at the highest rates. Players began optimizing their swings to produce more barrels, contributing to baseball's power surge in the late 2010s.
Spin rate became crucial for pitcher evaluation and development. High spin fastballs rose more than low spin fastballs, generating more swings and misses up in the zone. Certain spin characteristics made breaking balls more effective. When some pitchers saw their spin rates drop significantly after MLB began more vigorously enforcing foreign substance rules in 2021, their performance declined correspondingly—validating spin rate's importance.
Defensive metrics improved dramatically with Statcast. Catch probability accounted for distance traveled, route efficiency, and difficulty of the play. Outs Above Average (OAA) provided cleaner defensive measurement than earlier metrics. Teams could now quantify what they'd always known qualitatively: which outfielders took better routes, which infielders had better first-step quickness, which catchers best prevented stolen bases.
Statcast also democratized this data. While teams have access to more detailed proprietary data, Baseball Savant (MLB's public Statcast portal) makes much of the tracking data freely available. A fan can look up Aaron Judge's average exit velocity, Shohei Ohtani's fastball spin rate, or Mookie Betts's sprint speed with a few clicks.
The Statcast era has coincided with dramatic changes in how baseball is played. Infield shifts became ubiquitous until banned in 2023. The four-seam fastball lost dominance to sweepers and cutters optimized for horizontal movement. Launch angle revolution saw home runs surge and balls in play decline. Bullpen specialization intensified, with teams employing openers, platooning relievers aggressively, and using starting pitchers in shorter bursts.
We're now entering what might be called the biomechanics era. Teams use motion capture, force plates, and high-speed cameras to understand the physical mechanisms behind performance. Player development increasingly focuses on optimizing swing planes, arm paths, and delivery mechanics based on detailed biomechanical models informed by tracking data.
Runs Created = (Hits + Walks) × Total Bases / (At Bats + Walks)
Expected Winning Percentage = Runs Scored² / (Runs Scored² + Runs Allowed²)
P.2.1 The Gap in Current Resources {#the-gap}
Despite the proliferation of baseball analytics content, a significant gap remains in educational resources. Academic statistics textbooks might use baseball examples but don't deeply engage with baseball-specific analytical questions. Baseball analytics blogs and websites produce excellent content but often assume substantial prior knowledge or focus on presenting results rather than teaching methods. Programming tutorials teach R or Python but rarely use realistic baseball applications.
This book aims to fill that gap by providing comprehensive, hands-on instruction in baseball analytics using modern tools and real data. You won't just learn what WAR is; you'll calculate it yourself. You won't just read about pitch tunneling; you'll visualize it with code. Each concept is accompanied by working code examples in both R and Python, using actual MLB data.
The book assumes you have basic programming knowledge but doesn't assume expertise in either statistics or baseball. Statistical concepts are introduced as needed, with clear explanations and baseball-specific context. Baseball terminology and rules are explained for readers new to the sport. By the end, you'll have both the technical skills to perform sophisticated analyses and the domain knowledge to ask meaningful questions.
P.2.2 The Open Source Philosophy {#open-source}
This book is built on open-source tools and data, reflecting a core belief: baseball analytics should be accessible to anyone with curiosity and determination. You won't need expensive software licenses or proprietary data subscriptions to work through this book. Everything uses freely available resources:
- R and Python: Free, open-source programming languages with thriving communities
- Baseball data APIs: The
baseballrandpybaseballpackages provide free access to enormous amounts of data - Public Statcast data: Baseball Savant makes MLB's tracking data available to everyone
- Open-source libraries: All packages and tools used in this book are freely available
This approach has practical and philosophical benefits. Practically, it means you can start learning immediately without financial barriers. Philosophically, it aligns with how modern baseball analytics has developed—through communities of passionate people sharing ideas, code, and data.
The book's code examples are available in a public GitHub repository, and readers are encouraged to modify, extend, and share their own analyses. This isn't just a book you read; it's a resource you actively engage with, experiment with, and build upon.
P.2.3 Who Should Read This Book {#who-should-read}
This book is designed for several audiences:
Programmers interested in baseball will find practical applications for their skills, learning to work with sports data, time series, hierarchical structures, and specialized analytics challenges. The book uses baseball as a vehicle to teach data science concepts that transfer to other domains.
Baseball fans wanting to understand analytics will gain the tools to move beyond consuming others' analyses to producing their own. You'll understand not just what stats like wRC+ and FIP mean, but how they're calculated and when they're appropriate to use.
Students and educators will find a comprehensive resource for teaching data science through sports. Baseball provides intrinsically interesting problems, publicly available data, and questions with no single right answer—ideal for developing analytical thinking.
Aspiring baseball analysts will build a portfolio of skills and projects directly relevant to working in baseball. While there's no guaranteed path to a front office job, demonstrating proficiency in the tools and methods used by professional analysts is increasingly essential.
You should have basic programming experience in either R or Python—understanding variables, functions, loops, and basic data structures. If you're new to both languages, Python is recommended for its gentler learning curve and broader applicability beyond data analysis. The book provides all code in both languages, so you can choose based on preference or learn both.
Baseball knowledge is helpful but not required. Key concepts and terminology are explained as introduced. If you know what an at-bat, a walk, and an earned run are, you have sufficient foundation. Deep knowledge of baseball strategy and history will enhance your appreciation but isn't prerequisite.
P.3.1 For Beginners {#for-beginners}
If you're new to both programming and baseball analytics, work through the book sequentially. Early chapters build foundations that later chapters assume. Chapter 1 sets up your environment and introduces basic concepts. Chapters 2-4 cover data acquisition and manipulation—essential skills for any analysis. Chapters 5-8 introduce core baseball metrics and how to calculate them. Later chapters tackle more advanced topics like modeling, prediction, and visualization.
Don't skip the code examples. Type them out rather than copying and pasting; this builds muscle memory and understanding. Experiment by modifying examples—change the player, the year, the statistic. Many "aha moments" come from seeing what happens when you tweak parameters.
Work through the exercises at each chapter's end. These aren't optional; they're where learning solidifies. Struggling with exercises is normal and valuable. The book's GitHub repository includes solutions, but try seriously before looking.
Start building a personal project early. Pick a question that genuinely interests you—maybe related to your favorite team—and return to it as you learn new skills. By the book's end, you'll have a substantial analysis demonstrating your capabilities.
P.3.2 For Experienced Programmers {#for-experienced}
If you're already comfortable with R or Python but new to baseball analytics, you might skim or skip the programming basics in early chapters. However, don't skip the baseball-specific content—even experienced programmers often misunderstand baseball's nuances, leading to analytical errors.
Focus on domain knowledge: what questions matter in baseball, what data is available, how baseball's unique characteristics affect analysis. Pay attention to baseball-specific packages and their capabilities. Much of your value as a baseball analyst comes from asking good questions, not just technical implementation.
Consider jumping to topics that interest you, though be aware that later chapters sometimes reference earlier content. The book is designed to support non-linear reading for those with strong technical foundations.
Use the book as a reference for baseball-specific tasks: "How do I calculate WAR?" "Where can I get Statcast data?" "What's the standard way to visualize spray charts?" The comprehensive index and detailed table of contents support this usage.
P.3.3 For Instructors {#for-instructors}
This book is designed to support a semester-long course in sports analytics or data science. Each chapter includes multiple exercises of varying difficulty, suitable for assignments or exam questions. The progression from data acquisition through exploratory analysis to modeling and prediction mirrors a typical data science workflow.
Several pedagogical features support classroom use:
- Dual-language code: Students can work in R or Python based on departmental standards or preference
- Real data: Using actual MLB data increases engagement compared to toy datasets
- Progressive complexity: Early chapters require only basic programming; later chapters challenge advanced students
- Open-ended projects: Many exercises have no single correct answer, encouraging creative problem-solving
- Reproducible examples: All code uses publicly available data and produces consistent results
A suggested semester structure might cover Chapters 1-4 in the first quarter (data acquisition and manipulation), Chapters 5-8 in the second quarter (metrics and evaluation), Chapters 9-12 in the third quarter (modeling and prediction), and dedicate the final quarter to student projects building on earlier material.
The book's GitHub repository includes additional resources for instructors: solution sets for exercises, supplementary datasets, and suggested project ideas. Instructors who adopt the book are encouraged to contribute their own materials back to the community.
Baseball analytics stands at an exciting juncture. The data has never been richer, the tools never more accessible, the community never more vibrant. Questions that seemed unanswerable a decade ago—How much does shifting improve defense? What makes a pitch effective? How do players age?—now have empirical answers. Yet each answer raises new questions, and the frontier of knowledge continuously expands.
This book invites you to join that frontier. Whether you're seeking a career in baseball, looking to understand your favorite team better, or simply exploring the intersection of sports and data science, you'll find the tools and knowledge to pursue your curiosity. The sabermetric revolution democratized baseball understanding; this book is your guide to participating in it.
Now let's begin. Fire up R or Python, and let's analyze some baseball.