There has never been a better time to be a baseball nerd with a laptop. Pitch-by-pitch tracking, century-old box scores, every batted ball’s exit velocity — almost all of it is public, free, and a single HTTP request away. The hard part is no longer access; it’s knowing which door to knock on for which question.
So we ran a live connectivity test before writing a word of this, hitting each source the way you actually would. Two of them needed nothing but a URL: the MLB Stats API handed back all 30 teams from one call, and Baseball Savant returned 251 qualified batters across 14 columns of expected statistics. No login, no token, no scraping tricks. What follows is a ranking of the six sources worth your time, in the order I’d reach for them.
How this ranking works
The order below reflects how often a curious analyst will actually want each source, weighted by how painless it is to query. Statcast and the MLB Stats API sit on top because they are both authoritative and genuinely open — you can hit them from a plain script with no credentials. The historical archives rank lower not because they’re worse but because you reach for them less often, and one or two have rough edges worth flagging. Every code sample here is real and runnable; the helper functions come from the companion script, which wraps nothing more exotic than requests and pandas.
#1 — Baseball Savant / Statcast
Best for: pitch-level and batted-ball-level data — exit velocity, launch angle, expected stats (xwOBA, xBA), spin rate, sprint speed. If a question involves how hard or where a ball was hit, this is the source. Needs a key? No.
Savant is run by MLB Advanced Media and exposes its leaderboards as CSV downloads. The trick almost nobody mentions: take any leaderboard URL and append csv=true, and the page becomes a clean, parseable file. That is exactly the call we verified live, and it returned 251 qualified hitters with their full expected-statistics line.
import pandas as pd
url = ("https://baseballsavant.mlb.com/leaderboard/expected_statistics"
"?type=batter&year=2025&position=&team=&min=q&csv=true")
df = pd.read_csv(url)
print(df.shape) # -> (251, 14)
One caveat: Savant’s data only goes back to 2015, when the Statcast system was installed league-wide (and pitch-tracking via its predecessor to 2008). For anything earlier, you need one of the archives below.
#2 — MLB Stats API (statsapi.mlb.com)
Best for: standings, schedules, rosters, season and career stat lines reaching back to the 1900s, and live game feeds including pitch-by-pitch and win-probability data. This is the same JSON that powers MLB.com’s own scoreboards. Needs a key? No.
It is the most versatile source on this list. The endpoints return structured JSON, the historical depth is enormous, and because it’s the league’s production API, it is reliable and fast. Our test pulled the full team list for 2025 and counted all thirty franchises on the first try.
import requests
r = requests.get("https://statsapi.mlb.com/api/v1/teams",
params={"sportId": 1, "season": 2025}, timeout=60)
teams = r.json()["teams"]
print(len(teams)) # -> 30
The API is officially undocumented, which is its one real downside — you learn the endpoints (standings, people, schedule, game/{pk}/feed/live) from community notes rather than a manual. But for the breadth of questions it answers without a single credential, nothing else comes close. One honest footnote: sportId=1 bundles in Negro-League and Federal-League rows for historical seasons, so league-wide totals sometimes need a filter to AL/NL only.
#3 — Baseball-Reference (via pybaseball)
Best for: deep history, the canonical versions of OPS+ and bWAR, and clean season/career tables for any player back to 1871. Needs a key? No — but you query it through the pybaseball package rather than scraping it yourself.
Baseball-Reference is the sport’s reference shelf, and its WAR implementation (bWAR) is one of the two you’ll see quoted everywhere. The pybaseball library wraps its pages into tidy DataFrames so you don’t have to parse HTML by hand.
from pybaseball import batting_stats_bref
df = batting_stats_bref(2025) # season batting, OPS+ included
print(df[["Name", "OPS+"]].head())
Be a polite citizen here: these calls hit a live website, so cache results and don’t hammer it in a loop. For the difference between bWAR and the FanGraphs flavor, see our piece on why WAR differs by site.
#4 — FanGraphs
Best for: wRC+, fWAR, and a sprawling set of advanced and pitch-info metrics presented in the best leaderboards on the internet. Needs a key? No — but read the warning.
Here is the honest part the other guides skip: FanGraphs actively blocks automated pulls. Scripted requests to its leaderboards come back with an HTTP 403 Forbidden, and pybaseball’s FanGraphs functions frequently fail for the same reason. That isn’t a bug in your code; it’s the site defending its bandwidth.
The practical workaround is to use FanGraphs the way it’s meant to be used — through the website. Build your view in the leaderboard UI, then click “Export Data” to download a CSV by hand. You lose automation, but you gain the single best source for wRC+ and fWAR. If your project depends on scripted access to those exact numbers, plan around this constraint rather than fighting it.
#5 — Retrosheet
Best for: play-by-play history — the actual event log of (nearly) every game going back decades, the raw material behind most historical research. Needs a key? No.
Retrosheet is a volunteer labor of love: a community that has reconstructed the event-level record of baseball game by game. The data ships as downloadable event files rather than a live API, so the workflow is to grab a season’s archive and parse it. pybaseball offers convenience wrappers for some of it.
from pybaseball import retrosheet
games = retrosheet.season_game_logs(2024)
print(games.shape)
It is overkill for a casual question and indispensable for a serious one. If you want to know what happened in a specific at-bat in 1986, this is where the answer lives. Their data-use note asks that you credit Retrosheet in anything you publish — a fair trade for decades of free reconstruction.
#6 — The Lahman Database
Best for: tidy season-and-career tables — a single, well-normalized relational dataset of batting, pitching, fielding, teams, and awards from 1871 to last season. Needs a key? No.
The Lahman database, maintained for years by Sean Lahman and now stewarded through the community, is the friendliest historical dataset for anyone who thinks in spreadsheets or SQL. It’s a set of flat CSV tables you load once and join freely — perfect for “career totals” questions that don’t need pitch-level detail.
One honest caveat worth knowing before you waste an afternoon: pybaseball’s bundled Lahman downloader is currently broken — the helper that’s supposed to fetch the archive fails because the upstream file location moved. Until that wrapper is patched, skip it and download the dataset straight from the source instead, then load the CSVs yourself.
# pybaseball's lahman.download() is currently broken.
# Download the CSV archive from the Lahman homepage, then:
import pandas as pd
batting = pd.read_csv("Batting.csv")
print(batting[batting["playerID"] == "ruthba01"][["yearID", "HR"]])
The bottom line
If you only bookmark two of these, make them Baseball Savant and the MLB Stats API: between them you can answer the overwhelming majority of modern questions, for free, with no credentials, from a script you can write in five minutes. We proved both live — 30 teams and 251 batters on the first request each. Reach down the list for history (Baseball-Reference, Retrosheet, Lahman) and over to FanGraphs’ website when you specifically need wRC+ or fWAR. The data is all there. The only thing standing between you and a real analysis is deciding which question to ask.
Sources & Further Reading
- Live connectivity test (30 teams, 251 qualified batters) re-runnable via
scripts/data_sources_demo.py. Retrieved June 2026. - Baseball Savant — Statcast leaderboards and CSV exports.
- Baseball-Reference and FanGraphs — the two standard homes for advanced metrics and WAR.
- Retrosheet — play-by-play game logs, used here with thanks per their data-use note.