Chapter 11: Fantasy Baseball & Sports Betting Analytics

Baseball analytics extends far beyond front offices and media analysis. Millions of people engage with baseball through fantasy leagues and sports betting, creating demand for predictive modeling, player valuation, and strategic optimization. This chapter explores the analytical frameworks underlying both domains, emphasizing the mathematical principles and statistical methods that inform decision-making.

Intermediate ~10 min read 4 sections 19 code examples 4 exercises
Book Progress
22%
Chapter 12 of 54
What You'll Learn
  • Fantasy Baseball Analytics
  • Sports Betting Analytics
  • Ethical Considerations
  • Exercises
Languages in This Chapter
R (10) Python (9)

All code examples can be copied and run in your environment.

11.1 Fantasy Baseball Analytics

Fantasy baseball challenges participants to assemble rosters of real MLB players who accumulate statistics over a season. Players earn points based on their real-world performance in various statistical categories. Success requires accurate player projections, understanding positional scarcity, and optimizing roster construction within constraints like salary caps or draft positions.

11.1.1 Standard Scoring Categories {#scoring-categories}

Fantasy baseball uses two primary league formats, each with different scoring systems:

Rotisserie (Roto) Scoring ranks teams in multiple statistical categories, with team standings in each category determining points. A typical 12-team 5×5 league uses five hitting and five pitching categories:

Hitting categories:


  • Runs (R): Total runs scored

  • Home Runs (HR): Total home runs hit

  • Runs Batted In (RBI): Total runs driven in

  • Stolen Bases (SB): Total successful steals

  • Batting Average (AVG): Hits divided by at-bats

Alternative hitting categories include On-Base Percentage (OBP), Slugging Percentage (SLG), or On-Base Plus Slugging (OPS).

Pitching categories:


  • Wins (W): Pitching victories

  • Strikeouts (K): Total strikeouts recorded

  • Earned Run Average (ERA): Earned runs per nine innings

  • Walks plus Hits per Innings Pitched (WHIP): (BB + H) / IP

  • Saves (SV): Games finished as the winning closer

Alternative pitching categories include Quality Starts (QS), Holds (HLD), or Strikeout-to-Walk ratio (K/BB).

In a 12-team league, the team with the most home runs earns 12 points, second-most earns 11, continuing down to last place earning 1 point. Total points across all categories determine final standings.

Points-Based Scoring assigns numerical values to each statistical event. A typical system might award:

Hitting:


  • Single: 1 point

  • Double: 2 points

  • Triple: 3 points

  • Home run: 4 points

  • RBI: 1 point

  • Run: 1 point

  • Stolen base: 2 points

  • Walk: 1 point

  • Strikeout: -1 point

Pitching:


  • Inning pitched: 3 points

  • Strikeout: 1 point

  • Win: 5 points

  • Save: 5 points

  • Hit allowed: -1 point

  • Walk allowed: -1 point

  • Earned run allowed: -2 points

Points leagues simplify valuation—player value equals projected total points. Rotisserie leagues require more sophisticated analysis because performance must be evaluated across multiple dimensions.

Understanding Category Independence

A crucial insight for Roto leagues: categories aren't equally valuable, and their values depend on league context. In a league where most teams excel at stolen bases but struggle with ERA, additional stolen bases have minimal marginal value while ERA improvements are precious. This leads to context-dependent valuation where a player's worth varies by league composition.

Let's calculate the variance and mean for each category in a sample league to understand scarcity:

R Implementation:

library(tidyverse)

# Simulate a 12-team league's projected totals for hitting categories
set.seed(2024)

# Generate league projections (these would normally come from projection systems)
league_projections <- tibble(
  team = paste("Team", 1:12),
  runs = rnorm(12, mean = 850, sd = 60),
  home_runs = rnorm(12, mean = 220, sd = 25),
  rbi = rnorm(12, mean = 825, sd = 55),
  stolen_bases = rnorm(12, mean = 90, sd = 30),
  batting_avg = rnorm(12, mean = 0.265, sd = 0.010)
)

# Calculate coefficient of variation (CV) for each category
# Higher CV indicates greater scarcity and therefore higher value
category_scarcity <- league_projections %>%
  select(-team) %>%
  summarise(across(everything(),
                   list(mean = mean, sd = sd))) %>%
  pivot_longer(everything()) %>%
  separate(name, into = c("category", "stat"), sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = stat, values_from = value) %>%
  mutate(
    cv = sd / mean,
    scarcity_rank = rank(-cv)
  ) %>%
  arrange(scarcity_rank)

print(category_scarcity)

# Visualization of category scarcity
ggplot(category_scarcity, aes(x = reorder(category, -cv), y = cv)) +
  geom_col(fill = "#003087") +
  geom_text(aes(label = sprintf("%.3f", cv)), vjust = -0.5) +
  labs(
    title = "Fantasy Category Scarcity Analysis",
    subtitle = "Higher coefficient of variation indicates greater scarcity",
    x = "Category",
    y = "Coefficient of Variation"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Python Implementation:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(2024)

# Simulate 12-team league projections
league_projections = pd.DataFrame({
    'team': [f'Team {i}' for i in range(1, 13)],
    'runs': np.random.normal(850, 60, 12),
    'home_runs': np.random.normal(220, 25, 12),
    'rbi': np.random.normal(825, 55, 12),
    'stolen_bases': np.random.normal(90, 30, 12),
    'batting_avg': np.random.normal(0.265, 0.010, 12)
})

# Calculate coefficient of variation for each category
stats_cols = ['runs', 'home_runs', 'rbi', 'stolen_bases', 'batting_avg']
scarcity_analysis = pd.DataFrame({
    'category': stats_cols,
    'mean': [league_projections[col].mean() for col in stats_cols],
    'std': [league_projections[col].std() for col in stats_cols]
})

scarcity_analysis['cv'] = scarcity_analysis['std'] / scarcity_analysis['mean']
scarcity_analysis['scarcity_rank'] = scarcity_analysis['cv'].rank(ascending=False)
scarcity_analysis = scarcity_analysis.sort_values('scarcity_rank')

print("\nCategory Scarcity Analysis:")
print(scarcity_analysis)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(range(len(scarcity_analysis)), scarcity_analysis['cv'],
              color='#003087')
ax.set_xticks(range(len(scarcity_analysis)))
ax.set_xticklabels(scarcity_analysis['category'], rotation=45, ha='right')
ax.set_ylabel('Coefficient of Variation')
ax.set_title('Fantasy Category Scarcity Analysis\nHigher CV indicates greater scarcity',
             fontsize=14, fontweight='bold')

# Add value labels
for i, (bar, val) in enumerate(zip(bars, scarcity_analysis['cv'])):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(),
            f'{val:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

Stolen bases typically show the highest coefficient of variation, making them disproportionately valuable. Conversely, runs and RBI often have lower variance, reducing their marginal value. This quantitative assessment of scarcity drives optimal draft strategy.

11.1.2 Player Valuation: The SGP Method {#sgp-method}

The Standings Gain Points (SGP) method, developed by Art McGee in the 1990s, remains the gold standard for fantasy player valuation in Rotisserie leagues. SGP converts each statistical category into a common currency: the number of standings points (or fraction thereof) that a player contributes.

The SGP Calculation Process:

  1. Establish baseline performance: Determine the statistics of a replacement-level player—the worst player you'd consider starting. For a 12-team league with standard rosters, this is typically the 120th-ranked player at each position.
  1. Calculate category contributions above replacement: Subtract replacement-level stats from each player's projections in every category.
  1. Convert to standings points: Determine how many raw statistics equal one standings point. In a 12-team league, the gap between first and last place is 11 points. Divide the difference in category totals between first and twelfth by 11 to get the statistic-per-point conversion.
  1. Sum across categories: A player's total value is the sum of their SGP across all categories.

Example SGP Calculation:

Suppose in your 12-team league:


  • Top team projects 950 runs; bottom team projects 750 runs

  • Difference: 200 runs / 11 standings points = 18.2 runs per point

  • Replacement-level player projects 75 runs

  • Player A projects 100 runs

Player A's SGP for runs = (100 - 75) / 18.2 = 1.37 standings points

Repeat for all categories and sum.

R Implementation:

library(tidyverse)

# Sample player projections (normally from ZiPS, Steamer, or THE BAT)
player_projections <- tibble(
  player = c("Ronald Acuna Jr.", "Shohei Ohtani", "Mookie Betts",
             "Bobby Witt Jr.", "Replacement Player"),
  runs = c(115, 100, 110, 95, 70),
  hr = c(38, 42, 35, 28, 15),
  rbi = c(95, 100, 90, 85, 60),
  sb = c(65, 20, 15, 35, 5),
  avg = c(.285, .295, .290, .275, .240)
)

# League parameters: points spread (first to last)
league_spreads <- tibble(
  category = c("runs", "hr", "rbi", "sb", "avg"),
  first_place = c(950, 240, 920, 140, .275),
  last_place = c(750, 180, 770, 60, .255),
  standings_spread = 11  # 12-team league
)

# Calculate per-point denominators
league_spreads <- league_spreads %>%
  mutate(per_point = (first_place - last_place) / standings_spread)

# Get replacement level
replacement_stats <- player_projections %>%
  filter(player == "Replacement Player") %>%
  select(-player) %>%
  pivot_longer(everything(), names_to = "category", values_to = "replacement_value")

# Calculate SGP for each player
sgp_values <- player_projections %>%
  pivot_longer(-player, names_to = "category", values_to = "projected") %>%
  left_join(replacement_stats, by = "category") %>%
  left_join(league_spreads, by = "category") %>%
  mutate(
    above_replacement = projected - replacement_value,
    sgp = above_replacement / per_point
  ) %>%
  group_by(player) %>%
  summarise(total_sgp = sum(sgp), .groups = "drop") %>%
  arrange(desc(total_sgp))

print(sgp_values)

# Visualization
ggplot(sgp_values %>% filter(player != "Replacement Player"),
       aes(x = reorder(player, total_sgp), y = total_sgp)) +
  geom_col(fill = "#4A90E2") +
  geom_text(aes(label = sprintf("%.1f", total_sgp)), hjust = -0.2) +
  coord_flip() +
  labs(
    title = "Fantasy Player Value (SGP Method)",
    subtitle = "Standings Gain Points above replacement level",
    x = NULL,
    y = "Total SGP"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Python Implementation:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sample player projections
player_projections = pd.DataFrame({
    'player': ['Ronald Acuna Jr.', 'Shohei Ohtani', 'Mookie Betts',
               'Bobby Witt Jr.', 'Replacement Player'],
    'runs': [115, 100, 110, 95, 70],
    'hr': [38, 42, 35, 28, 15],
    'rbi': [95, 100, 90, 85, 60],
    'sb': [65, 20, 15, 35, 5],
    'avg': [.285, .295, .290, .275, .240]
})

# League parameters
league_spreads = pd.DataFrame({
    'category': ['runs', 'hr', 'rbi', 'sb', 'avg'],
    'first_place': [950, 240, 920, 140, .275],
    'last_place': [750, 180, 770, 60, .255],
    'standings_spread': [11] * 5
})

# Calculate per-point denominators
league_spreads['per_point'] = (
    (league_spreads['first_place'] - league_spreads['last_place']) /
    league_spreads['standings_spread']
)

# Get replacement level
replacement_stats = player_projections[
    player_projections['player'] == 'Replacement Player'
].drop('player', axis=1).iloc[0]

# Calculate SGP for each player
sgp_results = []
for _, player_row in player_projections.iterrows():
    player_sgp = 0
    for category in ['runs', 'hr', 'rbi', 'sb', 'avg']:
        projected = player_row[category]
        replacement = replacement_stats[category]
        per_point = league_spreads[
            league_spreads['category'] == category
        ]['per_point'].values[0]

        above_replacement = projected - replacement
        sgp = above_replacement / per_point
        player_sgp += sgp

    sgp_results.append({
        'player': player_row['player'],
        'total_sgp': player_sgp
    })

sgp_df = pd.DataFrame(sgp_results).sort_values('total_sgp', ascending=False)
print("\nPlayer Valuations (SGP):")
print(sgp_df)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
plot_data = sgp_df[sgp_df['player'] != 'Replacement Player']
ax.barh(range(len(plot_data)), plot_data['total_sgp'], color='#4A90E2')
ax.set_yticks(range(len(plot_data)))
ax.set_yticklabels(plot_data['player'])
ax.set_xlabel('Total SGP')
ax.set_title('Fantasy Player Value (SGP Method)\nStandings Gain Points above replacement',
             fontsize=14, fontweight='bold')

# Add value labels
for i, val in enumerate(plot_data['total_sgp']):
    ax.text(val + 0.2, i, f'{val:.1f}', va='center')

plt.tight_layout()
plt.show()

Converting SGP to Auction Values

In auction drafts, participants bid on players with a fixed budget (typically $260). To convert SGP to dollar values, use a simple proportion:

Total budget × (Number of teams) = Total dollars available

For 12 teams: $260 × 12 = $3,120 total

Sum all players' SGP above replacement to get total SGP pool. Then:

Player value ($) = (Player SGP / Total SGP) × Total budget

Typically, reserve $1-2 per roster spot for end-of-roster players, reducing available dollars for premium players.

R Implementation:

# Calculate auction values from SGP
total_budget <- 260 * 12  # 12 teams, $260 each
roster_spots <- 14  # typical roster size
reserve_per_spot <- 1

usable_budget <- total_budget - (roster_spots * 12 * reserve_per_spot)

# Sum SGP (excluding replacement player)
total_sgp <- sgp_values %>%
  filter(player != "Replacement Player") %>%
  summarise(sum = sum(total_sgp)) %>%
  pull(sum)

# Calculate auction values
auction_values <- sgp_values %>%
  filter(player != "Replacement Player") %>%
  mutate(
    auction_value = (total_sgp / total_sgp) * (usable_budget / total_sgp),
    auction_value = round(auction_value + reserve_per_spot, 0)
  ) %>%
  select(player, total_sgp, auction_value) %>%
  arrange(desc(auction_value))

print(auction_values)

Python Implementation:

# Calculate auction values from SGP
total_budget = 260 * 12  # $3,120 total
roster_spots = 14
reserve_per_spot = 1

usable_budget = total_budget - (roster_spots * 12 * reserve_per_spot)

# Sum total SGP (excluding replacement)
total_sgp_pool = sgp_df[sgp_df['player'] != 'Replacement Player']['total_sgp'].sum()

# Calculate auction values
auction_values = sgp_df[sgp_df['player'] != 'Replacement Player'].copy()
auction_values['auction_value'] = (
    (auction_values['total_sgp'] / total_sgp_pool) * usable_budget + reserve_per_spot
).round(0)

print("\nAuction Values:")
print(auction_values[['player', 'total_sgp', 'auction_value']])

11.1.3 Roster Optimization {#roster-optimization}

Draft optimization treats roster construction as a constrained optimization problem. In snake drafts, you don't control which players you acquire directly (opponents' picks intervene), but you can optimize pick selection given available players.

Draft Strategy Principles:

  1. Value-based drafting (VBD): Always select the player with the highest SGP above replacement, regardless of position, unless positional scarcity demands otherwise.
  1. Scarcity adjustment: Positions with steeper value dropoffs (SS, CF, closers) should be prioritized when similar-value players exist.
  1. Category balance: Avoid punting categories early; maintain flexibility to compete across categories.

Linear Programming for Daily Fantasy

Daily fantasy sports (DFS) like DraftKings and FanDuel use salary caps, making roster construction a classic knapsack problem: maximize projected points subject to salary constraint and position requirements.

Formulation:

Maximize: Σ(projectedpointsi × selected_i)

Subject to:


  • Σ(salaryi × selectedi) ≤ salary_cap

  • Exactly 1 C, 1 1B, 1 2B, 1 3B, 1 SS, 3 OF, 2 P, 1 UTIL

  • selected_i ∈ {0, 1}

R Implementation:

library(tidyverse)
library(lpSolve)

# Sample DFS player pool (in reality, hundreds of players)
dfs_players <- tibble(
  player = c("Shohei Ohtani", "Aaron Judge", "Mookie Betts", "Freddie Freeman",
             "Jose Altuve", "Rafael Devers", "Trea Turner", "Kyle Tucker",
             "Juan Soto", "Gerrit Cole", "Spencer Strider", "Dylan Cease"),
  position = c("UTIL", "OF", "OF", "1B", "2B", "3B", "SS", "OF",
               "OF", "P", "P", "P"),
  salary = c(6500, 6200, 5800, 5400, 5000, 5200, 5600, 5400,
             6000, 10500, 9800, 8500),
  projected_points = c(11.2, 10.8, 9.5, 8.7, 8.2, 8.9, 9.1, 9.3,
                       10.5, 22.5, 20.8, 18.2)
)

# This is a simplified example. Full implementation would use proper
# integer programming with position constraints
# For demonstration, we'll use a greedy value-per-dollar approach

dfs_players <- dfs_players %>%
  mutate(value_per_dollar = projected_points / salary) %>%
  arrange(desc(value_per_dollar))

# Simple greedy selection (not optimal but illustrative)
salary_cap <- 50000
selected_players <- tibble()
remaining_budget <- salary_cap
position_needs <- c("C" = 1, "1B" = 1, "2B" = 1, "3B" = 1, "SS" = 1,
                    "OF" = 3, "P" = 2, "UTIL" = 1)

for (i in 1:nrow(dfs_players)) {
  player <- dfs_players[i, ]
  if (player$salary <= remaining_budget) {
    selected_players <- bind_rows(selected_players, player)
    remaining_budget <- remaining_budget - player$salary
    if (nrow(selected_players) >= 10) break
  }
}

print(selected_players %>% select(player, position, salary, projected_points))
print(paste("Total salary:", salary_cap - remaining_budget))
print(paste("Projected points:", sum(selected_players$projected_points)))

Python Implementation:

import pandas as pd
from pulp import LpProblem, LpMaximize, LpVariable, lpSum, LpBinary, value

# Sample DFS player pool
dfs_players = pd.DataFrame({
    'player': ['Shohei Ohtani', 'Aaron Judge', 'Mookie Betts', 'Freddie Freeman',
               'Jose Altuve', 'Rafael Devers', 'Trea Turner', 'Kyle Tucker',
               'Juan Soto', 'Gerrit Cole', 'Spencer Strider', 'Dylan Cease'],
    'position': ['UTIL', 'OF', 'OF', '1B', '2B', '3B', 'SS', 'OF',
                 'OF', 'P', 'P', 'P'],
    'salary': [6500, 6200, 5800, 5400, 5000, 5200, 5600, 5400,
               6000, 10500, 9800, 8500],
    'projected_points': [11.2, 10.8, 9.5, 8.7, 8.2, 8.9, 9.1, 9.3,
                         10.5, 22.5, 20.8, 18.2]
})

# Create optimization problem
prob = LpProblem("DFS_Optimizer", LpMaximize)

# Decision variables
player_vars = [LpVariable(f"player_{i}", cat=LpBinary)
               for i in range(len(dfs_players))]

# Objective: maximize projected points
prob += lpSum([dfs_players.iloc[i]['projected_points'] * player_vars[i]
               for i in range(len(dfs_players))])

# Constraint: salary cap
salary_cap = 50000
prob += lpSum([dfs_players.iloc[i]['salary'] * player_vars[i]
               for i in range(len(dfs_players))]) <= salary_cap

# Constraint: roster size (simplified - 10 players)
prob += lpSum(player_vars) == 10

# Solve
prob.solve()

# Extract solution
selected = [i for i in range(len(dfs_players)) if value(player_vars[i]) == 1]
optimal_roster = dfs_players.iloc[selected]

print("\nOptimal DFS Roster:")
print(optimal_roster[['player', 'position', 'salary', 'projected_points']])
print(f"\nTotal salary: ${optimal_roster['salary'].sum():,}")
print(f"Projected points: {optimal_roster['projected_points'].sum():.1f}")

Note: Full DFS optimization requires proper position constraints (exactly 3 OF, etc.), which requires more sophisticated linear programming. Libraries like PuLP (Python) or lpSolve (R) handle these constraints elegantly.

11.1.4 Daily Fantasy Considerations {#daily-fantasy}

Daily fantasy baseball differs from season-long formats in several crucial ways:

Salary Cap Optimization: Every player has a price; your budget is fixed. The optimization problem becomes:

max Σ(projectioni × xi) subject to Σ(salaryi × xi) ≤ cap

where x_i ∈ {0, 1} indicates selection.

Stack Strategies: Selecting multiple hitters from the same team ("stacking") exploits positive correlation. If Team A scores 8 runs, multiple batters benefit. The optimal stack size depends on correlation strength and variance.

Consider a simplified model:

  • Individual batter projections: 2.0 points each
  • Correlation between teammates: ρ = 0.3
  • Stack of 4 batters vs. 4 independent batters

Expected value is identical (8.0 points), but variance differs:

Var(independent) = 4 × Var(individual)
Var(stacked) = 4 × Var(individual) + 12 × Cov(i,j)

Higher variance increases upside in tournaments (top-heavy payout structures) but increases downside in cash games (50/50s, double-ups).

Pitcher-Batter Correlation: Selecting batters facing your own pitcher creates negative correlation. Generally avoid unless prices create extraordinary value.

Ownership Consideration: In large tournaments, contrarian plays (low-ownership players) offer higher expected value adjusted for field competition. If a player is 30% owned but has similar projection to a 5% owned player, the latter provides differentiation value.

11.1.5 Projection Systems {#projection-systems}

Projection systems forecast player performance using historical data, aging curves, and regression models. Major systems include:

Steamer: Freely available projections using weighted historical data and regression to the mean. Developed by Jared Cross and Dash Davidson.

ZiPS: Created by Dan Szymborski, incorporates aging curves and similarity scores to historical players.

THE BAT: Developed by Derek Carty, uses more sophisticated aging curves and platoon adjustments.

ATC (Average Total Cost): Simple average of multiple projection systems, often outperforms any individual system.

Combining Projections

No single projection system is definitively best. Averaging multiple systems reduces error through ensemble learning. Consider Bayesian model averaging:

P(stat | data) = Σ P(stat | modeli) × P(modeli | data)

Weight each system by historical accuracy:

R Implementation:

library(tidyverse)

# Sample projections from three systems for one player
projections <- tibble(
  system = c("Steamer", "ZiPS", "THE BAT"),
  hr = c(32, 35, 34),
  avg = c(.275, .272, .280),
  historical_mae_hr = c(8.2, 8.5, 7.9),  # Mean Absolute Error
  historical_mae_avg = c(.018, .019, .017)
)

# Calculate weights inversely proportional to error
projection_weights <- projections %>%
  mutate(
    weight_hr = (1 / historical_mae_hr) / sum(1 / historical_mae_hr),
    weight_avg = (1 / historical_mae_avg) / sum(1 / historical_mae_avg)
  )

# Weighted average projection
consensus <- projection_weights %>%
  summarise(
    consensus_hr = sum(hr * weight_hr),
    consensus_avg = sum(avg * weight_avg)
  )

print(projection_weights)
print(consensus)

Python Implementation:

import pandas as pd
import numpy as np

# Sample projections
projections = pd.DataFrame({
    'system': ['Steamer', 'ZiPS', 'THE BAT'],
    'hr': [32, 35, 34],
    'avg': [.275, .272, .280],
    'historical_mae_hr': [8.2, 8.5, 7.9],
    'historical_mae_avg': [.018, .019, .017]
})

# Calculate inverse-error weights
projections['weight_hr'] = (1 / projections['historical_mae_hr']) / (1 / projections['historical_mae_hr']).sum()
projections['weight_avg'] = (1 / projections['historical_mae_avg']) / (1 / projections['historical_mae_avg']).sum()

# Weighted consensus
consensus_hr = (projections['hr'] * projections['weight_hr']).sum()
consensus_avg = (projections['avg'] * projections['weight_avg']).sum()

print("\nProjection Systems with Weights:")
print(projections)
print(f"\nConsensus Projection - HR: {consensus_hr:.1f}, AVG: {consensus_avg:.3f}")

Consensus projections typically outperform individual systems by 5-10% in out-of-sample testing.


R
library(tidyverse)

# Simulate a 12-team league's projected totals for hitting categories
set.seed(2024)

# Generate league projections (these would normally come from projection systems)
league_projections <- tibble(
  team = paste("Team", 1:12),
  runs = rnorm(12, mean = 850, sd = 60),
  home_runs = rnorm(12, mean = 220, sd = 25),
  rbi = rnorm(12, mean = 825, sd = 55),
  stolen_bases = rnorm(12, mean = 90, sd = 30),
  batting_avg = rnorm(12, mean = 0.265, sd = 0.010)
)

# Calculate coefficient of variation (CV) for each category
# Higher CV indicates greater scarcity and therefore higher value
category_scarcity <- league_projections %>%
  select(-team) %>%
  summarise(across(everything(),
                   list(mean = mean, sd = sd))) %>%
  pivot_longer(everything()) %>%
  separate(name, into = c("category", "stat"), sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = stat, values_from = value) %>%
  mutate(
    cv = sd / mean,
    scarcity_rank = rank(-cv)
  ) %>%
  arrange(scarcity_rank)

print(category_scarcity)

# Visualization of category scarcity
ggplot(category_scarcity, aes(x = reorder(category, -cv), y = cv)) +
  geom_col(fill = "#003087") +
  geom_text(aes(label = sprintf("%.3f", cv)), vjust = -0.5) +
  labs(
    title = "Fantasy Category Scarcity Analysis",
    subtitle = "Higher coefficient of variation indicates greater scarcity",
    x = "Category",
    y = "Coefficient of Variation"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
R
library(tidyverse)

# Sample player projections (normally from ZiPS, Steamer, or THE BAT)
player_projections <- tibble(
  player = c("Ronald Acuna Jr.", "Shohei Ohtani", "Mookie Betts",
             "Bobby Witt Jr.", "Replacement Player"),
  runs = c(115, 100, 110, 95, 70),
  hr = c(38, 42, 35, 28, 15),
  rbi = c(95, 100, 90, 85, 60),
  sb = c(65, 20, 15, 35, 5),
  avg = c(.285, .295, .290, .275, .240)
)

# League parameters: points spread (first to last)
league_spreads <- tibble(
  category = c("runs", "hr", "rbi", "sb", "avg"),
  first_place = c(950, 240, 920, 140, .275),
  last_place = c(750, 180, 770, 60, .255),
  standings_spread = 11  # 12-team league
)

# Calculate per-point denominators
league_spreads <- league_spreads %>%
  mutate(per_point = (first_place - last_place) / standings_spread)

# Get replacement level
replacement_stats <- player_projections %>%
  filter(player == "Replacement Player") %>%
  select(-player) %>%
  pivot_longer(everything(), names_to = "category", values_to = "replacement_value")

# Calculate SGP for each player
sgp_values <- player_projections %>%
  pivot_longer(-player, names_to = "category", values_to = "projected") %>%
  left_join(replacement_stats, by = "category") %>%
  left_join(league_spreads, by = "category") %>%
  mutate(
    above_replacement = projected - replacement_value,
    sgp = above_replacement / per_point
  ) %>%
  group_by(player) %>%
  summarise(total_sgp = sum(sgp), .groups = "drop") %>%
  arrange(desc(total_sgp))

print(sgp_values)

# Visualization
ggplot(sgp_values %>% filter(player != "Replacement Player"),
       aes(x = reorder(player, total_sgp), y = total_sgp)) +
  geom_col(fill = "#4A90E2") +
  geom_text(aes(label = sprintf("%.1f", total_sgp)), hjust = -0.2) +
  coord_flip() +
  labs(
    title = "Fantasy Player Value (SGP Method)",
    subtitle = "Standings Gain Points above replacement level",
    x = NULL,
    y = "Total SGP"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))
R
# Calculate auction values from SGP
total_budget <- 260 * 12  # 12 teams, $260 each
roster_spots <- 14  # typical roster size
reserve_per_spot <- 1

usable_budget <- total_budget - (roster_spots * 12 * reserve_per_spot)

# Sum SGP (excluding replacement player)
total_sgp <- sgp_values %>%
  filter(player != "Replacement Player") %>%
  summarise(sum = sum(total_sgp)) %>%
  pull(sum)

# Calculate auction values
auction_values <- sgp_values %>%
  filter(player != "Replacement Player") %>%
  mutate(
    auction_value = (total_sgp / total_sgp) * (usable_budget / total_sgp),
    auction_value = round(auction_value + reserve_per_spot, 0)
  ) %>%
  select(player, total_sgp, auction_value) %>%
  arrange(desc(auction_value))

print(auction_values)
R
library(tidyverse)
library(lpSolve)

# Sample DFS player pool (in reality, hundreds of players)
dfs_players <- tibble(
  player = c("Shohei Ohtani", "Aaron Judge", "Mookie Betts", "Freddie Freeman",
             "Jose Altuve", "Rafael Devers", "Trea Turner", "Kyle Tucker",
             "Juan Soto", "Gerrit Cole", "Spencer Strider", "Dylan Cease"),
  position = c("UTIL", "OF", "OF", "1B", "2B", "3B", "SS", "OF",
               "OF", "P", "P", "P"),
  salary = c(6500, 6200, 5800, 5400, 5000, 5200, 5600, 5400,
             6000, 10500, 9800, 8500),
  projected_points = c(11.2, 10.8, 9.5, 8.7, 8.2, 8.9, 9.1, 9.3,
                       10.5, 22.5, 20.8, 18.2)
)

# This is a simplified example. Full implementation would use proper
# integer programming with position constraints
# For demonstration, we'll use a greedy value-per-dollar approach

dfs_players <- dfs_players %>%
  mutate(value_per_dollar = projected_points / salary) %>%
  arrange(desc(value_per_dollar))

# Simple greedy selection (not optimal but illustrative)
salary_cap <- 50000
selected_players <- tibble()
remaining_budget <- salary_cap
position_needs <- c("C" = 1, "1B" = 1, "2B" = 1, "3B" = 1, "SS" = 1,
                    "OF" = 3, "P" = 2, "UTIL" = 1)

for (i in 1:nrow(dfs_players)) {
  player <- dfs_players[i, ]
  if (player$salary <= remaining_budget) {
    selected_players <- bind_rows(selected_players, player)
    remaining_budget <- remaining_budget - player$salary
    if (nrow(selected_players) >= 10) break
  }
}

print(selected_players %>% select(player, position, salary, projected_points))
print(paste("Total salary:", salary_cap - remaining_budget))
print(paste("Projected points:", sum(selected_players$projected_points)))
R
library(tidyverse)

# Sample projections from three systems for one player
projections <- tibble(
  system = c("Steamer", "ZiPS", "THE BAT"),
  hr = c(32, 35, 34),
  avg = c(.275, .272, .280),
  historical_mae_hr = c(8.2, 8.5, 7.9),  # Mean Absolute Error
  historical_mae_avg = c(.018, .019, .017)
)

# Calculate weights inversely proportional to error
projection_weights <- projections %>%
  mutate(
    weight_hr = (1 / historical_mae_hr) / sum(1 / historical_mae_hr),
    weight_avg = (1 / historical_mae_avg) / sum(1 / historical_mae_avg)
  )

# Weighted average projection
consensus <- projection_weights %>%
  summarise(
    consensus_hr = sum(hr * weight_hr),
    consensus_avg = sum(avg * weight_avg)
  )

print(projection_weights)
print(consensus)
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(2024)

# Simulate 12-team league projections
league_projections = pd.DataFrame({
    'team': [f'Team {i}' for i in range(1, 13)],
    'runs': np.random.normal(850, 60, 12),
    'home_runs': np.random.normal(220, 25, 12),
    'rbi': np.random.normal(825, 55, 12),
    'stolen_bases': np.random.normal(90, 30, 12),
    'batting_avg': np.random.normal(0.265, 0.010, 12)
})

# Calculate coefficient of variation for each category
stats_cols = ['runs', 'home_runs', 'rbi', 'stolen_bases', 'batting_avg']
scarcity_analysis = pd.DataFrame({
    'category': stats_cols,
    'mean': [league_projections[col].mean() for col in stats_cols],
    'std': [league_projections[col].std() for col in stats_cols]
})

scarcity_analysis['cv'] = scarcity_analysis['std'] / scarcity_analysis['mean']
scarcity_analysis['scarcity_rank'] = scarcity_analysis['cv'].rank(ascending=False)
scarcity_analysis = scarcity_analysis.sort_values('scarcity_rank')

print("\nCategory Scarcity Analysis:")
print(scarcity_analysis)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(range(len(scarcity_analysis)), scarcity_analysis['cv'],
              color='#003087')
ax.set_xticks(range(len(scarcity_analysis)))
ax.set_xticklabels(scarcity_analysis['category'], rotation=45, ha='right')
ax.set_ylabel('Coefficient of Variation')
ax.set_title('Fantasy Category Scarcity Analysis\nHigher CV indicates greater scarcity',
             fontsize=14, fontweight='bold')

# Add value labels
for i, (bar, val) in enumerate(zip(bars, scarcity_analysis['cv'])):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(),
            f'{val:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sample player projections
player_projections = pd.DataFrame({
    'player': ['Ronald Acuna Jr.', 'Shohei Ohtani', 'Mookie Betts',
               'Bobby Witt Jr.', 'Replacement Player'],
    'runs': [115, 100, 110, 95, 70],
    'hr': [38, 42, 35, 28, 15],
    'rbi': [95, 100, 90, 85, 60],
    'sb': [65, 20, 15, 35, 5],
    'avg': [.285, .295, .290, .275, .240]
})

# League parameters
league_spreads = pd.DataFrame({
    'category': ['runs', 'hr', 'rbi', 'sb', 'avg'],
    'first_place': [950, 240, 920, 140, .275],
    'last_place': [750, 180, 770, 60, .255],
    'standings_spread': [11] * 5
})

# Calculate per-point denominators
league_spreads['per_point'] = (
    (league_spreads['first_place'] - league_spreads['last_place']) /
    league_spreads['standings_spread']
)

# Get replacement level
replacement_stats = player_projections[
    player_projections['player'] == 'Replacement Player'
].drop('player', axis=1).iloc[0]

# Calculate SGP for each player
sgp_results = []
for _, player_row in player_projections.iterrows():
    player_sgp = 0
    for category in ['runs', 'hr', 'rbi', 'sb', 'avg']:
        projected = player_row[category]
        replacement = replacement_stats[category]
        per_point = league_spreads[
            league_spreads['category'] == category
        ]['per_point'].values[0]

        above_replacement = projected - replacement
        sgp = above_replacement / per_point
        player_sgp += sgp

    sgp_results.append({
        'player': player_row['player'],
        'total_sgp': player_sgp
    })

sgp_df = pd.DataFrame(sgp_results).sort_values('total_sgp', ascending=False)
print("\nPlayer Valuations (SGP):")
print(sgp_df)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
plot_data = sgp_df[sgp_df['player'] != 'Replacement Player']
ax.barh(range(len(plot_data)), plot_data['total_sgp'], color='#4A90E2')
ax.set_yticks(range(len(plot_data)))
ax.set_yticklabels(plot_data['player'])
ax.set_xlabel('Total SGP')
ax.set_title('Fantasy Player Value (SGP Method)\nStandings Gain Points above replacement',
             fontsize=14, fontweight='bold')

# Add value labels
for i, val in enumerate(plot_data['total_sgp']):
    ax.text(val + 0.2, i, f'{val:.1f}', va='center')

plt.tight_layout()
plt.show()
Python
# Calculate auction values from SGP
total_budget = 260 * 12  # $3,120 total
roster_spots = 14
reserve_per_spot = 1

usable_budget = total_budget - (roster_spots * 12 * reserve_per_spot)

# Sum total SGP (excluding replacement)
total_sgp_pool = sgp_df[sgp_df['player'] != 'Replacement Player']['total_sgp'].sum()

# Calculate auction values
auction_values = sgp_df[sgp_df['player'] != 'Replacement Player'].copy()
auction_values['auction_value'] = (
    (auction_values['total_sgp'] / total_sgp_pool) * usable_budget + reserve_per_spot
).round(0)

print("\nAuction Values:")
print(auction_values[['player', 'total_sgp', 'auction_value']])
Python
import pandas as pd
from pulp import LpProblem, LpMaximize, LpVariable, lpSum, LpBinary, value

# Sample DFS player pool
dfs_players = pd.DataFrame({
    'player': ['Shohei Ohtani', 'Aaron Judge', 'Mookie Betts', 'Freddie Freeman',
               'Jose Altuve', 'Rafael Devers', 'Trea Turner', 'Kyle Tucker',
               'Juan Soto', 'Gerrit Cole', 'Spencer Strider', 'Dylan Cease'],
    'position': ['UTIL', 'OF', 'OF', '1B', '2B', '3B', 'SS', 'OF',
                 'OF', 'P', 'P', 'P'],
    'salary': [6500, 6200, 5800, 5400, 5000, 5200, 5600, 5400,
               6000, 10500, 9800, 8500],
    'projected_points': [11.2, 10.8, 9.5, 8.7, 8.2, 8.9, 9.1, 9.3,
                         10.5, 22.5, 20.8, 18.2]
})

# Create optimization problem
prob = LpProblem("DFS_Optimizer", LpMaximize)

# Decision variables
player_vars = [LpVariable(f"player_{i}", cat=LpBinary)
               for i in range(len(dfs_players))]

# Objective: maximize projected points
prob += lpSum([dfs_players.iloc[i]['projected_points'] * player_vars[i]
               for i in range(len(dfs_players))])

# Constraint: salary cap
salary_cap = 50000
prob += lpSum([dfs_players.iloc[i]['salary'] * player_vars[i]
               for i in range(len(dfs_players))]) <= salary_cap

# Constraint: roster size (simplified - 10 players)
prob += lpSum(player_vars) == 10

# Solve
prob.solve()

# Extract solution
selected = [i for i in range(len(dfs_players)) if value(player_vars[i]) == 1]
optimal_roster = dfs_players.iloc[selected]

print("\nOptimal DFS Roster:")
print(optimal_roster[['player', 'position', 'salary', 'projected_points']])
print(f"\nTotal salary: ${optimal_roster['salary'].sum():,}")
print(f"Projected points: {optimal_roster['projected_points'].sum():.1f}")
Python
import pandas as pd
import numpy as np

# Sample projections
projections = pd.DataFrame({
    'system': ['Steamer', 'ZiPS', 'THE BAT'],
    'hr': [32, 35, 34],
    'avg': [.275, .272, .280],
    'historical_mae_hr': [8.2, 8.5, 7.9],
    'historical_mae_avg': [.018, .019, .017]
})

# Calculate inverse-error weights
projections['weight_hr'] = (1 / projections['historical_mae_hr']) / (1 / projections['historical_mae_hr']).sum()
projections['weight_avg'] = (1 / projections['historical_mae_avg']) / (1 / projections['historical_mae_avg']).sum()

# Weighted consensus
consensus_hr = (projections['hr'] * projections['weight_hr']).sum()
consensus_avg = (projections['avg'] * projections['weight_avg']).sum()

print("\nProjection Systems with Weights:")
print(projections)
print(f"\nConsensus Projection - HR: {consensus_hr:.1f}, AVG: {consensus_avg:.3f}")

11.2 Sports Betting Analytics

Sports betting requires converting probabilistic beliefs about game outcomes into optimal wagering decisions. Unlike fantasy sports where you compete against other participants, sports betting pits you against bookmakers who have sophisticated models and access to sharp betting markets.

11.2.1 Types of Baseball Bets {#bet-types}

Moneyline: Bet on which team wins outright. Odds reflect implied probability:

  • Yankees -150 (favorites): Bet $150 to win $100
  • Red Sox +130 (underdogs): Bet $100 to win $130

Run Line: Baseball's equivalent of a point spread, typically 1.5 runs:

  • Yankees -1.5 (+120): Yankees must win by 2+ runs
  • Red Sox +1.5 (-140): Red Sox must win or lose by 1 run

Totals (Over/Under): Bet on combined runs scored:

  • Over 8.5 runs (-110)
  • Under 8.5 runs (-110)

Proposition Bets: Specific outcomes within games:

  • Will Aaron Judge hit a home run? Yes (+250) / No (-350)
  • Total hits by Shohei Ohtani: Over 1.5 (+120)
  • First team to score: Yankees (-115) / Red Sox (-105)

Futures: Long-term outcomes:

  • World Series winner
  • AL MVP
  • Team win total: Over/Under 87.5 wins

Parlays: Multiple bets combined; all must win. Payouts multiply but probability decreases dramatically:

  • Yankees ML + Over 8.5 + Dodgers ML
  • If all three win, payout is product of individual odds
  • If any loses, entire bet loses

11.2.2 Implied Probability {#implied-probability}

Betting odds encode the bookmaker's assessment of outcome probability plus a profit margin (vig/juice). Converting odds to implied probability reveals the break-even win rate needed for profit.

American Odds Conversion:

For favorites (negative odds):
Implied Probability = |odds| / (|odds| + 100)

Example: -150 odds → 150 / (150 + 100) = 60%

For underdogs (positive odds):
Implied Probability = 100 / (odds + 100)

Example: +130 odds → 100 / (130 + 100) = 43.48%

Decimal Odds Conversion:

Implied Probability = 1 / decimal_odds

Example: 1.75 odds → 1 / 1.75 = 57.14%

The Vig (Vigorish)

Summing implied probabilities for both sides exceeds 100%, revealing the bookmaker's edge:

Yankees -150 (60%) + Red Sox +130 (43.48%) = 103.48%

The 3.48% represents the vig—the bookmaker's commission. To profit betting randomly on either side, you'd need to win more than your implied break-even rate.

R Implementation:

library(tidyverse)

# Function to convert American odds to implied probability
american_to_prob <- function(odds) {
  if (odds < 0) {
    abs(odds) / (abs(odds) + 100)
  } else {
    100 / (odds + 100)
  }
}

# Function to convert decimal odds to implied probability
decimal_to_prob <- function(odds) {
  1 / odds
}

# Sample betting lines
betting_lines <- tibble(
  bet = c("Yankees ML", "Red Sox ML", "Over 8.5", "Under 8.5"),
  american_odds = c(-150, 130, -110, -110),
  decimal_odds = c(1.67, 2.30, 1.91, 1.91)
)

# Calculate implied probabilities
betting_lines <- betting_lines %>%
  mutate(
    implied_prob_american = sapply(american_odds, american_to_prob),
    implied_prob_decimal = sapply(decimal_odds, decimal_to_prob),
    implied_prob_pct = implied_prob_american * 100
  )

print(betting_lines)

# Calculate total vig on moneyline
moneyline_vig <- betting_lines %>%
  filter(bet %in% c("Yankees ML", "Red Sox ML")) %>%
  summarise(total_prob = sum(implied_prob_american),
            vig = (total_prob - 1) * 100)

print(paste("Moneyline vig:", round(moneyline_vig$vig, 2), "%"))

Python Implementation:

import pandas as pd
import numpy as np

def american_to_prob(odds):
    """Convert American odds to implied probability"""
    if odds < 0:
        return abs(odds) / (abs(odds) + 100)
    else:
        return 100 / (odds + 100)

def decimal_to_prob(odds):
    """Convert decimal odds to implied probability"""
    return 1 / odds

# Sample betting lines
betting_lines = pd.DataFrame({
    'bet': ['Yankees ML', 'Red Sox ML', 'Over 8.5', 'Under 8.5'],
    'american_odds': [-150, 130, -110, -110],
    'decimal_odds': [1.67, 2.30, 1.91, 1.91]
})

# Calculate implied probabilities
betting_lines['implied_prob_american'] = betting_lines['american_odds'].apply(american_to_prob)
betting_lines['implied_prob_decimal'] = betting_lines['decimal_odds'].apply(decimal_to_prob)
betting_lines['implied_prob_pct'] = betting_lines['implied_prob_american'] * 100

print("\nBetting Lines with Implied Probabilities:")
print(betting_lines)

# Calculate vig
moneyline_bets = betting_lines[betting_lines['bet'].str.contains('ML')]
total_prob = moneyline_bets['implied_prob_american'].sum()
vig = (total_prob - 1) * 100

print(f"\nMoneyline vig: {vig:.2f}%")

11.2.3 Building a Betting Model {#betting-model}

Profitable sports betting requires edge: your probability estimates must be more accurate than the market's implied probabilities. Building a model involves:

  1. Data collection: Historical game results, team statistics, pitcher matchups, weather, umpires
  2. Feature engineering: Creating predictive variables (team wOBA last 30 days, pitcher ERA vs. opposing lineup)
  3. Model training: Logistic regression, random forests, or gradient boosting to predict win probability
  4. Calibration: Ensuring predicted probabilities match observed frequencies
  5. Bet selection: Only bet when your probability significantly exceeds implied probability

Expected Value Calculation:

EV = (Probability of Win × Profit if Win) - (Probability of Loss × Loss if Lose)

Example:


  • Your model: Yankees have 65% win probability

  • Bookmaker odds: Yankees -150 (60% implied)

  • Bet $100 on Yankees ML

EV = (0.65 × $66.67) - (0.35 × $100) = $43.33 - $35.00 = +$8.33

Positive EV indicates a profitable bet long-term.

R Implementation:

library(tidyverse)

# Function to calculate EV for American odds
calculate_ev <- function(win_prob, odds, stake = 100) {
  # Calculate potential profit
  if (odds < 0) {
    profit <- stake * (100 / abs(odds))
  } else {
    profit <- stake * (odds / 100)
  }

  # Calculate EV
  ev <- (win_prob * profit) - ((1 - win_prob) * stake)
  return(ev)
}

# Sample betting scenarios
bet_scenarios <- tibble(
  team = c("Yankees", "Red Sox", "Dodgers", "Padres"),
  model_win_prob = c(0.65, 0.48, 0.55, 0.42),
  odds = c(-150, 130, -120, 175),
  stake = 100
)

# Calculate EV and implied probability for each bet
bet_scenarios <- bet_scenarios %>%
  mutate(
    implied_prob = sapply(odds, american_to_prob),
    ev = mapply(calculate_ev, model_win_prob, odds, stake),
    edge = model_win_prob - implied_prob,
    bet_recommendation = ifelse(ev > 0, "BET", "PASS")
  )

print(bet_scenarios)

# Visualize edges
ggplot(bet_scenarios, aes(x = team, y = edge, fill = bet_recommendation)) +
  geom_col() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_text(aes(label = sprintf("%.1f%%", edge * 100)), vjust = -0.5) +
  scale_fill_manual(values = c("BET" = "#2ECC40", "PASS" = "#FF4136")) +
  labs(
    title = "Betting Edge Analysis",
    subtitle = "Model probability vs. implied probability",
    x = NULL,
    y = "Edge (Model - Implied)",
    fill = "Recommendation"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Python Implementation:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def calculate_ev(win_prob, odds, stake=100):
    """Calculate expected value of a bet"""
    # Calculate potential profit
    if odds < 0:
        profit = stake * (100 / abs(odds))
    else:
        profit = stake * (odds / 100)

    # Calculate EV
    ev = (win_prob * profit) - ((1 - win_prob) * stake)
    return ev

# Sample betting scenarios
bet_scenarios = pd.DataFrame({
    'team': ['Yankees', 'Red Sox', 'Dodgers', 'Padres'],
    'model_win_prob': [0.65, 0.48, 0.55, 0.42],
    'odds': [-150, 130, -120, 175],
    'stake': [100] * 4
})

# Calculate metrics
bet_scenarios['implied_prob'] = bet_scenarios['odds'].apply(american_to_prob)
bet_scenarios['ev'] = bet_scenarios.apply(
    lambda row: calculate_ev(row['model_win_prob'], row['odds'], row['stake']),
    axis=1
)
bet_scenarios['edge'] = bet_scenarios['model_win_prob'] - bet_scenarios['implied_prob']
bet_scenarios['bet_recommendation'] = bet_scenarios['ev'].apply(
    lambda x: 'BET' if x > 0 else 'PASS'
)

print("\nBetting Edge Analysis:")
print(bet_scenarios)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#2ECC40' if rec == 'BET' else '#FF4136'
          for rec in bet_scenarios['bet_recommendation']]
bars = ax.bar(bet_scenarios['team'], bet_scenarios['edge'] * 100, color=colors)
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.7)
ax.set_ylabel('Edge (Model - Implied) %')
ax.set_title('Betting Edge Analysis\nModel probability vs. implied probability',
             fontsize=14, fontweight='bold')

# Add value labels
for bar, edge in zip(bars, bet_scenarios['edge']):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{edge * 100:.1f}%', ha='center',
            va='bottom' if height > 0 else 'top')

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='#2ECC40', label='BET'),
                   Patch(facecolor='#FF4136', label='PASS')]
ax.legend(handles=legend_elements, title='Recommendation')

plt.tight_layout()
plt.show()

Kelly Criterion for Bet Sizing:

The Kelly Criterion determines optimal bet size to maximize long-term growth:

f* = (bp - q) / b

where:


  • f* = fraction of bankroll to bet

  • b = decimal odds - 1 (profit per unit staked)

  • p = probability of winning

  • q = probability of losing (1 - p)

Example:


  • Your model: 55% win probability

  • Odds: +110 (2.10 decimal, b = 1.10)

  • Kelly = (1.10 × 0.55 - 0.45) / 1.10 = 0.145 = 14.5% of bankroll

Most sharp bettors use fractional Kelly (1/4 Kelly or 1/2 Kelly) to reduce variance:

R Implementation:

# Kelly Criterion calculator
kelly_criterion <- function(win_prob, odds, fraction = 1) {
  # Convert American odds to decimal
  if (odds < 0) {
    decimal_odds <- 1 + (100 / abs(odds))
  } else {
    decimal_odds <- 1 + (odds / 100)
  }

  b <- decimal_odds - 1
  p <- win_prob
  q <- 1 - win_prob

  kelly <- (b * p - q) / b
  kelly <- max(0, kelly)  # Never bet if Kelly is negative

  return(kelly * fraction)
}

# Calculate Kelly for our scenarios
bet_scenarios <- bet_scenarios %>%
  mutate(
    full_kelly = mapply(kelly_criterion, model_win_prob, odds, 1),
    half_kelly = mapply(kelly_criterion, model_win_prob, odds, 0.5),
    quarter_kelly = mapply(kelly_criterion, model_win_prob, odds, 0.25)
  )

print(bet_scenarios %>%
      select(team, model_win_prob, odds, full_kelly, half_kelly, quarter_kelly))

Python Implementation:

def kelly_criterion(win_prob, odds, fraction=1.0):
    """Calculate Kelly Criterion bet sizing"""
    # Convert American odds to decimal
    if odds < 0:
        decimal_odds = 1 + (100 / abs(odds))
    else:
        decimal_odds = 1 + (odds / 100)

    b = decimal_odds - 1
    p = win_prob
    q = 1 - win_prob

    kelly = (b * p - q) / b
    kelly = max(0, kelly)  # Never bet if Kelly is negative

    return kelly * fraction

# Calculate Kelly for scenarios
bet_scenarios['full_kelly'] = bet_scenarios.apply(
    lambda row: kelly_criterion(row['model_win_prob'], row['odds'], 1.0),
    axis=1
)
bet_scenarios['half_kelly'] = bet_scenarios.apply(
    lambda row: kelly_criterion(row['model_win_prob'], row['odds'], 0.5),
    axis=1
)
bet_scenarios['quarter_kelly'] = bet_scenarios.apply(
    lambda row: kelly_criterion(row['model_win_prob'], row['odds'], 0.25),
    axis=1
)

print("\nKelly Criterion Bet Sizing:")
print(bet_scenarios[['team', 'model_win_prob', 'odds',
                     'full_kelly', 'half_kelly', 'quarter_kelly']])

11.2.4 Finding Value {#finding-value}

Line Shopping: Different sportsbooks offer different odds. Shopping for the best line increases expected value significantly.

Example:


  • Sportsbook A: Yankees -150

  • Sportsbook B: Yankees -145

Betting $100 at -145 instead of -150 increases profit from $66.67 to $68.97—an extra $2.30 per bet. Over hundreds of bets, this compounds substantially.

Market Inefficiencies: Betting markets are generally efficient, especially for high-profile games. Opportunities arise in:

  • Lower-profile games: Midweek day games between weak teams receive less attention
  • Timing: Early lines (opening odds) before sharp money arrives
  • Props and derivatives: Markets with less liquidity
  • Public bias: Overvalued popular teams (Yankees, Dodgers, Red Sox)

Closing Line Value (CLV): The best measure of betting skill is beating the closing line—the odds immediately before game start, when most information is incorporated. Consistently beating the closing line indicates edge.

11.2.5 Bankroll Management {#bankroll-management}

Proper bankroll management separates disciplined bettors from gamblers:

Unit Sizing: Define a "unit" as 1-2% of total bankroll. Never bet more than 5% on any single game (even with strong edge).

Tracking: Record every bet with:


  • Date, game, bet type

  • Odds, stake, result

  • Running bankroll total

  • Closing line for comparison

R Implementation - Bet Tracking:

library(tidyverse)
library(lubridate)

# Sample bet log
bet_log <- tibble(
  date = as.Date(c("2024-04-01", "2024-04-01", "2024-04-02",
                   "2024-04-03", "2024-04-04")),
  game = c("NYY @ BOS", "LAD @ SD", "HOU @ SEA", "ATL @ NYM", "CHC @ STL"),
  bet_type = c("Moneyline", "Moneyline", "Over 8.5", "Moneyline", "Under 7.5"),
  team = c("Yankees", "Dodgers", "Over", "Braves", "Under"),
  odds = c(-150, 120, -110, -135, -105),
  stake = c(100, 100, 110, 100, 100),
  result = c("Win", "Loss", "Win", "Win", "Loss"),
  profit = c(66.67, -100, 100, 74.07, -100)
)

# Calculate cumulative metrics
bet_log <- bet_log %>%
  mutate(
    cumulative_profit = cumsum(profit),
    cumulative_roi = cumulative_profit / cumsum(stake),
    win = ifelse(result == "Win", 1, 0)
  ) %>%
  mutate(
    cumulative_wins = cumsum(win),
    bet_number = row_number(),
    win_rate = cumulative_wins / bet_number
  )

print(bet_log %>% select(date, game, bet_type, stake, result,
                          profit, cumulative_profit, win_rate))

# Visualization
ggplot(bet_log, aes(x = bet_number, y = cumulative_profit)) +
  geom_line(color = "#003087", size = 1.2) +
  geom_point(aes(color = result), size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  scale_color_manual(values = c("Win" = "#2ECC40", "Loss" = "#FF4136")) +
  labs(
    title = "Betting Performance Tracker",
    subtitle = "Cumulative profit over time",
    x = "Bet Number",
    y = "Cumulative Profit ($)",
    color = "Result"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))

Python Implementation - Bet Tracking:

import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

# Sample bet log
bet_log = pd.DataFrame({
    'date': pd.to_datetime(['2024-04-01', '2024-04-01', '2024-04-02',
                            '2024-04-03', '2024-04-04']),
    'game': ['NYY @ BOS', 'LAD @ SD', 'HOU @ SEA', 'ATL @ NYM', 'CHC @ STL'],
    'bet_type': ['Moneyline', 'Moneyline', 'Over 8.5', 'Moneyline', 'Under 7.5'],
    'team': ['Yankees', 'Dodgers', 'Over', 'Braves', 'Under'],
    'odds': [-150, 120, -110, -135, -105],
    'stake': [100, 100, 110, 100, 100],
    'result': ['Win', 'Loss', 'Win', 'Win', 'Loss'],
    'profit': [66.67, -100, 100, 74.07, -100]
})

# Calculate cumulative metrics
bet_log['cumulative_profit'] = bet_log['profit'].cumsum()
bet_log['cumulative_stake'] = bet_log['stake'].cumsum()
bet_log['cumulative_roi'] = bet_log['cumulative_profit'] / bet_log['cumulative_stake']
bet_log['win'] = (bet_log['result'] == 'Win').astype(int)
bet_log['cumulative_wins'] = bet_log['win'].cumsum()
bet_log['bet_number'] = range(1, len(bet_log) + 1)
bet_log['win_rate'] = bet_log['cumulative_wins'] / bet_log['bet_number']

print("\nBet Tracker Summary:")
print(bet_log[['date', 'game', 'bet_type', 'stake', 'result',
               'profit', 'cumulative_profit', 'win_rate']])

# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(bet_log['bet_number'], bet_log['cumulative_profit'],
        color='#003087', linewidth=2, marker='o', markersize=8)
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.7)

# Color points by result
for idx, row in bet_log.iterrows():
    color = '#2ECC40' if row['result'] == 'Win' else '#FF4136'
    ax.scatter(row['bet_number'], row['cumulative_profit'],
              color=color, s=100, zorder=5)

ax.set_xlabel('Bet Number', fontsize=12)
ax.set_ylabel('Cumulative Profit ($)', fontsize=12)
ax.set_title('Betting Performance Tracker\nCumulative profit over time',
             fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='#2ECC40', label='Win'),
                   Patch(facecolor='#FF4136', label='Loss')]
ax.legend(handles=legend_elements, title='Result')

plt.tight_layout()
plt.show()

print(f"\nOverall Win Rate: {bet_log['win_rate'].iloc[-1]:.1%}")
print(f"Overall ROI: {bet_log['cumulative_roi'].iloc[-1]:.1%}")

Risk of Ruin: The probability of losing your entire bankroll. Aggressive staking increases risk of ruin even with edge. Calculate using:

P(ruin) ≈ ((1-p)/p)^(bankroll/unit_size)

where p is win probability.


R
library(tidyverse)

# Function to convert American odds to implied probability
american_to_prob <- function(odds) {
  if (odds < 0) {
    abs(odds) / (abs(odds) + 100)
  } else {
    100 / (odds + 100)
  }
}

# Function to convert decimal odds to implied probability
decimal_to_prob <- function(odds) {
  1 / odds
}

# Sample betting lines
betting_lines <- tibble(
  bet = c("Yankees ML", "Red Sox ML", "Over 8.5", "Under 8.5"),
  american_odds = c(-150, 130, -110, -110),
  decimal_odds = c(1.67, 2.30, 1.91, 1.91)
)

# Calculate implied probabilities
betting_lines <- betting_lines %>%
  mutate(
    implied_prob_american = sapply(american_odds, american_to_prob),
    implied_prob_decimal = sapply(decimal_odds, decimal_to_prob),
    implied_prob_pct = implied_prob_american * 100
  )

print(betting_lines)

# Calculate total vig on moneyline
moneyline_vig <- betting_lines %>%
  filter(bet %in% c("Yankees ML", "Red Sox ML")) %>%
  summarise(total_prob = sum(implied_prob_american),
            vig = (total_prob - 1) * 100)

print(paste("Moneyline vig:", round(moneyline_vig$vig, 2), "%"))
R
library(tidyverse)

# Function to calculate EV for American odds
calculate_ev <- function(win_prob, odds, stake = 100) {
  # Calculate potential profit
  if (odds < 0) {
    profit <- stake * (100 / abs(odds))
  } else {
    profit <- stake * (odds / 100)
  }

  # Calculate EV
  ev <- (win_prob * profit) - ((1 - win_prob) * stake)
  return(ev)
}

# Sample betting scenarios
bet_scenarios <- tibble(
  team = c("Yankees", "Red Sox", "Dodgers", "Padres"),
  model_win_prob = c(0.65, 0.48, 0.55, 0.42),
  odds = c(-150, 130, -120, 175),
  stake = 100
)

# Calculate EV and implied probability for each bet
bet_scenarios <- bet_scenarios %>%
  mutate(
    implied_prob = sapply(odds, american_to_prob),
    ev = mapply(calculate_ev, model_win_prob, odds, stake),
    edge = model_win_prob - implied_prob,
    bet_recommendation = ifelse(ev > 0, "BET", "PASS")
  )

print(bet_scenarios)

# Visualize edges
ggplot(bet_scenarios, aes(x = team, y = edge, fill = bet_recommendation)) +
  geom_col() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  geom_text(aes(label = sprintf("%.1f%%", edge * 100)), vjust = -0.5) +
  scale_fill_manual(values = c("BET" = "#2ECC40", "PASS" = "#FF4136")) +
  labs(
    title = "Betting Edge Analysis",
    subtitle = "Model probability vs. implied probability",
    x = NULL,
    y = "Edge (Model - Implied)",
    fill = "Recommendation"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")
R
# Kelly Criterion calculator
kelly_criterion <- function(win_prob, odds, fraction = 1) {
  # Convert American odds to decimal
  if (odds < 0) {
    decimal_odds <- 1 + (100 / abs(odds))
  } else {
    decimal_odds <- 1 + (odds / 100)
  }

  b <- decimal_odds - 1
  p <- win_prob
  q <- 1 - win_prob

  kelly <- (b * p - q) / b
  kelly <- max(0, kelly)  # Never bet if Kelly is negative

  return(kelly * fraction)
}

# Calculate Kelly for our scenarios
bet_scenarios <- bet_scenarios %>%
  mutate(
    full_kelly = mapply(kelly_criterion, model_win_prob, odds, 1),
    half_kelly = mapply(kelly_criterion, model_win_prob, odds, 0.5),
    quarter_kelly = mapply(kelly_criterion, model_win_prob, odds, 0.25)
  )

print(bet_scenarios %>%
      select(team, model_win_prob, odds, full_kelly, half_kelly, quarter_kelly))
R
library(tidyverse)
library(lubridate)

# Sample bet log
bet_log <- tibble(
  date = as.Date(c("2024-04-01", "2024-04-01", "2024-04-02",
                   "2024-04-03", "2024-04-04")),
  game = c("NYY @ BOS", "LAD @ SD", "HOU @ SEA", "ATL @ NYM", "CHC @ STL"),
  bet_type = c("Moneyline", "Moneyline", "Over 8.5", "Moneyline", "Under 7.5"),
  team = c("Yankees", "Dodgers", "Over", "Braves", "Under"),
  odds = c(-150, 120, -110, -135, -105),
  stake = c(100, 100, 110, 100, 100),
  result = c("Win", "Loss", "Win", "Win", "Loss"),
  profit = c(66.67, -100, 100, 74.07, -100)
)

# Calculate cumulative metrics
bet_log <- bet_log %>%
  mutate(
    cumulative_profit = cumsum(profit),
    cumulative_roi = cumulative_profit / cumsum(stake),
    win = ifelse(result == "Win", 1, 0)
  ) %>%
  mutate(
    cumulative_wins = cumsum(win),
    bet_number = row_number(),
    win_rate = cumulative_wins / bet_number
  )

print(bet_log %>% select(date, game, bet_type, stake, result,
                          profit, cumulative_profit, win_rate))

# Visualization
ggplot(bet_log, aes(x = bet_number, y = cumulative_profit)) +
  geom_line(color = "#003087", size = 1.2) +
  geom_point(aes(color = result), size = 3) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  scale_color_manual(values = c("Win" = "#2ECC40", "Loss" = "#FF4136")) +
  labs(
    title = "Betting Performance Tracker",
    subtitle = "Cumulative profit over time",
    x = "Bet Number",
    y = "Cumulative Profit ($)",
    color = "Result"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"))
Python
import pandas as pd
import numpy as np

def american_to_prob(odds):
    """Convert American odds to implied probability"""
    if odds < 0:
        return abs(odds) / (abs(odds) + 100)
    else:
        return 100 / (odds + 100)

def decimal_to_prob(odds):
    """Convert decimal odds to implied probability"""
    return 1 / odds

# Sample betting lines
betting_lines = pd.DataFrame({
    'bet': ['Yankees ML', 'Red Sox ML', 'Over 8.5', 'Under 8.5'],
    'american_odds': [-150, 130, -110, -110],
    'decimal_odds': [1.67, 2.30, 1.91, 1.91]
})

# Calculate implied probabilities
betting_lines['implied_prob_american'] = betting_lines['american_odds'].apply(american_to_prob)
betting_lines['implied_prob_decimal'] = betting_lines['decimal_odds'].apply(decimal_to_prob)
betting_lines['implied_prob_pct'] = betting_lines['implied_prob_american'] * 100

print("\nBetting Lines with Implied Probabilities:")
print(betting_lines)

# Calculate vig
moneyline_bets = betting_lines[betting_lines['bet'].str.contains('ML')]
total_prob = moneyline_bets['implied_prob_american'].sum()
vig = (total_prob - 1) * 100

print(f"\nMoneyline vig: {vig:.2f}%")
Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def calculate_ev(win_prob, odds, stake=100):
    """Calculate expected value of a bet"""
    # Calculate potential profit
    if odds < 0:
        profit = stake * (100 / abs(odds))
    else:
        profit = stake * (odds / 100)

    # Calculate EV
    ev = (win_prob * profit) - ((1 - win_prob) * stake)
    return ev

# Sample betting scenarios
bet_scenarios = pd.DataFrame({
    'team': ['Yankees', 'Red Sox', 'Dodgers', 'Padres'],
    'model_win_prob': [0.65, 0.48, 0.55, 0.42],
    'odds': [-150, 130, -120, 175],
    'stake': [100] * 4
})

# Calculate metrics
bet_scenarios['implied_prob'] = bet_scenarios['odds'].apply(american_to_prob)
bet_scenarios['ev'] = bet_scenarios.apply(
    lambda row: calculate_ev(row['model_win_prob'], row['odds'], row['stake']),
    axis=1
)
bet_scenarios['edge'] = bet_scenarios['model_win_prob'] - bet_scenarios['implied_prob']
bet_scenarios['bet_recommendation'] = bet_scenarios['ev'].apply(
    lambda x: 'BET' if x > 0 else 'PASS'
)

print("\nBetting Edge Analysis:")
print(bet_scenarios)

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#2ECC40' if rec == 'BET' else '#FF4136'
          for rec in bet_scenarios['bet_recommendation']]
bars = ax.bar(bet_scenarios['team'], bet_scenarios['edge'] * 100, color=colors)
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.7)
ax.set_ylabel('Edge (Model - Implied) %')
ax.set_title('Betting Edge Analysis\nModel probability vs. implied probability',
             fontsize=14, fontweight='bold')

# Add value labels
for bar, edge in zip(bars, bet_scenarios['edge']):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{edge * 100:.1f}%', ha='center',
            va='bottom' if height > 0 else 'top')

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='#2ECC40', label='BET'),
                   Patch(facecolor='#FF4136', label='PASS')]
ax.legend(handles=legend_elements, title='Recommendation')

plt.tight_layout()
plt.show()
Python
def kelly_criterion(win_prob, odds, fraction=1.0):
    """Calculate Kelly Criterion bet sizing"""
    # Convert American odds to decimal
    if odds < 0:
        decimal_odds = 1 + (100 / abs(odds))
    else:
        decimal_odds = 1 + (odds / 100)

    b = decimal_odds - 1
    p = win_prob
    q = 1 - win_prob

    kelly = (b * p - q) / b
    kelly = max(0, kelly)  # Never bet if Kelly is negative

    return kelly * fraction

# Calculate Kelly for scenarios
bet_scenarios['full_kelly'] = bet_scenarios.apply(
    lambda row: kelly_criterion(row['model_win_prob'], row['odds'], 1.0),
    axis=1
)
bet_scenarios['half_kelly'] = bet_scenarios.apply(
    lambda row: kelly_criterion(row['model_win_prob'], row['odds'], 0.5),
    axis=1
)
bet_scenarios['quarter_kelly'] = bet_scenarios.apply(
    lambda row: kelly_criterion(row['model_win_prob'], row['odds'], 0.25),
    axis=1
)

print("\nKelly Criterion Bet Sizing:")
print(bet_scenarios[['team', 'model_win_prob', 'odds',
                     'full_kelly', 'half_kelly', 'quarter_kelly']])
Python
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

# Sample bet log
bet_log = pd.DataFrame({
    'date': pd.to_datetime(['2024-04-01', '2024-04-01', '2024-04-02',
                            '2024-04-03', '2024-04-04']),
    'game': ['NYY @ BOS', 'LAD @ SD', 'HOU @ SEA', 'ATL @ NYM', 'CHC @ STL'],
    'bet_type': ['Moneyline', 'Moneyline', 'Over 8.5', 'Moneyline', 'Under 7.5'],
    'team': ['Yankees', 'Dodgers', 'Over', 'Braves', 'Under'],
    'odds': [-150, 120, -110, -135, -105],
    'stake': [100, 100, 110, 100, 100],
    'result': ['Win', 'Loss', 'Win', 'Win', 'Loss'],
    'profit': [66.67, -100, 100, 74.07, -100]
})

# Calculate cumulative metrics
bet_log['cumulative_profit'] = bet_log['profit'].cumsum()
bet_log['cumulative_stake'] = bet_log['stake'].cumsum()
bet_log['cumulative_roi'] = bet_log['cumulative_profit'] / bet_log['cumulative_stake']
bet_log['win'] = (bet_log['result'] == 'Win').astype(int)
bet_log['cumulative_wins'] = bet_log['win'].cumsum()
bet_log['bet_number'] = range(1, len(bet_log) + 1)
bet_log['win_rate'] = bet_log['cumulative_wins'] / bet_log['bet_number']

print("\nBet Tracker Summary:")
print(bet_log[['date', 'game', 'bet_type', 'stake', 'result',
               'profit', 'cumulative_profit', 'win_rate']])

# Visualization
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(bet_log['bet_number'], bet_log['cumulative_profit'],
        color='#003087', linewidth=2, marker='o', markersize=8)
ax.axhline(y=0, color='gray', linestyle='--', alpha=0.7)

# Color points by result
for idx, row in bet_log.iterrows():
    color = '#2ECC40' if row['result'] == 'Win' else '#FF4136'
    ax.scatter(row['bet_number'], row['cumulative_profit'],
              color=color, s=100, zorder=5)

ax.set_xlabel('Bet Number', fontsize=12)
ax.set_ylabel('Cumulative Profit ($)', fontsize=12)
ax.set_title('Betting Performance Tracker\nCumulative profit over time',
             fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='#2ECC40', label='Win'),
                   Patch(facecolor='#FF4136', label='Loss')]
ax.legend(handles=legend_elements, title='Result')

plt.tight_layout()
plt.show()

print(f"\nOverall Win Rate: {bet_log['win_rate'].iloc[-1]:.1%}")
print(f"Overall ROI: {bet_log['cumulative_roi'].iloc[-1]:.1%}")

11.3 Ethical Considerations

Fantasy sports and sports betting analytics raise important ethical questions about responsible participation and data usage.

Responsible Gambling

Problem Gambling Awareness: Gambling addiction affects millions. Warning signs include:


  • Chasing losses (betting more to recover losses)

  • Betting with money needed for essentials

  • Lying about betting activity

  • Inability to stop despite negative consequences

Resources: National Council on Problem Gambling (1-800-522-4700), Gamblers Anonymous.

Setting Limits: Establish and enforce strict limits:


  • Maximum bankroll allocated (money you can afford to lose completely)

  • Maximum bet size per game (1-2% of bankroll)

  • Loss limits (stop betting if down X% in a period)

  • Time limits (don't let betting dominate life)

Analytics as Entertainment: Treat betting as entertainment with costs, not income generation. Even with edge, variance means losing streaks happen. Very few people profit consistently from sports betting.

Data Ethics

Insider Information: Using non-public information (injuries not yet reported, lineup changes before announcement) creates ethical and potentially legal issues. Most sportsbooks prohibit betting based on material non-public information.

Model Transparency: If publishing betting models or recommendations, disclose:


  • Methodology and assumptions

  • Historical performance (including losses)

  • Potential conflicts of interest

  • Risk warnings

Fantasy Industry Practices: Daily fantasy sites have faced criticism regarding:


  • Employee betting with insider data

  • Rake structures that make profit nearly impossible for casual players

  • Advertising targeting vulnerable populations

Academic Integrity: Using sports betting as an academic exercise in modeling and optimization is valuable. Presenting betting systems without acknowledging long-term difficulty of profitability misleads students.

Legal Considerations

Sports betting legality varies by jurisdiction. As of 2024:


  • Legal and regulated in 30+ US states

  • Illegal in others (e.g., California, Texas, Georgia)

  • Daily fantasy sports have different legal status than sports betting in some states

Always verify local laws before participating.



11.4 Exercises

Exercise 11.1: Fantasy Player Valuation

Using projection data for 10 players across all five standard hitting categories:

  1. Calculate replacement-level statistics (use the worst player's projections)
  2. Define per-point denominators for each category (assume reasonable league spreads)
  3. Calculate SGP for each player
  4. Convert SGP to auction values (12-team league, $260 budget)
  5. Visualize the relationship between a player's projected home runs and their auction value

Extension: How does the value of a player with extreme stolen base totals (50+) change if the league adds OBP as a sixth category?

Exercise 11.2: DFS Optimization

Create a simplified DFS optimizer for a slate of 20 players:

  1. Generate random projections (points) and salaries for 20 players across positions
  2. Implement a greedy algorithm that selects players by value-per-dollar
  3. Compare the greedy solution to a random selection
  4. Calculate what percentage of random lineups beat the greedy lineup
  5. Discuss: Why might the optimal lineup differ from highest value-per-dollar?

Extension: Add stack constraints (at least 3 batters from the same team must be selected).

Exercise 11.3: Implied Probability and Vig Analysis

Given these betting lines for five games:

Game 1: Team A -180, Team B +160
Game 2: Team C -110, Team D -110
Game 3: Team E +140, Team F -160
Game 4: Team G -125, Team H +105
Game 5: Team I -200, Team J +175
  1. Calculate implied probability for each team
  2. Calculate the vig for each game
  3. Identify which game has the highest and lowest vig
  4. If you believe Team B has a 42% chance to win, calculate the EV of betting $100 on them
  5. Visualize implied probabilities vs. your model probabilities (create hypothetical model probabilities)

Exercise 11.4: Bankroll Simulation

Simulate a betting season with the following parameters:

  1. Starting bankroll: $1,000
  2. 100 bets over the season
  3. Each bet has 53% win probability (representing edge over 50%)
  4. Odds: -110 for all bets
  5. Three bet sizing strategies: flat $50 per bet, 5% Kelly, 2% Kelly

For each strategy:


  • Simulate 1,000 seasons (1,000 × 100 bets)

  • Calculate median ending bankroll

  • Calculate probability of ruin (ending bankroll < $100)

  • Calculate 90th percentile outcome

  • Visualize distribution of ending bankrolls

Discuss which strategy you'd recommend and why.


Fantasy baseball and sports betting analytics demonstrate how rigorous quantitative methods inform decisions under uncertainty. Whether valuing players across multiple performance dimensions, optimizing roster construction against salary constraints, or calculating expected value for betting opportunities, the principles of probability, statistics, and optimization provide frameworks for systematic decision-making.

These applications also illustrate analytics' limitations. Fantasy success depends on projection accuracy—but player performance includes irreducible randomness. Sports betting models require edge over sophisticated market prices—but even small edges require disciplined bankroll management to survive variance. No amount of analytical sophistication eliminates uncertainty or guarantees profits.

The skills developed in this chapter—converting probabilities to decisions, optimizing under constraints, managing risk—transfer to countless domains beyond sports. Data scientists in any field face similar challenges: building predictive models, quantifying uncertainty, and making optimal choices given limited information. Baseball provides a rich, accessible environment for developing these capabilities.

Chapter 12 will explore advanced topics in baseball analytics, including machine learning applications, deep learning for pitch classification, and cutting-edge research areas shaping the field's future.

R
Game 1: Team A -180, Team B +160
Game 2: Team C -110, Team D -110
Game 3: Team E +140, Team F -160
Game 4: Team G -125, Team H +105
Game 5: Team I -200, Team J +175

Practice Exercises

Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.

4 exercises
Tips for Success
  • Read the problem carefully before starting to code
  • Break down complex problems into smaller steps
  • Use the hints if you're stuck - they won't give away the answer
  • After solving, compare your approach with the solution
Exercise 11.1
Fantasy Player Valuation
Medium
Using projection data for 10 players across all five standard hitting categories:

1. Calculate replacement-level statistics (use the worst player's projections)
2. Define per-point denominators for each category (assume reasonable league spreads)
3. Calculate SGP for each player
4. Convert SGP to auction values (12-team league, $260 budget)
5. Visualize the relationship between a player's projected home runs and their auction value

**Extension**: How does the value of a player with extreme stolen base totals (50+) change if the league adds OBP as a sixth category?
Exercise 11.2
DFS Optimization
Hard
Create a simplified DFS optimizer for a slate of 20 players:

1. Generate random projections (points) and salaries for 20 players across positions
2. Implement a greedy algorithm that selects players by value-per-dollar
3. Compare the greedy solution to a random selection
4. Calculate what percentage of random lineups beat the greedy lineup
5. Discuss: Why might the optimal lineup differ from highest value-per-dollar?

**Extension**: Add stack constraints (at least 3 batters from the same team must be selected).
Exercise 11.3
Implied Probability and Vig Analysis
Hard
Given these betting lines for five games:

```
Game 1: Team A -180, Team B +160
Game 2: Team C -110, Team D -110
Game 3: Team E +140, Team F -160
Game 4: Team G -125, Team H +105
Game 5: Team I -200, Team J +175
```

1. Calculate implied probability for each team
2. Calculate the vig for each game
3. Identify which game has the highest and lowest vig
4. If you believe Team B has a 42% chance to win, calculate the EV of betting $100 on them
5. Visualize implied probabilities vs. your model probabilities (create hypothetical model probabilities)
Exercise 11.4
Bankroll Simulation
Hard
Simulate a betting season with the following parameters:

1. Starting bankroll: $1,000
2. 100 bets over the season
3. Each bet has 53% win probability (representing edge over 50%)
4. Odds: -110 for all bets
5. Three bet sizing strategies: flat $50 per bet, 5% Kelly, 2% Kelly

For each strategy:
- Simulate 1,000 seasons (1,000 × 100 bets)
- Calculate median ending bankroll
- Calculate probability of ruin (ending bankroll < $100)
- Calculate 90th percentile outcome
- Visualize distribution of ending bankrolls

Discuss which strategy you'd recommend and why.

---

Fantasy baseball and sports betting analytics demonstrate how rigorous quantitative methods inform decisions under uncertainty. Whether valuing players across multiple performance dimensions, optimizing roster construction against salary constraints, or calculating expected value for betting opportunities, the principles of probability, statistics, and optimization provide frameworks for systematic decision-making.

These applications also illustrate analytics' limitations. Fantasy success depends on projection accuracy—but player performance includes irreducible randomness. Sports betting models require edge over sophisticated market prices—but even small edges require disciplined bankroll management to survive variance. No amount of analytical sophistication eliminates uncertainty or guarantees profits.

The skills developed in this chapter—converting probabilities to decisions, optimizing under constraints, managing risk—transfer to countless domains beyond sports. Data scientists in any field face similar challenges: building predictive models, quantifying uncertainty, and making optimal choices given limited information. Baseball provides a rich, accessible environment for developing these capabilities.

Chapter 12 will explore advanced topics in baseball analytics, including machine learning applications, deep learning for pitch classification, and cutting-edge research areas shaping the field's future.

Chapter Summary

In this chapter, you learned about fantasy baseball & sports betting analytics. Key topics covered:

  • Fantasy Baseball Analytics
  • Sports Betting Analytics
  • Ethical Considerations
  • Exercises
4 practice exercises available Practice Now