Chapter 23: International Baseball Analytics

23.1 Global Baseball Landscape

Baseball's international footprint extends far beyond Major League Baseball, with professional leagues across Asia, Latin America, and other regions producing world-class talent. Understanding the global baseball landscape is essential for modern analytics, as international players increasingly impact MLB rosters and performance.

Major International Leagues

The primary professional baseball leagues outside MLB include:

Nippon Professional Baseball (NPB) - Japan's elite league, founded in 1950
Korea Baseball Organization (KBO) - South Korea's top league, established in 1982
Chinese Professional Baseball League (CPBL) - Taiwan's premier league, started in 1990
Cuban National Series - Cuba's domestic league system
Mexican League (LMB) - Mexico's top-tier professional league
Various Caribbean leagues - Dominican, Venezuelan, and Puerto Rican winter leagues

Each league operates with unique characteristics affecting player development, performance metrics, and translation to MLB standards.

League Comparison Framework

When analyzing international baseball, we must account for several factors:

Competition level differences: NPB and KBO represent high-quality competition, roughly equivalent to Triple-A or low MLB levels
Ball specifications: Different leagues use balls with varying specifications affecting flight characteristics
Park dimensions: International stadiums often have different dimensions than MLB parks
Schedule intensity: NPB plays 143 games, KBO plays 144, compared to MLB's 162
Playing style: Cultural differences influence strategic approaches

Let's examine league statistics to understand competitive balance:

# R: Comparing league statistics across international baseball
library(tidyverse)
library(ggplot2)

# League comparison data (2023 season)
league_stats <- data.frame(
  league = c("MLB", "NPB", "KBO", "CPBL", "Mexican League"),
  avg_ba = c(.248, .249, .269, .289, .285),
  avg_obp = c(.320, .319, .336, .351, .348),
  avg_slg = c(.409, .388, .408, .426, .437),
  avg_era = c(4.33, 3.87, 4.54, 4.89, 4.72),
  avg_k_rate = c(22.5, 21.8, 19.3, 17.2, 18.5),
  avg_bb_rate = c(8.7, 8.9, 10.2, 9.8, 10.5),
  hr_per_game = c(1.19, 1.08, 1.15, 1.23, 1.28),
  games_per_season = c(162, 143, 144, 120, 114),
  teams = c(30, 12, 10, 6, 18)
)

# Calculate wOBA for each league
league_stats <- league_stats %>%
  mutate(
    wOBA = 0.69 * avg_bb_rate/100 +
           0.72 * (avg_obp - avg_ba - avg_bb_rate/100) +
           0.88 * (avg_ba - (avg_slg - avg_ba)/3.5) +
           1.24 * ((avg_slg - avg_ba)/3.5 - hr_per_game/9) +
           1.56 * (hr_per_game/9)
  )

# Visualize offensive environment
ggplot(league_stats, aes(x = reorder(league, -wOBA), y = wOBA, fill = league)) +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = mean(league_stats$wOBA),
             linetype = "dashed", color = "red") +
  labs(title = "League Offensive Environments (2023)",
       subtitle = "wOBA comparison across international leagues",
       x = "League", y = "League Average wOBA") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

# Print comparison table
print(league_stats %>%
  select(league, avg_ba, avg_obp, avg_slg, avg_era, wOBA) %>%
  arrange(desc(wOBA)))

# Python: International league environment analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# League comparison data
league_data = {
    'league': ['MLB', 'NPB', 'KBO', 'CPBL', 'Mexican League'],
    'avg_ba': [.248, .249, .269, .289, .285],
    'avg_obp': [.320, .319, .336, .351, .348],
    'avg_slg': [.409, .388, .408, .426, .437],
    'avg_era': [4.33, 3.87, 4.54, 4.89, 4.72],
    'avg_k_rate': [22.5, 21.8, 19.3, 17.2, 18.5],
    'avg_bb_rate': [8.7, 8.9, 10.2, 9.8, 10.5],
    'hr_per_game': [1.19, 1.08, 1.15, 1.23, 1.28]
}

df_leagues = pd.DataFrame(league_data)

# Calculate run environment index (REI) - normalized to MLB = 100
mlb_runs_per_game = 4.5
df_leagues['runs_per_game'] = [4.5, 4.2, 4.8, 5.1, 5.0]
df_leagues['rei'] = (df_leagues['runs_per_game'] / mlb_runs_per_game) * 100

# Create comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Three True Outcomes
ax1 = axes[0]
x = np.arange(len(df_leagues))
width = 0.25

ax1.bar(x - width, df_leagues['avg_k_rate'], width, label='K%', alpha=0.8)
ax1.bar(x, df_leagues['avg_bb_rate'], width, label='BB%', alpha=0.8)
ax1.bar(x + width, df_leagues['hr_per_game']*2, width, label='HR/G (scaled)', alpha=0.8)

ax1.set_xlabel('League')
ax1.set_ylabel('Percentage / Rate')
ax1.set_title('Three True Outcomes Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(df_leagues['league'], rotation=45, ha='right')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Plot 2: Run Environment Index
ax2 = axes[1]
colors = ['#d62728' if rei > 100 else '#1f77b4' for rei in df_leagues['rei']]
ax2.barh(df_leagues['league'], df_leagues['rei'], color=colors, alpha=0.7)
ax2.axvline(x=100, color='black', linestyle='--', linewidth=2, label='MLB Baseline')
ax2.set_xlabel('Run Environment Index (MLB = 100)')
ax2.set_title('League Run Scoring Environment')
ax2.legend()
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('international_league_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nLeague Statistics Summary:")
print(df_leagues[['league', 'avg_ba', 'avg_obp', 'avg_slg', 'rei']])

Talent Flow Patterns

International player movement to MLB has increased dramatically over the past three decades:

NPB to MLB: Peak years 2012-2023 saw 50+ active NPB imports
KBO to MLB: Increased after 2015, with notable success stories
Cuban defectors: Major impact players since 1990s
Latin American academies: Primary pipeline for teams

Understanding these patterns helps teams identify market inefficiencies and projection opportunities.

# R: Comparing league statistics across international baseball
library(tidyverse)
library(ggplot2)

# League comparison data (2023 season)
league_stats <- data.frame(
  league = c("MLB", "NPB", "KBO", "CPBL", "Mexican League"),
  avg_ba = c(.248, .249, .269, .289, .285),
  avg_obp = c(.320, .319, .336, .351, .348),
  avg_slg = c(.409, .388, .408, .426, .437),
  avg_era = c(4.33, 3.87, 4.54, 4.89, 4.72),
  avg_k_rate = c(22.5, 21.8, 19.3, 17.2, 18.5),
  avg_bb_rate = c(8.7, 8.9, 10.2, 9.8, 10.5),
  hr_per_game = c(1.19, 1.08, 1.15, 1.23, 1.28),
  games_per_season = c(162, 143, 144, 120, 114),
  teams = c(30, 12, 10, 6, 18)
)

# Calculate wOBA for each league
league_stats <- league_stats %>%
  mutate(
    wOBA = 0.69 * avg_bb_rate/100 +
           0.72 * (avg_obp - avg_ba - avg_bb_rate/100) +
           0.88 * (avg_ba - (avg_slg - avg_ba)/3.5) +
           1.24 * ((avg_slg - avg_ba)/3.5 - hr_per_game/9) +
           1.56 * (hr_per_game/9)
  )

# Visualize offensive environment
ggplot(league_stats, aes(x = reorder(league, -wOBA), y = wOBA, fill = league)) +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = mean(league_stats$wOBA),
             linetype = "dashed", color = "red") +
  labs(title = "League Offensive Environments (2023)",
       subtitle = "wOBA comparison across international leagues",
       x = "League", y = "League Average wOBA") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

# Print comparison table
print(league_stats %>%
  select(league, avg_ba, avg_obp, avg_slg, avg_era, wOBA) %>%
  arrange(desc(wOBA)))

Python

# Python: International league environment analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# League comparison data
league_data = {
    'league': ['MLB', 'NPB', 'KBO', 'CPBL', 'Mexican League'],
    'avg_ba': [.248, .249, .269, .289, .285],
    'avg_obp': [.320, .319, .336, .351, .348],
    'avg_slg': [.409, .388, .408, .426, .437],
    'avg_era': [4.33, 3.87, 4.54, 4.89, 4.72],
    'avg_k_rate': [22.5, 21.8, 19.3, 17.2, 18.5],
    'avg_bb_rate': [8.7, 8.9, 10.2, 9.8, 10.5],
    'hr_per_game': [1.19, 1.08, 1.15, 1.23, 1.28]
}

df_leagues = pd.DataFrame(league_data)

# Calculate run environment index (REI) - normalized to MLB = 100
mlb_runs_per_game = 4.5
df_leagues['runs_per_game'] = [4.5, 4.2, 4.8, 5.1, 5.0]
df_leagues['rei'] = (df_leagues['runs_per_game'] / mlb_runs_per_game) * 100

# Create comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: Three True Outcomes
ax1 = axes[0]
x = np.arange(len(df_leagues))
width = 0.25

ax1.bar(x - width, df_leagues['avg_k_rate'], width, label='K%', alpha=0.8)
ax1.bar(x, df_leagues['avg_bb_rate'], width, label='BB%', alpha=0.8)
ax1.bar(x + width, df_leagues['hr_per_game']*2, width, label='HR/G (scaled)', alpha=0.8)

ax1.set_xlabel('League')
ax1.set_ylabel('Percentage / Rate')
ax1.set_title('Three True Outcomes Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(df_leagues['league'], rotation=45, ha='right')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Plot 2: Run Environment Index
ax2 = axes[1]
colors = ['#d62728' if rei > 100 else '#1f77b4' for rei in df_leagues['rei']]
ax2.barh(df_leagues['league'], df_leagues['rei'], color=colors, alpha=0.7)
ax2.axvline(x=100, color='black', linestyle='--', linewidth=2, label='MLB Baseline')
ax2.set_xlabel('Run Environment Index (MLB = 100)')
ax2.set_title('League Run Scoring Environment')
ax2.legend()
ax2.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig('international_league_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nLeague Statistics Summary:")
print(df_leagues[['league', 'avg_ba', 'avg_obp', 'avg_slg', 'rei']])

23.2 NPB Analytics & Translation to MLB

Nippon Professional Baseball represents the highest level of competition outside MLB. Successful translation of NPB performance to MLB projections requires sophisticated analytical approaches accounting for league differences.

Historical NPB-to-MLB Performance

Notable successful NPB imports include:

Shohei Ohtani (Nippon-Ham Fighters): Two-way superstar
Yoshinobu Yamamoto (Orix Buffaloes): Elite pitcher
Masataka Yoshida (Orix Buffaloes): Contact-oriented outfielder
Seiya Suzuki (Hiroshima Carp): Five-tool outfielder
Yu Darvish (Nippon-Ham Fighters): Ace pitcher
Shota Imanaga (DeNA BayStars): Left-handed starter

Translation Factors

Research suggests several key adjustments when projecting NPB stats to MLB:

Offensive translation: Multiply NPB performance by 0.75-0.85 factor
Power adjustment: HR totals typically translate at 0.70-0.80 rate
Contact skills: High-contact NPB hitters maintain skills better
Pitching velocity: Add 1-2 mph to NPB readings due to measurement differences
Age consideration: Younger NPB players (under 27) translate better

Let's build a translation model using historical data:

# R: NPB to MLB translation model
library(tidyverse)
library(broom)

# Historical NPB-to-MLB hitter transitions (final NPB season vs first 2 MLB years)
npb_mlb_hitters <- data.frame(
  player = c("Shohei Ohtani", "Masataka Yoshida", "Seiya Suzuki",
             "Kenta Maeda", "Shogo Akiyama", "Yoshitomo Tsutsugo",
             "Kosuke Fukudome", "Akinori Iwamura", "Norichika Aoki"),
  age_mlb_debut = c(23, 29, 27, 24, 31, 28, 30, 27, 30),
  npb_last_pa = c(382, 543, 516, 82, 575, 559, 606, 589, 621),
  npb_last_avg = c(.286, .335, .315, .235, .301, .272, .344, .311, .292),
  npb_last_obp = c(.358, .421, .418, .328, .376, .348, .453, .383, .348),
  npb_last_slg = c(.500, .505, .537, .353, .454, .475, .628, .495, .449),
  npb_last_hr = c(22, 21, 38, 1, 20, 29, 31, 23, 20),
  mlb_first2_pa = c(870, 520, 798, 130, 453, 626, 1089, 1055, 1142),
  mlb_first2_avg = c(.272, .280, .241, .188, .245, .197, .257, .275, .283),
  mlb_first2_obp = c(.356, .346, .331, .270, .320, .314, .359, .352, .346),
  mlb_first2_slg = c(.519, .430, .412, .313, .344, .343, .433, .418, .396),
  mlb_first2_hr = c(40, 15, 28, 1, 7, 16, 30, 24, 18)
)

# Calculate translation ratios
npb_mlb_hitters <- npb_mlb_hitters %>%
  mutate(
    avg_ratio = mlb_first2_avg / npb_last_avg,
    obp_ratio = mlb_first2_obp / npb_last_obp,
    slg_ratio = mlb_first2_slg / npb_last_slg,
    hr_ratio = (mlb_first2_hr / mlb_first2_pa * 600) / (npb_last_hr / npb_last_pa * 600),
    age_group = ifelse(age_mlb_debut <= 26, "Young", "Veteran")
  )

# Summary statistics
translation_summary <- npb_mlb_hitters %>%
  group_by(age_group) %>%
  summarise(
    n = n(),
    avg_translation = mean(avg_ratio),
    obp_translation = mean(obp_ratio),
    slg_translation = mean(slg_ratio),
    hr_translation = mean(hr_ratio),
    avg_sd = sd(avg_ratio),
    slg_sd = sd(slg_ratio)
  )

print(translation_summary)

# Build regression model for SLG translation
slg_model <- lm(slg_ratio ~ age_mlb_debut + npb_last_slg + npb_last_hr,
                data = npb_mlb_hitters)

summary(slg_model)

# Visualization
ggplot(npb_mlb_hitters, aes(x = npb_last_slg, y = mlb_first2_slg)) +
  geom_point(aes(color = age_group, size = npb_last_pa), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", linetype = "dashed") +
  geom_abline(intercept = 0, slope = 0.85, color = "red", linetype = "dashed") +
  labs(title = "NPB to MLB Slugging Translation",
       subtitle = "Red line = 0.85x translation factor",
       x = "Final NPB Season SLG",
       y = "First 2 MLB Seasons Average SLG",
       color = "Age Group",
       size = "NPB PA") +
  theme_minimal() +
  annotate("text", x = 0.55, y = 0.35,
           label = "Most players fall\nbelow 1:1 line",
           hjust = 0, color = "darkred")

# Function to project NPB hitter to MLB
project_npb_hitter <- function(age, npb_avg, npb_obp, npb_slg, npb_hr, npb_pa = 600) {
  # Age adjustment factor
  age_factor <- ifelse(age <= 26, 0.85, 0.78)

  # Base translations
  mlb_avg <- npb_avg * age_factor
  mlb_obp <- npb_obp * (age_factor + 0.03)  # OBP translates slightly better
  mlb_slg <- npb_slg * (age_factor - 0.05)  # Power translates worse

  # HR projection
  mlb_hr <- (npb_hr / npb_pa * 600) * (age_factor - 0.10) * 0.75

  return(data.frame(
    proj_avg = round(mlb_avg, 3),
    proj_obp = round(mlb_obp, 3),
    proj_slg = round(mlb_slg, 3),
    proj_hr_per_600pa = round(mlb_hr, 1),
    age_factor = age_factor
  ))
}

# Example: Project a hypothetical NPB star
cat("\nProjection for 25-year-old NPB star (.310/.390/.550, 35 HR):\n")
print(project_npb_hitter(age = 25, npb_avg = .310, npb_obp = .390,
                         npb_slg = .550, npb_hr = 35))

# Python: Advanced NPB pitching translation model
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Historical NPB-to-MLB pitcher transitions
npb_mlb_pitchers = pd.DataFrame({
    'player': ['Yu Darvish', 'Masahiro Tanaka', 'Kenta Maeda',
               'Yoshinobu Yamamoto', 'Shota Imanaga', 'Kodai Senga',
               'Yusei Kikuchi', 'Tomoyuki Sugano', 'Shohei Ohtani'],
    'age_mlb_debut': [25, 25, 27, 25, 30, 30, 26, 31, 23],
    'npb_last_era': [1.44, 1.27, 2.10, 1.21, 2.66, 1.94, 3.08, 1.97, 2.52],
    'npb_last_whip': [0.82, 0.94, 1.03, 0.88, 1.02, 0.98, 1.20, 1.00, 1.09],
    'npb_last_k9': [11.2, 10.8, 9.5, 11.7, 9.8, 10.5, 9.2, 8.9, 11.7],
    'npb_last_bb9': [2.1, 2.3, 2.8, 1.8, 2.4, 2.6, 3.2, 1.9, 3.2],
    'npb_last_ip': [232, 212, 175, 164, 148, 144, 163, 137, 155],
    'mlb_first2_era': [3.38, 3.18, 3.73, 2.92, 2.91, 3.68, 4.50, np.nan, 2.86],
    'mlb_first2_whip': [1.12, 1.11, 1.16, 1.04, 1.09, 1.22, 1.32, np.nan, 1.16],
    'mlb_first2_k9': [10.8, 9.2, 9.8, 11.2, 10.2, 10.8, 9.5, np.nan, 11.9],
    'mlb_first2_bb9': [2.8, 2.1, 2.5, 2.4, 2.6, 3.2, 3.8, np.nan, 3.5],
    'mlb_first2_ip': [363, 314, 258, 189, 173, 166, 318, np.nan, 285]
})

# Remove players with insufficient MLB data
npb_mlb_pitchers = npb_mlb_pitchers.dropna()

# Calculate translation factors
npb_mlb_pitchers['era_ratio'] = npb_mlb_pitchers['mlb_first2_era'] / npb_mlb_pitchers['npb_last_era']
npb_mlb_pitchers['k9_ratio'] = npb_mlb_pitchers['mlb_first2_k9'] / npb_mlb_pitchers['npb_last_k9']
npb_mlb_pitchers['bb9_ratio'] = npb_mlb_pitchers['mlb_first2_bb9'] / npb_mlb_pitchers['npb_last_bb9']

print("NPB to MLB Pitcher Translation Factors:")
print(f"ERA multiplier: {npb_mlb_pitchers['era_ratio'].mean():.3f} (±{npb_mlb_pitchers['era_ratio'].std():.3f})")
print(f"K/9 retention: {npb_mlb_pitchers['k9_ratio'].mean():.3f} (±{npb_mlb_pitchers['k9_ratio'].std():.3f})")
print(f"BB/9 change: {npb_mlb_pitchers['bb9_ratio'].mean():.3f} (±{npb_mlb_pitchers['bb9_ratio'].std():.3f})")

# Build FIP-based translation model
def calculate_fip(era, k9, bb9, hr9=1.0):
    """Calculate Fielding Independent Pitching"""
    return ((13 * hr9) + (3 * bb9) - (2 * k9)) / 9 + 3.2

npb_mlb_pitchers['npb_fip'] = calculate_fip(
    npb_mlb_pitchers['npb_last_era'],
    npb_mlb_pitchers['npb_last_k9'],
    npb_mlb_pitchers['npb_last_bb9'],
    hr9=0.8  # Estimated NPB HR/9
)

npb_mlb_pitchers['mlb_fip'] = calculate_fip(
    npb_mlb_pitchers['mlb_first2_era'],
    npb_mlb_pitchers['mlb_first2_k9'],
    npb_mlb_pitchers['mlb_first2_bb9'],
    hr9=1.1  # Estimated MLB HR/9
)

# Regression model for FIP translation
X = npb_mlb_pitchers[['age_mlb_debut', 'npb_fip', 'npb_last_k9']].values
y = npb_mlb_pitchers['mlb_fip'].values

model = LinearRegression()
model.fit(X, y)

print(f"\nFIP Translation Model:")
print(f"R² Score: {model.score(X, y):.3f}")
print(f"Coefficients: Age={model.coef_[0]:.3f}, NPB_FIP={model.coef_[1]:.3f}, K9={model.coef_[2]:.3f}")
print(f"Intercept: {model.intercept_:.3f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: ERA Translation
ax1 = axes[0]
ax1.scatter(npb_mlb_pitchers['npb_last_era'],
           npb_mlb_pitchers['mlb_first2_era'],
           s=100, alpha=0.6, c=npb_mlb_pitchers['age_mlb_debut'],
           cmap='viridis')
ax1.plot([0, 4], [0, 4], 'k--', alpha=0.3, label='1:1 line')
ax1.plot([0, 4], [0, 8], 'r--', alpha=0.5, label='2x ERA line')
ax1.set_xlabel('NPB Final Season ERA')
ax1.set_ylabel('MLB First 2 Seasons ERA')
ax1.set_title('NPB to MLB ERA Translation')
ax1.legend()
ax1.grid(alpha=0.3)

# Add player labels
for idx, row in npb_mlb_pitchers.iterrows():
    ax1.annotate(row['player'].split()[-1],
                (row['npb_last_era'], row['mlb_first2_era']),
                fontsize=8, alpha=0.7)

# Plot 2: K/9 Retention
ax2 = axes[1]
ax2.scatter(npb_mlb_pitchers['npb_last_k9'],
           npb_mlb_pitchers['mlb_first2_k9'],
           s=100, alpha=0.6, c=npb_mlb_pitchers['age_mlb_debut'],
           cmap='viridis')
ax2.plot([7, 13], [7, 13], 'k--', alpha=0.3, label='1:1 line')
ax2.set_xlabel('NPB Final Season K/9')
ax2.set_ylabel('MLB First 2 Seasons K/9')
ax2.set_title('Strikeout Rate Translation')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('npb_mlb_pitcher_translation.png', dpi=300, bbox_inches='tight')
plt.show()

# Projection function
def project_npb_pitcher(age, npb_era, npb_k9, npb_bb9, npb_ip):
    """Project NPB pitcher stats to MLB"""

    # Age-based adjustment
    age_factor = max(0.7, 1.0 - (age - 25) * 0.03)

    # League difficulty adjustment
    era_multiplier = 1.8 + (npb_era - 2.0) * 0.5  # Better NPB ERA = bigger jump
    k9_retention = 0.92 + (min(age, 27) - 25) * 0.02  # Younger = better retention
    bb9_increase = 1.25 - (9.0 - npb_k9) * 0.03  # Better K rate = better control

    # Projections
    mlb_era = npb_era * era_multiplier * (1 / age_factor)
    mlb_k9 = npb_k9 * k9_retention
    mlb_bb9 = npb_bb9 * bb9_increase
    mlb_fip = calculate_fip(mlb_era, mlb_k9, mlb_bb9, hr9=1.1)

    return {
        'projected_ERA': round(mlb_era, 2),
        'projected_K9': round(mlb_k9, 1),
        'projected_BB9': round(mlb_bb9, 1),
        'projected_FIP': round(mlb_fip, 2),
        'age_factor': round(age_factor, 3),
        'confidence': 'High' if age <= 27 and npb_ip > 150 else 'Medium'
    }

# Example projection
print("\n" + "="*60)
print("Example: 26-year-old NPB ace (1.85 ERA, 11.5 K/9, 2.0 BB/9, 180 IP)")
print("="*60)
projection = project_npb_pitcher(26, 1.85, 11.5, 2.0, 180)
for key, value in projection.items():
    print(f"{key}: {value}")

Case Study: Shohei Ohtani's Two-Way Translation

Ohtani's unprecedented two-way performance presents unique analytical challenges. His NPB stats (2016):

Hitting: .322/.416/.588, 22 HR in 382 PA
Pitching: 1.86 ERA, 174 K in 140 IP, 10-4 record

His MLB transition exceeded most projections, demonstrating the importance of age, athleticism, and elite raw tools in translation models.

# R: NPB to MLB translation model
library(tidyverse)
library(broom)

# Historical NPB-to-MLB hitter transitions (final NPB season vs first 2 MLB years)
npb_mlb_hitters <- data.frame(
  player = c("Shohei Ohtani", "Masataka Yoshida", "Seiya Suzuki",
             "Kenta Maeda", "Shogo Akiyama", "Yoshitomo Tsutsugo",
             "Kosuke Fukudome", "Akinori Iwamura", "Norichika Aoki"),
  age_mlb_debut = c(23, 29, 27, 24, 31, 28, 30, 27, 30),
  npb_last_pa = c(382, 543, 516, 82, 575, 559, 606, 589, 621),
  npb_last_avg = c(.286, .335, .315, .235, .301, .272, .344, .311, .292),
  npb_last_obp = c(.358, .421, .418, .328, .376, .348, .453, .383, .348),
  npb_last_slg = c(.500, .505, .537, .353, .454, .475, .628, .495, .449),
  npb_last_hr = c(22, 21, 38, 1, 20, 29, 31, 23, 20),
  mlb_first2_pa = c(870, 520, 798, 130, 453, 626, 1089, 1055, 1142),
  mlb_first2_avg = c(.272, .280, .241, .188, .245, .197, .257, .275, .283),
  mlb_first2_obp = c(.356, .346, .331, .270, .320, .314, .359, .352, .346),
  mlb_first2_slg = c(.519, .430, .412, .313, .344, .343, .433, .418, .396),
  mlb_first2_hr = c(40, 15, 28, 1, 7, 16, 30, 24, 18)
)

# Calculate translation ratios
npb_mlb_hitters <- npb_mlb_hitters %>%
  mutate(
    avg_ratio = mlb_first2_avg / npb_last_avg,
    obp_ratio = mlb_first2_obp / npb_last_obp,
    slg_ratio = mlb_first2_slg / npb_last_slg,
    hr_ratio = (mlb_first2_hr / mlb_first2_pa * 600) / (npb_last_hr / npb_last_pa * 600),
    age_group = ifelse(age_mlb_debut <= 26, "Young", "Veteran")
  )

# Summary statistics
translation_summary <- npb_mlb_hitters %>%
  group_by(age_group) %>%
  summarise(
    n = n(),
    avg_translation = mean(avg_ratio),
    obp_translation = mean(obp_ratio),
    slg_translation = mean(slg_ratio),
    hr_translation = mean(hr_ratio),
    avg_sd = sd(avg_ratio),
    slg_sd = sd(slg_ratio)
  )

print(translation_summary)

# Build regression model for SLG translation
slg_model <- lm(slg_ratio ~ age_mlb_debut + npb_last_slg + npb_last_hr,
                data = npb_mlb_hitters)

summary(slg_model)

# Visualization
ggplot(npb_mlb_hitters, aes(x = npb_last_slg, y = mlb_first2_slg)) +
  geom_point(aes(color = age_group, size = npb_last_pa), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", linetype = "dashed") +
  geom_abline(intercept = 0, slope = 0.85, color = "red", linetype = "dashed") +
  labs(title = "NPB to MLB Slugging Translation",
       subtitle = "Red line = 0.85x translation factor",
       x = "Final NPB Season SLG",
       y = "First 2 MLB Seasons Average SLG",
       color = "Age Group",
       size = "NPB PA") +
  theme_minimal() +
  annotate("text", x = 0.55, y = 0.35,
           label = "Most players fall\nbelow 1:1 line",
           hjust = 0, color = "darkred")

# Function to project NPB hitter to MLB
project_npb_hitter <- function(age, npb_avg, npb_obp, npb_slg, npb_hr, npb_pa = 600) {
  # Age adjustment factor
  age_factor <- ifelse(age <= 26, 0.85, 0.78)

  # Base translations
  mlb_avg <- npb_avg * age_factor
  mlb_obp <- npb_obp * (age_factor + 0.03)  # OBP translates slightly better
  mlb_slg <- npb_slg * (age_factor - 0.05)  # Power translates worse

  # HR projection
  mlb_hr <- (npb_hr / npb_pa * 600) * (age_factor - 0.10) * 0.75

  return(data.frame(
    proj_avg = round(mlb_avg, 3),
    proj_obp = round(mlb_obp, 3),
    proj_slg = round(mlb_slg, 3),
    proj_hr_per_600pa = round(mlb_hr, 1),
    age_factor = age_factor
  ))
}

# Example: Project a hypothetical NPB star
cat("\nProjection for 25-year-old NPB star (.310/.390/.550, 35 HR):\n")
print(project_npb_hitter(age = 25, npb_avg = .310, npb_obp = .390,
                         npb_slg = .550, npb_hr = 35))

Python

# Python: Advanced NPB pitching translation model
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Historical NPB-to-MLB pitcher transitions
npb_mlb_pitchers = pd.DataFrame({
    'player': ['Yu Darvish', 'Masahiro Tanaka', 'Kenta Maeda',
               'Yoshinobu Yamamoto', 'Shota Imanaga', 'Kodai Senga',
               'Yusei Kikuchi', 'Tomoyuki Sugano', 'Shohei Ohtani'],
    'age_mlb_debut': [25, 25, 27, 25, 30, 30, 26, 31, 23],
    'npb_last_era': [1.44, 1.27, 2.10, 1.21, 2.66, 1.94, 3.08, 1.97, 2.52],
    'npb_last_whip': [0.82, 0.94, 1.03, 0.88, 1.02, 0.98, 1.20, 1.00, 1.09],
    'npb_last_k9': [11.2, 10.8, 9.5, 11.7, 9.8, 10.5, 9.2, 8.9, 11.7],
    'npb_last_bb9': [2.1, 2.3, 2.8, 1.8, 2.4, 2.6, 3.2, 1.9, 3.2],
    'npb_last_ip': [232, 212, 175, 164, 148, 144, 163, 137, 155],
    'mlb_first2_era': [3.38, 3.18, 3.73, 2.92, 2.91, 3.68, 4.50, np.nan, 2.86],
    'mlb_first2_whip': [1.12, 1.11, 1.16, 1.04, 1.09, 1.22, 1.32, np.nan, 1.16],
    'mlb_first2_k9': [10.8, 9.2, 9.8, 11.2, 10.2, 10.8, 9.5, np.nan, 11.9],
    'mlb_first2_bb9': [2.8, 2.1, 2.5, 2.4, 2.6, 3.2, 3.8, np.nan, 3.5],
    'mlb_first2_ip': [363, 314, 258, 189, 173, 166, 318, np.nan, 285]
})

# Remove players with insufficient MLB data
npb_mlb_pitchers = npb_mlb_pitchers.dropna()

# Calculate translation factors
npb_mlb_pitchers['era_ratio'] = npb_mlb_pitchers['mlb_first2_era'] / npb_mlb_pitchers['npb_last_era']
npb_mlb_pitchers['k9_ratio'] = npb_mlb_pitchers['mlb_first2_k9'] / npb_mlb_pitchers['npb_last_k9']
npb_mlb_pitchers['bb9_ratio'] = npb_mlb_pitchers['mlb_first2_bb9'] / npb_mlb_pitchers['npb_last_bb9']

print("NPB to MLB Pitcher Translation Factors:")
print(f"ERA multiplier: {npb_mlb_pitchers['era_ratio'].mean():.3f} (±{npb_mlb_pitchers['era_ratio'].std():.3f})")
print(f"K/9 retention: {npb_mlb_pitchers['k9_ratio'].mean():.3f} (±{npb_mlb_pitchers['k9_ratio'].std():.3f})")
print(f"BB/9 change: {npb_mlb_pitchers['bb9_ratio'].mean():.3f} (±{npb_mlb_pitchers['bb9_ratio'].std():.3f})")

# Build FIP-based translation model
def calculate_fip(era, k9, bb9, hr9=1.0):
    """Calculate Fielding Independent Pitching"""
    return ((13 * hr9) + (3 * bb9) - (2 * k9)) / 9 + 3.2

npb_mlb_pitchers['npb_fip'] = calculate_fip(
    npb_mlb_pitchers['npb_last_era'],
    npb_mlb_pitchers['npb_last_k9'],
    npb_mlb_pitchers['npb_last_bb9'],
    hr9=0.8  # Estimated NPB HR/9
)

npb_mlb_pitchers['mlb_fip'] = calculate_fip(
    npb_mlb_pitchers['mlb_first2_era'],
    npb_mlb_pitchers['mlb_first2_k9'],
    npb_mlb_pitchers['mlb_first2_bb9'],
    hr9=1.1  # Estimated MLB HR/9
)

# Regression model for FIP translation
X = npb_mlb_pitchers[['age_mlb_debut', 'npb_fip', 'npb_last_k9']].values
y = npb_mlb_pitchers['mlb_fip'].values

model = LinearRegression()
model.fit(X, y)

print(f"\nFIP Translation Model:")
print(f"R² Score: {model.score(X, y):.3f}")
print(f"Coefficients: Age={model.coef_[0]:.3f}, NPB_FIP={model.coef_[1]:.3f}, K9={model.coef_[2]:.3f}")
print(f"Intercept: {model.intercept_:.3f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plot 1: ERA Translation
ax1 = axes[0]
ax1.scatter(npb_mlb_pitchers['npb_last_era'],
           npb_mlb_pitchers['mlb_first2_era'],
           s=100, alpha=0.6, c=npb_mlb_pitchers['age_mlb_debut'],
           cmap='viridis')
ax1.plot([0, 4], [0, 4], 'k--', alpha=0.3, label='1:1 line')
ax1.plot([0, 4], [0, 8], 'r--', alpha=0.5, label='2x ERA line')
ax1.set_xlabel('NPB Final Season ERA')
ax1.set_ylabel('MLB First 2 Seasons ERA')
ax1.set_title('NPB to MLB ERA Translation')
ax1.legend()
ax1.grid(alpha=0.3)

# Add player labels
for idx, row in npb_mlb_pitchers.iterrows():
    ax1.annotate(row['player'].split()[-1],
                (row['npb_last_era'], row['mlb_first2_era']),
                fontsize=8, alpha=0.7)

# Plot 2: K/9 Retention
ax2 = axes[1]
ax2.scatter(npb_mlb_pitchers['npb_last_k9'],
           npb_mlb_pitchers['mlb_first2_k9'],
           s=100, alpha=0.6, c=npb_mlb_pitchers['age_mlb_debut'],
           cmap='viridis')
ax2.plot([7, 13], [7, 13], 'k--', alpha=0.3, label='1:1 line')
ax2.set_xlabel('NPB Final Season K/9')
ax2.set_ylabel('MLB First 2 Seasons K/9')
ax2.set_title('Strikeout Rate Translation')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('npb_mlb_pitcher_translation.png', dpi=300, bbox_inches='tight')
plt.show()

# Projection function
def project_npb_pitcher(age, npb_era, npb_k9, npb_bb9, npb_ip):
    """Project NPB pitcher stats to MLB"""

    # Age-based adjustment
    age_factor = max(0.7, 1.0 - (age - 25) * 0.03)

    # League difficulty adjustment
    era_multiplier = 1.8 + (npb_era - 2.0) * 0.5  # Better NPB ERA = bigger jump
    k9_retention = 0.92 + (min(age, 27) - 25) * 0.02  # Younger = better retention
    bb9_increase = 1.25 - (9.0 - npb_k9) * 0.03  # Better K rate = better control

    # Projections
    mlb_era = npb_era * era_multiplier * (1 / age_factor)
    mlb_k9 = npb_k9 * k9_retention
    mlb_bb9 = npb_bb9 * bb9_increase
    mlb_fip = calculate_fip(mlb_era, mlb_k9, mlb_bb9, hr9=1.1)

    return {
        'projected_ERA': round(mlb_era, 2),
        'projected_K9': round(mlb_k9, 1),
        'projected_BB9': round(mlb_bb9, 1),
        'projected_FIP': round(mlb_fip, 2),
        'age_factor': round(age_factor, 3),
        'confidence': 'High' if age <= 27 and npb_ip > 150 else 'Medium'
    }

# Example projection
print("\n" + "="*60)
print("Example: 26-year-old NPB ace (1.85 ERA, 11.5 K/9, 2.0 BB/9, 180 IP)")
print("="*60)
projection = project_npb_pitcher(26, 1.85, 11.5, 2.0, 180)
for key, value in projection.items():
    print(f"{key}: {value}")

23.3 KBO Analytics & Notable Imports

The Korea Baseball Organization has emerged as an increasingly important source of MLB talent, particularly following the success of players like Jung Ho Kang (2015) and more recently, Ha-Seong Kim and others.

KBO League Characteristics

Key differences from MLB:

Higher offensive environment: League BA typically .260-.270 vs MLB .240-.250
Smaller parks: Many stadiums favor hitters
Different ball: KBO ball historically had lower seams
Designated hitter: Used in both leagues (since 2021 in NL)
Foreign player limit: Maximum 3 per team affects competition

Translation Methodology

KBO translation requires different adjustments than NPB:

# R: KBO to MLB translation analysis
library(tidyverse)
library(ggplot2)

# KBO-to-MLB position player transitions
kbo_mlb_data <- data.frame(
  player = c("Jung Ho Kang", "Ha-Seong Kim", "Hyun-soo Kim",
             "Dae-ho Lee", "Tommy Joseph", "Eric Thames"),
  age_mlb = c(28, 25, 28, 34, 24, 30),
  kbo_final_avg = c(.356, .306, .318, .288, .263, .381),
  kbo_final_obp = c(.459, .397, .406, .366, .333, .497),
  kbo_final_slg = c(.739, .523, .488, .488, .470, .790),
  kbo_final_hr = c(40, 11, 11, 17, 21, 47),
  kbo_final_pa = c(564, 587, 621, 550, 587, 575),
  mlb_avg = c(.255, .242, .229, .253, .235, .247),
  mlb_obp = c(.354, .326, .299, .317, .286, .359),
  mlb_slg = c(.461, .376, .340, .413, .402, .518),
  mlb_hr_per_600 = c(27, 13, 8, 22, 20, 35),
  mlb_pa_total = c(1248, 1456, 460, 531, 712, 983)
)

# Calculate translation factors
kbo_mlb_data <- kbo_mlb_data %>%
  mutate(
    avg_translation = mlb_avg / kbo_final_avg,
    obp_translation = mlb_obp / kbo_final_obp,
    slg_translation = mlb_slg / kbo_final_slg,
    iso_kbo = kbo_final_slg - kbo_final_avg,
    iso_mlb = mlb_slg - mlb_avg,
    iso_translation = iso_mlb / iso_kbo,
    power_class = case_when(
      kbo_final_hr >= 30 ~ "Elite Power",
      kbo_final_hr >= 20 ~ "Above Average",
      TRUE ~ "Average"
    )
  )

# Summary by power class
power_translation <- kbo_mlb_data %>%
  group_by(power_class) %>%
  summarise(
    n = n(),
    avg_trans = mean(avg_translation),
    slg_trans = mean(slg_translation),
    iso_trans = mean(iso_translation)
  )

print("KBO Translation Factors by Power Level:")
print(power_translation)

# Overall translation factors
cat("\nOverall KBO to MLB Translation:\n")
cat(sprintf("AVG: %.3f (multiply KBO avg by this)\n",
            mean(kbo_mlb_data$avg_translation)))
cat(sprintf("OBP: %.3f\n", mean(kbo_mlb_data$obp_translation)))
cat(sprintf("SLG: %.3f\n", mean(kbo_mlb_data$slg_translation)))
cat(sprintf("ISO: %.3f\n", mean(kbo_mlb_data$iso_translation)))

# Visualization: Power translation
ggplot(kbo_mlb_data, aes(x = kbo_final_hr, y = mlb_hr_per_600)) +
  geom_point(aes(size = mlb_pa_total, color = age_mlb), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", alpha = 0.2) +
  geom_abline(slope = 0.7, intercept = 0, linetype = "dashed", color = "red") +
  scale_color_gradient(low = "green", high = "orange") +
  labs(title = "KBO to MLB Home Run Translation",
       subtitle = "Red dashed line = 0.70x translation",
       x = "KBO Final Season HR",
       y = "MLB HR per 600 PA",
       size = "MLB PA",
       color = "Age at MLB Debut") +
  theme_minimal() +
  geom_text(aes(label = player), hjust = -0.1, vjust = -0.5, size = 3)

# Advanced projection system
kbo_projection_model <- lm(mlb_slg ~ kbo_final_slg + age_mlb + I(iso_kbo),
                           data = kbo_mlb_data)

cat("\n\nKBO SLG Translation Model:\n")
print(summary(kbo_projection_model))

# Create comprehensive projection function
project_kbo_player <- function(age, kbo_avg, kbo_obp, kbo_slg, kbo_hr,
                               kbo_pa, defensive_value = 0) {
  # Base translation factors (from empirical data)
  avg_factor <- 0.75
  obp_factor <- 0.82
  slg_factor <- 0.68

  # Age adjustment (peak = 26)
  age_adj <- 1 - abs(age - 26) * 0.015
  age_adj <- max(0.85, min(1.05, age_adj))

  # Sample size adjustment
  pa_confidence <- min(1, kbo_pa / 500)

  # Calculate projections
  proj_avg <- kbo_avg * avg_factor * age_adj
  proj_obp <- kbo_obp * obp_factor * age_adj
  proj_slg <- kbo_slg * slg_factor * age_adj

  # Power metrics
  kbo_iso <- kbo_slg - kbo_avg
  proj_iso <- kbo_iso * slg_factor * age_adj
  proj_hr_per_600 <- (kbo_hr / kbo_pa * 600) * slg_factor * age_adj

  # wOBA projection (using standard weights)
  woba_scale <- 1.15
  woba_bb <- (proj_obp - proj_avg) * 600 * 0.69
  woba_1b <- (proj_avg - proj_iso/3) * 600 * 0.88
  woba_xbh <- (proj_iso * 2) * 600 * 1.3
  proj_woba <- (woba_bb + woba_1b + woba_xbh) / 600 / woba_scale

  # WAR estimation (very rough)
  batting_runs <- (proj_woba - 0.320) / 1.15 * 600 * 0.9
  war_estimate <- (batting_runs + defensive_value * 10) / 10

  return(data.frame(
    projected_AVG = round(proj_avg, 3),
    projected_OBP = round(proj_obp, 3),
    projected_SLG = round(proj_slg, 3),
    projected_ISO = round(proj_iso, 3),
    projected_HR_600PA = round(proj_hr_per_600, 1),
    projected_wOBA = round(proj_woba, 3),
    estimated_WAR = round(war_estimate, 1),
    confidence = round(pa_confidence * 100, 0)
  ))
}

# Example projection
cat("\n\nExample KBO Star Projection (26 years old):\n")
cat("KBO Stats: .320/.400/.580, 35 HR in 600 PA\n")
cat("Defensive Value: +5 runs\n\n")
print(project_kbo_player(26, .320, .400, .580, 35, 600, 5))

# Python: KBO pitcher analysis and projection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# KBO pitcher MLB transitions
kbo_pitchers = pd.DataFrame({
    'player': ['Hyun-Jin Ryu', 'Kwang-Hyun Kim', 'Ha-Seong Kim',
               'Chan Ho Park', 'Jung Ho Kang'],
    'type': ['SP', 'SP', 'RP', 'SP', 'Position'],  # Position included for comparison
    'age_mlb': [26, 32, 25, 25, 28],
    'kbo_era': [2.80, 3.13, np.nan, 3.64, np.nan],
    'kbo_k9': [7.8, 7.2, np.nan, 8.1, np.nan],
    'kbo_bb9': [2.1, 2.5, np.nan, 3.8, np.nan],
    'kbo_hr9': [0.6, 0.7, np.nan, 0.8, np.nan],
    'mlb_era': [3.17, 3.62, np.nan, 4.36, np.nan],
    'mlb_k9': [8.5, 7.8, np.nan, 7.2, np.nan],
    'mlb_bb9': [1.8, 2.4, np.nan, 3.9, np.nan],
    'mlb_hr9': [0.9, 1.1, np.nan, 1.2, np.nan]
})

# Filter out position players
kbo_pitchers = kbo_pitchers[kbo_pitchers['type'].isin(['SP', 'RP'])].dropna()

# Calculate ratios
kbo_pitchers['era_change'] = kbo_pitchers['mlb_era'] - kbo_pitchers['kbo_era']
kbo_pitchers['k9_change'] = kbo_pitchers['mlb_k9'] - kbo_pitchers['kbo_k9']
kbo_pitchers['hr9_change'] = kbo_pitchers['mlb_hr9'] - kbo_pitchers['kbo_hr9']

print("KBO to MLB Pitcher Changes:")
print(f"Average ERA increase: +{kbo_pitchers['era_change'].mean():.2f}")
print(f"Average K/9 change: {kbo_pitchers['k9_change'].mean():+.1f}")
print(f"Average HR/9 increase: +{kbo_pitchers['hr9_change'].mean():.2f}")

# More comprehensive dataset with additional metrics
kbo_detailed = pd.DataFrame({
    'season': [2012, 2019, 2013, 2016, 2020],
    'pitcher': ['Ryu', 'Kim KH', 'Park', 'Oh', 'Other'],
    'kbo_whip': [1.08, 1.24, 1.35, 1.15, 1.20],
    'kbo_babip': [.275, .290, .295, .280, .285],
    'kbo_lob_pct': [75.2, 73.1, 70.5, 74.8, 72.0],
    'mlb_whip': [1.22, 1.28, 1.45, np.nan, np.nan],
    'mlb_babip': [.285, .295, .305, np.nan, np.nan],
    'mlb_lob_pct': [73.5, 71.8, 68.2, np.nan, np.nan]
})

# Advanced projection system for KBO pitchers
class KBOPitcherProjector:
    def __init__(self):
        # Empirically derived translation factors
        self.era_multiplier = 1.15  # KBO ERA typically increases 15%
        self.k9_retention = 0.98    # K rate mostly maintained
        self.bb9_multiplier = 1.05  # Slight walk increase
        self.hr9_multiplier = 1.45  # HR rate increases significantly

    def project_era_fip(self, kbo_era, kbo_k9, kbo_bb9, kbo_hr9, age):
        """Project FIP-based ERA for MLB"""
        # Age factor (peak at 27)
        age_factor = 1 + abs(age - 27) * 0.02

        # Component projections
        mlb_k9 = kbo_k9 * self.k9_retention
        mlb_bb9 = kbo_bb9 * self.bb9_multiplier
        mlb_hr9 = kbo_hr9 * self.hr9_multiplier

        # Calculate FIP
        mlb_fip = ((13 * mlb_hr9) + (3 * mlb_bb9) - (2 * mlb_k9)) + 3.2

        # ERA projection (FIP + league/age adjustment)
        mlb_era = mlb_fip * age_factor * 0.98

        return {
            'projected_ERA': round(mlb_era, 2),
            'projected_FIP': round(mlb_fip, 2),
            'projected_K9': round(mlb_k9, 1),
            'projected_BB9': round(mlb_bb9, 1),
            'projected_HR9': round(mlb_hr9, 2),
            'projected_WHIP': round((mlb_bb9 + (9 - mlb_k9) * 0.3) / 9 + 0.95, 2)
        }

    def confidence_interval(self, projection, sample_size_ip):
        """Calculate confidence intervals based on sample size"""
        # Standard error decreases with more IP
        se_factor = max(0.3, 1 / np.sqrt(sample_size_ip / 100))

        era_se = projection['projected_ERA'] * se_factor * 0.15

        return {
            'ERA_lower': round(projection['projected_ERA'] - 1.96 * era_se, 2),
            'ERA_upper': round(projection['projected_ERA'] + 1.96 * era_se, 2),
            'confidence_level': 0.95
        }

# Example usage
projector = KBOPitcherProjector()

print("\n" + "="*70)
print("KBO Pitcher Projection Example:")
print("="*70)
print("KBO Stats: 2.85 ERA, 9.2 K/9, 2.3 BB/9, 0.65 HR/9")
print("Age: 27, IP: 180")
print("-"*70)

projection = projector.project_era_fip(2.85, 9.2, 2.3, 0.65, 27)
for key, value in projection.items():
    print(f"{key}: {value}")

print("\n95% Confidence Interval:")
ci = projector.confidence_interval(projection, 180)
print(f"ERA Range: {ci['ERA_lower']} - {ci['ERA_upper']}")

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

# Simulate multiple pitcher projections
ages = np.array([24, 25, 26, 27, 28, 29, 30, 31, 32])
kbo_eras = np.array([2.5, 2.7, 2.6, 2.8, 2.9, 3.0, 2.8, 3.1, 3.2])
projected_eras = []

for age, kbo_era in zip(ages, kbo_eras):
    proj = projector.project_era_fip(kbo_era, 9.0, 2.5, 0.7, age)
    projected_eras.append(proj['projected_ERA'])

ax.scatter(ages, kbo_eras, s=100, alpha=0.6, label='KBO ERA', color='blue')
ax.scatter(ages, projected_eras, s=100, alpha=0.6, label='Projected MLB ERA', color='red')
ax.plot(ages, projected_eras, '--', alpha=0.3, color='red')

ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('ERA', fontsize=12)
ax.set_title('KBO to MLB ERA Projection by Age', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
ax.set_ylim(2.0, 4.5)

plt.tight_layout()
plt.savefig('kbo_pitcher_age_curve.png', dpi=300, bbox_inches='tight')
plt.show()

Notable KBO Success Stories

Ha-Seong Kim (2021- ): Signed with San Diego Padres

KBO Final: .306/.397/.523, 11 HR

MLB Performance: Solid utility infielder, excellent defense

Key: Elite athleticism and defensive versatility

Hyun-Jin Ryu (2013- ): Dodgers, Blue Jays

KBO Final: 2.80 ERA, 210 IP

MLB Peak: 2.32 ERA (2019), All-Star

Key: Command, deception, and preparation

# R: KBO to MLB translation analysis
library(tidyverse)
library(ggplot2)

# KBO-to-MLB position player transitions
kbo_mlb_data <- data.frame(
  player = c("Jung Ho Kang", "Ha-Seong Kim", "Hyun-soo Kim",
             "Dae-ho Lee", "Tommy Joseph", "Eric Thames"),
  age_mlb = c(28, 25, 28, 34, 24, 30),
  kbo_final_avg = c(.356, .306, .318, .288, .263, .381),
  kbo_final_obp = c(.459, .397, .406, .366, .333, .497),
  kbo_final_slg = c(.739, .523, .488, .488, .470, .790),
  kbo_final_hr = c(40, 11, 11, 17, 21, 47),
  kbo_final_pa = c(564, 587, 621, 550, 587, 575),
  mlb_avg = c(.255, .242, .229, .253, .235, .247),
  mlb_obp = c(.354, .326, .299, .317, .286, .359),
  mlb_slg = c(.461, .376, .340, .413, .402, .518),
  mlb_hr_per_600 = c(27, 13, 8, 22, 20, 35),
  mlb_pa_total = c(1248, 1456, 460, 531, 712, 983)
)

# Calculate translation factors
kbo_mlb_data <- kbo_mlb_data %>%
  mutate(
    avg_translation = mlb_avg / kbo_final_avg,
    obp_translation = mlb_obp / kbo_final_obp,
    slg_translation = mlb_slg / kbo_final_slg,
    iso_kbo = kbo_final_slg - kbo_final_avg,
    iso_mlb = mlb_slg - mlb_avg,
    iso_translation = iso_mlb / iso_kbo,
    power_class = case_when(
      kbo_final_hr >= 30 ~ "Elite Power",
      kbo_final_hr >= 20 ~ "Above Average",
      TRUE ~ "Average"
    )
  )

# Summary by power class
power_translation <- kbo_mlb_data %>%
  group_by(power_class) %>%
  summarise(
    n = n(),
    avg_trans = mean(avg_translation),
    slg_trans = mean(slg_translation),
    iso_trans = mean(iso_translation)
  )

print("KBO Translation Factors by Power Level:")
print(power_translation)

# Overall translation factors
cat("\nOverall KBO to MLB Translation:\n")
cat(sprintf("AVG: %.3f (multiply KBO avg by this)\n",
            mean(kbo_mlb_data$avg_translation)))
cat(sprintf("OBP: %.3f\n", mean(kbo_mlb_data$obp_translation)))
cat(sprintf("SLG: %.3f\n", mean(kbo_mlb_data$slg_translation)))
cat(sprintf("ISO: %.3f\n", mean(kbo_mlb_data$iso_translation)))

# Visualization: Power translation
ggplot(kbo_mlb_data, aes(x = kbo_final_hr, y = mlb_hr_per_600)) +
  geom_point(aes(size = mlb_pa_total, color = age_mlb), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", alpha = 0.2) +
  geom_abline(slope = 0.7, intercept = 0, linetype = "dashed", color = "red") +
  scale_color_gradient(low = "green", high = "orange") +
  labs(title = "KBO to MLB Home Run Translation",
       subtitle = "Red dashed line = 0.70x translation",
       x = "KBO Final Season HR",
       y = "MLB HR per 600 PA",
       size = "MLB PA",
       color = "Age at MLB Debut") +
  theme_minimal() +
  geom_text(aes(label = player), hjust = -0.1, vjust = -0.5, size = 3)

# Advanced projection system
kbo_projection_model <- lm(mlb_slg ~ kbo_final_slg + age_mlb + I(iso_kbo),
                           data = kbo_mlb_data)

cat("\n\nKBO SLG Translation Model:\n")
print(summary(kbo_projection_model))

# Create comprehensive projection function
project_kbo_player <- function(age, kbo_avg, kbo_obp, kbo_slg, kbo_hr,
                               kbo_pa, defensive_value = 0) {
  # Base translation factors (from empirical data)
  avg_factor <- 0.75
  obp_factor <- 0.82
  slg_factor <- 0.68

  # Age adjustment (peak = 26)
  age_adj <- 1 - abs(age - 26) * 0.015
  age_adj <- max(0.85, min(1.05, age_adj))

  # Sample size adjustment
  pa_confidence <- min(1, kbo_pa / 500)

  # Calculate projections
  proj_avg <- kbo_avg * avg_factor * age_adj
  proj_obp <- kbo_obp * obp_factor * age_adj
  proj_slg <- kbo_slg * slg_factor * age_adj

  # Power metrics
  kbo_iso <- kbo_slg - kbo_avg
  proj_iso <- kbo_iso * slg_factor * age_adj
  proj_hr_per_600 <- (kbo_hr / kbo_pa * 600) * slg_factor * age_adj

  # wOBA projection (using standard weights)
  woba_scale <- 1.15
  woba_bb <- (proj_obp - proj_avg) * 600 * 0.69
  woba_1b <- (proj_avg - proj_iso/3) * 600 * 0.88
  woba_xbh <- (proj_iso * 2) * 600 * 1.3
  proj_woba <- (woba_bb + woba_1b + woba_xbh) / 600 / woba_scale

  # WAR estimation (very rough)
  batting_runs <- (proj_woba - 0.320) / 1.15 * 600 * 0.9
  war_estimate <- (batting_runs + defensive_value * 10) / 10

  return(data.frame(
    projected_AVG = round(proj_avg, 3),
    projected_OBP = round(proj_obp, 3),
    projected_SLG = round(proj_slg, 3),
    projected_ISO = round(proj_iso, 3),
    projected_HR_600PA = round(proj_hr_per_600, 1),
    projected_wOBA = round(proj_woba, 3),
    estimated_WAR = round(war_estimate, 1),
    confidence = round(pa_confidence * 100, 0)
  ))
}

# Example projection
cat("\n\nExample KBO Star Projection (26 years old):\n")
cat("KBO Stats: .320/.400/.580, 35 HR in 600 PA\n")
cat("Defensive Value: +5 runs\n\n")
print(project_kbo_player(26, .320, .400, .580, 35, 600, 5))

Python

# Python: KBO pitcher analysis and projection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# KBO pitcher MLB transitions
kbo_pitchers = pd.DataFrame({
    'player': ['Hyun-Jin Ryu', 'Kwang-Hyun Kim', 'Ha-Seong Kim',
               'Chan Ho Park', 'Jung Ho Kang'],
    'type': ['SP', 'SP', 'RP', 'SP', 'Position'],  # Position included for comparison
    'age_mlb': [26, 32, 25, 25, 28],
    'kbo_era': [2.80, 3.13, np.nan, 3.64, np.nan],
    'kbo_k9': [7.8, 7.2, np.nan, 8.1, np.nan],
    'kbo_bb9': [2.1, 2.5, np.nan, 3.8, np.nan],
    'kbo_hr9': [0.6, 0.7, np.nan, 0.8, np.nan],
    'mlb_era': [3.17, 3.62, np.nan, 4.36, np.nan],
    'mlb_k9': [8.5, 7.8, np.nan, 7.2, np.nan],
    'mlb_bb9': [1.8, 2.4, np.nan, 3.9, np.nan],
    'mlb_hr9': [0.9, 1.1, np.nan, 1.2, np.nan]
})

# Filter out position players
kbo_pitchers = kbo_pitchers[kbo_pitchers['type'].isin(['SP', 'RP'])].dropna()

# Calculate ratios
kbo_pitchers['era_change'] = kbo_pitchers['mlb_era'] - kbo_pitchers['kbo_era']
kbo_pitchers['k9_change'] = kbo_pitchers['mlb_k9'] - kbo_pitchers['kbo_k9']
kbo_pitchers['hr9_change'] = kbo_pitchers['mlb_hr9'] - kbo_pitchers['kbo_hr9']

print("KBO to MLB Pitcher Changes:")
print(f"Average ERA increase: +{kbo_pitchers['era_change'].mean():.2f}")
print(f"Average K/9 change: {kbo_pitchers['k9_change'].mean():+.1f}")
print(f"Average HR/9 increase: +{kbo_pitchers['hr9_change'].mean():.2f}")

# More comprehensive dataset with additional metrics
kbo_detailed = pd.DataFrame({
    'season': [2012, 2019, 2013, 2016, 2020],
    'pitcher': ['Ryu', 'Kim KH', 'Park', 'Oh', 'Other'],
    'kbo_whip': [1.08, 1.24, 1.35, 1.15, 1.20],
    'kbo_babip': [.275, .290, .295, .280, .285],
    'kbo_lob_pct': [75.2, 73.1, 70.5, 74.8, 72.0],
    'mlb_whip': [1.22, 1.28, 1.45, np.nan, np.nan],
    'mlb_babip': [.285, .295, .305, np.nan, np.nan],
    'mlb_lob_pct': [73.5, 71.8, 68.2, np.nan, np.nan]
})

# Advanced projection system for KBO pitchers
class KBOPitcherProjector:
    def __init__(self):
        # Empirically derived translation factors
        self.era_multiplier = 1.15  # KBO ERA typically increases 15%
        self.k9_retention = 0.98    # K rate mostly maintained
        self.bb9_multiplier = 1.05  # Slight walk increase
        self.hr9_multiplier = 1.45  # HR rate increases significantly

    def project_era_fip(self, kbo_era, kbo_k9, kbo_bb9, kbo_hr9, age):
        """Project FIP-based ERA for MLB"""
        # Age factor (peak at 27)
        age_factor = 1 + abs(age - 27) * 0.02

        # Component projections
        mlb_k9 = kbo_k9 * self.k9_retention
        mlb_bb9 = kbo_bb9 * self.bb9_multiplier
        mlb_hr9 = kbo_hr9 * self.hr9_multiplier

        # Calculate FIP
        mlb_fip = ((13 * mlb_hr9) + (3 * mlb_bb9) - (2 * mlb_k9)) + 3.2

        # ERA projection (FIP + league/age adjustment)
        mlb_era = mlb_fip * age_factor * 0.98

        return {
            'projected_ERA': round(mlb_era, 2),
            'projected_FIP': round(mlb_fip, 2),
            'projected_K9': round(mlb_k9, 1),
            'projected_BB9': round(mlb_bb9, 1),
            'projected_HR9': round(mlb_hr9, 2),
            'projected_WHIP': round((mlb_bb9 + (9 - mlb_k9) * 0.3) / 9 + 0.95, 2)
        }

    def confidence_interval(self, projection, sample_size_ip):
        """Calculate confidence intervals based on sample size"""
        # Standard error decreases with more IP
        se_factor = max(0.3, 1 / np.sqrt(sample_size_ip / 100))

        era_se = projection['projected_ERA'] * se_factor * 0.15

        return {
            'ERA_lower': round(projection['projected_ERA'] - 1.96 * era_se, 2),
            'ERA_upper': round(projection['projected_ERA'] + 1.96 * era_se, 2),
            'confidence_level': 0.95
        }

# Example usage
projector = KBOPitcherProjector()

print("\n" + "="*70)
print("KBO Pitcher Projection Example:")
print("="*70)
print("KBO Stats: 2.85 ERA, 9.2 K/9, 2.3 BB/9, 0.65 HR/9")
print("Age: 27, IP: 180")
print("-"*70)

projection = projector.project_era_fip(2.85, 9.2, 2.3, 0.65, 27)
for key, value in projection.items():
    print(f"{key}: {value}")

print("\n95% Confidence Interval:")
ci = projector.confidence_interval(projection, 180)
print(f"ERA Range: {ci['ERA_lower']} - {ci['ERA_upper']}")

# Visualization
fig, ax = plt.subplots(figsize=(10, 6))

# Simulate multiple pitcher projections
ages = np.array([24, 25, 26, 27, 28, 29, 30, 31, 32])
kbo_eras = np.array([2.5, 2.7, 2.6, 2.8, 2.9, 3.0, 2.8, 3.1, 3.2])
projected_eras = []

for age, kbo_era in zip(ages, kbo_eras):
    proj = projector.project_era_fip(kbo_era, 9.0, 2.5, 0.7, age)
    projected_eras.append(proj['projected_ERA'])

ax.scatter(ages, kbo_eras, s=100, alpha=0.6, label='KBO ERA', color='blue')
ax.scatter(ages, projected_eras, s=100, alpha=0.6, label='Projected MLB ERA', color='red')
ax.plot(ages, projected_eras, '--', alpha=0.3, color='red')

ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('ERA', fontsize=12)
ax.set_title('KBO to MLB ERA Projection by Age', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
ax.set_ylim(2.0, 4.5)

plt.tight_layout()
plt.savefig('kbo_pitcher_age_curve.png', dpi=300, bbox_inches='tight')
plt.show()

23.4 Cuban & Latin American Player Evaluation

Cuban and Latin American players present unique analytical challenges due to limited data availability, varying competition levels, and diverse development paths.

Data Availability Challenges

Unlike NPB and KBO, systematic statistical data from Cuban leagues and Latin American summer leagues is often:

Incomplete: Missing advanced metrics
Inconsistent: Different tracking standards
Limited access: Restricted availability
Variable quality: Competition levels vary widely

Evaluation Frameworks

Teams rely heavily on:

Showcase performances: International tournaments
Workout metrics: Measurables (velocity, exit velo, sprint speed)
Video analysis: Manual tracking of mechanics and approach
Historical comps: Similar player paths
Age verification: Critical for projections

# R: Cuban/Latin American player projection framework
library(tidyverse)
library(ggplot2)

# Historical Cuban defector performance data
cuban_players <- data.frame(
  player = c("Yasiel Puig", "Yoenis Cespedes", "Jorge Soler",
             "Jose Abreu", "Aroldis Chapman", "Luis Robert",
             "Randy Arozarena", "Yordan Alvarez"),
  position = c("OF", "OF", "OF", "1B", "P", "OF", "OF", "DH"),
  age_mlb_debut = c(22, 26, 21, 27, 22, 22, 25, 22),
  signing_bonus_m = c(42, 36, 30, 68, 30.25, 26, 1.25, 2),
  showcase_exit_velo = c(105, 108, 106, 109, NA, 107, 103, 110),
  first_year_war = c(4.3, 4.8, 1.2, 6.2, 1.5, 0.8, 1.4, 4.0),
  career_war_5yr = c(11.2, 13.5, 5.8, 17.3, 11.2, 5.5, 8.2, 12.5),
  hit_tool = c(55, 55, 50, 60, NA, 60, 60, 70),
  power_tool = c(65, 70, 65, 70, NA, 65, 60, 80),
  speed_tool = c(60, 60, 40, 30, NA, 70, 60, 30)
)

# Remove pitchers for hitting analysis
cuban_hitters <- cuban_players %>% filter(position != "P")

# Analysis: Exit velocity vs MLB success
cor_exit_war <- cor(cuban_hitters$showcase_exit_velo,
                     cuban_hitters$first_year_war,
                     use = "complete.obs")

cat(sprintf("Correlation between exit velo and Year 1 WAR: %.3f\n", cor_exit_war))

# Tool grades vs performance
tool_model <- lm(career_war_5yr ~ hit_tool + power_tool + speed_tool + age_mlb_debut,
                 data = cuban_hitters)

cat("\nScout Tool Grades Predicting 5-Year WAR:\n")
print(summary(tool_model))

# Visualization
ggplot(cuban_hitters, aes(x = showcase_exit_velo, y = career_war_5yr)) +
  geom_point(aes(size = signing_bonus_m, color = age_mlb_debut), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", alpha = 0.2) +
  geom_text(aes(label = player), hjust = -0.1, size = 3) +
  scale_size_continuous(name = "Bonus ($M)") +
  scale_color_gradient(low = "green", high = "red", name = "Debut Age") +
  labs(title = "Cuban Defector Success: Exit Velocity vs Career Value",
       subtitle = "5-Year WAR as success metric",
       x = "Showcase Exit Velocity (mph)",
       y = "Career WAR (First 5 Years)") +
  theme_minimal()

# Dominican/Venezuelan academy graduates
latin_academy <- data.frame(
  player = c("Juan Soto", "Vladimir Guerrero Jr", "Fernando Tatis Jr",
             "Rafael Devers", "Wander Franco", "Julio Rodriguez"),
  country = c("DOM", "DOM", "DOM", "DOM", "DOM", "DOM"),
  age_mlb = c(19.5, 20, 20.5, 20.5, 20, 21),
  signing_bonus_k = c(1500, 3900, 700, 1500, 3825, 1750),
  war_age21_season = c(4.0, 2.3, 4.9, 3.3, 3.5, 4.2),
  war_thru_age_24 = c(19.8, 12.5, 15.2, 13.8, 11.2, 8.5),
  exit_velo_age20 = c(109, 112, 108, 107, 105, 108)
)

# Early performance indicators
early_success_model <- lm(war_thru_age_24 ~ age_mlb + log(signing_bonus_k) +
                          exit_velo_age20,
                          data = latin_academy)

cat("\n\nLatin American Academy Success Model:\n")
print(summary(early_success_model))

# Age at debut analysis
age_performance <- cuban_hitters %>%
  mutate(age_group = ifelse(age_mlb_debut <= 23, "Young (<24)", "Older (24+)")) %>%
  group_by(age_group) %>%
  summarise(
    n = n(),
    avg_first_yr_war = mean(first_year_war, na.rm = TRUE),
    avg_5yr_war = mean(career_war_5yr, na.rm = TRUE),
    avg_bonus = mean(signing_bonus_m, na.rm = TRUE)
  )

print("\n\nPerformance by Age at Debut:")
print(age_performance)

# Create projection function for Cuban/Latin players
project_cuban_latin <- function(age, exit_velo, hit_grade, power_grade,
                                speed_grade, competition_level = "showcase") {

  # Base WAR from tools (scout grades on 20-80 scale)
  tool_war <- (hit_grade - 50) * 0.15 +
              (power_grade - 50) * 0.12 +
              (speed_grade - 50) * 0.08

  # Age adjustment (younger = higher ceiling)
  age_adj <- max(0.7, 1.3 - (age - 20) * 0.05)

  # Exit velocity component
  velo_war <- (exit_velo - 100) * 0.3

  # Competition adjustment
  comp_factor <- case_when(
    competition_level == "MLB" ~ 1.0,
    competition_level == "showcase" ~ 0.85,
    competition_level == "cuban_series" ~ 0.80,
    TRUE ~ 0.75
  )

  # First year projection
  year1_war <- (tool_war + velo_war) * age_adj * comp_factor

  # 5-year projection (with development curve)
  year5_war <- year1_war * 3.2  # Average multiplier from data

  # Confidence based on data availability
  confidence <- case_when(
    competition_level == "MLB" ~ "High",
    competition_level == "showcase" & !is.na(exit_velo) ~ "Medium",
    TRUE ~ "Low"
  )

  return(data.frame(
    projected_year1_WAR = round(year1_war, 1),
    projected_5yr_WAR = round(year5_war, 1),
    age_factor = round(age_adj, 2),
    confidence = confidence
  ))
}

# Example projections
cat("\n\nExample Projection 1: 20-year-old Cuban OF\n")
cat("Exit Velo: 108 mph, Hit: 60, Power: 70, Speed: 65\n")
print(project_cuban_latin(20, 108, 60, 70, 65, "showcase"))

cat("\n\nExample Projection 2: 26-year-old established Cuban star\n")
cat("Exit Velo: 110 mph, Hit: 65, Power: 75, Speed: 50\n")
print(project_cuban_latin(26, 110, 65, 75, 50, "cuban_series"))

# Python: Advanced Latin American player tracking and projection
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns

# Comprehensive Latin American dataset
latin_players = pd.DataFrame({
    'player': ['Juan Soto', 'Vlad Jr', 'Tatis Jr', 'Acuna Jr', 'Devers',
               'Wander Franco', 'Julio Rodriguez', 'Bobby Witt Jr'],
    'country': ['DOM', 'DOM', 'DOM', 'VEN', 'DOM', 'DOM', 'DOM', 'USA'],
    'signing_age': [16, 16, 16, 16, 16, 16, 16, 18],
    'signing_bonus': [1.5, 3.9, 0.7, 4.25, 1.5, 3.825, 1.75, 7.5],  # millions
    'mlb_debut_age': [19.5, 20, 20.5, 20.5, 20.5, 20, 21, 21.5],
    'hit_grade': [70, 60, 60, 65, 60, 70, 65, 60],
    'power_grade': [70, 70, 70, 70, 65, 55, 65, 70],
    'speed_grade': [50, 30, 70, 70, 40, 60, 70, 70],
    'arm_grade': [60, 50, 60, 70, 50, 55, 70, 60],
    'field_grade': [55, 40, 60, 70, 50, 60, 70, 70],
    'war_thru_age_23': [15.5, 9.2, 11.8, 13.5, 10.2, 8.5, 7.5, 4.2],
    'avg_exit_velo': [109.5, 112.1, 108.3, 109.8, 107.2, 105.1, 107.8, 108.9]
})

# Feature engineering
latin_players['total_tools'] = (latin_players['hit_grade'] +
                                latin_players['power_grade'] +
                                latin_players['speed_grade'] +
                                latin_players['arm_grade'] +
                                latin_players['field_grade'])

latin_players['years_to_mlb'] = (latin_players['mlb_debut_age'] -
                                 latin_players['signing_age'])

latin_players['bonus_per_year'] = (latin_players['signing_bonus'] /
                                   latin_players['years_to_mlb'])

# Machine learning model for WAR prediction
features = ['hit_grade', 'power_grade', 'speed_grade', 'mlb_debut_age',
            'avg_exit_velo', 'signing_bonus']
X = latin_players[features].values
y = latin_players['war_thru_age_23'].values

# Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42,
                                 max_depth=4, min_samples_split=2)
rf_model.fit(X, y)

# Feature importance
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance for WAR Prediction:")
print(feature_importance)

# Cross-validation (limited by small sample)
cv_scores = cross_val_score(rf_model, X, y, cv=3,
                            scoring='neg_mean_squared_error')
print(f"\nCross-Validation RMSE: {np.sqrt(-cv_scores.mean()):.2f} WAR")

# Visualization: Tool grades vs performance
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Hit tool vs WAR
ax1 = axes[0, 0]
ax1.scatter(latin_players['hit_grade'], latin_players['war_thru_age_23'],
           s=latin_players['signing_bonus']*20, alpha=0.6, c='blue')
ax1.set_xlabel('Hit Tool Grade (20-80)')
ax1.set_ylabel('WAR Through Age 23')
ax1.set_title('Hit Tool vs Early Career Success')
ax1.grid(alpha=0.3)

# Plot 2: Power tool vs WAR
ax2 = axes[0, 1]
ax2.scatter(latin_players['power_grade'], latin_players['war_thru_age_23'],
           s=latin_players['signing_bonus']*20, alpha=0.6, c='red')
ax2.set_xlabel('Power Tool Grade (20-80)')
ax2.set_ylabel('WAR Through Age 23')
ax2.set_title('Power Tool vs Early Career Success')
ax2.grid(alpha=0.3)

# Plot 3: Speed tool vs WAR
ax3 = axes[1, 0]
ax3.scatter(latin_players['speed_grade'], latin_players['war_thru_age_23'],
           s=latin_players['signing_bonus']*20, alpha=0.6, c='green')
ax3.set_xlabel('Speed Tool Grade (20-80)')
ax3.set_ylabel('WAR Through Age 23')
ax3.set_title('Speed Tool vs Early Career Success')
ax3.grid(alpha=0.3)

# Plot 4: Age at debut vs WAR
ax4 = axes[1, 1]
scatter = ax4.scatter(latin_players['mlb_debut_age'],
                      latin_players['war_thru_age_23'],
                      s=latin_players['signing_bonus']*20,
                      c=latin_players['total_tools'],
                      cmap='viridis', alpha=0.7)
ax4.set_xlabel('Age at MLB Debut')
ax4.set_ylabel('WAR Through Age 23')
ax4.set_title('Debut Age vs Success (color = total tools)')
ax4.grid(alpha=0.3)
plt.colorbar(scatter, ax=ax4, label='Total Tool Grade')

plt.tight_layout()
plt.savefig('latin_american_tool_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Advanced projection class
class LatinAmericanProjector:
    def __init__(self, model=None):
        self.model = model if model else rf_model
        self.feature_names = features

    def project_war(self, hit, power, speed, debut_age, exit_velo, bonus):
        """Project WAR through age 23"""
        input_data = np.array([[hit, power, speed, debut_age, exit_velo, bonus]])
        prediction = self.model.predict(input_data)[0]

        # Calculate confidence interval using ensemble variance
        tree_predictions = [tree.predict(input_data)[0]
                          for tree in self.model.estimators_]
        std_dev = np.std(tree_predictions)

        return {
            'projected_WAR': round(prediction, 1),
            'lower_bound': round(prediction - 1.96 * std_dev, 1),
            'upper_bound': round(prediction + 1.96 * std_dev, 1),
            'std_dev': round(std_dev, 2)
        }

    def compare_prospects(self, prospects_df):
        """Compare multiple prospects"""
        results = []
        for idx, prospect in prospects_df.iterrows():
            proj = self.project_war(
                prospect['hit_grade'],
                prospect['power_grade'],
                prospect['speed_grade'],
                prospect['mlb_debut_age'],
                prospect['avg_exit_velo'],
                prospect['signing_bonus']
            )
            results.append({
                'player': prospect.get('player', f'Prospect_{idx}'),
                **proj
            })
        return pd.DataFrame(results).sort_values('projected_WAR', ascending=False)

# Example usage
projector = LatinAmericanProjector()

print("\n" + "="*70)
print("Example Prospect Projection:")
print("="*70)
print("Hit: 65, Power: 70, Speed: 60")
print("Debut Age: 20.5, Exit Velo: 108, Bonus: $3.5M")
print("-"*70)

projection = projector.project_war(65, 70, 60, 20.5, 108, 3.5)
for key, value in projection.items():
    print(f"{key}: {value}")

# Multiple prospect comparison
prospects = pd.DataFrame({
    'player': ['Prospect A', 'Prospect B', 'Prospect C'],
    'hit_grade': [70, 60, 65],
    'power_grade': [65, 75, 70],
    'speed_grade': [60, 50, 70],
    'mlb_debut_age': [20, 21, 19.5],
    'avg_exit_velo': [108, 111, 107],
    'signing_bonus': [4.0, 2.5, 5.0]
})

print("\n" + "="*70)
print("Prospect Comparison:")
print("="*70)
comparison = projector.compare_prospects(prospects)
print(comparison.to_string(index=False))

World Baseball Classic as Evaluation Tool

The World Baseball Classic provides valuable data for international player evaluation, featuring top competition in high-pressure situations.

# R: Cuban/Latin American player projection framework
library(tidyverse)
library(ggplot2)

# Historical Cuban defector performance data
cuban_players <- data.frame(
  player = c("Yasiel Puig", "Yoenis Cespedes", "Jorge Soler",
             "Jose Abreu", "Aroldis Chapman", "Luis Robert",
             "Randy Arozarena", "Yordan Alvarez"),
  position = c("OF", "OF", "OF", "1B", "P", "OF", "OF", "DH"),
  age_mlb_debut = c(22, 26, 21, 27, 22, 22, 25, 22),
  signing_bonus_m = c(42, 36, 30, 68, 30.25, 26, 1.25, 2),
  showcase_exit_velo = c(105, 108, 106, 109, NA, 107, 103, 110),
  first_year_war = c(4.3, 4.8, 1.2, 6.2, 1.5, 0.8, 1.4, 4.0),
  career_war_5yr = c(11.2, 13.5, 5.8, 17.3, 11.2, 5.5, 8.2, 12.5),
  hit_tool = c(55, 55, 50, 60, NA, 60, 60, 70),
  power_tool = c(65, 70, 65, 70, NA, 65, 60, 80),
  speed_tool = c(60, 60, 40, 30, NA, 70, 60, 30)
)

# Remove pitchers for hitting analysis
cuban_hitters <- cuban_players %>% filter(position != "P")

# Analysis: Exit velocity vs MLB success
cor_exit_war <- cor(cuban_hitters$showcase_exit_velo,
                     cuban_hitters$first_year_war,
                     use = "complete.obs")

cat(sprintf("Correlation between exit velo and Year 1 WAR: %.3f\n", cor_exit_war))

# Tool grades vs performance
tool_model <- lm(career_war_5yr ~ hit_tool + power_tool + speed_tool + age_mlb_debut,
                 data = cuban_hitters)

cat("\nScout Tool Grades Predicting 5-Year WAR:\n")
print(summary(tool_model))

# Visualization
ggplot(cuban_hitters, aes(x = showcase_exit_velo, y = career_war_5yr)) +
  geom_point(aes(size = signing_bonus_m, color = age_mlb_debut), alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", alpha = 0.2) +
  geom_text(aes(label = player), hjust = -0.1, size = 3) +
  scale_size_continuous(name = "Bonus ($M)") +
  scale_color_gradient(low = "green", high = "red", name = "Debut Age") +
  labs(title = "Cuban Defector Success: Exit Velocity vs Career Value",
       subtitle = "5-Year WAR as success metric",
       x = "Showcase Exit Velocity (mph)",
       y = "Career WAR (First 5 Years)") +
  theme_minimal()

# Dominican/Venezuelan academy graduates
latin_academy <- data.frame(
  player = c("Juan Soto", "Vladimir Guerrero Jr", "Fernando Tatis Jr",
             "Rafael Devers", "Wander Franco", "Julio Rodriguez"),
  country = c("DOM", "DOM", "DOM", "DOM", "DOM", "DOM"),
  age_mlb = c(19.5, 20, 20.5, 20.5, 20, 21),
  signing_bonus_k = c(1500, 3900, 700, 1500, 3825, 1750),
  war_age21_season = c(4.0, 2.3, 4.9, 3.3, 3.5, 4.2),
  war_thru_age_24 = c(19.8, 12.5, 15.2, 13.8, 11.2, 8.5),
  exit_velo_age20 = c(109, 112, 108, 107, 105, 108)
)

# Early performance indicators
early_success_model <- lm(war_thru_age_24 ~ age_mlb + log(signing_bonus_k) +
                          exit_velo_age20,
                          data = latin_academy)

cat("\n\nLatin American Academy Success Model:\n")
print(summary(early_success_model))

# Age at debut analysis
age_performance <- cuban_hitters %>%
  mutate(age_group = ifelse(age_mlb_debut <= 23, "Young (<24)", "Older (24+)")) %>%
  group_by(age_group) %>%
  summarise(
    n = n(),
    avg_first_yr_war = mean(first_year_war, na.rm = TRUE),
    avg_5yr_war = mean(career_war_5yr, na.rm = TRUE),
    avg_bonus = mean(signing_bonus_m, na.rm = TRUE)
  )

print("\n\nPerformance by Age at Debut:")
print(age_performance)

# Create projection function for Cuban/Latin players
project_cuban_latin <- function(age, exit_velo, hit_grade, power_grade,
                                speed_grade, competition_level = "showcase") {

  # Base WAR from tools (scout grades on 20-80 scale)
  tool_war <- (hit_grade - 50) * 0.15 +
              (power_grade - 50) * 0.12 +
              (speed_grade - 50) * 0.08

  # Age adjustment (younger = higher ceiling)
  age_adj <- max(0.7, 1.3 - (age - 20) * 0.05)

  # Exit velocity component
  velo_war <- (exit_velo - 100) * 0.3

  # Competition adjustment
  comp_factor <- case_when(
    competition_level == "MLB" ~ 1.0,
    competition_level == "showcase" ~ 0.85,
    competition_level == "cuban_series" ~ 0.80,
    TRUE ~ 0.75
  )

  # First year projection
  year1_war <- (tool_war + velo_war) * age_adj * comp_factor

  # 5-year projection (with development curve)
  year5_war <- year1_war * 3.2  # Average multiplier from data

  # Confidence based on data availability
  confidence <- case_when(
    competition_level == "MLB" ~ "High",
    competition_level == "showcase" & !is.na(exit_velo) ~ "Medium",
    TRUE ~ "Low"
  )

  return(data.frame(
    projected_year1_WAR = round(year1_war, 1),
    projected_5yr_WAR = round(year5_war, 1),
    age_factor = round(age_adj, 2),
    confidence = confidence
  ))
}

# Example projections
cat("\n\nExample Projection 1: 20-year-old Cuban OF\n")
cat("Exit Velo: 108 mph, Hit: 60, Power: 70, Speed: 65\n")
print(project_cuban_latin(20, 108, 60, 70, 65, "showcase"))

cat("\n\nExample Projection 2: 26-year-old established Cuban star\n")
cat("Exit Velo: 110 mph, Hit: 65, Power: 75, Speed: 50\n")
print(project_cuban_latin(26, 110, 65, 75, 50, "cuban_series"))

Python

# Python: Advanced Latin American player tracking and projection
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns

# Comprehensive Latin American dataset
latin_players = pd.DataFrame({
    'player': ['Juan Soto', 'Vlad Jr', 'Tatis Jr', 'Acuna Jr', 'Devers',
               'Wander Franco', 'Julio Rodriguez', 'Bobby Witt Jr'],
    'country': ['DOM', 'DOM', 'DOM', 'VEN', 'DOM', 'DOM', 'DOM', 'USA'],
    'signing_age': [16, 16, 16, 16, 16, 16, 16, 18],
    'signing_bonus': [1.5, 3.9, 0.7, 4.25, 1.5, 3.825, 1.75, 7.5],  # millions
    'mlb_debut_age': [19.5, 20, 20.5, 20.5, 20.5, 20, 21, 21.5],
    'hit_grade': [70, 60, 60, 65, 60, 70, 65, 60],
    'power_grade': [70, 70, 70, 70, 65, 55, 65, 70],
    'speed_grade': [50, 30, 70, 70, 40, 60, 70, 70],
    'arm_grade': [60, 50, 60, 70, 50, 55, 70, 60],
    'field_grade': [55, 40, 60, 70, 50, 60, 70, 70],
    'war_thru_age_23': [15.5, 9.2, 11.8, 13.5, 10.2, 8.5, 7.5, 4.2],
    'avg_exit_velo': [109.5, 112.1, 108.3, 109.8, 107.2, 105.1, 107.8, 108.9]
})

# Feature engineering
latin_players['total_tools'] = (latin_players['hit_grade'] +
                                latin_players['power_grade'] +
                                latin_players['speed_grade'] +
                                latin_players['arm_grade'] +
                                latin_players['field_grade'])

latin_players['years_to_mlb'] = (latin_players['mlb_debut_age'] -
                                 latin_players['signing_age'])

latin_players['bonus_per_year'] = (latin_players['signing_bonus'] /
                                   latin_players['years_to_mlb'])

# Machine learning model for WAR prediction
features = ['hit_grade', 'power_grade', 'speed_grade', 'mlb_debut_age',
            'avg_exit_velo', 'signing_bonus']
X = latin_players[features].values
y = latin_players['war_thru_age_23'].values

# Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42,
                                 max_depth=4, min_samples_split=2)
rf_model.fit(X, y)

# Feature importance
feature_importance = pd.DataFrame({
    'feature': features,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature Importance for WAR Prediction:")
print(feature_importance)

# Cross-validation (limited by small sample)
cv_scores = cross_val_score(rf_model, X, y, cv=3,
                            scoring='neg_mean_squared_error')
print(f"\nCross-Validation RMSE: {np.sqrt(-cv_scores.mean()):.2f} WAR")

# Visualization: Tool grades vs performance
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Hit tool vs WAR
ax1 = axes[0, 0]
ax1.scatter(latin_players['hit_grade'], latin_players['war_thru_age_23'],
           s=latin_players['signing_bonus']*20, alpha=0.6, c='blue')
ax1.set_xlabel('Hit Tool Grade (20-80)')
ax1.set_ylabel('WAR Through Age 23')
ax1.set_title('Hit Tool vs Early Career Success')
ax1.grid(alpha=0.3)

# Plot 2: Power tool vs WAR
ax2 = axes[0, 1]
ax2.scatter(latin_players['power_grade'], latin_players['war_thru_age_23'],
           s=latin_players['signing_bonus']*20, alpha=0.6, c='red')
ax2.set_xlabel('Power Tool Grade (20-80)')
ax2.set_ylabel('WAR Through Age 23')
ax2.set_title('Power Tool vs Early Career Success')
ax2.grid(alpha=0.3)

# Plot 3: Speed tool vs WAR
ax3 = axes[1, 0]
ax3.scatter(latin_players['speed_grade'], latin_players['war_thru_age_23'],
           s=latin_players['signing_bonus']*20, alpha=0.6, c='green')
ax3.set_xlabel('Speed Tool Grade (20-80)')
ax3.set_ylabel('WAR Through Age 23')
ax3.set_title('Speed Tool vs Early Career Success')
ax3.grid(alpha=0.3)

# Plot 4: Age at debut vs WAR
ax4 = axes[1, 1]
scatter = ax4.scatter(latin_players['mlb_debut_age'],
                      latin_players['war_thru_age_23'],
                      s=latin_players['signing_bonus']*20,
                      c=latin_players['total_tools'],
                      cmap='viridis', alpha=0.7)
ax4.set_xlabel('Age at MLB Debut')
ax4.set_ylabel('WAR Through Age 23')
ax4.set_title('Debut Age vs Success (color = total tools)')
ax4.grid(alpha=0.3)
plt.colorbar(scatter, ax=ax4, label='Total Tool Grade')

plt.tight_layout()
plt.savefig('latin_american_tool_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Advanced projection class
class LatinAmericanProjector:
    def __init__(self, model=None):
        self.model = model if model else rf_model
        self.feature_names = features

    def project_war(self, hit, power, speed, debut_age, exit_velo, bonus):
        """Project WAR through age 23"""
        input_data = np.array([[hit, power, speed, debut_age, exit_velo, bonus]])
        prediction = self.model.predict(input_data)[0]

        # Calculate confidence interval using ensemble variance
        tree_predictions = [tree.predict(input_data)[0]
                          for tree in self.model.estimators_]
        std_dev = np.std(tree_predictions)

        return {
            'projected_WAR': round(prediction, 1),
            'lower_bound': round(prediction - 1.96 * std_dev, 1),
            'upper_bound': round(prediction + 1.96 * std_dev, 1),
            'std_dev': round(std_dev, 2)
        }

    def compare_prospects(self, prospects_df):
        """Compare multiple prospects"""
        results = []
        for idx, prospect in prospects_df.iterrows():
            proj = self.project_war(
                prospect['hit_grade'],
                prospect['power_grade'],
                prospect['speed_grade'],
                prospect['mlb_debut_age'],
                prospect['avg_exit_velo'],
                prospect['signing_bonus']
            )
            results.append({
                'player': prospect.get('player', f'Prospect_{idx}'),
                **proj
            })
        return pd.DataFrame(results).sort_values('projected_WAR', ascending=False)

# Example usage
projector = LatinAmericanProjector()

print("\n" + "="*70)
print("Example Prospect Projection:")
print("="*70)
print("Hit: 65, Power: 70, Speed: 60")
print("Debut Age: 20.5, Exit Velo: 108, Bonus: $3.5M")
print("-"*70)

projection = projector.project_war(65, 70, 60, 20.5, 108, 3.5)
for key, value in projection.items():
    print(f"{key}: {value}")

# Multiple prospect comparison
prospects = pd.DataFrame({
    'player': ['Prospect A', 'Prospect B', 'Prospect C'],
    'hit_grade': [70, 60, 65],
    'power_grade': [65, 75, 70],
    'speed_grade': [60, 50, 70],
    'mlb_debut_age': [20, 21, 19.5],
    'avg_exit_velo': [108, 111, 107],
    'signing_bonus': [4.0, 2.5, 5.0]
})

print("\n" + "="*70)
print("Prospect Comparison:")
print("="*70)
comparison = projector.compare_prospects(prospects)
print(comparison.to_string(index=False))

23.5 World Baseball Classic Analysis

The World Baseball Classic (WBC) offers a unique analytical opportunity: top international players competing at maximum intensity in a short tournament format.

WBC Performance Analytics

# R: WBC performance analysis
library(tidyverse)

# 2023 WBC key performers (sample data)
wbc_2023 <- data.frame(
  player = c("Shohei Ohtani", "Trea Turner", "Mike Trout",
             "Masataka Yoshida", "Mookie Betts", "Lars Nootbaar",
             "Randy Arozarena", "J-Rod", "Paul Goldschmidt"),
  country = c("JPN", "USA", "USA", "JPN", "USA", "JPN", "MEX", "DOM", "USA"),
  pa = c(28, 37, 33, 31, 35, 28, 42, 25, 33),
  avg = c(.435, .389, .273, .429, .263, .345, .419, .348, .200),
  obp = c(.606, .500, .433, .548, .371, .448, .500, .440, .273),
  slg = c(.739, .722, .636, .714, .421, .690, .744, .652, .343),
  hr = c(1, 2, 1, 1, 1, 2, 4, 1, 1),
  sb = c(1, 3, 0, 1, 0, 1, 3, 2, 0),
  wrc_plus = c(280, 265, 188, 295, 132, 250, 310, 245, 85),
  mlb_2023_wrc_plus = c(184, 132, 126, 126, 147, 95, 126, 136, 131)
)

# Compare WBC to MLB performance
wbc_2023 <- wbc_2023 %>%
  mutate(
    wbc_vs_mlb = wrc_plus - mlb_2023_wrc_plus,
    performance_tier = case_when(
      wbc_vs_mlb > 100 ~ "Massive Outperformance",
      wbc_vs_mlb > 50 ~ "Strong Outperformance",
      wbc_vs_mlb > 0 ~ "Slight Outperformance",
      wbc_vs_mlb > -50 ~ "Underperformance",
      TRUE ~ "Major Underperformance"
    )
  )

# Analysis
cat("WBC vs MLB Regular Season Performance:\n")
print(wbc_2023 %>%
  select(player, country, wrc_plus, mlb_2023_wrc_plus, wbc_vs_mlb, performance_tier) %>%
  arrange(desc(wbc_vs_mlb)))

# Visualization
ggplot(wbc_2023, aes(x = mlb_2023_wrc_plus, y = wrc_plus)) +
  geom_point(aes(color = country, size = pa), alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
  geom_text(aes(label = player), hjust = -0.1, size = 2.5) +
  labs(title = "WBC Performance vs 2023 MLB Season",
       subtitle = "wRC+ comparison (100 = league average)",
       x = "2023 MLB Regular Season wRC+",
       y = "2023 WBC wRC+",
       color = "Country",
       size = "WBC PA") +
  theme_minimal() +
  xlim(75, 200) +
  ylim(75, 325)

# Small sample size considerations
cat("\n\nSmall Sample Considerations:\n")
cat(sprintf("Average WBC PA: %.1f\n", mean(wbc_2023$pa)))
cat(sprintf("Minimum for stabilization: ~150 PA for AVG, 500+ for power\n"))
cat(sprintf("WBC provides ~5%% of full season sample size\n"))

The WBC demonstrates the challenge of small-sample analysis while providing insights into player performance under pressure and international competition.

# R: WBC performance analysis
library(tidyverse)

# 2023 WBC key performers (sample data)
wbc_2023 <- data.frame(
  player = c("Shohei Ohtani", "Trea Turner", "Mike Trout",
             "Masataka Yoshida", "Mookie Betts", "Lars Nootbaar",
             "Randy Arozarena", "J-Rod", "Paul Goldschmidt"),
  country = c("JPN", "USA", "USA", "JPN", "USA", "JPN", "MEX", "DOM", "USA"),
  pa = c(28, 37, 33, 31, 35, 28, 42, 25, 33),
  avg = c(.435, .389, .273, .429, .263, .345, .419, .348, .200),
  obp = c(.606, .500, .433, .548, .371, .448, .500, .440, .273),
  slg = c(.739, .722, .636, .714, .421, .690, .744, .652, .343),
  hr = c(1, 2, 1, 1, 1, 2, 4, 1, 1),
  sb = c(1, 3, 0, 1, 0, 1, 3, 2, 0),
  wrc_plus = c(280, 265, 188, 295, 132, 250, 310, 245, 85),
  mlb_2023_wrc_plus = c(184, 132, 126, 126, 147, 95, 126, 136, 131)
)

# Compare WBC to MLB performance
wbc_2023 <- wbc_2023 %>%
  mutate(
    wbc_vs_mlb = wrc_plus - mlb_2023_wrc_plus,
    performance_tier = case_when(
      wbc_vs_mlb > 100 ~ "Massive Outperformance",
      wbc_vs_mlb > 50 ~ "Strong Outperformance",
      wbc_vs_mlb > 0 ~ "Slight Outperformance",
      wbc_vs_mlb > -50 ~ "Underperformance",
      TRUE ~ "Major Underperformance"
    )
  )

# Analysis
cat("WBC vs MLB Regular Season Performance:\n")
print(wbc_2023 %>%
  select(player, country, wrc_plus, mlb_2023_wrc_plus, wbc_vs_mlb, performance_tier) %>%
  arrange(desc(wbc_vs_mlb)))

# Visualization
ggplot(wbc_2023, aes(x = mlb_2023_wrc_plus, y = wrc_plus)) +
  geom_point(aes(color = country, size = pa), alpha = 0.7) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
  geom_text(aes(label = player), hjust = -0.1, size = 2.5) +
  labs(title = "WBC Performance vs 2023 MLB Season",
       subtitle = "wRC+ comparison (100 = league average)",
       x = "2023 MLB Regular Season wRC+",
       y = "2023 WBC wRC+",
       color = "Country",
       size = "WBC PA") +
  theme_minimal() +
  xlim(75, 200) +
  ylim(75, 325)

# Small sample size considerations
cat("\n\nSmall Sample Considerations:\n")
cat(sprintf("Average WBC PA: %.1f\n", mean(wbc_2023$pa)))
cat(sprintf("Minimum for stabilization: ~150 PA for AVG, 500+ for power\n"))
cat(sprintf("WBC provides ~5%% of full season sample size\n"))

23.6 International Scouting Data Challenges

International player evaluation faces unique data challenges requiring specialized analytical approaches.

Key Challenges

Limited Statcast data: Most international leagues lack ball/player tracking
Inconsistent statistical standards: Different counting methods
Video access restrictions: Limited broadcast availability
Age verification: Critical in Latin America
Cultural/language barriers: Scouting communication issues
Political factors: Cuban defection complications

Alternative Data Sources

# Python: Building composite international player evaluation
import pandas as pd
import numpy as np

class InternationalPlayerEvaluator:
    """
    Comprehensive evaluation system for international players
    combining traditional stats, physical metrics, and projections
    """

    def __init__(self):
        self.weights = {
            'stats': 0.35,
            'tools': 0.30,
            'physical': 0.20,
            'track_record': 0.15
        }

    def evaluate_stats(self, league, avg, obp, slg, age):
        """Evaluate statistical performance with league adjustments"""
        league_factors = {
            'MLB': 1.00,
            'NPB': 0.85,
            'KBO': 0.78,
            'CPBL': 0.70,
            'Cuban': 0.75,
            'Mexican': 0.68
        }

        factor = league_factors.get(league, 0.65)

        # Calculate adjusted OPS+
        ops = obp + slg
        league_avg_ops = 0.720  # MLB baseline
        adj_ops_plus = (ops / league_avg_ops) * factor * 100

        # Age adjustment
        age_curve = 1.0 - abs(age - 26) * 0.02

        return adj_ops_plus * age_curve

    def evaluate_tools(self, hit, power, speed, field, arm):
        """Evaluate 5-tool grades (20-80 scale)"""
        # Weights for different positions (can be customized)
        tool_weights = {
            'hit': 0.30,
            'power': 0.25,
            'speed': 0.15,
            'field': 0.15,
            'arm': 0.15
        }

        weighted_grade = (
            hit * tool_weights['hit'] +
            power * tool_weights['power'] +
            speed * tool_weights['speed'] +
            field * tool_weights['field'] +
            arm * tool_weights['arm']
        )

        return weighted_grade

    def evaluate_physical(self, height_in, weight_lbs, exit_velo,
                         sprint_speed, arm_velo=None):
        """Evaluate physical tools and measurables"""
        score = 50  # Start at average

        # Exit velocity component (major factor)
        if exit_velo:
            velo_score = (exit_velo - 87) * 2  # 87 mph = average
            score += velo_score * 0.4

        # Sprint speed component
        if sprint_speed:
            # 27 ft/s = average MLB
            speed_score = (sprint_speed - 27) * 10
            score += speed_score * 0.3

        # Build/athleticism
        bmi = (weight_lbs / (height_in ** 2)) * 703
        if 22 <= bmi <= 27:  # Optimal athletic range
            score += 5

        return max(20, min(80, score))  # Clamp to 20-80 scale

    def evaluate_track_record(self, years_pro, level_reached,
                             consistency_score):
        """Evaluate professional track record"""
        level_scores = {
            'MLB': 80,
            'AAA': 70,
            'NPB': 70,
            'KBO': 65,
            'AA': 60,
            'CPBL': 60,
            'Cuban': 55,
            'A+': 50
        }

        level_score = level_scores.get(level_reached, 45)
        experience_bonus = min(10, years_pro * 2)

        return (level_score + experience_bonus) * (consistency_score / 100)

    def composite_score(self, stats_score, tools_score, physical_score,
                       track_score):
        """Calculate weighted composite score"""
        composite = (
            stats_score * self.weights['stats'] +
            tools_score * self.weights['tools'] +
            physical_score * self.weights['physical'] +
            track_score * self.weights['track_record']
        )

        return composite

    def risk_adjustment(self, composite_score, data_quality, age,
                       injury_history):
        """Adjust score for various risk factors"""
        risk_factor = 1.0

        # Data quality risk
        data_factors = {
            'high': 1.0,
            'medium': 0.92,
            'low': 0.85,
            'very_low': 0.75
        }
        risk_factor *= data_factors.get(data_quality, 0.85)

        # Age risk (older = more risk for prospect, less for proven player)
        if age > 28:
            risk_factor *= 0.95
        elif age < 22:
            risk_factor *= 0.98  # Projection risk

        # Injury history
        if injury_history > 0:
            risk_factor *= (1 - injury_history * 0.05)

        return composite_score * risk_factor

    def generate_report(self, player_data):
        """Generate comprehensive evaluation report"""
        # Calculate component scores
        stats = self.evaluate_stats(
            player_data['league'],
            player_data['avg'],
            player_data['obp'],
            player_data['slg'],
            player_data['age']
        )

        tools = self.evaluate_tools(
            player_data['hit_grade'],
            player_data['power_grade'],
            player_data['speed_grade'],
            player_data['field_grade'],
            player_data['arm_grade']
        )

        physical = self.evaluate_physical(
            player_data['height'],
            player_data['weight'],
            player_data.get('exit_velo'),
            player_data.get('sprint_speed')
        )

        track_record = self.evaluate_track_record(
            player_data['years_pro'],
            player_data['level'],
            player_data['consistency']
        )

        # Composite score
        composite = self.composite_score(stats, tools, physical, track_record)

        # Risk-adjusted final score
        final = self.risk_adjustment(
            composite,
            player_data['data_quality'],
            player_data['age'],
            player_data.get('injury_history', 0)
        )

        # Grade assignment
        grade = self.assign_grade(final)

        return {
            'player': player_data.get('name', 'Unknown'),
            'stats_score': round(stats, 1),
            'tools_score': round(tools, 1),
            'physical_score': round(physical, 1),
            'track_record_score': round(track_record, 1),
            'composite_score': round(composite, 1),
            'final_score': round(final, 1),
            'grade': grade,
            'recommendation': self.generate_recommendation(final, player_data)
        }

    def assign_grade(self, score):
        """Convert numerical score to grade"""
        if score >= 70:
            return 'A+ (Elite)'
        elif score >= 65:
            return 'A (Plus Regular)'
        elif score >= 60:
            return 'B+ (Above Average)'
        elif score >= 55:
            return 'B (Average Starter)'
        elif score >= 50:
            return 'C+ (Platoon/Depth)'
        else:
            return 'C or below'

    def generate_recommendation(self, score, player_data):
        """Generate signing/acquisition recommendation"""
        if score >= 65:
            return f"Strong pursue - potential impact player"
        elif score >= 60:
            return f"Pursue - likely contributor"
        elif score >= 55:
            return f"Monitor - depth/upside candidate"
        else:
            return f"Pass unless at significant discount"

# Example evaluation
evaluator = InternationalPlayerEvaluator()

# Example player: NPB star
npb_star = {
    'name': 'NPB Star Candidate',
    'age': 26,
    'league': 'NPB',
    'avg': .315,
    'obp': .385,
    'slg': .545,
    'hit_grade': 60,
    'power_grade': 65,
    'speed_grade': 55,
    'field_grade': 60,
    'arm_grade': 60,
    'height': 72,  # inches
    'weight': 195,  # lbs
    'exit_velo': 107,
    'sprint_speed': 28.2,
    'years_pro': 6,
    'level': 'NPB',
    'consistency': 85,  # 0-100 scale
    'data_quality': 'high',
    'injury_history': 0
}

print("="*70)
print("INTERNATIONAL PLAYER EVALUATION REPORT")
print("="*70)

report = evaluator.generate_report(npb_star)
for key, value in report.items():
    print(f"{key.replace('_', ' ').title()}: {value}")

# Compare multiple international candidates
candidates = [
    {
        'name': 'Cuban Prospect A',
        'age': 22,
        'league': 'Cuban',
        'avg': .345,
        'obp': .420,
        'slg': .598,
        'hit_grade': 65,
        'power_grade': 70,
        'speed_grade': 60,
        'field_grade': 55,
        'arm_grade': 60,
        'height': 74,
        'weight': 215,
        'exit_velo': 110,
        'sprint_speed': 28.5,
        'years_pro': 3,
        'level': 'Cuban',
        'consistency': 75,
        'data_quality': 'low',
        'injury_history': 0
    },
    {
        'name': 'KBO Veteran B',
        'age': 29,
        'league': 'KBO',
        'avg': .298,
        'obp': .365,
        'slg': .485,
        'hit_grade': 60,
        'power_grade': 60,
        'speed_grade': 50,
        'field_grade': 65,
        'arm_grade': 65,
        'height': 70,
        'weight': 185,
        'exit_velo': 105,
        'sprint_speed': 27.5,
        'years_pro': 8,
        'level': 'KBO',
        'consistency': 90,
        'data_quality': 'medium',
        'injury_history': 1
    }
]

print("\n" + "="*70)
print("CANDIDATE COMPARISON")
print("="*70)

comparison_results = []
for candidate in candidates:
    result = evaluator.generate_report(candidate)
    comparison_results.append(result)

comparison_df = pd.DataFrame(comparison_results)
print(comparison_df[['player', 'final_score', 'grade', 'recommendation']].to_string(index=False))

Data Quality Framework

Organizations should establish data quality tiers:

Tier 1: Full Statcast, verified stats (MLB, some NPB)
Tier 2: Comprehensive traditional stats, video (NPB, KBO)
Tier 3: Basic stats, limited video (CPBL, Mexican League)
Tier 4: Incomplete stats, showcase only (Cuban, some Latin leagues)

Python

# Python: Building composite international player evaluation
import pandas as pd
import numpy as np

class InternationalPlayerEvaluator:
    """
    Comprehensive evaluation system for international players
    combining traditional stats, physical metrics, and projections
    """

    def __init__(self):
        self.weights = {
            'stats': 0.35,
            'tools': 0.30,
            'physical': 0.20,
            'track_record': 0.15
        }

    def evaluate_stats(self, league, avg, obp, slg, age):
        """Evaluate statistical performance with league adjustments"""
        league_factors = {
            'MLB': 1.00,
            'NPB': 0.85,
            'KBO': 0.78,
            'CPBL': 0.70,
            'Cuban': 0.75,
            'Mexican': 0.68
        }

        factor = league_factors.get(league, 0.65)

        # Calculate adjusted OPS+
        ops = obp + slg
        league_avg_ops = 0.720  # MLB baseline
        adj_ops_plus = (ops / league_avg_ops) * factor * 100

        # Age adjustment
        age_curve = 1.0 - abs(age - 26) * 0.02

        return adj_ops_plus * age_curve

    def evaluate_tools(self, hit, power, speed, field, arm):
        """Evaluate 5-tool grades (20-80 scale)"""
        # Weights for different positions (can be customized)
        tool_weights = {
            'hit': 0.30,
            'power': 0.25,
            'speed': 0.15,
            'field': 0.15,
            'arm': 0.15
        }

        weighted_grade = (
            hit * tool_weights['hit'] +
            power * tool_weights['power'] +
            speed * tool_weights['speed'] +
            field * tool_weights['field'] +
            arm * tool_weights['arm']
        )

        return weighted_grade

    def evaluate_physical(self, height_in, weight_lbs, exit_velo,
                         sprint_speed, arm_velo=None):
        """Evaluate physical tools and measurables"""
        score = 50  # Start at average

        # Exit velocity component (major factor)
        if exit_velo:
            velo_score = (exit_velo - 87) * 2  # 87 mph = average
            score += velo_score * 0.4

        # Sprint speed component
        if sprint_speed:
            # 27 ft/s = average MLB
            speed_score = (sprint_speed - 27) * 10
            score += speed_score * 0.3

        # Build/athleticism
        bmi = (weight_lbs / (height_in ** 2)) * 703
        if 22 <= bmi <= 27:  # Optimal athletic range
            score += 5

        return max(20, min(80, score))  # Clamp to 20-80 scale

    def evaluate_track_record(self, years_pro, level_reached,
                             consistency_score):
        """Evaluate professional track record"""
        level_scores = {
            'MLB': 80,
            'AAA': 70,
            'NPB': 70,
            'KBO': 65,
            'AA': 60,
            'CPBL': 60,
            'Cuban': 55,
            'A+': 50
        }

        level_score = level_scores.get(level_reached, 45)
        experience_bonus = min(10, years_pro * 2)

        return (level_score + experience_bonus) * (consistency_score / 100)

    def composite_score(self, stats_score, tools_score, physical_score,
                       track_score):
        """Calculate weighted composite score"""
        composite = (
            stats_score * self.weights['stats'] +
            tools_score * self.weights['tools'] +
            physical_score * self.weights['physical'] +
            track_score * self.weights['track_record']
        )

        return composite

    def risk_adjustment(self, composite_score, data_quality, age,
                       injury_history):
        """Adjust score for various risk factors"""
        risk_factor = 1.0

        # Data quality risk
        data_factors = {
            'high': 1.0,
            'medium': 0.92,
            'low': 0.85,
            'very_low': 0.75
        }
        risk_factor *= data_factors.get(data_quality, 0.85)

        # Age risk (older = more risk for prospect, less for proven player)
        if age > 28:
            risk_factor *= 0.95
        elif age < 22:
            risk_factor *= 0.98  # Projection risk

        # Injury history
        if injury_history > 0:
            risk_factor *= (1 - injury_history * 0.05)

        return composite_score * risk_factor

    def generate_report(self, player_data):
        """Generate comprehensive evaluation report"""
        # Calculate component scores
        stats = self.evaluate_stats(
            player_data['league'],
            player_data['avg'],
            player_data['obp'],
            player_data['slg'],
            player_data['age']
        )

        tools = self.evaluate_tools(
            player_data['hit_grade'],
            player_data['power_grade'],
            player_data['speed_grade'],
            player_data['field_grade'],
            player_data['arm_grade']
        )

        physical = self.evaluate_physical(
            player_data['height'],
            player_data['weight'],
            player_data.get('exit_velo'),
            player_data.get('sprint_speed')
        )

        track_record = self.evaluate_track_record(
            player_data['years_pro'],
            player_data['level'],
            player_data['consistency']
        )

        # Composite score
        composite = self.composite_score(stats, tools, physical, track_record)

        # Risk-adjusted final score
        final = self.risk_adjustment(
            composite,
            player_data['data_quality'],
            player_data['age'],
            player_data.get('injury_history', 0)
        )

        # Grade assignment
        grade = self.assign_grade(final)

        return {
            'player': player_data.get('name', 'Unknown'),
            'stats_score': round(stats, 1),
            'tools_score': round(tools, 1),
            'physical_score': round(physical, 1),
            'track_record_score': round(track_record, 1),
            'composite_score': round(composite, 1),
            'final_score': round(final, 1),
            'grade': grade,
            'recommendation': self.generate_recommendation(final, player_data)
        }

    def assign_grade(self, score):
        """Convert numerical score to grade"""
        if score >= 70:
            return 'A+ (Elite)'
        elif score >= 65:
            return 'A (Plus Regular)'
        elif score >= 60:
            return 'B+ (Above Average)'
        elif score >= 55:
            return 'B (Average Starter)'
        elif score >= 50:
            return 'C+ (Platoon/Depth)'
        else:
            return 'C or below'

    def generate_recommendation(self, score, player_data):
        """Generate signing/acquisition recommendation"""
        if score >= 65:
            return f"Strong pursue - potential impact player"
        elif score >= 60:
            return f"Pursue - likely contributor"
        elif score >= 55:
            return f"Monitor - depth/upside candidate"
        else:
            return f"Pass unless at significant discount"

# Example evaluation
evaluator = InternationalPlayerEvaluator()

# Example player: NPB star
npb_star = {
    'name': 'NPB Star Candidate',
    'age': 26,
    'league': 'NPB',
    'avg': .315,
    'obp': .385,
    'slg': .545,
    'hit_grade': 60,
    'power_grade': 65,
    'speed_grade': 55,
    'field_grade': 60,
    'arm_grade': 60,
    'height': 72,  # inches
    'weight': 195,  # lbs
    'exit_velo': 107,
    'sprint_speed': 28.2,
    'years_pro': 6,
    'level': 'NPB',
    'consistency': 85,  # 0-100 scale
    'data_quality': 'high',
    'injury_history': 0
}

print("="*70)
print("INTERNATIONAL PLAYER EVALUATION REPORT")
print("="*70)

report = evaluator.generate_report(npb_star)
for key, value in report.items():
    print(f"{key.replace('_', ' ').title()}: {value}")

# Compare multiple international candidates
candidates = [
    {
        'name': 'Cuban Prospect A',
        'age': 22,
        'league': 'Cuban',
        'avg': .345,
        'obp': .420,
        'slg': .598,
        'hit_grade': 65,
        'power_grade': 70,
        'speed_grade': 60,
        'field_grade': 55,
        'arm_grade': 60,
        'height': 74,
        'weight': 215,
        'exit_velo': 110,
        'sprint_speed': 28.5,
        'years_pro': 3,
        'level': 'Cuban',
        'consistency': 75,
        'data_quality': 'low',
        'injury_history': 0
    },
    {
        'name': 'KBO Veteran B',
        'age': 29,
        'league': 'KBO',
        'avg': .298,
        'obp': .365,
        'slg': .485,
        'hit_grade': 60,
        'power_grade': 60,
        'speed_grade': 50,
        'field_grade': 65,
        'arm_grade': 65,
        'height': 70,
        'weight': 185,
        'exit_velo': 105,
        'sprint_speed': 27.5,
        'years_pro': 8,
        'level': 'KBO',
        'consistency': 90,
        'data_quality': 'medium',
        'injury_history': 1
    }
]

print("\n" + "="*70)
print("CANDIDATE COMPARISON")
print("="*70)

comparison_results = []
for candidate in candidates:
    result = evaluator.generate_report(candidate)
    comparison_results.append(result)

comparison_df = pd.DataFrame(comparison_results)
print(comparison_df[['player', 'final_score', 'grade', 'recommendation']].to_string(index=False))

23.7 Exercises

Exercise 23.1: NPB Translation Model

Using the provided NPB-to-MLB translation data, build a regression model to project the first-year MLB performance of a hypothetical NPB player:

Player Profile:

Age: 25

Final NPB season: .305/.380/.520, 28 HR in 550 PA

Position: Corner OF

Exit velocity: 106 mph (NPB measurement)

Tasks:

Apply the translation factors from Section 23.2

Calculate projected MLB slash line and HR total

Estimate first-year WAR using the projection

Assess confidence level and identify key uncertainties

Bonus: Compare your projection to actual performance of similar NPB players (e.g., Seiya Suzuki, Masataka Yoshida).

Exercise 23.2: KBO Pitcher Projection

A 27-year-old KBO left-handed starter has the following final season:

2.65 ERA, 1.15 WHIP
9.8 K/9, 2.8 BB/9, 0.75 HR/9
175 IP, 15-6 record

Tasks:

Using the KBO pitcher translation model from Section 23.3, project his MLB stats

Calculate projected FIP and ERA

Build a confidence interval for your ERA projection

Compare to similar KBO pitchers (e.g., Hyun-Jin Ryu, Kwang-Hyun Kim)

Recommend a contract structure based on projection and risk

Exercise 23.3: Latin American Tool-Based Valuation

You are evaluating three Dominican Republic prospects for international signing:

Prospect A:

Age: 17

Hit: 60, Power: 70, Speed: 55, Field: 55, Arm: 60

Exit velocity: 108 mph

Asking bonus: $3.5M

Prospect B:

Age: 16

Hit: 65, Power: 60, Speed: 70, Field: 65, Arm: 60

Exit velocity: 104 mph

Asking bonus: $4.0M

Prospect C:

Age: 18

Hit: 55, Power: 75, Speed: 45, Field: 50, Arm: 55

Exit velocity: 112 mph

Asking bonus: $2.5M

Tasks:

Use the Latin American projection model from Section 23.4

Project WAR through age 23 for each prospect

Calculate value per dollar of bonus

Rank the prospects considering both ceiling and floor outcomes

Recommend which prospect(s) to sign and at what price

Advanced: Simulate 1,000 career paths for each prospect incorporating uncertainty and injury risk.

Exercise 23.4: International League Environment Analysis

Using the league comparison data from Section 23.1, conduct a comprehensive analysis:

Tasks:

Calculate park-adjusted metrics for each league

Estimate "true talent" translation factors using regression to the mean

Build a Bayesian updating system that improves projections as players accumulate MLB PA

Create visualizations comparing league offensive environments over time (2015-2023)

Develop recommendations for adjusting scouting priorities based on league trends

Data Required:

League-wide statistics (provided in section)

Park factors (research or estimate)

Historical translation success rates

Deliverables:

R or Python code implementing your analysis

Report summarizing findings

Recommendations for international scouting departments

Summary

International baseball analytics requires sophisticated approaches to account for varying competition levels, data quality, and cultural contexts. Key takeaways:

League translation factors are essential but imperfect tools requiring continuous refinement
NPB and KBO provide the highest quality international data and most reliable translation models
Cuban and Latin American evaluation relies heavily on physical tools and limited showcase data
Age at transition significantly impacts success rates across all international sources
Data quality tiers should inform confidence levels and risk assessment
Small sample sizes in tournaments like WBC require careful statistical interpretation

As international signing and posting systems evolve, analytical approaches must adapt to incorporate new data sources, changing competitive environments, and improved measurement technologies. Organizations that excel at international player evaluation and projection gain significant competitive advantages in talent acquisition.

Further Reading:

Baseball America's International Prospect Handbook
FanGraphs International Free Agent analysis series
MLB Pipeline scouting reports and tools grades
Academic research on translation factors (Baseball Prospectus, The Hardball Times)
Statcast comparative studies across international leagues

Practice Exercises

Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.

4 exercises

Tips for Success

Read the problem carefully before starting to code
Break down complex problems into smaller steps
Use the hints if you're stuck - they won't give away the answer
After solving, compare your approach with the solution

Exercise 23.1

NPB Translation Model

Hard

Using the provided NPB-to-MLB translation data, build a regression model to project the first-year MLB performance of a hypothetical NPB player:

**Player Profile:**
- Age: 25
- Final NPB season: .305/.380/.520, 28 HR in 550 PA
- Position: Corner OF
- Exit velocity: 106 mph (NPB measurement)

**Tasks:**
1. Apply the translation factors from Section 23.2
2. Calculate projected MLB slash line and HR total
3. Estimate first-year WAR using the projection
4. Assess confidence level and identify key uncertainties

**Bonus:** Compare your projection to actual performance of similar NPB players (e.g., Seiya Suzuki, Masataka Yoshida).

Exercise 23.2

KBO Pitcher Projection

Hard

A 27-year-old KBO left-handed starter has the following final season:

- 2.65 ERA, 1.15 WHIP
- 9.8 K/9, 2.8 BB/9, 0.75 HR/9
- 175 IP, 15-6 record

**Tasks:**
1. Using the KBO pitcher translation model from Section 23.3, project his MLB stats
2. Calculate projected FIP and ERA
3. Build a confidence interval for your ERA projection
4. Compare to similar KBO pitchers (e.g., Hyun-Jin Ryu, Kwang-Hyun Kim)
5. Recommend a contract structure based on projection and risk

Exercise 23.3

Latin American Tool-Based Valuation

Hard

You are evaluating three Dominican Republic prospects for international signing:

**Prospect A:**
- Age: 17
- Hit: 60, Power: 70, Speed: 55, Field: 55, Arm: 60
- Exit velocity: 108 mph
- Asking bonus: $3.5M

**Prospect B:**
- Age: 16
- Hit: 65, Power: 60, Speed: 70, Field: 65, Arm: 60
- Exit velocity: 104 mph
- Asking bonus: $4.0M

**Prospect C:**
- Age: 18
- Hit: 55, Power: 75, Speed: 45, Field: 50, Arm: 55
- Exit velocity: 112 mph
- Asking bonus: $2.5M

**Tasks:**
1. Use the Latin American projection model from Section 23.4
2. Project WAR through age 23 for each prospect
3. Calculate value per dollar of bonus
4. Rank the prospects considering both ceiling and floor outcomes
5. Recommend which prospect(s) to sign and at what price

**Advanced:** Simulate 1,000 career paths for each prospect incorporating uncertainty and injury risk.

Exercise 23.4

International League Environment Analysis

Hard

Using the league comparison data from Section 23.1, conduct a comprehensive analysis:

**Tasks:**
1. Calculate park-adjusted metrics for each league
2. Estimate "true talent" translation factors using regression to the mean
3. Build a Bayesian updating system that improves projections as players accumulate MLB PA
4. Create visualizations comparing league offensive environments over time (2015-2023)
5. Develop recommendations for adjusting scouting priorities based on league trends

**Data Required:**
- League-wide statistics (provided in section)
- Park factors (research or estimate)
- Historical translation success rates

**Deliverables:**
- R or Python code implementing your analysis
- Report summarizing findings
- Recommendations for international scouting departments

---

Chapter 23: International Baseball Analytics

Book Progress

What You'll Learn

Languages in This Chapter

Table of Contents

Quick Navigation

23.1 Global Baseball Landscape

Major International Leagues

League Comparison Framework

Talent Flow Patterns

23.2 NPB Analytics & Translation to MLB

Historical NPB-to-MLB Performance

Translation Factors

Case Study: Shohei Ohtani's Two-Way Translation

23.3 KBO Analytics & Notable Imports

KBO League Characteristics

Translation Methodology

Notable KBO Success Stories

23.4 Cuban & Latin American Player Evaluation

Data Availability Challenges

Evaluation Frameworks

World Baseball Classic as Evaluation Tool

23.5 World Baseball Classic Analysis

WBC Performance Analytics

23.6 International Scouting Data Challenges

Key Challenges

Alternative Data Sources

Data Quality Framework

23.7 Exercises

Exercise 23.1: NPB Translation Model

Exercise 23.2: KBO Pitcher Projection

Exercise 23.3: Latin American Tool-Based Valuation

Exercise 23.4: International League Environment Analysis

Summary

Practice Exercises

Tips for Success

NPB Translation Model

KBO Pitcher Projection

Latin American Tool-Based Valuation

International League Environment Analysis

Chapter Summary

Related Resources

Glossary

Resources

All Chapters