Baseball's international footprint extends far beyond Major League Baseball, with professional leagues across Asia, Latin America, and other regions producing world-class talent. Understanding the global baseball landscape is essential for modern analytics, as international players increasingly impact MLB rosters and performance.
Major International Leagues
The primary professional baseball leagues outside MLB include:
- Nippon Professional Baseball (NPB) - Japan's elite league, founded in 1950
- Korea Baseball Organization (KBO) - South Korea's top league, established in 1982
- Chinese Professional Baseball League (CPBL) - Taiwan's premier league, started in 1990
- Cuban National Series - Cuba's domestic league system
- Mexican League (LMB) - Mexico's top-tier professional league
- Various Caribbean leagues - Dominican, Venezuelan, and Puerto Rican winter leagues
Each league operates with unique characteristics affecting player development, performance metrics, and translation to MLB standards.
League Comparison Framework
When analyzing international baseball, we must account for several factors:
- Competition level differences: NPB and KBO represent high-quality competition, roughly equivalent to Triple-A or low MLB levels
- Ball specifications: Different leagues use balls with varying specifications affecting flight characteristics
- Park dimensions: International stadiums often have different dimensions than MLB parks
- Schedule intensity: NPB plays 143 games, KBO plays 144, compared to MLB's 162
- Playing style: Cultural differences influence strategic approaches
Let's examine league statistics to understand competitive balance:
# R: Comparing league statistics across international baseball
library(tidyverse)
library(ggplot2)
# League comparison data (2023 season)
league_stats <- data.frame(
league = c("MLB", "NPB", "KBO", "CPBL", "Mexican League"),
avg_ba = c(.248, .249, .269, .289, .285),
avg_obp = c(.320, .319, .336, .351, .348),
avg_slg = c(.409, .388, .408, .426, .437),
avg_era = c(4.33, 3.87, 4.54, 4.89, 4.72),
avg_k_rate = c(22.5, 21.8, 19.3, 17.2, 18.5),
avg_bb_rate = c(8.7, 8.9, 10.2, 9.8, 10.5),
hr_per_game = c(1.19, 1.08, 1.15, 1.23, 1.28),
games_per_season = c(162, 143, 144, 120, 114),
teams = c(30, 12, 10, 6, 18)
)
# Calculate wOBA for each league
league_stats <- league_stats %>%
mutate(
wOBA = 0.69 * avg_bb_rate/100 +
0.72 * (avg_obp - avg_ba - avg_bb_rate/100) +
0.88 * (avg_ba - (avg_slg - avg_ba)/3.5) +
1.24 * ((avg_slg - avg_ba)/3.5 - hr_per_game/9) +
1.56 * (hr_per_game/9)
)
# Visualize offensive environment
ggplot(league_stats, aes(x = reorder(league, -wOBA), y = wOBA, fill = league)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = mean(league_stats$wOBA),
linetype = "dashed", color = "red") +
labs(title = "League Offensive Environments (2023)",
subtitle = "wOBA comparison across international leagues",
x = "League", y = "League Average wOBA") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))
# Print comparison table
print(league_stats %>%
select(league, avg_ba, avg_obp, avg_slg, avg_era, wOBA) %>%
arrange(desc(wOBA)))
# Python: International league environment analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# League comparison data
league_data = {
'league': ['MLB', 'NPB', 'KBO', 'CPBL', 'Mexican League'],
'avg_ba': [.248, .249, .269, .289, .285],
'avg_obp': [.320, .319, .336, .351, .348],
'avg_slg': [.409, .388, .408, .426, .437],
'avg_era': [4.33, 3.87, 4.54, 4.89, 4.72],
'avg_k_rate': [22.5, 21.8, 19.3, 17.2, 18.5],
'avg_bb_rate': [8.7, 8.9, 10.2, 9.8, 10.5],
'hr_per_game': [1.19, 1.08, 1.15, 1.23, 1.28]
}
df_leagues = pd.DataFrame(league_data)
# Calculate run environment index (REI) - normalized to MLB = 100
mlb_runs_per_game = 4.5
df_leagues['runs_per_game'] = [4.5, 4.2, 4.8, 5.1, 5.0]
df_leagues['rei'] = (df_leagues['runs_per_game'] / mlb_runs_per_game) * 100
# Create comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Plot 1: Three True Outcomes
ax1 = axes[0]
x = np.arange(len(df_leagues))
width = 0.25
ax1.bar(x - width, df_leagues['avg_k_rate'], width, label='K%', alpha=0.8)
ax1.bar(x, df_leagues['avg_bb_rate'], width, label='BB%', alpha=0.8)
ax1.bar(x + width, df_leagues['hr_per_game']*2, width, label='HR/G (scaled)', alpha=0.8)
ax1.set_xlabel('League')
ax1.set_ylabel('Percentage / Rate')
ax1.set_title('Three True Outcomes Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(df_leagues['league'], rotation=45, ha='right')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)
# Plot 2: Run Environment Index
ax2 = axes[1]
colors = ['#d62728' if rei > 100 else '#1f77b4' for rei in df_leagues['rei']]
ax2.barh(df_leagues['league'], df_leagues['rei'], color=colors, alpha=0.7)
ax2.axvline(x=100, color='black', linestyle='--', linewidth=2, label='MLB Baseline')
ax2.set_xlabel('Run Environment Index (MLB = 100)')
ax2.set_title('League Run Scoring Environment')
ax2.legend()
ax2.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('international_league_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print("\nLeague Statistics Summary:")
print(df_leagues[['league', 'avg_ba', 'avg_obp', 'avg_slg', 'rei']])
Talent Flow Patterns
International player movement to MLB has increased dramatically over the past three decades:
- NPB to MLB: Peak years 2012-2023 saw 50+ active NPB imports
- KBO to MLB: Increased after 2015, with notable success stories
- Cuban defectors: Major impact players since 1990s
- Latin American academies: Primary pipeline for teams
Understanding these patterns helps teams identify market inefficiencies and projection opportunities.
# R: Comparing league statistics across international baseball
library(tidyverse)
library(ggplot2)
# League comparison data (2023 season)
league_stats <- data.frame(
league = c("MLB", "NPB", "KBO", "CPBL", "Mexican League"),
avg_ba = c(.248, .249, .269, .289, .285),
avg_obp = c(.320, .319, .336, .351, .348),
avg_slg = c(.409, .388, .408, .426, .437),
avg_era = c(4.33, 3.87, 4.54, 4.89, 4.72),
avg_k_rate = c(22.5, 21.8, 19.3, 17.2, 18.5),
avg_bb_rate = c(8.7, 8.9, 10.2, 9.8, 10.5),
hr_per_game = c(1.19, 1.08, 1.15, 1.23, 1.28),
games_per_season = c(162, 143, 144, 120, 114),
teams = c(30, 12, 10, 6, 18)
)
# Calculate wOBA for each league
league_stats <- league_stats %>%
mutate(
wOBA = 0.69 * avg_bb_rate/100 +
0.72 * (avg_obp - avg_ba - avg_bb_rate/100) +
0.88 * (avg_ba - (avg_slg - avg_ba)/3.5) +
1.24 * ((avg_slg - avg_ba)/3.5 - hr_per_game/9) +
1.56 * (hr_per_game/9)
)
# Visualize offensive environment
ggplot(league_stats, aes(x = reorder(league, -wOBA), y = wOBA, fill = league)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = mean(league_stats$wOBA),
linetype = "dashed", color = "red") +
labs(title = "League Offensive Environments (2023)",
subtitle = "wOBA comparison across international leagues",
x = "League", y = "League Average wOBA") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))
# Print comparison table
print(league_stats %>%
select(league, avg_ba, avg_obp, avg_slg, avg_era, wOBA) %>%
arrange(desc(wOBA)))
# Python: International league environment analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# League comparison data
league_data = {
'league': ['MLB', 'NPB', 'KBO', 'CPBL', 'Mexican League'],
'avg_ba': [.248, .249, .269, .289, .285],
'avg_obp': [.320, .319, .336, .351, .348],
'avg_slg': [.409, .388, .408, .426, .437],
'avg_era': [4.33, 3.87, 4.54, 4.89, 4.72],
'avg_k_rate': [22.5, 21.8, 19.3, 17.2, 18.5],
'avg_bb_rate': [8.7, 8.9, 10.2, 9.8, 10.5],
'hr_per_game': [1.19, 1.08, 1.15, 1.23, 1.28]
}
df_leagues = pd.DataFrame(league_data)
# Calculate run environment index (REI) - normalized to MLB = 100
mlb_runs_per_game = 4.5
df_leagues['runs_per_game'] = [4.5, 4.2, 4.8, 5.1, 5.0]
df_leagues['rei'] = (df_leagues['runs_per_game'] / mlb_runs_per_game) * 100
# Create comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Plot 1: Three True Outcomes
ax1 = axes[0]
x = np.arange(len(df_leagues))
width = 0.25
ax1.bar(x - width, df_leagues['avg_k_rate'], width, label='K%', alpha=0.8)
ax1.bar(x, df_leagues['avg_bb_rate'], width, label='BB%', alpha=0.8)
ax1.bar(x + width, df_leagues['hr_per_game']*2, width, label='HR/G (scaled)', alpha=0.8)
ax1.set_xlabel('League')
ax1.set_ylabel('Percentage / Rate')
ax1.set_title('Three True Outcomes Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(df_leagues['league'], rotation=45, ha='right')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)
# Plot 2: Run Environment Index
ax2 = axes[1]
colors = ['#d62728' if rei > 100 else '#1f77b4' for rei in df_leagues['rei']]
ax2.barh(df_leagues['league'], df_leagues['rei'], color=colors, alpha=0.7)
ax2.axvline(x=100, color='black', linestyle='--', linewidth=2, label='MLB Baseline')
ax2.set_xlabel('Run Environment Index (MLB = 100)')
ax2.set_title('League Run Scoring Environment')
ax2.legend()
ax2.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('international_league_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print("\nLeague Statistics Summary:")
print(df_leagues[['league', 'avg_ba', 'avg_obp', 'avg_slg', 'rei']])
Nippon Professional Baseball represents the highest level of competition outside MLB. Successful translation of NPB performance to MLB projections requires sophisticated analytical approaches accounting for league differences.
Historical NPB-to-MLB Performance
Notable successful NPB imports include:
- Shohei Ohtani (Nippon-Ham Fighters): Two-way superstar
- Yoshinobu Yamamoto (Orix Buffaloes): Elite pitcher
- Masataka Yoshida (Orix Buffaloes): Contact-oriented outfielder
- Seiya Suzuki (Hiroshima Carp): Five-tool outfielder
- Yu Darvish (Nippon-Ham Fighters): Ace pitcher
- Shota Imanaga (DeNA BayStars): Left-handed starter
Translation Factors
Research suggests several key adjustments when projecting NPB stats to MLB:
- Offensive translation: Multiply NPB performance by 0.75-0.85 factor
- Power adjustment: HR totals typically translate at 0.70-0.80 rate
- Contact skills: High-contact NPB hitters maintain skills better
- Pitching velocity: Add 1-2 mph to NPB readings due to measurement differences
- Age consideration: Younger NPB players (under 27) translate better
Let's build a translation model using historical data:
# R: NPB to MLB translation model
library(tidyverse)
library(broom)
# Historical NPB-to-MLB hitter transitions (final NPB season vs first 2 MLB years)
npb_mlb_hitters <- data.frame(
player = c("Shohei Ohtani", "Masataka Yoshida", "Seiya Suzuki",
"Kenta Maeda", "Shogo Akiyama", "Yoshitomo Tsutsugo",
"Kosuke Fukudome", "Akinori Iwamura", "Norichika Aoki"),
age_mlb_debut = c(23, 29, 27, 24, 31, 28, 30, 27, 30),
npb_last_pa = c(382, 543, 516, 82, 575, 559, 606, 589, 621),
npb_last_avg = c(.286, .335, .315, .235, .301, .272, .344, .311, .292),
npb_last_obp = c(.358, .421, .418, .328, .376, .348, .453, .383, .348),
npb_last_slg = c(.500, .505, .537, .353, .454, .475, .628, .495, .449),
npb_last_hr = c(22, 21, 38, 1, 20, 29, 31, 23, 20),
mlb_first2_pa = c(870, 520, 798, 130, 453, 626, 1089, 1055, 1142),
mlb_first2_avg = c(.272, .280, .241, .188, .245, .197, .257, .275, .283),
mlb_first2_obp = c(.356, .346, .331, .270, .320, .314, .359, .352, .346),
mlb_first2_slg = c(.519, .430, .412, .313, .344, .343, .433, .418, .396),
mlb_first2_hr = c(40, 15, 28, 1, 7, 16, 30, 24, 18)
)
# Calculate translation ratios
npb_mlb_hitters <- npb_mlb_hitters %>%
mutate(
avg_ratio = mlb_first2_avg / npb_last_avg,
obp_ratio = mlb_first2_obp / npb_last_obp,
slg_ratio = mlb_first2_slg / npb_last_slg,
hr_ratio = (mlb_first2_hr / mlb_first2_pa * 600) / (npb_last_hr / npb_last_pa * 600),
age_group = ifelse(age_mlb_debut <= 26, "Young", "Veteran")
)
# Summary statistics
translation_summary <- npb_mlb_hitters %>%
group_by(age_group) %>%
summarise(
n = n(),
avg_translation = mean(avg_ratio),
obp_translation = mean(obp_ratio),
slg_translation = mean(slg_ratio),
hr_translation = mean(hr_ratio),
avg_sd = sd(avg_ratio),
slg_sd = sd(slg_ratio)
)
print(translation_summary)
# Build regression model for SLG translation
slg_model <- lm(slg_ratio ~ age_mlb_debut + npb_last_slg + npb_last_hr,
data = npb_mlb_hitters)
summary(slg_model)
# Visualization
ggplot(npb_mlb_hitters, aes(x = npb_last_slg, y = mlb_first2_slg)) +
geom_point(aes(color = age_group, size = npb_last_pa), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "blue", linetype = "dashed") +
geom_abline(intercept = 0, slope = 0.85, color = "red", linetype = "dashed") +
labs(title = "NPB to MLB Slugging Translation",
subtitle = "Red line = 0.85x translation factor",
x = "Final NPB Season SLG",
y = "First 2 MLB Seasons Average SLG",
color = "Age Group",
size = "NPB PA") +
theme_minimal() +
annotate("text", x = 0.55, y = 0.35,
label = "Most players fall\nbelow 1:1 line",
hjust = 0, color = "darkred")
# Function to project NPB hitter to MLB
project_npb_hitter <- function(age, npb_avg, npb_obp, npb_slg, npb_hr, npb_pa = 600) {
# Age adjustment factor
age_factor <- ifelse(age <= 26, 0.85, 0.78)
# Base translations
mlb_avg <- npb_avg * age_factor
mlb_obp <- npb_obp * (age_factor + 0.03) # OBP translates slightly better
mlb_slg <- npb_slg * (age_factor - 0.05) # Power translates worse
# HR projection
mlb_hr <- (npb_hr / npb_pa * 600) * (age_factor - 0.10) * 0.75
return(data.frame(
proj_avg = round(mlb_avg, 3),
proj_obp = round(mlb_obp, 3),
proj_slg = round(mlb_slg, 3),
proj_hr_per_600pa = round(mlb_hr, 1),
age_factor = age_factor
))
}
# Example: Project a hypothetical NPB star
cat("\nProjection for 25-year-old NPB star (.310/.390/.550, 35 HR):\n")
print(project_npb_hitter(age = 25, npb_avg = .310, npb_obp = .390,
npb_slg = .550, npb_hr = 35))
# Python: Advanced NPB pitching translation model
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
# Historical NPB-to-MLB pitcher transitions
npb_mlb_pitchers = pd.DataFrame({
'player': ['Yu Darvish', 'Masahiro Tanaka', 'Kenta Maeda',
'Yoshinobu Yamamoto', 'Shota Imanaga', 'Kodai Senga',
'Yusei Kikuchi', 'Tomoyuki Sugano', 'Shohei Ohtani'],
'age_mlb_debut': [25, 25, 27, 25, 30, 30, 26, 31, 23],
'npb_last_era': [1.44, 1.27, 2.10, 1.21, 2.66, 1.94, 3.08, 1.97, 2.52],
'npb_last_whip': [0.82, 0.94, 1.03, 0.88, 1.02, 0.98, 1.20, 1.00, 1.09],
'npb_last_k9': [11.2, 10.8, 9.5, 11.7, 9.8, 10.5, 9.2, 8.9, 11.7],
'npb_last_bb9': [2.1, 2.3, 2.8, 1.8, 2.4, 2.6, 3.2, 1.9, 3.2],
'npb_last_ip': [232, 212, 175, 164, 148, 144, 163, 137, 155],
'mlb_first2_era': [3.38, 3.18, 3.73, 2.92, 2.91, 3.68, 4.50, np.nan, 2.86],
'mlb_first2_whip': [1.12, 1.11, 1.16, 1.04, 1.09, 1.22, 1.32, np.nan, 1.16],
'mlb_first2_k9': [10.8, 9.2, 9.8, 11.2, 10.2, 10.8, 9.5, np.nan, 11.9],
'mlb_first2_bb9': [2.8, 2.1, 2.5, 2.4, 2.6, 3.2, 3.8, np.nan, 3.5],
'mlb_first2_ip': [363, 314, 258, 189, 173, 166, 318, np.nan, 285]
})
# Remove players with insufficient MLB data
npb_mlb_pitchers = npb_mlb_pitchers.dropna()
# Calculate translation factors
npb_mlb_pitchers['era_ratio'] = npb_mlb_pitchers['mlb_first2_era'] / npb_mlb_pitchers['npb_last_era']
npb_mlb_pitchers['k9_ratio'] = npb_mlb_pitchers['mlb_first2_k9'] / npb_mlb_pitchers['npb_last_k9']
npb_mlb_pitchers['bb9_ratio'] = npb_mlb_pitchers['mlb_first2_bb9'] / npb_mlb_pitchers['npb_last_bb9']
print("NPB to MLB Pitcher Translation Factors:")
print(f"ERA multiplier: {npb_mlb_pitchers['era_ratio'].mean():.3f} (±{npb_mlb_pitchers['era_ratio'].std():.3f})")
print(f"K/9 retention: {npb_mlb_pitchers['k9_ratio'].mean():.3f} (±{npb_mlb_pitchers['k9_ratio'].std():.3f})")
print(f"BB/9 change: {npb_mlb_pitchers['bb9_ratio'].mean():.3f} (±{npb_mlb_pitchers['bb9_ratio'].std():.3f})")
# Build FIP-based translation model
def calculate_fip(era, k9, bb9, hr9=1.0):
"""Calculate Fielding Independent Pitching"""
return ((13 * hr9) + (3 * bb9) - (2 * k9)) / 9 + 3.2
npb_mlb_pitchers['npb_fip'] = calculate_fip(
npb_mlb_pitchers['npb_last_era'],
npb_mlb_pitchers['npb_last_k9'],
npb_mlb_pitchers['npb_last_bb9'],
hr9=0.8 # Estimated NPB HR/9
)
npb_mlb_pitchers['mlb_fip'] = calculate_fip(
npb_mlb_pitchers['mlb_first2_era'],
npb_mlb_pitchers['mlb_first2_k9'],
npb_mlb_pitchers['mlb_first2_bb9'],
hr9=1.1 # Estimated MLB HR/9
)
# Regression model for FIP translation
X = npb_mlb_pitchers[['age_mlb_debut', 'npb_fip', 'npb_last_k9']].values
y = npb_mlb_pitchers['mlb_fip'].values
model = LinearRegression()
model.fit(X, y)
print(f"\nFIP Translation Model:")
print(f"R² Score: {model.score(X, y):.3f}")
print(f"Coefficients: Age={model.coef_[0]:.3f}, NPB_FIP={model.coef_[1]:.3f}, K9={model.coef_[2]:.3f}")
print(f"Intercept: {model.intercept_:.3f}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Plot 1: ERA Translation
ax1 = axes[0]
ax1.scatter(npb_mlb_pitchers['npb_last_era'],
npb_mlb_pitchers['mlb_first2_era'],
s=100, alpha=0.6, c=npb_mlb_pitchers['age_mlb_debut'],
cmap='viridis')
ax1.plot([0, 4], [0, 4], 'k--', alpha=0.3, label='1:1 line')
ax1.plot([0, 4], [0, 8], 'r--', alpha=0.5, label='2x ERA line')
ax1.set_xlabel('NPB Final Season ERA')
ax1.set_ylabel('MLB First 2 Seasons ERA')
ax1.set_title('NPB to MLB ERA Translation')
ax1.legend()
ax1.grid(alpha=0.3)
# Add player labels
for idx, row in npb_mlb_pitchers.iterrows():
ax1.annotate(row['player'].split()[-1],
(row['npb_last_era'], row['mlb_first2_era']),
fontsize=8, alpha=0.7)
# Plot 2: K/9 Retention
ax2 = axes[1]
ax2.scatter(npb_mlb_pitchers['npb_last_k9'],
npb_mlb_pitchers['mlb_first2_k9'],
s=100, alpha=0.6, c=npb_mlb_pitchers['age_mlb_debut'],
cmap='viridis')
ax2.plot([7, 13], [7, 13], 'k--', alpha=0.3, label='1:1 line')
ax2.set_xlabel('NPB Final Season K/9')
ax2.set_ylabel('MLB First 2 Seasons K/9')
ax2.set_title('Strikeout Rate Translation')
ax2.legend()
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('npb_mlb_pitcher_translation.png', dpi=300, bbox_inches='tight')
plt.show()
# Projection function
def project_npb_pitcher(age, npb_era, npb_k9, npb_bb9, npb_ip):
"""Project NPB pitcher stats to MLB"""
# Age-based adjustment
age_factor = max(0.7, 1.0 - (age - 25) * 0.03)
# League difficulty adjustment
era_multiplier = 1.8 + (npb_era - 2.0) * 0.5 # Better NPB ERA = bigger jump
k9_retention = 0.92 + (min(age, 27) - 25) * 0.02 # Younger = better retention
bb9_increase = 1.25 - (9.0 - npb_k9) * 0.03 # Better K rate = better control
# Projections
mlb_era = npb_era * era_multiplier * (1 / age_factor)
mlb_k9 = npb_k9 * k9_retention
mlb_bb9 = npb_bb9 * bb9_increase
mlb_fip = calculate_fip(mlb_era, mlb_k9, mlb_bb9, hr9=1.1)
return {
'projected_ERA': round(mlb_era, 2),
'projected_K9': round(mlb_k9, 1),
'projected_BB9': round(mlb_bb9, 1),
'projected_FIP': round(mlb_fip, 2),
'age_factor': round(age_factor, 3),
'confidence': 'High' if age <= 27 and npb_ip > 150 else 'Medium'
}
# Example projection
print("\n" + "="*60)
print("Example: 26-year-old NPB ace (1.85 ERA, 11.5 K/9, 2.0 BB/9, 180 IP)")
print("="*60)
projection = project_npb_pitcher(26, 1.85, 11.5, 2.0, 180)
for key, value in projection.items():
print(f"{key}: {value}")
Case Study: Shohei Ohtani's Two-Way Translation
Ohtani's unprecedented two-way performance presents unique analytical challenges. His NPB stats (2016):
- Hitting: .322/.416/.588, 22 HR in 382 PA
- Pitching: 1.86 ERA, 174 K in 140 IP, 10-4 record
His MLB transition exceeded most projections, demonstrating the importance of age, athleticism, and elite raw tools in translation models.
# R: NPB to MLB translation model
library(tidyverse)
library(broom)
# Historical NPB-to-MLB hitter transitions (final NPB season vs first 2 MLB years)
npb_mlb_hitters <- data.frame(
player = c("Shohei Ohtani", "Masataka Yoshida", "Seiya Suzuki",
"Kenta Maeda", "Shogo Akiyama", "Yoshitomo Tsutsugo",
"Kosuke Fukudome", "Akinori Iwamura", "Norichika Aoki"),
age_mlb_debut = c(23, 29, 27, 24, 31, 28, 30, 27, 30),
npb_last_pa = c(382, 543, 516, 82, 575, 559, 606, 589, 621),
npb_last_avg = c(.286, .335, .315, .235, .301, .272, .344, .311, .292),
npb_last_obp = c(.358, .421, .418, .328, .376, .348, .453, .383, .348),
npb_last_slg = c(.500, .505, .537, .353, .454, .475, .628, .495, .449),
npb_last_hr = c(22, 21, 38, 1, 20, 29, 31, 23, 20),
mlb_first2_pa = c(870, 520, 798, 130, 453, 626, 1089, 1055, 1142),
mlb_first2_avg = c(.272, .280, .241, .188, .245, .197, .257, .275, .283),
mlb_first2_obp = c(.356, .346, .331, .270, .320, .314, .359, .352, .346),
mlb_first2_slg = c(.519, .430, .412, .313, .344, .343, .433, .418, .396),
mlb_first2_hr = c(40, 15, 28, 1, 7, 16, 30, 24, 18)
)
# Calculate translation ratios
npb_mlb_hitters <- npb_mlb_hitters %>%
mutate(
avg_ratio = mlb_first2_avg / npb_last_avg,
obp_ratio = mlb_first2_obp / npb_last_obp,
slg_ratio = mlb_first2_slg / npb_last_slg,
hr_ratio = (mlb_first2_hr / mlb_first2_pa * 600) / (npb_last_hr / npb_last_pa * 600),
age_group = ifelse(age_mlb_debut <= 26, "Young", "Veteran")
)
# Summary statistics
translation_summary <- npb_mlb_hitters %>%
group_by(age_group) %>%
summarise(
n = n(),
avg_translation = mean(avg_ratio),
obp_translation = mean(obp_ratio),
slg_translation = mean(slg_ratio),
hr_translation = mean(hr_ratio),
avg_sd = sd(avg_ratio),
slg_sd = sd(slg_ratio)
)
print(translation_summary)
# Build regression model for SLG translation
slg_model <- lm(slg_ratio ~ age_mlb_debut + npb_last_slg + npb_last_hr,
data = npb_mlb_hitters)
summary(slg_model)
# Visualization
ggplot(npb_mlb_hitters, aes(x = npb_last_slg, y = mlb_first2_slg)) +
geom_point(aes(color = age_group, size = npb_last_pa), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "blue", linetype = "dashed") +
geom_abline(intercept = 0, slope = 0.85, color = "red", linetype = "dashed") +
labs(title = "NPB to MLB Slugging Translation",
subtitle = "Red line = 0.85x translation factor",
x = "Final NPB Season SLG",
y = "First 2 MLB Seasons Average SLG",
color = "Age Group",
size = "NPB PA") +
theme_minimal() +
annotate("text", x = 0.55, y = 0.35,
label = "Most players fall\nbelow 1:1 line",
hjust = 0, color = "darkred")
# Function to project NPB hitter to MLB
project_npb_hitter <- function(age, npb_avg, npb_obp, npb_slg, npb_hr, npb_pa = 600) {
# Age adjustment factor
age_factor <- ifelse(age <= 26, 0.85, 0.78)
# Base translations
mlb_avg <- npb_avg * age_factor
mlb_obp <- npb_obp * (age_factor + 0.03) # OBP translates slightly better
mlb_slg <- npb_slg * (age_factor - 0.05) # Power translates worse
# HR projection
mlb_hr <- (npb_hr / npb_pa * 600) * (age_factor - 0.10) * 0.75
return(data.frame(
proj_avg = round(mlb_avg, 3),
proj_obp = round(mlb_obp, 3),
proj_slg = round(mlb_slg, 3),
proj_hr_per_600pa = round(mlb_hr, 1),
age_factor = age_factor
))
}
# Example: Project a hypothetical NPB star
cat("\nProjection for 25-year-old NPB star (.310/.390/.550, 35 HR):\n")
print(project_npb_hitter(age = 25, npb_avg = .310, npb_obp = .390,
npb_slg = .550, npb_hr = 35))
# Python: Advanced NPB pitching translation model
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
# Historical NPB-to-MLB pitcher transitions
npb_mlb_pitchers = pd.DataFrame({
'player': ['Yu Darvish', 'Masahiro Tanaka', 'Kenta Maeda',
'Yoshinobu Yamamoto', 'Shota Imanaga', 'Kodai Senga',
'Yusei Kikuchi', 'Tomoyuki Sugano', 'Shohei Ohtani'],
'age_mlb_debut': [25, 25, 27, 25, 30, 30, 26, 31, 23],
'npb_last_era': [1.44, 1.27, 2.10, 1.21, 2.66, 1.94, 3.08, 1.97, 2.52],
'npb_last_whip': [0.82, 0.94, 1.03, 0.88, 1.02, 0.98, 1.20, 1.00, 1.09],
'npb_last_k9': [11.2, 10.8, 9.5, 11.7, 9.8, 10.5, 9.2, 8.9, 11.7],
'npb_last_bb9': [2.1, 2.3, 2.8, 1.8, 2.4, 2.6, 3.2, 1.9, 3.2],
'npb_last_ip': [232, 212, 175, 164, 148, 144, 163, 137, 155],
'mlb_first2_era': [3.38, 3.18, 3.73, 2.92, 2.91, 3.68, 4.50, np.nan, 2.86],
'mlb_first2_whip': [1.12, 1.11, 1.16, 1.04, 1.09, 1.22, 1.32, np.nan, 1.16],
'mlb_first2_k9': [10.8, 9.2, 9.8, 11.2, 10.2, 10.8, 9.5, np.nan, 11.9],
'mlb_first2_bb9': [2.8, 2.1, 2.5, 2.4, 2.6, 3.2, 3.8, np.nan, 3.5],
'mlb_first2_ip': [363, 314, 258, 189, 173, 166, 318, np.nan, 285]
})
# Remove players with insufficient MLB data
npb_mlb_pitchers = npb_mlb_pitchers.dropna()
# Calculate translation factors
npb_mlb_pitchers['era_ratio'] = npb_mlb_pitchers['mlb_first2_era'] / npb_mlb_pitchers['npb_last_era']
npb_mlb_pitchers['k9_ratio'] = npb_mlb_pitchers['mlb_first2_k9'] / npb_mlb_pitchers['npb_last_k9']
npb_mlb_pitchers['bb9_ratio'] = npb_mlb_pitchers['mlb_first2_bb9'] / npb_mlb_pitchers['npb_last_bb9']
print("NPB to MLB Pitcher Translation Factors:")
print(f"ERA multiplier: {npb_mlb_pitchers['era_ratio'].mean():.3f} (±{npb_mlb_pitchers['era_ratio'].std():.3f})")
print(f"K/9 retention: {npb_mlb_pitchers['k9_ratio'].mean():.3f} (±{npb_mlb_pitchers['k9_ratio'].std():.3f})")
print(f"BB/9 change: {npb_mlb_pitchers['bb9_ratio'].mean():.3f} (±{npb_mlb_pitchers['bb9_ratio'].std():.3f})")
# Build FIP-based translation model
def calculate_fip(era, k9, bb9, hr9=1.0):
"""Calculate Fielding Independent Pitching"""
return ((13 * hr9) + (3 * bb9) - (2 * k9)) / 9 + 3.2
npb_mlb_pitchers['npb_fip'] = calculate_fip(
npb_mlb_pitchers['npb_last_era'],
npb_mlb_pitchers['npb_last_k9'],
npb_mlb_pitchers['npb_last_bb9'],
hr9=0.8 # Estimated NPB HR/9
)
npb_mlb_pitchers['mlb_fip'] = calculate_fip(
npb_mlb_pitchers['mlb_first2_era'],
npb_mlb_pitchers['mlb_first2_k9'],
npb_mlb_pitchers['mlb_first2_bb9'],
hr9=1.1 # Estimated MLB HR/9
)
# Regression model for FIP translation
X = npb_mlb_pitchers[['age_mlb_debut', 'npb_fip', 'npb_last_k9']].values
y = npb_mlb_pitchers['mlb_fip'].values
model = LinearRegression()
model.fit(X, y)
print(f"\nFIP Translation Model:")
print(f"R² Score: {model.score(X, y):.3f}")
print(f"Coefficients: Age={model.coef_[0]:.3f}, NPB_FIP={model.coef_[1]:.3f}, K9={model.coef_[2]:.3f}")
print(f"Intercept: {model.intercept_:.3f}")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Plot 1: ERA Translation
ax1 = axes[0]
ax1.scatter(npb_mlb_pitchers['npb_last_era'],
npb_mlb_pitchers['mlb_first2_era'],
s=100, alpha=0.6, c=npb_mlb_pitchers['age_mlb_debut'],
cmap='viridis')
ax1.plot([0, 4], [0, 4], 'k--', alpha=0.3, label='1:1 line')
ax1.plot([0, 4], [0, 8], 'r--', alpha=0.5, label='2x ERA line')
ax1.set_xlabel('NPB Final Season ERA')
ax1.set_ylabel('MLB First 2 Seasons ERA')
ax1.set_title('NPB to MLB ERA Translation')
ax1.legend()
ax1.grid(alpha=0.3)
# Add player labels
for idx, row in npb_mlb_pitchers.iterrows():
ax1.annotate(row['player'].split()[-1],
(row['npb_last_era'], row['mlb_first2_era']),
fontsize=8, alpha=0.7)
# Plot 2: K/9 Retention
ax2 = axes[1]
ax2.scatter(npb_mlb_pitchers['npb_last_k9'],
npb_mlb_pitchers['mlb_first2_k9'],
s=100, alpha=0.6, c=npb_mlb_pitchers['age_mlb_debut'],
cmap='viridis')
ax2.plot([7, 13], [7, 13], 'k--', alpha=0.3, label='1:1 line')
ax2.set_xlabel('NPB Final Season K/9')
ax2.set_ylabel('MLB First 2 Seasons K/9')
ax2.set_title('Strikeout Rate Translation')
ax2.legend()
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('npb_mlb_pitcher_translation.png', dpi=300, bbox_inches='tight')
plt.show()
# Projection function
def project_npb_pitcher(age, npb_era, npb_k9, npb_bb9, npb_ip):
"""Project NPB pitcher stats to MLB"""
# Age-based adjustment
age_factor = max(0.7, 1.0 - (age - 25) * 0.03)
# League difficulty adjustment
era_multiplier = 1.8 + (npb_era - 2.0) * 0.5 # Better NPB ERA = bigger jump
k9_retention = 0.92 + (min(age, 27) - 25) * 0.02 # Younger = better retention
bb9_increase = 1.25 - (9.0 - npb_k9) * 0.03 # Better K rate = better control
# Projections
mlb_era = npb_era * era_multiplier * (1 / age_factor)
mlb_k9 = npb_k9 * k9_retention
mlb_bb9 = npb_bb9 * bb9_increase
mlb_fip = calculate_fip(mlb_era, mlb_k9, mlb_bb9, hr9=1.1)
return {
'projected_ERA': round(mlb_era, 2),
'projected_K9': round(mlb_k9, 1),
'projected_BB9': round(mlb_bb9, 1),
'projected_FIP': round(mlb_fip, 2),
'age_factor': round(age_factor, 3),
'confidence': 'High' if age <= 27 and npb_ip > 150 else 'Medium'
}
# Example projection
print("\n" + "="*60)
print("Example: 26-year-old NPB ace (1.85 ERA, 11.5 K/9, 2.0 BB/9, 180 IP)")
print("="*60)
projection = project_npb_pitcher(26, 1.85, 11.5, 2.0, 180)
for key, value in projection.items():
print(f"{key}: {value}")
The Korea Baseball Organization has emerged as an increasingly important source of MLB talent, particularly following the success of players like Jung Ho Kang (2015) and more recently, Ha-Seong Kim and others.
KBO League Characteristics
Key differences from MLB:
- Higher offensive environment: League BA typically .260-.270 vs MLB .240-.250
- Smaller parks: Many stadiums favor hitters
- Different ball: KBO ball historically had lower seams
- Designated hitter: Used in both leagues (since 2021 in NL)
- Foreign player limit: Maximum 3 per team affects competition
Translation Methodology
KBO translation requires different adjustments than NPB:
# R: KBO to MLB translation analysis
library(tidyverse)
library(ggplot2)
# KBO-to-MLB position player transitions
kbo_mlb_data <- data.frame(
player = c("Jung Ho Kang", "Ha-Seong Kim", "Hyun-soo Kim",
"Dae-ho Lee", "Tommy Joseph", "Eric Thames"),
age_mlb = c(28, 25, 28, 34, 24, 30),
kbo_final_avg = c(.356, .306, .318, .288, .263, .381),
kbo_final_obp = c(.459, .397, .406, .366, .333, .497),
kbo_final_slg = c(.739, .523, .488, .488, .470, .790),
kbo_final_hr = c(40, 11, 11, 17, 21, 47),
kbo_final_pa = c(564, 587, 621, 550, 587, 575),
mlb_avg = c(.255, .242, .229, .253, .235, .247),
mlb_obp = c(.354, .326, .299, .317, .286, .359),
mlb_slg = c(.461, .376, .340, .413, .402, .518),
mlb_hr_per_600 = c(27, 13, 8, 22, 20, 35),
mlb_pa_total = c(1248, 1456, 460, 531, 712, 983)
)
# Calculate translation factors
kbo_mlb_data <- kbo_mlb_data %>%
mutate(
avg_translation = mlb_avg / kbo_final_avg,
obp_translation = mlb_obp / kbo_final_obp,
slg_translation = mlb_slg / kbo_final_slg,
iso_kbo = kbo_final_slg - kbo_final_avg,
iso_mlb = mlb_slg - mlb_avg,
iso_translation = iso_mlb / iso_kbo,
power_class = case_when(
kbo_final_hr >= 30 ~ "Elite Power",
kbo_final_hr >= 20 ~ "Above Average",
TRUE ~ "Average"
)
)
# Summary by power class
power_translation <- kbo_mlb_data %>%
group_by(power_class) %>%
summarise(
n = n(),
avg_trans = mean(avg_translation),
slg_trans = mean(slg_translation),
iso_trans = mean(iso_translation)
)
print("KBO Translation Factors by Power Level:")
print(power_translation)
# Overall translation factors
cat("\nOverall KBO to MLB Translation:\n")
cat(sprintf("AVG: %.3f (multiply KBO avg by this)\n",
mean(kbo_mlb_data$avg_translation)))
cat(sprintf("OBP: %.3f\n", mean(kbo_mlb_data$obp_translation)))
cat(sprintf("SLG: %.3f\n", mean(kbo_mlb_data$slg_translation)))
cat(sprintf("ISO: %.3f\n", mean(kbo_mlb_data$iso_translation)))
# Visualization: Power translation
ggplot(kbo_mlb_data, aes(x = kbo_final_hr, y = mlb_hr_per_600)) +
geom_point(aes(size = mlb_pa_total, color = age_mlb), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "blue", alpha = 0.2) +
geom_abline(slope = 0.7, intercept = 0, linetype = "dashed", color = "red") +
scale_color_gradient(low = "green", high = "orange") +
labs(title = "KBO to MLB Home Run Translation",
subtitle = "Red dashed line = 0.70x translation",
x = "KBO Final Season HR",
y = "MLB HR per 600 PA",
size = "MLB PA",
color = "Age at MLB Debut") +
theme_minimal() +
geom_text(aes(label = player), hjust = -0.1, vjust = -0.5, size = 3)
# Advanced projection system
kbo_projection_model <- lm(mlb_slg ~ kbo_final_slg + age_mlb + I(iso_kbo),
data = kbo_mlb_data)
cat("\n\nKBO SLG Translation Model:\n")
print(summary(kbo_projection_model))
# Create comprehensive projection function
project_kbo_player <- function(age, kbo_avg, kbo_obp, kbo_slg, kbo_hr,
kbo_pa, defensive_value = 0) {
# Base translation factors (from empirical data)
avg_factor <- 0.75
obp_factor <- 0.82
slg_factor <- 0.68
# Age adjustment (peak = 26)
age_adj <- 1 - abs(age - 26) * 0.015
age_adj <- max(0.85, min(1.05, age_adj))
# Sample size adjustment
pa_confidence <- min(1, kbo_pa / 500)
# Calculate projections
proj_avg <- kbo_avg * avg_factor * age_adj
proj_obp <- kbo_obp * obp_factor * age_adj
proj_slg <- kbo_slg * slg_factor * age_adj
# Power metrics
kbo_iso <- kbo_slg - kbo_avg
proj_iso <- kbo_iso * slg_factor * age_adj
proj_hr_per_600 <- (kbo_hr / kbo_pa * 600) * slg_factor * age_adj
# wOBA projection (using standard weights)
woba_scale <- 1.15
woba_bb <- (proj_obp - proj_avg) * 600 * 0.69
woba_1b <- (proj_avg - proj_iso/3) * 600 * 0.88
woba_xbh <- (proj_iso * 2) * 600 * 1.3
proj_woba <- (woba_bb + woba_1b + woba_xbh) / 600 / woba_scale
# WAR estimation (very rough)
batting_runs <- (proj_woba - 0.320) / 1.15 * 600 * 0.9
war_estimate <- (batting_runs + defensive_value * 10) / 10
return(data.frame(
projected_AVG = round(proj_avg, 3),
projected_OBP = round(proj_obp, 3),
projected_SLG = round(proj_slg, 3),
projected_ISO = round(proj_iso, 3),
projected_HR_600PA = round(proj_hr_per_600, 1),
projected_wOBA = round(proj_woba, 3),
estimated_WAR = round(war_estimate, 1),
confidence = round(pa_confidence * 100, 0)
))
}
# Example projection
cat("\n\nExample KBO Star Projection (26 years old):\n")
cat("KBO Stats: .320/.400/.580, 35 HR in 600 PA\n")
cat("Defensive Value: +5 runs\n\n")
print(project_kbo_player(26, .320, .400, .580, 35, 600, 5))
# Python: KBO pitcher analysis and projection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# KBO pitcher MLB transitions
kbo_pitchers = pd.DataFrame({
'player': ['Hyun-Jin Ryu', 'Kwang-Hyun Kim', 'Ha-Seong Kim',
'Chan Ho Park', 'Jung Ho Kang'],
'type': ['SP', 'SP', 'RP', 'SP', 'Position'], # Position included for comparison
'age_mlb': [26, 32, 25, 25, 28],
'kbo_era': [2.80, 3.13, np.nan, 3.64, np.nan],
'kbo_k9': [7.8, 7.2, np.nan, 8.1, np.nan],
'kbo_bb9': [2.1, 2.5, np.nan, 3.8, np.nan],
'kbo_hr9': [0.6, 0.7, np.nan, 0.8, np.nan],
'mlb_era': [3.17, 3.62, np.nan, 4.36, np.nan],
'mlb_k9': [8.5, 7.8, np.nan, 7.2, np.nan],
'mlb_bb9': [1.8, 2.4, np.nan, 3.9, np.nan],
'mlb_hr9': [0.9, 1.1, np.nan, 1.2, np.nan]
})
# Filter out position players
kbo_pitchers = kbo_pitchers[kbo_pitchers['type'].isin(['SP', 'RP'])].dropna()
# Calculate ratios
kbo_pitchers['era_change'] = kbo_pitchers['mlb_era'] - kbo_pitchers['kbo_era']
kbo_pitchers['k9_change'] = kbo_pitchers['mlb_k9'] - kbo_pitchers['kbo_k9']
kbo_pitchers['hr9_change'] = kbo_pitchers['mlb_hr9'] - kbo_pitchers['kbo_hr9']
print("KBO to MLB Pitcher Changes:")
print(f"Average ERA increase: +{kbo_pitchers['era_change'].mean():.2f}")
print(f"Average K/9 change: {kbo_pitchers['k9_change'].mean():+.1f}")
print(f"Average HR/9 increase: +{kbo_pitchers['hr9_change'].mean():.2f}")
# More comprehensive dataset with additional metrics
kbo_detailed = pd.DataFrame({
'season': [2012, 2019, 2013, 2016, 2020],
'pitcher': ['Ryu', 'Kim KH', 'Park', 'Oh', 'Other'],
'kbo_whip': [1.08, 1.24, 1.35, 1.15, 1.20],
'kbo_babip': [.275, .290, .295, .280, .285],
'kbo_lob_pct': [75.2, 73.1, 70.5, 74.8, 72.0],
'mlb_whip': [1.22, 1.28, 1.45, np.nan, np.nan],
'mlb_babip': [.285, .295, .305, np.nan, np.nan],
'mlb_lob_pct': [73.5, 71.8, 68.2, np.nan, np.nan]
})
# Advanced projection system for KBO pitchers
class KBOPitcherProjector:
def __init__(self):
# Empirically derived translation factors
self.era_multiplier = 1.15 # KBO ERA typically increases 15%
self.k9_retention = 0.98 # K rate mostly maintained
self.bb9_multiplier = 1.05 # Slight walk increase
self.hr9_multiplier = 1.45 # HR rate increases significantly
def project_era_fip(self, kbo_era, kbo_k9, kbo_bb9, kbo_hr9, age):
"""Project FIP-based ERA for MLB"""
# Age factor (peak at 27)
age_factor = 1 + abs(age - 27) * 0.02
# Component projections
mlb_k9 = kbo_k9 * self.k9_retention
mlb_bb9 = kbo_bb9 * self.bb9_multiplier
mlb_hr9 = kbo_hr9 * self.hr9_multiplier
# Calculate FIP
mlb_fip = ((13 * mlb_hr9) + (3 * mlb_bb9) - (2 * mlb_k9)) + 3.2
# ERA projection (FIP + league/age adjustment)
mlb_era = mlb_fip * age_factor * 0.98
return {
'projected_ERA': round(mlb_era, 2),
'projected_FIP': round(mlb_fip, 2),
'projected_K9': round(mlb_k9, 1),
'projected_BB9': round(mlb_bb9, 1),
'projected_HR9': round(mlb_hr9, 2),
'projected_WHIP': round((mlb_bb9 + (9 - mlb_k9) * 0.3) / 9 + 0.95, 2)
}
def confidence_interval(self, projection, sample_size_ip):
"""Calculate confidence intervals based on sample size"""
# Standard error decreases with more IP
se_factor = max(0.3, 1 / np.sqrt(sample_size_ip / 100))
era_se = projection['projected_ERA'] * se_factor * 0.15
return {
'ERA_lower': round(projection['projected_ERA'] - 1.96 * era_se, 2),
'ERA_upper': round(projection['projected_ERA'] + 1.96 * era_se, 2),
'confidence_level': 0.95
}
# Example usage
projector = KBOPitcherProjector()
print("\n" + "="*70)
print("KBO Pitcher Projection Example:")
print("="*70)
print("KBO Stats: 2.85 ERA, 9.2 K/9, 2.3 BB/9, 0.65 HR/9")
print("Age: 27, IP: 180")
print("-"*70)
projection = projector.project_era_fip(2.85, 9.2, 2.3, 0.65, 27)
for key, value in projection.items():
print(f"{key}: {value}")
print("\n95% Confidence Interval:")
ci = projector.confidence_interval(projection, 180)
print(f"ERA Range: {ci['ERA_lower']} - {ci['ERA_upper']}")
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
# Simulate multiple pitcher projections
ages = np.array([24, 25, 26, 27, 28, 29, 30, 31, 32])
kbo_eras = np.array([2.5, 2.7, 2.6, 2.8, 2.9, 3.0, 2.8, 3.1, 3.2])
projected_eras = []
for age, kbo_era in zip(ages, kbo_eras):
proj = projector.project_era_fip(kbo_era, 9.0, 2.5, 0.7, age)
projected_eras.append(proj['projected_ERA'])
ax.scatter(ages, kbo_eras, s=100, alpha=0.6, label='KBO ERA', color='blue')
ax.scatter(ages, projected_eras, s=100, alpha=0.6, label='Projected MLB ERA', color='red')
ax.plot(ages, projected_eras, '--', alpha=0.3, color='red')
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('ERA', fontsize=12)
ax.set_title('KBO to MLB ERA Projection by Age', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
ax.set_ylim(2.0, 4.5)
plt.tight_layout()
plt.savefig('kbo_pitcher_age_curve.png', dpi=300, bbox_inches='tight')
plt.show()
Notable KBO Success Stories
Ha-Seong Kim (2021- ): Signed with San Diego Padres
- KBO Final: .306/.397/.523, 11 HR
- MLB Performance: Solid utility infielder, excellent defense
- Key: Elite athleticism and defensive versatility
Hyun-Jin Ryu (2013- ): Dodgers, Blue Jays
- KBO Final: 2.80 ERA, 210 IP
- MLB Peak: 2.32 ERA (2019), All-Star
- Key: Command, deception, and preparation
# R: KBO to MLB translation analysis
library(tidyverse)
library(ggplot2)
# KBO-to-MLB position player transitions
kbo_mlb_data <- data.frame(
player = c("Jung Ho Kang", "Ha-Seong Kim", "Hyun-soo Kim",
"Dae-ho Lee", "Tommy Joseph", "Eric Thames"),
age_mlb = c(28, 25, 28, 34, 24, 30),
kbo_final_avg = c(.356, .306, .318, .288, .263, .381),
kbo_final_obp = c(.459, .397, .406, .366, .333, .497),
kbo_final_slg = c(.739, .523, .488, .488, .470, .790),
kbo_final_hr = c(40, 11, 11, 17, 21, 47),
kbo_final_pa = c(564, 587, 621, 550, 587, 575),
mlb_avg = c(.255, .242, .229, .253, .235, .247),
mlb_obp = c(.354, .326, .299, .317, .286, .359),
mlb_slg = c(.461, .376, .340, .413, .402, .518),
mlb_hr_per_600 = c(27, 13, 8, 22, 20, 35),
mlb_pa_total = c(1248, 1456, 460, 531, 712, 983)
)
# Calculate translation factors
kbo_mlb_data <- kbo_mlb_data %>%
mutate(
avg_translation = mlb_avg / kbo_final_avg,
obp_translation = mlb_obp / kbo_final_obp,
slg_translation = mlb_slg / kbo_final_slg,
iso_kbo = kbo_final_slg - kbo_final_avg,
iso_mlb = mlb_slg - mlb_avg,
iso_translation = iso_mlb / iso_kbo,
power_class = case_when(
kbo_final_hr >= 30 ~ "Elite Power",
kbo_final_hr >= 20 ~ "Above Average",
TRUE ~ "Average"
)
)
# Summary by power class
power_translation <- kbo_mlb_data %>%
group_by(power_class) %>%
summarise(
n = n(),
avg_trans = mean(avg_translation),
slg_trans = mean(slg_translation),
iso_trans = mean(iso_translation)
)
print("KBO Translation Factors by Power Level:")
print(power_translation)
# Overall translation factors
cat("\nOverall KBO to MLB Translation:\n")
cat(sprintf("AVG: %.3f (multiply KBO avg by this)\n",
mean(kbo_mlb_data$avg_translation)))
cat(sprintf("OBP: %.3f\n", mean(kbo_mlb_data$obp_translation)))
cat(sprintf("SLG: %.3f\n", mean(kbo_mlb_data$slg_translation)))
cat(sprintf("ISO: %.3f\n", mean(kbo_mlb_data$iso_translation)))
# Visualization: Power translation
ggplot(kbo_mlb_data, aes(x = kbo_final_hr, y = mlb_hr_per_600)) +
geom_point(aes(size = mlb_pa_total, color = age_mlb), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "blue", alpha = 0.2) +
geom_abline(slope = 0.7, intercept = 0, linetype = "dashed", color = "red") +
scale_color_gradient(low = "green", high = "orange") +
labs(title = "KBO to MLB Home Run Translation",
subtitle = "Red dashed line = 0.70x translation",
x = "KBO Final Season HR",
y = "MLB HR per 600 PA",
size = "MLB PA",
color = "Age at MLB Debut") +
theme_minimal() +
geom_text(aes(label = player), hjust = -0.1, vjust = -0.5, size = 3)
# Advanced projection system
kbo_projection_model <- lm(mlb_slg ~ kbo_final_slg + age_mlb + I(iso_kbo),
data = kbo_mlb_data)
cat("\n\nKBO SLG Translation Model:\n")
print(summary(kbo_projection_model))
# Create comprehensive projection function
project_kbo_player <- function(age, kbo_avg, kbo_obp, kbo_slg, kbo_hr,
kbo_pa, defensive_value = 0) {
# Base translation factors (from empirical data)
avg_factor <- 0.75
obp_factor <- 0.82
slg_factor <- 0.68
# Age adjustment (peak = 26)
age_adj <- 1 - abs(age - 26) * 0.015
age_adj <- max(0.85, min(1.05, age_adj))
# Sample size adjustment
pa_confidence <- min(1, kbo_pa / 500)
# Calculate projections
proj_avg <- kbo_avg * avg_factor * age_adj
proj_obp <- kbo_obp * obp_factor * age_adj
proj_slg <- kbo_slg * slg_factor * age_adj
# Power metrics
kbo_iso <- kbo_slg - kbo_avg
proj_iso <- kbo_iso * slg_factor * age_adj
proj_hr_per_600 <- (kbo_hr / kbo_pa * 600) * slg_factor * age_adj
# wOBA projection (using standard weights)
woba_scale <- 1.15
woba_bb <- (proj_obp - proj_avg) * 600 * 0.69
woba_1b <- (proj_avg - proj_iso/3) * 600 * 0.88
woba_xbh <- (proj_iso * 2) * 600 * 1.3
proj_woba <- (woba_bb + woba_1b + woba_xbh) / 600 / woba_scale
# WAR estimation (very rough)
batting_runs <- (proj_woba - 0.320) / 1.15 * 600 * 0.9
war_estimate <- (batting_runs + defensive_value * 10) / 10
return(data.frame(
projected_AVG = round(proj_avg, 3),
projected_OBP = round(proj_obp, 3),
projected_SLG = round(proj_slg, 3),
projected_ISO = round(proj_iso, 3),
projected_HR_600PA = round(proj_hr_per_600, 1),
projected_wOBA = round(proj_woba, 3),
estimated_WAR = round(war_estimate, 1),
confidence = round(pa_confidence * 100, 0)
))
}
# Example projection
cat("\n\nExample KBO Star Projection (26 years old):\n")
cat("KBO Stats: .320/.400/.580, 35 HR in 600 PA\n")
cat("Defensive Value: +5 runs\n\n")
print(project_kbo_player(26, .320, .400, .580, 35, 600, 5))
# Python: KBO pitcher analysis and projection
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# KBO pitcher MLB transitions
kbo_pitchers = pd.DataFrame({
'player': ['Hyun-Jin Ryu', 'Kwang-Hyun Kim', 'Ha-Seong Kim',
'Chan Ho Park', 'Jung Ho Kang'],
'type': ['SP', 'SP', 'RP', 'SP', 'Position'], # Position included for comparison
'age_mlb': [26, 32, 25, 25, 28],
'kbo_era': [2.80, 3.13, np.nan, 3.64, np.nan],
'kbo_k9': [7.8, 7.2, np.nan, 8.1, np.nan],
'kbo_bb9': [2.1, 2.5, np.nan, 3.8, np.nan],
'kbo_hr9': [0.6, 0.7, np.nan, 0.8, np.nan],
'mlb_era': [3.17, 3.62, np.nan, 4.36, np.nan],
'mlb_k9': [8.5, 7.8, np.nan, 7.2, np.nan],
'mlb_bb9': [1.8, 2.4, np.nan, 3.9, np.nan],
'mlb_hr9': [0.9, 1.1, np.nan, 1.2, np.nan]
})
# Filter out position players
kbo_pitchers = kbo_pitchers[kbo_pitchers['type'].isin(['SP', 'RP'])].dropna()
# Calculate ratios
kbo_pitchers['era_change'] = kbo_pitchers['mlb_era'] - kbo_pitchers['kbo_era']
kbo_pitchers['k9_change'] = kbo_pitchers['mlb_k9'] - kbo_pitchers['kbo_k9']
kbo_pitchers['hr9_change'] = kbo_pitchers['mlb_hr9'] - kbo_pitchers['kbo_hr9']
print("KBO to MLB Pitcher Changes:")
print(f"Average ERA increase: +{kbo_pitchers['era_change'].mean():.2f}")
print(f"Average K/9 change: {kbo_pitchers['k9_change'].mean():+.1f}")
print(f"Average HR/9 increase: +{kbo_pitchers['hr9_change'].mean():.2f}")
# More comprehensive dataset with additional metrics
kbo_detailed = pd.DataFrame({
'season': [2012, 2019, 2013, 2016, 2020],
'pitcher': ['Ryu', 'Kim KH', 'Park', 'Oh', 'Other'],
'kbo_whip': [1.08, 1.24, 1.35, 1.15, 1.20],
'kbo_babip': [.275, .290, .295, .280, .285],
'kbo_lob_pct': [75.2, 73.1, 70.5, 74.8, 72.0],
'mlb_whip': [1.22, 1.28, 1.45, np.nan, np.nan],
'mlb_babip': [.285, .295, .305, np.nan, np.nan],
'mlb_lob_pct': [73.5, 71.8, 68.2, np.nan, np.nan]
})
# Advanced projection system for KBO pitchers
class KBOPitcherProjector:
def __init__(self):
# Empirically derived translation factors
self.era_multiplier = 1.15 # KBO ERA typically increases 15%
self.k9_retention = 0.98 # K rate mostly maintained
self.bb9_multiplier = 1.05 # Slight walk increase
self.hr9_multiplier = 1.45 # HR rate increases significantly
def project_era_fip(self, kbo_era, kbo_k9, kbo_bb9, kbo_hr9, age):
"""Project FIP-based ERA for MLB"""
# Age factor (peak at 27)
age_factor = 1 + abs(age - 27) * 0.02
# Component projections
mlb_k9 = kbo_k9 * self.k9_retention
mlb_bb9 = kbo_bb9 * self.bb9_multiplier
mlb_hr9 = kbo_hr9 * self.hr9_multiplier
# Calculate FIP
mlb_fip = ((13 * mlb_hr9) + (3 * mlb_bb9) - (2 * mlb_k9)) + 3.2
# ERA projection (FIP + league/age adjustment)
mlb_era = mlb_fip * age_factor * 0.98
return {
'projected_ERA': round(mlb_era, 2),
'projected_FIP': round(mlb_fip, 2),
'projected_K9': round(mlb_k9, 1),
'projected_BB9': round(mlb_bb9, 1),
'projected_HR9': round(mlb_hr9, 2),
'projected_WHIP': round((mlb_bb9 + (9 - mlb_k9) * 0.3) / 9 + 0.95, 2)
}
def confidence_interval(self, projection, sample_size_ip):
"""Calculate confidence intervals based on sample size"""
# Standard error decreases with more IP
se_factor = max(0.3, 1 / np.sqrt(sample_size_ip / 100))
era_se = projection['projected_ERA'] * se_factor * 0.15
return {
'ERA_lower': round(projection['projected_ERA'] - 1.96 * era_se, 2),
'ERA_upper': round(projection['projected_ERA'] + 1.96 * era_se, 2),
'confidence_level': 0.95
}
# Example usage
projector = KBOPitcherProjector()
print("\n" + "="*70)
print("KBO Pitcher Projection Example:")
print("="*70)
print("KBO Stats: 2.85 ERA, 9.2 K/9, 2.3 BB/9, 0.65 HR/9")
print("Age: 27, IP: 180")
print("-"*70)
projection = projector.project_era_fip(2.85, 9.2, 2.3, 0.65, 27)
for key, value in projection.items():
print(f"{key}: {value}")
print("\n95% Confidence Interval:")
ci = projector.confidence_interval(projection, 180)
print(f"ERA Range: {ci['ERA_lower']} - {ci['ERA_upper']}")
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
# Simulate multiple pitcher projections
ages = np.array([24, 25, 26, 27, 28, 29, 30, 31, 32])
kbo_eras = np.array([2.5, 2.7, 2.6, 2.8, 2.9, 3.0, 2.8, 3.1, 3.2])
projected_eras = []
for age, kbo_era in zip(ages, kbo_eras):
proj = projector.project_era_fip(kbo_era, 9.0, 2.5, 0.7, age)
projected_eras.append(proj['projected_ERA'])
ax.scatter(ages, kbo_eras, s=100, alpha=0.6, label='KBO ERA', color='blue')
ax.scatter(ages, projected_eras, s=100, alpha=0.6, label='Projected MLB ERA', color='red')
ax.plot(ages, projected_eras, '--', alpha=0.3, color='red')
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('ERA', fontsize=12)
ax.set_title('KBO to MLB ERA Projection by Age', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
ax.set_ylim(2.0, 4.5)
plt.tight_layout()
plt.savefig('kbo_pitcher_age_curve.png', dpi=300, bbox_inches='tight')
plt.show()
Cuban and Latin American players present unique analytical challenges due to limited data availability, varying competition levels, and diverse development paths.
Data Availability Challenges
Unlike NPB and KBO, systematic statistical data from Cuban leagues and Latin American summer leagues is often:
- Incomplete: Missing advanced metrics
- Inconsistent: Different tracking standards
- Limited access: Restricted availability
- Variable quality: Competition levels vary widely
Evaluation Frameworks
Teams rely heavily on:
- Showcase performances: International tournaments
- Workout metrics: Measurables (velocity, exit velo, sprint speed)
- Video analysis: Manual tracking of mechanics and approach
- Historical comps: Similar player paths
- Age verification: Critical for projections
# R: Cuban/Latin American player projection framework
library(tidyverse)
library(ggplot2)
# Historical Cuban defector performance data
cuban_players <- data.frame(
player = c("Yasiel Puig", "Yoenis Cespedes", "Jorge Soler",
"Jose Abreu", "Aroldis Chapman", "Luis Robert",
"Randy Arozarena", "Yordan Alvarez"),
position = c("OF", "OF", "OF", "1B", "P", "OF", "OF", "DH"),
age_mlb_debut = c(22, 26, 21, 27, 22, 22, 25, 22),
signing_bonus_m = c(42, 36, 30, 68, 30.25, 26, 1.25, 2),
showcase_exit_velo = c(105, 108, 106, 109, NA, 107, 103, 110),
first_year_war = c(4.3, 4.8, 1.2, 6.2, 1.5, 0.8, 1.4, 4.0),
career_war_5yr = c(11.2, 13.5, 5.8, 17.3, 11.2, 5.5, 8.2, 12.5),
hit_tool = c(55, 55, 50, 60, NA, 60, 60, 70),
power_tool = c(65, 70, 65, 70, NA, 65, 60, 80),
speed_tool = c(60, 60, 40, 30, NA, 70, 60, 30)
)
# Remove pitchers for hitting analysis
cuban_hitters <- cuban_players %>% filter(position != "P")
# Analysis: Exit velocity vs MLB success
cor_exit_war <- cor(cuban_hitters$showcase_exit_velo,
cuban_hitters$first_year_war,
use = "complete.obs")
cat(sprintf("Correlation between exit velo and Year 1 WAR: %.3f\n", cor_exit_war))
# Tool grades vs performance
tool_model <- lm(career_war_5yr ~ hit_tool + power_tool + speed_tool + age_mlb_debut,
data = cuban_hitters)
cat("\nScout Tool Grades Predicting 5-Year WAR:\n")
print(summary(tool_model))
# Visualization
ggplot(cuban_hitters, aes(x = showcase_exit_velo, y = career_war_5yr)) +
geom_point(aes(size = signing_bonus_m, color = age_mlb_debut), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "blue", alpha = 0.2) +
geom_text(aes(label = player), hjust = -0.1, size = 3) +
scale_size_continuous(name = "Bonus ($M)") +
scale_color_gradient(low = "green", high = "red", name = "Debut Age") +
labs(title = "Cuban Defector Success: Exit Velocity vs Career Value",
subtitle = "5-Year WAR as success metric",
x = "Showcase Exit Velocity (mph)",
y = "Career WAR (First 5 Years)") +
theme_minimal()
# Dominican/Venezuelan academy graduates
latin_academy <- data.frame(
player = c("Juan Soto", "Vladimir Guerrero Jr", "Fernando Tatis Jr",
"Rafael Devers", "Wander Franco", "Julio Rodriguez"),
country = c("DOM", "DOM", "DOM", "DOM", "DOM", "DOM"),
age_mlb = c(19.5, 20, 20.5, 20.5, 20, 21),
signing_bonus_k = c(1500, 3900, 700, 1500, 3825, 1750),
war_age21_season = c(4.0, 2.3, 4.9, 3.3, 3.5, 4.2),
war_thru_age_24 = c(19.8, 12.5, 15.2, 13.8, 11.2, 8.5),
exit_velo_age20 = c(109, 112, 108, 107, 105, 108)
)
# Early performance indicators
early_success_model <- lm(war_thru_age_24 ~ age_mlb + log(signing_bonus_k) +
exit_velo_age20,
data = latin_academy)
cat("\n\nLatin American Academy Success Model:\n")
print(summary(early_success_model))
# Age at debut analysis
age_performance <- cuban_hitters %>%
mutate(age_group = ifelse(age_mlb_debut <= 23, "Young (<24)", "Older (24+)")) %>%
group_by(age_group) %>%
summarise(
n = n(),
avg_first_yr_war = mean(first_year_war, na.rm = TRUE),
avg_5yr_war = mean(career_war_5yr, na.rm = TRUE),
avg_bonus = mean(signing_bonus_m, na.rm = TRUE)
)
print("\n\nPerformance by Age at Debut:")
print(age_performance)
# Create projection function for Cuban/Latin players
project_cuban_latin <- function(age, exit_velo, hit_grade, power_grade,
speed_grade, competition_level = "showcase") {
# Base WAR from tools (scout grades on 20-80 scale)
tool_war <- (hit_grade - 50) * 0.15 +
(power_grade - 50) * 0.12 +
(speed_grade - 50) * 0.08
# Age adjustment (younger = higher ceiling)
age_adj <- max(0.7, 1.3 - (age - 20) * 0.05)
# Exit velocity component
velo_war <- (exit_velo - 100) * 0.3
# Competition adjustment
comp_factor <- case_when(
competition_level == "MLB" ~ 1.0,
competition_level == "showcase" ~ 0.85,
competition_level == "cuban_series" ~ 0.80,
TRUE ~ 0.75
)
# First year projection
year1_war <- (tool_war + velo_war) * age_adj * comp_factor
# 5-year projection (with development curve)
year5_war <- year1_war * 3.2 # Average multiplier from data
# Confidence based on data availability
confidence <- case_when(
competition_level == "MLB" ~ "High",
competition_level == "showcase" & !is.na(exit_velo) ~ "Medium",
TRUE ~ "Low"
)
return(data.frame(
projected_year1_WAR = round(year1_war, 1),
projected_5yr_WAR = round(year5_war, 1),
age_factor = round(age_adj, 2),
confidence = confidence
))
}
# Example projections
cat("\n\nExample Projection 1: 20-year-old Cuban OF\n")
cat("Exit Velo: 108 mph, Hit: 60, Power: 70, Speed: 65\n")
print(project_cuban_latin(20, 108, 60, 70, 65, "showcase"))
cat("\n\nExample Projection 2: 26-year-old established Cuban star\n")
cat("Exit Velo: 110 mph, Hit: 65, Power: 75, Speed: 50\n")
print(project_cuban_latin(26, 110, 65, 75, 50, "cuban_series"))
# Python: Advanced Latin American player tracking and projection
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns
# Comprehensive Latin American dataset
latin_players = pd.DataFrame({
'player': ['Juan Soto', 'Vlad Jr', 'Tatis Jr', 'Acuna Jr', 'Devers',
'Wander Franco', 'Julio Rodriguez', 'Bobby Witt Jr'],
'country': ['DOM', 'DOM', 'DOM', 'VEN', 'DOM', 'DOM', 'DOM', 'USA'],
'signing_age': [16, 16, 16, 16, 16, 16, 16, 18],
'signing_bonus': [1.5, 3.9, 0.7, 4.25, 1.5, 3.825, 1.75, 7.5], # millions
'mlb_debut_age': [19.5, 20, 20.5, 20.5, 20.5, 20, 21, 21.5],
'hit_grade': [70, 60, 60, 65, 60, 70, 65, 60],
'power_grade': [70, 70, 70, 70, 65, 55, 65, 70],
'speed_grade': [50, 30, 70, 70, 40, 60, 70, 70],
'arm_grade': [60, 50, 60, 70, 50, 55, 70, 60],
'field_grade': [55, 40, 60, 70, 50, 60, 70, 70],
'war_thru_age_23': [15.5, 9.2, 11.8, 13.5, 10.2, 8.5, 7.5, 4.2],
'avg_exit_velo': [109.5, 112.1, 108.3, 109.8, 107.2, 105.1, 107.8, 108.9]
})
# Feature engineering
latin_players['total_tools'] = (latin_players['hit_grade'] +
latin_players['power_grade'] +
latin_players['speed_grade'] +
latin_players['arm_grade'] +
latin_players['field_grade'])
latin_players['years_to_mlb'] = (latin_players['mlb_debut_age'] -
latin_players['signing_age'])
latin_players['bonus_per_year'] = (latin_players['signing_bonus'] /
latin_players['years_to_mlb'])
# Machine learning model for WAR prediction
features = ['hit_grade', 'power_grade', 'speed_grade', 'mlb_debut_age',
'avg_exit_velo', 'signing_bonus']
X = latin_players[features].values
y = latin_players['war_thru_age_23'].values
# Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42,
max_depth=4, min_samples_split=2)
rf_model.fit(X, y)
# Feature importance
feature_importance = pd.DataFrame({
'feature': features,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print("Feature Importance for WAR Prediction:")
print(feature_importance)
# Cross-validation (limited by small sample)
cv_scores = cross_val_score(rf_model, X, y, cv=3,
scoring='neg_mean_squared_error')
print(f"\nCross-Validation RMSE: {np.sqrt(-cv_scores.mean()):.2f} WAR")
# Visualization: Tool grades vs performance
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Plot 1: Hit tool vs WAR
ax1 = axes[0, 0]
ax1.scatter(latin_players['hit_grade'], latin_players['war_thru_age_23'],
s=latin_players['signing_bonus']*20, alpha=0.6, c='blue')
ax1.set_xlabel('Hit Tool Grade (20-80)')
ax1.set_ylabel('WAR Through Age 23')
ax1.set_title('Hit Tool vs Early Career Success')
ax1.grid(alpha=0.3)
# Plot 2: Power tool vs WAR
ax2 = axes[0, 1]
ax2.scatter(latin_players['power_grade'], latin_players['war_thru_age_23'],
s=latin_players['signing_bonus']*20, alpha=0.6, c='red')
ax2.set_xlabel('Power Tool Grade (20-80)')
ax2.set_ylabel('WAR Through Age 23')
ax2.set_title('Power Tool vs Early Career Success')
ax2.grid(alpha=0.3)
# Plot 3: Speed tool vs WAR
ax3 = axes[1, 0]
ax3.scatter(latin_players['speed_grade'], latin_players['war_thru_age_23'],
s=latin_players['signing_bonus']*20, alpha=0.6, c='green')
ax3.set_xlabel('Speed Tool Grade (20-80)')
ax3.set_ylabel('WAR Through Age 23')
ax3.set_title('Speed Tool vs Early Career Success')
ax3.grid(alpha=0.3)
# Plot 4: Age at debut vs WAR
ax4 = axes[1, 1]
scatter = ax4.scatter(latin_players['mlb_debut_age'],
latin_players['war_thru_age_23'],
s=latin_players['signing_bonus']*20,
c=latin_players['total_tools'],
cmap='viridis', alpha=0.7)
ax4.set_xlabel('Age at MLB Debut')
ax4.set_ylabel('WAR Through Age 23')
ax4.set_title('Debut Age vs Success (color = total tools)')
ax4.grid(alpha=0.3)
plt.colorbar(scatter, ax=ax4, label='Total Tool Grade')
plt.tight_layout()
plt.savefig('latin_american_tool_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Advanced projection class
class LatinAmericanProjector:
def __init__(self, model=None):
self.model = model if model else rf_model
self.feature_names = features
def project_war(self, hit, power, speed, debut_age, exit_velo, bonus):
"""Project WAR through age 23"""
input_data = np.array([[hit, power, speed, debut_age, exit_velo, bonus]])
prediction = self.model.predict(input_data)[0]
# Calculate confidence interval using ensemble variance
tree_predictions = [tree.predict(input_data)[0]
for tree in self.model.estimators_]
std_dev = np.std(tree_predictions)
return {
'projected_WAR': round(prediction, 1),
'lower_bound': round(prediction - 1.96 * std_dev, 1),
'upper_bound': round(prediction + 1.96 * std_dev, 1),
'std_dev': round(std_dev, 2)
}
def compare_prospects(self, prospects_df):
"""Compare multiple prospects"""
results = []
for idx, prospect in prospects_df.iterrows():
proj = self.project_war(
prospect['hit_grade'],
prospect['power_grade'],
prospect['speed_grade'],
prospect['mlb_debut_age'],
prospect['avg_exit_velo'],
prospect['signing_bonus']
)
results.append({
'player': prospect.get('player', f'Prospect_{idx}'),
**proj
})
return pd.DataFrame(results).sort_values('projected_WAR', ascending=False)
# Example usage
projector = LatinAmericanProjector()
print("\n" + "="*70)
print("Example Prospect Projection:")
print("="*70)
print("Hit: 65, Power: 70, Speed: 60")
print("Debut Age: 20.5, Exit Velo: 108, Bonus: $3.5M")
print("-"*70)
projection = projector.project_war(65, 70, 60, 20.5, 108, 3.5)
for key, value in projection.items():
print(f"{key}: {value}")
# Multiple prospect comparison
prospects = pd.DataFrame({
'player': ['Prospect A', 'Prospect B', 'Prospect C'],
'hit_grade': [70, 60, 65],
'power_grade': [65, 75, 70],
'speed_grade': [60, 50, 70],
'mlb_debut_age': [20, 21, 19.5],
'avg_exit_velo': [108, 111, 107],
'signing_bonus': [4.0, 2.5, 5.0]
})
print("\n" + "="*70)
print("Prospect Comparison:")
print("="*70)
comparison = projector.compare_prospects(prospects)
print(comparison.to_string(index=False))
World Baseball Classic as Evaluation Tool
The World Baseball Classic provides valuable data for international player evaluation, featuring top competition in high-pressure situations.
# R: Cuban/Latin American player projection framework
library(tidyverse)
library(ggplot2)
# Historical Cuban defector performance data
cuban_players <- data.frame(
player = c("Yasiel Puig", "Yoenis Cespedes", "Jorge Soler",
"Jose Abreu", "Aroldis Chapman", "Luis Robert",
"Randy Arozarena", "Yordan Alvarez"),
position = c("OF", "OF", "OF", "1B", "P", "OF", "OF", "DH"),
age_mlb_debut = c(22, 26, 21, 27, 22, 22, 25, 22),
signing_bonus_m = c(42, 36, 30, 68, 30.25, 26, 1.25, 2),
showcase_exit_velo = c(105, 108, 106, 109, NA, 107, 103, 110),
first_year_war = c(4.3, 4.8, 1.2, 6.2, 1.5, 0.8, 1.4, 4.0),
career_war_5yr = c(11.2, 13.5, 5.8, 17.3, 11.2, 5.5, 8.2, 12.5),
hit_tool = c(55, 55, 50, 60, NA, 60, 60, 70),
power_tool = c(65, 70, 65, 70, NA, 65, 60, 80),
speed_tool = c(60, 60, 40, 30, NA, 70, 60, 30)
)
# Remove pitchers for hitting analysis
cuban_hitters <- cuban_players %>% filter(position != "P")
# Analysis: Exit velocity vs MLB success
cor_exit_war <- cor(cuban_hitters$showcase_exit_velo,
cuban_hitters$first_year_war,
use = "complete.obs")
cat(sprintf("Correlation between exit velo and Year 1 WAR: %.3f\n", cor_exit_war))
# Tool grades vs performance
tool_model <- lm(career_war_5yr ~ hit_tool + power_tool + speed_tool + age_mlb_debut,
data = cuban_hitters)
cat("\nScout Tool Grades Predicting 5-Year WAR:\n")
print(summary(tool_model))
# Visualization
ggplot(cuban_hitters, aes(x = showcase_exit_velo, y = career_war_5yr)) +
geom_point(aes(size = signing_bonus_m, color = age_mlb_debut), alpha = 0.7) +
geom_smooth(method = "lm", se = TRUE, color = "blue", alpha = 0.2) +
geom_text(aes(label = player), hjust = -0.1, size = 3) +
scale_size_continuous(name = "Bonus ($M)") +
scale_color_gradient(low = "green", high = "red", name = "Debut Age") +
labs(title = "Cuban Defector Success: Exit Velocity vs Career Value",
subtitle = "5-Year WAR as success metric",
x = "Showcase Exit Velocity (mph)",
y = "Career WAR (First 5 Years)") +
theme_minimal()
# Dominican/Venezuelan academy graduates
latin_academy <- data.frame(
player = c("Juan Soto", "Vladimir Guerrero Jr", "Fernando Tatis Jr",
"Rafael Devers", "Wander Franco", "Julio Rodriguez"),
country = c("DOM", "DOM", "DOM", "DOM", "DOM", "DOM"),
age_mlb = c(19.5, 20, 20.5, 20.5, 20, 21),
signing_bonus_k = c(1500, 3900, 700, 1500, 3825, 1750),
war_age21_season = c(4.0, 2.3, 4.9, 3.3, 3.5, 4.2),
war_thru_age_24 = c(19.8, 12.5, 15.2, 13.8, 11.2, 8.5),
exit_velo_age20 = c(109, 112, 108, 107, 105, 108)
)
# Early performance indicators
early_success_model <- lm(war_thru_age_24 ~ age_mlb + log(signing_bonus_k) +
exit_velo_age20,
data = latin_academy)
cat("\n\nLatin American Academy Success Model:\n")
print(summary(early_success_model))
# Age at debut analysis
age_performance <- cuban_hitters %>%
mutate(age_group = ifelse(age_mlb_debut <= 23, "Young (<24)", "Older (24+)")) %>%
group_by(age_group) %>%
summarise(
n = n(),
avg_first_yr_war = mean(first_year_war, na.rm = TRUE),
avg_5yr_war = mean(career_war_5yr, na.rm = TRUE),
avg_bonus = mean(signing_bonus_m, na.rm = TRUE)
)
print("\n\nPerformance by Age at Debut:")
print(age_performance)
# Create projection function for Cuban/Latin players
project_cuban_latin <- function(age, exit_velo, hit_grade, power_grade,
speed_grade, competition_level = "showcase") {
# Base WAR from tools (scout grades on 20-80 scale)
tool_war <- (hit_grade - 50) * 0.15 +
(power_grade - 50) * 0.12 +
(speed_grade - 50) * 0.08
# Age adjustment (younger = higher ceiling)
age_adj <- max(0.7, 1.3 - (age - 20) * 0.05)
# Exit velocity component
velo_war <- (exit_velo - 100) * 0.3
# Competition adjustment
comp_factor <- case_when(
competition_level == "MLB" ~ 1.0,
competition_level == "showcase" ~ 0.85,
competition_level == "cuban_series" ~ 0.80,
TRUE ~ 0.75
)
# First year projection
year1_war <- (tool_war + velo_war) * age_adj * comp_factor
# 5-year projection (with development curve)
year5_war <- year1_war * 3.2 # Average multiplier from data
# Confidence based on data availability
confidence <- case_when(
competition_level == "MLB" ~ "High",
competition_level == "showcase" & !is.na(exit_velo) ~ "Medium",
TRUE ~ "Low"
)
return(data.frame(
projected_year1_WAR = round(year1_war, 1),
projected_5yr_WAR = round(year5_war, 1),
age_factor = round(age_adj, 2),
confidence = confidence
))
}
# Example projections
cat("\n\nExample Projection 1: 20-year-old Cuban OF\n")
cat("Exit Velo: 108 mph, Hit: 60, Power: 70, Speed: 65\n")
print(project_cuban_latin(20, 108, 60, 70, 65, "showcase"))
cat("\n\nExample Projection 2: 26-year-old established Cuban star\n")
cat("Exit Velo: 110 mph, Hit: 65, Power: 75, Speed: 50\n")
print(project_cuban_latin(26, 110, 65, 75, 50, "cuban_series"))
# Python: Advanced Latin American player tracking and projection
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import seaborn as sns
# Comprehensive Latin American dataset
latin_players = pd.DataFrame({
'player': ['Juan Soto', 'Vlad Jr', 'Tatis Jr', 'Acuna Jr', 'Devers',
'Wander Franco', 'Julio Rodriguez', 'Bobby Witt Jr'],
'country': ['DOM', 'DOM', 'DOM', 'VEN', 'DOM', 'DOM', 'DOM', 'USA'],
'signing_age': [16, 16, 16, 16, 16, 16, 16, 18],
'signing_bonus': [1.5, 3.9, 0.7, 4.25, 1.5, 3.825, 1.75, 7.5], # millions
'mlb_debut_age': [19.5, 20, 20.5, 20.5, 20.5, 20, 21, 21.5],
'hit_grade': [70, 60, 60, 65, 60, 70, 65, 60],
'power_grade': [70, 70, 70, 70, 65, 55, 65, 70],
'speed_grade': [50, 30, 70, 70, 40, 60, 70, 70],
'arm_grade': [60, 50, 60, 70, 50, 55, 70, 60],
'field_grade': [55, 40, 60, 70, 50, 60, 70, 70],
'war_thru_age_23': [15.5, 9.2, 11.8, 13.5, 10.2, 8.5, 7.5, 4.2],
'avg_exit_velo': [109.5, 112.1, 108.3, 109.8, 107.2, 105.1, 107.8, 108.9]
})
# Feature engineering
latin_players['total_tools'] = (latin_players['hit_grade'] +
latin_players['power_grade'] +
latin_players['speed_grade'] +
latin_players['arm_grade'] +
latin_players['field_grade'])
latin_players['years_to_mlb'] = (latin_players['mlb_debut_age'] -
latin_players['signing_age'])
latin_players['bonus_per_year'] = (latin_players['signing_bonus'] /
latin_players['years_to_mlb'])
# Machine learning model for WAR prediction
features = ['hit_grade', 'power_grade', 'speed_grade', 'mlb_debut_age',
'avg_exit_velo', 'signing_bonus']
X = latin_players[features].values
y = latin_players['war_thru_age_23'].values
# Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42,
max_depth=4, min_samples_split=2)
rf_model.fit(X, y)
# Feature importance
feature_importance = pd.DataFrame({
'feature': features,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print("Feature Importance for WAR Prediction:")
print(feature_importance)
# Cross-validation (limited by small sample)
cv_scores = cross_val_score(rf_model, X, y, cv=3,
scoring='neg_mean_squared_error')
print(f"\nCross-Validation RMSE: {np.sqrt(-cv_scores.mean()):.2f} WAR")
# Visualization: Tool grades vs performance
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Plot 1: Hit tool vs WAR
ax1 = axes[0, 0]
ax1.scatter(latin_players['hit_grade'], latin_players['war_thru_age_23'],
s=latin_players['signing_bonus']*20, alpha=0.6, c='blue')
ax1.set_xlabel('Hit Tool Grade (20-80)')
ax1.set_ylabel('WAR Through Age 23')
ax1.set_title('Hit Tool vs Early Career Success')
ax1.grid(alpha=0.3)
# Plot 2: Power tool vs WAR
ax2 = axes[0, 1]
ax2.scatter(latin_players['power_grade'], latin_players['war_thru_age_23'],
s=latin_players['signing_bonus']*20, alpha=0.6, c='red')
ax2.set_xlabel('Power Tool Grade (20-80)')
ax2.set_ylabel('WAR Through Age 23')
ax2.set_title('Power Tool vs Early Career Success')
ax2.grid(alpha=0.3)
# Plot 3: Speed tool vs WAR
ax3 = axes[1, 0]
ax3.scatter(latin_players['speed_grade'], latin_players['war_thru_age_23'],
s=latin_players['signing_bonus']*20, alpha=0.6, c='green')
ax3.set_xlabel('Speed Tool Grade (20-80)')
ax3.set_ylabel('WAR Through Age 23')
ax3.set_title('Speed Tool vs Early Career Success')
ax3.grid(alpha=0.3)
# Plot 4: Age at debut vs WAR
ax4 = axes[1, 1]
scatter = ax4.scatter(latin_players['mlb_debut_age'],
latin_players['war_thru_age_23'],
s=latin_players['signing_bonus']*20,
c=latin_players['total_tools'],
cmap='viridis', alpha=0.7)
ax4.set_xlabel('Age at MLB Debut')
ax4.set_ylabel('WAR Through Age 23')
ax4.set_title('Debut Age vs Success (color = total tools)')
ax4.grid(alpha=0.3)
plt.colorbar(scatter, ax=ax4, label='Total Tool Grade')
plt.tight_layout()
plt.savefig('latin_american_tool_analysis.png', dpi=300, bbox_inches='tight')
plt.show()
# Advanced projection class
class LatinAmericanProjector:
def __init__(self, model=None):
self.model = model if model else rf_model
self.feature_names = features
def project_war(self, hit, power, speed, debut_age, exit_velo, bonus):
"""Project WAR through age 23"""
input_data = np.array([[hit, power, speed, debut_age, exit_velo, bonus]])
prediction = self.model.predict(input_data)[0]
# Calculate confidence interval using ensemble variance
tree_predictions = [tree.predict(input_data)[0]
for tree in self.model.estimators_]
std_dev = np.std(tree_predictions)
return {
'projected_WAR': round(prediction, 1),
'lower_bound': round(prediction - 1.96 * std_dev, 1),
'upper_bound': round(prediction + 1.96 * std_dev, 1),
'std_dev': round(std_dev, 2)
}
def compare_prospects(self, prospects_df):
"""Compare multiple prospects"""
results = []
for idx, prospect in prospects_df.iterrows():
proj = self.project_war(
prospect['hit_grade'],
prospect['power_grade'],
prospect['speed_grade'],
prospect['mlb_debut_age'],
prospect['avg_exit_velo'],
prospect['signing_bonus']
)
results.append({
'player': prospect.get('player', f'Prospect_{idx}'),
**proj
})
return pd.DataFrame(results).sort_values('projected_WAR', ascending=False)
# Example usage
projector = LatinAmericanProjector()
print("\n" + "="*70)
print("Example Prospect Projection:")
print("="*70)
print("Hit: 65, Power: 70, Speed: 60")
print("Debut Age: 20.5, Exit Velo: 108, Bonus: $3.5M")
print("-"*70)
projection = projector.project_war(65, 70, 60, 20.5, 108, 3.5)
for key, value in projection.items():
print(f"{key}: {value}")
# Multiple prospect comparison
prospects = pd.DataFrame({
'player': ['Prospect A', 'Prospect B', 'Prospect C'],
'hit_grade': [70, 60, 65],
'power_grade': [65, 75, 70],
'speed_grade': [60, 50, 70],
'mlb_debut_age': [20, 21, 19.5],
'avg_exit_velo': [108, 111, 107],
'signing_bonus': [4.0, 2.5, 5.0]
})
print("\n" + "="*70)
print("Prospect Comparison:")
print("="*70)
comparison = projector.compare_prospects(prospects)
print(comparison.to_string(index=False))
The World Baseball Classic (WBC) offers a unique analytical opportunity: top international players competing at maximum intensity in a short tournament format.
WBC Performance Analytics
# R: WBC performance analysis
library(tidyverse)
# 2023 WBC key performers (sample data)
wbc_2023 <- data.frame(
player = c("Shohei Ohtani", "Trea Turner", "Mike Trout",
"Masataka Yoshida", "Mookie Betts", "Lars Nootbaar",
"Randy Arozarena", "J-Rod", "Paul Goldschmidt"),
country = c("JPN", "USA", "USA", "JPN", "USA", "JPN", "MEX", "DOM", "USA"),
pa = c(28, 37, 33, 31, 35, 28, 42, 25, 33),
avg = c(.435, .389, .273, .429, .263, .345, .419, .348, .200),
obp = c(.606, .500, .433, .548, .371, .448, .500, .440, .273),
slg = c(.739, .722, .636, .714, .421, .690, .744, .652, .343),
hr = c(1, 2, 1, 1, 1, 2, 4, 1, 1),
sb = c(1, 3, 0, 1, 0, 1, 3, 2, 0),
wrc_plus = c(280, 265, 188, 295, 132, 250, 310, 245, 85),
mlb_2023_wrc_plus = c(184, 132, 126, 126, 147, 95, 126, 136, 131)
)
# Compare WBC to MLB performance
wbc_2023 <- wbc_2023 %>%
mutate(
wbc_vs_mlb = wrc_plus - mlb_2023_wrc_plus,
performance_tier = case_when(
wbc_vs_mlb > 100 ~ "Massive Outperformance",
wbc_vs_mlb > 50 ~ "Strong Outperformance",
wbc_vs_mlb > 0 ~ "Slight Outperformance",
wbc_vs_mlb > -50 ~ "Underperformance",
TRUE ~ "Major Underperformance"
)
)
# Analysis
cat("WBC vs MLB Regular Season Performance:\n")
print(wbc_2023 %>%
select(player, country, wrc_plus, mlb_2023_wrc_plus, wbc_vs_mlb, performance_tier) %>%
arrange(desc(wbc_vs_mlb)))
# Visualization
ggplot(wbc_2023, aes(x = mlb_2023_wrc_plus, y = wrc_plus)) +
geom_point(aes(color = country, size = pa), alpha = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
geom_text(aes(label = player), hjust = -0.1, size = 2.5) +
labs(title = "WBC Performance vs 2023 MLB Season",
subtitle = "wRC+ comparison (100 = league average)",
x = "2023 MLB Regular Season wRC+",
y = "2023 WBC wRC+",
color = "Country",
size = "WBC PA") +
theme_minimal() +
xlim(75, 200) +
ylim(75, 325)
# Small sample size considerations
cat("\n\nSmall Sample Considerations:\n")
cat(sprintf("Average WBC PA: %.1f\n", mean(wbc_2023$pa)))
cat(sprintf("Minimum for stabilization: ~150 PA for AVG, 500+ for power\n"))
cat(sprintf("WBC provides ~5%% of full season sample size\n"))
The WBC demonstrates the challenge of small-sample analysis while providing insights into player performance under pressure and international competition.
# R: WBC performance analysis
library(tidyverse)
# 2023 WBC key performers (sample data)
wbc_2023 <- data.frame(
player = c("Shohei Ohtani", "Trea Turner", "Mike Trout",
"Masataka Yoshida", "Mookie Betts", "Lars Nootbaar",
"Randy Arozarena", "J-Rod", "Paul Goldschmidt"),
country = c("JPN", "USA", "USA", "JPN", "USA", "JPN", "MEX", "DOM", "USA"),
pa = c(28, 37, 33, 31, 35, 28, 42, 25, 33),
avg = c(.435, .389, .273, .429, .263, .345, .419, .348, .200),
obp = c(.606, .500, .433, .548, .371, .448, .500, .440, .273),
slg = c(.739, .722, .636, .714, .421, .690, .744, .652, .343),
hr = c(1, 2, 1, 1, 1, 2, 4, 1, 1),
sb = c(1, 3, 0, 1, 0, 1, 3, 2, 0),
wrc_plus = c(280, 265, 188, 295, 132, 250, 310, 245, 85),
mlb_2023_wrc_plus = c(184, 132, 126, 126, 147, 95, 126, 136, 131)
)
# Compare WBC to MLB performance
wbc_2023 <- wbc_2023 %>%
mutate(
wbc_vs_mlb = wrc_plus - mlb_2023_wrc_plus,
performance_tier = case_when(
wbc_vs_mlb > 100 ~ "Massive Outperformance",
wbc_vs_mlb > 50 ~ "Strong Outperformance",
wbc_vs_mlb > 0 ~ "Slight Outperformance",
wbc_vs_mlb > -50 ~ "Underperformance",
TRUE ~ "Major Underperformance"
)
)
# Analysis
cat("WBC vs MLB Regular Season Performance:\n")
print(wbc_2023 %>%
select(player, country, wrc_plus, mlb_2023_wrc_plus, wbc_vs_mlb, performance_tier) %>%
arrange(desc(wbc_vs_mlb)))
# Visualization
ggplot(wbc_2023, aes(x = mlb_2023_wrc_plus, y = wrc_plus)) +
geom_point(aes(color = country, size = pa), alpha = 0.7) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "red") +
geom_smooth(method = "lm", se = TRUE, alpha = 0.2) +
geom_text(aes(label = player), hjust = -0.1, size = 2.5) +
labs(title = "WBC Performance vs 2023 MLB Season",
subtitle = "wRC+ comparison (100 = league average)",
x = "2023 MLB Regular Season wRC+",
y = "2023 WBC wRC+",
color = "Country",
size = "WBC PA") +
theme_minimal() +
xlim(75, 200) +
ylim(75, 325)
# Small sample size considerations
cat("\n\nSmall Sample Considerations:\n")
cat(sprintf("Average WBC PA: %.1f\n", mean(wbc_2023$pa)))
cat(sprintf("Minimum for stabilization: ~150 PA for AVG, 500+ for power\n"))
cat(sprintf("WBC provides ~5%% of full season sample size\n"))
International player evaluation faces unique data challenges requiring specialized analytical approaches.
Key Challenges
- Limited Statcast data: Most international leagues lack ball/player tracking
- Inconsistent statistical standards: Different counting methods
- Video access restrictions: Limited broadcast availability
- Age verification: Critical in Latin America
- Cultural/language barriers: Scouting communication issues
- Political factors: Cuban defection complications
Alternative Data Sources
# Python: Building composite international player evaluation
import pandas as pd
import numpy as np
class InternationalPlayerEvaluator:
"""
Comprehensive evaluation system for international players
combining traditional stats, physical metrics, and projections
"""
def __init__(self):
self.weights = {
'stats': 0.35,
'tools': 0.30,
'physical': 0.20,
'track_record': 0.15
}
def evaluate_stats(self, league, avg, obp, slg, age):
"""Evaluate statistical performance with league adjustments"""
league_factors = {
'MLB': 1.00,
'NPB': 0.85,
'KBO': 0.78,
'CPBL': 0.70,
'Cuban': 0.75,
'Mexican': 0.68
}
factor = league_factors.get(league, 0.65)
# Calculate adjusted OPS+
ops = obp + slg
league_avg_ops = 0.720 # MLB baseline
adj_ops_plus = (ops / league_avg_ops) * factor * 100
# Age adjustment
age_curve = 1.0 - abs(age - 26) * 0.02
return adj_ops_plus * age_curve
def evaluate_tools(self, hit, power, speed, field, arm):
"""Evaluate 5-tool grades (20-80 scale)"""
# Weights for different positions (can be customized)
tool_weights = {
'hit': 0.30,
'power': 0.25,
'speed': 0.15,
'field': 0.15,
'arm': 0.15
}
weighted_grade = (
hit * tool_weights['hit'] +
power * tool_weights['power'] +
speed * tool_weights['speed'] +
field * tool_weights['field'] +
arm * tool_weights['arm']
)
return weighted_grade
def evaluate_physical(self, height_in, weight_lbs, exit_velo,
sprint_speed, arm_velo=None):
"""Evaluate physical tools and measurables"""
score = 50 # Start at average
# Exit velocity component (major factor)
if exit_velo:
velo_score = (exit_velo - 87) * 2 # 87 mph = average
score += velo_score * 0.4
# Sprint speed component
if sprint_speed:
# 27 ft/s = average MLB
speed_score = (sprint_speed - 27) * 10
score += speed_score * 0.3
# Build/athleticism
bmi = (weight_lbs / (height_in ** 2)) * 703
if 22 <= bmi <= 27: # Optimal athletic range
score += 5
return max(20, min(80, score)) # Clamp to 20-80 scale
def evaluate_track_record(self, years_pro, level_reached,
consistency_score):
"""Evaluate professional track record"""
level_scores = {
'MLB': 80,
'AAA': 70,
'NPB': 70,
'KBO': 65,
'AA': 60,
'CPBL': 60,
'Cuban': 55,
'A+': 50
}
level_score = level_scores.get(level_reached, 45)
experience_bonus = min(10, years_pro * 2)
return (level_score + experience_bonus) * (consistency_score / 100)
def composite_score(self, stats_score, tools_score, physical_score,
track_score):
"""Calculate weighted composite score"""
composite = (
stats_score * self.weights['stats'] +
tools_score * self.weights['tools'] +
physical_score * self.weights['physical'] +
track_score * self.weights['track_record']
)
return composite
def risk_adjustment(self, composite_score, data_quality, age,
injury_history):
"""Adjust score for various risk factors"""
risk_factor = 1.0
# Data quality risk
data_factors = {
'high': 1.0,
'medium': 0.92,
'low': 0.85,
'very_low': 0.75
}
risk_factor *= data_factors.get(data_quality, 0.85)
# Age risk (older = more risk for prospect, less for proven player)
if age > 28:
risk_factor *= 0.95
elif age < 22:
risk_factor *= 0.98 # Projection risk
# Injury history
if injury_history > 0:
risk_factor *= (1 - injury_history * 0.05)
return composite_score * risk_factor
def generate_report(self, player_data):
"""Generate comprehensive evaluation report"""
# Calculate component scores
stats = self.evaluate_stats(
player_data['league'],
player_data['avg'],
player_data['obp'],
player_data['slg'],
player_data['age']
)
tools = self.evaluate_tools(
player_data['hit_grade'],
player_data['power_grade'],
player_data['speed_grade'],
player_data['field_grade'],
player_data['arm_grade']
)
physical = self.evaluate_physical(
player_data['height'],
player_data['weight'],
player_data.get('exit_velo'),
player_data.get('sprint_speed')
)
track_record = self.evaluate_track_record(
player_data['years_pro'],
player_data['level'],
player_data['consistency']
)
# Composite score
composite = self.composite_score(stats, tools, physical, track_record)
# Risk-adjusted final score
final = self.risk_adjustment(
composite,
player_data['data_quality'],
player_data['age'],
player_data.get('injury_history', 0)
)
# Grade assignment
grade = self.assign_grade(final)
return {
'player': player_data.get('name', 'Unknown'),
'stats_score': round(stats, 1),
'tools_score': round(tools, 1),
'physical_score': round(physical, 1),
'track_record_score': round(track_record, 1),
'composite_score': round(composite, 1),
'final_score': round(final, 1),
'grade': grade,
'recommendation': self.generate_recommendation(final, player_data)
}
def assign_grade(self, score):
"""Convert numerical score to grade"""
if score >= 70:
return 'A+ (Elite)'
elif score >= 65:
return 'A (Plus Regular)'
elif score >= 60:
return 'B+ (Above Average)'
elif score >= 55:
return 'B (Average Starter)'
elif score >= 50:
return 'C+ (Platoon/Depth)'
else:
return 'C or below'
def generate_recommendation(self, score, player_data):
"""Generate signing/acquisition recommendation"""
if score >= 65:
return f"Strong pursue - potential impact player"
elif score >= 60:
return f"Pursue - likely contributor"
elif score >= 55:
return f"Monitor - depth/upside candidate"
else:
return f"Pass unless at significant discount"
# Example evaluation
evaluator = InternationalPlayerEvaluator()
# Example player: NPB star
npb_star = {
'name': 'NPB Star Candidate',
'age': 26,
'league': 'NPB',
'avg': .315,
'obp': .385,
'slg': .545,
'hit_grade': 60,
'power_grade': 65,
'speed_grade': 55,
'field_grade': 60,
'arm_grade': 60,
'height': 72, # inches
'weight': 195, # lbs
'exit_velo': 107,
'sprint_speed': 28.2,
'years_pro': 6,
'level': 'NPB',
'consistency': 85, # 0-100 scale
'data_quality': 'high',
'injury_history': 0
}
print("="*70)
print("INTERNATIONAL PLAYER EVALUATION REPORT")
print("="*70)
report = evaluator.generate_report(npb_star)
for key, value in report.items():
print(f"{key.replace('_', ' ').title()}: {value}")
# Compare multiple international candidates
candidates = [
{
'name': 'Cuban Prospect A',
'age': 22,
'league': 'Cuban',
'avg': .345,
'obp': .420,
'slg': .598,
'hit_grade': 65,
'power_grade': 70,
'speed_grade': 60,
'field_grade': 55,
'arm_grade': 60,
'height': 74,
'weight': 215,
'exit_velo': 110,
'sprint_speed': 28.5,
'years_pro': 3,
'level': 'Cuban',
'consistency': 75,
'data_quality': 'low',
'injury_history': 0
},
{
'name': 'KBO Veteran B',
'age': 29,
'league': 'KBO',
'avg': .298,
'obp': .365,
'slg': .485,
'hit_grade': 60,
'power_grade': 60,
'speed_grade': 50,
'field_grade': 65,
'arm_grade': 65,
'height': 70,
'weight': 185,
'exit_velo': 105,
'sprint_speed': 27.5,
'years_pro': 8,
'level': 'KBO',
'consistency': 90,
'data_quality': 'medium',
'injury_history': 1
}
]
print("\n" + "="*70)
print("CANDIDATE COMPARISON")
print("="*70)
comparison_results = []
for candidate in candidates:
result = evaluator.generate_report(candidate)
comparison_results.append(result)
comparison_df = pd.DataFrame(comparison_results)
print(comparison_df[['player', 'final_score', 'grade', 'recommendation']].to_string(index=False))
Data Quality Framework
Organizations should establish data quality tiers:
- Tier 1: Full Statcast, verified stats (MLB, some NPB)
- Tier 2: Comprehensive traditional stats, video (NPB, KBO)
- Tier 3: Basic stats, limited video (CPBL, Mexican League)
- Tier 4: Incomplete stats, showcase only (Cuban, some Latin leagues)
# Python: Building composite international player evaluation
import pandas as pd
import numpy as np
class InternationalPlayerEvaluator:
"""
Comprehensive evaluation system for international players
combining traditional stats, physical metrics, and projections
"""
def __init__(self):
self.weights = {
'stats': 0.35,
'tools': 0.30,
'physical': 0.20,
'track_record': 0.15
}
def evaluate_stats(self, league, avg, obp, slg, age):
"""Evaluate statistical performance with league adjustments"""
league_factors = {
'MLB': 1.00,
'NPB': 0.85,
'KBO': 0.78,
'CPBL': 0.70,
'Cuban': 0.75,
'Mexican': 0.68
}
factor = league_factors.get(league, 0.65)
# Calculate adjusted OPS+
ops = obp + slg
league_avg_ops = 0.720 # MLB baseline
adj_ops_plus = (ops / league_avg_ops) * factor * 100
# Age adjustment
age_curve = 1.0 - abs(age - 26) * 0.02
return adj_ops_plus * age_curve
def evaluate_tools(self, hit, power, speed, field, arm):
"""Evaluate 5-tool grades (20-80 scale)"""
# Weights for different positions (can be customized)
tool_weights = {
'hit': 0.30,
'power': 0.25,
'speed': 0.15,
'field': 0.15,
'arm': 0.15
}
weighted_grade = (
hit * tool_weights['hit'] +
power * tool_weights['power'] +
speed * tool_weights['speed'] +
field * tool_weights['field'] +
arm * tool_weights['arm']
)
return weighted_grade
def evaluate_physical(self, height_in, weight_lbs, exit_velo,
sprint_speed, arm_velo=None):
"""Evaluate physical tools and measurables"""
score = 50 # Start at average
# Exit velocity component (major factor)
if exit_velo:
velo_score = (exit_velo - 87) * 2 # 87 mph = average
score += velo_score * 0.4
# Sprint speed component
if sprint_speed:
# 27 ft/s = average MLB
speed_score = (sprint_speed - 27) * 10
score += speed_score * 0.3
# Build/athleticism
bmi = (weight_lbs / (height_in ** 2)) * 703
if 22 <= bmi <= 27: # Optimal athletic range
score += 5
return max(20, min(80, score)) # Clamp to 20-80 scale
def evaluate_track_record(self, years_pro, level_reached,
consistency_score):
"""Evaluate professional track record"""
level_scores = {
'MLB': 80,
'AAA': 70,
'NPB': 70,
'KBO': 65,
'AA': 60,
'CPBL': 60,
'Cuban': 55,
'A+': 50
}
level_score = level_scores.get(level_reached, 45)
experience_bonus = min(10, years_pro * 2)
return (level_score + experience_bonus) * (consistency_score / 100)
def composite_score(self, stats_score, tools_score, physical_score,
track_score):
"""Calculate weighted composite score"""
composite = (
stats_score * self.weights['stats'] +
tools_score * self.weights['tools'] +
physical_score * self.weights['physical'] +
track_score * self.weights['track_record']
)
return composite
def risk_adjustment(self, composite_score, data_quality, age,
injury_history):
"""Adjust score for various risk factors"""
risk_factor = 1.0
# Data quality risk
data_factors = {
'high': 1.0,
'medium': 0.92,
'low': 0.85,
'very_low': 0.75
}
risk_factor *= data_factors.get(data_quality, 0.85)
# Age risk (older = more risk for prospect, less for proven player)
if age > 28:
risk_factor *= 0.95
elif age < 22:
risk_factor *= 0.98 # Projection risk
# Injury history
if injury_history > 0:
risk_factor *= (1 - injury_history * 0.05)
return composite_score * risk_factor
def generate_report(self, player_data):
"""Generate comprehensive evaluation report"""
# Calculate component scores
stats = self.evaluate_stats(
player_data['league'],
player_data['avg'],
player_data['obp'],
player_data['slg'],
player_data['age']
)
tools = self.evaluate_tools(
player_data['hit_grade'],
player_data['power_grade'],
player_data['speed_grade'],
player_data['field_grade'],
player_data['arm_grade']
)
physical = self.evaluate_physical(
player_data['height'],
player_data['weight'],
player_data.get('exit_velo'),
player_data.get('sprint_speed')
)
track_record = self.evaluate_track_record(
player_data['years_pro'],
player_data['level'],
player_data['consistency']
)
# Composite score
composite = self.composite_score(stats, tools, physical, track_record)
# Risk-adjusted final score
final = self.risk_adjustment(
composite,
player_data['data_quality'],
player_data['age'],
player_data.get('injury_history', 0)
)
# Grade assignment
grade = self.assign_grade(final)
return {
'player': player_data.get('name', 'Unknown'),
'stats_score': round(stats, 1),
'tools_score': round(tools, 1),
'physical_score': round(physical, 1),
'track_record_score': round(track_record, 1),
'composite_score': round(composite, 1),
'final_score': round(final, 1),
'grade': grade,
'recommendation': self.generate_recommendation(final, player_data)
}
def assign_grade(self, score):
"""Convert numerical score to grade"""
if score >= 70:
return 'A+ (Elite)'
elif score >= 65:
return 'A (Plus Regular)'
elif score >= 60:
return 'B+ (Above Average)'
elif score >= 55:
return 'B (Average Starter)'
elif score >= 50:
return 'C+ (Platoon/Depth)'
else:
return 'C or below'
def generate_recommendation(self, score, player_data):
"""Generate signing/acquisition recommendation"""
if score >= 65:
return f"Strong pursue - potential impact player"
elif score >= 60:
return f"Pursue - likely contributor"
elif score >= 55:
return f"Monitor - depth/upside candidate"
else:
return f"Pass unless at significant discount"
# Example evaluation
evaluator = InternationalPlayerEvaluator()
# Example player: NPB star
npb_star = {
'name': 'NPB Star Candidate',
'age': 26,
'league': 'NPB',
'avg': .315,
'obp': .385,
'slg': .545,
'hit_grade': 60,
'power_grade': 65,
'speed_grade': 55,
'field_grade': 60,
'arm_grade': 60,
'height': 72, # inches
'weight': 195, # lbs
'exit_velo': 107,
'sprint_speed': 28.2,
'years_pro': 6,
'level': 'NPB',
'consistency': 85, # 0-100 scale
'data_quality': 'high',
'injury_history': 0
}
print("="*70)
print("INTERNATIONAL PLAYER EVALUATION REPORT")
print("="*70)
report = evaluator.generate_report(npb_star)
for key, value in report.items():
print(f"{key.replace('_', ' ').title()}: {value}")
# Compare multiple international candidates
candidates = [
{
'name': 'Cuban Prospect A',
'age': 22,
'league': 'Cuban',
'avg': .345,
'obp': .420,
'slg': .598,
'hit_grade': 65,
'power_grade': 70,
'speed_grade': 60,
'field_grade': 55,
'arm_grade': 60,
'height': 74,
'weight': 215,
'exit_velo': 110,
'sprint_speed': 28.5,
'years_pro': 3,
'level': 'Cuban',
'consistency': 75,
'data_quality': 'low',
'injury_history': 0
},
{
'name': 'KBO Veteran B',
'age': 29,
'league': 'KBO',
'avg': .298,
'obp': .365,
'slg': .485,
'hit_grade': 60,
'power_grade': 60,
'speed_grade': 50,
'field_grade': 65,
'arm_grade': 65,
'height': 70,
'weight': 185,
'exit_velo': 105,
'sprint_speed': 27.5,
'years_pro': 8,
'level': 'KBO',
'consistency': 90,
'data_quality': 'medium',
'injury_history': 1
}
]
print("\n" + "="*70)
print("CANDIDATE COMPARISON")
print("="*70)
comparison_results = []
for candidate in candidates:
result = evaluator.generate_report(candidate)
comparison_results.append(result)
comparison_df = pd.DataFrame(comparison_results)
print(comparison_df[['player', 'final_score', 'grade', 'recommendation']].to_string(index=False))
Exercise 23.1: NPB Translation Model
Using the provided NPB-to-MLB translation data, build a regression model to project the first-year MLB performance of a hypothetical NPB player:
Player Profile:
- Age: 25
- Final NPB season: .305/.380/.520, 28 HR in 550 PA
- Position: Corner OF
- Exit velocity: 106 mph (NPB measurement)
Tasks:
- Apply the translation factors from Section 23.2
- Calculate projected MLB slash line and HR total
- Estimate first-year WAR using the projection
- Assess confidence level and identify key uncertainties
Bonus: Compare your projection to actual performance of similar NPB players (e.g., Seiya Suzuki, Masataka Yoshida).
Exercise 23.2: KBO Pitcher Projection
A 27-year-old KBO left-handed starter has the following final season:
- 2.65 ERA, 1.15 WHIP
- 9.8 K/9, 2.8 BB/9, 0.75 HR/9
- 175 IP, 15-6 record
Tasks:
- Using the KBO pitcher translation model from Section 23.3, project his MLB stats
- Calculate projected FIP and ERA
- Build a confidence interval for your ERA projection
- Compare to similar KBO pitchers (e.g., Hyun-Jin Ryu, Kwang-Hyun Kim)
- Recommend a contract structure based on projection and risk
Exercise 23.3: Latin American Tool-Based Valuation
You are evaluating three Dominican Republic prospects for international signing:
Prospect A:
- Age: 17
- Hit: 60, Power: 70, Speed: 55, Field: 55, Arm: 60
- Exit velocity: 108 mph
- Asking bonus: $3.5M
Prospect B:
- Age: 16
- Hit: 65, Power: 60, Speed: 70, Field: 65, Arm: 60
- Exit velocity: 104 mph
- Asking bonus: $4.0M
Prospect C:
- Age: 18
- Hit: 55, Power: 75, Speed: 45, Field: 50, Arm: 55
- Exit velocity: 112 mph
- Asking bonus: $2.5M
Tasks:
- Use the Latin American projection model from Section 23.4
- Project WAR through age 23 for each prospect
- Calculate value per dollar of bonus
- Rank the prospects considering both ceiling and floor outcomes
- Recommend which prospect(s) to sign and at what price
Advanced: Simulate 1,000 career paths for each prospect incorporating uncertainty and injury risk.
Exercise 23.4: International League Environment Analysis
Using the league comparison data from Section 23.1, conduct a comprehensive analysis:
Tasks:
- Calculate park-adjusted metrics for each league
- Estimate "true talent" translation factors using regression to the mean
- Build a Bayesian updating system that improves projections as players accumulate MLB PA
- Create visualizations comparing league offensive environments over time (2015-2023)
- Develop recommendations for adjusting scouting priorities based on league trends
Data Required:
- League-wide statistics (provided in section)
- Park factors (research or estimate)
- Historical translation success rates
Deliverables:
- R or Python code implementing your analysis
- Report summarizing findings
- Recommendations for international scouting departments
Summary
International baseball analytics requires sophisticated approaches to account for varying competition levels, data quality, and cultural contexts. Key takeaways:
- League translation factors are essential but imperfect tools requiring continuous refinement
- NPB and KBO provide the highest quality international data and most reliable translation models
- Cuban and Latin American evaluation relies heavily on physical tools and limited showcase data
- Age at transition significantly impacts success rates across all international sources
- Data quality tiers should inform confidence levels and risk assessment
- Small sample sizes in tournaments like WBC require careful statistical interpretation
As international signing and posting systems evolve, analytical approaches must adapt to incorporate new data sources, changing competitive environments, and improved measurement technologies. Organizations that excel at international player evaluation and projection gain significant competitive advantages in talent acquisition.
Further Reading:
- Baseball America's International Prospect Handbook
- FanGraphs International Free Agent analysis series
- MLB Pipeline scouting reports and tools grades
- Academic research on translation factors (Baseball Prospectus, The Hardball Times)
- Statcast comparative studies across international leagues
Practice Exercises
Reinforce what you've learned with these hands-on exercises. Try to solve them on your own before viewing hints or solutions.
Tips for Success
- Read the problem carefully before starting to code
- Break down complex problems into smaller steps
- Use the hints if you're stuck - they won't give away the answer
- After solving, compare your approach with the solution
NPB Translation Model
**Player Profile:**
- Age: 25
- Final NPB season: .305/.380/.520, 28 HR in 550 PA
- Position: Corner OF
- Exit velocity: 106 mph (NPB measurement)
**Tasks:**
1. Apply the translation factors from Section 23.2
2. Calculate projected MLB slash line and HR total
3. Estimate first-year WAR using the projection
4. Assess confidence level and identify key uncertainties
**Bonus:** Compare your projection to actual performance of similar NPB players (e.g., Seiya Suzuki, Masataka Yoshida).
KBO Pitcher Projection
- 2.65 ERA, 1.15 WHIP
- 9.8 K/9, 2.8 BB/9, 0.75 HR/9
- 175 IP, 15-6 record
**Tasks:**
1. Using the KBO pitcher translation model from Section 23.3, project his MLB stats
2. Calculate projected FIP and ERA
3. Build a confidence interval for your ERA projection
4. Compare to similar KBO pitchers (e.g., Hyun-Jin Ryu, Kwang-Hyun Kim)
5. Recommend a contract structure based on projection and risk
Latin American Tool-Based Valuation
**Prospect A:**
- Age: 17
- Hit: 60, Power: 70, Speed: 55, Field: 55, Arm: 60
- Exit velocity: 108 mph
- Asking bonus: $3.5M
**Prospect B:**
- Age: 16
- Hit: 65, Power: 60, Speed: 70, Field: 65, Arm: 60
- Exit velocity: 104 mph
- Asking bonus: $4.0M
**Prospect C:**
- Age: 18
- Hit: 55, Power: 75, Speed: 45, Field: 50, Arm: 55
- Exit velocity: 112 mph
- Asking bonus: $2.5M
**Tasks:**
1. Use the Latin American projection model from Section 23.4
2. Project WAR through age 23 for each prospect
3. Calculate value per dollar of bonus
4. Rank the prospects considering both ceiling and floor outcomes
5. Recommend which prospect(s) to sign and at what price
**Advanced:** Simulate 1,000 career paths for each prospect incorporating uncertainty and injury risk.
International League Environment Analysis
**Tasks:**
1. Calculate park-adjusted metrics for each league
2. Estimate "true talent" translation factors using regression to the mean
3. Build a Bayesian updating system that improves projections as players accumulate MLB PA
4. Create visualizations comparing league offensive environments over time (2015-2023)
5. Develop recommendations for adjusting scouting priorities based on league trends
**Data Required:**
- League-wide statistics (provided in section)
- Park factors (research or estimate)
- Historical translation success rates
**Deliverables:**
- R or Python code implementing your analysis
- Report summarizing findings
- Recommendations for international scouting departments
---