How to Build an Expected Goals (xG) Model in R with worldfootballR

Introduction

Expected Goals (xG) has become the go‑to metric for measuring the quality of chances in football. If you’re a data‑savvy fan or analyst, building your own xG model gives you deeper insight and the flexibility to customise for any league. In this guide we walk you through a complete end‑to‑end workflow in R, using the worldfootballR package to fetch data, clean it, train a logistic regression model, and visualise the results.

What You’ll Need

  • Basic knowledge of R and tidyverse.
  • RStudio (or any R IDE).
  • Install the latest worldfootballR package.
  • Internet connection for data download.

Step 1: Install and Load Packages

install.packages(c("worldfootballR", "tidyverse", "broom", "caret")) library(worldfootballR) library(tidyverse) library(broom) library(caret) 

These packages provide data access, manipulation, modelling and evaluation tools.

Step 2: Pull Shot Data with worldfootballR

worldfootballR connects to FBref and extracts match event data. For this example we’ll use the English Premier League 2022/23 season.

# Get all match URLs for the season matches <- fb_match_urls(   country = "ENG",   gender = "M",   season_end_year = 2023,   tier = "1st",   level = "Match Centre" )  # Pull shot‑level data shots_raw <- fb_shot_data(matches) 

The resulting shots_raw tibble contains one row per shot with variables like shot_outcome, shot_body_part, x, y, distance, and angle.

Step 3: Clean and Engineer Features

We need a binary target (goal) and a set of explanatory variables.

shots_clean <- shots_raw %>%   mutate(     goal = if_else(shot_outcome == "Goal", 1, 0),     # Convert coordinates to a standard 0‑100 scale (FBref uses 0‑120 for length)     x_norm = x * 100 / 120,     y_norm = y * 100 / 80,     # Calculate Euclidean distance from centre of goal (x = 0, y = 50)     distance = sqrt((x_norm)^2 + (y_norm-50)^2),     # Calculate shooting angle in radians     angle = atan2(7.32/2, distance) * 180 / pi,     # Dummy variables for body part     foot = if_else(shot_body_part == "Right Foot" | shot_body_part == "Left Foot", 1, 0),     head = if_else(shot_body_part == "Head", 1, 0)   ) %>%   select(goal, distance, angle, foot, head, shot_type, match_date)  # Remove rows with missing values shots_clean <- na.omit(shots_clean) 

Step 4: Split Data into Training and Test Sets

set.seed(123) train_index <- createDataPartition(shots_clean$goal, p = 0.7, list = FALSE) train_data <- shots_clean[train_index, ] test_data  <- shots_clean[-train_index, ] 

Step 5: Fit a Logistic Regression Model

Logistic regression is the classic baseline for xG because the outcome (goal vs no goal) is binary.

model_xg <- glm(goal ~ distance + angle + foot + head,                  data = train_data,                  family = binomial(link = "logit"))  # Inspect coefficients tidy(model_xg) 

Interpretation

  • Negative coefficient for distance → farther shots have lower probability.
  • Positive coefficient for angle → tighter angles increase chance.
  • Foot vs head dummy variables capture the higher conversion rate of headed shots inside the box.

Step 6: Generate xG Values for the Test Set

test_data <- test_data %>%   mutate(     xG = predict(model_xg, newdata = ., type = "response")   ) 

The new xG column now holds the expected‑goal probability for every shot.

Step 7: Model Evaluation

Use Brier score and calibration plots to see how well the model predicts.

# Brier score (lower is better) brier <- mean((test_data$goal - test_data$xG)^2)  # Calibration: group shots by deciles of predicted xG calibration <- test_data %>%   mutate(decile = ntile(xG, 10)) %>%   group_by(decile) %>%   summarise(mean_pred = mean(xG),             mean_obs  = mean(goal)) 

A well‑calibrated model will have mean_pred ≈ mean_obs across deciles.

Step 8: Visualising Shots and xG

library(ggplot2) ggplot(test_data, aes(x = x_norm, y = y_norm)) +   geom_point(aes(size = xG, colour = factor(goal)), alpha = 0.7) +   scale_size_continuous(range = c(1,5)) +   labs(title = "Shot Map with xG Values",        colour = "Goal",        size = "xG") +   theme_minimal() 

The plot shows larger circles for high‑xG chances and colours to differentiate actual goals.

Step 9: Aggregating xG for Teams or Players

# Example: total xG per team in the test period team_xg <- test_data %>%   left_join(shots_raw %>% select(shot_id = event_id, team_name), by = c("row.names" = "shot_id")) %>%   group_by(team_name) %>%   summarise(Total_xG = sum(xG),             Goals = sum(goal)) 

Compare Total_xG vs actual goals to spot over‑ or under‑performing teams.

Step 10: Exporting Results

write_csv(test_data, "xg_predictions_2022_23.csv") write_csv(team_xg, "team_xg_summary.csv") 

You now have a reusable xG pipeline that can be scheduled each week for live analysis.

Conclusion

Building an xG model in R is straightforward once you have reliable shot data. worldfootballR removes the tedious scraping step, letting you focus on feature engineering, model fitting, and insight generation. Start with this logistic baseline, experiment with additional variables (pressing intensity, goalkeeper position, expected threat), and watch your analytical edge grow.

Comments are closed, but trackbacks and pingbacks are open.