Earning in Esports

Author

Shubham Shrestha

Introduction

Esports has grown into a multibillion-dollar industry, with competitive gaming events awarding large prize pools. Understanding the trends and factors that influence esports revenue can provide significant information to players, organizations, and investors. In this analysis, we examine the historical profits data of esports from 1998 to 2023, retrieved from Kaggle (https://www.kaggle.com/datasets/rankirsh/esports-earnings/data).

In this analysis, we will look at various machine learning data models to examine and anticipate overall earnings in the esports business. We will focus on Linear Regression and K-Nearest Neighbors (KNN) and compare their performance before selecting the most accurate predictive model for our dataset. To examine and complete this data, we will undertake data spitting, preprocessing, model training, assessment, and lastly assessing the results for the various models.

Data Exploration

First, let’s explore the dataset by examining its structure and attributes. We begin by importing the necessary libraries:

library(tidyverse)
library(tidymodels)
library(ggplot2)
library(ggrepel)
library(knitr)
library(corrplot)

knitr::opts_chunk$set(warning = FALSE, message = FALSE) 

Now, we load the dataset and inspect its contents:

# Here we are loading the dataset are storing it into earning
earning <- read.csv("GeneralEsportData.csv")

#let us look at the first six data of the dataset to explore the data
head(earning)
                   Game ReleaseDate    Genre TotalEarnings OfflineEarnings
1        Age of Empires        1997 Strategy     736284.75       522378.17
2     Age of Empires II        1999 Strategy    3898508.73      1361409.22
3    Age of Empires III        2005 Strategy     122256.72        44472.60
4     Age of Empires IV        2021 Strategy    1190813.44       439117.93
5 Age of Empires Online        2011 Strategy      11462.98          775.00
6      Age of Mythology        2002 Strategy     188619.58        86723.77
  PercentOffline TotalPlayers TotalTournaments
1     0.70947846          624              341
2     0.34921282         2256             1939
3     0.36376405          172              179
4     0.36875460          643              423
5     0.06760895           52               68
6     0.45978138          236              298
earning |> glimpse()
Rows: 669
Columns: 8
$ Game             <chr> "Age of Empires", "Age of Empires II", "Age of Empire…
$ ReleaseDate      <int> 1997, 1999, 2005, 2021, 2011, 2002, 2018, 2019, 2018,…
$ Genre            <chr> "Strategy", "Strategy", "Strategy", "Strategy", "Stra…
$ TotalEarnings    <dbl> 736284.75, 3898508.73, 122256.72, 1190813.44, 11462.9…
$ OfflineEarnings  <dbl> 522378.17, 1361409.22, 44472.60, 439117.93, 775.00, 8…
$ PercentOffline   <dbl> 0.70947846, 0.34921282, 0.36376405, 0.36875460, 0.067…
$ TotalPlayers     <int> 624, 2256, 172, 643, 52, 236, 14, 133, 698, 1158, 90,…
$ TotalTournaments <int> 341, 1939, 179, 423, 68, 298, 8, 52, 207, 779, 29, 22…

Doing head(earning) and glimpse it allows us to the intial first few data from the dataset. Using them it provide us with an overview of the available data fields with a total 660 obseravation data and 8 column which include Game, ReleaseDate, Genre, TotalEarnings, OfflineEarnings, PercentOffline, TotalPlayers and TotalTournaments with their respective data types.

The dataset highlights that earnings and player engagement vary significantly between games.

Now let us visualize the data to looking into the earning over the time.

ggplot(earning, aes(x = ReleaseDate, y = TotalEarnings)) +
  geom_point(alpha = 0.6, aes(color = Genre, size = TotalPlayers)) +
  labs(title = "Esports Total Earnings by Release Year",
       subtitle = "Games released between 2009-2017 generated highest earnings",
       x = "Release Year", 
       y = "Total Earnings (Millions USD)",
       color = "Game Genre",
       size = "Total Players") +
  theme_minimal()

This visualization helps us identify trends, potential outliers, and the overall growth trajectory of esports earnings over the analyzed period. From the graph it shows that games released between 2009 and 2017 generated the highest earnings.

# Genre breakdown by earnings
genre_table <- earning |> 
  group_by(Genre) |> 
  summarise(
    Total_Genre_Earnings = sum(TotalEarnings),
    Avg_Earnings = mean(TotalEarnings),
    Game_Count = n()
  ) |> 
  arrange(desc(Total_Genre_Earnings))

kable(genre_table, format.args = list(big.mark = ","))
Genre Total_Genre_Earnings Avg_Earnings Game_Count
Multiplayer Online Battle Arena 643,829,101.5 22,201,003.50 29
First-Person Shooter 483,596,197.6 3,504,320.27 138
Battle Royale 390,379,254.7 22,963,485.57 17
Strategy 135,819,161.6 1,997,340.61 68
Sports 106,413,829.4 1,085,855.40 98
Collectible Card Game 51,485,118.4 3,028,536.37 17
Fighting Game 39,733,007.1 210,227.55 189
Racing 18,528,690.7 280,737.74 66
Role-Playing Game 15,277,937.0 1,527,793.70 10
Third-Person Shooter 6,105,826.4 555,075.13 11
Puzzle Game 476,062.7 28,003.69 17
Music / Rhythm Game 323,034.9 35,892.77 9

From this table we can see that the esport industry have been thoroughly dominated by the MOBA games highest earning total esport of $643.8M and average $22.2M. While having he highest total earning we can find that battle royale has the highest average earning of $22.9M. For better understanding we visual this data to get the following graph.

ggplot(genre_table, aes(x = reorder(Genre, Total_Genre_Earnings), y = Total_Genre_Earnings/1e6)) +
  geom_col(aes(fill = Game_Count)) +
  geom_text(aes(label = Game_Count), hjust = -0.1) +
  scale_fill_viridis_c() +
  coord_flip() +
  labs(title = "Total Esports Earnings by Genre",
       subtitle = "Number of games shown for each genre",
       x = "Genre",
       y = "Total Earnings (Millions $)",
       fill = "Game Count") +
  theme_minimal()

Model Analysis:

Data Splitting

Before comparing different model our the primary task would be splitting our data into training and test sets. This allows us to evaluate model performance on the unseen data.

set.seed(427)
data_split <- initial_split(earning, prop = 0.65)
train_data <- training(data_split)
test_data  <- testing(data_split)

test_folds <- vfold_cv(train_data, v = 10, repeats = 10)

Data Preprocessing

To ensure model performance, we preprocess the data using different imputation techniques and transformation strategies. Here, we create various several recipe for both Linear Regression and KNN models each with a their own data cleaning and transforming strategies using various imputation for filling the missing data.

# Linear Regression Recipes for mean
lm_recipe1 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.75) |>
  step_lincomb(all_predictors())

# Linear Regression Recipes for median
lm_recipe2 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>
  step_other(all_nominal_predictors(), threshold = 0.05, other = "Rare") |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal(), -all_outcomes()) |>
  step_lincomb(all_predictors()) |>
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors())

# Linear Regression Recipes for knn
lm_recipe3 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |>
  step_impute_knn(all_numeric_predictors(), neighbors = 5) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal_predictors()) |>
  step_lincomb(all_predictors()) |>
  step_normalize(all_numeric_predictors())

# KNN Recipes for knn
knn_recipe1 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_impute_knn(all_numeric_predictors(), neighbors = 5) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.75) |>
  step_lincomb(all_predictors())

# KNN Recipes for median
knn_recipe2 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>
  step_other(all_nominal_predictors(), threshold = 0.05, other = "Rare") |>
  step_scale(all_numeric_predictors()) |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.75) |>
  step_lincomb(all_predictors())

# KNN Recipes for mean
knn_recipe3 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors(), threshold = 0.01, other = "Rare") |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_corr(all_numeric_predictors(), threshold = 0.75) |>
  step_lincomb(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors())

Model Creation

After the preprocessing we move on to the next stage where are required to define multiple model to compare their predictive performance. Here are creating a model for linear Regression model, and KNN model for 5 and 10 neighbors. We define our models, including a Linear Regression model and KNN models with different numbers of neighbors:

lm_model <- linear_reg() |>
  set_engine('lm')

knn5_model <- nearest_neighbor(neighbors = 5) |>
  set_engine("kknn") |>
  set_mode("regression")

knn10_model <- nearest_neighbor(neighbors = 10) |>
  set_engine("kknn") |>
  set_mode("regression")

Data Training and Comparison

After the creation of different models we move on to the next step which is the creation of workflow set where we combine our preprocessing recipes with the model specification. Doing so we will end with 9 different model for predicting the data with 3 linear regression and 6 KNN model.

knn_models <- list(
  knn5 = knn5_model,
  knn10 = knn10_model

)

lm_models <- list(
  lm_model = lm_model
)

knn_preprocessors <- list(
  knn_knn_impute = knn_recipe1,
  knn_mean_impute = knn_recipe3,
  knn_median_imput = knn_recipe2
)

lm_preprocessors <- list(
  lm_knn_impute = lm_recipe3,
  lm_mean_impute = lm_recipe1,
  lm_median_imput = lm_recipe2
)

lm_models <- list(
  lm_model = lm_model
)

knn_models <- workflow_set(knn_preprocessors, knn_models, cross = TRUE)
lm_models <-  workflow_set(lm_preprocessors, lm_models, cross = TRUE)

all_models <- lm_models |> 
  bind_rows(knn_models)

Finally we can move on to the final step which would be the evaluation of all models using cross validation and comparing their performance metrics particularly Root Squared and Root Mean Squared Error

earning_metrics <- metric_set(rmse, rsq)

all_fits <- all_models |> 
  workflow_map("fit_resamples",
               resamples = test_folds,
               metrics = earning_metrics)

autoplot(all_fits,  metric = "rsq") + 
  geom_text_repel(aes(label = wflow_id))

After Fitting and visualizing all the different model we are will be evaluating their performance using the Rsq which is used to measure the proper of variance in our TotalEarnings. We have known that having the higher rsq is better for the data meaning as the rsq approches 1 we can say that that given model prediction would be more accurate than those with lesser value.

In this rsq graph we can find that lm_median_imput_model achieved the highest R-squared value showing us that median imputation in linear regression is the most effective way to handle the dataset. It demonstrated that there would be a consistency with stable performance arcoss the cross validation fold.

If we were to go for a non linear model for the data, the knn_median_imput_knn10 model may be a viable alternative, albeit with slightly worse overall predictive performance.

To create a appropriate graph for the rsq table was too difficult as the the label imposed on each other So i refered to AI to create a better visualization for the different model Root Squared.

autoplot(all_fits, metric = "rsq") + 
  geom_label_repel(aes(label = wflow_id),
                   box.padding = 0.6,  
                   point.padding = 0.5,  
                   max.overlaps = Inf,  
                   hjust = 1, vjust = 1,  
                   nudge_y = -0.7,  
                   direction = "y",  
                   segment.color = "grey50") +  
  theme_minimal() + 
  labs(title = "Model Performance (R-squared)", 
       x = "Workflow Rank", 
       y = "R-squared (rsq)") +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.05))) +  # Ensure y-axis values appear properly
  theme(
    legend.position = "bottom",
    axis.text.y = element_text(size = 12, color = "black")  # Improve y-axis text visibility
  )

We can figured out that lm_median_imput_model is suppose to best for predicting our data. Hence we are finally going to using this model to test data to confirm the whether our given data is accurate or we would have to start all over

best_workflow <- workflow() |>
  add_recipe(lm_recipe2) |>
  add_model(lm_model)

best_fit <- best_workflow |>
  fit(data = train_data)

best_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
8 Recipe Steps

• step_novel()
• step_impute_median()
• step_other()
• step_unknown()
• step_dummy()
• step_lincomb()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
               (Intercept)                 ReleaseDate  
                   2427903                       20710  
           OfflineEarnings              PercentOffline  
                   7593402                     -234431  
              TotalPlayers            TotalTournaments  
                   5945654                     -784903  
                 Game_Rare  Genre_First.Person.Shooter  
                    -11787                     -204118  
              Genre_Racing                Genre_Sports  
                      7174                       26751  
            Genre_Strategy                  Genre_Rare  
                     62082                      148465  
predictions <- predict(best_fit, new_data = test_data)

results <- test_data |>
  bind_cols(predictions)

metrics <- metric_set(rmse, rsq)
error_metrics <- results |>
  metrics(truth = TotalEarnings, estimate = .pred)

error_metrics
# A tibble: 2 × 3
  .metric .estimator   .estimate
  <chr>   <chr>            <dbl>
1 rmse    standard   4201628.   
2 rsq     standard         0.976

We can find that the rsq is 0.975 and rmsq of $4.2M. These result indicate a very strong predictive performance from our model. Here the 0.976 shows that our model explains a large proportion of the variance in total esports earnings and the rmse of 4.2M represents the our model predict the average deviation of $4.2M in the test which is fair estimation considering the scale of earning in the esports industry

Conclusion

This analysis demonstrates that machine learning models can accurately forecast esports earnings, with our best model explaining 97.5% of variance in the data. Linear regression models consistently outperformed KNN models, indicating that the connection between predictors and earnings is primarily linear. In this dataset we can factor like Tournament count, player base size, genre, and release timing plays the most significant predictors of financial success in esports. Along with that having MOBA, FPS, and Battle Royale genres dominating the earnings landscape it plays huge factor on the earning for the games. Finally, the good performance of our linear model indicate that the esports business has established predictable income production pattern which can be forecasted.