Earning in Esports

Author

Shubham Shrestha

1. Introduction

Esports has grown into a multibillion-dollar industry, with professional gaming events offering significant prize pools. Understanding the trends and factors that drive esports revenue can provide valuable insights for players, organizations, and investors. In this analysis, we look at historical esports profits data from 1998 to 2023, which we obtained from Kaggle (https://www.kaggle.com/datasets/rankirsh/esports-earnings/data). This dataset was created by merging information from EsportEarnings.com.

The Esports industry’s rapid expansion provides an opportunity for data-driven analysis, particularly to understand the financial structures that are supporting its long-term sustainability and trajectory. This analysis tries to answer the central question: Can machine learning strategies accurately predict overall earnings in the esports industry? In this analysis, we will look at multiple machine learning data models to assess and predict overall earnings in the esports industry. We will compare Linear Regression and K-Nearest Neighbors (KNN) performance before deciding on the best accurate predictive model for our dataset. To gain insight and complete this data, we will perform data spitting, preprocessing, model training, assessment, and finally evaluating the results of the various models. Our final goal for this analysis is to establish if the current trend in esports earnings is predictable or not.

2. Dataset Variables

Before beginning the analysis, the dataset had the following variables for the different rows.

Game : The name of the Game

ReleaseDate: The year the respective game was released

Genre: Genre of the game

TotalEarnings: The Total prizepool allocated in tournaments

OfflineEarnings: Total amount of earnings allocated in Offline/Lan events

PercentOffline: Percent of earnings coming from offline tournaments

TotalPlayers: Total amount of players who received a prize

TotalTournaments: Total amount of tournaments in the site

3. Data Exploration

First, let’s look at the dataset’s structure and attributes. We will examine various types of data using various data visualization methods in order to explore and gain a better grasp of our dataset.

We start by importing the relevant libraries:

#Importing the necessary library for our project
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(ggrepel)
library(knitr)
library(corrplot)

knitr::opts_chunk$set(warning = FALSE, message = FALSE)

Now, let us import the dataset and look examine its contents.

# Here we are loading the dataset are storing it into earning
earning <- read.csv("GeneralEsportData.csv")

#let us look at the first six data of the dataset to explore the data
head(earning)

                   Game ReleaseDate    Genre TotalEarnings OfflineEarnings
1        Age of Empires        1997 Strategy     736284.75       522378.17
2     Age of Empires II        1999 Strategy    3898508.73      1361409.22
3    Age of Empires III        2005 Strategy     122256.72        44472.60
4     Age of Empires IV        2021 Strategy    1190813.44       439117.93
5 Age of Empires Online        2011 Strategy      11462.98          775.00
6      Age of Mythology        2002 Strategy     188619.58        86723.77
  PercentOffline TotalPlayers TotalTournaments
1     0.70947846          624              341
2     0.34921282         2256             1939
3     0.36376405          172              179
4     0.36875460          643              423
5     0.06760895           52               68
6     0.45978138          236              298

earning |> glimpse()

Rows: 669
Columns: 8
$ Game             <chr> "Age of Empires", "Age of Empires II", "Age of Empire…
$ ReleaseDate      <int> 1997, 1999, 2005, 2021, 2011, 2002, 2018, 2019, 2018,…
$ Genre            <chr> "Strategy", "Strategy", "Strategy", "Strategy", "Stra…
$ TotalEarnings    <dbl> 736284.75, 3898508.73, 122256.72, 1190813.44, 11462.9…
$ OfflineEarnings  <dbl> 522378.17, 1361409.22, 44472.60, 439117.93, 775.00, 8…
$ PercentOffline   <dbl> 0.70947846, 0.34921282, 0.36376405, 0.36875460, 0.067…
$ TotalPlayers     <int> 624, 2256, 172, 643, 52, 236, 14, 133, 698, 1158, 90,…
$ TotalTournaments <int> 341, 1939, 179, 423, 68, 298, 8, 52, 207, 779, 29, 22…

Using head (earning) and glimpse, we can get the first few data points from the dataset. Using them, we can get an overview of the accessible data fields, which contain 660 observations and 8 columns such as Game, ReleaseDate, Genre, TotalEarnings, OfflineEarnings, PercentOffline, TotalPlayers, and TotalTournaments, each with its own data type.

The dataset shows that earnings and player involvement differ dramatically between games.

Now let us visualize the data to looking into the earning over the time.

ggplot(earning, aes(x = ReleaseDate, y = TotalEarnings)) +
  geom_point(alpha = 0.6, aes(color = Genre, size = TotalPlayers)) +
  labs(title = "Esports Total Earnings by Release Year",
       subtitle = "Games released between 2009-2017 generated highest earnings",
       x = "Release Year", 
       y = "Total Earnings (Millions USD)",
       color = "Game Genre",
       size = "Total Players") +
  theme_minimal()

The visualization allows us to identify trends, potential outliers, and the overall growth trajectory of esports profits across the study period. The graph illustrates that games launched between 2009 and 2017 produced the most revenue.

# Genre breakdown by earnings
genre_table <- earning |> 
  group_by(Genre) |> 
  summarise(
    Total_Genre_Earnings = sum(TotalEarnings),
    Avg_Earnings = mean(TotalEarnings),
    Game_Count = n()
  ) |> 
  arrange(desc(Total_Genre_Earnings))

kable(genre_table, format.args = list(big.mark = ","))

Genre	Total_Genre_Earnings	Avg_Earnings	Game_Count
Multiplayer Online Battle Arena	643,829,101.5	22,201,003.50	29
First-Person Shooter	483,596,197.6	3,504,320.27	138
Battle Royale	390,379,254.7	22,963,485.57	17
Strategy	135,819,161.6	1,997,340.61	68
Sports	106,413,829.4	1,085,855.40	98
Collectible Card Game	51,485,118.4	3,028,536.37	17
Fighting Game	39,733,007.1	210,227.55	189
Racing	18,528,690.7	280,737.74	66
Role-Playing Game	15,277,937.0	1,527,793.70	10
Third-Person Shooter	6,105,826.4	555,075.13	11
Puzzle Game	476,062.7	28,003.69	17
Music / Rhythm Game	323,034.9	35,892.77	9

According to this table, the MOBA game has completely dominated the esport business, earning $643.8 million in total and an average of $22.2 million. While having the most overall earnings, Battle Royale has the highest average earnings of $22.9 million. For a deeper comprehension of this data, we created the graph shown below.

ggplot(genre_table, aes(x = reorder(Genre, Total_Genre_Earnings), y = Total_Genre_Earnings/1e6)) +
  geom_col(aes(fill = Game_Count)) +
  geom_text(aes(label = Game_Count), hjust = -0.1) +
  scale_fill_viridis_c() +
  coord_flip() +
  labs(title = "Total Esports Earnings by Genre",
       subtitle = "Number of games shown for each genre",
       x = "Genre",
       y = "Total Earnings (Millions $)",
       fill = "Game Count") +
  theme_minimal()

4. Model Analysis:

4.1 Data Splitting

Before comparing different models, the first objective would be to divide our data into training and test sets. We do this to get an accurate estimation of how well each model generalizes to new, previously unseen data. Without a separate test set, we risk over fitting our models to the training data, resulting in unrealistic performance predictions that do not reflect real-world applications. By comparing model performance on previously unseen data in the test set, we can assure a fair and accurate comparison of predictive capabilities and choose the model that is most likely to perform well in practice.

set.seed(427)
data_split <- initial_split(earning, prop = 0.65)
train_data <- training(data_split)
test_data  <- testing(data_split)

test_folds <- vfold_cv(train_data, v = 10, repeats = 10)

4.2 Data Preprocessing

In this analysis we are focusing of the weather the data set has a predictive trend so major focus for our dataset would be regarding different parameter of earning due to which data such as game name is not useful for our analysis hence the first thing we would do for the data cleaning would be removing the column.

4.2.1 Data Cleaning

The game name may add extra complexity without improving our knowledge of the general predictability of esports earnings as an industry. As a result, the first stage in our data cleaning process will be to delete the “Game” column, allowing us to focus on earning-related metrics that are more likely to show predictive trends.

# Remove the Game column
earning <- earning |>
  select(-Game)

Along with deleting the “Game” column, it is critical to resolve any missing data in the dataset. Upon additional analysis, we discovered other fields with “N/A” values. These “N/A” entries denote missing data points, which can cause issue in our study and degrade the performance of our predictive models. To overcome this, we will use imputation techniques to fill in the missing information.

missing_values <- earning |>
  summarise_all(~ sum(is.na(.))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "MissingCount")

kable(missing_values)

Variable	MissingCount
ReleaseDate	0
Genre	0
TotalEarnings	0
OfflineEarnings	0
PercentOffline	61
TotalPlayers	0
TotalTournaments	0

The table shows that PercentOffline is the only column having “N/A” values. Given that PercentOffline is determined directly from OfflineEarnings and TotalEarnings, the three variables are inherently correlated. In this case, including OfflineEarnings and PercentOffline in our study would result in redundancy. As a result, the strong correlation suggests that PercentOffline may not provide significant unique or additional predictive power beyond what is already captured by OfflineEarnings when combined with TotalEarnings, making it a less important variable for our primary goal of predicting overall earning trends. So we remove PercentOffline as well

earning <- earning |>
  select(-PercentOffline )

We are developing multiple distinct recipes for both Linear Regression and K-Nearest Neighbors (KNN) models, each with its own data cleaning and transformation strategies, and leveraging various imputation approaches to fill missing data. We are using this method because Linear Regression and KNN models have distinct underlying mechanisms that are affected differently by data preprocessing. Normalization or standardization is critical to prevent features with bigger values from dominating distance computations, and experimenting with different imputation strategies allows us to evaluate how missing data handling affects the performance of each model type.

4.2.2 Linear Model Recipe

# Linear Regression Recipes for mean
lm_recipe1 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.75) |>
  step_lincomb(all_predictors())

# Linear Regression Recipes for median
lm_recipe2 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>
  step_other(all_nominal_predictors(), threshold = 0.05, other = "Rare") |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal(), -all_outcomes()) |>
  step_lincomb(all_predictors()) |>
  step_center(all_numeric_predictors()) |>
  step_scale(all_numeric_predictors())

# Linear Regression Recipes for knn
lm_recipe3 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |>
  step_impute_knn(all_numeric_predictors(), neighbors = 5) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal_predictors()) |>
  step_lincomb(all_predictors()) |>
  step_normalize(all_numeric_predictors())

4.2.3 KNN Model Recipe

# KNN Recipes for knn
knn_recipe1 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_impute_knn(all_numeric_predictors(), neighbors = 5) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.75) |>
  step_lincomb(all_predictors())

# KNN Recipes for median
knn_recipe2 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>
  step_other(all_nominal_predictors(), threshold = 0.05, other = "Rare") |>
  step_scale(all_numeric_predictors()) |>
  step_unknown(all_nominal(), -all_outcomes()) |>
  step_dummy(all_nominal_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.75) |>
  step_lincomb(all_predictors())

# KNN Recipes for mean
knn_recipe3 <- recipe(TotalEarnings ~ ., data = train_data) |>
  step_novel(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_other(all_nominal_predictors(), threshold = 0.01, other = "Rare") |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_corr(all_numeric_predictors(), threshold = 0.75) |>
  step_lincomb(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors())

4.3 Model Creation

With the data preprocessed, we can now define the machine learning models that will be used to estimate esports earnings. Its simplicity makes the coefficients easier to read, providing insights into the impact of each predictor on overall revenue. This will be used as a baseline model to assess the predictability of esports earnings through a linear method. We are building two KNN models with varying numbers of neighbors: five and ten. Changing the k values allows us to evaluate the model’s sensitivity.

A smaller k may capture more local patterns in the data, but this may introduce noise and lead to overfitting. In contrast, a higher k offers a smoother decision boundary, making the model more resilient to noise but potentially missing finer nuances in the data. By comparing the performance of these models, we hope to determine whether a linear or non-linear strategy is more suited to predicting esports revenues in our dataset, as well as the optimal number of neighbors for the KNN model.

lm_model <- linear_reg() |>
  set_engine('lm')

knn5_model <- nearest_neighbor(neighbors = 5) |>
  set_engine("kknn") |>
  set_mode("regression")

knn10_model <- nearest_neighbor(neighbors = 10) |>
  set_engine("kknn") |>
  set_mode("regression")

4.4 Data Training and Comparison

Following the establishment of several models, we go to the next phase, which is the creation of a workflow set that combines our preprocessing recipes with the model specification. As a result, we will have nine alternative models for predicting data, including three linear regressions and six KNN models.

knn_models <- list(
  knn5 = knn5_model,
  knn10 = knn10_model

)

lm_models <- list(
  lm_model = lm_model
)

knn_preprocessors <- list(
  knn_knn_impute = knn_recipe1,
  knn_mean_impute = knn_recipe3,
  knn_median_imput = knn_recipe2
)

lm_preprocessors <- list(
  lm_knn_impute = lm_recipe3,
  lm_mean_impute = lm_recipe1,
  lm_median_imput = lm_recipe2
)

lm_models <- list(
  lm_model = lm_model
)

knn_models <- workflow_set(knn_preprocessors, knn_models, cross = TRUE)
lm_models <-  workflow_set(lm_preprocessors, lm_models, cross = TRUE)

all_models <- lm_models |> 
  bind_rows(knn_models)

Finally, we may proceed to the final phase, which is to evaluate all models using cross validation and compare their performance metrics, particularly Root Squared and Root Mean Squared Error.

earning_metrics <- metric_set(rmse, rsq)

all_fits <- all_models |> 
  workflow_map("fit_resamples",
               resamples = test_folds,
               metrics = earning_metrics)

autoplot(all_fits,  metric = "rsq") + 
  geom_text_repel(aes(label = wflow_id))

After fitting and visualizing all of the different models, we will evaluate their performance using the Rsq, which is utilized to calculate the right variance in our TotalEarnings. We know that having a larger rsq is better for the data, thus as the rsq approaches one, we can say that the given model prediction will be more accurate than those with lower values.

In this rsq graph, we can see that lm_median_imput_model has the highest R-squared value, indicating that median imputation in linear regression is the most efficient technique to handle the dataset. It established that there would be consistent and steady performance across the cross validation fold.

If we were to use a nonlinear model for the data, the knn_median_imput_knn10 model could be a plausible option, although with slightly worse overall predictive performance.

Creating an acceptable graph for the rsq table was too tough because the labels were imposed on each other, so I turned to artificial intelligence to produce a better depiction for the many Root Squared models.

autoplot(all_fits, metric = "rsq") + 
  geom_label_repel(aes(label = wflow_id),
                   box.padding = 0.6,  
                   point.padding = 0.5,  
                   max.overlaps = Inf,  
                   hjust = 1, vjust = 1,  
                   nudge_y = -0.7,  
                   direction = "y",  
                   segment.color = "grey50") +  
  theme_minimal() + 
  labs(title = "Model Performance (R-squared)", 
       x = "Workflow Rank", 
       y = "R-squared (rsq)") +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.05))) +  # Ensure y-axis values appear properly
  theme(
    legend.position = "bottom",
    axis.text.y = element_text(size = 12, color = "black")  # Improve y-axis text visibility
  )

We can conclude that the lm_mean_imput_model is better suited to predicting our data. As a result, we are finally going to use this model to test data to determine whether our given data is accurate, or we will have to start over.

best_workflow <- workflow() |>
  add_recipe(lm_recipe1) |>
  add_model(lm_model)

best_fit <- best_workflow |>
  fit(data = train_data)

predictions <- predict(best_fit, new_data = test_data)

results <- test_data |>
  bind_cols(predictions)

metrics <- metric_set(rmse, rsq)
error_metrics <- results |>
  metrics(truth = TotalEarnings, estimate = .pred)

error_metrics

# A tibble: 2 × 3
  .metric .estimator   .estimate
  <chr>   <chr>            <dbl>
1 rmse    standard   2915256.   
2 rsq     standard         0.989

We can see that the rsq is 0.989 and the rmsq is $2.9 million. These results show that our model has very excellent predictive performance. Here, the 0.989 shows that our model explains a substantial fraction of the variance in total esports earnings, and the rmse of 2.9M reflects that our model predicted the average deviation of $2.9M in the test, which is a good assessment considering the scale of earning in the esports.

5. Conclusion

This investigation shows that machine learning models can reliably anticipate esports earnings, with our top model accounting for 98.9% of the variance in data. Linear regression models consistently outperformed KNN models, demonstrating that the relationship between predictors and earnings is predominantly linear. In this dataset, we can see that tournament quantity, player base size, genre, and release timing are the most important determinants of esports revenue. Along with that, the MOBA, FPS, and Battle Royale genres dominate the revenue scene, which has a significant impact on the games’ earnings. Finally, the strong performance of our linear model suggests that the esports industry has formed a predictable income output pattern that can be projected.