#Importing the necessary library for our project
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(ggrepel)
library(knitr)
library(corrplot)
::opts_chunk$set(warning = FALSE, message = FALSE) knitr
Earning in Esports
1. Introduction
Esports has grown into a multibillion-dollar industry, with professional gaming events offering significant prize pools. Understanding the trends and factors that drive esports revenue can provide valuable insights for players, organizations, and investors. In this analysis, we look at historical esports profits data from 1998 to 2023, which we obtained from Kaggle (https://www.kaggle.com/datasets/rankirsh/esports-earnings/data). This dataset was created by merging information from EsportEarnings.com.
The Esports industry’s rapid expansion provides an opportunity for data-driven analysis, particularly to understand the financial structures that are supporting its long-term sustainability and trajectory. This analysis tries to answer the central question: Can machine learning strategies accurately predict overall earnings in the esports industry? In this analysis, we will look at multiple machine learning data models to assess and predict overall earnings in the esports industry. We will compare Linear Regression and K-Nearest Neighbors (KNN) performance before deciding on the best accurate predictive model for our dataset. To gain insight and complete this data, we will perform data spitting, preprocessing, model training, assessment, and finally evaluating the results of the various models. Our final goal for this analysis is to establish if the current trend in esports earnings is predictable or not.
2. Dataset Variables
Before beginning the analysis, the dataset had the following variables for the different rows.
Game : The name of the Game
ReleaseDate: The year the respective game was released
Genre: Genre of the game
TotalEarnings: The Total prizepool allocated in tournaments
OfflineEarnings: Total amount of earnings allocated in Offline/Lan events
PercentOffline: Percent of earnings coming from offline tournaments
TotalPlayers: Total amount of players who received a prize
TotalTournaments: Total amount of tournaments in the site
3. Data Exploration
First, let’s look at the dataset’s structure and attributes. We will examine various types of data using various data visualization methods in order to explore and gain a better grasp of our dataset.
We start by importing the relevant libraries:
Now, let us import the dataset and look examine its contents.
# Here we are loading the dataset are storing it into earning
<- read.csv("GeneralEsportData.csv")
earning
#let us look at the first six data of the dataset to explore the data
head(earning)
Game ReleaseDate Genre TotalEarnings OfflineEarnings
1 Age of Empires 1997 Strategy 736284.75 522378.17
2 Age of Empires II 1999 Strategy 3898508.73 1361409.22
3 Age of Empires III 2005 Strategy 122256.72 44472.60
4 Age of Empires IV 2021 Strategy 1190813.44 439117.93
5 Age of Empires Online 2011 Strategy 11462.98 775.00
6 Age of Mythology 2002 Strategy 188619.58 86723.77
PercentOffline TotalPlayers TotalTournaments
1 0.70947846 624 341
2 0.34921282 2256 1939
3 0.36376405 172 179
4 0.36875460 643 423
5 0.06760895 52 68
6 0.45978138 236 298
|> glimpse() earning
Rows: 669
Columns: 8
$ Game <chr> "Age of Empires", "Age of Empires II", "Age of Empire…
$ ReleaseDate <int> 1997, 1999, 2005, 2021, 2011, 2002, 2018, 2019, 2018,…
$ Genre <chr> "Strategy", "Strategy", "Strategy", "Strategy", "Stra…
$ TotalEarnings <dbl> 736284.75, 3898508.73, 122256.72, 1190813.44, 11462.9…
$ OfflineEarnings <dbl> 522378.17, 1361409.22, 44472.60, 439117.93, 775.00, 8…
$ PercentOffline <dbl> 0.70947846, 0.34921282, 0.36376405, 0.36875460, 0.067…
$ TotalPlayers <int> 624, 2256, 172, 643, 52, 236, 14, 133, 698, 1158, 90,…
$ TotalTournaments <int> 341, 1939, 179, 423, 68, 298, 8, 52, 207, 779, 29, 22…
Using head (earning) and glimpse, we can get the first few data points from the dataset. Using them, we can get an overview of the accessible data fields, which contain 660 observations and 8 columns such as Game, ReleaseDate, Genre, TotalEarnings, OfflineEarnings, PercentOffline, TotalPlayers, and TotalTournaments, each with its own data type.
The dataset shows that earnings and player involvement differ dramatically between games.
Now let us visualize the data to looking into the earning over the time.
ggplot(earning, aes(x = ReleaseDate, y = TotalEarnings)) +
geom_point(alpha = 0.6, aes(color = Genre, size = TotalPlayers)) +
labs(title = "Esports Total Earnings by Release Year",
subtitle = "Games released between 2009-2017 generated highest earnings",
x = "Release Year",
y = "Total Earnings (Millions USD)",
color = "Game Genre",
size = "Total Players") +
theme_minimal()
The visualization allows us to identify trends, potential outliers, and the overall growth trajectory of esports profits across the study period. The graph illustrates that games launched between 2009 and 2017 produced the most revenue.
# Genre breakdown by earnings
<- earning |>
genre_table group_by(Genre) |>
summarise(
Total_Genre_Earnings = sum(TotalEarnings),
Avg_Earnings = mean(TotalEarnings),
Game_Count = n()
|>
) arrange(desc(Total_Genre_Earnings))
kable(genre_table, format.args = list(big.mark = ","))
Genre | Total_Genre_Earnings | Avg_Earnings | Game_Count |
---|---|---|---|
Multiplayer Online Battle Arena | 643,829,101.5 | 22,201,003.50 | 29 |
First-Person Shooter | 483,596,197.6 | 3,504,320.27 | 138 |
Battle Royale | 390,379,254.7 | 22,963,485.57 | 17 |
Strategy | 135,819,161.6 | 1,997,340.61 | 68 |
Sports | 106,413,829.4 | 1,085,855.40 | 98 |
Collectible Card Game | 51,485,118.4 | 3,028,536.37 | 17 |
Fighting Game | 39,733,007.1 | 210,227.55 | 189 |
Racing | 18,528,690.7 | 280,737.74 | 66 |
Role-Playing Game | 15,277,937.0 | 1,527,793.70 | 10 |
Third-Person Shooter | 6,105,826.4 | 555,075.13 | 11 |
Puzzle Game | 476,062.7 | 28,003.69 | 17 |
Music / Rhythm Game | 323,034.9 | 35,892.77 | 9 |
According to this table, the MOBA game has completely dominated the esport business, earning $643.8 million in total and an average of $22.2 million. While having the most overall earnings, Battle Royale has the highest average earnings of $22.9 million. For a deeper comprehension of this data, we created the graph shown below.
ggplot(genre_table, aes(x = reorder(Genre, Total_Genre_Earnings), y = Total_Genre_Earnings/1e6)) +
geom_col(aes(fill = Game_Count)) +
geom_text(aes(label = Game_Count), hjust = -0.1) +
scale_fill_viridis_c() +
coord_flip() +
labs(title = "Total Esports Earnings by Genre",
subtitle = "Number of games shown for each genre",
x = "Genre",
y = "Total Earnings (Millions $)",
fill = "Game Count") +
theme_minimal()
4. Model Analysis:
4.1 Data Splitting
Before comparing different models, the first objective would be to divide our data into training and test sets. We do this to get an accurate estimation of how well each model generalizes to new, previously unseen data. Without a separate test set, we risk over fitting our models to the training data, resulting in unrealistic performance predictions that do not reflect real-world applications. By comparing model performance on previously unseen data in the test set, we can assure a fair and accurate comparison of predictive capabilities and choose the model that is most likely to perform well in practice.
set.seed(427)
<- initial_split(earning, prop = 0.65)
data_split <- training(data_split)
train_data <- testing(data_split)
test_data
<- vfold_cv(train_data, v = 10, repeats = 10) test_folds
4.2 Data Preprocessing
In this analysis we are focusing of the weather the data set has a predictive trend so major focus for our dataset would be regarding different parameter of earning due to which data such as game name is not useful for our analysis hence the first thing we would do for the data cleaning would be removing the column.
4.2.1 Data Cleaning
The game name may add extra complexity without improving our knowledge of the general predictability of esports earnings as an industry. As a result, the first stage in our data cleaning process will be to delete the “Game” column, allowing us to focus on earning-related metrics that are more likely to show predictive trends.
# Remove the Game column
<- earning |>
earning select(-Game)
Along with deleting the “Game” column, it is critical to resolve any missing data in the dataset. Upon additional analysis, we discovered other fields with “N/A” values. These “N/A” entries denote missing data points, which can cause issue in our study and degrade the performance of our predictive models. To overcome this, we will use imputation techniques to fill in the missing information.
<- earning |>
missing_values summarise_all(~ sum(is.na(.))) |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "MissingCount")
kable(missing_values)
Variable | MissingCount |
---|---|
ReleaseDate | 0 |
Genre | 0 |
TotalEarnings | 0 |
OfflineEarnings | 0 |
PercentOffline | 61 |
TotalPlayers | 0 |
TotalTournaments | 0 |
The table shows that PercentOffline is the only column having “N/A” values. Given that PercentOffline is determined directly from OfflineEarnings and TotalEarnings, the three variables are inherently correlated. In this case, including OfflineEarnings and PercentOffline in our study would result in redundancy. As a result, the strong correlation suggests that PercentOffline may not provide significant unique or additional predictive power beyond what is already captured by OfflineEarnings when combined with TotalEarnings, making it a less important variable for our primary goal of predicting overall earning trends. So we remove PercentOffline as well
<- earning |>
earning select(-PercentOffline )
We are developing multiple distinct recipes for both Linear Regression and K-Nearest Neighbors (KNN) models, each with its own data cleaning and transformation strategies, and leveraging various imputation approaches to fill missing data. We are using this method because Linear Regression and KNN models have distinct underlying mechanisms that are affected differently by data preprocessing. Normalization or standardization is critical to prevent features with bigger values from dominating distance computations, and experimenting with different imputation strategies allows us to evaluate how missing data handling affects the performance of each model type.
4.2.2 Linear Model Recipe
# Linear Regression Recipes for mean
<- recipe(TotalEarnings ~ ., data = train_data) |>
lm_recipe1 step_novel(all_nominal_predictors()) |>
step_impute_mean(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
step_lincomb(all_predictors())
# Linear Regression Recipes for median
<- recipe(TotalEarnings ~ ., data = train_data) |>
lm_recipe2 step_novel(all_nominal_predictors()) |>
step_impute_median(all_numeric_predictors()) |>
step_other(all_nominal_predictors(), threshold = 0.05, other = "Rare") |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal(), -all_outcomes()) |>
step_lincomb(all_predictors()) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors())
# Linear Regression Recipes for knn
<- recipe(TotalEarnings ~ ., data = train_data) |>
lm_recipe3 step_novel(all_nominal_predictors()) |>
step_nzv(all_predictors()) |>
step_impute_knn(all_numeric_predictors(), neighbors = 5) |>
step_impute_mode(all_nominal_predictors()) |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal_predictors()) |>
step_lincomb(all_predictors()) |>
step_normalize(all_numeric_predictors())
4.2.3 KNN Model Recipe
# KNN Recipes for knn
<- recipe(TotalEarnings ~ ., data = train_data) |>
knn_recipe1 step_novel(all_nominal_predictors()) |>
step_impute_knn(all_numeric_predictors(), neighbors = 5) |>
step_impute_mode(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
step_lincomb(all_predictors())
# KNN Recipes for median
<- recipe(TotalEarnings ~ ., data = train_data) |>
knn_recipe2 step_novel(all_nominal_predictors()) |>
step_impute_median(all_numeric_predictors()) |>
step_other(all_nominal_predictors(), threshold = 0.05, other = "Rare") |>
step_scale(all_numeric_predictors()) |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
step_lincomb(all_predictors())
# KNN Recipes for mean
<- recipe(TotalEarnings ~ ., data = train_data) |>
knn_recipe3 step_novel(all_nominal_predictors()) |>
step_nzv(all_predictors()) |>
step_impute_mean(all_numeric_predictors()) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors(), threshold = 0.01, other = "Rare") |>
step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
step_lincomb(all_numeric_predictors()) |>
step_normalize(all_numeric_predictors())
4.3 Model Creation
With the data preprocessed, we can now define the machine learning models that will be used to estimate esports earnings. Its simplicity makes the coefficients easier to read, providing insights into the impact of each predictor on overall revenue. This will be used as a baseline model to assess the predictability of esports earnings through a linear method. We are building two KNN models with varying numbers of neighbors: five and ten. Changing the k values allows us to evaluate the model’s sensitivity.
A smaller k may capture more local patterns in the data, but this may introduce noise and lead to overfitting. In contrast, a higher k offers a smoother decision boundary, making the model more resilient to noise but potentially missing finer nuances in the data. By comparing the performance of these models, we hope to determine whether a linear or non-linear strategy is more suited to predicting esports revenues in our dataset, as well as the optimal number of neighbors for the KNN model.
<- linear_reg() |>
lm_model set_engine('lm')
<- nearest_neighbor(neighbors = 5) |>
knn5_model set_engine("kknn") |>
set_mode("regression")
<- nearest_neighbor(neighbors = 10) |>
knn10_model set_engine("kknn") |>
set_mode("regression")
4.4 Data Training and Comparison
Following the establishment of several models, we go to the next phase, which is the creation of a workflow set that combines our preprocessing recipes with the model specification. As a result, we will have nine alternative models for predicting data, including three linear regressions and six KNN models.
<- list(
knn_models knn5 = knn5_model,
knn10 = knn10_model
)
<- list(
lm_models lm_model = lm_model
)
<- list(
knn_preprocessors knn_knn_impute = knn_recipe1,
knn_mean_impute = knn_recipe3,
knn_median_imput = knn_recipe2
)
<- list(
lm_preprocessors lm_knn_impute = lm_recipe3,
lm_mean_impute = lm_recipe1,
lm_median_imput = lm_recipe2
)
<- list(
lm_models lm_model = lm_model
)
<- workflow_set(knn_preprocessors, knn_models, cross = TRUE)
knn_models <- workflow_set(lm_preprocessors, lm_models, cross = TRUE)
lm_models
<- lm_models |>
all_models bind_rows(knn_models)
Finally, we may proceed to the final phase, which is to evaluate all models using cross validation and compare their performance metrics, particularly Root Squared and Root Mean Squared Error.
<- metric_set(rmse, rsq)
earning_metrics
<- all_models |>
all_fits workflow_map("fit_resamples",
resamples = test_folds,
metrics = earning_metrics)
autoplot(all_fits, metric = "rsq") +
geom_text_repel(aes(label = wflow_id))
After fitting and visualizing all of the different models, we will evaluate their performance using the Rsq, which is utilized to calculate the right variance in our TotalEarnings. We know that having a larger rsq is better for the data, thus as the rsq approaches one, we can say that the given model prediction will be more accurate than those with lower values.
In this rsq graph, we can see that lm_median_imput_model has the highest R-squared value, indicating that median imputation in linear regression is the most efficient technique to handle the dataset. It established that there would be consistent and steady performance across the cross validation fold.
If we were to use a nonlinear model for the data, the knn_median_imput_knn10 model could be a plausible option, although with slightly worse overall predictive performance.
Creating an acceptable graph for the rsq table was too tough because the labels were imposed on each other, so I turned to artificial intelligence to produce a better depiction for the many Root Squared models.
autoplot(all_fits, metric = "rsq") +
geom_label_repel(aes(label = wflow_id),
box.padding = 0.6,
point.padding = 0.5,
max.overlaps = Inf,
hjust = 1, vjust = 1,
nudge_y = -0.7,
direction = "y",
segment.color = "grey50") +
theme_minimal() +
labs(title = "Model Performance (R-squared)",
x = "Workflow Rank",
y = "R-squared (rsq)") +
scale_y_continuous(expand = expansion(mult = c(0.05, 0.05))) + # Ensure y-axis values appear properly
theme(
legend.position = "bottom",
axis.text.y = element_text(size = 12, color = "black") # Improve y-axis text visibility
)
We can conclude that the lm_mean_imput_model is better suited to predicting our data. As a result, we are finally going to use this model to test data to determine whether our given data is accurate, or we will have to start over.
<- workflow() |>
best_workflow add_recipe(lm_recipe1) |>
add_model(lm_model)
<- best_workflow |>
best_fit fit(data = train_data)
<- predict(best_fit, new_data = test_data)
predictions
<- test_data |>
results bind_cols(predictions)
<- metric_set(rmse, rsq)
metrics <- results |>
error_metrics metrics(truth = TotalEarnings, estimate = .pred)
error_metrics
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 2915256.
2 rsq standard 0.989
We can see that the rsq is 0.989 and the rmsq is $2.9 million. These results show that our model has very excellent predictive performance. Here, the 0.989 shows that our model explains a substantial fraction of the variance in total esports earnings, and the rmse of 2.9M reflects that our model predicted the average deviation of $2.9M in the test, which is a good assessment considering the scale of earning in the esports.
5. Conclusion
This investigation shows that machine learning models can reliably anticipate esports earnings, with our top model accounting for 98.9% of the variance in data. Linear regression models consistently outperformed KNN models, demonstrating that the relationship between predictors and earnings is predominantly linear. In this dataset, we can see that tournament quantity, player base size, genre, and release timing are the most important determinants of esports revenue. Along with that, the MOBA, FPS, and Battle Royale genres dominate the revenue scene, which has a significant impact on the games’ earnings. Finally, the strong performance of our linear model suggests that the esports industry has formed a predictable income output pattern that can be projected.