library(tidyverse)
library(tidymodels)
library(ggplot2)
library(ggrepel)
library(knitr)
library(corrplot)
::opts_chunk$set(warning = FALSE, message = FALSE) knitr
Earning in Esports
Introduction
Esports has grown into a multibillion-dollar industry, with competitive gaming events awarding large prize pools. Understanding the trends and factors that influence esports revenue can provide significant information to players, organizations, and investors. In this analysis, we examine the historical profits data of esports from 1998 to 2023, retrieved from Kaggle (https://www.kaggle.com/datasets/rankirsh/esports-earnings/data).
In this analysis, we will look at various machine learning data models to examine and anticipate overall earnings in the esports business. We will focus on Linear Regression and K-Nearest Neighbors (KNN) and compare their performance before selecting the most accurate predictive model for our dataset. To examine and complete this data, we will undertake data spitting, preprocessing, model training, assessment, and lastly assessing the results for the various models.
Data Exploration
First, let’s explore the dataset by examining its structure and attributes. We begin by importing the necessary libraries:
Now, we load the dataset and inspect its contents:
# Here we are loading the dataset are storing it into earning
<- read.csv("GeneralEsportData.csv")
earning
#let us look at the first six data of the dataset to explore the data
head(earning)
Game ReleaseDate Genre TotalEarnings OfflineEarnings
1 Age of Empires 1997 Strategy 736284.75 522378.17
2 Age of Empires II 1999 Strategy 3898508.73 1361409.22
3 Age of Empires III 2005 Strategy 122256.72 44472.60
4 Age of Empires IV 2021 Strategy 1190813.44 439117.93
5 Age of Empires Online 2011 Strategy 11462.98 775.00
6 Age of Mythology 2002 Strategy 188619.58 86723.77
PercentOffline TotalPlayers TotalTournaments
1 0.70947846 624 341
2 0.34921282 2256 1939
3 0.36376405 172 179
4 0.36875460 643 423
5 0.06760895 52 68
6 0.45978138 236 298
|> glimpse() earning
Rows: 669
Columns: 8
$ Game <chr> "Age of Empires", "Age of Empires II", "Age of Empire…
$ ReleaseDate <int> 1997, 1999, 2005, 2021, 2011, 2002, 2018, 2019, 2018,…
$ Genre <chr> "Strategy", "Strategy", "Strategy", "Strategy", "Stra…
$ TotalEarnings <dbl> 736284.75, 3898508.73, 122256.72, 1190813.44, 11462.9…
$ OfflineEarnings <dbl> 522378.17, 1361409.22, 44472.60, 439117.93, 775.00, 8…
$ PercentOffline <dbl> 0.70947846, 0.34921282, 0.36376405, 0.36875460, 0.067…
$ TotalPlayers <int> 624, 2256, 172, 643, 52, 236, 14, 133, 698, 1158, 90,…
$ TotalTournaments <int> 341, 1939, 179, 423, 68, 298, 8, 52, 207, 779, 29, 22…
Doing head(earning) and glimpse it allows us to the intial first few data from the dataset. Using them it provide us with an overview of the available data fields with a total 660 obseravation data and 8 column which include Game, ReleaseDate, Genre, TotalEarnings, OfflineEarnings, PercentOffline, TotalPlayers and TotalTournaments with their respective data types.
The dataset highlights that earnings and player engagement vary significantly between games.
Now let us visualize the data to looking into the earning over the time.
ggplot(earning, aes(x = ReleaseDate, y = TotalEarnings)) +
geom_point(alpha = 0.6, aes(color = Genre, size = TotalPlayers)) +
labs(title = "Esports Total Earnings by Release Year",
subtitle = "Games released between 2009-2017 generated highest earnings",
x = "Release Year",
y = "Total Earnings (Millions USD)",
color = "Game Genre",
size = "Total Players") +
theme_minimal()
This visualization helps us identify trends, potential outliers, and the overall growth trajectory of esports earnings over the analyzed period. From the graph it shows that games released between 2009 and 2017 generated the highest earnings.
# Genre breakdown by earnings
<- earning |>
genre_table group_by(Genre) |>
summarise(
Total_Genre_Earnings = sum(TotalEarnings),
Avg_Earnings = mean(TotalEarnings),
Game_Count = n()
|>
) arrange(desc(Total_Genre_Earnings))
kable(genre_table, format.args = list(big.mark = ","))
Genre | Total_Genre_Earnings | Avg_Earnings | Game_Count |
---|---|---|---|
Multiplayer Online Battle Arena | 643,829,101.5 | 22,201,003.50 | 29 |
First-Person Shooter | 483,596,197.6 | 3,504,320.27 | 138 |
Battle Royale | 390,379,254.7 | 22,963,485.57 | 17 |
Strategy | 135,819,161.6 | 1,997,340.61 | 68 |
Sports | 106,413,829.4 | 1,085,855.40 | 98 |
Collectible Card Game | 51,485,118.4 | 3,028,536.37 | 17 |
Fighting Game | 39,733,007.1 | 210,227.55 | 189 |
Racing | 18,528,690.7 | 280,737.74 | 66 |
Role-Playing Game | 15,277,937.0 | 1,527,793.70 | 10 |
Third-Person Shooter | 6,105,826.4 | 555,075.13 | 11 |
Puzzle Game | 476,062.7 | 28,003.69 | 17 |
Music / Rhythm Game | 323,034.9 | 35,892.77 | 9 |
From this table we can see that the esport industry have been thoroughly dominated by the MOBA games highest earning total esport of $643.8M and average $22.2M. While having he highest total earning we can find that battle royale has the highest average earning of $22.9M. For better understanding we visual this data to get the following graph.
ggplot(genre_table, aes(x = reorder(Genre, Total_Genre_Earnings), y = Total_Genre_Earnings/1e6)) +
geom_col(aes(fill = Game_Count)) +
geom_text(aes(label = Game_Count), hjust = -0.1) +
scale_fill_viridis_c() +
coord_flip() +
labs(title = "Total Esports Earnings by Genre",
subtitle = "Number of games shown for each genre",
x = "Genre",
y = "Total Earnings (Millions $)",
fill = "Game Count") +
theme_minimal()
Model Analysis:
Data Splitting
Before comparing different model our the primary task would be splitting our data into training and test sets. This allows us to evaluate model performance on the unseen data.
set.seed(427)
<- initial_split(earning, prop = 0.65)
data_split <- training(data_split)
train_data <- testing(data_split)
test_data
<- vfold_cv(train_data, v = 10, repeats = 10) test_folds
Data Preprocessing
To ensure model performance, we preprocess the data using different imputation techniques and transformation strategies. Here, we create various several recipe for both Linear Regression and KNN models each with a their own data cleaning and transforming strategies using various imputation for filling the missing data.
# Linear Regression Recipes for mean
<- recipe(TotalEarnings ~ ., data = train_data) |>
lm_recipe1 step_novel(all_nominal_predictors()) |>
step_impute_mean(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
step_lincomb(all_predictors())
# Linear Regression Recipes for median
<- recipe(TotalEarnings ~ ., data = train_data) |>
lm_recipe2 step_novel(all_nominal_predictors()) |>
step_impute_median(all_numeric_predictors()) |>
step_other(all_nominal_predictors(), threshold = 0.05, other = "Rare") |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal(), -all_outcomes()) |>
step_lincomb(all_predictors()) |>
step_center(all_numeric_predictors()) |>
step_scale(all_numeric_predictors())
# Linear Regression Recipes for knn
<- recipe(TotalEarnings ~ ., data = train_data) |>
lm_recipe3 step_novel(all_nominal_predictors()) |>
step_nzv(all_predictors()) |>
step_impute_knn(all_numeric_predictors(), neighbors = 5) |>
step_impute_mode(all_nominal_predictors()) |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal_predictors()) |>
step_lincomb(all_predictors()) |>
step_normalize(all_numeric_predictors())
# KNN Recipes for knn
<- recipe(TotalEarnings ~ ., data = train_data) |>
knn_recipe1 step_novel(all_nominal_predictors()) |>
step_impute_knn(all_numeric_predictors(), neighbors = 5) |>
step_impute_mode(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
step_lincomb(all_predictors())
# KNN Recipes for median
<- recipe(TotalEarnings ~ ., data = train_data) |>
knn_recipe2 step_novel(all_nominal_predictors()) |>
step_impute_median(all_numeric_predictors()) |>
step_other(all_nominal_predictors(), threshold = 0.05, other = "Rare") |>
step_scale(all_numeric_predictors()) |>
step_unknown(all_nominal(), -all_outcomes()) |>
step_dummy(all_nominal_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
step_lincomb(all_predictors())
# KNN Recipes for mean
<- recipe(TotalEarnings ~ ., data = train_data) |>
knn_recipe3 step_novel(all_nominal_predictors()) |>
step_nzv(all_predictors()) |>
step_impute_mean(all_numeric_predictors()) |>
step_unknown(all_nominal_predictors()) |>
step_other(all_nominal_predictors(), threshold = 0.01, other = "Rare") |>
step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
step_lincomb(all_numeric_predictors()) |>
step_normalize(all_numeric_predictors())
Model Creation
After the preprocessing we move on to the next stage where are required to define multiple model to compare their predictive performance. Here are creating a model for linear Regression model, and KNN model for 5 and 10 neighbors. We define our models, including a Linear Regression model and KNN models with different numbers of neighbors:
<- linear_reg() |>
lm_model set_engine('lm')
<- nearest_neighbor(neighbors = 5) |>
knn5_model set_engine("kknn") |>
set_mode("regression")
<- nearest_neighbor(neighbors = 10) |>
knn10_model set_engine("kknn") |>
set_mode("regression")
Data Training and Comparison
After the creation of different models we move on to the next step which is the creation of workflow set where we combine our preprocessing recipes with the model specification. Doing so we will end with 9 different model for predicting the data with 3 linear regression and 6 KNN model.
<- list(
knn_models knn5 = knn5_model,
knn10 = knn10_model
)
<- list(
lm_models lm_model = lm_model
)
<- list(
knn_preprocessors knn_knn_impute = knn_recipe1,
knn_mean_impute = knn_recipe3,
knn_median_imput = knn_recipe2
)
<- list(
lm_preprocessors lm_knn_impute = lm_recipe3,
lm_mean_impute = lm_recipe1,
lm_median_imput = lm_recipe2
)
<- list(
lm_models lm_model = lm_model
)
<- workflow_set(knn_preprocessors, knn_models, cross = TRUE)
knn_models <- workflow_set(lm_preprocessors, lm_models, cross = TRUE)
lm_models
<- lm_models |>
all_models bind_rows(knn_models)
Finally we can move on to the final step which would be the evaluation of all models using cross validation and comparing their performance metrics particularly Root Squared and Root Mean Squared Error
<- metric_set(rmse, rsq)
earning_metrics
<- all_models |>
all_fits workflow_map("fit_resamples",
resamples = test_folds,
metrics = earning_metrics)
autoplot(all_fits, metric = "rsq") +
geom_text_repel(aes(label = wflow_id))
After Fitting and visualizing all the different model we are will be evaluating their performance using the Rsq which is used to measure the proper of variance in our TotalEarnings. We have known that having the higher rsq is better for the data meaning as the rsq approches 1 we can say that that given model prediction would be more accurate than those with lesser value.
In this rsq graph we can find that lm_median_imput_model achieved the highest R-squared value showing us that median imputation in linear regression is the most effective way to handle the dataset. It demonstrated that there would be a consistency with stable performance arcoss the cross validation fold.
If we were to go for a non linear model for the data, the knn_median_imput_knn10 model may be a viable alternative, albeit with slightly worse overall predictive performance.
To create a appropriate graph for the rsq table was too difficult as the the label imposed on each other So i refered to AI to create a better visualization for the different model Root Squared.
autoplot(all_fits, metric = "rsq") +
geom_label_repel(aes(label = wflow_id),
box.padding = 0.6,
point.padding = 0.5,
max.overlaps = Inf,
hjust = 1, vjust = 1,
nudge_y = -0.7,
direction = "y",
segment.color = "grey50") +
theme_minimal() +
labs(title = "Model Performance (R-squared)",
x = "Workflow Rank",
y = "R-squared (rsq)") +
scale_y_continuous(expand = expansion(mult = c(0.05, 0.05))) + # Ensure y-axis values appear properly
theme(
legend.position = "bottom",
axis.text.y = element_text(size = 12, color = "black") # Improve y-axis text visibility
)
We can figured out that lm_median_imput_model is suppose to best for predicting our data. Hence we are finally going to using this model to test data to confirm the whether our given data is accurate or we would have to start all over
<- workflow() |>
best_workflow add_recipe(lm_recipe2) |>
add_model(lm_model)
<- best_workflow |>
best_fit fit(data = train_data)
best_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
8 Recipe Steps
• step_novel()
• step_impute_median()
• step_other()
• step_unknown()
• step_dummy()
• step_lincomb()
• step_center()
• step_scale()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) ReleaseDate
2427903 20710
OfflineEarnings PercentOffline
7593402 -234431
TotalPlayers TotalTournaments
5945654 -784903
Game_Rare Genre_First.Person.Shooter
-11787 -204118
Genre_Racing Genre_Sports
7174 26751
Genre_Strategy Genre_Rare
62082 148465
<- predict(best_fit, new_data = test_data)
predictions
<- test_data |>
results bind_cols(predictions)
<- metric_set(rmse, rsq)
metrics <- results |>
error_metrics metrics(truth = TotalEarnings, estimate = .pred)
error_metrics
# A tibble: 2 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 4201628.
2 rsq standard 0.976
We can find that the rsq is 0.975 and rmsq of $4.2M. These result indicate a very strong predictive performance from our model. Here the 0.976 shows that our model explains a large proportion of the variance in total esports earnings and the rmse of 4.2M represents the our model predict the average deviation of $4.2M in the test which is fair estimation considering the scale of earning in the esports industry
Conclusion
This analysis demonstrates that machine learning models can accurately forecast esports earnings, with our best model explaining 97.5% of variance in the data. Linear regression models consistently outperformed KNN models, indicating that the connection between predictors and earnings is primarily linear. In this dataset we can factor like Tournament count, player base size, genre, and release timing plays the most significant predictors of financial success in esports. Along with that having MOBA, FPS, and Battle Royale genres dominating the earnings landscape it plays huge factor on the earning for the games. Finally, the good performance of our linear model indicate that the esports business has established predictable income production pattern which can be forecasted.