Link to notebook. The notebook includes all the necessary code to create the data that we need and generate the plots. Nonetheless, an iteration of this data is already contained here, under the k_justification folder.
This notebooks presents a long process of defining and testing several models before selecting one of them to be part of the final system. We define four base models:
All metrics (all). This model contains all the original metrics of the profiles (over 60), likely to overfit and contain a lot of redundancy and unnecessary information
All metrics + lightweight feature selection (all_fs_simple). We apply a simple feature selection process to remove irrelevant model features ad hoc via an embedded method. The goal is to just remove those features that are clearly redundant without taking too much time and not altering the core of the model.
All metrics + in-depth feature selection (all_fs_deep). For this model we test a multi-layer (several filter, wrapper and embedded tasks) and aggresive feature selection process, whose goal is to obtain a far superior subset of metrics. The details of the feature selection are described in the following paragraph.
Custom set of metrics (custom). For the final model we select a custom subset of metrics, according to previous experimentation.
Feature selection is a standard process in the field of data science and machine learning, as most features in a given dataset tend to be redundant and not useful. Hence, reducing the number of features to the most relevant reduces the complexity of the model and prevents overfitting to superfluous variables. Our initial set of features comprised 65 different metrics, so the likeliness that many of these features were irrelevant was high, hence the introduction of a feature selection process. Our goal is to keep only those very meaningful features that will surely maintain their relevance when obtained in external benchmarks.
To improve the robustness of our approach, we conducted a new, more thorough feature selection process based on several layers:
We employ filter methods (extremely fast, based on relationships between the attributes) to remove clearly unhelpful features without requiring model intervention. This layer encompassed three substeps:
Variability analysis: features with less than 1% of value variability are removed, as such a low spectrum of values is highly unlikely to provide the model with useful patterns to make predictions.
Redundancy analysis: we remove highly redundant features (above 90% correlation).
Relevancy analysis: we remove features that have less than 0.01 Mutual Information (MI) score w.r.t. the target variable, as the probability that such small correlations provide meaningful insight is very low.
We used the inherent capacity of our ensemble-based models (i.e. embedded approach) to assign relevance scores to each feature based on how much each contributed to the prediction. Features under a given “relevancy threshold” were pruned. This provides a quick filter to remove features that the model deems to be not useful. More precisely, we remove those features whose relevance w.r.t. the model is less than 0.0001%
We employ a costly wrapper method to prune out more irrelevant features, now based on an exhaustive cross-validation task. This can be done at this stage because we removed more than half of the original features, so even though the overall wrapper process is still costly, it is feasible to compute. More precisely, we employed a Recursive Feature Elimination with Cross Validation (RFECV), with a 5-Fold cross-validation and removing a single feature each step.
These models are instantiated with up to 17 base regressors. We test all the possible combinations with our benchmark, selecting a subset of the best 20 models based on a conjoined analysis over RMSE, MAE and MedAE scores. This subset of models is tested on all the remaining benchmarks, in order to ensure that we have not commited a mistake and some of the underperforming models in the creation stage then outperforms the others when generalizing to other benchmarks. The results, however, are clear, and the Gradient Boosting regressor, with custom features is the model that clearly performs best, both in our benchmark as well as the other's.
To better illustrate this, we took the best performing model from the training stage and tested on the external benchmarks.
Top 5 Model Rankings Santos Small
Rank
Model
Avg Precision
1
gradient_boosting_custom
0.9586
2
gradient_boosting_all
0.9490
3
gradient_boosting_all_fs_simple
0.9443
4
extra_trees_all
0.9386
5
extra_trees_all_fs_simple
0.9369
Top 5 Model Rankings TUS Small
Rank
Model
Avg Precision
1
gradient_boosting_custom
0.8875
2
catboost_all_fs_deep
0.8676
3
catboost_custom
0.8425
4
gradient_boosting_all
0.8356
5
catboost_all_fs_simple
0.8272
Top 5 Model Rankings TUS Big
Rank
Model
Avg Precision
1
gradient_boosting_custom
0.9267
2
gradient_boosting_all
0.9155
3
gradient_boosting_all_fs_simple
0.9143
4
extra_trees_all
0.9012
5
extra_trees_all_fs_deep
0.9011
Top 5 Model Rankings D3L
Rank
Model
Avg Precision
1
gradient_boosting_custom
0.7788
2
catboost_custom
0.6327
3
catboost_all_fs_deep
0.6283
4
xgboosting_custom
0.6259
5
gradient_boosting_all_fs_simple
0.6054
Top 5 Model Rankings Freyja
Rank
Model
Avg Precision
1
gradient_boosting_custom
0.9624
2
gradient_boosting_all
0.9387
3
gradient_boosting_all_fs_simple
0.9363
4
gradient_boosting_all_fs_deep
0.9213
5
xgboosting_custom
0.8898
Top 5 Model Rankings OM CG
Rank
Model
Avg Precision
1
gradient_boosting_custom
0.5763
2
catboost_custom
0.5552
3
extra_trees_all_fs_deep
0.5526
4
extra_trees_all
0.5483
5
extra_trees_custom
0.5448
Top 5 Model Rankings OM CR
Rank
Model
Avg Precision
1
gradient_boosting_custom
0.5996
2
extra_trees_all
0.5490
3
extra_trees_all_fs_simple
0.5444
4
gradient_boosting_all_fs_simple
0.5439
5
gradient_boosting_all
0.5436
To squeeze the best performance out of the final model, we can fine-tune it to obtain a better set of hyperparameters. We will generate 48 models from the initial gradient boosting, following a grid search over relevant parameters. Each model will be evaluated on each of the seven benchmarks, and we will obtain the best overall model (same as before).