Model development

Link to notebook. The notebook includes all the necessary code to create the data that we need and generate the plots. Nonetheless, an iteration of this data is already contained here, under the k_justification folder.

This notebooks presents a long process of defining and testing several models before selecting one of them to be part of the final system. We define four base models:

  1. All metrics (all). This model contains all the original metrics of the profiles (over 60), likely to overfit and contain a lot of redundancy and unnecessary information
  2. All metrics + lightweight feature selection (all_fs_simple). We apply a simple feature selection process to remove irrelevant model features ad hoc via an embedded method. The goal is to just remove those features that are clearly redundant without taking too much time and not altering the core of the model.
  3. All metrics + in-depth feature selection (all_fs_deep). For this model we test a multi-layer (several filter, wrapper and embedded tasks) and aggresive feature selection process, whose goal is to obtain a far superior subset of metrics. The details of the feature selection are described in the following paragraph.
  4. Custom set of metrics (custom). For the final model we select a custom subset of metrics, according to previous experimentation.

Feature selection is a standard process in the field of data science and machine learning, as most features in a given dataset tend to be redundant and not useful. Hence, reducing the number of features to the most relevant reduces the complexity of the model and prevents overfitting to superfluous variables. Our initial set of features comprised 65 different metrics, so the likeliness that many of these features were irrelevant was high, hence the introduction of a feature selection process. Our goal is to keep only those very meaningful features that will surely maintain their relevance when obtained in external benchmarks.

To improve the robustness of our approach, we conducted a new, more thorough feature selection process based on several layers:

  1. We employ filter methods (extremely fast, based on relationships between the attributes) to remove clearly unhelpful features without requiring model intervention. This layer encompassed three substeps:
    • Variability analysis: features with less than 1% of value variability are removed, as such a low spectrum of values is highly unlikely to provide the model with useful patterns to make predictions.
    • Redundancy analysis: we remove highly redundant features (above 90% correlation).
    • Relevancy analysis: we remove features that have less than 0.01 Mutual Information (MI) score w.r.t. the target variable, as the probability that such small correlations provide meaningful insight is very low.
  2. We used the inherent capacity of our ensemble-based models (i.e. embedded approach) to assign relevance scores to each feature based on how much each contributed to the prediction. Features under a given “relevancy threshold” were pruned. This provides a quick filter to remove features that the model deems to be not useful. More precisely, we remove those features whose relevance w.r.t. the model is less than 0.0001%
  3. We employ a costly wrapper method to prune out more irrelevant features, now based on an exhaustive cross-validation task. This can be done at this stage because we removed more than half of the original features, so even though the overall wrapper process is still costly, it is feasible to compute. More precisely, we employed a Recursive Feature Elimination with Cross Validation (RFECV), with a 5-Fold cross-validation and removing a single feature each step.

These models are instantiated with up to 17 base regressors. We test all the possible combinations with our benchmark, selecting a subset of the best 20 models based on a conjoined analysis over RMSE, MAE and MedAE scores. This subset of models is tested on all the remaining benchmarks, in order to ensure that we have not commited a mistake and some of the underperforming models in the creation stage then outperforms the others when generalizing to other benchmarks. The results, however, are clear, and the Gradient Boosting regressor, with custom features is the model that clearly performs best, both in our benchmark as well as the other's.

To better illustrate this, we took the best performing model from the training stage and tested on the external benchmarks.

Top 5 Model Rankings Santos Small
Rank Model Avg Precision
1gradient_boosting_custom0.9586
2gradient_boosting_all0.9490
3gradient_boosting_all_fs_simple0.9443
4extra_trees_all0.9386
5extra_trees_all_fs_simple0.9369

Top 5 Model Rankings TUS Small
Rank Model Avg Precision
1gradient_boosting_custom0.8875
2catboost_all_fs_deep0.8676
3catboost_custom0.8425
4gradient_boosting_all0.8356
5catboost_all_fs_simple0.8272

Top 5 Model Rankings TUS Big
Rank Model Avg Precision
1gradient_boosting_custom0.9267
2gradient_boosting_all0.9155
3gradient_boosting_all_fs_simple0.9143
4extra_trees_all0.9012
5extra_trees_all_fs_deep0.9011

Top 5 Model Rankings D3L
Rank Model Avg Precision
1gradient_boosting_custom0.7788
2catboost_custom0.6327
3catboost_all_fs_deep0.6283
4xgboosting_custom0.6259
5gradient_boosting_all_fs_simple0.6054
Top 5 Model Rankings Freyja
Rank Model Avg Precision
1gradient_boosting_custom0.9624
2gradient_boosting_all0.9387
3gradient_boosting_all_fs_simple0.9363
4gradient_boosting_all_fs_deep0.9213
5xgboosting_custom0.8898
Top 5 Model Rankings OM CG
Rank Model Avg Precision
1gradient_boosting_custom0.5763
2catboost_custom0.5552
3extra_trees_all_fs_deep0.5526
4extra_trees_all0.5483
5extra_trees_custom0.5448
Top 5 Model Rankings OM CR
Rank Model Avg Precision
1gradient_boosting_custom0.5996
2extra_trees_all0.5490
3extra_trees_all_fs_simple0.5444
4gradient_boosting_all_fs_simple0.5439
5gradient_boosting_all0.5436

To squeeze the best performance out of the final model, we can fine-tune it to obtain a better set of hyperparameters. We will generate 48 models from the initial gradient boosting, following a grid search over relevant parameters. Each model will be evaluated on each of the seven benchmarks, and we will obtain the best overall model (same as before).

Top 5 Model Rankings Santos Small
Rank Model Avg Precision
1gradient_boosting_ne25_lr0.1_md3_ss1.0_msl10.9719
2gradient_boosting_ne50_lr0.05_md3_ss1.0_msl10.9714
3gradient_boosting_ne50_lr0.05_md3_ss1.0_msl100.9712
4gradient_boosting_ne25_lr0.1_md3_ss1.0_msl100.9706
5gradient_boosting_ne50_lr0.05_md3_ss0.8_msl10.9659
Top 5 Model Rankings TUS Small
Rank Model Avg Precision
1gradient_boosting_ne50_lr0.1_md3_ss0.8_msl100.8958
2gradient_boosting_ne50_lr0.05_md5_ss0.8_msl10.8950
3gradient_boosting_ne50_lr0.05_md5_ss0.8_msl100.8944
4gradient_boosting_ne100_lr0.1_md3_ss0.8_msl100.8936
5gradient_boosting_ne25_lr0.1_md5_ss1.0_msl10.8936
Top 5 Model Rankings TUS Big
Rank Model Avg Precision
1gradient_boosting_ne25_lr0.1_md3_ss1.0_msl100.9381
2gradient_boosting_ne50_lr0.1_md3_ss0.8_msl100.9358
3gradient_boosting_ne50_lr0.05_md3_ss1.0_msl100.9351
4gradient_boosting_ne25_lr0.1_md3_ss1.0_msl10.9350
5gradient_boosting_ne50_lr0.05_md3_ss1.0_msl10.9326
Top 5 Model Rankings D3L
Rank Model Avg Precision
1gradient_boosting_ne50_lr0.1_md3_ss0.8_msl100.8005
2gradient_boosting_ne50_lr0.05_md5_ss0.8_msl10.7993
3gradient_boosting_ne100_lr0.05_md3_ss0.8_msl100.7980
4gradient_boosting_ne50_lr0.1_md3_ss1.0_msl100.7930
5gradient_boosting_ne100_lr0.05_md3_ss1.0_msl100.7843
Top 5 Model Rankings Freyja
Rank Model Avg Precision
1gradient_boosting_ne50_lr0.1_md3_ss1.0_msl10.9624
2gradient_boosting_ne25_lr0.1_md3_ss0.8_msl10.9623
3gradient_boosting_ne25_lr0.1_md3_ss1.0_msl10.9578
4gradient_boosting_ne100_lr0.05_md3_ss0.8_msl10.9576
5gradient_boosting_ne100_lr0.05_md3_ss1.0_msl100.9570
Top 5 Model Rankings OM CG
Rank Model Avg Precision
1gradient_boosting_ne100_lr0.05_md3_ss0.8_msl100.6111
2gradient_boosting_ne100_lr0.1_md3_ss0.8_msl100.6018
3gradient_boosting_ne50_lr0.1_md3_ss0.8_msl10.5990
4gradient_boosting_ne100_lr0.1_md3_ss1.0_msl100.5947
5gradient_boosting_ne100_lr0.1_md3_ss1.0_msl10.5942
Top 5 Model Rankings OM CR
Rank Model Avg Precision
1gradient_boosting_ne100_lr0.05_md3_ss1.0_msl100.6089
2gradient_boosting_ne100_lr0.05_md3_ss0.8_msl10.6034
3gradient_boosting_ne50_lr0.05_md3_ss1.0_msl100.5997
4gradient_boosting_ne50_lr0.1_md3_ss1.0_msl10.5996
5gradient_boosting_ne25_lr0.1_md3_ss1.0_msl100.5994


The three best overall models are:
Rank Model Avg Precision
1gradient_boosting_ne100_lr0.05_md3_ss0.8_msl100.8176
2gradient_boosting_ne100_lr0.05_md3_ss1.0_msl100.8163
3gradient_boosting_ne50_lr0.1_md3_ss0.8_msl100.8157

← Back to main page

last updated: 2026/01/07