Frank Ceballos
2 min readMar 6, 2020

--

David, I just reproduced your result using the data I presented in this article. There is something funny going with XGBoost. For example, if I use a Random Forest or an Extra Trees classifier, the Boruta algorithm is able to select features (10–15 features). When using the XGBoost algorithm with the default parameters, the relevant feature part of the pipeline only selected one feature. My guess is that XGBoost is grossly overfitting the data. Therefore, I think that the ranking is random and meaningless.

Now, when I wrote the relevant feature part of the pipeline, I decided to cross validate the results. For example, the data is split into 5 folds. Then the sub relevant features are determined within each fold. Finally, a feature is considered relevant if it was relevant in every fold. If the relevant features are different in every fold, then the output will be a small number of features or no features. My guess is that this will happen when the ranking is meaningless, therefore different features will be considered ‘relevant’ when you change the data slightly.

To test this out, add some shadow features to your data. To create a shadow feature you can select some features from your data and randomly shuffle the rows. I will suggest that about 5–10% of your data should be shadow features for this test. These shadow features shouldn’t be relevant to the outcome because of the random shuffling. Then train XGBoost with your data+shadow features. Finally, create something like Figure 10, to see how the shadow features are ranked. Now randomly drop 20% of your data and repeat the procedure. Repeat this one or two times more.

Then compare how features are ranked for different datasets with shadow features. If the ranking is completely different, then my guess is that XGBoost is overfitting pretty badly. Additionally, if you see any shadow features ranked highly, your problem is definitely overfitting. Try to reduce this overfitting and try again.

Good luck!

--

--

Frank Ceballos
Frank Ceballos

Responses (1)