Hello Randy,
Excellent article and great results. I would like to mention something that I notice in your code which might leak some information from the test set into the training set. I don’t know how much this which change your results.
Usually when I scale data I follow this order:
- Create train and test set
- Scale data using train set
- Scale test set data based on training set scaling
To be as specific as possible:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30, random_state=12345)# Initialize fit StandardScaler
std_scale = StandardScaler().fit(X_train)# Scale data
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)
You followed this order:
- Scale data
- Create train and test set
If the effects of your data leakage is large this will give you a fictitious good performance on your test sets. If you think about it, you want your test set to not share any information with your training set. By scaling first and splitting your data later, you connect your train and test set with the same scaling parameters.
I noticed this after spending a few hours trying to replicate your results with my own code. I can get a high Recall score but my precision in the test set is trash (less than 6%). I came to the conclusion that we built our train and test entirely differently.
This is how I build my train and test set:
- Create train and test set by randomly sampling data without replacement.
- Scale data based on train set.
- Balance train set and leave the test set untouched.
The way you build your test set is a bit more fictitious in the sense that in a real case scenario you won’t be having a nice balance test set. You will be getting all kinds of normal transactions that might appear fraudulent. For example, consider the size of your test set ~ about 300 where you have 150 not fraud cased and 150 fraud cases. I will find it hard to believe that those 150 cases that are not fraudulent really capture the essence of the +284,000 not fraudulent cases; hence, why I didn’t balance the test set. However, I’m not entirely sure if my approach is correct. If you continue working on this problem and have any improvements please share them with us.
Thanks for introducing me to such a cool data set.
-Frank