Common Statistical Pitfalls

Which very common statistical areas of machine learning can trap us into some pitfalls? You might be familiar in some way with trade-offs, general biases, or cognitive nuances.

It’s a reminder list worth reiterating.

πŸ’£ Correlation is not Causation

πŸ‘‰ Resist the inclination to explain findings on correlated variables as though they have a causal relationship

πŸ’£ Beware of Non-Representative Samples of Data OR non-representative training data

πŸ‘‰ Carefully examine if you are creating a false comfort from bad data

πŸ’£ Oooof…. Data Leakage!  

πŸ‘‰ Ensure that similar data to the training dataset used to train the model with, WILL be available at the time of the prediction! (There are two types of data leakage: target leakage and train-test contamination)

πŸ’£ the best for last … Overfitting

πŸ‘‰ Being rigorous of about examining the ability of the model to predict training data as well as new unseen observation by the model.

What other ones to keep in mind?

Check out what others may have said about this subject on Linkedin

Leave a Reply