Day 15 - Cross Validation

By Jerin Lalichan 


Cross Validation

    Cross-validation is a technique for assessing how the statistical analysis generalizes to an independent data set. It is a technique for evaluating machine learning models by training several models on subsets of the available input data and evaluating them on the complementary subset of the data. We can detect overfitting easily with this technique.

Different types of cross-validation techniques are:-

    1. K-Fold Cross Validation

    2. Leave P-out Cross Validation

    3. Leave One-out Cross Validation

    4. Repeated Random Sub-sampling Method

    5. Holdout Method

Among these K-Fold cross-validation is most commonly used.


Why do we need cross-validation?

    We usually split the dataset into training and testing datasets. But the accuracy and metrics are highly biased on certain factors like how the split is done, depending on the shuffling, which part of the data is used for training, etc.
Hence, it does not represent the model's ability to generalize a dataset. This leads to the need for cross-validation.

K-fold Cross validation

    The first step is to separate the test dataset for the final evaluation. Cross-validation is to be performed on the training dataset only.     

5-fold cross-validation



    The complete training data set is initially divided into k equal parts. The remaining k-1 parts are utilized to train the model while the part which was set aside is to be used as the hold-out (testing) set. The holdout set is then used to test the trained model. This procedure is repeated k times, with the holdout set being changed each time. As a result, each data point has an equal chance of being included in the test set.

   The value of  K typically equals 3 or 5. Even larger values like 10 or 15 can be used, but doing so requires a lot of calculation and takes a long time duration.





   
  I am doing a challenge - #66DaysofData  in which I will be learning something new from the Data Science field for 66 days, and I will be posting daily topics on my LinkedIn, On my GitHub repository, and on my blog as well.


Stay Curious!  





Comments

Popular posts from this blog

Day 17 - Ensemble Techniques in ML - Averaging, Weighted average

Day 4 - Performance metrics in Machine Learning - Regression