The Datamatics

Posts

Showing posts from August, 2022

Day 5 - Performance metrics in Machine Learning - Classification

- August 31, 2022

By Jerin Lalichan Performance metrics in ML Evaluation of the performance of a model is important. Performance metrics are certain measures to quantify the performance of the model during the training and testing phases. In Machine learning, there are generally two kinds of performance metrics in use. For regression models and for classification models. Below are the most popular metrics in use: Classification Metrics Confusion Matrix (Not a metric but base to others) It is a visualization of ground truth vs predicted values, in the form of a matrix. It is not exactly a performance metric but forms a basis for other metrics. Each cell consists of one term, which is an evaluation factor. TP This indicates how many positive cases are predicted correctly FP This indicates the number of cases in which the value is actually negative but predicted as positive. This factor represents Type-I error in statistics. FN This indicates the val...

Day 4 - Performance metrics in Machine Learning - Regression

- August 30, 2022

By Jerin Lalichan Performance metrics in ML Evaluation of the performance of a model is important. Performance metrics are certain measures to quantify the performance of the model during the training and testing phases. In Machine learning, there are generally two kinds of performance metrics in use. For regression models and for classification models. Below are the most popular metrics in use: Regression Metrics Mean Squared Error (MSE) It is simply the average of the sum of the squares of the difference between the actual value and predicted values. Due to the squaring in this equation, small errors are overestimated. Also because for that reason, this is very much prone to outliers. 2 . Mean Absolute Error (MAE) Mean Absolute Error is the average of the difference between the ground truth(actual value) and the predicted values . Since there is no squaring, the error estimated is not exaggerated or overestimated. Also, ...

Day 3 - Advantages and disadvantages of Linear regression

- August 29, 2022

By Jerin Lalichan 1. Advantages of Linear Regression It is Easy: Implementing linear regression is easy, and it is simpler to understand the output coefficients. Easy to avoid over-fitting problem: Over-fitting can occur with linear regression, however, it can be prevented by adopting cross-validation, regularisation (L1 and L2) techniques, and some dimensionality reduction Techniques. image credit: javatpoint.com 2. Disadvantages of Linear Regression It assumes a Linear relationship: A linear model is one in which the independent variables are assumed to linearly explain the dependent variable, such as a = bx + c. No exponential, logarithms, powers, etc. are permitted. Even though this is a great simplification, t he real world is not linear. In order to achieve a linear representation, using a linear model would either require us to ignore some patterns or force us to use complex transformations. Data must be independent: In the general case, that is not...

Day 2 - Gradient descent - #66DaysofData Challenge

- August 28, 2022

By Jerin Lalichan 1. Gradient Descent Gradient descent is an optimization algorithm which is commonly used to train machine learning models and neural networks. Optimization is the process of minimizing the cost function in an algorithm. Types of Gradient descent Batch Gradient descent In this, all the training sample is used for calculations in each iteration, which makes this a time-consuming process if it's a large dataset. ie, It uses all samples for one forward pass and then adjusts weights. Stochastic Gradient descent In this, not all the samples, but one sample selected randomly will undergo forward pass and the weight gets adjusted. Mini-batch Gradient descent In this, a batch of samples, selected randomly will undergo forward pass and weight gets adjusted. Gradient descent in Linear Regression In Linear regression, the cost function is the Mean squared error or root mean square error. The parameters( θ) are the Slope and ...

Day 1 - Normal Distribution, Z-score, Standard Normal distribution - #66DaysofData Challenge

- August 27, 2022

By Jerin Lalichan 1. Normal Distribution (Gaussian Distribution) How the data is distributed can tell us a lot about the data. There are over 20 different types of data distributions commonly used in data science to develop a model. The normal distribution is one of the most common data distributions seen in nature, for example, the distribution of the heights of human beings . It is also called as Bell Curve. Image credits: Statistics by Jim Properties: Parameters of a bell curve are : Mean - which determines the maximum height of the curve Standard deviation - which determines how wide should be the curve The mean, median, and mode are the same. The distribution is symmetrical, in other words, the data points are symmetrically distributed around the mean value. 2. Z score It is a number that represents how far is, ie how many standard deviations away is a point located, measured from the mean value. So farther the ...

#66DaysOfData ? Here is why you should also accept the challenge.

- August 26, 2022

By Jerin Lalichan "A 2009 study , published in the European Journal of Social Psychology found that it takes anywhere between 18 to 254 days for someone to develop a new habit. Additionally, the study found that, on average, it takes about 66 days for a new behavior to become automatic"

What exactly is Data Science ? Who is a data scientist ? Explained in a simple way.

- August 18, 2022

By Jerin Lalichan "Data Scientist" is one of the top three emerging jobs on LinkedIn. Ever since the past decade, the job opportunities for Data Scientists have increased exponentially, thereby making Data Science one of the hottest careers option at present. If you don't know what exactly this buzzword is, keep reading"