Understanding Outliers: Impact, Detection, and Remedies
Introduction
Outliers are data points that significantly deviate from the average or typical values within a dataset. These observations, though rare, can greatly influence statistical analyses and machine learning models if not properly addressed. In this blog, we will explore what outliers are, why they can be dangerous, their effects on machine learning models, and effective methods to detect and treat outliers.
What are Outliers?
Outliers are data points that lie far away from the majority of the data, either above or below the expected range. They can arise due to various reasons, such as measurement errors, experimental anomalies, or truly exceptional observations. Outliers can distort statistical analyses, affecting the accuracy and reliability of the results.
When are Outliers Dangerous?
Outliers can be particularly dangerous when they exert a disproportionate influence on the analysis or modeling results. They can skew the statistical measures of central tendency, such as the mean and median, leading to biased estimates. In regression analysis, outliers can significantly affect the slope and intercept of the regression line, distorting the relationship between variables. Outliers can also impact clustering algorithms by affecting the distance metrics and the formation of clusters.
Which machine learning models are more sensitive to outliers, and which ones are less affected by them?
Outliers can have a varying impact on different machine learning models. Some models are more sensitive to outliers, while others are more robust and less affected by extreme values. Here’s a breakdown:
Models Affected by Outliers:
a. Linear Regression: Linear regression models can be significantly affected by outliers, as they heavily rely on minimizing the squared differences between predicted and actual values. Outliers can introduce a high level of error and result in biased coefficient estimates.
b. K-means Clustering: K-means clustering can be influenced by outliers, as they can distort the distances between data points and the formation of clusters. Outliers can be mistakenly assigned to clusters or result in the creation of separate clusters.
c. Support Vector Machines (SVM): SVM models can be sensitive to outliers, especially in cases where the margin between classes is narrow. Outliers lying near the decision boundary can affect the placement of the boundary, leading to misclassifications.
Models Less Affected by Outliers:
a. Decision Trees: Decision trees are relatively robust to outliers. The hierarchical splitting process focuses on finding optimal splits based on impurity measures, such as Gini index or entropy, rather than on the exact values of individual data points.
b. Random Forests: Random forests, which are an ensemble of decision trees, are also less affected by outliers. The averaging of multiple trees helps mitigate the impact of outliers, as the errors introduced by outliers tend to average out.
c. Naive Bayes: Naive Bayes models are generally less sensitive to outliers since they make strong independence assumptions between features. Outliers might not disrupt the probabilistic calculations as much as in other models.
How can outliers be treated?
Outliers can be addressed using various techniques depending on the specific goals of the analysis and the nature of the data. It is important to choose an appropriate approach to maintain the integrity of the data while mitigating the impact of outliers.
Trimming: Trimming involves removing a certain percentage of extreme values from both ends of the data distribution. This approach discards the outliers entirely, which can reduce their influence on statistical analyses. By trimming the dataset, the extreme values are eliminated, and the analysis focuses on the majority of the data. However, it’s essential to carefully consider the percentage to be trimmed to avoid removing too much data.
Capping: Capping, also known as Winsorization, sets a predefined threshold beyond which the outlier values are replaced with the nearest acceptable value within that threshold. Capping prevents the complete removal of outliers and instead modifies their values to align them with the nearby observations. This approach helps control the impact of outliers while retaining their presence in the dataset. Capping can be done symmetrically, by capping both lower and upper extremes, or asymmetrically if there is a specific directionality to the outliers.
What are the approaches used for outlier detection?
Detecting outliers in different types of distributions is crucial for accurate data analysis. Several methods can be employed based on the distribution characteristics. Let’s explore the techniques and formulas used to detect outliers in various distributions:
Normal Distribution:
Normal distribution, also known as the Gaussian distribution or bell curve, is a statistical concept that describes a specific pattern of data distribution. In a normal distribution, the data is symmetrically distributed around the mean, creating a characteristic bell-shaped curve.
In a dataset that follows a normal distribution, the following rules apply to the bell curve:
1. The mean of the data represents the center of the bell curve and is also the highest point on the curve.
2. Approximately 68.2% of the data points fall within one standard deviation of the mean, covering the range from (Mean — Standard Deviation) to (Mean + Standard Deviation).
3. Roughly 95.5% of the data points lie within two standard deviations of the mean, encompassing the range from (Mean — 2 * Standard Deviation) to (Mean + 2 * Standard Deviation).
4. Almost 99.7% of the data points fall within three standard deviations of the mean, covering the range from (Mean — 3 * Standard Deviation) to (Mean + 3 * Standard Deviation).
These rules, often referred to as the empirical rule or the 68–95–99.7 rule, provide a guideline for understanding the distribution of data in a normal distribution and help identify the expected range of values around the mean.
In a normal distribution, outliers can be detected using the z-score approach. The formula for calculating the z-score is: z = (x — μ) / σ Here, z represents the z-score, x is the data point, μ is the mean, and σ is the standard deviation. Typically, a z-score greater than a certain threshold (e.g., ±2 or ±3) indicates an outlier.
Let’s Perform outlier detection and removal using Python code.
Import the necessary libraries and dataset. Here I am using Placement dataset.
The dataset consists of 1000 rows and 3 columns.
Create a plot to visualize the distribution of data in the columns.
Upon observing the data, it becomes evident that the “cgpa” column follows a normal distribution, while the “placement_exam_marks” column exhibits right-skewness. As a result, the z-score approach is applicable only to the “cgpa” column.
Calculate the minimum, maximum, mean, and standard deviation for the “cgpa” column.
Determine the boundary values.
The highest boundary is determined to be 8.8089, while the lowest boundary is identified as 5.1135.
Let’s identify the outliers in our dataset by considering all values above 8.8 and below 5.11 as outliers.
Upon examining our dataset, we observe that there are three rows with values below the lower boundary and two rows with values above the highest boundary.
To address the outliers, we will employ the trimming approach and eliminate all rows with outliers in our dataset.
After applying the trimming approach, the dataset has been reduced to 995 rows and 3 columns, excluding the rows that contained outliers.
Let’s proceed with the second approach, which involves applying the capping method to address the outliers in the dataset.
To handle outliers using the capping method, we will replace all values above the upper limit with the upper limit value and all values below the lower limit with the lower limit value.
It is observed that the shape of the dataset remains unchanged after implementing the capping method, indicating that the number of rows and columns in the dataset remains the same.
Now, let’s examine the minimum and maximum values of the dataset after applying the capping method. The maximum value should correspond to the highest boundary, while the minimum value should align with the lowest boundary.
Conclusion
In conclusion, handling outliers is essential in data analysis and machine learning. Outliers can significantly impact the accuracy and reliability of our models, leading to skewed results. In this blog, we focused on the z-score method for outlier detection and treated outliers applying appropriate techniques like trimming or capping.
code: https://github.com/taherafirdose/100-days-of-Machine-Learning/tree/master/Handling%20Outliers