Treating Outliers using IQR and Percentile approach — Part 2
In the previous blog, we explored the concept of outliers and discussed how certain models are influenced by them. We also examined one method of treating outliers which is using the normal distribution. In today’s blog, we will look into two alternative approaches to addressing outliers: the Interquartile Range (IQR) method and the Percentile method.
Skewed Distribution (IQR method): A skewed distribution refers to a type of data distribution where the values are not evenly distributed around the mean. In a skewed distribution, the data tends to be concentrated towards one side, resulting in a long tail on either the left or right side of the distribution.
Right-skewed data, also known as positively skewed data, refers to a type of data distribution where the majority of the values are concentrated towards the left (lower values) of the distribution, while a few larger values are present on the right (higher values) side. This results in a long tail extending towards the right.
In a right-skewed distribution:
· The mean is usually greater than the median.
· The median is closer to the lower end of the data range.
· The mode, or the most frequently occurring value, tends to be smaller than the median.
Visually, a right-skewed distribution appears asymmetric, with a stretched or elongated tail on the right-hand side. The bulk of the data is situated towards the left, indicating a prevalence of smaller values, while a few extreme values on the right contribute to the skewness.
To treat outliers in a rightly skewed distribution using the Interquartile Range (IQR) method, follow these steps:
1. Calculate the first quartile (Q1) and third quartile (Q3) of the dataset.
2. Calculate the IQR by subtracting Q1 from Q3, i.e., IQR = Q3 — Q1.
3. Identify the lower limit by subtracting 1.5 times the IQR from Q1, i.e., Lower Limit = Q1–1.5 * IQR.
4. Identify the upper limit by adding 1.5 times the IQR to Q3, i.e., Upper Limit = Q3 + 1.5 * IQR.
5. Values below the lower limit or above the upper limit are considered outliers.
6. Treat the outliers by either removing them from the dataset or replacing them with a suitable value.
Let’s Perform outlier detection and removal of rightly skewed data using Python code.
We will work with the same data as used in part 1 and focus on addressing outliers in the “placement_exam_marks” column, which exhibits a right-skewed distribution.
The dataset consists of 1000 rows and 3 columns.
Create a plot to visualize the distribution of data in the columns.
it is evident that the “cgpa” column follows a normal distribution, while the “placement_exam_marks” column exhibits right-skewness. For this approach we will consider “placement_exam_marks” column.
Calculate the minimum, maximum, mean, and standard deviation for the “placement_exam_marks” column.
We will proceed with the steps outlined earlier to handle outliers in the “placement_exam_marks” column, considering its right-skewed distribution.
Calculate 25th and 75th Percentile ,IQR , upperlimit and lowerlimit.
Let’s examine the dataset to identify any values in the “placement_exam_marks” column that exceed the upper and lower limit.
Upon analysis, it is evident that there are 114 rows in the dataset where the values in the “placement_exam_marks” column exceed the upper limit. However, there are no rows with values below the lower limit in the dataset.
While trimming may not be the ideal option in this case due to the significant loss of 10% of the data, we will still perform trimming as an exercise to understand how it can be implemented. However, it is important to note that capping is generally considered a better approach as it allows us to retain more data and crucial information.
Trimming involves the removal of all data points that fall below the lower limit and above the upper limit. After applying the trimming approach, the data frame is reduced to 886 rows and 3 columns, reflecting the removal of the outlier data points.
Let’s generate visualizations to compare the plots before and after the removal of outliers.
We can clearly see the outliers are removed.
Considering the suitability of the situation, we will apply the capping method. This involves replacing all values above the upper limit with the upper limit value and all values below the lower limit with the lower limit value. This approach is more appropriate since it ensures that the dataframe shape remains unchanged, and we retain all the important information without any loss.
It is apparent that the maximum value in the dataset corresponds to the upper limit, while the minimum value corresponds to the lower limit. This indicates that all the outliers have been effectively replaced through the capping process.
Let’s visually examine the dataset to confirm the removal of outliers and the application of capping by plotting the data before and after the process.
Percentile Approach :
The percentile method is a technique used to treat outliers by identifying and capping extreme values based on a specified percentage threshold. It involves calculating the threshold values based on percentiles and replacing any data points that exceed these thresholds with the corresponding threshold values.
The process of using the percentile method to handle outliers typically involves the following steps:
1. Determine the percentage threshold: Select a percentage value that represents the threshold for extreme values. Commonly used thresholds are 95th percentile (5% of data points are considered outliers) or 99th percentile (1% of data points are considered outliers). The choice of threshold depends on the specific dataset and the desired level of outlier removal.
2. Calculate the threshold values: Use the chosen percentile to calculate the upper and lower thresholds. For example, if the 95th percentile is chosen, the upper threshold would be the value below which 95% of the data falls, and the lower threshold would be the value above which 95% of the data falls.
3. Identify and replace outliers: Iterate through the dataset and identify any data points that exceed the upper or lower threshold. Replace these outliers with the corresponding threshold value.
By applying the percentile method, extreme values that lie beyond the selected threshold are effectively replaced with more representative values. This helps mitigate the impact of outliers on the data analysis and modeling process, allowing for more robust and reliable results.
Let’s Perform outlier detection and removal of outliers using Percentile approach in Python code.
We will work with a new dataset called “heightweight” and import the required libraries. Let’s determine the shape of the dataframe, which provides information about the number of rows and columns in the dataset.
We will generate visualizations to examine the “height” and “weight” columns and identify any potential outliers present in the dataset.
It is evident from the visualizations that the “height” column contains a higher number of outliers compared to the “weight” column. Therefore, we will focus our attention on addressing outliers specifically in the “height” column.
To detect outliers in the “height” column, we will utilize the 99th percentile and 1st percentile as the boundary values. Any data points above the 99th percentile or below the 1st percentile will be considered outliers.
Upon analysis, we have identified that there are 100 rows in the dataset with values exceeding the 99th percentile, as well as 100 rows with values falling below the 1 percentile. These rows are considered outliers in the dataset.
Since we have previously explored the trimming method for handling outliers, I will directly proceed with applying capping to the “height” column. This involves replacing all values above the 99th percentile with the upper limit and all values below the 1 percentile with the lower limit.
Let’s generate visualizations to compare the plots before and after the removal of outliers.
Conclusion
Outliers can significantly impact statistical analyses and machine learning models, distorting results and hindering accurate predictions. Understanding what outliers are, when they are dangerous, and how to handle them is crucial for robust data analysis. By employing techniques such as z-score, IQR, and percentile approaches and further trimming and capping, outliers can be effectively treated, improving the quality and reliability of the analysis and enhancing the performance of machine learning models.