Filling missing values with Mean and Median

Tahera Firdose
8 min readMay 29, 2023

--

Data analysis and machine learning often involve working with datasets that may contain missing values. Handling missing data is a crucial step in the data preprocessing phase, as it can significantly impact the accuracy and reliability of our models. One common approach to dealing with missing values is to replace them with the mean or median of the available data. In this blog post, we will explore the process of filling missing values with mean and median, and discuss their advantages and limitations.

Why do we fill missing values with mean and median?

When confronted with missing values, we have several options for handling them, such as removing rows with missing data, using imputation techniques, or building models that can handle missingness. However, filling missing values with the mean or median is a straightforward and widely-used approach that can be easily implemented. It allows us to retain valuable information from the dataset while maintaining the integrity of the data structure.

Filling missing values with the mean

The mean is the average value calculated by summing up all the available values and dividing by the total number of observations. Filling missing values with the mean is a simple way to estimate the unknown values based on the existing data. Here’s a step-by-step process to fill missing values with the mean:

  1. Identify the columns or variables that contain missing values.
  2. Calculate the mean of each column containing missing values.
  3. Replace the missing values in each column with the respective mean value.

Filling missing values with the median

The median is the middle value in a sorted dataset. It represents the central tendency and is less sensitive to outliers compared to the mean. Filling missing values with the median follows a similar procedure as filling with the mean:

  1. Identify the columns or variables with missing values.
  2. Calculate the median of each column containing missing values.
  3. Replace the missing values in each column with the respective median value.

When to fill missing values with the mean:

  1. Normally distributed data: If the data follows a normal distribution, filling missing values with the mean is a reasonable choice. The mean represents the central tendency of the data and aligns well with the overall distribution.
  2. Continuous or interval data: Mean is suitable for continuous or interval variables since it takes into account the magnitude of the values and provides a balanced estimate.

When to fill missing values with the median:

  1. Skewed or non-normal data: If the data is heavily skewed or has outliers, the median is a better choice as it is robust to extreme values. The median represents the middle value and is less influenced by extreme observations.

To demonstrate the process, let’s consider titanic dataset with missing values in ‘Age’ and ‘Fare’ columns.

let’s load our dataset into a Pandas DataFrame and examine the missing values

We can observe that the columns ‘Age’ and ‘Fare’ contain missing values. The ‘Age’ column has approximately 19.86% of missing values, while the ‘Fare’ column has around 5% missing values.

We will split the data into training and testing sets, with 80% of the data allocated for training and 20% for testing.

To fill missing values with the mean, we will use the fillna() method from Pandas. The fillna() function allows us to replace missing values with a specified value. In this case, we will replace the missing values with the mean of Age and Fare columns. Here's the code

The fillna(X_train[‘Age’].mean()) and X_train[‘Fare’].mean()part replaces the missing values in the respective column with the mean value of that column. In the resulting DataFrame, the columns X_train[‘Age_mean’], X_train[‘Fare_mean’] , will contain the filled data.

To fill missing values with the median, we will follow a similar approach using the fillna() method. However, this time we will replace the missing values with the median of each column.

Comparing the Results

After filling the missing values with mean and median, it’s essential to compare the original dataset with the filled datasets to observe any changes. We can use descriptive statistics or visualizations to assess the impact of filling missing values.

we can clearly seethat both median and mean imputation have resulted in a reduction in variance for both ‘Age’ and ‘Fare’ variables. The reduction in variance suggests that filling the missing values has led to a decrease in the variability of the ‘Age’ and ‘Fare’ variable.

Lets further visualize by plotting a distplot

The distribution of the ‘Age’ variable undergoes substantial changes after applying mean and median imputation due to the high percentage of missing values, which amounts to 19% of the data. These missing values have a noticeable impact on the resulting distribution of the ‘Age’ variable after imputation.

Since a considerable portion of the data was initially missing, the imputed values are introduced to fill those gaps. As a result, the imputed values tend to cluster around the central tendency (mean or median), altering the original distribution of the ‘Age’ variable. This shift can be visually observed when comparing the histograms or density plots before and after imputation.

Let us now plot the density plot for Fare column

The distributions of the ‘Fare’ column before and after imputation with both mean and median are virtually identical, resulting in overlapping graphs. This suggests that the imputation process using either the mean or median did not substantially impact the distribution of the ‘Fare’ variable. The similarity between the distributions indicates that the missing values in the ‘Fare’ column were effectively filled without significantly altering the overall pattern of the data.

Additionally, we can observe from the covariance analysis that the imputed ‘Age’ column has a substantial impact on the covariance with other columns. This observation raises a red signal or cautionary note regarding the reliability of the imputed values. The high influence of imputed ‘Age’ values on the covariance matrix suggests that imputation may introduce biases and affect the relationships between variables.

Filling missing values with mean and median is a common practice in data preprocessing, but the choice between these two methods depends on the nature of the data and the underlying assumptions. Here are some considerations for when to fill missing values with mean or median

The box plot analysis further highlights the impact of imputation on the ‘Age’ column. After imputation, outliers are created in the distribution of the ‘Age’ variable, leading to values that deviate significantly from the overall range of the data. Additionally, the interquartile range (IQR) for the ‘Age’ column becomes smaller, indicating a reduced spread of values compared to the original data.

In contrast, the box plot for the ‘Fare’ column does not reveal any outliers after imputation. The absence of outliers suggests that the imputation process did not introduce extreme or unexpected values in the ‘Fare’ variable.

Another approach to filling missing values is by using the SimpleImputer class from the scikit-learn library. This class provides a convenient way to handle missing data using various imputation strategies. Here's how you can utilize the SimpleImputer for filling missing values.

Also we can impute different columns with different strategies using the SimpleImputer class. In this case, lets impute the 'Age' column with the median strategy and the 'Fare' column using the mean strategy.

Fit and transform the data

Advantages of using mean and median

  1. Simplicity: The mean and median are simple statistical measures that can be easily calculated and implemented.
  2. Preserves data distribution: Filling missing values with the mean or median helps maintain the overall distribution of the data, ensuring that the general pattern and characteristics of the dataset remain intact.
  3. Minimizes data loss: By replacing missing values with the mean or median, we can retain more data points compared to other approaches that involve removing rows with missing values.

Limitations and considerations

  1. Distortion of variance: Filling missing values with the mean or median can result in a reduction of variance, potentially underestimating the variability in the dataset. This may affect the accuracy of certain statistical analyses.
  2. Bias towards central tendency: If the missing values are not missing at random and are related to the underlying data pattern, filling them with the mean or median may introduce bias into the dataset.
  3. Impact on correlations: Imputing missing values with the mean or median can affect the correlation structure between variables, potentially leading to misleading results.

Conclusion

Filling missing values with the mean or median is a practical and widely-used approach in data preprocessing. It provides a simple solution to handle missing data while preserving the integrity of the dataset. However, it is essential to consider the limitations and potential biases associated with this method.

--

--

Tahera Firdose
Tahera Firdose

Written by Tahera Firdose

Datascience - Knowledge grows exponentially when it is shared