Launch your tech mastery with us—your coding journey starts now!
Course Content
Introduction to Machine Learning
At its core, Machine Learning (ML) is a branch of artificial intelligence that focuses on building systems that learn or improve performance based on the data they consume.
0/5
Machine Learning

Handling missing values

The Simple Definition

Data Preprocessing is the act of cleaning and formatting raw data into a pristine format that a computer algorithm can actually understand.Missing Values are simply the blank cells in your dataset. In Python, you will often see them labeled as NaN (Not a Number) or Null.

Why is this the “Most Important Step”?

Imagine trying to bake a cake, but the recipe has a coffee stain covering how many eggs you need. If you just guess randomly or skip the eggs entirely, the cake will fail.

Machine Learning algorithms are the same way. If you feed a Scikit-Learn algorithm a dataset that contains blank cells, the algorithm will instantly crash. Computers cannot do math on a blank space. Therefore, you must decide how to handle these missing pieces before you can train your model.

How to Handle Missing Values (Step-by-Step)

When you find missing data in your Pandas DataFrame, you generally have two main choices: Drop it or Fill it.

Option 1: Dropping the Data (Deletion)

This is the simplest approach: if a row has a missing value, you just delete the entire row from your dataset.

  • When to use it: Only when you have a massive dataset and very few missing values. If you have 100,000 records and only 10 are missing an age, deleting those 10 rows won’t hurt your model’s ability to learn.
  • The Danger: If you drop too much data, you are throwing away valuable Features that the model needs to learn the patterns!

Option 2: Filling in the Blanks (Imputation)

Instead of throwing the data away, we make an educated guess to fill in the blank cell. This is called Imputation. How do we guess? We use basic statistics!

  • Mean (The Average): If a few customers didn’t provide their age, you calculate the average age of all the other customers and put that number in the blank spots.
  • Median (The Middle): If your data has crazy outliers (like one customer who is 110 years old), the average gets messed up. Instead, you line up all the ages from lowest to highest and pick the exact middle number to fill the blanks.
  • Mode (The Most Frequent): What if the missing data is text, like “City”? You can’t calculate the average of a word! Instead, you use the Mode, which simply fills the blank with the most common city in your dataset.

Real-Life Examples & Practical Use Cases

Where do these missing values come from in the real world, and how are they handled?

  • Medical Diagnosis: A hospital is predicting patient readmissions. Some patient files are missing “Blood Pressure” readings because a nurse forgot to write it down.
  • The Fix: A data scientist uses Pandas to impute (fill) those blanks with the median blood pressure of patients of the same age and gender.
  • E-Commerce Surveys: An online store sends out a 20-question survey. Many users get bored and leave the last 5 questions blank.
  • The Fix: If a user left 90% of the survey blank, the data scientist will likely choose to drop that specific user’s row entirely, as it doesn’t provide enough useful data to learn from.
  • IoT (Internet of Things) Sensors: A smart thermostat records the temperature every minute, but the Wi-Fi drops out for an hour, leaving 60 blank rows of data.
  • The Fix: The system looks at the temperature right before the Wi-Fi dropped and right after it reconnected, and fills the blanks with the average between those two points.

Summary

Data Preprocessing is the critical step of cleaning your data before training a model. Handling Missing Values is a massive part of this. Because algorithms crash when they see blank data (NaN), we use tools like Pandas to either safely Drop the incomplete rows or intelligently Impute (fill) the blanks using averages, medians, or the most frequent items.