Cleaning Data in Data Science | Practical Guide

Here's something nobody tells you before you get into data science: you're gonna spend most of your time cleaning data. Not building cool models. Not making pretty dashboards. Cleaning. The usual estimate is 60-80% of project time goes to data prep. When I first heard that number, I thought it was exaggerated. It's not.

So let's talk about what actually comes up when you're knee-deep in a messy dataset.

Why You Can't Skip This Part

Garbage in, garbage out. Yeah, everybody says it. Doesn't make it less true though.

Train an XGBoost model on dirty data and it'll give you predictions that look really confident and are completely wrong. Missing values crash your pipeline at 2am. Duplicate rows make your accuracy metrics look better than they should. And inconsistent formatting? That's the one that really gets you. Your SQL join returns zero rows and you spend three hours before realizing "United States" and "US" don't match.

Cleaning data in data science isn't the fun part. But if you skip it, nothing else works.

The Stuff You'll Actually Encounter

1. Missing Values

Every. Single. Real. Dataset. Has them.

The question isn't whether you've got missing data. It's what you do about it.

import pandas as pd

df = pd.read_csv('sales_data.csv')

# First, see what's actually missing
print(df.isnull().sum())
print(df.isnull().mean() * 100)  # percentage missing per column

Now you've got choices:

Just drop the rows: df.dropna() — but only if missing data is rare (under 5%) and actually random. If one demographic has way more gaps than another, dropping introduces bias.
Fill with something reasonable: df['price'].fillna(df['price'].median()) — median for numbers, mode for categories. Don't use mean if you've got outliers — it'll pull everything toward the extremes.
Forward or backward fill: df['temp'].fillna(method='ffill') — this makes sense for time series. The last recorded temperature is probably a decent guess for the next missing one.
Flag it and fill: Add a column like df['price_missing'] = df['price'].isnull() before you fill. That way your model can learn from the pattern of what's missing.

Real talk, there's no single right approach here. It depends on WHY the data is missing. Sensor glitched out randomly? Interpolate, probably fine. Survey respondents deliberately skipped a sensitive question? That's a pattern in itself. The missingness IS information.

2. Duplicates

These pop up constantly when you're merging data from multiple sources. Way more common than most people expect.

# How many duplicates?
print(df.duplicated().sum())

# Let me see them
print(df[df.duplicated(keep=False)])

# Get rid of them
df = df.drop_duplicates()

# Or be specific about what counts as a duplicate
df = df.drop_duplicates(subset=['email', 'order_date'], keep='last')

Pay attention to keep='first' versus keep='last'. If later rows are corrections to earlier ones, keep last. If later rows are accidental re-entries, keep first. Getting this wrong quietly distorts your whole analysis.

3. Formatting That's All Over the Place

This is the one that makes me want to throw my laptop. Same data point, five different spellings:

"United States", "US", "U.S.", "USA", "united states", "US."
"$1,234.56", "1234.56", "$1234.56", "1,234.56 USD"
"2024-03-17", "03/17/2024", "March 17, 2024", "17-Mar-24"

# Get text under control
df['country'] = df['country'].str.strip().str.lower()
df['country'] = df['country'].replace({
    'us': 'united states',
    'u.s.': 'united states',
    'usa': 'united states',
    'us.': 'united states'
})

# Wrangle dates into a single format
df['date'] = pd.to_datetime(df['date'], format='mixed')

# Strip the junk out of price strings
df['price'] = df['price'].str.replace(r'[$,USD]', '', regex=True).astype(float)

If you're working with a smallish dataset and don't want to write code, our data cleaner can handle a lot of these common formatting fixes in the browser.

4. Outliers

Important thing to remember: an outlier isn't automatically an error. A 7-foot-tall person is unusual. Real, but unusual. A recorded height of 700 feet? That's a typo. You need context to know which is which.

# The classic IQR approach
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]
print(f"Found {len(outliers)} outliers")

Once you've found them:

Look at them first. Seriously. Before doing anything, check if they're legit or broken.
Cap or floor them (winsorize): Replace anything past your threshold with the threshold itself
Log transform the column: Pulls in the tails on skewed data, makes outliers less extreme
Delete them: But only when you're really sure they're errors, not just unusual values

5. Data Types That Make No Sense

Numbers stored as strings. Dates saved as integers. True/false values recorded as "Yes" and "No" text. This stuff breaks your calculations in ways that don't always throw errors — sometimes it just silently gives you wrong answers.

# What types do we have?
print(df.dtypes)

# Fix them
df['age'] = pd.to_numeric(df['age'], errors='coerce')  # bad values become NaN
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['is_active'] = df['is_active'].map({'Yes': True, 'No': False, 'Y': True, 'N': False})

That errors='coerce' thing is a lifesaver. Instead of blowing up when it hits a weird value, it just turns it into NaN. Then you deal with it in the missing values step. Clean.

My Cleaning Workflow

Every dataset is its own special mess, but this order works for most of them:

Look around first. Run df.info(), df.describe(), df.head(20). Get a feel for what you're working with before you touch anything.
Fix data types. Do this early. Everything else gets easier when columns are the right type.
Kill duplicates. No point cleaning rows you're about to delete anyway.
Standardize formatting. Text casing, date formats, category names — make everything consistent.
Deal with missing values. Now that types and formats are sorted, you can make smarter imputation decisions.
Handle outliers. With duplicates gone and types fixed, your outlier detection is actually accurate now.
Validate everything. Do the ranges make sense? Do the row counts match what you expected? Quick sanity checks save you hours of debugging later.

Not Everything Has to Be Python

Pandas is the default tool and it's great. But it's not your only option:

OpenRefine — free, visual, built specifically for data cleaning. Really good if you don't want to write code.
Excel or Google Sheets — totally fine for smaller datasets. TRIM, CLEAN, PROPER, Find & Replace — those functions do more than people give them credit for.
SQL — if your data's already in a database, cleaning it in SQL is often faster than exporting to Python and back.
dbt — for production pipelines. You define cleaning transformations as version-controlled SQL models. Super clean approach for teams.
Browser-based cleaners — good for quick CSV fixes when you don't want to spin up a whole Jupyter notebook.

The Thing Nobody Tells You

Cleaning data in data science isn't a step you do once before the "real work" starts. It IS the real work. And it's not linear either — you'll be halfway through building a model, find some new data problem, and end up back in the cleaning step. That's normal. That's how it works for everyone.

The data scientists who build the best models? They're not smarter about algorithms. They're more thorough about cleaning. Start by understanding your data, clean it methodically, and write down what you changed and why. Your future self and your teammates will thank you.