What is data cleaning and why is it necessary?

Data cleaning (also called data cleansing or data scrubbing) is the process of identifying and correcting errors, inconsistencies, and inaccuracies in a dataset before it is used for analysis. Real-world data is almost always dirty: it comes from multiple sources with different conventions, it is entered by humans who make typos, it contains legacy formats from old systems, and it accumulates errors over time. Analysis built on dirty data produces inaccurate insights — the principle of 'garbage in, garbage out' applies directly. Studies suggest data professionals spend 60–80% of their time on data cleaning rather than actual analysis.

How do I handle missing values?

The right approach to missing values depends on context. Options: remove rows with missing values (appropriate if missing data is random and small percentage of total rows); fill with a default value (0 for numeric counts, 'Unknown' for category labels); fill with the column mean or median (appropriate for numeric data where missing values are likely near the average); fill with the previous or next value (appropriate for time series where values change slowly); or leave as null and handle programmatically in your analysis. Never blindly replace nulls with zeros — a null revenue figure is different from a zero revenue figure.

What is the difference between data cleaning and data transformation?

Data cleaning fixes errors and inconsistencies in existing data without changing its meaning: fixing typos, standardising formats, removing duplicates, and filling missing values. Data transformation changes the structure or content of data to suit an analytical purpose: converting units (kilometres to miles), creating derived columns (calculating profit from revenue and cost), aggregating rows (summing daily data to monthly totals), and reshaping data (pivoting from long to wide format). Both are part of data preparation, but cleaning comes first — transform clean data, not dirty data.

How to Clean and Process Data Online

Remove duplicates, fix formatting, standardise values, and prepare messy data for analysis with our free Data Cleaner tool. Supports CSV and JSON.

Steps

Upload your data

Upload a CSV, TSV, or JSON file, or paste your data directly. The tool parses the structure and gives you a preview of the data with its detected column types (text, number, date, boolean).

Review data quality issues

The data profiling panel shows detected issues: missing values by column (and what percentage of rows are affected), duplicate rows, inconsistent formatting (dates in mixed formats, phone numbers with and without country codes), leading or trailing whitespace, and outlier values.

Select cleaning operations

Choose which cleaning operations to apply: Remove duplicate rows, Trim whitespace from text columns, Standardise date formats (convert all dates to YYYY-MM-DD), Normalise text case (all lowercase, title case), Remove rows with too many empty values, Replace empty values with a specified default, or Remove specific columns.

Preview the cleaned data

Preview the result of your selected operations before applying. The diff view shows which rows changed and how. Verify that the operations produced the intended result without unintended side effects.

Download the clean data

Download the cleaned data in the same format (CSV or JSON) or export to a different format. The tool also generates a cleaning summary report showing how many rows were affected by each operation.

Common Data Quality Problems and Their Causes

Data quality problems fall into predictable categories. Structural problems: inconsistent column names (some columns use underscores, some use camelCase, some have typos), mixed data types in a column (mostly numbers but some text like 'N/A' or 'unknown'), and dates in different formats (01/15/2024, 2024-01-15, 15th January 2024 all in the same column). Content problems: duplicate records created by form submissions being processed twice, leading/trailing whitespace creating non-matching values ('London' ≠ ' London'), inconsistent categorical values ('UK', 'United Kingdom', 'England' all meaning the same thing), and values that are technically valid but logically impossible (negative ages, future birthdates, revenue totals that do not match line item sums). Knowing the common patterns helps you look for them systematically rather than discovering them when analysis produces unexpected results.

Data Cleaning for Different Downstream Uses

The level and type of cleaning needed depends on what you plan to do with the data. For statistical analysis: ensure correct data types, handle outliers, verify that distributions make sense, and decide on a principled approach to missing values. For machine learning: more aggressive cleaning is typically needed — handle missing values (most ML algorithms cannot handle nulls), encode categorical variables, normalise numeric ranges, and consider how to handle outliers (remove them or cap them). For database import: ensure values conform to the schema constraints — text lengths, required fields, foreign key relationships, and unique constraints. For reporting and visualisation: focus on aggregation errors, missing category labels, and date format consistency. For API integration: ensure data types match the API's expectations — particularly important for dates, booleans, and numeric precision.