Section 3: Data Cleaning

Learning Objectives

By the end of this section, students will be able to:

  • Apply business context to data cleaning decisions
  • Import and validate data from multiple sources
  • Identify and handle missing values appropriately
  • Detect and manage different types of outliers
  • Standardize data formats and ensure consistency
  • Document cleaning processes for reproducibility

Business Context as Foundation

Data cleaning begins not with formulas but with business understanding. The most sophisticated technical approaches fail when disconnected from the underlying problem you’re solving. Every dataset arrives wrapped in context: how it was collected, who recorded it, what incentives shaped its creation, and which decisions it will inform.

Why do so many analysts skip directly to removing nulls and fixing formats? The temptation to start working immediately often leads to technically correct but business-irrelevant transformations. A marketing analyst working with customer transaction data must understand seasonal patterns before deciding whether to fill missing December sales with yearly averages or flag them as anomalous holiday behavior.

The business problem dictates cleaning priorities. Consider two scenarios using the same property dataset. A real estate investor seeking undervalued properties needs accurate square footage and recent sale prices. A city planner studying housing density cares more about zoning classifications and building ages. Same data, different cleaning requirements.

Reading Data from Multiple Sources

Modern analytics rarely involves single, pristine spreadsheets. Data arrives from disparate systems with incompatible formats, conflicting definitions, and varying quality standards. Excel provides reliable tools for ingesting this chaos, but success requires understanding each source’s quirks.

Data Import Best Practices

Excel Import Wizard Settings:

  • Text format for zip codes (preserves leading zeros)
  • Date format for temporal data
  • Custom delimiters for inconsistent separators
  • Error handling for malformed records

A real estate analyst receives monthly MLS (Multiple Listing Service) exports, county tax assessments, and Zillow API responses. The MLS uses ‘NR’ for not reported, the county uses -999 for missing values, and Zillow returns empty cells. Without documenting these source-specific null indicators, the analyst might calculate average home prices including -999 values, producing absurd results like negative property values in premium neighborhoods.

Data Validation Checklist

After importing data, verify what actually loaded:

  1. Record Count: Expected number of properties
  2. Column Completeness: All required fields present
  3. Data Types: Numbers formatted as numbers, dates as dates
  4. Memory Usage: File size reasonable for analysis
  5. Null Detection: Identify missing value patterns

Missing Value Detection and Strategy

Missing data tells stories. Random gaps differ fundamentally from systematic absences. A dataset showing missing pool information for all condos isn’t experiencing data quality issues—condos rarely have private pools. Understanding why data is missing guides appropriate handling.

The pattern of missingness matters more than the percentage. Missing Completely at Random (MCAR) allows straightforward deletion or imputation. Missing at Random (MAR) requires careful modeling. Missing Not at Random (MNAR) indicates the absence itself carries information.

Consider property tax records where high-value homes consistently lack renovation dates. Wealthy owners might renovate without permits, avoiding reassessment. Simply filling these gaps with neighborhood averages would miss this behavioral signal. The missingness becomes a feature indicating possible unpermitted improvements.

Missing Value Analysis in Excel

Step 1: Identify Missing Patterns

Use Excel’s conditional formatting to highlight empty cells:

  1. Select data range
  2. Home → Conditional Formatting → Highlight Cells Rules → Blanks
  3. Review patterns by column and row

Step 2: Calculate Missing Percentages

Create summary statistics for each column:

Column A: Property ID (0% missing)
Column B: Square Footage (15% missing)  
Column C: Bedrooms (5% missing)
Column D: Sale Price (2% missing)

Step 3: Apply Appropriate Strategies

  • Numerical fields: Use median or mean imputation
  • Categorical fields: Use mode or “Unknown” category
  • Critical fields: Remove records with missing values
  • Meaningful zeros: Replace missing with zero (HOA fees)

Type Conversion and Validation

Data types seem mundane until they cause analysis failures. The text ‘1,250,000’ isn’t a number Excel can average. The date ‘13/25/2024’ will crash date functions. Type mismatches compound throughout analysis pipelines, producing errors far from their source.

Real estate data particularly suffers from inconsistent formatting. Square footage might appear as ‘2,500 sq ft’, ‘2500SF’, or ‘2.5k sqft’. Lot sizes mix acres and square feet. Prices include dollar signs, commas, and occasionally text like ‘Call for price’.

Data Cleaning Workflow

RAW DATA INPUT                    TYPE CONVERSION PROCESS
┌─────────────────────────┐      ┌─────────────────────────┐
│ Price: "$1,250,000"     │ ────▶│ Remove $ and commas     │
│ SqFt: "2,500 sq ft"     │      │ Extract numbers only    │
│ Bedrooms: "3.5"         │      │ Handle decimals         │
│ Date: "13/25/2024"      │      │ Validate date format    │
│ Lot: "0.5 acres"        │      │ Convert units           │
└─────────────────────────┘      └─────────────────────────┘
                                         │
                                         ▼
CLEAN DATA OUTPUT               VALIDATION CHECKS
┌─────────────────────────┐      ┌─────────────────────────┐
│ Price: 1250000 (number) │      │ Price > 0               │
│ SqFt: 2500 (number)     │      │ SqFt > 0                │
│ Bedrooms: 3.5 (number)  │      │ Bedrooms >= 0           │
│ Date: ERROR (invalid)   │      │ Date in valid range     │
│ Lot: 21780 (sqft)       │      │ Lot > 0                 │
└─────────────────────────┘      └─────────────────────────┘

Excel Text-to-Columns Tool

For complex text formatting, use Excel’s Text to Columns feature:

  1. Select column with mixed formats
  2. Data → Text to Columns
  3. Choose delimiter (space, comma, etc.)
  4. Specify data types for each column
  5. Preview results before applying

When should data type issues trigger investigation versus automatic correction? A property listed with 50 bedrooms could be a typo for 5, or it might be a boarding house. The business context determines whether to cap values, investigate manually, or create a separate category for unusual properties.

Identifying Outliers: Technical, Fundamental, and Statistical

Outliers exist in three dimensions, each requiring different treatment. Technical outliers violate physical constraints—negative square footage, 200-story residential buildings, or sale dates in the future. These demand correction or removal.

Fundamental outliers contradict business logic without being impossible. A studio apartment priced above neighborhood mansions suggests data quality issues, though celebrity provenance or historical significance might justify extreme valuations. These require investigation.

Statistical outliers fall beyond expected distributions but might represent genuine extreme values. Luxury properties naturally create right-skewed price distributions. Removing them would bias analysis toward middle-market properties, misrepresenting the actual market.

Outlier Type Detection Method Treatment Strategy Real Estate Example
Technical Constraint validation Remove or correct Negative lot size
Fundamental Business rules Investigate cause $10M mobile home
Statistical Z-score, IQR Context-dependent Beachfront mansion

Z-score Method in Excel

The Z-score measures how many standard deviations a value lies from the mean. Values beyond ±3 standard deviations are typically considered outliers.

Z-score Formula

Z = (X - μ) / σ

Where:

  • X = individual data point
  • μ = mean of the dataset
  • σ = standard deviation of the dataset
  • Threshold = ±3 for outlier detection

Excel Implementation:

  1. Calculate mean: =AVERAGE(data_range)
  2. Calculate standard deviation: =STDEV(data_range)
  3. Calculate Z-score: =(value - mean) / stdev
  4. Flag outliers: =ABS(z_score) > 3

IQR Method in Excel

The Interquartile Range (IQR) method identifies outliers using quartile boundaries. It calculates the range between the 25th and 75th percentiles, then flags values beyond 1.5×IQR from these boundaries.

IQR Outlier Detection

Lower Bound = Q1 - 1.5 × IQR

Upper Bound = Q3 + 1.5 × IQR

Where:

  • Q1 = 25th percentile
  • Q3 = 75th percentile
  • IQR = Q3 - Q1
  • Values outside bounds are outliers

Excel Implementation:

  1. Calculate Q1: =QUARTILE(data_range, 1)
  2. Calculate Q3: =QUARTILE(data_range, 3)
  3. Calculate IQR: =Q3 - Q1
  4. Calculate bounds: =Q1 - 1.5*IQR and =Q3 + 1.5*IQR
  5. Flag outliers: =OR(value < lower_bound, value > upper_bound)

Segment-Aware Outlier Detection

For real estate data, apply outlier detection within property type segments to avoid incorrectly flagging all luxury properties:

  1. Group by property type (single-family, condo, townhouse)
  2. Calculate IQR within each group
  3. Apply asymmetric bounds for right-skewed data
  4. Flag outliers within context

Merging Datasets

Real analysis requires combining multiple data sources. Property listings need census demographics, school ratings, and crime statistics. Each dataset uses different geographic identifiers, temporal granularities, and update frequencies.

The merge operation itself becomes a data quality checkpoint. Unmatched records reveal geocoding errors, boundary changes, or timing misalignments. A property might have sold between the tax assessment and listing database updates, creating conflicting information.

Excel VLOOKUP and INDEX/MATCH

VLOOKUP for simple joins:

=VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])

INDEX/MATCH for flexible joins:

=INDEX(return_array, MATCH(lookup_value, lookup_array, 0))

Handling merge conflicts:

When tax records show 1,800 square feet but listings claim 2,100, which source takes precedence? The answer depends on your analysis goals and source reliability.

Geographic Joins

Geographic joins present particular challenges. Property addresses rarely match perfectly across systems. ‘123 Main St’ versus ‘123 Main Street’ versus ‘123 Main St Unit A’ represent the same location with different representations.

Address standardization steps:

  1. Remove common abbreviations (St → Street, Ave → Avenue)
  2. Standardize unit indicators (Apt, Unit, #)
  3. Normalize case and spacing
  4. Use fuzzy matching for approximate matches

Data Consistency and Standardization

Consistency enables analysis. The same concept represented differently across records creates false patterns or hides real ones. Standardization transforms varied representations into uniform formats without losing information.

Text fields particularly require attention. Property descriptions mix abbreviations (‘BR’ vs ‘Bedroom’), pluralizations (‘bath’ vs ‘baths’), and regionalisms (‘garden apartment’ vs ‘ground floor flat’). Cities appear as ‘NY’, ‘New York’, ‘NYC’, and ‘New York City’.

Excel Standardization Tools

Find and Replace for systematic changes:

  1. Ctrl+H to open Find and Replace dialog
  2. Find: “BR” → Replace: “Bedroom”
  3. Use wildcards for pattern matching
  4. Apply to entire column or selected range

Data Validation for consistency:

  1. Select data range
  2. Data → Data Validation
  3. Choose validation criteria
  4. Set error alerts for violations

Creating mapping tables:

Build lookup tables for complex standardization rules:

Original Standardized
BR Bedroom
Bath Bathroom
SF Square Feet
AC Air Conditioning

Documentation and Reproducibility

Data cleaning creates derived datasets whose provenance must be tracked. Which records were removed? What values were imputed? How were outliers handled? Without documentation, cleaned data becomes a black box that other analysts cannot validate or extend.

Excel Documentation Practices

Worksheet organization:

  1. Raw Data - Original imported data
  2. Cleaning Steps - Intermediate transformations
  3. Clean Data - Final cleaned dataset
  4. Documentation - Cleaning log and decisions

Cleaning log template:

Step Action Records Affected Business Rule Notes
1 Remove duplicates 15 Keep most recent Based on listing date
2 Fix price format 200 Remove $ and commas Convert text to number
3 Impute bedrooms 50 Use median by property type Missing for condos only

Version control:

  • Save intermediate files with timestamps
  • Use descriptive filenames
  • Document major decisions in comments
  • Create backup copies before major changes

Validation Through Analysis

Cleaning succeeds when it enables accurate analysis. Statistical summaries reveal remaining issues that technical checks miss. Visualizations expose patterns that suggest data quality problems. The analytical process itself validates cleaning effectiveness.

Excel Validation Techniques

Cross-tabulations uncover impossible combinations:

Properties with zero bedrooms shouldn’t have primary bathroom features. Downtown locations shouldn’t have multi-acre lots. These logical inconsistencies indicate either cleaning errors or fundamental data problems requiring investigation.

Pivot tables for pattern analysis:

  1. Create pivot table with key dimensions
  2. Look for unexpected patterns or gaps
  3. Identify data quality issues
  4. Validate cleaning decisions

Time series analysis reveals temporal anomalies:

A sudden spike in average prices might indicate market heat or might reveal that luxury listings switched from ‘Call for price’ to actual values in that month. Understanding these patterns prevents misinterpretation.

Distributional analysis confirms transformations worked correctly:

After standardizing square footage across multiple listing sources, the combined distribution should be smooth without artificial gaps or spikes at conversion boundaries.

Iterative Data Cleaning Process

The iterative nature of cleaning means returning to earlier steps as analysis reveals new issues. Each analytical question might expose previously unnoticed data quality problems. This feedback loop gradually improves dataset quality while deepening business understanding.

Key insight: Data cleaning is not a one-time task but an ongoing process that evolves with your analysis needs.


© 2025 Prof. Tim Frenzel. All rights reserved. | Version 1.0.8