Section 7: Linear Regression Fundamentals

Learning Objectives

By the end of this section, students will be able to:

Understand the mathematical foundation of linear regression
Apply regression analysis to real estate valuation problems
Interpret regression coefficients and statistical measures
Identify and address common regression assumptions
Use Excel for regression analysis and diagnostics

Introduction

Linear regression provides the statistical framework for translating property attributes into price predictions. This section teaches how to build and interpret regression models for real estate valuation.

Property valuation has evolved from gut instinct and comparable sales to mathematical models that capture the hedonic value of every characteristic, from granite countertops to proximity to transit stations. Linear regression forms the foundation of this transformation, providing the statistical framework that translates property attributes into price predictions.

Real estate professionals encounter regression daily, whether they recognize it or not. The “price per square foot” metric represents the simplest regression model: a single variable explaining price variation. Modern analysis extends this concept to dozens of variables simultaneously, capturing the complex interplay of location, physical attributes, and market conditions that determine property value.

Mathematical Foundation

Simple Linear Regression

Linear regression quantifies the relationship between a dependent variable (typically sale price or rent) and one or more independent variables (property characteristics). The model assumes that changes in property attributes produce proportional changes in value.

Simple Linear Regression Equation

Y = β₀ + β₁X + ε

Where:

Y = dependent variable (sale price)
X = independent variable (square footage)
β₀ = intercept (base value when X = 0)
β₁ = slope coefficient (price change per unit of X)
ε = error term (unexplained variation)

The slope coefficient β₁ represents the marginal contribution of the independent variable to property value. If β₁ equals 150 for square footage, each additional square foot adds $150 to the expected sale price.

Multiple Linear Regression

Multiple regression extends this framework to numerous variables simultaneously:

Multiple Linear Regression Equation

Y = β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ + ε

Where:

X₁, X₂, …, Xₙ = independent variables (bedrooms, bathrooms, age, etc.)
β₁, β₂, …, βₙ = partial regression coefficients
Each β represents the effect of its X while holding other variables constant

Key Regression Statistics

R-squared Formula

R² = 1 - (SSE/SST)

Where:

SSE = Sum of Squared Errors (residuals)
SST = Total Sum of Squares
R² = proportion of variance explained by the model

Standard Error Formula

SE = √(SSE/(n-k-1))

Where:

SSE = Sum of Squared Errors
n = number of observations
k = number of independent variables
SE = average prediction error

Regression Assumptions

Linearity:

Relationships between variables follow straight lines
Property age often violates this assumption
Add polynomial terms or segment by age ranges

Independence:

One property’s value doesn’t influence another’s
Spatial autocorrelation causes neighboring properties to share similar values
Use spatial regression techniques or clustered standard errors

Homoscedasticity:

Constant error variance across all prediction levels
Real estate shows heteroscedasticity: errors grow with property value
Use log transformations or weighted least squares

No Multicollinearity:

Independent variables remain uncorrelated
Real estate variables naturally correlate
Use Variance Inflation Factors (VIF) to detect problems

Normal Residuals:

Enable valid hypothesis testing and confidence intervals
Real estate residuals often show positive skewness
Use log transformations or robust standard errors

Interactive Regression Assumptions Guide

Explore the five key regression assumptions and how they apply to real estate data. Each assumption shows the ideal condition, common violations in property markets, and practical solutions:

Excel Implementation

Simple Regression: Scatter Chart with Trend Line

For basic regression analysis, Excel’s scatter chart with trend line provides an intuitive approach:

Step 1: Prepare Data - Column A: Independent variable (square footage) - Column B: Dependent variable (sale price) - Include headers in row 1 - Ensure no empty cells or blank rows

Step 2: Create Scatter Chart 1. Select both data columns including headers 2. Insert → Charts → Scatter (X,Y) or Bubble Chart 3. Choose the first scatter chart option 4. Excel automatically plots the data points

Step 3: Add Trend Line 1. Right-click on any data point 2. Select “Add Trendline” 3. Choose “Linear” trend line 4. Check “Display Equation on chart” 5. Check “Display R-squared value on chart”

Step 4: Format and Interpret - The equation shows: Y = β₀ + β₁X - R² value indicates model fit quality - Use this for quick analysis and visualization

Simple Regression: Excel Functions

For more detailed analysis, use individual Excel functions:

SLOPE Function:

=SLOPE(price_range, sqft_range)

INTERCEPT Function:

=INTERCEPT(price_range, sqft_range)

RSQ Function:

=RSQ(price_range, sqft_range)

CORREL Function:

=CORREL(price_range, sqft_range)

Multiple Regression: Data Analysis ToolPak

For multiple regression analysis, Excel’s Data Analysis ToolPak provides comprehensive statistical output:

Step 1: Enable Data Analysis ToolPak 1. File → Options → Add-ins 2. Select “Analysis ToolPak” from the list 3. Click “Go” and check the box 4. Click “OK” to enable

Step 2: Prepare Data for Multiple Regression - Column A: Dependent variable (sale price) - Columns B, C, D, etc.: Independent variables (sqft, bedrooms, age, etc.) - Include headers in row 1 - Critical: No empty cells or blank columns - Critical: No text values in numeric columns - Critical: All data must be contiguous

Step 3: Run Regression Analysis 1. Data → Data Analysis → Regression 2. Input Y Range: Select dependent variable column (including header) 3. Input X Range: Select all independent variable columns (including headers) 4. Check “Labels” if headers are included 5. Select output location 6. Check desired options: - Residuals (for diagnostic plots) - Standardized Residuals - Residual Plots - Line Fit Plots 7. Click “OK”

Excel Troubleshooting Tips

Common Error Prevention: - Never leave empty columns between variables - Never leave empty cells in data ranges - Never include text in numeric columns - Always include headers when checking “Labels” - Always select contiguous ranges for X variables

Data Preparation Checklist: - [ ] All data is numeric (no text, no empty cells) - [ ] Headers are in row 1 - [ ] No blank rows or columns - [ ] Data ranges are contiguous - [ ] Dependent variable is in first column - [ ] Independent variables are in adjacent columns

Error Messages and Solutions: - “Input range contains non-numeric data” → Check for text or empty cells - “Regression analysis failed” → Verify data ranges are contiguous - “Labels option requires headers” → Ensure row 1 contains headers - “Insufficient data” → Check that you have more observations than variables

Interpreting Regression Output

Model Fit Statistics

R-squared (R²):

Definition: Proportion of variance in the dependent variable explained by the model
Range: 0 to 1 (0% to 100%)
Real Estate Interpretation: R² of 0.75 means the model explains 75% of property price variation
Context: Higher values indicate better fit, but perfect fit (R² = 1) suggests overfitting

Adjusted R-squared:

Definition: R² adjusted for the number of variables in the model
Purpose: Prevents overfitting by penalizing unnecessary variables
Real Estate Application: More reliable than R² when comparing models with different numbers of features
Rule of Thumb: Adjusted R² should be close to R²; large differences indicate overfitting

Standard Error:

Definition: Average prediction error in the same units as the dependent variable
Real Estate Context: Standard error of $50,000 means typical predictions are within $50,000 of actual prices
Practical Impact: On $500,000 predictions, this represents a 10% margin of error
Business Use: Helps assess prediction reliability for investment decisions

Variable Significance Tests

t-statistic:

Calculation: Coefficient ÷ Standard Error of Coefficient
Interpretation: Measures how many standard errors the coefficient is from zero
Significance Threshold: |t-statistic| > 2 suggests the variable significantly affects price
Real Estate Example: t-statistic of 3.2 for square footage means this variable strongly influences property value

p-value:

Definition: Probability that the coefficient equals zero (no effect)
Significance Level: p-value < 0.05 confirms the variable affects price
Real Estate Context: p-value of 0.02 for bedroom count means there’s only a 2% chance bedrooms don’t influence price
Business Decision: Variables with p-value > 0.05 may be removed from the model

F-statistic:

Purpose: Tests overall model significance
Interpretation: Determines if the model explains more variance than random chance
Real Estate Application: F-statistic with p-value < 0.05 confirms the model has predictive power
Context: Low F-statistic suggests the model performs no better than using the mean price

Practical Real Estate Examples

High-Quality Model Indicators:

R-squared > 0.70 (explains most price variation)
Adjusted R-squared close to R-squared (no overfitting)
Standard error < 15% of average property price
Most variables have p-values < 0.05
F-statistic p-value < 0.001

Model Improvement Signals:

Low R-squared suggests missing important variables
Large gap between R-squared and adjusted R-squared indicates unnecessary variables
High standard error relative to property values suggests poor prediction accuracy
Many variables with p-values > 0.05 may indicate irrelevant features

Model Diagnostics

Residual Analysis

Creating Residual Plots in Excel:

Run regression with residual output option checked
Create scatter plot with predicted values on X-axis, residuals on Y-axis
Add horizontal line at Y=0
Look for patterns suggesting assumption violations

Common Patterns:

Random scatter around zero: Good model fit
Funnel shapes: Signal heteroscedasticity
Curves: Suggest missing polynomial terms
Clusters: Indicate missing categorical variables

Excel Diagnostic Tools

Residual Plots: - Check “Residual Plots” option in Data Analysis ToolPak - Excel creates plots for each independent variable - Look for patterns that violate regression assumptions

Standardized Residuals: - Check “Standardized Residuals” option - Values beyond ±2 indicate potential outliers - Values beyond ±3 are definite outliers

Example: Apartment Rent Analysis

Building a regression model for apartment rents in downtown Chicago.

Model Variables: - Dependent: Monthly rent - Independent: Square footage, floor number, building age, distance to subway

Excel Implementation Steps: 1. Prepare data with rent in column A, other variables in columns B-E 2. Use Data Analysis ToolPak → Regression 3. Select rent as Y range, other variables as X range 4. Check “Labels” and “Residuals” options 5. Review output for model quality

Results: - Rent = 800 + 1.5(SqFt) + 15(Floor) - 8(Age) - 200(SubwayDist) - R-squared = 0.72 - Each square foot adds $1.50 to rent - Each floor up adds $15 - Each year of building age reduces rent by $8 - Each mile from subway reduces rent by $200