Section 7: Linear Regression Fundamentals
Learning Objectives
By the end of this section, students will be able to:
- Understand the mathematical foundation of linear regression
- Apply regression analysis to real estate valuation problems
- Interpret regression coefficients and statistical measures
- Identify and address common regression assumptions
- Use Excel for regression analysis and diagnostics
Introduction
Linear regression provides the statistical framework for translating property attributes into price predictions. This section teaches how to build and interpret regression models for real estate valuation.
Property valuation has evolved from gut instinct and comparable sales to mathematical models that capture the hedonic value of every characteristic, from granite countertops to proximity to transit stations. Linear regression forms the foundation of this transformation, providing the statistical framework that translates property attributes into price predictions.
Real estate professionals encounter regression daily, whether they recognize it or not. The “price per square foot” metric represents the simplest regression model: a single variable explaining price variation. Modern analysis extends this concept to dozens of variables simultaneously, capturing the complex interplay of location, physical attributes, and market conditions that determine property value.
Mathematical Foundation
Simple Linear Regression
Linear regression quantifies the relationship between a dependent variable (typically sale price or rent) and one or more independent variables (property characteristics). The model assumes that changes in property attributes produce proportional changes in value.
Y = β₀ + β₁X + ε
Where:
- Y = dependent variable (sale price)
- X = independent variable (square footage)
- β₀ = intercept (base value when X = 0)
- β₁ = slope coefficient (price change per unit of X)
- ε = error term (unexplained variation)
The slope coefficient β₁ represents the marginal contribution of the independent variable to property value. If β₁ equals 150 for square footage, each additional square foot adds $150 to the expected sale price.
Multiple Linear Regression
Multiple regression extends this framework to numerous variables simultaneously:
Y = β₀ + β₁X₁ + β₂X₂ + … + βₙXₙ + ε
Where:
- X₁, X₂, …, Xₙ = independent variables (bedrooms, bathrooms, age, etc.)
- β₁, β₂, …, βₙ = partial regression coefficients
- Each β represents the effect of its X while holding other variables constant
Key Regression Statistics
R² = 1 - (SSE/SST)
Where:
- SSE = Sum of Squared Errors (residuals)
- SST = Total Sum of Squares
- R² = proportion of variance explained by the model
SE = √(SSE/(n-k-1))
Where:
- SSE = Sum of Squared Errors
- n = number of observations
- k = number of independent variables
- SE = average prediction error
Regression Assumptions
Linearity:
- Relationships between variables follow straight lines
- Property age often violates this assumption
- Add polynomial terms or segment by age ranges
Independence:
- One property’s value doesn’t influence another’s
- Spatial autocorrelation causes neighboring properties to share similar values
- Use spatial regression techniques or clustered standard errors
Homoscedasticity:
- Constant error variance across all prediction levels
- Real estate shows heteroscedasticity: errors grow with property value
- Use log transformations or weighted least squares
No Multicollinearity:
- Independent variables remain uncorrelated
- Real estate variables naturally correlate
- Use Variance Inflation Factors (VIF) to detect problems
Normal Residuals:
- Enable valid hypothesis testing and confidence intervals
- Real estate residuals often show positive skewness
- Use log transformations or robust standard errors
Explore the five key regression assumptions and how they apply to real estate data. Each assumption shows the ideal condition, common violations in property markets, and practical solutions:
Excel Implementation
Simple Regression: Scatter Chart with Trend Line
For basic regression analysis, Excel’s scatter chart with trend line provides an intuitive approach:
Step 1: Prepare Data - Column A: Independent variable (square footage) - Column B: Dependent variable (sale price) - Include headers in row 1 - Ensure no empty cells or blank rows
Step 2: Create Scatter Chart 1. Select both data columns including headers 2. Insert → Charts → Scatter (X,Y) or Bubble Chart 3. Choose the first scatter chart option 4. Excel automatically plots the data points
Step 3: Add Trend Line 1. Right-click on any data point 2. Select “Add Trendline” 3. Choose “Linear” trend line 4. Check “Display Equation on chart” 5. Check “Display R-squared value on chart”
Step 4: Format and Interpret - The equation shows: Y = β₀ + β₁X - R² value indicates model fit quality - Use this for quick analysis and visualization
Simple Regression: Excel Functions
For more detailed analysis, use individual Excel functions:
SLOPE Function:
=SLOPE(price_range, sqft_range)
INTERCEPT Function:
=INTERCEPT(price_range, sqft_range)
RSQ Function:
=RSQ(price_range, sqft_range)
CORREL Function:
=CORREL(price_range, sqft_range)
Multiple Regression: Data Analysis ToolPak
For multiple regression analysis, Excel’s Data Analysis ToolPak provides comprehensive statistical output:
Step 1: Enable Data Analysis ToolPak 1. File → Options → Add-ins 2. Select “Analysis ToolPak” from the list 3. Click “Go” and check the box 4. Click “OK” to enable
Step 2: Prepare Data for Multiple Regression - Column A: Dependent variable (sale price) - Columns B, C, D, etc.: Independent variables (sqft, bedrooms, age, etc.) - Include headers in row 1 - Critical: No empty cells or blank columns - Critical: No text values in numeric columns - Critical: All data must be contiguous
Step 3: Run Regression Analysis 1. Data → Data Analysis → Regression 2. Input Y Range: Select dependent variable column (including header) 3. Input X Range: Select all independent variable columns (including headers) 4. Check “Labels” if headers are included 5. Select output location 6. Check desired options: - Residuals (for diagnostic plots) - Standardized Residuals - Residual Plots - Line Fit Plots 7. Click “OK”
Excel Troubleshooting Tips
Common Error Prevention: - Never leave empty columns between variables - Never leave empty cells in data ranges - Never include text in numeric columns - Always include headers when checking “Labels” - Always select contiguous ranges for X variables
Data Preparation Checklist: - [ ] All data is numeric (no text, no empty cells) - [ ] Headers are in row 1 - [ ] No blank rows or columns - [ ] Data ranges are contiguous - [ ] Dependent variable is in first column - [ ] Independent variables are in adjacent columns
Error Messages and Solutions: - “Input range contains non-numeric data” → Check for text or empty cells - “Regression analysis failed” → Verify data ranges are contiguous - “Labels option requires headers” → Ensure row 1 contains headers - “Insufficient data” → Check that you have more observations than variables
Interpreting Regression Output
Model Fit Statistics
R-squared (R²):
- Definition: Proportion of variance in the dependent variable explained by the model
- Range: 0 to 1 (0% to 100%)
- Real Estate Interpretation: R² of 0.75 means the model explains 75% of property price variation
- Context: Higher values indicate better fit, but perfect fit (R² = 1) suggests overfitting
Adjusted R-squared:
- Definition: R² adjusted for the number of variables in the model
- Purpose: Prevents overfitting by penalizing unnecessary variables
- Real Estate Application: More reliable than R² when comparing models with different numbers of features
- Rule of Thumb: Adjusted R² should be close to R²; large differences indicate overfitting
Standard Error:
- Definition: Average prediction error in the same units as the dependent variable
- Real Estate Context: Standard error of $50,000 means typical predictions are within $50,000 of actual prices
- Practical Impact: On $500,000 predictions, this represents a 10% margin of error
- Business Use: Helps assess prediction reliability for investment decisions
Variable Significance Tests
t-statistic:
- Calculation: Coefficient ÷ Standard Error of Coefficient
- Interpretation: Measures how many standard errors the coefficient is from zero
- Significance Threshold: |t-statistic| > 2 suggests the variable significantly affects price
- Real Estate Example: t-statistic of 3.2 for square footage means this variable strongly influences property value
p-value:
- Definition: Probability that the coefficient equals zero (no effect)
- Significance Level: p-value < 0.05 confirms the variable affects price
- Real Estate Context: p-value of 0.02 for bedroom count means there’s only a 2% chance bedrooms don’t influence price
- Business Decision: Variables with p-value > 0.05 may be removed from the model
F-statistic:
- Purpose: Tests overall model significance
- Interpretation: Determines if the model explains more variance than random chance
- Real Estate Application: F-statistic with p-value < 0.05 confirms the model has predictive power
- Context: Low F-statistic suggests the model performs no better than using the mean price
Practical Real Estate Examples
High-Quality Model Indicators:
- R-squared > 0.70 (explains most price variation)
- Adjusted R-squared close to R-squared (no overfitting)
- Standard error < 15% of average property price
- Most variables have p-values < 0.05
- F-statistic p-value < 0.001
Model Improvement Signals:
- Low R-squared suggests missing important variables
- Large gap between R-squared and adjusted R-squared indicates unnecessary variables
- High standard error relative to property values suggests poor prediction accuracy
- Many variables with p-values > 0.05 may indicate irrelevant features
Model Diagnostics
Residual Analysis
Creating Residual Plots in Excel:
- Run regression with residual output option checked
- Create scatter plot with predicted values on X-axis, residuals on Y-axis
- Add horizontal line at Y=0
- Look for patterns suggesting assumption violations
Common Patterns:
- Random scatter around zero: Good model fit
- Funnel shapes: Signal heteroscedasticity
- Curves: Suggest missing polynomial terms
- Clusters: Indicate missing categorical variables
Excel Diagnostic Tools
Residual Plots: - Check “Residual Plots” option in Data Analysis ToolPak - Excel creates plots for each independent variable - Look for patterns that violate regression assumptions
Standardized Residuals: - Check “Standardized Residuals” option - Values beyond ±2 indicate potential outliers - Values beyond ±3 are definite outliers
Example: Apartment Rent Analysis
Building a regression model for apartment rents in downtown Chicago.
Model Variables: - Dependent: Monthly rent - Independent: Square footage, floor number, building age, distance to subway
Excel Implementation Steps: 1. Prepare data with rent in column A, other variables in columns B-E 2. Use Data Analysis ToolPak → Regression 3. Select rent as Y range, other variables as X range 4. Check “Labels” and “Residuals” options 5. Review output for model quality
Results: - Rent = 800 + 1.5(SqFt) + 15(Floor) - 8(Age) - 200(SubwayDist) - R-squared = 0.72 - Each square foot adds $1.50 to rent - Each floor up adds $15 - Each year of building age reduces rent by $8 - Each mile from subway reduces rent by $200
© 2025 Prof. Tim Frenzel. All rights reserved. | Version 1.0.6