Housing Price Regression

10 min readFeb 25, 2021

In this blog post, I will be walking you through my steps to clean data and build a multivariate regression to predict house sale prices in King’s County.

My goal behind this regression is to provide a tool for real estate investors (primarily in King’s County) to aid in asset valuation, during both buy and sell side activity. The idea is that this tool can be leveraged to identify lower priced houses for entry into the market. Additionally, the model may provide some insight into whether renovations are correlated with higher sale prices, and if certain metrics provided by King’s County metrics pertaining to house quality are correlated with sale price / reliable for valuation purposes.

Overall process will follow the following steps:

Import data, handle missing values / initial data cleaning
Explore data by looking at relationships with predictor
Transform and preprocess to improve results

All data used throughout this exercise comes from kc_house_data.csv, a dataset provided by Flatiron School.

As a first step before importing the data, make sure all necessary libraries are imported, and build helper functions that will be helpful throughout this exercise.

The first function is a forward-backward feature selection to

"""
Function to perform a forward-backward feature selection
based on p-value from statsmodels.api.OLS
Parameters:
    X: pandas.Dataframe with candidate features
    y: list-like with the target
    initial_list: list of features to start with (column names of X)
    threshold_in: include a feature if its p-val < threshold_in
    threshold_out: exclude a feature if its p-val > threshold_out
    verbose: whether to print the sequence of inclusions and   exclusions
Returns: list of selected features
Note: always set threshold_in < threshold_out to avoid infinite looping
"""
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    
    included = list(initial_list)
    while True:
        changed=False
        
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded, dtype='float64')
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.idxmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))# backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

Next function is one to identify multi-collinearity within a given set of predictors:

"""
Function to identify multi-collinearity based on a threshold value.
Parameters:
    data_pred: predictor variables
    percent_correlated: threshold percentage to determine the level of correlation 
                        to identify
Returns: Dataframe of correlated pairs, with duplicates removed
"""
def multi_collinearity(data_pred, percent_correlated=0.75):
    df = data_pred.corr().abs().stack().reset_index().sort_values(0, ascending=False)
    
    # zip the variable name columns (named level_0 and level_1 by default) in a new column named "pairs"
    df['pairs'] = list(zip(df.level_0, df.level_1))
    
    # set index to pairs
    df.set_index(['pairs'], inplace=True)
    
    # drop level columns
    df.drop(columns=['level_1', 'level_0'], inplace=True)
    
    # rename correlation column as cc rather than 0
    df.columns = ['cc']
    
    # drop duplicates
    df.drop_duplicates(inplace=True)
    
    return df[(df.cc > percent_correlated) & (df.cc < 1)]

Import and evaluate the data:

# import raw dataset
raw_data = pd.read_csv('kc_house_data.csv')# examine the raw dataset
raw_data.info()

From first glance, we can see that waterfront, yr_renovated, and view have missing values. Additionally, we can drop the id column as this is not relevant to predicting sale price. Drop id column and handle missing values. I chose to fill missing waterfront and missing yr_renovated values with 0s as I thought these likely related to houses without these features or any renovations. I am also going to drop all missing view values as there are only 63 that are missing.

clean_data = raw_data.drop('id', axis=1)clean_data['waterfront'] = clean_data['waterfront'].fillna(value=0.)
clean_data['yr_renovated'] = clean_data['yr_renovated'].fillna(value=0.)

Check for duplicates

clean_data.duplicated().sum()

Once we have handled all missing and duplicated data, we need to handle the date column. Currently this column is a string. We are going to set this to a datetime column and then split year, month, day into their own columns.

#convert column to datetime
clean_data['date'] = pd.to_datetime(clean_data['date'], 
                                    infer_datetime_format=True)# create indidivual columns for year, month, day
clean_data['year_sold'] = clean_data['date'].map(lambda x: x.year)
clean_data['month_sold'] = clean_data['date'].map(lambda x: x.month)
clean_data['day_sold'] = clean_data['date'].map(lambda x: x.day)# drop date column
clean_data = clean_data.drop('date', axis=1)

Now that we have an initially clean dataset with missing values, duplicates, weird values, and all string columns translated to numerical columns, we can start looking at all distributions. I have chosen to look at them all at once using the below. I also use the color purple because I think it is a good change from the blue / red everyone always uses.

clean_data.hist(figsize=(25, 25), bins='auto', color='purple');

Now that we have a view of all our variables, we can describe the distributions and start making some observations. Below are the obs that I made:

Description of variables and distributions

Next step is to start plotting all variables against price. I am using the below function to help, given there are a number of sns.regplots that will need to be created throughout.

"""
Function to plot y vs. a chosen set of x variables: includes regression line
Parameters:
    data: name of dataframe variables are coming from
    target: string of target variable
    predictors: dataframe of predictors
    nrows: number of rows of subplots
    ncols: number of cols of subplots
    figsize: size of the figure
    type: allows chooser to select between regression or distplot
"""
def plot_viz(data, target, predictors, nrows=1, ncols=1, figsize=(10, 4), type='reg'):
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
    
    if type == 'reg':
        for ax, feature in zip(axes.flatten(), predictors):
            sns.regplot(x=feature, y=target, data=data, ax=ax, color='purple', scatter_kws={'alpha':0.2});
    
    plt.tight_layout()

I’ve decided to leave if statement in for future functionality to build in additional graphs. Right now, this only produces regplots

Next, plot all predictor variables against price.

# evaluate relationships with price
predictors = clean_data.drop('price', axis=1)
plot_viz(data=clean_data,
         target='price',
         predictors=predictors,
         nrows=5,
         ncols=2,
         figsize=(15, 15))

Example of plotting vs. target variable (price)

Look closer at the distribution of our predictor variable. I divide by 1000 to make the scale easier to read:

sns.distplot(clean_data['price']/1000, color='purple')
plt.tight_layout()

Looking at the distribution above, we can see we have a lot of right skew. Distribution would likely benefit from a transformation. In this case, will be using a boxcox transformation. When transforming the data, I set a variable, fitted_lambda_0 to the output lambda variable. This output lambda value is the power adjustment used in the transformation. If lambda is 0, a log transformation is used.

Transform target and plot distribution to compare

# boxcox transformation
clean_data['price_boxcox'], fitted_lamba_0 = stats.boxcox(x=clean_data['price'])
print(f'Fitted lambda value: {fitted_lamba_0}')# plot price_boxcox
sns.distplot(clean_data['price_boxcox'], color='purple')

From a visual inspection, we can see that the transformed distribution is more normal than the non-transformed data. Now drop price as we won’t need it going forward.

# drop price for now
preprocessed = clean_data.drop('price', axis=1)

Replot continuous variables against transformed price to see any improvement

# plot all continuos against log_price
variables = ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']
variables_2 = ['yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15', 'zipcode']plot_viz(data=preprocessed,
         target='log_price',
         predictors=variables,
         nrows=2,
         ncols=2,
         figsize=(10, 10),
         type='reg')plot_viz(data=preprocessed,
         target='log_price',
         predictors=variables_2,
         nrows=3,
         ncols=3,
         figsize=(10, 10),
         type='reg')

As seen from above, our predictors are also looking far more linearly related with this transformation now.

Drop columns that do not conform to linearity requirement, identify multicollinearity, and remove correlated predictors.

# drop clear non-linear columns
preprocessed = preprocessed.drop(['sqft_lot', 
                                  'yr_built', 
                                  'zipcode', 
                                  'lat', 
                                  'long', 
                                  'sqft_lot15'], axis=1)# identify and remove any multi-collinearity
data_pred = preprocessed.drop(['log_price', 'price_boxcox'], axis=1)
correlated_predictors = multi_collinearity(data_pred, 0.75)
correlated_predictors

Drop correlated predictors

# drop sqft_living, year_sold, sqft_above
preprocessed = preprocessed.drop(['sqft_living', 'year_sold', 'sqft_above'], axis=1)

Given large number of zeros in yr_renovated column, convert to a categorical column representing a renovation vs. no renovation

# create renovated column
preprocessed['renovated'] = preprocessed['yr_renovated'].apply(lambda x: 1 if x > 0 else 0)# drop yr_renovated column
preprocessed.drop('yr_renovated', axis=1, inplace=True)

Now perform additional preprocessing on remaining continuous columns

Plot continuous vs. price to try and improve results

# cont. columns
numerical = ['sqft_basement', 'sqft_living15']
numerical_vars = preprocessed_2[numerical]# plot numerical columns against price and see if transformations improve results
plot_viz(data=preprocessed_2,
         target='price_boxcox',
         predictors=numerical_vars,
         nrows=1,
         ncols=2,
         type='reg')

Remaining cont variables vs. price_boxcox

Both relationships are looking linearly related with price_boxcox. sqft_living15 looks like it would benefit from a transformation of some sort. sqft_basement poses an issue given the number of 0s or non-basement entries that exist within the dataset. The range in house prices for houses with no basements, is similar to the entire range of houses with basements, highlighting issues leaving the 0s in the data.

Check the distributions.

# plot distributions
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
for ax, feature in zip(axes.flatten(), numerical_vars):
    sns.distplot(preprocessed_2[feature], ax=ax, color='purple');plt.tight_layout()

Given the number of zeros in sqft_basement, convert to cat column like yr_renovated. I also keep track of all cat columns that were one-hot encoded with ohe_list

preprocessed_2['basement'] = preprocessed_2['sqft_basement'].apply(lambda x: 1 if x > 0 else 0)
ohe_list = []
ohe_list.append('basement')# drop sqft_basement
preprocessed_2.drop('sqft_basement', axis=1, inplace=True)

Transform sqft_living15 in the same way as price with boxcox.

# transform sqft_living_15 using boxcox transformation
preprocessed_2['t_sqft_liv_15'], fitted_lambda = stats.boxcox(preprocessed_2['sqft_living15'])# drop sqft_living15
preprocessed_2.drop('sqft_living15', axis=1, inplace=True)

Replot against price

As you can see from above, this transformation results in a much more linear relationship.

Now we have dropped non-linear variables and handled continuous variables, we need to handle categorical variables. First step is to identify which are ordinal vs. non-ordinal. Non-ordinal will be one-hot encoded. Ordinal will be kept as a single column. Identify cat columns and then plot boxplots against price to determine if ordinal or not.

cat_cols = ['bedrooms', 'bathrooms', 'floors', 'waterfront', 'view', 'condition', 
            'grade', 'month_sold', 'day_sold', 'renovated']
cat_predictors = preprocessed[cat_cols]fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(15, 15))
for ax, feature in zip(axes.flatten(), cat_predictors):
    sns.boxplot(x=feature, y='price_boxcox', data=preprocessed, ax=ax, color='darkred');plt.tight_layout()

Identify which cat columns are ordinal and will be left as a single column.

# identify ordinal
ordinal_cat_cols = ['bedrooms', 'bathrooms', 'grade']
ordinal_predictors = preprocessed_2[ordinal_cat_cols]

After removing outliers as you see fit, replot vs. price.

# replot
plot_viz(data=preprocessed_2,
         target='price_boxcox',
         predictors=ordinal_predictors,
         nrows=1,
         ncols=3,
         figsize=(15, 5),
         type='reg')

After additional preprocessing to remove outliers, one-hot encode the non-ordinal categorical columns.

# create dummy variables
basement_dummies = pd.get_dummies(preprocessed_2['basement'], prefix='basement', drop_first=True)
floors_dummies = pd.get_dummies(preprocessed_2['floors_10'], prefix='floors', drop_first=True)
water_dummies = pd.get_dummies(preprocessed_2['waterfront'], prefix='water', drop_first=True)
view_dummies = pd.get_dummies(preprocessed_2['view'], prefix='view', drop_first=True)
condition_dummies = pd.get_dummies(preprocessed_2['condition'], prefix='condition', drop_first=True)
month_dummies = pd.get_dummies(preprocessed_2['month_sold'], prefix='month', drop_first=True)
day_dummies = pd.get_dummies(preprocessed_2['day_sold'], prefix='day', drop_first=True)
reno_dummies = pd.get_dummies(preprocessed_2['renovated'], prefix='reno', drop_first=True)

Next, drop columns, and concat to a final dataframe for feature selection and regression.

# drop columns
data_fin_2 = preprocessed_2.drop(ohe_list, axis=1)# concat dummy variables with remaining columns
data_fin_2 = pd.concat([data_fin_2,
                        basement_dummies,
                        floors_dummies,
                        water_dummies,
                        view_dummies,
                        condition_dummies,
                        month_dummies,
                        day_dummies,
                        reno_dummies], axis=1)# run stepwise selection to select features
y = data_fin_2['price_boxcox']
X = data_fin_2.drop('price_boxcox', axis=1)
result = stepwise_selection(X, y, verbose=True)

Run a regression using these selected features

# run regression with these features
X = data_fin_2[result]
y = data_fin_2['price_boxcox']
predictors = sm.add_constant(X)
model = sm.OLS(y, predictors).fit()model.summary()

We can see our p-vals are significant, but we still need to check assumptions on the residuals.

Use the below to create residual plots for all variables.

# evaluate residuals to check assumptions
dataset = pd.concat([y, X], axis=1)
results = []
y = 'price_boxcox'
for idx, column in enumerate(dataset.columns):
    print (f"KC Housing DataSet - Regression Analysis and Diagnostics for {y}~{column}")
    print ("-------------------------------------------------------------------------------------")f = f'{y}~{column}'
    model = smf.ols(formula=f, data=dataset).fit()
    
    fig, axes = plt.subplots(figsize=(15,12))
    fig = sm.graphics.plot_regress_exog(model, column, fig=fig)
    fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line='45', fit=True)
    fig.tight_layout()
    plt.show()
    
    results.append([column, model.rsquared, model.params[0], model.params[1], model.pvalues[1], sms.jarque_bera(model.resid)[0]])

One that code is run, the results can be stored in a dataframe making it easy to compare to other iterations.

# store results in dataframe to make future comparison easy
model_4_results = pd.DataFrame(results, columns=['ind_var', 
                                                  'r_squared', 
                                                  'intercept', 
                                                  'slope', 
                                                  'p-value', 
                                                  'normality (JB)'])

Looking at these results in conjunction with the q-q plots and other residual plots, it is easy to determine which variables do not meet normality or homoscedastcity assumptions.

Drop violating variables and run regression with remaining predictors

X = X.drop(['floors_30',
            'floors_25',
            'view_4',
            'view_2',
            'view_3',
            'view_1',
            'water_1', 
            'condition_2', 
            'bathrooms'], axis=1)# run regression with these features removed
y = data_fin_2['price_boxcox']
predictors = sm.add_constant(X)
model = sm.OLS(y, predictors).fit()model.summary()

Finally, an option is to run KFold Cross Val to see if our MSE and R2 are in line.

# cross validation to make sure we are not overfitting
regression = LinearRegression()
y = dataset['price_boxcox']
crossvalidation = KFold(n_splits=4, shuffle=True, random_state=1)
score = np.mean(cross_val_score(regression, predictors, y, scoring='r2', cv=crossvalidation))
score# now calc MSE
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score
mse = make_scorer(mean_squared_error)
cv_results = cross_val_score(regression, X, predictors, cv=5, scoring=mse)
mse_score = cv_results.mean()

And just like that you have a final model, with a decent level of preprocessing included, transformations on the target, one-hot encodings, and outlier handling. One thing to notice is that the effect size / coefficients are relatively small. Additionally, there are additional items to try layering in such as interaction items and polynomial terms, but for now this is a good spot / fairly well working model.

Housing Price Regression

Written by Kai Graham