Exploratory Data Analysis – House Prices – Part 1

This is part of a series:


In this article we are going to do an Exploratory Data Analysis, a.k.a EDA, of the dataset "House Prices: Advanced Regression Techniques".

In this Part 1 we will:

  • Understand the problem
  • Explore the data and deal with missing values

In Part 2 we will:

  • Prepare the data
  • Select and transform variables, especially categorical ones

The Problem

This is the description of the problem on Kaggle:

"Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home."

So, we are going to explore the dataset, try to get some insights from it, and use some tools to transform the data into formats that make more sense.

Initial Exploration and First Insights

In this section, we are going to make an initial exploration of the dataset.

This EDA was performed on a Jupyter Notebook and you can download the notebook of this part 1 of the EDA, but the notebook is more raw and don’t have the explanations.

Importing Libraries

We begin by importing the libs we are going to use:

  • The standard math module provides access to the mathematical functions.
  • The NumPy lib is fundamental for any kind of scientific computing with Python.
  • pandas is a must-have tool for data analysis and manipulation.
  • matplotlib is the most complete package in Python when it comes to data visualizations.
  • seaborn is based on matplotlib as a higher-level set of visualization tools, not as powerful as matplotlib, but much easier to work with and delivers a lot with less work.
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline

Loading Data

Since we have tabular data, we are going to use pandas to load the data and take a first look at it.

To load the data, since the format is CSV (Comma-Separated Values), we use the read_csv() function from pandas.

Then we print its shape, which is 1168×81, meaning we have 1168 rows (records) and 81 columns (features).

Actually, we have 1169 rows in the CSV file, but the header that describes the columns doesn’t count.

And we actually have 79 features since one of the columns is SalePrice, which is the column we will try to predict in a model, and we also will not use the column Id and will get rid of it later.

The dataset can be downloaded from Homes Dataset.

train = pd.read_csv('../data/raw/train.csv')
train.shape
(1168, 81)

Looking at the Data

First, I recommend you to read this brief description of each column.

Using the head() function from pandas with an argument of 3, we can take a look at the first 3 records.

The .T means Transpose, this way we visualize rows as columns and vice-versa.

Notice how it doesn’t show all of the columns in the middle and only displays ... because there are too many of them.

train.head(3).T
012
Id8931106414
MSSubClass206030
MSZoningRLRLRM
LotFrontage709856
LotArea8414122568960
MoSold243
YrSold200620102010
SaleTypeWDWDWD
SaleConditionNormalNormalNormal
SalePrice154500325000115000

81 rows × 3 columns

The info() method from pandas will give you a summary of the data.

Notice how Alley has 70 non-null values, meaning it doesn’t have a value for most of the 1168 records.

We can also visualize the data types.

train.info()

RangeIndex: 1168 entries, 0 to 1167
Data columns (total 81 columns):
Id               1168 non-null int64
MSSubClass       1168 non-null int64
MSZoning         1168 non-null object
LotFrontage      964 non-null float64
LotArea          1168 non-null int64
Street           1168 non-null object
Alley            70 non-null object
LotShape         1168 non-null object
LandContour      1168 non-null object
Utilities        1168 non-null object
LotConfig        1168 non-null object
LandSlope        1168 non-null object
Neighborhood     1168 non-null object
Condition1       1168 non-null object
Condition2       1168 non-null object
BldgType         1168 non-null object
HouseStyle       1168 non-null object
OverallQual      1168 non-null int64
OverallCond      1168 non-null int64
YearBuilt        1168 non-null int64
YearRemodAdd     1168 non-null int64
RoofStyle        1168 non-null object
RoofMatl         1168 non-null object
Exterior1st      1168 non-null object
Exterior2nd      1168 non-null object
MasVnrType       1160 non-null object
MasVnrArea       1160 non-null float64
ExterQual        1168 non-null object
ExterCond        1168 non-null object
Foundation       1168 non-null object
BsmtQual         1138 non-null object
BsmtCond         1138 non-null object
BsmtExposure     1137 non-null object
BsmtFinType1     1138 non-null object
BsmtFinSF1       1168 non-null int64
BsmtFinType2     1137 non-null object
BsmtFinSF2       1168 non-null int64
BsmtUnfSF        1168 non-null int64
TotalBsmtSF      1168 non-null int64
Heating          1168 non-null object
HeatingQC        1168 non-null object
CentralAir       1168 non-null object
Electrical       1167 non-null object
1stFlrSF         1168 non-null int64
2ndFlrSF         1168 non-null int64
LowQualFinSF     1168 non-null int64
GrLivArea        1168 non-null int64
BsmtFullBath     1168 non-null int64
BsmtHalfBath     1168 non-null int64
FullBath         1168 non-null int64
HalfBath         1168 non-null int64
BedroomAbvGr     1168 non-null int64
KitchenAbvGr     1168 non-null int64
KitchenQual      1168 non-null object
TotRmsAbvGrd     1168 non-null int64
Functional       1168 non-null object
Fireplaces       1168 non-null int64
FireplaceQu      617 non-null object
GarageType       1099 non-null object
GarageYrBlt      1099 non-null float64
GarageFinish     1099 non-null object
GarageCars       1168 non-null int64
GarageArea       1168 non-null int64
GarageQual       1099 non-null object
GarageCond       1099 non-null object
PavedDrive       1168 non-null object
WoodDeckSF       1168 non-null int64
OpenPorchSF      1168 non-null int64
EnclosedPorch    1168 non-null int64
3SsnPorch        1168 non-null int64
ScreenPorch      1168 non-null int64
PoolArea         1168 non-null int64
PoolQC           4 non-null object
Fence            217 non-null object
MiscFeature      39 non-null object
MiscVal          1168 non-null int64
MoSold           1168 non-null int64
YrSold           1168 non-null int64
SaleType         1168 non-null object
SaleCondition    1168 non-null object
SalePrice        1168 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 739.2+ KB

The describe() method is good to have the first insights of the data.

It automatically gives you descriptive statistics for each feature: number of non-NA/null observations, mean, standard deviation, the min value, the quartiles, and the max value.

Note that the calculations don’t take NaN values into consideration.

For LotFrontage, for instance, it uses only the 964 non-null values, and excludes the other 204 null observations.

train.describe().T
countmeanstdmin25%50%75%max
Id1168.0720.240582420.2376851.0355.75716.51080.251460.0
MSSubClass1168.056.69948641.81406520.020.0050.070.00190.0
LotFrontage964.070.27178425.01938621.059.0069.580.00313.0
LotArea1168.010597.72089010684.9583231477.07560.009463.011601.50215245.0
OverallQual1168.06.0950341.4034021.05.006.07.0010.0
OverallCond1168.05.5941781.1168421.05.005.06.009.0
YearBuilt1168.01971.12071930.2795601872.01954.001972.02000.002009.0
YearRemodAdd1168.01985.20034220.4985661950.01968.001994.02004.002010.0
MasVnrArea1160.0104.620690183.9960310.00.000.0166.251600.0
BsmtFinSF11168.0444.345890466.2787510.00.00384.0706.505644.0
BsmtFinSF21168.046.869863162.3240860.00.000.00.001474.0
BsmtUnfSF1168.0562.949486445.6054580.0216.00464.5808.502336.0
TotalBsmtSF1168.01054.165240448.8489110.0792.75984.01299.006110.0
1stFlrSF1168.01161.268836393.541120334.0873.501079.51392.004692.0
2ndFlrSF1168.0351.218322437.3348020.00.000.0730.502065.0
LowQualFinSF1168.05.65325348.0683120.00.000.00.00572.0
GrLivArea1168.01518.140411534.904019334.01133.251467.51775.255642.0
BsmtFullBath1168.00.4263700.5233760.00.000.01.003.0
BsmtHalfBath1168.00.0616440.2441460.00.000.00.002.0
FullBath1168.01.5616440.5550740.01.002.02.003.0
HalfBath1168.00.3861300.5043560.00.000.01.002.0
BedroomAbvGr1168.02.8655820.8174910.02.003.03.008.0
KitchenAbvGr1168.01.0462330.2180841.01.001.01.003.0
TotRmsAbvGrd1168.06.5325341.6274122.05.006.07.0014.0
Fireplaces1168.00.6121580.6408720.00.001.01.003.0
GarageYrBlt1099.01978.58689724.6081581900.01962.001980.02002.002010.0
GarageCars1168.01.7611300.7590390.01.002.02.004.0
GarageArea1168.0473.000000218.7952600.0318.75479.5577.001418.0
WoodDeckSF1168.092.618151122.7961840.00.000.0168.00736.0
OpenPorchSF1168.045.25684964.1207690.00.0024.068.00523.0
EnclosedPorch1168.020.79024058.3089870.00.000.00.00330.0
3SsnPorch1168.03.32363027.2610550.00.000.00.00407.0
ScreenPorch1168.014.02311652.4985200.00.000.00.00410.0
PoolArea1168.01.93407533.1925380.00.000.00.00648.0
MiscVal1168.042.092466538.9414730.00.000.00.0015500.0
MoSold1168.06.3775682.7270101.05.006.08.0012.0
YrSold1168.02007.8150681.3273392006.02007.002008.02009.002010.0
SalePrice1168.0181081.87671281131.22800734900.0129975.00162950.0214000.00755000.0

Data Cleaning

In this section, we will perform some Data Cleaning.

The id column

The id column is only a dumb identification with no correlation to SalePrice.

So let’s remove the id:

train.drop(columns=['Id'], inplace=True)

Missing values

When we used info() to see the data summary, we could see many columns had a bunch of missing data.

Let’s see which columns have missing values and the proportion in each one of them.

isna() from pandas will return the missing values for each column, then the sum() function will add them up to give you a total.

columns_with_miss = train.isna().sum()
#filtering only the columns with at least 1 missing value
columns_with_miss = columns_with_miss[columns_with_miss!=0]
#The number of columns with missing values
print('Columns with missing values:', len(columns_with_miss))
#sorting the columns by the number of missing values descending
columns_with_miss.sort_values(ascending=False)
Columns with missing values: 19

PoolQC          1164
MiscFeature     1129
Alley           1098
Fence            951
FireplaceQu      551
LotFrontage      204
GarageYrBlt       69
GarageType        69
GarageFinish      69
GarageQual        69
GarageCond        69
BsmtFinType2      31
BsmtExposure      31
BsmtFinType1      30
BsmtCond          30
BsmtQual          30
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64

Out of 80 columns, 19 have missing values.

Missing values per se it not a big problem, but columns with a high number of missing values can cause distortions.

This is the case for:

  • PoolQC: Pool quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • Alley: Type of alley access to property
  • Fence: Fence quality

Let’s drop them from the dataset for now.

# Removing columns
train.drop(columns=['PoolQC', 'MiscFeature', 'Alley', 'Fence'], inplace=True)

FireplaceQu has 551 missing values, which is also pretty high.

In this case, the missing values have meaning, which is "NO Fireplace".

Fireplace has the following categories:

  • Ex Excellent – Exceptional Masonry Fireplace
  • Gd Good – Masonry Fireplace in main level
  • TA Average – Prefabricated Fireplace in main living area or Masonry Fireplace in basement
  • Fa Fair – Prefabricated Fireplace in basement
  • Po Poor – Ben Franklin Stove
  • NA No Fireplace

Let’s check the correlation between FireplaceQu and SalePrice, to see how important this feature is in order to determine the price.

First, we will replace the missing values for 0.

Then, we encode the categories into numbers from 1 to 5.

train['FireplaceQu'].fillna(0, inplace=True)
train['FireplaceQu'].replace({'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}, inplace=True)

Using a barplot, we can see how the category of the FirePlace increases the value of SalePrice.

It is also worth noting how much higher the value is when the house has an Excellent fireplace.

This means we should keep FireplaceQu as feature.

sns.set(style="whitegrid")
sns.barplot(x='FireplaceQu', y="SalePrice", data=train)

png

Missing values in numeric columns

Another feature with a high number of missing values is LotFrontage with a count 204.

Let’s see the correlation between the remaining features with missing values and the SalePrice.

columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
c = list(columns_with_miss.index)
c.append('SalePrice')
train[c].corr()
LotFrontageMasVnrAreaGarageYrBltSalePrice
LotFrontage1.0000000.1966490.0895420.371839
MasVnrArea0.1966491.0000000.2533480.478724
GarageYrBlt0.0895420.2533481.0000000.496575
SalePrice0.3718390.4787240.4965751.000000

Note that LotFrontage, MasVnrArea, and GarageYrBlt have a positive correlation with SalePrice, but this correlation isn’t very strong.

To simplify this analisys, we will remove theses columns for now:

cols_to_be_removed = ['LotFrontage', 'GarageYrBlt', 'MasVnrArea']
train.drop(columns=cols_to_be_removed, inplace=True)

Finally, these are the remaining columns with missing values:

columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
print(f'Columns with missing values: {len(columns_with_miss)}')
columns_with_miss.sort_values(ascending=False)
Columns with missing values: 11

GarageCond      69
GarageQual      69
GarageFinish    69
GarageType      69
BsmtFinType2    31
BsmtExposure    31
BsmtFinType1    30
BsmtCond        30
BsmtQual        30
MasVnrType       8
Electrical       1
dtype: int64

Conclusion

In this part 1 we dealt with missing values and removed the following columns: ‘Id’, ‘PoolQC’, ‘MiscFeature’, ‘Alley’, ‘Fence’, ‘LotFrontage’, ‘GarageYrBlt’, ‘MasVnrArea’.

Please note that the removed columns are not useless or may not contribute to the final model.

After the first round of analysis and testing of the hypothesis, if you ever need to improve your future model further, you can consider reevaluating these columns and understand them better to see how they fit into the problem.

Data Analysis and Machine Learning is NOT a straight path.

It is a process where you iterate and keep testing ideas until you have the result you want, or until find out the result you need is not possible.

In Part 2 (the final part of the EDA) we will see ways to handle the missing values in the other 11 columns.

We will also explore categorical variables.