This is the third article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques.
The first two articles were the Exploratory Data Analysis (EDA) on the dataset:
- Exploratory Data Analysis – House Prices – Part 1
- Exploratory Data Analysis – House Prices – Part 2
- Data Science Project: Data Cleaning Script – House Prices DataSet
- Data Science Project: Machine Learning Model – House Prices Dataset
- Data Science Project: House Prices Dataset – API
- Data Science and Machine Learning Project: House Prices Dataset
This article converts the final decisions made to clean the data in the Jupyter Notebook into a single Python script that will take the data in CSV format and write the cleaned data also as a CSV.
Data Cleaning Script
You can save the script on a file ‘data_cleaning.py’ and execute it directly with python3 data_cleaning.py
or python data_cleaning.py
, depending on your installation.
You just need the pandas library installed, which comes by default on Anaconda.
The script expects the train file(unzip it to have the CSV file).
The output will be a file named ‘cleaned_data.csv’.
It will also print the shape of the original data and the shape of the new cleaned data.
Original Data: (1168, 81)
After Cleaning: (1168, 73)
import os
import pandas as pd
# writes the output on 'cleaned_data.csv' by default
def clean_data(df, output_file='cleaned_data.csv'):
"""Makes an initial clean in a dataframe.
Args:
df (pd.DataFrame): A dataframe to clean.
Returns:
pd.DataFrame: the cleaned dataframe.
"""
# Removes columns with missing values issues
cols_to_be_removed = ['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'LotFrontage',
'GarageYrBlt', 'MasVnrArea']
df.drop(columns=cols_to_be_removed, inplace=True)
# Transforms ordinal columns to numerical
ordinal_cols = ['FireplaceQu', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond',
'HeatingQC', 'KitchenQual', 'GarageQual', 'GarageCond']
for col in ordinal_cols:
df[col].fillna(0, inplace=True)
df[col].replace({'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}, inplace=True)
# Replace the NaN with NA
for c in ['GarageType', 'GarageFinish', 'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1']:
df[c].fillna('NA', inplace=True)
# Replace the NaN with None
df['MasVnrType'].fillna('None', inplace=True)
# Imputes with most frequent value
df['Electrical'].fillna('SBrkr', inplace=True)
# Saves a copy
cleaned_data = os.path.join(output_file)
df.to_csv(cleaned_data)
return df
if __name__ == "__main__":
# Reads the file train.csv
train_file = os.path.join('train.csv')
if os.path.exists(train_file):
df = pd.read_csv(train_file)
print(f'Original Data: {df.shape}')
cleaned_df = clean_data(df)
print(f'After Cleaning: {cleaned_df.shape}')
else:
print(f'File not found {train_file}')