Data Science Project: House Prices DataSet – Data Cleaning Script

This is the third article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques.

The first two articles were the Exploratory Data Analysis (EDA) on the dataset:

This article converts the final decisions made to clean the data in the Jupyter Notebook into a single Python script that will take the data in CSV format and write the cleaned data also as a CSV.

Data Cleaning Script

You can save the script on a file ‘data_cleaning.py’ and execute it directly with python3 data_cleaning.py or python data_cleaning.py, depending on your installation.

You just need the pandas library installed, which comes by default on Anaconda.

The script expects the train file(unzip it to have the CSV file).

The output will be a file named ‘cleaned_data.csv’.

It will also print the shape of the original data and the shape of the new cleaned data.

Original Data: (1168, 81)
After Cleaning: (1168, 73)
import os
import pandas as pd

# writes the output on 'cleaned_data.csv' by default
def clean_data(df, output_file='cleaned_data.csv'):
    """Makes an initial clean in a dataframe.

    Args:
        df (pd.DataFrame): A dataframe to clean.

    Returns:
        pd.DataFrame: the cleaned dataframe.
    """

    # Removes columns with missing values issues
    cols_to_be_removed = ['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'LotFrontage',
    'GarageYrBlt', 'MasVnrArea']
    df.drop(columns=cols_to_be_removed, inplace=True)

    # Transforms ordinal columns to numerical
    ordinal_cols = ['FireplaceQu', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 
    'HeatingQC', 'KitchenQual', 'GarageQual', 'GarageCond']
    for col in ordinal_cols:
        df[col].fillna(0, inplace=True)
        df[col].replace({'Po': 1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}, inplace=True)

    # Replace the NaN with NA
    for c in ['GarageType', 'GarageFinish', 'BsmtFinType2', 'BsmtExposure', 'BsmtFinType1']:
        df[c].fillna('NA', inplace=True)

    # Replace the NaN with None
    df['MasVnrType'].fillna('None', inplace=True)

    # Imputes with most frequent value
    df['Electrical'].fillna('SBrkr', inplace=True)

    # Saves a copy
    cleaned_data = os.path.join(output_file)
    df.to_csv(cleaned_data)

    return df

if __name__ == "__main__":
    # Reads the file train.csv
    train_file = os.path.join('train.csv')

    if os.path.exists(train_file):
        df = pd.read_csv(train_file)
        print(f'Original Data: {df.shape}')
        cleaned_df = clean_data(df)
        print(f'After Cleaning: {cleaned_df.shape}')
    else:
        print(f'File not found {train_file}')