Data Science Project: Machine Learning Model

This is the fourth article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques.

The first three articles were the Exploratory Data Analysis (EDA) and cleaning of the dataset:

The output of the first three articles is the cleaned_dataset (you have to unzip the file to use the CSV) that we are going to use to generate the Machine Learning Model.

Training the Machine Learning Model

You can save the script on a file train_model.py and execute it directly with python3 train_model.py or python train_model.py, depending on your installation.

It expects you to have a file called ‘cleaned_data.csv’ (you can download it on the link above in ZIP format) on the same folder and will output three other files:

model.pkl: the model in binary format generated by pickle that we can reuse later
train.csv: the train data after the split of the original data into train and test
test.csv: the test data after the split of the original data into train and test

The output on the terminal will be similar to this:

Train data for modeling: (934, 74)
Test data for predictions: (234, 74)
Training the model ...
Testing the model ...
Average Price Test: 175652.0128205128
RMSE: 10552.188828855931
Model saved at model.pkl

It means the models used 934 data point to train and 234 data points to test.

The average Sale Price in the test set is 175k dollars.

The RMSE (root-mean-square error) is a good metric to understand the output because in you can read it using the same scale of you dependent variable, which is Sale Price in this case.

A RMSE of 10552 means that, on average, we missed the correct Sale Prices by a bit over 10k dollars.

Considering an average if 175k, missing the mark, on average, by 10k, is not too bad.

The Training Script

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pickle

def create_train_test_data(dataset):
    # load and split the data
    data_train = dataset.sample(frac=0.8, random_state=30).reset_index(drop=True)
    data_test = dataset.drop(data_train.index).reset_index(drop=True)

    # save the data
    data_train.to_csv('train.csv', index=False)
    data_test.to_csv('test.csv', index=False)

    print(f"Train data for modeling: {data_train.shape}")
    print(f"Test data for predictions: {data_test.shape}")

def train_model(x_train, y_train):

    print("Training the model ...")

    model = Pipeline(steps=[
        ("label encoding", OneHotEncoder(handle_unknown='ignore')),
        ("tree model", LinearRegression())
    ])
    model.fit(x_train, y_train)

    return model

def accuracy(model, x_test, y_test):
    print("Testing the model ...")
    predictions = model.predict(x_test)
    tree_mse = mean_squared_error(y_test, predictions)
    tree_rmse = np.sqrt(tree_mse)
    return tree_rmse

def export_model(model):
    # Save the model
    pkl_path = 'model.pkl'
    with open(pkl_path, 'wb') as file:
        pickle.dump(model, file)
        print(f"Model saved at {pkl_path}")

def main():
    # Load the whole data
    data = pd.read_csv('cleaned_data.csv', keep_default_na=False, index_col=0)

    # Split train/test
    # Creates train.csv and test.csv
    create_train_test_data(data)

    # Loads the data for the model training
    train = pd.read_csv('train.csv', keep_default_na=False)
    x_train = train.drop(columns=['SalePrice'])
    y_train = train['SalePrice']

    # Loads the data for the model testing
    test = pd.read_csv('test.csv', keep_default_na=False)
    x_test = test.drop(columns=['SalePrice'])
    y_test = test['SalePrice']

    # Train and Test
    model = train_model(x_train, y_train)
    rmse_test = accuracy(model, x_test, y_test)

    print(f"Average Price Test: {y_test.mean()}")
    print(f"RMSE: {rmse_test}")

    # Save the model
    export_model(model)

if __name__ == '__main__':
    main()