<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Data Science</title>
	<atom:link href="https://renanmf.com/category/data-science/feed/" rel="self" type="application/rss+xml" />
	<link></link>
	<description>Software development, machine learning</description>
	<lastBuildDate>Tue, 27 Jul 2021 12:03:55 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>

<image>
	<url>https://renanmf.com/wp-content/uploads/2020/03/cropped-android-chrome-512x512-2-32x32.png</url>
	<title>Data Science</title>
	<link></link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>State of Data Science 2021: Popularity of Python</title>
		<link>https://renanmf.com/state-of-data-science-2021-popularity-of-python/</link>
		
		<dc:creator><![CDATA[Renan Moura]]></dc:creator>
		<pubDate>Tue, 27 Jul 2021 12:03:55 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[machine learning]]></category>
		<guid isPermaLink="false">https://renanmf.com/?p=3747</guid>

					<description><![CDATA[<p>Python continues to be an excellent choice if you are entering the data science field. Python still dominates and is the most popular language, particularly among younger generations. 88% of students surveyed are learning Python in preparation for a data science career. 63% of the respondents said they use it frequently or always. 71% of [&#8230;]</p>
<p>The content <a href="https://renanmf.com/state-of-data-science-2021-popularity-of-python/">State of Data Science 2021: Popularity of Python</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Python continues to be an excellent choice if you are entering the data science field.</p>
<p>Python still dominates and is the most popular language, particularly among younger generations.</p>
<p>88% of students surveyed are learning Python in preparation for a data science career.</p>
<p>63% of the respondents said they use it frequently or always.</p>
<p>71% of educators are teaching Python.</p>
<p><img decoding="async" src="https://renanmf.com/wp-content/uploads/2021/07/most_used_language_data.jpeg" alt="" /></p>
<p>It is also interesting to notice SQL raking 2nd place right after Python.</p>
<p>Most structured data is still in relational databases, so a good knowledge of both Python and SQL are a must to deal with data.</p>
<p>The good news is that they are both very accessible and good to begin working with code.</p>
<h2>Comments about the other languages</h2>
<p>R is an alternative to Python, but I don&#8217;t see any advantage in learning it if you are already in the Python path since R won&#8217;t bring anything to the table that Python doesn&#8217;t.</p>
<p>Then we have JavaScript and HTML/CSS, which makes sense since your results won&#8217;t live in a Word document on your computer, a good way to display them is on the web with nice interactivity.</p>
<p>Bash/Shell are super useful, the command line is one of the most powerful tools in a coder&#8217;s tool belt, not only that, but many tools that deal with data engineering like Hadoop rely heavily on the command line interfaces that can be easily automated with a nice shell script.</p>
<p>If you are wondering why Java ranks so high in this list, Hadoop, Hive, HDFS, etc. are made in Java, for instance, and many data pipelines depend on JVM powered tools like Kafka.</p>
<p>So while you may never touch Java as a Data Scientist, you will most probably have to deal with it as Data Engineer at some point.</p>
<p>C/C++ ranks high due to the number of libraries coded in these languages for high performance.</p>
<p>Python&#8217;s most used Machine Learning frameworks and libraries like Pandas are implemented in C/C++ while Python just provides a nicer API to work with.</p>
<p>The other languages (C#, TypeScript, PHP, Rust, Julia and Go), although they have their place, of course, would not be the subject of further studies from my point of view at the moment.</p>
<p>They are used for more specific use cases or simply fall into &quot;that&#8217;s what I and my team knows best&quot;.</p>
<p>The best contender here would be Julia to replace Python, but it still has ways to go before deserving the time and energy to learn it.</p>
<p>Go would be the high level performant alternative to Java, but it doesn&#8217;t have the ecosystem with as many tools behind it yet.</p>
<p>So, out of this list, the ones I think will pay you the most dividends for your investment in time and effort are Python, SQL, JavaScript, HTML/CSS, Bash/Shell, and Java.</p>
<p>These languages are more than enough to put you in any stage of a Data Science project or pipeline.</p>
<p>You can read the full report on <a href="https://www.anaconda.com/state-of-data-science-2021">State of Data Science 2021</a></p>
<p>The content <a href="https://renanmf.com/state-of-data-science-2021-popularity-of-python/">State of Data Science 2021: Popularity of Python</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Data Science and Machine Learning Project: House Prices Dataset</title>
		<link>https://renanmf.com/data-science-machine-learning-house-prices/</link>
		
		<dc:creator><![CDATA[Renan Moura]]></dc:creator>
		<pubDate>Tue, 23 Feb 2021 13:26:03 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[machine learning]]></category>
		<guid isPermaLink="false">https://renanmf.com/?p=3016</guid>

					<description><![CDATA[<p>This is a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques. You can download a PDF version of this Data Science and Machine Learning Project with the full source code repository linked in the book. In this series we begin with [&#8230;]</p>
<p>The content <a href="https://renanmf.com/data-science-machine-learning-house-prices/">Data Science and Machine Learning Project: House Prices Dataset</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>This is a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">House Prices: Advanced Regression Techniques</a>.</p>
<p><a href="https://renanmf.com/book-ds-ml-project-house-prices/">You can download a PDF version of this Data Science and Machine Learning Project with the full source code repository linked in the book.</a></p>
<p>In this series we begin with the EDA (Exploratory Data Analysis) of the data, we create a script to clean the data, then we use the cleaned data to create a Machine Learning Model, and finally we use the Machine Learning model to implement a prediction API:</p>
<ul>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Exploratory Data Analysis – House Prices – Part 1</a></li>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Exploratory Data Analysis – House Prices – Part 2</a></li>
<li><a href="https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/">Data Science Project: Data Cleaning Script – House Prices DataSet</a></li>
<li><a href="https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/">Data Science Project: Machine Learning Model – House Prices Dataset</a></li>
<li><a href="https://renanmf.com/data-science-project-house-prices-dataset-api/">Data Science Project: House Prices Dataset &#8211; API</a></li>
</ul>
<p>You can download the complete code in the <a href="https://github.com/renanmouraf/data-science-house-prices">Github Repository</a> with clear instructions to execute this end-to-end project.</p>
<p><strong>&gt;&gt;&gt;You can also watch how to run this project on Youtube&lt;&lt;&lt;</strong></p>
<p><iframe width="560" height="315" src="https://www.youtube.com/embed/xEfCyb-0Wsk" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></p>
<p>The content <a href="https://renanmf.com/data-science-machine-learning-house-prices/">Data Science and Machine Learning Project: House Prices Dataset</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Data Science Project: House Prices Dataset &#8211; API</title>
		<link>https://renanmf.com/data-science-project-house-prices-dataset-api/</link>
		
		<dc:creator><![CDATA[Renan Moura]]></dc:creator>
		<pubDate>Tue, 16 Feb 2021 20:41:21 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[FastAPI]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[fastapi]]></category>
		<category><![CDATA[machine learning]]></category>
		<guid isPermaLink="false">https://renanmf.com/?p=3014</guid>

					<description><![CDATA[<p>This is the 5th and final article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques. The first four articles were the Exploratory Data Analysis (EDA), Cleaning of the dataset, and the Machine Learning model: Exploratory Data Analysis – House [&#8230;]</p>
<p>The content <a href="https://renanmf.com/data-science-project-house-prices-dataset-api/">Data Science Project: House Prices Dataset &#8211; API</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>This is the 5th and final article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">House Prices: Advanced Regression Techniques</a>.</p>
<p>The first four articles were the Exploratory Data Analysis (EDA), Cleaning of the dataset, and the Machine Learning model:</p>
<ul>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Exploratory Data Analysis – House Prices – Part 1</a></li>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Exploratory Data Analysis – House Prices – Part 2</a></li>
<li><a href="https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/">Data Science Project: Data Cleaning Script – House Prices DataSet</a></li>
<li><a href="https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/">Data Science Project: Machine Learning Model – House Prices Dataset</a></li>
<li><strong><a href="https://renanmf.com/data-science-project-house-prices-dataset-api/">Data Science Project: House Prices Dataset &#8211; API</a></strong></li>
<li><a href="https://renanmf.com/data-science-machine-learning-house-prices/">Data Science and Machine Learning Project: House Prices Dataset</a></li>
</ul>
<p>The output of the fourth article is the <a href="https://renanmf.com/wp-content/uploads/2021/02/model.zip" title="model">Machine Learning Model</a> (you have to unzip the file) that we are going to use in the API.</p>
<h2>Class HousePriceModel</h2>
<p>Save this script on a file named <code>predict.py</code>.</p>
<p>This file has the class <code>HousePriceModel</code> and is used to load the Machine Learning model and make the predictions.</p>
<pre><code class="language-python"># the pickle lib is used to load the machine learning model
import pickle
import pandas as pd

class HousePriceModel():

    def __init__(self):
        self.model = self.load_model()
        self.preds = None

    def load_model(self):
        # uses the file model.pkl
        pkl_filename = &#039;model.pkl&#039;

        try:
            with open(pkl_filename, &#039;rb&#039;) as file:
                pickle_model = pickle.load(file)
        except:
            print(f&#039;Error loading the model at {pkl_filename}&#039;)
            return None

        return pickle_model

    def predict(self, data):

        if not isinstance(data, pd.DataFrame):
            data = pd.DataFrame(data, index=[0])

        # makes the predictions using the loaded model
        self.preds = self.model.predict(data)
        return self.preds</code></pre>
<h2>The API with FastAPI</h2>
<p>To run the API:</p>
<pre><code>uvicorn api:app</code></pre>
<p>Expected output:</p>
<pre><code>INFO:     Started server process [56652]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)</code></pre>
<p>The was API created with the framework <a href="https://fastapi.tiangolo.com/">FastAPI</a>.</p>
<p>The &quot;/predict&quot; endpoint will give you a prediction based on a sample.</p>
<pre><code class="language-python">from fastapi import FastAPI
from datetime import datetime
from predict import HousePriceModel

app = FastAPI()

@app.get(&quot;/&quot;)
def root():
    return {&quot;status&quot;: &quot;online&quot;}

@app.post(&quot;/predict&quot;)
def predict(inputs: dict):

    model = HousePriceModel()

    start = datetime.today()
    pred = model.predict(inputs)[0]
    dur = (datetime.today() - start).total_seconds()

    return pred</code></pre>
<h2>Testing the API</h2>
<p>You can save the script on a file <code>test_api.py</code> and execute it directly with <code>python3 test_api.py</code> or <code>python test_api.py</code>, depending on your installation.</p>
<p>Remember to execute this test on a second terminal while the first one runs the server for the actual API.</p>
<p>Expected output:</p>
<pre><code>The actual Sale Price: 109000
The predicted Sale Price: 109000.01144237864</code></pre>
<p>The code to test the API:</p>
<pre><code class="language-python"># import requests library to make API calls
import requests
from predict import HousePriceModel

# a sample input with all the features we 
# used to train the model
sample_input = {&#039;MSSubClass&#039;: 20, &#039;MSZoning&#039;: &#039;RL&#039;, 
&#039;LotArea&#039;: 7922, &#039;Street&#039;: &#039;Pave&#039;, 
&#039;LotShape&#039;: &#039;Reg&#039;, &#039;LandContour&#039;: &#039;Lvl&#039;, 
&#039;Utilities&#039;: &#039;AllPub&#039;, &#039;LotConfig&#039;: &#039;Inside&#039;, 
&#039;LandSlope&#039;: &#039;Gtl&#039;, &#039;Neighborhood&#039;: &#039;NAmes&#039;, 
&#039;Condition1&#039;: &#039;Norm&#039;, &#039;Condition2&#039;: &#039;Norm&#039;, 
&#039;BldgType&#039;: &#039;1Fam&#039;, &#039;HouseStyle&#039;: &#039;1Story&#039;, 
&#039;OverallQual&#039;: 5, &#039;OverallCond&#039;: 7, 
&#039;YearBuilt&#039;: 1953, &#039;YearRemodAdd&#039;: 2007, 
&#039;RoofStyle&#039;: &#039;Gable&#039;, &#039;RoofMatl&#039;: &#039;CompShg&#039;, 
&#039;Exterior1st&#039;: &#039;VinylSd&#039;, &#039;Exterior2nd&#039;: &#039;VinylSd&#039;, 
&#039;MasVnrType&#039;: &#039;None&#039;, &#039;ExterQual&#039;: 3,
&#039;ExterCond&#039;: 4, &#039;Foundation&#039;: &#039;CBlock&#039;, 
&#039;BsmtQual&#039;: 3, &#039;BsmtCond&#039;: 3, 
&#039;BsmtExposure&#039;: &#039;No&#039;, &#039;BsmtFinType1&#039;: &#039;GLQ&#039;, 
&#039;BsmtFinSF1&#039;: 731, &#039;BsmtFinType2&#039;: &#039;Unf&#039;, 
&#039;BsmtFinSF2&#039;: 0, &#039;BsmtUnfSF&#039;: 326, 
&#039;TotalBsmtSF&#039;: 1057, &#039;Heating&#039;: &#039;GasA&#039;, 
&#039;HeatingQC&#039;: 3, &#039;CentralAir&#039;: &#039;Y&#039;, 
&#039;Electrical&#039;: &#039;SBrkr&#039;, &#039;1stFlrSF&#039;: 1057, 
&#039;2ndFlrSF&#039;: 0, &#039;LowQualFinSF&#039;: 0, 
&#039;GrLivArea&#039;: 1057, &#039;BsmtFullBath&#039;: 1, 
&#039;BsmtHalfBath&#039;: 0, &#039;FullBath&#039;: 1, 
&#039;HalfBath&#039;: 0, &#039;BedroomAbvGr&#039;: 3, 
&#039;KitchenAbvGr&#039;: 1, &#039;KitchenQual&#039;: 4, 
&#039;TotRmsAbvGrd&#039;: 5, &#039;Functional&#039;: &#039;Typ&#039;, 
&#039;Fireplaces&#039;: 0, &#039;FireplaceQu&#039;: 0, 
&#039;GarageType&#039;: &#039;Detchd&#039;, &#039;GarageFinish&#039;: &#039;Unf&#039;,
&#039;GarageCars&#039;: 1, &#039;GarageArea&#039;: 246, 
&#039;GarageQual&#039;: 3, &#039;GarageCond&#039;: 3, 
&#039;PavedDrive&#039;: &#039;Y&#039;, &#039;WoodDeckSF&#039;: 0, 
&#039;OpenPorchSF&#039;: 52, &#039;EnclosedPorch&#039;: 0, 
&#039;3SsnPorch&#039;: 0, &#039;ScreenPorch&#039;: 0, 
&#039;PoolArea&#039;: 0, &#039;MiscVal&#039;: 0, &#039;MoSold&#039;: 1,
&#039;YrSold&#039;: 2010, &#039;SaleType&#039;: &#039;WD&#039;, 
&#039;SaleCondition&#039;: &#039;Abnorml&#039;}

def run_prediction_from_sample():

    url=&quot;http://127.0.0.1:8000/predict&quot;
    headers = {&quot;Content-Type&quot;: &quot;application/json&quot;, \
    &quot;Accept&quot;:&quot;text/plain&quot;}

    response = requests.post(url, headers=headers, \
    json=sample_input)
    print(&quot;The actual Sale Price: 109000&quot;)
    print(f&quot;The predicted Sale Price: {response.text}&quot;)

if __name__ == &quot;__main__&quot;:
    run_prediction_from_sample()</code></pre>
<p>The content <a href="https://renanmf.com/data-science-project-house-prices-dataset-api/">Data Science Project: House Prices Dataset &#8211; API</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Data Science Project: Machine Learning Model &#8211; House Prices Dataset</title>
		<link>https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/</link>
		
		<dc:creator><![CDATA[Renan Moura]]></dc:creator>
		<pubDate>Wed, 10 Feb 2021 14:34:50 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[machine learning]]></category>
		<guid isPermaLink="false">https://renanmf.com/?p=2992</guid>

					<description><![CDATA[<p>This is the fourth article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques. The first three articles were the Exploratory Data Analysis (EDA) and cleaning of the dataset: Exploratory Data Analysis – House Prices – Part 1 Exploratory Data [&#8230;]</p>
<p>The content <a href="https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/">Data Science Project: Machine Learning Model &#8211; House Prices Dataset</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>This is the fourth article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">House Prices: Advanced Regression Techniques</a>.</p>
<p>The first three articles were the Exploratory Data Analysis (EDA) and cleaning of the dataset:</p>
<ul>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Exploratory Data Analysis – House Prices – Part 1</a></li>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Exploratory Data Analysis – House Prices – Part 2</a></li>
<li><a href="https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/">Data Science Project: Data Cleaning Script – House Prices DataSet</a></li>
<li><strong><a href="https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/">Data Science Project: Machine Learning Model – House Prices Dataset</a></strong></li>
<li><a href="https://renanmf.com/data-science-project-house-prices-dataset-api/">Data Science Project: House Prices Dataset &#8211; API</a></li>
<li><a href="https://renanmf.com/data-science-machine-learning-house-prices/">Data Science and Machine Learning Project: House Prices Dataset</a></li>
</ul>
<p>The output of the first three articles is the <a href="https://renanmf.com/wp-content/uploads/2021/02/cleaned_data.zip">cleaned_dataset</a> (you have to unzip the file to use the CSV) that we are going to use to generate the Machine Learning Model.</p>
<h2>Training the Machine Learning Model</h2>
<p>You can save the script on a file <code>train_model.py</code> and execute it directly with <code>python3 train_model.py</code> or <code>python train_model.py</code>, depending on your installation.</p>
<p>It expects you to have a file called &#8216;cleaned_data.csv&#8217; (you can download it on the link above in ZIP format) on the same folder and will output three other files:</p>
<ul>
<li>model.pkl: the model in binary format generated by pickle that we can reuse later</li>
<li>train.csv: the <strong>train</strong> data after the split of the original data into train and test</li>
<li>test.csv: the <strong>test</strong> data after the split of the original data into train and test</li>
</ul>
<p>The output on the terminal will be similar to this:</p>
<pre><code>Train data for modeling: (934, 74)
Test data for predictions: (234, 74)
Training the model ...
Testing the model ...
Average Price Test: 175652.0128205128
RMSE: 10552.188828855931
Model saved at model.pkl</code></pre>
<p>It means the models used 934 data point to train and 234 data points to test.</p>
<p>The average Sale Price in the test set is 175k dollars.</p>
<p>The RMSE (root-mean-square error) is a good metric to understand the output because in you can read it using the same scale of you dependent variable, which is Sale Price in this case.</p>
<p>A RMSE of 10552 means that, on average, we missed the correct Sale Prices by a bit over 10k dollars.</p>
<p>Considering an average if 175k, missing the mark, on average, by 10k, is not too bad.</p>
<h2>The Training Script</h2>
<pre><code class="language-python">import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pickle

def create_train_test_data(dataset):
    # load and split the data
    data_train = dataset.sample(frac=0.8, random_state=30).reset_index(drop=True)
    data_test = dataset.drop(data_train.index).reset_index(drop=True)

    # save the data
    data_train.to_csv(&#039;train.csv&#039;, index=False)
    data_test.to_csv(&#039;test.csv&#039;, index=False)

    print(f&quot;Train data for modeling: {data_train.shape}&quot;)
    print(f&quot;Test data for predictions: {data_test.shape}&quot;)

def train_model(x_train, y_train):

    print(&quot;Training the model ...&quot;)

    model = Pipeline(steps=[
        (&quot;label encoding&quot;, OneHotEncoder(handle_unknown=&#039;ignore&#039;)),
        (&quot;tree model&quot;, LinearRegression())
    ])
    model.fit(x_train, y_train)

    return model

def accuracy(model, x_test, y_test):
    print(&quot;Testing the model ...&quot;)
    predictions = model.predict(x_test)
    tree_mse = mean_squared_error(y_test, predictions)
    tree_rmse = np.sqrt(tree_mse)
    return tree_rmse

def export_model(model):
    # Save the model
    pkl_path = &#039;model.pkl&#039;
    with open(pkl_path, &#039;wb&#039;) as file:
        pickle.dump(model, file)
        print(f&quot;Model saved at {pkl_path}&quot;)

def main():
    # Load the whole data
    data = pd.read_csv(&#039;cleaned_data.csv&#039;, keep_default_na=False, index_col=0)

    # Split train/test
    # Creates train.csv and test.csv
    create_train_test_data(data)

    # Loads the data for the model training
    train = pd.read_csv(&#039;train.csv&#039;, keep_default_na=False)
    x_train = train.drop(columns=[&#039;SalePrice&#039;])
    y_train = train[&#039;SalePrice&#039;]

    # Loads the data for the model testing
    test = pd.read_csv(&#039;test.csv&#039;, keep_default_na=False)
    x_test = test.drop(columns=[&#039;SalePrice&#039;])
    y_test = test[&#039;SalePrice&#039;]

    # Train and Test
    model = train_model(x_train, y_train)
    rmse_test = accuracy(model, x_test, y_test)

    print(f&quot;Average Price Test: {y_test.mean()}&quot;)
    print(f&quot;RMSE: {rmse_test}&quot;)

    # Save the model
    export_model(model)

if __name__ == &#039;__main__&#039;:
    main()</code></pre>
<p>The content <a href="https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/">Data Science Project: Machine Learning Model &#8211; House Prices Dataset</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Data Science Project: House Prices DataSet &#8211; Data Cleaning Script</title>
		<link>https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/</link>
		
		<dc:creator><![CDATA[Renan Moura]]></dc:creator>
		<pubDate>Wed, 03 Feb 2021 11:49:45 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://renanmf.com/?p=2962</guid>

					<description><![CDATA[<p>This is the third article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques. The first two articles were the Exploratory Data Analysis (EDA) on the dataset: Exploratory Data Analysis – House Prices – Part 1 Exploratory Data Analysis – [&#8230;]</p>
<p>The content <a href="https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/">Data Science Project: House Prices DataSet &#8211; Data Cleaning Script</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>This is the third article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">House Prices: Advanced Regression Techniques</a>.</p>
<p>The first two articles were the Exploratory Data Analysis (EDA) on the dataset:</p>
<ul>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Exploratory Data Analysis – House Prices – Part 1</a></li>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Exploratory Data Analysis – House Prices – Part 2</a></li>
<li><strong><a href="https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/">Data Science Project: Data Cleaning Script – House Prices DataSet</a></strong></li>
<li><a href="https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/">Data Science Project: Machine Learning Model – House Prices Dataset</a></li>
<li><a href="https://renanmf.com/data-science-project-house-prices-dataset-api/">Data Science Project: House Prices Dataset &#8211; API</a></li>
<li><a href="https://renanmf.com/data-science-machine-learning-house-prices/">Data Science and Machine Learning Project: House Prices Dataset</a></li>
</ul>
<p>This article converts the final decisions made to clean the data in the <a href="https://renanmf.com/wp-content/uploads/2021/01/Exploratory-Data-Analysis-House-Prices-Part1AND2.zip">Jupyter Notebook</a> into a single Python script that will take the data in CSV format and write the cleaned data also as a CSV.</p>
<h2>Data Cleaning Script</h2>
<p>You can save the script on a file &#8216;data_cleaning.py&#8217; and execute it directly with <code>python3 data_cleaning.py</code> or <code>python data_cleaning.py</code>, depending on your installation.</p>
<p>You just need the <a href="https://pypi.org/project/pandas/">pandas</a> library installed, which comes by default on <a href="https://www.anaconda.com/">Anaconda</a>.</p>
<p>The script expects the <a href="https://renanmf.com/wp-content/uploads/2021/01/train.zip">train file</a>(unzip it to have the CSV file).</p>
<p>The output will be a file named &#8216;cleaned_data.csv&#8217;.</p>
<p>It will also print the shape of the original data and the shape of the new cleaned data.</p>
<pre><code>Original Data: (1168, 81)
After Cleaning: (1168, 73)</code></pre>
<pre><code class="language-python">import os
import pandas as pd

# writes the output on &#039;cleaned_data.csv&#039; by default
def clean_data(df, output_file=&#039;cleaned_data.csv&#039;):
    &quot;&quot;&quot;Makes an initial clean in a dataframe.

    Args:
        df (pd.DataFrame): A dataframe to clean.

    Returns:
        pd.DataFrame: the cleaned dataframe.
    &quot;&quot;&quot;

    # Removes columns with missing values issues
    cols_to_be_removed = [&#039;Id&#039;, &#039;PoolQC&#039;, &#039;MiscFeature&#039;, &#039;Alley&#039;, &#039;Fence&#039;, &#039;LotFrontage&#039;,
    &#039;GarageYrBlt&#039;, &#039;MasVnrArea&#039;]
    df.drop(columns=cols_to_be_removed, inplace=True)

    # Transforms ordinal columns to numerical
    ordinal_cols = [&#039;FireplaceQu&#039;, &#039;ExterQual&#039;, &#039;ExterCond&#039;, &#039;BsmtQual&#039;, &#039;BsmtCond&#039;, 
    &#039;HeatingQC&#039;, &#039;KitchenQual&#039;, &#039;GarageQual&#039;, &#039;GarageCond&#039;]
    for col in ordinal_cols:
        df[col].fillna(0, inplace=True)
        df[col].replace({&#039;Po&#039;: 1, &#039;Fa&#039;: 2, &#039;TA&#039;: 3, &#039;Gd&#039;: 4, &#039;Ex&#039;: 5}, inplace=True)

    # Replace the NaN with NA
    for c in [&#039;GarageType&#039;, &#039;GarageFinish&#039;, &#039;BsmtFinType2&#039;, &#039;BsmtExposure&#039;, &#039;BsmtFinType1&#039;]:
        df[c].fillna(&#039;NA&#039;, inplace=True)

    # Replace the NaN with None
    df[&#039;MasVnrType&#039;].fillna(&#039;None&#039;, inplace=True)

    # Imputes with most frequent value
    df[&#039;Electrical&#039;].fillna(&#039;SBrkr&#039;, inplace=True)

    # Saves a copy
    cleaned_data = os.path.join(output_file)
    df.to_csv(cleaned_data)

    return df

if __name__ == &quot;__main__&quot;:
    # Reads the file train.csv
    train_file = os.path.join(&#039;train.csv&#039;)

    if os.path.exists(train_file):
        df = pd.read_csv(train_file)
        print(f&#039;Original Data: {df.shape}&#039;)
        cleaned_df = clean_data(df)
        print(f&#039;After Cleaning: {cleaned_df.shape}&#039;)
    else:
        print(f&#039;File not found {train_file}&#039;)</code></pre>
<p>The content <a href="https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/">Data Science Project: House Prices DataSet &#8211; Data Cleaning Script</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Exploratory Data Analysis – House Prices – Part 2</title>
		<link>https://renanmf.com/exploratory-data-analysis-house-prices-part-two/</link>
		
		<dc:creator><![CDATA[Renan Moura]]></dc:creator>
		<pubDate>Wed, 27 Jan 2021 22:21:06 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://renanmf.com/?p=2934</guid>

					<description><![CDATA[<p>This is part of a series: Exploratory Data Analysis – House Prices – Part 1 Exploratory Data Analysis – House Prices – Part 2 Data Science Project: Data Cleaning Script – House Prices DataSet Data Science Project: Machine Learning Model – House Prices Dataset Data Science Project: House Prices Dataset &#8211; API Data Science and [&#8230;]</p>
<p>The content <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Exploratory Data Analysis – House Prices – Part 2</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>This is part of a series:</p>
<ul>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Exploratory Data Analysis – House Prices – Part 1</a></li>
<li><strong><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Exploratory Data Analysis – House Prices – Part 2</a></strong></li>
<li><a href="https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/">Data Science Project: Data Cleaning Script – House Prices DataSet</a></li>
<li><a href="https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/">Data Science Project: Machine Learning Model – House Prices Dataset</a></li>
<li><a href="https://renanmf.com/data-science-project-house-prices-dataset-api/">Data Science Project: House Prices Dataset &#8211; API</a></li>
<li><a href="https://renanmf.com/data-science-machine-learning-house-prices/">Data Science and Machine Learning Project: House Prices Dataset</a></li>
</ul>
<hr />
<p>In this article, we will finish the Exploratory Data Analysis, a.k.a EDA, and cleaning of the data of the dataset <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">House Prices: Advanced Regression Techniques</a>.</p>
<p>In <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Part 1</a> we:</p>
<ul>
<li>Understood the problem</li>
<li>Explored the data and dealt with missing values</li>
</ul>
<p>In this post we will:</p>
<ul>
<li>Prepare the data</li>
<li>Select and transform variables, especially categorical ones</li>
</ul>
<p>You can download the compĺete <a href="https://renanmf.com/wp-content/uploads/2021/01/Exploratory-Data-Analysis-House-Prices-Part1AND2.zip">Jupyter Notebook</a> covering  part 1 and 2 of the EDA, but the notebook is just code and don&#8217;t have the explanations.</p>
<p>The following steps are a direct continuation of the ones in <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Part 1</a>.</p>
<h2>Categorical variables</h2>
<p>Let&#8217;s work on the categorical variables of our dataset.</p>
<h3>Dealing with missing values</h3>
<p>Filling Categorical NaN that we know how to fill due to the <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=data_description.txt">description file</a>.</p>
<pre><code class="language-python"># Fills NA in place of NaN
for c in [&#039;GarageType&#039;, &#039;GarageFinish&#039;, &#039;BsmtFinType2&#039;, &#039;BsmtExposure&#039;, &#039;BsmtFinType1&#039;]:
    train[c].fillna(&#039;NA&#039;, inplace=True)

# Fills None in place of NaN
train[&#039;MasVnrType&#039;].fillna(&#039;None&#039;, inplace=True)</code></pre>
<p>With this have only 5 columns with missing values left in our dataset.</p>
<pre><code class="language-python">columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
print(f&#039;Columns with missing values: {len(columns_with_miss)}&#039;)
columns_with_miss.sort_values(ascending=False)</code></pre>
<pre><code>Columns with missing values: 5

GarageCond    69
GarageQual    69
BsmtCond      30
BsmtQual      30
Electrical     1
dtype: int64</code></pre>
<h3>Ordinal</h3>
<p>Also by reading the <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=data_description.txt">description file</a>, we can identify other variables that have a similar system to FireplaceQu to categorize the quality: Poor, Good, Excellent, etc.</p>
<p>We are going to replicate the treatment we gave to FireplaceQu to these variables according to the following descriptions:</p>
<p>ExterQual: Evaluates the quality of the material on the exterior</p>
<ul>
<li>Ex Excellent</li>
<li>Gd Good</li>
<li>TA Average/Typical</li>
<li>Fa Fair</li>
<li>Po Poor</li>
</ul>
<p>ExterCond: Evaluates the present condition of the material on the exterior</p>
<ul>
<li>Ex Excellent</li>
<li>Gd Good</li>
<li>TA Average/Typical</li>
<li>Fa Fair</li>
<li>Po Poor</li>
</ul>
<p>BsmtQual: Evaluates the height of the basement</p>
<ul>
<li>Ex Excellent (100+ inches)</li>
<li>Gd Good (90-99 inches)</li>
<li>TA Typical (80-89 inches)</li>
<li>Fa Fair (70-79 inches)</li>
<li>Po Poor ( &lt; 70 inches)</li>
<li>NA No Basement</li>
</ul>
<p>BsmtCond: Evaluates the general condition of the basement</p>
<ul>
<li>Ex Excellent</li>
<li>Gd Good</li>
<li>TA Typical &#8211; slight dampness allowed</li>
<li>Fa Fair &#8211; dampness or some cracking or settling</li>
<li>Po Poor &#8211; Severe cracking, settling, or wetness</li>
<li>NA No Basement</li>
</ul>
<p>HeatingQC: Heating quality and condition</p>
<ul>
<li>Ex Excellent</li>
<li>Gd Good</li>
<li>TA Average/Typical</li>
<li>Fa Fair</li>
<li>Po Poor</li>
</ul>
<p>KitchenQual: Kitchen quality</p>
<ul>
<li>Ex Excellent</li>
<li>Gd Good</li>
<li>TA Average/Typical</li>
<li>Fa Fair</li>
<li>Po Poor</li>
</ul>
<p>GarageQual: Garage quality</p>
<ul>
<li>Ex Excellent</li>
<li>Gd Good</li>
<li>TA Average/Typical</li>
<li>Fa Fair</li>
<li>Po Poor</li>
<li>NA No Garage</li>
</ul>
<p>GarageCond: Garage condition</p>
<ul>
<li>Ex Excellent</li>
<li>Gd Good</li>
<li>TA Average/Typical</li>
<li>Fa Fair</li>
<li>Po Poor</li>
<li>NA No Garage</li>
</ul>
<pre><code class="language-python">ord_cols = [&#039;ExterQual&#039;, &#039;ExterCond&#039;, &#039;BsmtQual&#039;, &#039;BsmtCond&#039;, &#039;HeatingQC&#039;, &#039;KitchenQual&#039;, &#039;GarageQual&#039;, &#039;GarageCond&#039;]
for col in ord_cols:
    train[col].fillna(0, inplace=True)
    train[col].replace({&#039;Po&#039;: 1, &#039;Fa&#039;: 2, &#039;TA&#039;: 3, &#039;Gd&#039;: 4, &#039;Ex&#039;: 5}, inplace=True)</code></pre>
<p>Let&#8217;s now plot the correlation of these variables with SalePrice.</p>
<pre><code class="language-python">ord_cols = [&#039;ExterQual&#039;, &#039;ExterCond&#039;, &#039;BsmtQual&#039;, &#039;BsmtCond&#039;, &#039;HeatingQC&#039;, &#039;KitchenQual&#039;, &#039;GarageQual&#039;, &#039;GarageCond&#039;]
f, axes = plt.subplots(2, 4, figsize=(15, 10), sharey=True)

for r in range(0, 2):
    for c in range(0, 4):
        sns.barplot(x=ord_cols.pop(), y=&quot;SalePrice&quot;, data=train, ax=axes[r][c])

plt.tight_layout()
plt.show()</code></pre>
<p><img decoding="async" src="https://renanmf.com/wp-content/uploads/2021/01/output_20_0-1024x679.png" alt="correlation saleprice one" /></p>
<p>As you can see, the better the category of a variable, the higher the price, which means these variables will be important for a prediction model.</p>
<h3>Nominal</h3>
<p>Other categorical variables don&#8217;t seem to follow any clear ordering.</p>
<p>Let&#8217;s see how many values these columns can assume:</p>
<pre><code class="language-python">cols = train.columns
num_cols = train._get_numeric_data().columns
nom_cols = list(set(cols) - set(num_cols))
print(f&#039;Nominal columns: {len(nom_cols)}&#039;)

value_counts = {}
for c in nom_cols:
    value_counts[c] = len(train[c].value_counts())

sorted_value_counts = {k: v for k, v in sorted(value_counts.items(), key=lambda item: item[1])}
sorted_value_counts</code></pre>
<pre><code>Nominal columns: 31

{'CentralAir': 2,
 'Street': 2,
 'Utilities': 2,
 'LandSlope': 3,
 'PavedDrive': 3,
 'MasVnrType': 4,
 'GarageFinish': 4,
 'LotShape': 4,
 'LandContour': 4,
 'BsmtCond': 5,
 'MSZoning': 5,
 'Electrical': 5,
 'Heating': 5,
 'BldgType': 5,
 'BsmtExposure': 5,
 'LotConfig': 5,
 'Foundation': 6,
 'RoofStyle': 6,
 'SaleCondition': 6,
 'BsmtFinType2': 7,
 'Functional': 7,
 'GarageType': 7,
 'BsmtFinType1': 7,
 'RoofMatl': 7,
 'HouseStyle': 8,
 'Condition2': 8,
 'SaleType': 9,
 'Condition1': 9,
 'Exterior1st': 15,
 'Exterior2nd': 16,
 'Neighborhood': 25}</code></pre>
<p>Some categorical variables can assume several different values like Neighborhood. </p>
<p>To simplify, let&#8217;s analyze only variables with 6 different values or less.</p>
<pre><code class="language-python">nom_cols_less_than_6 = []
for c in nom_cols:
    n_values = len(train[c].value_counts())
    if n_values &lt; 7:
        nom_cols_less_than_6.append(c)

print(f&#039;Nominal columns with less than 6 values: {len(nom_cols_less_than_6)}&#039;)</code></pre>
<pre><code>Nominal columns with less than 6 values: 19</code></pre>
<p>Plotting against SalePrice to have a better idea of how they affect it:</p>
<pre><code class="language-python">ncols = 3
nrows = math.ceil(len(nom_cols_less_than_6) / ncols)
f, axes = plt.subplots(nrows, ncols, figsize=(15, 30))

for r in range(0, nrows):
    for c in range(0, ncols):
        if not nom_cols_less_than_6:
            continue
        sns.barplot(x=nom_cols_less_than_6.pop(), y=&quot;SalePrice&quot;, data=train, ax=axes[r][c])

plt.tight_layout()
plt.show()</code></pre>
<p><img decoding="async" src="https://renanmf.com/wp-content/uploads/2021/01/output_27_0.png" alt="correlation saleprice two" /></p>
<p>We can see a good correlation of many of these columns with the target variable.</p>
<p>For now, let&#8217;s keep them.</p>
<p>We still have NaN in &#8216;Electrical&#8217;.</p>
<p>As we could see in the plot above, &#8216;SBrkr&#8217; is the most frequent value in &#8216;Electrical&#8217;.</p>
<p>Let&#8217;s use this value to replace NaN in Electrical.</p>
<pre><code class="language-python"># Inputs more frequent value in place of NaN

train[&#039;Electrical&#039;].fillna(&#039;SBrkr&#039;, inplace=True)</code></pre>
<h3>Zero values</h3>
<p>Another quick check is to see how many columns have lots of data equals to 0.</p>
<pre><code class="language-python">train.isin([0]).sum().sort_values(ascending=False).head(25)</code></pre>
<pre><code>PoolArea         1164
LowQualFinSF     1148
3SsnPorch        1148
MiscVal          1131
BsmtHalfBath     1097
ScreenPorch      1079
BsmtFinSF2       1033
EnclosedPorch    1007
HalfBath          727
BsmtFullBath      686
2ndFlrSF          655
WoodDeckSF        610
Fireplaces        551
FireplaceQu       551
OpenPorchSF       534
BsmtFinSF1        382
BsmtUnfSF          98
GarageCars         69
GarageArea         69
GarageCond         69
GarageQual         69
TotalBsmtSF        30
BsmtCond           30
BsmtQual           30
FullBath            8
dtype: int64</code></pre>
<p>In this case, even though there are many 0&#8217;s, they have meaning.</p>
<p>For instance, PoolArea (Pool area in square feet) equals 0 means that the house doesn&#8217;t have any pool area.</p>
<p>This is important information correlated to the house and thus, we are going to keep them.</p>
<h2>Outliers</h2>
<p>We can also take a look at the outliers in the numeric variables.</p>
<pre><code class="language-python"># Get only numerical columns
numerical_columns = list(train.dtypes[train.dtypes == &#039;int64&#039;].index)
len(numerical_columns)</code></pre>
<pre><code>42</code></pre>
<pre><code class="language-python"># Create the plot grid
rows = 7
columns = 6

fig, axes = plt.subplots(rows,columns, figsize=(30,30))

x, y = 0, 0

for i, column in enumerate(numerical_columns):
    sns.boxplot(x=train[column], ax=axes[x, y])

    if y &lt; columns-1:
        y += 1
    elif y == columns-1:
        x += 1
        y = 0
    else:
        y += 1</code></pre>
<p><img decoding="async" src="https://renanmf.com/wp-content/uploads/2021/01/output_36_0-1024x1017.png" alt="outliers" /></p>
<p>There are a lot of outliers in the dataset. </p>
<p>But, if we check the data <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=data_description.txt">description file</a>, we see that, actually, some numerical variables, are categorical variables that were saved (codified) as numbers. </p>
<p>So, some of these data points that seem to be outliers are, actually, categorical data with only one example of some category.</p>
<p>Let&#8217;s keep these outliers.</p>
<h2>Saving cleaned data</h2>
<p>Let&#8217;s see how the cleaned data looks like and how many columns we have left.</p>
<p>We have no more missing values:</p>
<pre><code class="language-python">columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
print(f&#039;Columns with missing values: {len(columns_with_miss)}&#039;)
columns_with_miss.sort_values(ascending=False)</code></pre>
<pre><code>Columns with missing values: 0

Series([], dtype: int64)</code></pre>
<p>After cleaning the data, we are left with 73 columns out of the initial 81.</p>
<pre><code class="language-python">train.shape</code></pre>
<pre><code>(1168, 73)</code></pre>
<p>Let&#8217;s take a look at the first 3 records of the cleaned data.</p>
<pre><code class="language-python">train.head(3).T</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }</p>
<p>    .dataframe tbody tr th {
        vertical-align: top;
    }</p>
<p>    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
<th>2</th>
</tr>
</thead>
<tbody>
<tr>
<th>MSSubClass</th>
<td>20</td>
<td>60</td>
<td>30</td>
</tr>
<tr>
<th>MSZoning</th>
<td>RL</td>
<td>RL</td>
<td>RM</td>
</tr>
<tr>
<th>LotArea</th>
<td>8414</td>
<td>12256</td>
<td>8960</td>
</tr>
<tr>
<th>Street</th>
<td>Pave</td>
<td>Pave</td>
<td>Pave</td>
</tr>
<tr>
<th>LotShape</th>
<td>Reg</td>
<td>IR1</td>
<td>Reg</td>
</tr>
<tr>
<th>&#8230;</th>
<td>&#8230;</td>
<td>&#8230;</td>
<td>&#8230;</td>
</tr>
<tr>
<th>MoSold</th>
<td>2</td>
<td>4</td>
<td>3</td>
</tr>
<tr>
<th>YrSold</th>
<td>2006</td>
<td>2010</td>
<td>2010</td>
</tr>
<tr>
<th>SaleType</th>
<td>WD</td>
<td>WD</td>
<td>WD</td>
</tr>
<tr>
<th>SaleCondition</th>
<td>Normal</td>
<td>Normal</td>
<td>Normal</td>
</tr>
<tr>
<th>SalePrice</th>
<td>154500</td>
<td>325000</td>
<td>115000</td>
</tr>
</tbody>
</table>
<p>73 rows × 3 columns</p>
</div>
<p>We can see a summary of the data showing that, for all the 1168 records, there isn&#8217;t a single missing (null) value.</p>
<pre><code class="language-python">train.info()</code></pre>
<pre><code><class 'pandas.core.frame.DataFrame'>
RangeIndex: 1168 entries, 0 to 1167
Data columns (total 73 columns):
MSSubClass       1168 non-null int64
MSZoning         1168 non-null object
LotArea          1168 non-null int64
Street           1168 non-null object
LotShape         1168 non-null object
LandContour      1168 non-null object
Utilities        1168 non-null object
LotConfig        1168 non-null object
LandSlope        1168 non-null object
Neighborhood     1168 non-null object
Condition1       1168 non-null object
Condition2       1168 non-null object
BldgType         1168 non-null object
HouseStyle       1168 non-null object
OverallQual      1168 non-null int64
OverallCond      1168 non-null int64
YearBuilt        1168 non-null int64
YearRemodAdd     1168 non-null int64
RoofStyle        1168 non-null object
RoofMatl         1168 non-null object
Exterior1st      1168 non-null object
Exterior2nd      1168 non-null object
MasVnrType       1168 non-null object
ExterQual        1168 non-null int64
ExterCond        1168 non-null int64
Foundation       1168 non-null object
BsmtQual         1168 non-null int64
BsmtCond         1168 non-null object
BsmtExposure     1168 non-null object
BsmtFinType1     1168 non-null object
BsmtFinSF1       1168 non-null int64
BsmtFinType2     1168 non-null object
BsmtFinSF2       1168 non-null int64
BsmtUnfSF        1168 non-null int64
TotalBsmtSF      1168 non-null int64
Heating          1168 non-null object
HeatingQC        1168 non-null int64
CentralAir       1168 non-null object
Electrical       1168 non-null object
1stFlrSF         1168 non-null int64
2ndFlrSF         1168 non-null int64
LowQualFinSF     1168 non-null int64
GrLivArea        1168 non-null int64
BsmtFullBath     1168 non-null int64
BsmtHalfBath     1168 non-null int64
FullBath         1168 non-null int64
HalfBath         1168 non-null int64
BedroomAbvGr     1168 non-null int64
KitchenAbvGr     1168 non-null int64
KitchenQual      1168 non-null int64
TotRmsAbvGrd     1168 non-null int64
Functional       1168 non-null object
Fireplaces       1168 non-null int64
FireplaceQu      1168 non-null int64
GarageType       1168 non-null object
GarageFinish     1168 non-null object
GarageCars       1168 non-null int64
GarageArea       1168 non-null int64
GarageQual       1168 non-null int64
GarageCond       1168 non-null int64
PavedDrive       1168 non-null object
WoodDeckSF       1168 non-null int64
OpenPorchSF      1168 non-null int64
EnclosedPorch    1168 non-null int64
3SsnPorch        1168 non-null int64
ScreenPorch      1168 non-null int64
PoolArea         1168 non-null int64
MiscVal          1168 non-null int64
MoSold           1168 non-null int64
YrSold           1168 non-null int64
SaleType         1168 non-null object
SaleCondition    1168 non-null object
SalePrice        1168 non-null int64
dtypes: int64(42), object(31)
memory usage: 666.2+ KB</code></pre>
<p>Finally, let&#8217;s save the cleaned data in a separate file.</p>
<pre><code class="language-python">train.to_csv(&#039;train-cleaned.csv&#039;)</code></pre>
<h2>Conclusions</h2>
<p>In <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Part 1</a> we dealt with missing values and removed the following columns: ‘Id’, ‘PoolQC’, ‘MiscFeature’, ‘Alley’, ‘Fence’, ‘LotFrontage’, ‘GarageYrBlt’, ‘MasVnrArea’.</p>
<p>In this Part 2 we:</p>
<ul>
<li>
<p>Replaced the NaN with NA in the following columns: &#8216;GarageType&#8217;, &#8216;GarageFinish&#8217;, &#8216;BsmtFinType2&#8217;, &#8216;BsmtExposure&#8217;, &#8216;BsmtFinType1&#8217;.</p>
</li>
<li>
<p>Replaced the NaN with None in &#8216;MasVnrType&#8217;.</p>
</li>
<li>
<p>Imputed the most frequent value in place of NaN in &#8216;Electrical&#8217;.</p>
</li>
</ul>
<p>We are going to use this data to create our Machine Learning model and predict the house prices in the next post of this series.</p>
<p>Remember you can download the compĺete <a href="https://renanmf.com/wp-content/uploads/2021/01/Exploratory-Data-Analysis-House-Prices-Part1AND2.zip">Jupyter Notebook</a> covering  part 1 and 2 of the EDA, but the notebook is just code and don&#8217;t have the explanations.</p>
<p>The content <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Exploratory Data Analysis – House Prices – Part 2</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Exploratory Data Analysis &#8211; House Prices &#8211; Part 1</title>
		<link>https://renanmf.com/exploratory-data-analysis-house-prices-part-one/</link>
		
		<dc:creator><![CDATA[Renan Moura]]></dc:creator>
		<pubDate>Wed, 20 Jan 2021 18:36:46 +0000</pubDate>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[data science]]></category>
		<guid isPermaLink="false">https://renanmf.com/?p=2886</guid>

					<description><![CDATA[<p>This is part of a series: Exploratory Data Analysis – House Prices – Part 1 Exploratory Data Analysis – House Prices – Part 2 Data Science Project: Data Cleaning Script – House Prices DataSet Data Science Project: Machine Learning Model – House Prices Dataset Data Science Project: House Prices Dataset &#8211; API Data Science and [&#8230;]</p>
<p>The content <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Exploratory Data Analysis &#8211; House Prices &#8211; Part 1</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>This is part of a series:</p>
<ul>
<li><strong><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Exploratory Data Analysis – House Prices – Part 1</a></strong></li>
<li><a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Exploratory Data Analysis – House Prices – Part 2</a></li>
<li><a href="https://renanmf.com/data-science-project-data-cleaning-house-prices-dataset/">Data Science Project: Data Cleaning Script – House Prices DataSet</a></li>
<li><a href="https://renanmf.com/data-science-project-machine-learning-model-house-prices-dataset/">Data Science Project: Machine Learning Model – House Prices Dataset</a></li>
<li><a href="https://renanmf.com/data-science-project-house-prices-dataset-api/">Data Science Project: House Prices Dataset &#8211; API</a></li>
<li><a href="https://renanmf.com/data-science-machine-learning-house-prices/">Data Science and Machine Learning Project: House Prices Dataset</a></li>
</ul>
<hr />
<p>In this article we are going to do an Exploratory Data Analysis, a.k.a EDA, of the dataset &quot;<a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">House Prices: Advanced Regression Techniques</a>&quot;.</p>
<p>In this Part 1 we will:</p>
<ul>
<li>Understand the problem</li>
<li>Explore the data and deal with missing values</li>
</ul>
<p>In <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Part 2</a> we will:</p>
<ul>
<li>Prepare the data</li>
<li>Select and transform variables, especially categorical ones</li>
</ul>
<h2>The Problem</h2>
<p>This is the description of the problem on Kaggle:</p>
<p>&quot;Ask a home buyer to describe their dream house, and they probably won&#8217;t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition&#8217;s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.</p>
<p>With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.&quot;</p>
<p>So, we are going to explore the dataset, try to get some insights from it, and use some tools to transform the data into formats that make more sense.</p>
<h2>Initial Exploration and First Insights</h2>
<p>In this section, we are going to make an initial exploration of the dataset.</p>
<p>This EDA was performed on a <a href="https://jupyter.org/">Jupyter Notebook</a> and you can <a href="https://renanmf.com/wp-content/uploads/2021/01/1.0-Exploratory-Data-Analysis-House-Prices-Part1.zip">download the notebook</a> of this part 1 of the EDA, but the notebook is more raw and don&#8217;t have the explanations.</p>
<h3>Importing Libraries</h3>
<p>We begin by importing the libs we are going to use:</p>
<ul>
<li>The standard <a href="https://docs.python.org/3/library/math.html">math</a> module provides access to the mathematical functions.</li>
<li>The <a href="https://numpy.org/">NumPy</a> lib is fundamental for any kind of scientific computing with Python.</li>
<li><a href="https://pandas.pydata.org/">pandas</a> is a must-have tool for data analysis and manipulation.</li>
<li><a href="https://matplotlib.org/">matplotlib</a> is the most complete package in Python when it comes to data visualizations.</li>
<li><a href="https://seaborn.pydata.org/">seaborn</a> is based on matplotlib as a higher-level set of visualization tools, not as powerful as matplotlib, but much easier to work with and delivers a lot with less work.</li>
</ul>
<pre><code class="language-python">import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline</code></pre>
<h3>Loading Data</h3>
<p>Since we have tabular data, we are going to use <em>pandas</em> to load the data and take a first look at it.</p>
<p>To load the data, since the format is CSV (Comma-Separated Values), we use the <code>read_csv()</code> function from pandas.</p>
<p>Then we print its shape, which is 1168&#215;81, meaning we have 1168 rows (records) and 81 columns (features).</p>
<p>Actually, we have 1169 rows in the CSV file, but the header that describes the columns doesn&#8217;t count.</p>
<p>And we actually have 79 features since one of the columns is <code>SalePrice</code>, which is the column we will try to predict in a model, and we also will not use the column <code>Id</code> and will get rid of it later.</p>
<p>The dataset can be downloaded from <a href="https://renanmf.com/wp-content/uploads/2021/01/train.zip">Homes Dataset</a>.</p>
<pre><code class="language-python">train = pd.read_csv(&#039;../data/raw/train.csv&#039;)
train.shape</code></pre>
<pre><code>(1168, 81)</code></pre>
<h3>Looking at the Data</h3>
<p>First, I recommend you to read <a href="https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data">this brief description of each column</a>.</p>
<p>Using the <code>head()</code> function from pandas with an argument of 3, we can take a look at the first 3 records.</p>
<p>The <code>.T</code> means <em>Transpose</em>, this way we visualize rows as columns and vice-versa.</p>
<p>Notice how it doesn&#8217;t show all of the columns in the middle and only displays <code>...</code> because there are too many of them.</p>
<pre><code class="language-python">train.head(3).T</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }</p>
<p>    .dataframe tbody tr th {
        vertical-align: top;
    }</p>
<p>    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
<th>2</th>
</tr>
</thead>
<tbody>
<tr>
<th>Id</th>
<td>893</td>
<td>1106</td>
<td>414</td>
</tr>
<tr>
<th>MSSubClass</th>
<td>20</td>
<td>60</td>
<td>30</td>
</tr>
<tr>
<th>MSZoning</th>
<td>RL</td>
<td>RL</td>
<td>RM</td>
</tr>
<tr>
<th>LotFrontage</th>
<td>70</td>
<td>98</td>
<td>56</td>
</tr>
<tr>
<th>LotArea</th>
<td>8414</td>
<td>12256</td>
<td>8960</td>
</tr>
<tr>
<th>&#8230;</th>
<td>&#8230;</td>
<td>&#8230;</td>
<td>&#8230;</td>
</tr>
<tr>
<th>MoSold</th>
<td>2</td>
<td>4</td>
<td>3</td>
</tr>
<tr>
<th>YrSold</th>
<td>2006</td>
<td>2010</td>
<td>2010</td>
</tr>
<tr>
<th>SaleType</th>
<td>WD</td>
<td>WD</td>
<td>WD</td>
</tr>
<tr>
<th>SaleCondition</th>
<td>Normal</td>
<td>Normal</td>
<td>Normal</td>
</tr>
<tr>
<th>SalePrice</th>
<td>154500</td>
<td>325000</td>
<td>115000</td>
</tr>
</tbody>
</table>
<p>81 rows × 3 columns</p>
</div>
<p>The <code>info()</code> method from pandas will give you a summary of the data.</p>
<p>Notice how <code>Alley</code> has 70 non-null values, meaning it doesn&#8217;t have a value for most of the 1168 records.</p>
<p>We can also visualize the data types.</p>
<pre><code class="language-python">train.info()</code></pre>
<pre><code><class 'pandas.core.frame.DataFrame'>
RangeIndex: 1168 entries, 0 to 1167
Data columns (total 81 columns):
Id               1168 non-null int64
MSSubClass       1168 non-null int64
MSZoning         1168 non-null object
LotFrontage      964 non-null float64
LotArea          1168 non-null int64
Street           1168 non-null object
Alley            70 non-null object
LotShape         1168 non-null object
LandContour      1168 non-null object
Utilities        1168 non-null object
LotConfig        1168 non-null object
LandSlope        1168 non-null object
Neighborhood     1168 non-null object
Condition1       1168 non-null object
Condition2       1168 non-null object
BldgType         1168 non-null object
HouseStyle       1168 non-null object
OverallQual      1168 non-null int64
OverallCond      1168 non-null int64
YearBuilt        1168 non-null int64
YearRemodAdd     1168 non-null int64
RoofStyle        1168 non-null object
RoofMatl         1168 non-null object
Exterior1st      1168 non-null object
Exterior2nd      1168 non-null object
MasVnrType       1160 non-null object
MasVnrArea       1160 non-null float64
ExterQual        1168 non-null object
ExterCond        1168 non-null object
Foundation       1168 non-null object
BsmtQual         1138 non-null object
BsmtCond         1138 non-null object
BsmtExposure     1137 non-null object
BsmtFinType1     1138 non-null object
BsmtFinSF1       1168 non-null int64
BsmtFinType2     1137 non-null object
BsmtFinSF2       1168 non-null int64
BsmtUnfSF        1168 non-null int64
TotalBsmtSF      1168 non-null int64
Heating          1168 non-null object
HeatingQC        1168 non-null object
CentralAir       1168 non-null object
Electrical       1167 non-null object
1stFlrSF         1168 non-null int64
2ndFlrSF         1168 non-null int64
LowQualFinSF     1168 non-null int64
GrLivArea        1168 non-null int64
BsmtFullBath     1168 non-null int64
BsmtHalfBath     1168 non-null int64
FullBath         1168 non-null int64
HalfBath         1168 non-null int64
BedroomAbvGr     1168 non-null int64
KitchenAbvGr     1168 non-null int64
KitchenQual      1168 non-null object
TotRmsAbvGrd     1168 non-null int64
Functional       1168 non-null object
Fireplaces       1168 non-null int64
FireplaceQu      617 non-null object
GarageType       1099 non-null object
GarageYrBlt      1099 non-null float64
GarageFinish     1099 non-null object
GarageCars       1168 non-null int64
GarageArea       1168 non-null int64
GarageQual       1099 non-null object
GarageCond       1099 non-null object
PavedDrive       1168 non-null object
WoodDeckSF       1168 non-null int64
OpenPorchSF      1168 non-null int64
EnclosedPorch    1168 non-null int64
3SsnPorch        1168 non-null int64
ScreenPorch      1168 non-null int64
PoolArea         1168 non-null int64
PoolQC           4 non-null object
Fence            217 non-null object
MiscFeature      39 non-null object
MiscVal          1168 non-null int64
MoSold           1168 non-null int64
YrSold           1168 non-null int64
SaleType         1168 non-null object
SaleCondition    1168 non-null object
SalePrice        1168 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 739.2+ KB</code></pre>
<p>The <code>describe()</code> method is good to have the first insights of the data.</p>
<p>It automatically gives you descriptive statistics for each feature: number of non-NA/null observations, <em>mean</em>, <em>standard deviation</em>, the <em>min</em> value, the <em>quartiles</em>, and the <em>max</em> value.</p>
<p>Note that the calculations don&#8217;t take <code>NaN</code> values into consideration.</p>
<p>For <code>LotFrontage</code>, for instance, it uses only the 964 non-null values, and excludes the other 204 null observations.</p>
<pre><code class="language-python">train.describe().T</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }</p>
<p>    .dataframe tbody tr th {
        vertical-align: top;
    }</p>
<p>    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>count</th>
<th>mean</th>
<th>std</th>
<th>min</th>
<th>25%</th>
<th>50%</th>
<th>75%</th>
<th>max</th>
</tr>
</thead>
<tbody>
<tr>
<th>Id</th>
<td>1168.0</td>
<td>720.240582</td>
<td>420.237685</td>
<td>1.0</td>
<td>355.75</td>
<td>716.5</td>
<td>1080.25</td>
<td>1460.0</td>
</tr>
<tr>
<th>MSSubClass</th>
<td>1168.0</td>
<td>56.699486</td>
<td>41.814065</td>
<td>20.0</td>
<td>20.00</td>
<td>50.0</td>
<td>70.00</td>
<td>190.0</td>
</tr>
<tr>
<th>LotFrontage</th>
<td>964.0</td>
<td>70.271784</td>
<td>25.019386</td>
<td>21.0</td>
<td>59.00</td>
<td>69.5</td>
<td>80.00</td>
<td>313.0</td>
</tr>
<tr>
<th>LotArea</th>
<td>1168.0</td>
<td>10597.720890</td>
<td>10684.958323</td>
<td>1477.0</td>
<td>7560.00</td>
<td>9463.0</td>
<td>11601.50</td>
<td>215245.0</td>
</tr>
<tr>
<th>OverallQual</th>
<td>1168.0</td>
<td>6.095034</td>
<td>1.403402</td>
<td>1.0</td>
<td>5.00</td>
<td>6.0</td>
<td>7.00</td>
<td>10.0</td>
</tr>
<tr>
<th>OverallCond</th>
<td>1168.0</td>
<td>5.594178</td>
<td>1.116842</td>
<td>1.0</td>
<td>5.00</td>
<td>5.0</td>
<td>6.00</td>
<td>9.0</td>
</tr>
<tr>
<th>YearBuilt</th>
<td>1168.0</td>
<td>1971.120719</td>
<td>30.279560</td>
<td>1872.0</td>
<td>1954.00</td>
<td>1972.0</td>
<td>2000.00</td>
<td>2009.0</td>
</tr>
<tr>
<th>YearRemodAdd</th>
<td>1168.0</td>
<td>1985.200342</td>
<td>20.498566</td>
<td>1950.0</td>
<td>1968.00</td>
<td>1994.0</td>
<td>2004.00</td>
<td>2010.0</td>
</tr>
<tr>
<th>MasVnrArea</th>
<td>1160.0</td>
<td>104.620690</td>
<td>183.996031</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>166.25</td>
<td>1600.0</td>
</tr>
<tr>
<th>BsmtFinSF1</th>
<td>1168.0</td>
<td>444.345890</td>
<td>466.278751</td>
<td>0.0</td>
<td>0.00</td>
<td>384.0</td>
<td>706.50</td>
<td>5644.0</td>
</tr>
<tr>
<th>BsmtFinSF2</th>
<td>1168.0</td>
<td>46.869863</td>
<td>162.324086</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>0.00</td>
<td>1474.0</td>
</tr>
<tr>
<th>BsmtUnfSF</th>
<td>1168.0</td>
<td>562.949486</td>
<td>445.605458</td>
<td>0.0</td>
<td>216.00</td>
<td>464.5</td>
<td>808.50</td>
<td>2336.0</td>
</tr>
<tr>
<th>TotalBsmtSF</th>
<td>1168.0</td>
<td>1054.165240</td>
<td>448.848911</td>
<td>0.0</td>
<td>792.75</td>
<td>984.0</td>
<td>1299.00</td>
<td>6110.0</td>
</tr>
<tr>
<th>1stFlrSF</th>
<td>1168.0</td>
<td>1161.268836</td>
<td>393.541120</td>
<td>334.0</td>
<td>873.50</td>
<td>1079.5</td>
<td>1392.00</td>
<td>4692.0</td>
</tr>
<tr>
<th>2ndFlrSF</th>
<td>1168.0</td>
<td>351.218322</td>
<td>437.334802</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>730.50</td>
<td>2065.0</td>
</tr>
<tr>
<th>LowQualFinSF</th>
<td>1168.0</td>
<td>5.653253</td>
<td>48.068312</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>0.00</td>
<td>572.0</td>
</tr>
<tr>
<th>GrLivArea</th>
<td>1168.0</td>
<td>1518.140411</td>
<td>534.904019</td>
<td>334.0</td>
<td>1133.25</td>
<td>1467.5</td>
<td>1775.25</td>
<td>5642.0</td>
</tr>
<tr>
<th>BsmtFullBath</th>
<td>1168.0</td>
<td>0.426370</td>
<td>0.523376</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>1.00</td>
<td>3.0</td>
</tr>
<tr>
<th>BsmtHalfBath</th>
<td>1168.0</td>
<td>0.061644</td>
<td>0.244146</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>0.00</td>
<td>2.0</td>
</tr>
<tr>
<th>FullBath</th>
<td>1168.0</td>
<td>1.561644</td>
<td>0.555074</td>
<td>0.0</td>
<td>1.00</td>
<td>2.0</td>
<td>2.00</td>
<td>3.0</td>
</tr>
<tr>
<th>HalfBath</th>
<td>1168.0</td>
<td>0.386130</td>
<td>0.504356</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>1.00</td>
<td>2.0</td>
</tr>
<tr>
<th>BedroomAbvGr</th>
<td>1168.0</td>
<td>2.865582</td>
<td>0.817491</td>
<td>0.0</td>
<td>2.00</td>
<td>3.0</td>
<td>3.00</td>
<td>8.0</td>
</tr>
<tr>
<th>KitchenAbvGr</th>
<td>1168.0</td>
<td>1.046233</td>
<td>0.218084</td>
<td>1.0</td>
<td>1.00</td>
<td>1.0</td>
<td>1.00</td>
<td>3.0</td>
</tr>
<tr>
<th>TotRmsAbvGrd</th>
<td>1168.0</td>
<td>6.532534</td>
<td>1.627412</td>
<td>2.0</td>
<td>5.00</td>
<td>6.0</td>
<td>7.00</td>
<td>14.0</td>
</tr>
<tr>
<th>Fireplaces</th>
<td>1168.0</td>
<td>0.612158</td>
<td>0.640872</td>
<td>0.0</td>
<td>0.00</td>
<td>1.0</td>
<td>1.00</td>
<td>3.0</td>
</tr>
<tr>
<th>GarageYrBlt</th>
<td>1099.0</td>
<td>1978.586897</td>
<td>24.608158</td>
<td>1900.0</td>
<td>1962.00</td>
<td>1980.0</td>
<td>2002.00</td>
<td>2010.0</td>
</tr>
<tr>
<th>GarageCars</th>
<td>1168.0</td>
<td>1.761130</td>
<td>0.759039</td>
<td>0.0</td>
<td>1.00</td>
<td>2.0</td>
<td>2.00</td>
<td>4.0</td>
</tr>
<tr>
<th>GarageArea</th>
<td>1168.0</td>
<td>473.000000</td>
<td>218.795260</td>
<td>0.0</td>
<td>318.75</td>
<td>479.5</td>
<td>577.00</td>
<td>1418.0</td>
</tr>
<tr>
<th>WoodDeckSF</th>
<td>1168.0</td>
<td>92.618151</td>
<td>122.796184</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>168.00</td>
<td>736.0</td>
</tr>
<tr>
<th>OpenPorchSF</th>
<td>1168.0</td>
<td>45.256849</td>
<td>64.120769</td>
<td>0.0</td>
<td>0.00</td>
<td>24.0</td>
<td>68.00</td>
<td>523.0</td>
</tr>
<tr>
<th>EnclosedPorch</th>
<td>1168.0</td>
<td>20.790240</td>
<td>58.308987</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>0.00</td>
<td>330.0</td>
</tr>
<tr>
<th>3SsnPorch</th>
<td>1168.0</td>
<td>3.323630</td>
<td>27.261055</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>0.00</td>
<td>407.0</td>
</tr>
<tr>
<th>ScreenPorch</th>
<td>1168.0</td>
<td>14.023116</td>
<td>52.498520</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>0.00</td>
<td>410.0</td>
</tr>
<tr>
<th>PoolArea</th>
<td>1168.0</td>
<td>1.934075</td>
<td>33.192538</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>0.00</td>
<td>648.0</td>
</tr>
<tr>
<th>MiscVal</th>
<td>1168.0</td>
<td>42.092466</td>
<td>538.941473</td>
<td>0.0</td>
<td>0.00</td>
<td>0.0</td>
<td>0.00</td>
<td>15500.0</td>
</tr>
<tr>
<th>MoSold</th>
<td>1168.0</td>
<td>6.377568</td>
<td>2.727010</td>
<td>1.0</td>
<td>5.00</td>
<td>6.0</td>
<td>8.00</td>
<td>12.0</td>
</tr>
<tr>
<th>YrSold</th>
<td>1168.0</td>
<td>2007.815068</td>
<td>1.327339</td>
<td>2006.0</td>
<td>2007.00</td>
<td>2008.0</td>
<td>2009.00</td>
<td>2010.0</td>
</tr>
<tr>
<th>SalePrice</th>
<td>1168.0</td>
<td>181081.876712</td>
<td>81131.228007</td>
<td>34900.0</td>
<td>129975.00</td>
<td>162950.0</td>
<td>214000.00</td>
<td>755000.0</td>
</tr>
</tbody>
</table>
</div>
<h2>Data Cleaning</h2>
<p>In this section, we will perform some Data Cleaning.</p>
<h3>The <code>id</code> column</h3>
<p>The <code>id</code> column is only a dumb identification with no correlation to <code>SalePrice</code>.</p>
<p>So let&#8217;s remove the <code>id</code>:</p>
<pre><code class="language-python">train.drop(columns=[&#039;Id&#039;], inplace=True)</code></pre>
<h3>Missing values</h3>
<p>When we used <code>info()</code> to see the data summary, we could see many columns had a bunch of missing data.</p>
<p>Let&#8217;s see which columns have missing values and the proportion in each one of them.</p>
<p><code>isna()</code> from pandas will return the missing values for each column, then the <code>sum()</code> function will add them up to give you a total.</p>
<pre><code class="language-python">columns_with_miss = train.isna().sum()
#filtering only the columns with at least 1 missing value
columns_with_miss = columns_with_miss[columns_with_miss!=0]
#The number of columns with missing values
print(&#039;Columns with missing values:&#039;, len(columns_with_miss))
#sorting the columns by the number of missing values descending
columns_with_miss.sort_values(ascending=False)</code></pre>
<pre><code>Columns with missing values: 19

PoolQC          1164
MiscFeature     1129
Alley           1098
Fence            951
FireplaceQu      551
LotFrontage      204
GarageYrBlt       69
GarageType        69
GarageFinish      69
GarageQual        69
GarageCond        69
BsmtFinType2      31
BsmtExposure      31
BsmtFinType1      30
BsmtCond          30
BsmtQual          30
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64</code></pre>
<p>Out of 80 columns, 19 have missing values. </p>
<p>Missing values per se it not a big problem, but columns with a high number of missing values can cause distortions.</p>
<p>This is the case for:</p>
<ul>
<li>PoolQC: Pool quality</li>
<li>MiscFeature: Miscellaneous feature not covered in other categories</li>
<li>Alley: Type of alley access to property</li>
<li>Fence: Fence quality</li>
</ul>
<p>Let&#8217;s drop them from the dataset for now.</p>
<pre><code class="language-python"># Removing columns
train.drop(columns=[&#039;PoolQC&#039;, &#039;MiscFeature&#039;, &#039;Alley&#039;, &#039;Fence&#039;], inplace=True)</code></pre>
<p>FireplaceQu has 551 missing values, which is also pretty high.</p>
<p>In this case, the missing values have meaning, which is &quot;NO Fireplace&quot;.</p>
<p>Fireplace has the following categories:</p>
<ul>
<li>Ex Excellent &#8211; Exceptional Masonry Fireplace</li>
<li>Gd Good &#8211; Masonry Fireplace in main level</li>
<li>TA Average &#8211; Prefabricated Fireplace in main living area or Masonry Fireplace in basement</li>
<li>Fa Fair &#8211; Prefabricated Fireplace in basement</li>
<li>Po Poor &#8211; Ben Franklin Stove</li>
<li>NA No Fireplace</li>
</ul>
<p>Let&#8217;s check the correlation between FireplaceQu and SalePrice, to see how important this feature is in order to determine the price.</p>
<p>First, we will replace the missing values for 0.</p>
<p>Then, we encode the categories into numbers from 1 to 5.</p>
<pre><code class="language-python">train[&#039;FireplaceQu&#039;].fillna(0, inplace=True)
train[&#039;FireplaceQu&#039;].replace({&#039;Po&#039;: 1, &#039;Fa&#039;: 2, &#039;TA&#039;: 3, &#039;Gd&#039;: 4, &#039;Ex&#039;: 5}, inplace=True)</code></pre>
<p>Using a barplot, we can see how the category of the FirePlace increases the value of SalePrice.</p>
<p>It is also worth noting how much higher the value is when the house has an Excellent fireplace.</p>
<p>This means we should keep FireplaceQu as feature.</p>
<pre><code class="language-python">sns.set(style=&quot;whitegrid&quot;)
sns.barplot(x=&#039;FireplaceQu&#039;, y=&quot;SalePrice&quot;, data=train)</code></pre>
<pre><code><matplotlib.axes._subplots.AxesSubplot at 0x7f283d34dd10></code></pre>
<p><img decoding="async" src="https://renanmf.com/wp-content/uploads/2021/01/output_22_1.png" alt="png" /></p>
<h3>Missing values in numeric columns</h3>
<p>Another feature with a high number of missing values is LotFrontage with a count 204.</p>
<p>Let’s see the correlation between the remaining features with missing values and the SalePrice.</p>
<pre><code class="language-python">columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
c = list(columns_with_miss.index)
c.append(&#039;SalePrice&#039;)
train[c].corr()</code></pre>
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }</p>
<p>    .dataframe tbody tr th {
        vertical-align: top;
    }</p>
<p>    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>LotFrontage</th>
<th>MasVnrArea</th>
<th>GarageYrBlt</th>
<th>SalePrice</th>
</tr>
</thead>
<tbody>
<tr>
<th>LotFrontage</th>
<td>1.000000</td>
<td>0.196649</td>
<td>0.089542</td>
<td>0.371839</td>
</tr>
<tr>
<th>MasVnrArea</th>
<td>0.196649</td>
<td>1.000000</td>
<td>0.253348</td>
<td>0.478724</td>
</tr>
<tr>
<th>GarageYrBlt</th>
<td>0.089542</td>
<td>0.253348</td>
<td>1.000000</td>
<td>0.496575</td>
</tr>
<tr>
<th>SalePrice</th>
<td>0.371839</td>
<td>0.478724</td>
<td>0.496575</td>
<td>1.000000</td>
</tr>
</tbody>
</table>
</div>
<p>Note that LotFrontage, MasVnrArea, and GarageYrBlt have a positive correlation with SalePrice, but this correlation isn&#8217;t very strong.</p>
<p>To simplify this analisys, we will remove theses columns for now:</p>
<pre><code class="language-python">cols_to_be_removed = [&#039;LotFrontage&#039;, &#039;GarageYrBlt&#039;, &#039;MasVnrArea&#039;]
train.drop(columns=cols_to_be_removed, inplace=True)</code></pre>
<p>Finally, these are the remaining columns with missing values:</p>
<pre><code class="language-python">columns_with_miss = train.isna().sum()
columns_with_miss = columns_with_miss[columns_with_miss!=0]
print(f&#039;Columns with missing values: {len(columns_with_miss)}&#039;)
columns_with_miss.sort_values(ascending=False)</code></pre>
<pre><code>Columns with missing values: 11

GarageCond      69
GarageQual      69
GarageFinish    69
GarageType      69
BsmtFinType2    31
BsmtExposure    31
BsmtFinType1    30
BsmtCond        30
BsmtQual        30
MasVnrType       8
Electrical       1
dtype: int64</code></pre>
<h2>Conclusion</h2>
<p>In this part 1 we dealt with missing values and removed the following columns: &#8216;Id&#8217;, &#8216;PoolQC&#8217;, &#8216;MiscFeature&#8217;, &#8216;Alley&#8217;, &#8216;Fence&#8217;, &#8216;LotFrontage&#8217;, &#8216;GarageYrBlt&#8217;, &#8216;MasVnrArea&#8217;.</p>
<p>Please note that the removed columns are not useless or may not contribute to the final model.</p>
<p>After the first round of analysis and testing of the hypothesis, if you ever need to improve your future model further, you can consider reevaluating these columns and understand them better to see how they fit into the problem.</p>
<p>Data Analysis and Machine Learning is NOT a straight path.</p>
<p>It is a process where you iterate and keep testing ideas until you have the result you want, or until find out the result you need is not possible.</p>
<p>In <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-two/">Part 2</a> (the final part of the EDA) we will see ways to handle the missing values in the other 11 columns.</p>
<p>We will also explore categorical variables.</p>
<p>The content <a href="https://renanmf.com/exploratory-data-analysis-house-prices-part-one/">Exploratory Data Analysis &#8211; House Prices &#8211; Part 1</a> is from <a href="https://renanmf.com">Renan Moura - Software Engineering</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
