Machine Learning

State of Data Science 2021: Popularity of Python

Renan Moura — Tue, 27 Jul 2021 12:03:55 +0000

Python continues to be an excellent choice if you are entering the data science field.

Python still dominates and is the most popular language, particularly among younger generations.

88% of students surveyed are learning Python in preparation for a data science career.

63% of the respondents said they use it frequently or always.

71% of educators are teaching Python.

It is also interesting to notice SQL raking 2nd place right after Python.

Most structured data is still in relational databases, so a good knowledge of both Python and SQL are a must to deal with data.

The good news is that they are both very accessible and good to begin working with code.

Comments about the other languages

R is an alternative to Python, but I don’t see any advantage in learning it if you are already in the Python path since R won’t bring anything to the table that Python doesn’t.

Then we have JavaScript and HTML/CSS, which makes sense since your results won’t live in a Word document on your computer, a good way to display them is on the web with nice interactivity.

Bash/Shell are super useful, the command line is one of the most powerful tools in a coder’s tool belt, not only that, but many tools that deal with data engineering like Hadoop rely heavily on the command line interfaces that can be easily automated with a nice shell script.

If you are wondering why Java ranks so high in this list, Hadoop, Hive, HDFS, etc. are made in Java, for instance, and many data pipelines depend on JVM powered tools like Kafka.

So while you may never touch Java as a Data Scientist, you will most probably have to deal with it as Data Engineer at some point.

C/C++ ranks high due to the number of libraries coded in these languages for high performance.

Python’s most used Machine Learning frameworks and libraries like Pandas are implemented in C/C++ while Python just provides a nicer API to work with.

The other languages (C#, TypeScript, PHP, Rust, Julia and Go), although they have their place, of course, would not be the subject of further studies from my point of view at the moment.

They are used for more specific use cases or simply fall into "that’s what I and my team knows best".

The best contender here would be Julia to replace Python, but it still has ways to go before deserving the time and energy to learn it.

Go would be the high level performant alternative to Java, but it doesn’t have the ecosystem with as many tools behind it yet.

So, out of this list, the ones I think will pay you the most dividends for your investment in time and effort are Python, SQL, JavaScript, HTML/CSS, Bash/Shell, and Java.

These languages are more than enough to put you in any stage of a Data Science project or pipeline.

You can read the full report on State of Data Science 2021

The content State of Data Science 2021: Popularity of Python is from Renan Moura - Software Engineering.

Data Science and Machine Learning Project: House Prices Dataset

Renan Moura — Tue, 23 Feb 2021 13:26:03 +0000

This is a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques.

You can download a PDF version of this Data Science and Machine Learning Project with the full source code repository linked in the book.

In this series we begin with the EDA (Exploratory Data Analysis) of the data, we create a script to clean the data, then we use the cleaned data to create a Machine Learning Model, and finally we use the Machine Learning model to implement a prediction API:

You can download the complete code in the Github Repository with clear instructions to execute this end-to-end project.

>>>You can also watch how to run this project on Youtube<<<

The content Data Science and Machine Learning Project: House Prices Dataset is from Renan Moura - Software Engineering.

Data Science Project: House Prices Dataset – API

Renan Moura — Tue, 16 Feb 2021 20:41:21 +0000

This is the 5th and final article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques.

The first four articles were the Exploratory Data Analysis (EDA), Cleaning of the dataset, and the Machine Learning model:

The output of the fourth article is the Machine Learning Model (you have to unzip the file) that we are going to use in the API.

Class HousePriceModel

Save this script on a file named predict.py.

This file has the class HousePriceModel and is used to load the Machine Learning model and make the predictions.

# the pickle lib is used to load the machine learning model
import pickle
import pandas as pd

class HousePriceModel():

    def __init__(self):
        self.model = self.load_model()
        self.preds = None

    def load_model(self):
        # uses the file model.pkl
        pkl_filename = 'model.pkl'

        try:
            with open(pkl_filename, 'rb') as file:
                pickle_model = pickle.load(file)
        except:
            print(f'Error loading the model at {pkl_filename}')
            return None

        return pickle_model

    def predict(self, data):

        if not isinstance(data, pd.DataFrame):
            data = pd.DataFrame(data, index=[0])

        # makes the predictions using the loaded model
        self.preds = self.model.predict(data)
        return self.preds

The API with FastAPI

To run the API:

uvicorn api:app

Expected output:

INFO:     Started server process [56652]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

The was API created with the framework FastAPI.

The "/predict" endpoint will give you a prediction based on a sample.

from fastapi import FastAPI
from datetime import datetime
from predict import HousePriceModel

app = FastAPI()

@app.get("/")
def root():
    return {"status": "online"}

@app.post("/predict")
def predict(inputs: dict):

    model = HousePriceModel()

    start = datetime.today()
    pred = model.predict(inputs)[0]
    dur = (datetime.today() - start).total_seconds()

    return pred

Testing the API

You can save the script on a file test_api.py and execute it directly with python3 test_api.py or python test_api.py, depending on your installation.

Remember to execute this test on a second terminal while the first one runs the server for the actual API.

Expected output:

The actual Sale Price: 109000
The predicted Sale Price: 109000.01144237864

The code to test the API:

# import requests library to make API calls
import requests
from predict import HousePriceModel

# a sample input with all the features we 
# used to train the model
sample_input = {'MSSubClass': 20, 'MSZoning': 'RL', 
'LotArea': 7922, 'Street': 'Pave', 
'LotShape': 'Reg', 'LandContour': 'Lvl', 
'Utilities': 'AllPub', 'LotConfig': 'Inside', 
'LandSlope': 'Gtl', 'Neighborhood': 'NAmes', 
'Condition1': 'Norm', 'Condition2': 'Norm', 
'BldgType': '1Fam', 'HouseStyle': '1Story', 
'OverallQual': 5, 'OverallCond': 7, 
'YearBuilt': 1953, 'YearRemodAdd': 2007, 
'RoofStyle': 'Gable', 'RoofMatl': 'CompShg', 
'Exterior1st': 'VinylSd', 'Exterior2nd': 'VinylSd', 
'MasVnrType': 'None', 'ExterQual': 3,
'ExterCond': 4, 'Foundation': 'CBlock', 
'BsmtQual': 3, 'BsmtCond': 3, 
'BsmtExposure': 'No', 'BsmtFinType1': 'GLQ', 
'BsmtFinSF1': 731, 'BsmtFinType2': 'Unf', 
'BsmtFinSF2': 0, 'BsmtUnfSF': 326, 
'TotalBsmtSF': 1057, 'Heating': 'GasA', 
'HeatingQC': 3, 'CentralAir': 'Y', 
'Electrical': 'SBrkr', '1stFlrSF': 1057, 
'2ndFlrSF': 0, 'LowQualFinSF': 0, 
'GrLivArea': 1057, 'BsmtFullBath': 1, 
'BsmtHalfBath': 0, 'FullBath': 1, 
'HalfBath': 0, 'BedroomAbvGr': 3, 
'KitchenAbvGr': 1, 'KitchenQual': 4, 
'TotRmsAbvGrd': 5, 'Functional': 'Typ', 
'Fireplaces': 0, 'FireplaceQu': 0, 
'GarageType': 'Detchd', 'GarageFinish': 'Unf',
'GarageCars': 1, 'GarageArea': 246, 
'GarageQual': 3, 'GarageCond': 3, 
'PavedDrive': 'Y', 'WoodDeckSF': 0, 
'OpenPorchSF': 52, 'EnclosedPorch': 0, 
'3SsnPorch': 0, 'ScreenPorch': 0, 
'PoolArea': 0, 'MiscVal': 0, 'MoSold': 1,
'YrSold': 2010, 'SaleType': 'WD', 
'SaleCondition': 'Abnorml'}

def run_prediction_from_sample():

    url="http://127.0.0.1:8000/predict"
    headers = {"Content-Type": "application/json", \
    "Accept":"text/plain"}

    response = requests.post(url, headers=headers, \
    json=sample_input)
    print("The actual Sale Price: 109000")
    print(f"The predicted Sale Price: {response.text}")

if __name__ == "__main__":
    run_prediction_from_sample()

The content Data Science Project: House Prices Dataset – API is from Renan Moura - Software Engineering.

Data Science Project: Machine Learning Model – House Prices Dataset

Renan Moura — Wed, 10 Feb 2021 14:34:50 +0000

This is the fourth article in a series on Data Science and Machine Learning applied to a House Prices dataset from the Kaggle competition House Prices: Advanced Regression Techniques.

The first three articles were the Exploratory Data Analysis (EDA) and cleaning of the dataset:

The output of the first three articles is the cleaned_dataset (you have to unzip the file to use the CSV) that we are going to use to generate the Machine Learning Model.

Training the Machine Learning Model

You can save the script on a file train_model.py and execute it directly with python3 train_model.py or python train_model.py, depending on your installation.

It expects you to have a file called ‘cleaned_data.csv’ (you can download it on the link above in ZIP format) on the same folder and will output three other files:

model.pkl: the model in binary format generated by pickle that we can reuse later
train.csv: the train data after the split of the original data into train and test
test.csv: the test data after the split of the original data into train and test

The output on the terminal will be similar to this:

Train data for modeling: (934, 74)
Test data for predictions: (234, 74)
Training the model ...
Testing the model ...
Average Price Test: 175652.0128205128
RMSE: 10552.188828855931
Model saved at model.pkl

It means the models used 934 data point to train and 234 data points to test.

The average Sale Price in the test set is 175k dollars.

The RMSE (root-mean-square error) is a good metric to understand the output because in you can read it using the same scale of you dependent variable, which is Sale Price in this case.

A RMSE of 10552 means that, on average, we missed the correct Sale Prices by a bit over 10k dollars.

Considering an average if 175k, missing the mark, on average, by 10k, is not too bad.

The Training Script

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
import pickle

def create_train_test_data(dataset):
    # load and split the data
    data_train = dataset.sample(frac=0.8, random_state=30).reset_index(drop=True)
    data_test = dataset.drop(data_train.index).reset_index(drop=True)

    # save the data
    data_train.to_csv('train.csv', index=False)
    data_test.to_csv('test.csv', index=False)

    print(f"Train data for modeling: {data_train.shape}")
    print(f"Test data for predictions: {data_test.shape}")

def train_model(x_train, y_train):

    print("Training the model ...")

    model = Pipeline(steps=[
        ("label encoding", OneHotEncoder(handle_unknown='ignore')),
        ("tree model", LinearRegression())
    ])
    model.fit(x_train, y_train)

    return model

def accuracy(model, x_test, y_test):
    print("Testing the model ...")
    predictions = model.predict(x_test)
    tree_mse = mean_squared_error(y_test, predictions)
    tree_rmse = np.sqrt(tree_mse)
    return tree_rmse

def export_model(model):
    # Save the model
    pkl_path = 'model.pkl'
    with open(pkl_path, 'wb') as file:
        pickle.dump(model, file)
        print(f"Model saved at {pkl_path}")

def main():
    # Load the whole data
    data = pd.read_csv('cleaned_data.csv', keep_default_na=False, index_col=0)

    # Split train/test
    # Creates train.csv and test.csv
    create_train_test_data(data)

    # Loads the data for the model training
    train = pd.read_csv('train.csv', keep_default_na=False)
    x_train = train.drop(columns=['SalePrice'])
    y_train = train['SalePrice']

    # Loads the data for the model testing
    test = pd.read_csv('test.csv', keep_default_na=False)
    x_test = test.drop(columns=['SalePrice'])
    y_test = test['SalePrice']

    # Train and Test
    model = train_model(x_train, y_train)
    rmse_test = accuracy(model, x_test, y_test)

    print(f"Average Price Test: {y_test.mean()}")
    print(f"RMSE: {rmse_test}")

    # Save the model
    export_model(model)

if __name__ == '__main__':
    main()

The content Data Science Project: Machine Learning Model – House Prices Dataset is from Renan Moura - Software Engineering.

Computer Science or Computer Engineering for Machine Learning/AI

Renan Moura — Wed, 30 Sep 2020 22:30:06 +0000

I received a question from a reader directly on my e-mail about which degree to pursue to get into Machine Learning/Artificial Intelligence.

This is the e-mail Jeremy sent me.

Hi I will start by saying sorry for the intrusive direct e-mail. I got your e-mail address from being one of your twitter followers and reading you great book on python. Please could you help me answer this question…? Computer science or computer engineering for a career in machine learning/Ai. I work full time in another industry ( which I dislike with a passion) so need to focus all my spare time on the right path for me. I’m in my forties now so wish to chase my dream, which I should of done years ago but due to circumstances I was unable. Kind Regards Jeremy

My personal opinion on this question:

For starters, you should focus on becoming a good programmer, not an expert, but a good one.

Programming is a skill that will make your life much easier in all the steps of the machine learning pipeline.

If you have no experience with programming, I have a free Python Guide For Beginners.

Python is the main language to work with Machine Learning today.

Then you should try to learn Machine Learning on your own with some courses online to see how you like it.

It will take you a couple of months to finish an intro course on Machine Learning and then you can work on some projects on your own.

Here is a guide with resources to learn ML online: How to Learn Machine Learning and Deep Learning: a guide for Software Engineers.

That said, If you want to work on AI and want a more formal education, the choice of CS vs. CE depends a lot on the university you are attending, the overall curriculum changes a lot from one to another.

Pretty much all CS and CE courses have AI classes, so that is not an issue.

But to give you a final answer, I would say Computer Science because, usually, Computer Engineering has lots of classes related to electronics/hardware which are not inline with your initial focus, so go with CS.

The content Computer Science or Computer Engineering for Machine Learning/AI is from Renan Moura - Software Engineering.

How to Learn Machine Learning and Deep Learning: a guide for Software Engineers

Renan Moura — Tue, 28 Jan 2020 17:48:33 +0000

Introduction

The subject of Artificial Intelligence piques my interest and I’m constantly studying and trying new things in this field.

It is notorious how the technologies related to Natural Language Processing, Computer Vision and such have emerged and evolved into solutions used by millions of users every day.

Even though people use the term "Artificial Intelligence", we are still far away from something as advanced as a Skynet from the Terminator movies.

The most common subfield of AI used today is the one called Machine Learning, which, in its turn, has Deep Learning as subfield steeply growing every day for quite some time now.

In this guide, I aim to describe a path to follow for software engineers to begin understanding how Machine Learning works and how to apply it to your projects.

Yeah, you can just go to Google API’s or Amazon and pick some magical API to do Speech Recognition for you, but the value of knowing how it works, why it works and even more, how to make your own API as a Service and tune it to your specific needs is incredible.

Remember, as a developer, every tool is a new power.

I’ve read, watched and gone through all these resources until the end, even got a paid certification for some, even though it is not necessary to learn, I find myself more engaged to finish when I have some deadline and assessment to prove I actually learned the material.

Let’s dive into the topics.

Python

Python is the main language these days when working with Data Science, Machine Learning, and Deep Learning.

If you need a crash course on Python, here is your guide: The Python Guide for Beginners.

The Basics: Math!

Maybe you never had the chance to study some college-level math, or you did study it but you can’t remember most of the stuff because JavaScript and CSS took all the memory of those topics away.

There are 3 topics you must know beforehand, or at least have a decent grasp of to follow any good material on ML and DL: Linear Algebra, Calculus and Statistics.

If you’d like to go deep in learning the math needed to ML and DL, you can look for MIT OpenCourseWare classes like Professor Strang’s renowned Linear Algebra class.

I’ve watched it in college in parallel with my regular class and it is very good.

But, let’s face it, most people have no time for that or the patience.

So I will give you the crash course for the 3 topics mentioned above.

Linear Algebra

Just watch the whole series Essence of Linear Algebra from the Youtube channel 3Blue1Brown.

The guy makes visual explanations of once hard concepts incredibly easy!

It is very far in terms of content compared to Professor Strang’s, but it’s enough, to begin with, and you can go after other topics as you advance in ML and DL.

Calculus

Guess what?

3Blue1Brown also has a whole series on Calculus on Youtube for you to watch for free: Essence of Calculus.

Again, he is very good at giving you the intuition of why and how rather than just throw some random equations on your face.

Statistics

This is a whole field that, in my opinion, you can learn as needed, a good reference is Practical Statistics for Data Scientists: 50 Essential Concepts.

An objective book with some good examples for every concept.

Fast to read too.

As the title implies, it is more suitable for Data Scientists, but understanding some basics of statistics is always good and this is what this is book is for.

You won’t become a statistician after reading it, but you will learn some good stuff.

The Bypassed: Machine Learning

Everybody wants to jump straight into Deep Learning and be the cool guy training a single model for a week on a 12GB GPU.

But to get Deep Learning right, you need to go through Machine Learning first!

Start from the beginning

The concepts, the train of thought, the "feeling" of how things work start here and there is no one else more capable of teaching those concepts than Professor Andrew Ng in his course Machine Learning.

You may think this course is old and outdated, well, technology-wise, maybe, but conceptually-wise, it is better than anything else out there.

Professor Ng makes it easy to understand the math applied in every technique he teaches and gives you a solid understanding of what happens underneath in a very short and concise course.

All the exercises are made in Octave, a free version of Matlab of sorts, and you finish the course implementing your own Neural Network!

The syntax in Octave is easy to grasp for any programmer, so don’t let that be a barrier for you.

Once you finish the course, you will have implemented all the major algorithms and will be able to solve several prediction problems.

Random Forests

I said all the major algorithms, right?

Actually, there is but one flaw in Andrew Ng’s course, he doesn’t cover Random Forests.

An awesome complement to his course is fast.ai’s Introduction to Machine Learning for Coders.

Jeremy Howard goes super practical on the missing piece in Ng’s course covering a topic that is, for many classical problems, the best solution out there.

Fast.ai’s approach is what is called Top-Down, meaning they show you how to solve the problem and then explain why it worked, which is the total opposite of what we are used to in school.

Jeremy also uses real-world tools and libraries, so you learn by coding in industry-tested solutions.

Deep Learning

Finally!

The reason why we are all here, Deep Learning!

Again, the best resource for it is Professor Ng’s course, actually, a series of courses.

The Deep Learning Specialization is composed of 5 courses total going from the basics and evolving on specific topics such as language, images, and time-series data.

One nice thing is that he continues from the very end of his classical Machine Learning course, so it just feels like an extension of the first course.

The math, the concepts, the notion of how and why it works, he delivers it all very concisely like few I’ve seen.

~~The only drawback is that he uses Tensorflow 1.x (Google’s DL Framework) in this course, but that’s minimal detail in my opinion since the explanations and exercises are so well delivered.~~

~~You can pick up the most recent version of the framework relatively easy and to do so there is the final piece of this guide, a book.~~

UPDATE APRIL 2021: The course was updated and now features Tensorflow 2 and some extra topics.

Too much stuff, give me something faster

This book might be the only thing you need to start, it is Aurélien Géron’s Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.

It covers a lot, from classical Machine Learning to the most recent Deep Learning topics. Good examples and exercises using industry-grade frameworks and libraries.

I dare say that, if you are really in a rush, you can skip everything I said before and just go for the book.

You will miss a good amount of information contained on the other resources mentioned, but the practical and actionable knowledge from Géron’s book is enough to work on many ideas for your next project.

If you feel limited after only reading the book, go back and study the rest of the material, it will fill in the gaps you might have and give you a more solid understanding.

What about Framework X or Y?

"Hey, I’ve heard about PyTorch and that other framework or library X everybody talks about".

As a Software Engineer, you know better than anyone how fast technology evolves.

Don’t go crazy for that, after you learn the basics in this guide, you can easily go, for instance, on PyTorch documentation or any other library or framework of sorts and learn how to use it in a week or two.

The techniques, the concepts, are all the same, it is only a matter of syntax and application or even tastes that you might have for any given tool.

Conclusion

To wrap it up, I want to say that, even though it might seem a lot, I tried to remove all the noise and at the end of the process, you will feel confident that you understand what is happening behind the curtains, the jargons and even be able to read some papers published in the field to keep up with the latest advances.

TL;DR Here is the list of resources mentioned in sequence:

Watch on Youtube

You can also watch this content on Youtube:

The content How to Learn Machine Learning and Deep Learning: a guide for Software Engineers is from Renan Moura - Software Engineering.