Machine learning with Scikit-learn
This course will explain how to use scikit-learn to do advanced machine learning. If you are aiming to work as a professional data scientist, you need to master scikit-learn!
It is expected that you have some familiarity with statistics, and python programming. It’s not necessary to be an expert, but you should be able to understand what is a Gaussian distribution, code loops and functions in Python, and know the basics of a maximum likelihood estimator. The course will be entirely focused on the python implementation, and the math behind it will be omitted as much as possible.
The objective of this course is to provide you with a good understanding of scikit-learn (being able to identify which technique you can use for a particular problem). If you follow this course, you should be able to handle quite well a machine learning interview. Even though in that case you will need to study the math with more detail.
We’ll start by explaining what is the machine learning problem, methodology and terminology. We’ll explain what are the differences between AI, machine learning (ML), statistics, and data mining. Scikit-learn (being a Python library) benefits from Python’s spectacular simplicity and power. We’ll start by explaining how to install scikit-learn and its dependencies. And then show how can we can use Pandas data in scikit-learn, and also benefit from SciPy and Numpy. We’ll then show how to create synthetic data-sets using scikit-learn. We will be able to create data-sets specifically tailored for regression, classification and clustering.
In essence, machine learning can be divided into two big groups: supervised and unsupervised learning. In supervised learning we will have an objective variable (which can be continuous or categorical) and we want to use certain features to predict it. Scikit-learn will provide estimators for both classification and regression problems. We will start by discussing the simplest classifier which is “Naive Bayes”. We will then see some powerful regression techniques that via a special trick called regularization, will help get much better linear estimators. We will then analyze Support Vector Machines, a powerful technique for both regression and classification. We will then use classification and regression trees to estimate very complex models. We will see how we can combine many of the existing estimators into simpler structures, but more robust for out of sample performance, called “ensemble” methods. In particular random forests, random trees, and boosting methods. These methods are the ones winning most data science competitions nowadays.
We will see how we can use all these techniques for online data, image classification, sales data, and more. We also use real datasets from Kaggle such as spam SMS data, house prices in the United States, etc. to teach the student what to expect when working with real data.
On the other hand, in unsupervised learning we will have a set of features (but with no outcome or target variable) and we will attempt to learn from that data. Whether it has outliers, whether it can be grouped into groups, whether we can remove some of those features, etcetera. For example we will see k-means which is the simplest algorithm for classifying observations into groups. We will see that sometimes there are better techniques such as DBSCAN. We will then explain how we can use principal components to reduce the dimensionality of a data-set. And we will
use some very powerful scikit-learn functions that learn the density of the data, and are able to classify outliers.
I try to keep this course as updated as possible, specially since scikit-learn is constantly being updated. For example, neural networks was added in the latest release. I tried to keep the examples as simple as possible, keeping the amount of observations (samples) and features (variables) as small as possible. In real situations, we will use hundreds of features and thousands of samples, and most of the methods presented here scale really well into those scenarios. I don’t want this course to be focused on very realistic examples, because I think it obscures what we are trying to achieve in each example. Nevertheless, some more complex examples will be added as additional exercises.