4.1 out of 5
4.1
20 reviews on Udemy

Data Cleaning in Python

Preprocessing, structuring and normalizing data
Instructor:
Taimoor khan
1,799 students enrolled
English [Auto]
Data cleaning or cleansing as a preprocessing step towards making the data more consistent and high quality before training predictive models.

Data cleaning or Data cleansing is very important from the perspective of building intelligent automated systems. Data cleansing is a preprocessing step that improves the data validity, accuracy, completeness, consistency and uniformity. It is essential for building reliable machine learning models that can produce good results. Otherwise, no matter how good the model is, its results cannot be trusted. Beginners with machine learning starts working with the publicly available datasets that are thoroughly analyzed with such issues and are therefore, ready to be used for training models and getting good results. But it is far from how the data is, in real world. Common problems with the data may include missing values, noise values or univariate outliers, multivariate outliers, data duplication, improving the quality of data through standardizing and normalizing it, dealing with categorical features. The datasets that are in raw form and have all such issues cannot be benefited from, without knowing the data cleaning and preprocessing steps. The data directly acquired from multiple online sources, for building useful application, are even more exposed to such problems. Therefore, learning the data cleansing skills help users make useful analysis with their business data. Otherwise, the term ‘garbage in garbage out’ refers to the fact that without sorting out the issues in the data, no matter how efficient the model is, the results would be unreliable. 

In this course, we discuss the common problems with data, coming from different sources. We also discuss and implement how to resolve these issues handsomely. Each concept has three components that are theoretical explanation, mathematical evaluation and code. The lectures *.1.* refers to the theory and mathematical evaluation of a concept while the lectures *.2.* refers to the practical code of each concept.  In *.1.*, the first (*) refers to the Section number, while the second (*) refers to the lecture number within a section. All the codes are written in Python using Jupyter Notebook.

Introduction

1
Introduction

The lecture introduces the course and what we are going to cover in general.

2
Quality of Data

In this lecture, we discuss the characteristics of good quality data. This is very important to know before hand, in order to set a criteria around which we will attempt to improve the quality of our data.

3
Missing Values, Noise and Outliers

In this lecture we introduce the dataset preprocessing and the kind of issues that can be found in the data. The real-world data is expected to have all these issues. In some of the datasets that are available online, such issues are already taken care of, however, not all of them. Therefore, it is important to learn about them and rectify such issues before providing the data to train a model.

4
Examples of Anomalies

In this lecture we discuss the examples of anomalies that is missing values and noise values i.e., univariate outliers in the dataset with a small example.

5
Instructor

About instructor

Detecting Missing and Noise Values (Univariate Outliers)

1
2.1.1 Anomaly Detection (Median)

In this lecture we discuss anomaly detection i.e., univariate outlier with the help of median. The median gives us a range of normality that we can apply to all values of a feature.

2
2.2.1 Implementing Detection of Missing Values

In this lecture, we implement the detection of missing values in the dataset.

3
2.2.2 Implementing Median based Detection (Global Context)

In this lecture we use the median approach to define the range of normality for the detection of noise values.


4
2.2.3 Implementing Median based Detection (Local Context)

In this lecture, we detect missing values with the help of median using the local context.

5
2.1.2 Anomaly Detection (Mean)

In this lecture, we discuss detecting noise values with the help of mean to determine the range of normality.

6
2.2.4 Implementing Mean based Detection of Noise values

In this lecture, we implement the detection of noise values with the help of mean considering both the local and global context.

7
2.1.3 Anomally Detection (Z-score)

In this lecture, we discuss and evaluate the detection of a noise value or univariate outlier with the help of z-score.

8
2.2.5 Implementing Z-score based Detection

In this lecture, we implement the detection of noise values using z-score based approach.

9
2.1.4 Anomally Detection (Interquartile Range)

In this lecture, we discuss and evaluate the interquartile range for the detection of noise values.

10
2.2.6 Implementing Interquartile Range for Noise Detection

In this lecture we implment the interquartile range to define the range of normality. Then used its lower and upper limits to compare all the values for the feature and identify noise values that are very much unlike the rest of the values.

Handling Missing and Noise Values (Univariate Outliers)

1
3.1.1 Approaches to Handle Anomalies

In this lecture, we discuss the types of approaches that are used for handling or processing anomalies in the data particularly missing values and noise values i.e., univariate outliers.

2
3.1.2 Deletion Strategy

In this lecture, we discuss the deletion strategy as an approach for processing records or features i.e., rows or columns that has got anomalies.

3
3.2.1 Deleting Missing Values

In this lecture, we handle the missing values with the deletion strategy i.e., to get rid of all the rows with the missing values.

4
3.1.3 Global and Local Context

In this lecture, we differentiate between the global and the local context for handling an anomaly. In case of global context, all the values of the dataset are considered to handle an anomaly while in case of local context i.e., concept restricted then only the instances that fall into the local context are used for handling the anomaly.

5
3.1.4 Replacement Strategy

In this lecture, we discuss the replacement strategy for handling anomalies both with the global and local context.

6
3.1.5 Statistical Measures

In this lecture we discuss the use of statistical measures like median, mode and mean for handling missing and noise values.

7
3.2.2 Implementing Imputation with Mode

In this lecture, we demonstrate how and when to use the mode value for the imputation of a missing or noise value in the dataset.

8
3.2.3 Implementing Imputation with Median and Mean

In this lecture, we make use of the statistical measures mean and median for the imputation of noise values in the dataset.

Multivariate Outliers

1
4.1.1 Multivariate Outliers

In this lecture, we discuss the multivariate outliers. These are the outliers that have more than one noise values in a record and the whole object or instance is considered to be an outlier. The types of multivariate outliers and their detection techniques are discussed. They are generally being filtered off instead of trying to fix their values for having issues with more than one feature.

2
4.1.2 Local Outlier Factor

In this lecture, we discuss the use of Local outlier factor for detecing multivariate outliers.

3
4.2.1 Implementing LOF for Outlier Detection

In this lecture, we implement and demonstrate the use of Local Outlier Factor for the detection of multivariate outliers in a dataset.

4
4.1.3 Clustering for Multivariate Outlier Detection

In this lecture, we briefly introduce clustering and a particular clustering technique that is very effective towards identifying multivariate outliers i.e., DBSCAN. Working mechanism of the algorithm is also explained.

5
4.2.2 Implementing DBSCAN Clustering for Outlier Detection

In this lecture, we implement DBSCAN algorithm for clustering our data, however, our interest was in the points that are identified as outliers by the algorithm and are kept out of the clusters

6
4.1.3 Data Visualization for Outlier Detection

In this lecture, we discuss how data visualization can be helpful towards detecting multivariate outliers in the data. We have also discussed how to come up with good features that can be used for the visualization purpose.

7
4.2.3 Implementing Data Visualization

In this lecture, we implement the data by using the first two non-ID numeric columns of the dataset to plot a visualization of the data. It helped to manually inspect the outlier in the data.

Anomalies in Textual data

1
5.1.1 Normalizing Text Anomalies

This lecture introduces the various types of anomalies that can be found in textual data. It also demonstrates an example of what we want to achieve when textual data is passed through these normalization steps.

2
5.2.1 Lowercase, Whitespaces, Punctuations

In this lecture, we implement the removal of whitespaces and punctuations while the text is converted to lowercase.

3
5.2.2 Stopwords Removal

In this lecture, we implement the removal of stop words from the dataset using ENGLISH_STOP_WORDS list

4
5.1.2 Regular Expressions

In this lecture, we briefly discuss regular expression basics that should be sufficient for removing unwanted domain specific stopwords.

5
5.2.4 Implementing Regular Expressions for Filtering stopwords

In this lecturew, we make use of regular expressions for identifying and removing the domain specific stopwords.

6
5.2.3 Stemming and Lemmatization

In this lecture, we implement the stemming and lemmatization of words so that the number of features can be reduced while words having similar meanings are grouped together to be represented as a single feature.

7
Parts-of-speech (POS) Tagging

In this lecture we use the NLTK library for finding the parts-of-speech labels for the words in a sentence. POS has many other purposes as well. They are much more relevant in dialog and question-answeringn based systems. However, they are also frequently used for filtering the unwanted parts of speech. For example, in order to perform Sentiment analysis, the word types that are not expected to hold user opinions are filtered through its parts of speech rather than looking for each word individually.

8
5.2.6 Text Segmentation and Tokenization

In this lecture, we implement the separation of text into segments i.e., sentences and then segments into words or tokens. For this purpose, we make use of the NLTK library which is much more efficient in identifying tokens and segments as compared to splitting text on periods (.) and spaces.

Structuring Textual Documents

1
6.1.1 Structuring Textual Data

The lecture introduces the need for structuring textual documents and converting raw text into numerical values that can be easily consumed my machine learning techniques.

2
6.1.2 Bag-of-Words (BoW) Approach

In this lecture, we discuss the bag-of-words (BoW) approach for converting textual data into vectors. Its called BoWs approach because it loses the position related information about words and only retain their frequency related information.

3
6.1.3 Binary and TF-IDF Representation

In this lecture, we discuss the representation schemes of converting textual data into structured format by converting words or tokens to features and representing them through numbers for each document.

4
6.2.1 Implementing One Document Corpus Representation

In this lecture, we take single document and use count vectorizer to convert it into a structured format. With such a small dataset, the students can see how the words become columns and the document hold a numeric value for each word i.e., feature showing its frequency of occurence in the document.

5
6.2.2 Implementing Multi-doc Corpus Representation

In this lecture, we extend our study to include more documents in the corpus and see how they can be converted into a structured format. The idea of starting with these small datasets is for students to cross check the numbers and verify.

6
6.2.3 Tuning Parameters to Improve Representation

In this lecture, we tune the parameters to find the suitable parameters and number of those parameters for training a good model.

7
6.2.4 Implementing TF-IDF Representation Scheme

In this lecture, we implement the TF-IDF based text representation scheme for structuring textual data. TF-IDF is the most commonly used text representation approach.

8
6.2.5 Implementing Dummy Dataset Representation

In this lecture, the process of vectorizing the textual documents is applied on a dummy dataset with few documents.

9
6.2.6 Implementing UCI Repository Dataset Representation

In this lecture, the processing of vectorizing the textual documents is applied on a dataset from the UCI Repository.

Feature Scaling (Normalization)

1
7.1.1 Why Feature Scaling

In this lecture, we discuss the need for feature scaling.

2
7.1.2 Feature Normalization (Min Max Scaler)

In this lecture, we discuss the normalization scheme for feature scaling and how it is affected by outliers.

3
7.2.1 Implementing Feature Normalization

In this lecture, we demonstrate the implementation of feature scaling using normalization strategy. It make use of the MinMaxScaler class from sklearn.preprocessing.

4
7.1.3 Feature Standardization (Standard Scaler)

In this lecture, we discuss how to use standardization scheme for scaling the values of a feature.

5
7.2.2 Implementing Feature Standardization

In this lecture, we implement the conversion of feature values into their standardized values using StandardScaler from sklearn.preprocessing.

6
7.1.4 Robust Feature Scaler

In this lecture, we discuss another feature scaling scheme that make use of the interquartile range and therefore, is not affected by outliers

7
7.2.3 Implementation of Robust Scaler

This lecture, demonstrates the implementation of Robust Scaler using sklearn.preprocessing.

Data Acquisition

1
Data Acquisition from Webpages

In this lecture, we implement the acquisition of useful data from a webpage into python variables. At times it is desirable to prepare our own dataset as the publicly available dataset may not cover the kind of analysis that a user wants to perform. Therefore, data acquisition may also be a task before preprocessing it. Using the text based preprocessing that we discussed in the earlier module, textual data can be cleaned from code and other unwanted sequences.

You can view and review the lecture materials indefinitely, like an on-demand channel.
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don't have an internet connection, some instructors also let their students download course lectures. That's up to the instructor though, so make sure you get on their good side!
4.1
4.1 out of 5
20 Ratings

Detailed Rating

Stars 5
9
Stars 4
8
Stars 3
1
Stars 2
3
Stars 1
0
72dc565f9b492d9436bc82394a1674bd
30-Day Money-Back Guarantee

Includes

4 hours on-demand video
Full lifetime access
Access on mobile and TV
Certificate of Completion