Data cleaning or Data cleansing is very important from the perspective of building intelligent automated systems. Data cleansing is a preprocessing step that improves the data validity, accuracy, completeness, consistency and uniformity. It is essential for building reliable machine learning models that can produce good results. Otherwise, no matter how good the model is, its results cannot be trusted. Beginners with machine learning starts working with the publicly available datasets that are thoroughly analyzed with such issues and are therefore, ready to be used for training models and getting good results. But it is far from how the data is, in real world. Common problems with the data may include missing values, noise values or univariate outliers, multivariate outliers, data duplication, improving the quality of data through standardizing and normalizing it, dealing with categorical features. The datasets that are in raw form and have all such issues cannot be benefited from, without knowing the data cleaning and preprocessing steps. The data directly acquired from multiple online sources, for building useful application, are even more exposed to such problems. Therefore, learning the data cleansing skills help users make useful analysis with their business data. Otherwise, the term ‘garbage in garbage out’ refers to the fact that without sorting out the issues in the data, no matter how efficient the model is, the results would be unreliable.
In this course, we discuss the common problems with data, coming from different sources. We also discuss and implement how to resolve these issues handsomely. Each concept has three components that are theoretical explanation, mathematical evaluation and code. The lectures *.1.* refers to the theory and mathematical evaluation of a concept while the lectures *.2.* refers to the practical code of each concept. In *.1.*, the first (*) refers to the Section number, while the second (*) refers to the lecture number within a section. All the codes are written in Python using Jupyter Notebook.
Introduction
The lecture introduces the course and what we are going to cover in general.
In this lecture, we discuss the characteristics of good quality data. This is very important to know before hand, in order to set a criteria around which we will attempt to improve the quality of our data.
In this lecture we introduce the dataset preprocessing and the kind of issues that can be found in the data. The real-world data is expected to have all these issues. In some of the datasets that are available online, such issues are already taken care of, however, not all of them. Therefore, it is important to learn about them and rectify such issues before providing the data to train a model.
In this lecture we discuss the examples of anomalies that is missing values and noise values i.e., univariate outliers in the dataset with a small example.
About instructor
Detecting Missing and Noise Values (Univariate Outliers)
In this lecture we discuss anomaly detection i.e., univariate outlier with the help of median. The median gives us a range of normality that we can apply to all values of a feature.
In this lecture, we implement the detection of missing values in the dataset.
In this lecture we use the median approach to define the range of normality for the detection of noise values.
In this lecture, we detect missing values with the help of median using the local context.
In this lecture, we discuss detecting noise values with the help of mean to determine the range of normality.
In this lecture, we implement the detection of noise values with the help of mean considering both the local and global context.
In this lecture, we discuss and evaluate the detection of a noise value or univariate outlier with the help of z-score.
In this lecture, we implement the detection of noise values using z-score based approach.
In this lecture, we discuss and evaluate the interquartile range for the detection of noise values.
In this lecture we implment the interquartile range to define the range of normality. Then used its lower and upper limits to compare all the values for the feature and identify noise values that are very much unlike the rest of the values.
Handling Missing and Noise Values (Univariate Outliers)
In this lecture, we discuss the types of approaches that are used for handling or processing anomalies in the data particularly missing values and noise values i.e., univariate outliers.
In this lecture, we discuss the deletion strategy as an approach for processing records or features i.e., rows or columns that has got anomalies.
In this lecture, we handle the missing values with the deletion strategy i.e., to get rid of all the rows with the missing values.
In this lecture, we differentiate between the global and the local context for handling an anomaly. In case of global context, all the values of the dataset are considered to handle an anomaly while in case of local context i.e., concept restricted then only the instances that fall into the local context are used for handling the anomaly.
In this lecture, we discuss the replacement strategy for handling anomalies both with the global and local context.
In this lecture we discuss the use of statistical measures like median, mode and mean for handling missing and noise values.
In this lecture, we demonstrate how and when to use the mode value for the imputation of a missing or noise value in the dataset.
In this lecture, we make use of the statistical measures mean and median for the imputation of noise values in the dataset.
Multivariate Outliers
In this lecture, we discuss the multivariate outliers. These are the outliers that have more than one noise values in a record and the whole object or instance is considered to be an outlier. The types of multivariate outliers and their detection techniques are discussed. They are generally being filtered off instead of trying to fix their values for having issues with more than one feature.
In this lecture, we discuss the use of Local outlier factor for detecing multivariate outliers.
In this lecture, we implement and demonstrate the use of Local Outlier Factor for the detection of multivariate outliers in a dataset.
In this lecture, we briefly introduce clustering and a particular clustering technique that is very effective towards identifying multivariate outliers i.e., DBSCAN. Working mechanism of the algorithm is also explained.
In this lecture, we implement DBSCAN algorithm for clustering our data, however, our interest was in the points that are identified as outliers by the algorithm and are kept out of the clusters
In this lecture, we discuss how data visualization can be helpful towards detecting multivariate outliers in the data. We have also discussed how to come up with good features that can be used for the visualization purpose.
In this lecture, we implement the data by using the first two non-ID numeric columns of the dataset to plot a visualization of the data. It helped to manually inspect the outlier in the data.
Anomalies in Textual data
This lecture introduces the various types of anomalies that can be found in textual data. It also demonstrates an example of what we want to achieve when textual data is passed through these normalization steps.
In this lecture, we implement the removal of whitespaces and punctuations while the text is converted to lowercase.
In this lecture, we implement the removal of stop words from the dataset using ENGLISH_STOP_WORDS list
In this lecture, we briefly discuss regular expression basics that should be sufficient for removing unwanted domain specific stopwords.
In this lecturew, we make use of regular expressions for identifying and removing the domain specific stopwords.
In this lecture, we implement the stemming and lemmatization of words so that the number of features can be reduced while words having similar meanings are grouped together to be represented as a single feature.
In this lecture we use the NLTK library for finding the parts-of-speech labels for the words in a sentence. POS has many other purposes as well. They are much more relevant in dialog and question-answeringn based systems. However, they are also frequently used for filtering the unwanted parts of speech. For example, in order to perform Sentiment analysis, the word types that are not expected to hold user opinions are filtered through its parts of speech rather than looking for each word individually.
In this lecture, we implement the separation of text into segments i.e., sentences and then segments into words or tokens. For this purpose, we make use of the NLTK library which is much more efficient in identifying tokens and segments as compared to splitting text on periods (.) and spaces.
Structuring Textual Documents
The lecture introduces the need for structuring textual documents and converting raw text into numerical values that can be easily consumed my machine learning techniques.
In this lecture, we discuss the bag-of-words (BoW) approach for converting textual data into vectors. Its called BoWs approach because it loses the position related information about words and only retain their frequency related information.
In this lecture, we discuss the representation schemes of converting textual data into structured format by converting words or tokens to features and representing them through numbers for each document.
In this lecture, we take single document and use count vectorizer to convert it into a structured format. With such a small dataset, the students can see how the words become columns and the document hold a numeric value for each word i.e., feature showing its frequency of occurence in the document.
In this lecture, we extend our study to include more documents in the corpus and see how they can be converted into a structured format. The idea of starting with these small datasets is for students to cross check the numbers and verify.
In this lecture, we tune the parameters to find the suitable parameters and number of those parameters for training a good model.
In this lecture, we implement the TF-IDF based text representation scheme for structuring textual data. TF-IDF is the most commonly used text representation approach.
In this lecture, the process of vectorizing the textual documents is applied on a dummy dataset with few documents.
In this lecture, the processing of vectorizing the textual documents is applied on a dataset from the UCI Repository.
Feature Scaling (Normalization)
In this lecture, we discuss the need for feature scaling.
In this lecture, we discuss the normalization scheme for feature scaling and how it is affected by outliers.
In this lecture, we demonstrate the implementation of feature scaling using normalization strategy. It make use of the MinMaxScaler class from sklearn.preprocessing.
In this lecture, we discuss how to use standardization scheme for scaling the values of a feature.
In this lecture, we implement the conversion of feature values into their standardized values using StandardScaler from sklearn.preprocessing.
In this lecture, we discuss another feature scaling scheme that make use of the interquartile range and therefore, is not affected by outliers
This lecture, demonstrates the implementation of Robust Scaler using sklearn.preprocessing.
Data Acquisition
In this lecture, we implement the acquisition of useful data from a webpage into python variables. At times it is desirable to prepare our own dataset as the publicly available dataset may not cover the kind of analysis that a user wants to perform. Therefore, data acquisition may also be a task before preprocessing it. Using the text based preprocessing that we discussed in the earlier module, textual data can be cleaned from code and other unwanted sequences.