Data scientists spend only 20 percent of their time on building machine learning algorithms and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data. That mostly happen because many use graphical tools such as Excel to process their data. However, if you use a programming language such as Python you can drastically reduce the time it takes for processing your data and make them ready for use in your project. This course will show how Python can be used to manage, clean, and organize huge amounts of data.
This course assumes you have basic knowledge of variables, functions, for loops, and conditionals. In the course you will be given access to a million records of raw historical weather data and you will use Python in every single step to deal with that dataset. That includes learning how to use Python to batch download and extract the data files, load thousands of files in Python via pandas, cleaning the data, concatenating and joining data from different sources, converting between fields, aggregating, conditioning, and many more data processing operations. On top of that, you will also learn how to calculate statistics and visualize the final data. The course also covers a series of exercises where you will be given some sample data then practice what you learned by cleaning and reorganizing those data using Python.
Getting Started
You will learn how to install Python through the Anaconda package which is a complete package that will not only install Python into your computer, but also other libraries needed for data analysis and visualizations such as pandas, matplotlib, numpy, scipy, etc.
You will learn how to use the Spyder environment to write scripts of Python code and also learn how to use iPython which is an enhanced interactive shell where you type in and execute Python code. iPython is tailored for data analysis applications
Downloading Many Files with Python
Short lecture introducing you to this section of the course.
You will learn how to write Python code that establishes a connection to an FTP server and accesses the files of the FTP site.
You will learn how to use the Spyder editor for executing complete scripts of Python code.
You will learn how to create a custom FTP function that logs in to an FTP site and generates a list of file names contained in the site.
You will learn the Python code that downloads a single file from an FTP site.
Something to keep in mind for the next lecture.
Here we start building our data analysis program.
In this particular lecture, we will build an FTP function that will login to the FTP site, and download a given range of files from the site.
Extracting Data from Archive Files
You will learn how to extract various types of archive files using the patool library and the for loop.
You will learn how to extract RAR archive files.
Here you will write a function that will fetch the archive files downloaded by the FTP function and it will extract them all in a local directory.
Working with TXT and CSV Files
Short lecture introducing you to this section of the course.
You will learn how to easily read CSV and delimited TXT files using the pandas library and use their data inside Python.
You will learn how to export data from Python to CSV and TXT files.
You will learn how to open data from TXT files which columns are delimited by a certain width.
You will learn how to quickly export a pandas dataframe into an HTML file.
Getting Started with Pandas
We already used the pandas library in the previous section. Here you will be given an official tour to the pandas data analysis library.
You will create a function that grabs all the TXT files of a folder, opens each of them in Python as dataframes, adds a column in each dataframe and exports the updated dataframes back to CSV files.
Merging Data
You will write a function that gets all the CSV files and concatenates them vertically using the pandas concatenate function by creating a single CSV containing everything.
You will write a function that will join columns of a pandas dataframe to another dataframe.
Data Aggregation
You will learn how to use the pandas pivot function by creating a pivoted dataframe out of a large CSV file by aggregating the data values.
Visualizing Data
You will learn how to use the visualization features available in Python and generate graphs using the matplotlib and the seaborn libraries.
You will expand your knowledge on performing visualizations of different kinds out of pandas dataframes and adding labels and legends to the generated graphs.
You will learn create a function that will access the pivoted dataframe and it will generate a graph representing the data, and save the graph inside a PNG image file.
Mapping Spatial Data
You will learn how to create a point KML file using the simplekml library and display the file in Google Earth.
You will create a function that grabs the data from a pandas dataframe and creates a KML file using the latitude and the longitude information contained in the dataframe.
Putting everything together
You will learn how to make your script interact with a user who runs it.
You will learn how to execute all the functions of the programs in one single click.
You will learn how to make your program more user friendly by integrating the user input functionality.
You will learn how to convert your program into a Python module so you can import it in other scripts.
Bonus Section: Using Python in Jupyter Notebooks to Boost Productivity
Setting up Jupyter and learning how to use its keyboard shortcuts.
Learn how to handle a problem of joining raw data with no key column to base the join to.
Learn to apply various operations including in-line visualizations on a Jupyter browser-based notebook.