Very rarely in a data science project is data easily available as part of a package. It's more typical for the data to be in a file, a database, or extracted from a document such as web pages, tweets, or PDFs. In these cases, the first step is to import the data into R and tidy the data, using the tidyverse package. This usually involves several, often complicated, steps to convert data from its raw form to the tidy form that greatly facilitates the rest of the analysis. We refer to this process as data wrangling.

In this course, we will cover several common steps of the data wrangling process including importing data into R from files, tidying data, string processing, html parsing, working with dates and times, and text mining. Rarely are all these wrangling steps necessary in a single analysis, but data scientist will likely face them all at some point. 

HarvardX has partnered with DataCamp for all assignments. This allows students to program directly in a browser-based interface. You will not need to download any special software, but an up-to-date browser is recommended.

What you'll learn:

  • Importing data into R from different file formats
  • Web scraping
  • How to tidy data using the tidyverse to better facilitate analysis
  • String processing with regular expressions (regex)
  • Wrangling data using dplyr
  • How to work with dates and times as file formats
  • Text mining

This course is part of the HarvardX Data Science Professional Certificate program:

Meet The Faculty

Rafael Irizarry

Rafael Irizarry

Professor of Biostatistics, T.H. Chan School of Public Health

Rafael Irizarry is a Professor of Biostatistics at the Harvard T.H. Chan School of Public Health and a Professor of Biostatistics and Computational Biology at the Dana Farber Cancer Institute. For the past 15 years, Dr. Irizarry’s research has focused on the analysis of genomics data. During this time, he has also has taught several classes, all related to applied statistics. Dr. Irizarry is one of the founders of the Bioconductor Project, an open source and open development software project for the analysis of genomic data. His publications related to these topics have been highly cited and his software implementations widely downloaded.

Course Provided By

Back To Top