What you'll learn

  • Importing data into R from different file formats
  • Web scraping
  • How to tidy data using the tidyverse to better facilitate analysis
  • String processing with regular expressions (regex) 
  • Wrangling data using dplyr
  • How to work with dates and times as file formats, and text mining

Course description

Very rarely in a data science project is data easily available as part of a package. It's more typical for the data to be in a file, a database, or extracted from a document such as web pages, tweets, or PDFs. In these cases, the first step is to import the data into R and tidy the data, using the tidyverse package. This usually involves several, often complicated, steps to convert data from its raw form to the tidy form that greatly facilitates the rest of the analysis. We refer to this process as data wrangling.

In this course, we will cover several common steps of the data wrangling process including importing data into R from files, tidying data, string processing, html parsing, working with dates and times, and text mining. Rarely are all these wrangling steps necessary in a single analysis, but data scientist will likely face them all at some point. 

HarvardX has partnered with DataCamp for all assignments. This allows students to program directly in a browser-based interface. You will not need to download any special software, but an up-to-date browser is recommended.

This course is part of the HarvardX Data Science Professional Certificate program.


  • Portrait of Rafael Irizarry
    Professor of Biostatistics, T.H. Chan School of Public Health

Associated Schools

  • Harvard T.H. Chan School of Public Health

Enroll now.
Take courseon