The members of our data science team here at Cardinal Path are frequent users of R – and tend to favor it more so than any other programming language out there today. Since R was designed for statistical analysis, it has a a lot of really useful features; such as a vast network of package creators and maintainers, and robust integration with Tableau. In this post, the Cardinal Path data science team outlines their own favorite R packages of 2016.
The R package googlesheets is an easy to use way to access and manage Google Sheets through R. I like being able to connect directly to data stored in a Google Sheet because when I pass my R program to someone else, the data source (from a URL) is the same. Also, since Google Sheets can be shared with anyone the results from my programs that have been output to Google Sheets will automatically update so that everyone will be looking at the most up to date version without having to email new files to people. I love that it works with the pipe operator (%>%) so a lot of the operations are intuitive to use if you are familiar with using dplyr.
knitr (or rmarkdown):
This is used to create R Markdown documents. R Markdown is a structured file (html, pdf, MS Word, and more) that has both your comments, your R code, and your R Code output in one place. It looks great and provides a straightforward way to document your work. In addition, you can start using it with minimal changes to your current workflow. Rather than commenting with #, use #’–this will automatically create R Markdown documents with one line of code: spin (“FILENAME.R”). I’m a sucker for documentation, especially where it requires minimal effort.
Bonus points: In addition to loving data science, I’m also an avid knitter, and knitr has fun function names like spin, knit, hook_plot_html, kable, and stitch.
One of my least favorite tasks in any platform is manipulating time and date information. You can end up with a variety of formats depending on where your data has been exported from and the conversions to make the format consistent are never pleasant. The lubridate package describes itself has having “a consistent and memorable syntax that makes working with dates easy and fun”. This might be a slight overstatement, but I have found it a great way to work with time and date information. This is another package that is part of the Hadleyverse.
One of my greatest revelations when I started using R was the fact that you could not only connect to a SQL database, but also run SQL statements inside of R. Sometimes the way you think about data matches better to a SQL syntax than it does to an R syntax, and this package makes it possible to use both flexibly as pto match your own requirements.
It can be tempting to carry out many of your data manipulation tasks in Excel since it is familiar, user friendly, and usually pretty quick. However, you can improve the process by bringing the Excel data into R and use a package such as dplyr to carry out that manipulation. This process will also allow your colleagues to clearly view all of the changes you have made to the data and, of equal importance, make it fully reproducible on their machines. In addition, with some minor tweaks, you should be able to reuse the same script for Excel workbooks that have a similar format (no one likes unnecessary, manual rework!)
There are several R packages that let you bring Excel data in, but my favorite is readxl (also part of the Hadleyverse). You can read in a Excel workbook (supports both xlsx and xls files) which consist of a list of sheets. These sheets can be cleaned and transformed into data frames (the tabular data structure within R). Of course, it always helps if your Excel data is tidy as possible before you read it in!