Sharing Your R Code

Photo credit: iStock

I’ve written a couple of posts in the past about the programming language ‘R’, which is used to help predict outcomes and measure the impact of certain actions on your business goals. R also very useful in making large tasks more manageable, repeatable, and semi- automated. In this post, I’d like to outline the best ways to share your R code. There are many reasons for writing functions and sharing your code. This post will provide a brief introduction into how you would carry out these tasks, using examples to showcase why.

Let’s say you’ve started writing some R scripts. For example, you want to write an R program to pull in some data and clean a variable called ‘NAMES’, capitalizing and removing any non-alphanumeric characters. How should you do this in R, and how will you share this useful function once it is written?

Write your code

First, you will need some code to carry out the task at hand. To capitalize all letters, we use the function toupper. To remove any non- alphanumeric characters, we use the function gsub. Here is a small sample program below:

data <- read.csv(“filename.csv”, stringsAsFactors = FALSE)

data$NAMES <-toupper(data$NAMES)

data$NAMES<-gsub(“[[::punct::]]”, “ ”, “data$NAMES)

Great! Now we have some R code to run whenever we need to clean the NAMES field in our dataset.

Comment your code

You may decide that after writing and cleaning so much code, that you deserve a tropical vacation. But, that means that someone else on your team will have to take over the task of cleaning names on the dataset. You pass your code off to them, and if they are new to R, they will likely have no clue what is going on.

One excellent reason why it is important to comment your R code, is because when others read it, your comments will provide details on what is happening and why. (This also means, there is less of a chance that they will interrupt you with emails you when you are lounging pool-side!)

Comments follow a “#”. For example, for the function created in step 2:

# This reads in an input dataset

data <- read.csv(“filename.csv”, stringsAsFactors = FALSE)

# Convert all letters to uppercase

data$NAMES <-toupper(data$NAMES)

# below removes any non-alpha-numeric characters

data$NAMES<-gsub(“[[::punct::]]”, “ ”, “data$NAMES)

Now, when you send your code to your wonderful colleague who is covering for you, they will understand exactly what your code is doing. This makes covering and transitioning projects go a lot more smoothly.

Functionalize your code

Let’s say you find yourself using the code you wrote above on an almost weekly basis. Every time you pull in a new dataset, you use the code above to clean the “NAMES” variable. Sometimes, though, the dataset has a different name, or the variable you are cleaning has a different name. Each time you are copying and pasting the above code from step 1 and changing “data” to read “newdata” or changing “NAMES” to read “FIRSTLAST”.

When you find yourself using code regularly, you should write them as a user defined function. To do this, you need to change the format to allow for a function input, and add a line of code to return the code changed. Below is what the code from step 1 would look like in a function:

# cleannames is an R function to take an input dataset (data),

# the name of the variable holding “names” field, and clean it.

# It removes any non-alpha-numeric characters (gsub) and converts

# all letters to uppercase (toupper), then returns the clean

# dataset.

cleannames <- function(data, variablename) {

data[,variablename] <- gsub(“[[::punct::]]”, “ “,

toupper(data[,variablename]))

return(data)

}

newdata <- cleannames(newdata, “FIRSTLAST”)

Now, whenever you need to clean a ‘names’ variable, you can do so using a function with whatever dataset and variable name you have available. Your code is now ready to be used and reused in different contexts than originally setup in step 1!

At this point, when you send your code to your colleague, they’ll understand what your code is doing. This makes covering projects, transitioning projects, and just generally sharing code go much smoother.

Start using git

Once you have some useful code, you can put it into GitHub or BitBucket, which are both great ways to store your code for future download, as well as keeping track of any versioning related to your R package. This also allows you to keep track of versions, keep tabs on any bugs, and allows for others to contribute to your code.

When someone else wants to use or contribute to your R code, they can do so using github/bitbucket. More information on using git with R can be found here.

Create an R Package

You are now frequently using your function cleannames. You have also written several other functions you use on a regular basis: cleanphonenumbers, cleanzipcodes, and cleanages. You may also find your colleagues asking if you can share your code with them, as all these handy tools you have built are so useful and will save them loads of time too. Now is the right time to start thinking about writing an R package.

An R package is a collection of similar functions that can be reused many times. Some examples of R packages you might download from CRAN (where R code and documentation is stored and accessed from) include dplyr, forecast, tidyr, and so on. But you can also write your own!

This is getting into some advanced level R coding. The most comprehensive instructions can be found here, but if you’re just getting started on writing R packages, this tutorial is a good place to start.

CONCLUSION

You can use R at its most basic level to write code to complete tasks. You should always comment and provide documentation for your code. Once you get to the point where you are sharing code across your team, it’s a good idea to start functionalizing it. To facilitate easier collaboration on code, it’s very helpful to start using git. And finally, once you have a collection of similar useful functions, it’s an even better idea to create an R package.