Quick Tutorial on R for Data Science

Aruna Singh
Artificial Intelligence in Plain English
7 min readDec 13, 2020

--

If you are thinking of getting into R, this tutorial will give you a brief idea about how you should begin with. Through this tutorial, I have tried to give a basic insight into data science using R.

Installation of R and R Studio

You can download the detailed setup for R and R Studio from “here”.

After downloading and installing the aforementioned software, you are all set to begin your programming journey with R. Now, you may open R Studio, click on File, New File, and lastly on R Script.

Installing Packages and Importing Libraries

Let’s understand packages & libraries and how they play a significant role in R programming.

A library is nothing but a collection of functions that are developed to perform certain tasks. So, each time a programmer writes a code, instead of writing tens and hundreds of lines just to perform a simple operation such as finding a square root, he/she directly uses the readily available function in the default library of R. Packages are a collection of libraries. Basically, it expands the functionality available in R.

The list of useful libraries goes long, however, don’t fret too much about memorizing the name of every library required for effectively. For this purpose, we have packages like “tidyverse”. This package has got all the aforementioned libraries and many more. So let’s install this package to R and import it to our program with the below lines of instruction:

#Install package
install.packages(“tidyverse”) #Load core tidyverse package
library(tidyverse) or require(tidyverse)
search() #to check what all packages are present

Since we already talked about libraries and packages, we are now going to learn about matrices as they are a way to represent tables and a stepping stone in order to get to data frames.

Build your first matrix

For analysis, we will be using a data set of the top 10 highest-paid NBA players over the past 10 years. The “Basketball Dataset” has been collected from SuperDataScience under section 4. Therefore, the goal is to simply investigate trends and patterns that you see in their performance over the past 10 years.

mymatrix = matrix(data = "<path>", nrow="", ncol="", byrow=FALSE) #Replace <path> with the path of file

There are some functions we will be using to tweak the matrix.

One-dimensional matrix:

  • rbind(): To populate the data of the matrix in a row.
  • cbind(): To populate the data of the matrix in a column.
  • names(): To assign the dimension names of the one-dimensional matrix.

Two-dimensional matrix:

  • rownames(): To assign the row name to the two-dimensional matrix.
  • colnames(): To assign the column name to the two-dimensional matrix.

Matrix Operations:

  • round(cal, n): To round off the calculation which you want in a matrix by n decimal digit.
  • transpose matrix:
mymatrix = matrix(data = "<path>", nrow="", ncol="", byrow=FALSE) #Replace <path> with the path of filetrans_mymatrix <- t(mymatrix)

Visualizing with Matplot():

matplot() is the basic plot design with legends and differentiation.

matplot(t(mymatrix), type="b", pch=15:18, col=c(1:4,6))
legend("bottomleft", inset= 0.01, legend= Players, pch=15:18, col=c(1:4,6))

Subsetting: To modify the matrix by selecting a specific row and column.

x <- c("a", "b", "c", "d", "e") 
x[c(1,2)] #"a", "b"
x[1] #"a"

So far, we have discussed about libraries and packages, matrices and its various operations. Moving on, we will understand what data frame is and how we operate it. Wait…you may wonder that what is the difference between matrix and data frame. Well, they are very similar in the sense that they’re both two-dimensional objects but the core difference is that in matrices all the data has to have the same type.

Importing the Data set

For analysis, we will be using a demographic dataset ” under section 5 from SuperDataScience. However, we are using this data to analyze the World’s demographic trends. Consider, you are employed as Data Scientist, you are required to produce a scatterplot to illustrate Birthrate and Internet Usage statistics by country. After downloading the data set from the above link and placing it in a directory of our choice, now we are ready to import the data to our R script.

mydf = read.csv("<path>") #Replace <path> with the path of file or
mydf = read.csv(file.choose())

If you have stored your file in the working directory, you can directly call out the file name. But before that, you will have to set up the folder where you have saved the file as a working directory. Post which you can directly call out the file name.

setwd("<path>")#Data set will be stored in mydata data frame
mydf = read.csv("Demographic_data.csv")

In my case, the file is stored as “Demographic_data.csv”. CSV means comma-separated values. In this format, each element of data is separated by a comma. Although other characters can also be used to separate the elements of the data. To get more detail you can go through the R documentation by using the following command:

help(read.csv) #or
?read.csv

Exploring your dataset

Now we have our data set imported, but in most cases, we cannot use it directly as it may not be correctly ordered or it might contain some features which are not required for our analysis. Before doing that, let’s first have a look at our data. You can use all or any of the below listed commands to do so.

head(mydf) #or tail(mydf) 
summary(mydf) #or str(mydf)
nrow(mydf) #or ncol(mydf) #to get no of rows and cols
snippet for summary function()
snippet for str function()

“$” is another way to access the data in the data frame.

mydf$Internet.users

To get the levels of a certain column, use levels(). However, the column should be of type categorical “factor”. To convert it, use factor() which is described below.

mydf$Income.Group <- factor(mydf$Income.Group)
levels(mydf$Income.Group)
[1] "High income" "Low income" "Lower middle income" "Upper middle income"

To check if the dataset is a data frame or not, use is.data.frame(mydf)

To delete the column from a dataset, use mydf$Income.Group <- null

Tidying up your dataset

Filtering: Let’s filter the data based on a Birthrate greater than 4.

> gt4 <- mydf$Birth.rate > 4
> filtered_data <- mydf[gt4,]
> head(filtered_data)

Merge: Let’s merge the two data frames (mydf & stats) wrt country code.

> merged_data <- merge(mydf, stats, by.x = "Country.Code", by.y = "code")
> head(merged_data)

Data Visualization and Analysis

For visualization, we will mostly use “ggplot2” which is useful for almost every kind of graph. We can also use “dplyr” package for the same use cases. Seven factors involve representing the graph:

  • theme: things that make your chart pretty and overwhelming.
  • coordinates: Pretty much as the name suggests, used for categorizing the data.
  • facets: Get a sense of the shape of each feature of the data and explore a set of individual observations at different granularities.
  • statistics: statistics may require to make transformations to data to create new variables and visualize them.
  • geometries: things you cant actually see as the size of what? Or the color of what? Is it a circle? Is it a square or is it? is it a dot or is it a line?
  • aesthetics: things you can see like color, the size which gives a graph a more user-friendly look.
  • data: the center of a graph

In this section, we will discuss Movie Ratings by critics and audience as movie budgets for the year 2007–2011. However, we will explore different ways of creating graphs with ggplot2. To begin with, let's use Moving Ratings” under section 6 from SuperDataScience.

Import the file and assign proper column names to make the graph readable. Henceforth, create ggplot2 keeping Critic Rating along x coordinate and Audience Rating along x coordinate.

Let’s add aesthetics to the ggplot

> ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, color=Genre)) + geom_point()

Overriding aesthetics

> g <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating, color=Genre)) 
> g + geom_point(aes(size=CriticRating))

Mapping & Setting: mapping is done by using aes() which we have done so far while the setting is done without it. let's see now,

> g <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating)) 
> g + geom_point(aes(color=Genre)) #mapping
> g + geom_point(color="Dark Green") #setting

Histogram and Density Charts: mapping is done by using aes() which we have done so far while the setting is done without it. let's see now,

> g <- ggplot(data = movies, aes(x=CriticRating, y=AudienceRating)) 
> g + geom_point(aes(color=Genre)) #mapping
> g + geom_point(color="Dark Green") #setting

You are all set to begin some extensive EDA and model building in R. All you need to write a few lines of code and boom, the magic is in front of you.

Thank you for reading the article and I am sure, it would be pretty helpful for beginners.

--

--

As a BIE at Amazon, I explore why we call data, the new oil by interpreting and generating meaningful insights.