# Lesson #7 -- Intro to R and the tidyverse

Today we transition from python to *R*. Broadly the lesson today will be structured in three sections:
*  How to set up *R* in Colaboratory
*  A review of the topics we covered in the python lessons, but applied to *R*
*  Applying these actions to datasets using the package *tidyverse*


# Getting set up for *R* in Colaboratory

Let's start by loading our extension that allows us to us *R* in Colaboratory.

[link to *R colab*](https://colab.to/r)

This will open a new notebook that is specific the *R*.

Once the new notebook opens, we can check which version of *R* has been installed.

In [None]:
version

               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          2.1                         
year           2022                        
month          06                          
day            23                          
svn rev        82513                       
language       R                           
version.string R version 4.2.1 (2022-06-23)
nickname       Funny-Looking Kid           

And we can check which packages come pre-installed.

After hitting play, let's check if *tidyverse* and *ggplot2* (which we will use next week) are included in the installed packages.

In [None]:
str(allPackage <- installed.packages())
allPackage[, c(1,3:5)]

Since we do not see *tidyverse*, we can install it directly. First, we need to install the package from the server using *install.packages()*, then we need to load the package in from out library using *library*.

It is important to remember that installing and loading packages are two different actions in *R*. Installing downloads the commands onto your computer (or cloud), and loading activates the commands in the current session. It is best practice to only load from the library packages that you will be using. There might be overlapping commands between packages.

In [None]:
install.packages("tidyverse")
library(tidyverse)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.7      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Given the output from *library* we see that additional packages were included when loading *tidyverse*. One of these is *ggplot2*, which we will be using nest week.

Additionally, we see the commands that are conflicting with other commands already loaded. For instance, when we call filter(), we will get the *dplyr* version as it is 'masking' the *stats* version.

Now we are ready to dig in.

#Covering the Basics

We have spent the last 6 weeks covering 9 chapters of Py4E. While we will not go over everything again here, we can re-orient ourselves to get into the *R* mindset.

First let's start with assignment. In *R* we can use either the '=' or '<-' to assign values to variables

In [None]:
w = 5
d <- 4
t <- w * d
t

Additionally, what we see here is that in *R* we do not need to use a print command. By simply calling the variable, in this case 'c', the associated  value is called.

We can also construct conditional if statements.

In [None]:
x <- 5
if (x == 5){
    eval = "same"
    }else{
    eval = "different"
    }

eval


But, like our code in week 3, maybe we are interested in people that are the same high, taller, or shorter. We can have multiple branches in *R* as well, but the syntax is different. We need to use 'else if' instead of 'elif'.

In [None]:
x <- 3
if (x == 5){
      eval = "same"
    }else if (x < 5){
      eval = "less"
    }else{
      eval = "more"
    }

eval

We can now change the value of x to check our evaluation.

Next, let's think about writing our own definitions in *R*. Again, the logic is the same, but the syntax is a bit different.

To do this, we will be using the *function* command.

In [None]:
mean_val = function(a, b)
    {
      add = a + b
      return(add/2)
    }

mean_val(5, 13)

Notice that we need the *return* command in the newly defined function, as this is what signals to *R* what should be printed when the function is called. It functions similarly to the print() command in this case.

Next, let's move to For Loops. We can start with a list of integers, but instead of the square brackets, we will use the *c(x,x,x)* structure for lists in *R*.

In [None]:
height = c(5.4,7,6.2,5.9,6,5)
height

In [None]:
height = c(5.4,7,6.2,5.9,6,5)
total = 0
for(i in height)
  {
      total = total + i
  }

total

Similar to week 4, we can make this more interesting, and we can calculate multiple items in our for loop in order to calculate the mean height in our list.

In [None]:
height = c(5.4,7,6.2,5.9,6,5)
total = 0
count = 0
for(i in height)
  {
      total = total + i
      count = count +1
      mean = total/count
  }

total
count
mean

We can also generate a random list of number and feed it into out for loop.

In [None]:
height = runif(n=100, min = 4, max = 8)
total = 0
count = 0
for(i in height)
  {
      total = total + i
      count = count +1
      mean = total/count
  }

total
count
mean
height

To make this a little less burdensome to read, since out list is 100 numbers, we can do some basic calculations for 'height'

In [None]:
height = runif(n=100, min = 4, max = 8)
total = 0
count = 0
for(i in height)
  {
      total = total + i
      count = count +1
      mean = total/count
  }

total
count
mean
(max <- max(height))
(min <- min(height))
(sd <- sd(height))

Finally, we can traverse the list and place the normalized height back into the list.

In [None]:
height = runif(n=100, min = 4, max = 8)
total = 0
count = 0
for(i in height)
  {
      total = total + i
      count = count +1
      mean = total/count
  }

n_height <- height
for(i in 1:length(n_height))
  {
    n_height[i] = n_height[i]/mean
  }

(max <- max(height))
(min <- min(height))
(sd <- sd(height))

(n_max <- max(n_height))
(n_min <- min(n_height))
(n_sd <- sd(n_height))

This additionally shows that lists are not identical in *R*. Once assigned to a new variable, the list or data frame is distinct from manipulation to other variables.

#The *tidyverse*

As we start to use the *tidyverse* package, we also move from scalar values and lists to dataframes and tables. In this way, we are starting to engage with whole datasets rather than individual values or subsets of data.

Why *tidyverse*?

The *tidyverse* package is an incredibly powerful tool. As we saw above when loading the *tidyverse* package, it is not just one package but a set of 8 packages. This suite approach makes *tidyverse* fairly powerful. First, it is a suite of packages are made for data science, meaning it is constructed thinking about datasets rather than individual values or lists. Second, the packages share an underlying design philosophy, grammar, and data structures, meaning each package can efficiently and effectively communicate to each other.

While we are broadly introducing *tidyverse* today and will use a command or two from many of the packages today, we will focus mainly on processing data today -- i.e., getting datasets into *R*, cleaning those datasets, and manipulating and joining datasets.  Next week we will focus on constructing data visualizations.

In particular today, we will focus on the *tidyverse* packages:
*  *tibble* -- which allows us to construct data tables and are the main format we will be dealing with in the *tidyverse*
*  *readr* -- which offers a 'fast and friendly' way to read data files such as CSVs into *R*
*  *dplyr* -- which offers a set of tools to clean and manipulate our tibbles
*  *tidyr* -- which offers a set of tools to reshape and pivot tibbles


Let's hop in!

#tibble

First, we will start by constructing out own data table and the convert it to a tibble.


In [None]:
data <- data.frame(
      name = c("Graham","Nick","Anna","Jiho"),
      gender = c("M","M","F","M"),
      height = c(6,6.1,5.7,5.9),
      year = c(3,3,3,3)
)

data2 <- as_tibble(data)
data2
data

name,gender,height,year
<chr>,<chr>,<dbl>,<dbl>
Graham,M,6.0,3
Nick,M,6.1,3
Anna,F,5.7,3
Jiho,M,5.9,3


name,gender,height,year
<chr>,<chr>,<dbl>,<dbl>
Graham,M,6.0,3
Nick,M,6.1,3
Anna,F,5.7,3
Jiho,M,5.9,3


We can also construct a tibble directly using *tibble()*.

In [None]:
data <- tibble(
      name = c("Graham","Nick","Anna","Jiho"),
      gender = c("M","M","F","M"),
      height = c(6,6.1,5.7,5.9),
      year = c(3,3,3,3)
)

data

name,gender,height,year
<chr>,<chr>,<dbl>,<dbl>
Graham,M,6.0,3
Nick,M,6.1,3
Anna,F,5.7,3
Jiho,M,5.9,3


*tibble()* do less automatically than *data.frame()*. *tibble()* takes the inputs as you have entered them -- it does not change the type of the inputs (e.g. from strings to factors). *tibble()* does not changes the names of variables and it never creates row.names().

This seems like a step backwards -- why is it more powerful then?

What is gives you is control. While we might find more bugs and have to do more cleaning ourselves when starting out, you are in control of the cleaning. *tibble()* will not guess what you want. You have to tell *tibble()* what you want. This means bugs are caught earlier in preprocessing not later on when the outputs of your analysis seems off.

Before moving on, we can talk about basic ways to call information in tibbles. We can do this using the following format:
```
tibble['row number','column number']
```

*R* does not use the same counting convention as *python*. In *R*, row #1 is position 1.


In [None]:
data[1,]          #first row

name,gender,height,year
<chr>,<chr>,<dbl>,<dbl>
Graham,M,6,3


In [None]:
data[,3]          #third column

height
<dbl>
6.0
6.1
5.7
5.9


Additionally, we can call multiple rows of columns.

In [None]:
data[c(1,3),]          #first and third rows

name,gender,height,year
<chr>,<chr>,<dbl>,<dbl>
Graham,M,6.0,3
Anna,F,5.7,3


We can also call the columns directly by their name.

In [None]:
data[,"gender"]

gender
<chr>
M
M
F
M


#readr

Next, we will shift to *readr*, which allows us to upload pre-existing datasets into *R* for use.

We will use the following format:


```
data.name <- read_csv("path to file")
```

In Colaboratory we will have to upload our file in the left hand bar and enter the path given the files location in the cloud, as we have done earlier in the class.


In [None]:
data <- read_csv("/content/Lesson_7_EJ_attend_list.csv")
data

When calling the file, *R* is giving us important information about our tibble. First, how many rows and columns are observed. This is our first chance to see if the file uploaded correctly. Second, the delimiter. In this case ',' -- which makes sense as we have uploaded a Comma Separated Variables file.

A quick side note-- if you are analyzing text data, using a CSV file, and your tibble structure is incorrect (i.e., too many rows or columns) there is probably a comma in your text that is causing issues. You can convert your file to a TSV (i.e., a Tab Separated Variable file) or remove all commas from your input file.  

Third, we are given a count of the types of variables in the tibble. In this case, three character variables (and the variables listed out) and 71 double or numeric variables (and the variables listed out). Because tibbles are 'lazy', this is our first chance to see if our data has uploaded as the right type.

Let's say we want to further specify our types of data. More specifically, Position_ID should be an integer and Part_Sec should be a factor.

In [None]:
data <- read_csv("/content/Lesson_7_EJ_attend_list.csv",
            col_types = cols(
                       Position_ID = col_integer(),
                       Part_sec = col_factor(levels = c("App","Del","Pol"))
            ))
data

# dplyr

Now that we have out data read into *R* and structure as a tibble, it is time to clean and manipulate our data using *dplyr*

The command that is unique to *dplyer* is the pipe or:

```
%>%
```

The pipe command allows us to make multiple, sequenced manipulations to the same dataset.

For instance, let's say we are only interested in the first two meeting of the dataset (i.e., the meeting occurring in 2011). We can call the dataset, and pipe it to the select() command.


In [None]:
data_2011 <- data %>%
  select(Part_Position,	Position,	Position_ID,	Part_sec,	d_2011_10,	d_2011_11)

data_2011

If we wanted to be more efficient, we could call all columns between Part_Position and d_2011_11 using ":".

In [None]:
data_2011 <- data %>%
  select(Part_Position:d_2011_11)

data_2011

Given our new dataset, maybe we are interested in the total number of meetings that an individual  has attended in 2011. For this we can use the mutate() command. We can just build off our current code using another pipe.

In [None]:
data_2011 <- data %>%
  select(Part_Position:d_2011_11)%>%
  mutate(t_2011 = d_2011_10+d_2011_11)

data_2011

Now that we have t_2011, let's rearrange our dataset in order of total attendance. We can use the arrange() function to do this. Additionally, if we want the participants with the greatest attendance at the top of our tibble, we can nest the desc() command into the arrange command.

In [None]:
data_2011 <- data %>%
  select(Part_Position:d_2011_11) %>%
  mutate(t_2011 = d_2011_10+d_2011_11) %>%
  arrange(desc(t_2011))

data_2011

Next, let's say we are interested in the participant's sector in our analysis. but we are interested in running the variables as individual dummy variables rather than as a factor.

We can use a conditional mutate command to do this. The basic format for this is as follows:

```
mutate(data, location_of_mutatation = ifelse(evaluation, output_if_true, output_if_false))
```



In [None]:
data_2011 <- data %>%
  select(Part_Position:d_2011_11) %>%
  mutate(t_2011 = d_2011_10+d_2011_11) %>%
  mutate(App_ID = ifelse(Part_sec =="App",1,0))%>%
  arrange(desc(t_2011))

data_2011

We can see that a new column App_ID has now been added to our dataset. Furthermore, when the evaluation is TRUE, a 1 has been place in the column and a 0 otherwise.

Now, let's add two more columns, one for Del_ID and one for Pol_ID.

In [None]:
data_2011 <- data %>%
  select(Part_Position:d_2011_11) %>%
  mutate(t_2011 = d_2011_10+d_2011_11) %>%
  mutate(App_ID = ifelse(Part_sec =="App",1,0))%>%
  mutate(Del_ID = ifelse(Part_sec =="Del",1,0))%>%
  mutate(Pol_ID = ifelse(Part_sec =="Pol",1,0))%>%
  arrange(desc(t_2011))

data_2011

Finally, for illustrative purposes, let's say we want each of the Part_sec factors in all capital letters. With the conditional mutation function we can also overwrite current columns.

In [None]:
data_2011 <- data %>%
  select(Part_Position:d_2011_11) %>%
  mutate(t_2011 = d_2011_10+d_2011_11) %>%
  mutate(App_ID = ifelse(Part_sec =="App",1,0))%>%
  mutate(Del_ID = ifelse(Part_sec =="Del",1,0))%>%
  mutate(Pol_ID = ifelse(Part_sec =="Pol",1,0))%>%
  mutate(Part_sec = ifelse(Part_sec =="App","APP",0))%>%
  mutate(Part_sec = ifelse(Part_sec =="Del","DEL",0))%>%
  mutate(Part_sec = ifelse(Part_sec =="Pol","POL",0))%>%
  arrange(desc(t_2011))

data_2011

What happened? Everything is 0.

Because we are overwriting the current column, our  output_if_false state must be more flexible. Right now, if it is FALSE it is overwriting everything with 0, thus, in our next line of code there is nothing that is the same as 'Del'.

In [None]:
data_2011 <- data %>%
  select(Part_Position:d_2011_11) %>%
  mutate(t_2011 = d_2011_10+d_2011_11) %>%
  mutate(App_ID = ifelse(Part_sec =="App",1,0))%>%
  mutate(Del_ID = ifelse(Part_sec =="Del",1,0))%>%
  mutate(Pol_ID = ifelse(Part_sec =="Pol",1,0))%>%
  mutate(Part_sec = ifelse(Part_sec =="App","APP",Part_sec))%>%
  mutate(Part_sec = ifelse(Part_sec =="Del","DEL",Part_sec))%>%
  mutate(Part_sec = ifelse(Part_sec =="Pol","POL",Part_sec))%>%
  arrange(desc(t_2011))

data_2011

Hmm... now what is happening?

Becuase we established Part_sec as a factor, we are seeing the underlying factorization of the variable when we mutate it back into a string variable. This is how the tibble is viewing our factor variables. Despite seeing the strings before, *R* is recognizing them as factors in the order that we established them in read_csv().

This is not a problem, but we have to adjust out code.

In [None]:
data_2011 <- data %>%
  select(Part_Position:d_2011_11) %>%
  mutate(t_2011 = d_2011_10+d_2011_11) %>%
  mutate(App_ID = ifelse(Part_sec =="App",1,0))%>%
  mutate(Del_ID = ifelse(Part_sec =="Del",1,0))%>%
  mutate(Pol_ID = ifelse(Part_sec =="Pol",1,0))%>%
  mutate(Part_sec = ifelse(Part_sec =="App","APP",Part_sec))%>%
  mutate(Part_sec = ifelse(Part_sec ==2,"DEL",Part_sec))%>%
  mutate(Part_sec = ifelse(Part_sec ==3,"POL",Part_sec))%>%
  arrange(desc(t_2011))

data_2011

Maybe we are only interested in the Appointed Citizens (i.e., APP). We can use the filter() command.

In [None]:
data_App <- data_2011 %>%
  filter(Part_sec == 'APP')

data_App

Conversely, maybe we are  interested in everyone except the Appointed Citizens (i.e., APP). we can use the bang equals (i.e., !=)

In [None]:
data_Pol_Del <- data_2011 %>%
  filter(Part_sec != 'APP')

data_Pol_Del

Finally, maybe we are interested in the groups of actors based on their sector affiliation. We can use the summarize() command along with the group_by() command to look at differences across these groups.

In [None]:
data_summ <- data_2011 %>%
  group_by(Part_sec) %>%    #identifying the column(s) by which the data should be grouped
  summarise(mean = mean(t_2011))

data_summ

We can add as many summary statistics as we desire into the same summarize() command.

In [None]:
data_summ <- data_2011 %>%
  group_by(Part_sec) %>%
  summarise(mean = mean(t_2011), sd = sd(t_2011), max = max(t_2011), min = min(t_2011), count = n() )

data_summ

Part_sec,mean,sd,max,min,count
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
APP,1.166667,0.5773503,2,0,12
DEL,1.142857,0.6900656,2,0,7
POL,0.5,0.7071068,1,0,2


#tidyr

The final package that we will be covering today in the  *tidyverse* is *tidyr*. While *dplyer* is focused on manipulating and constructing data, *tidyr* is focused on sorting and shaping data.

For instance, let's move back to our full dataset. In our toy data set, we only looked at 2 of the 71 meetings. Say we are interested in calculating the total attendance  across all 71 meetings for each participant. We could use mutate() again, but that would require us to type in all 71 column names.

Instead we can leverage the pivot_longer() command in *tidyr*.

In [None]:
new_data <- data %>%
  group_by(Part_Position,Position,Position_ID,Part_sec) %>%
  pivot_longer(cols = c('d_2011_10':'d_2021_3'),
                names_to = 'meeting',
                values_to = 'attend')

new_data

Now that we have restructured the data, we can utilize  the summarize() command to sum() the total number of meetings each position has attended.

In [None]:
new_data <- data %>%
  group_by(Part_Position,Position,Position_ID,Part_sec) %>%
  pivot_longer(cols = c('d_2011_10':'d_2021_3'),
                names_to = 'meeting',
                values_to = 'attend')%>%
  summarize(t_attend = sum(attend))

new_data

[1m[22m`summarise()` has grouped output by 'Part_Position', 'Position', 'Position_ID'.
You can override using the `.groups` argument.


Part_Position,Position,Position_ID,Part_sec,t_attend
<chr>,<chr>,<int>,<fct>,<dbl>
Affected Communities 1,Aff_Comm_1,9,App,40
Affected Communities 2,Aff_Comm_2,10,App,23
Department of Business and Economic Development,Dept_Econ_Dev,5,Del,34
Department of Health and Mental Hygiene,Dept_Health,1,Del,49
Department of Housing and Community Development,Dept_Housing,3,Del,34
Department of Planning,Dept_Planning,4,Del,40
Department of the Environment,Dept_Enviro,0,Del,63
Department of Transportation,Dept_Transport,6,Del,39
Public Interest 1,Public_1,11,App,10
Public Interest 10,Public_10,20,App,13


Finally we can ungroup() and regroup to calculate the same parameters as we had calculated for the 2011 sub-dataset.

In [None]:
data_full_summ <- new_data %>%
  ungroup() %>%
  group_by(Part_sec) %>%
  summarise(mean = mean(t_attend), sd = sd(t_attend), max = max(t_attend), min = min(t_attend), count = n() )

data_full_summ

Part_sec,mean,sd,max,min,count
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
App,28.25,10.80509,42,10,12
Del,46.85714,14.1118,69,34,7
Pol,13.0,11.31371,21,5,2


#Merging Datasets

There is one last procedure that I want to show you, the bind_col functions. This is very valuable when it comes to merging datasets together. The basic function is as follow:

```
left_join(left_hand_data, right_hand_data, by = "joining_variable')
```

Let's say we are interested in joining the t_attend we just calculated back onto our original dataset. In this case

left_hand_data = *data*

right_hand_data = *new_data*

joining_variable = *Position*

Why is left_hand_data and right_hand_data important? Because we are using *left_join*, the left_hand_data will be prioritized. This means, if there is a variable in the right_hand_data that does not have a match in the left_hand_data, the data point will be dropped. Conversely, if there is a variable in the left_hand_data that does not have a match in the right_hand_data, an 'NA' will be entered.

While we have matching observations in our example, this might have large impacts on your dataset when the observations  do not match.

In [None]:
merge_data <- data %>%
  left_join(new_data, by="Position")

merge_data

We have the t_attend merged correctly, but now we also have a bunch of information that is repeated.   We can use select to refine new_data, so we are only adding what we want.

In [None]:
merge_data <- data %>%
  left_join(select(new_data, t_attend, Position), by="Position")

merge_data

It is still retaining some information that we do not want.

The comment at the top of the tibble helps us explain why:

```
Adding missing grouping variables: `Part_Position`, `Position_ID`
```

Our dataset *new_data* is still grouped, so *R* is adding these brouped variables back in. We will have to ungroup the dataset before using it.

In [None]:
new_data2 <- new_data %>% ungroup()

merge_data <- data %>%
  left_join(select(new_data, t_attend, Position), by="Position")

merge_data

Now our merge is correct