The first big hurdle

Getting data from your computer into R so you can analyze it.

Types of data that R can read

What a csv file looks like

file name ends with a .csv

What a csv file looks like on the inside

file name ends with a .csv

What an Excel file looks like

file name ends with a .xls or .xlsx

What an Excel file looks like on the inside

file name ends with a .xls or .xlsx

Importing CSV and Excel files

Two ways to get data

Get the URL

Right click the link of the data and click Copy Link Address

read.csv()

The function is read.csv() and put the URL address in “” and add the stringsAsFactors=F

df_csv <- read.csv("https://data.ct.gov/api/views/iyru-82zq/rows.csv?accessType=DOWNLOAD", stringsAsFactors=F)

stringsAsFactors=F

Why?

Blame statisticians.

Back when R was created the users weren’t using it as we use it now, with all these different strings.

The other way to import the data: Download it

When you right click on the link, instead of clicking Copy Link Address– this time, click Save Link As…

Save to the directory you’re working in

After saving to the directory, click on the circle arrow on the right to refresh the files to make sure it’s there.

Note: How to change directories in RStudio

Either by typing setwd("/directory/where/you/want") or by clicking in the menu up top Session > Set Working Directory > Choose Directory…

Importing local csv data

Just like before, except instead of the URL, it’s the name of the file.

Note: This will only work if the working directory is set to where the csv file is.

df_csv <- read.csv("Admissions_to_DMHAS_Addiction_Treatment_by_Town__Year__and_Month.csv", stringsAsFactors=F)

Importing Excel files

First, install the tidyverse package if you have not yet done so.

That will have readxl as part of the group of packages.

#install.packages("tidyverse")
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tidyr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(readxl)

Unlike a csv, you can’t just copy and paste the URL for an Excel sheet.

You gotta download the file first.

Right click the link of the Excel data link and click Save File As…

read_excel()

Excel spreadsheets have multiple sheets and it’s best to explore what it looks like in Excel first because read_excel() requires specific sheets to be referred to when importing.

Give it a shot with the first sheet.

df_xl <- read_excel("StatisticsSummary.xls", sheet=1)

Check it

View(df_xl)

This isn’t right.

The problem with Excel files is that people love to format it in ways that make it look nice in Excel but makes no sense in R.

read_excel() again

But this time we’ll add skip=2 so it skips the first rows when bringing in the data.

df_xl <- read_excel("StatisticsSummary.xls", sheet=1, skip=2)

Much better

View(df_xl)

Not there yet, though. In anticipation of the work we’re going to be doing, we should go ahead and simplify the column names so there are no characters or spaces.

Cleaning (part 1)

We’ll use the make.names() function on the column names. This function makes syntactically valid names out of character vectors.

colnames(df_xl) <- make.names(colnames(df_xl) )

Check it

Alright, that’s a bit better.

Still, some oddness in the names but that’s because enters were replaced with periods.

Change the name of a single column

We need to get the exact name of the Year column.

colnames(df_xl)
##  [1] "Fiscal.Year......7.1.6.30" "Accessions"               
##  [3] "Autopsies"                 "Exam.inations"            
##  [5] "Other.Cases"               "TOTAL"                    
##  [7] "Cremations"                "X..incl.crem"             
##  [9] "Homicides"                 "Suicide"                  
## [11] "Accidents"                 "Undetermined"             
## [13] "ALL"                       "U.20"                     
## [15] "U.17"                      "SIDS"                     
## [17] "Clinicals"

Copy Fiscal.Year......7.1.6.30 and paste it into `colnames(dataframe_name)[colnames(dataframe_name) == ‘ColumnNameToBeChanged’] <- ‘NewColumnName’

colnames(df_xl)[colnames(df_xl) == 'Fiscal.Year......7.1.6.30'] <- 'Year'

Check it

colnames(df_xl)
##  [1] "Year"          "Accessions"    "Autopsies"     "Exam.inations"
##  [5] "Other.Cases"   "TOTAL"         "Cremations"    "X..incl.crem" 
##  [9] "Homicides"     "Suicide"       "Accidents"     "Undetermined" 
## [13] "ALL"           "U.20"          "U.17"          "SIDS"         
## [17] "Clinicals"

Fix the other names if you want. I’m going to leave them as is for now.

Is the df_xl sheet clean enough to work with?

Scroll down to the bottom of the data.

Not clean yet. There are a bunch of NAs.

That might give us some issues later on so let’s take care of it now.

Eliminating NAs

Easiest way to get rid of NAs is to subset or filter out the NAs based on one column.

Let’s use the Year column.

There are two ways: subset() or filter()

  1. Base R
df_xl <- subset(df_xl, !is.na(Year))
  1. dplyr
df_xl <- filter(df_xl, !is.na(Year))

What’s the difference? Personal preference for this one. There’s not much difference.

Check it

No NAs at the bottom.

It took a few lines of code but the data’s cleaned up enough to analyze or visualize with.