Getting data from your computer into R so you can analyze it.
file name ends with a .csv
file name ends with a .csv
file name ends with a .xls or .xlsx
file name ends with a .xls or .xlsx
Importing Excel is complicated, readxl package is needed
There are more other packages that handle Excel files and can build extra sheets, but we won’t be needing them
The function is read.csv()
and put the URL address in “” and add the stringsAsFactors=F
df_csv <- read.csv("https://data.ct.gov/api/views/iyru-82zq/rows.csv?accessType=DOWNLOAD", stringsAsFactors=F)
Why?
Blame statisticians.
Back when R was created the users weren’t using it as we use it now, with all these different strings.
When you right click on the link, instead of clicking Copy Link Address– this time, click Save Link As…
Save to the directory you’re working in
After saving to the directory, click on the circle arrow on the right to refresh the files to make sure it’s there.
Either by typing setwd("/directory/where/you/want")
or by clicking in the menu up top Session > Set Working Directory > Choose Directory…
Just like before, except instead of the URL, it’s the name of the file.
Note: This will only work if the working directory is set to where the csv file is.
df_csv <- read.csv("Admissions_to_DMHAS_Addiction_Treatment_by_Town__Year__and_Month.csv", stringsAsFactors=F)
First, install the tidyverse
package if you have not yet done so.
That will have readxl
as part of the group of packages.
#install.packages("tidyverse")
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tidyr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(readxl)
Unlike a csv, you can’t just copy and paste the URL for an Excel sheet.
You gotta download the file first.
Right click the link of the Excel data link and click Save File As…
Excel spreadsheets have multiple sheets and it’s best to explore what it looks like in Excel first because read_excel()
requires specific sheets to be referred to when importing.
Give it a shot with the first sheet.
df_xl <- read_excel("StatisticsSummary.xls", sheet=1)
Check it
View(df_xl)
This isn’t right.
The problem with Excel files is that people love to format it in ways that make it look nice in Excel but makes no sense in R.
But this time we’ll add skip=2
so it skips the first rows when bringing in the data.
df_xl <- read_excel("StatisticsSummary.xls", sheet=1, skip=2)
Much better
View(df_xl)
Not there yet, though. In anticipation of the work we’re going to be doing, we should go ahead and simplify the column names so there are no characters or spaces.
We’ll use the make.names()
function on the column names. This function makes syntactically valid names out of character vectors.
colnames(df_xl) <- make.names(colnames(df_xl) )
Check it
Alright, that’s a bit better.
Still, some oddness in the names but that’s because enters were replaced with periods.
We need to get the exact name of the Year column.
colnames(df_xl)
## [1] "Fiscal.Year......7.1.6.30" "Accessions"
## [3] "Autopsies" "Exam.inations"
## [5] "Other.Cases" "TOTAL"
## [7] "Cremations" "X..incl.crem"
## [9] "Homicides" "Suicide"
## [11] "Accidents" "Undetermined"
## [13] "ALL" "U.20"
## [15] "U.17" "SIDS"
## [17] "Clinicals"
Copy Fiscal.Year......7.1.6.30
and paste it into `colnames(dataframe_name)[colnames(dataframe_name) == ‘ColumnNameToBeChanged’] <- ‘NewColumnName’
colnames(df_xl)[colnames(df_xl) == 'Fiscal.Year......7.1.6.30'] <- 'Year'
Check it
colnames(df_xl)
## [1] "Year" "Accessions" "Autopsies" "Exam.inations"
## [5] "Other.Cases" "TOTAL" "Cremations" "X..incl.crem"
## [9] "Homicides" "Suicide" "Accidents" "Undetermined"
## [13] "ALL" "U.20" "U.17" "SIDS"
## [17] "Clinicals"
Fix the other names if you want. I’m going to leave them as is for now.
Scroll down to the bottom of the data.
Not clean yet. There are a bunch of NA
s.
That might give us some issues later on so let’s take care of it now.
Easiest way to get rid of NAs is to subset or filter out the NA
s based on one column.
Let’s use the Year
column.
There are two ways: subset()
or filter()
df_xl <- subset(df_xl, !is.na(Year))
df_xl <- filter(df_xl, !is.na(Year))
What’s the difference? Personal preference for this one. There’s not much difference.
No NAs
at the bottom.
It took a few lines of code but the data’s cleaned up enough to analyze or visualize with.