An exercise in cleaning data with R.
We’ll be using OpenRefine in conjunction with R.
We’ll be replicating the analysis that went into the story: The bars where drivers got drunk before OUI arrests by Dan Adams and Matt Rocheleau.
Note: This data is property of The Boston Globe and is meant to be used for research purposes only and not for publication.
Let’s load the data in and take a look at what we’re working with
bars <- read.csv("https://docs.google.com/spreadsheets/d/15TJZhuh7jKoO9fDr8LIcBr8T3vf_sUJudBD9L5o0rDM/pub?output=csv", stringsAsFactors=F)
library(knitr)
kable(head(bars,10))
Restaurant.name | offense_date | citstzip | staddress | comments | ins_date | X | seq_01 | citation |
---|---|---|---|---|---|---|---|---|
Bell & Anchor Restaurant | 1/1/2013 | Great Barrington, MA 01230 | 17-Jan-13 | NA | 1 | R0590639 | ||
Bella Notte | 1/1/2013 | Bernardston, MA 01337 | 08-Jan-13 | NA | 1 | R1126876 | ||
Belle Notte | 1/1/2013 | Bernardston, MA 01337 | 199 Huckle Hill Road | 10-May-13 | NA | 1 | R1848185 | |
Main Street Bar & Grille | 1/1/2013 | Plymouth, MA 02360 | Main Street | 19-Jul-13 | NA | NA | ||
Majestic Dragon | 1/1/2013 | Ipswich | Rte 1A | NA | NA | |||
Naga | 1/1/2013 | Cambridge, Ma. | 17-Sep-14 | NA | NA | |||
LOS RECUERDOS | 1/2/2013 | CHELSEA, MA. 02150 | 14-16A FIFTH STREET | 24-Jun-14 | NA | NA | ||
43 Church | 1/3/2013 | Salem, MA 01970 | 07-Jan-13 | NA | 1 | R0893843 | ||
Gallagher’s | 1/3/2013 | Worcester, MA 01608 | 30-Jan-13 | NA | NA | |||
Slades Bar & Grill | 1/3/2013 | Roxbury Crossing, MA 02120 | 958 Tremont St | 04-Oct-13 | NA | NA | ||
This is definitely one of t | hose problemati | c data sets. |
Restaurant names are out of sync, sometimes there’s a dot or a star where there’s no data. Instead of NA
it might be “n/a”.
The addresses are inconsistent. SOmetimes they have the state abbreviated, sometimes it’s spelled out. Sometimes it’s just a road.
I don’t have a lot of confidence in the consistency in the names, either.
Let’s take a peek.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
restaurants <- bars %>%
group_by(Restaurant.name) %>%
summarize(Total.Incidents=n()) %>%
arrange(desc(Total.Incidents))
kable(head(restaurants,10))
Restaurant.name | Total.Incidents |
---|---|
Private | 164 |
NONE | 54 |
private | 47 |
26 | |
private residence | 26 |
Private Res. | 23 |
pvt. residence | 20 |
British Beer Company | 19 |
99 Restaurant | 18 |
The Grog | 18 |
It looks like there are | multiple ways to spell out private residences, among probably many more off spellings or casings or abbreviations. |
Count how many different Restaurant names there with the command length(unique(bars$Restaurant.name))
.
So, 3029. Let’s see if we can make that number smaller.
We’re going to use a package called rrefine.
It’s pretty much a wrapper for OpenRefine, which used to be known as Google Refine.
First, let’s install and open OpenRefine. It runs in a browser off your system (You’ll need to make sure you’ve got the latest version of Java installed)
# install.packages("devtools")
devtools::install_github("vpnagraj/rrefine")
## Skipping install of 'rrefine' from a github remote, the SHA1 (3f8f813b) has not changed since last install.
## Use `force = TRUE` to force installation
library(rrefine)
# You'll write the CSV to your computer
write.csv(bars, file = "data/bars.csv", row.names = FALSE)
# And then upload that file to the OpenRefine program
refine_upload(file = "data/bars.csv", project.name = "bars_cleanup", open.browser = TRUE)
This will take a minute or two to load, depending on the size of the file.
But you’ll see a new tab pop open pointing at 127.0.0.1:333 with the project name at the top.
Click on that.
Alright, from here, let’s clean things up a bit.
Jump right in and click on the blue arrow next to the Restaurant.name column header and select Edit cells > Cluster and edit….
OpenRefine uses a handful of methods and keying functions to detect when a string might match another.
It clusters them together and then by checking the Merge? box next to it, it will turn the strings on the left into the string on the right in New Cell Value. You could manually change that if you like, but it guesses the one you’ll most likely choose by picking the one with the most instances.
Go down through the 440 clusters one by one and cluster them.
You’ll see that Toby Keith’s could also be called Toby Keith’s I Love this Bar and Grill
So fill that in with the full-length name when you come across it.
Others: Fridays = TGI Friday’s.
Takes forever, right?
You could click the Select All button on the bottom right, if you want.
But that’s the benefit of using OpenRefine, there’s more control and you can see patterns that the machine can’t.
We’ll go over how to use a function to go over this, but once again, it’s not going to be as exacting as you are with this method.
Go through all the Methods and Distance Functions in the pulldown menus at the top to ensure you’ve got everything covered.
You have a lot of data so it’ll take a long time to cluster through each pulldown.
Let’s go further by selecting the the triangle next to Restaurant.name and choose Facet > Text facet.
This way you can look at the names alphabetically to see if you missed anything.
If you want to change one of them, just hover over the name and click on the edit link.
And the text will be updated and the count will be refreshed.
This way you’ll spot more obvious similar names than the algorithm did.
It’s a long process. I got partway through the B’s before I stopped.
Let’s see how far along we’ve gotten, though.
Bring the project back into R.
# I have to rename this file because running this file creates a new file and bringing it back in confuses this packages since then there would be two "bars_cleanup" files
# In short, use the uncommented version of the command below and ignore the one with "bars_cleanup2" since that won't work for you.
# bars_clean <- refine_export(project.name = "bars_cleanup")
bars_clean <- refine_export(project.name = "bars_cleanup2")
length(unique(bars_clean$Restaurant.name))
## [1] 2746
Nice.
We’ve cleaned up nearly 300 rows.
Another way to do it would be if you had a clean list for the official name of all the restaurants in Massachusetts.
Then you could use the adist()
function to figure out which bad version of the name approximates to the the good version of the name.
We need to try to extract the town names from citstzip column.
That’s going to be very difficult.
head(bars_clean)
## # A tibble: 6 × 9
## Restaurant.name offense_date citstzip
## <chr> <chr> <chr>
## 1 Bell & Anchor Restaurant 1/1/2013 Great Barrington, MA 01230
## 2 Bella Notte 1/1/2013 Bernardston, MA 01337
## 3 Belle Notte 1/1/2013 Bernardston, MA 01337
## 4 Main Street Bar & Grille 1/1/2013 Plymouth, MA 02360
## 5 Majestic Dragon 1/1/2013 Ipswich
## 6 NAGA 1/1/2013 Cambridge, Ma.
## # ... with 6 more variables: staddress <chr>, comments <chr>,
## # ins_date <chr>, X <chr>, seq_01 <int>, citation <chr>
Ok, it looks like we’ve got Town, State and Zipcode at most. Sometimes it’s just the Town. Sometimes it’s Town and State.
We want Town, State, and Zipcode to stand alone in their own colulmns.
So, let’s do this.
bars_clean$zip <- gsub("[[:alpha:]]", "", bars_clean$citstzip)
head(bars_clean$zip)
## [1] " , 01230" ", 01337" ", 01337" ", 02360" "" ", ."
Close, but still have to get rid of the spaces, commas, and periods.
bars_clean$zip <- gsub(",", "", bars_clean$zip)
bars_clean$zip <- gsub(" ", "", bars_clean$zip)
bars_clean$zip <- gsub(".", "", bars_clean$zip, fixed=TRUE)
bars_clean$zip <- gsub("*", "", bars_clean$zip, fixed=TRUE)
bars_clean$zip <- gsub("/", "", bars_clean$zip, fixed=TRUE)
head(bars_clean$zip)
## [1] "01230" "01337" "01337" "02360" "" ""
Nice.
Alright, let’s try to guess what state the addresses have.
bars_clean$state <- ""
bars_clean$state <- ifelse(grepl(" RI", bars_clean$citstzip, ignore.case = TRUE), "Rhode Island", bars_clean$state)
bars_clean$state <- ifelse(grepl(" MA", bars_clean$citstzip, ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl("M.A.", bars_clean$citstzip, ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl(",MA", bars_clean$citstzip, ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl("MA.", bars_clean$citstzip, ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl(" NH", bars_clean$citstzip, ignore.case = TRUE), "New Hampshire", bars_clean$state)
bars_clean$state <- ifelse(grepl("N.H.", bars_clean$citstzip, ignore.case = TRUE), "New Hampshire", bars_clean$state)
bars_clean$state <- ifelse(grepl("New Hampshire", bars_clean$citstzip, ignore.case = TRUE), "New Hampshire", bars_clean$state)
bars_clean$state <- ifelse(grepl(" N H", bars_clean$citstzip, ignore.case = TRUE), "New Hampshire", bars_clean$state)
bars_clean$state <- ifelse(grepl(" CT", bars_clean$citstzip, ignore.case = TRUE), "Connecticut", bars_clean$state)
bars_clean$state <- ifelse(grepl("Rhode Island", bars_clean$citstzip, ignore.case = TRUE), "Rhode Island", bars_clean$state)
bars_clean$state <- ifelse(grepl(" VT", bars_clean$citstzip, ignore.case = TRUE), "Vermont", bars_clean$state)
There’s no clear way to do this.
Just set up some rules and see if it works.
If it missed some, then add more rules until all of them are caught.
Finally, the town column.
bars_clean$town <- gsub(",.*", "", bars_clean$citstzip)
bars_clean$town <- gsub(" MA.*", "", bars_clean$town, ignore.case=T)
bars_clean$town <- gsub("S.", "South ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("N.", "North ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("W.", "West ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("E.", "North ", bars_clean$town, fixed=TRUE)
library(stringr)
bars_clean$town <- str_to_title(bars_clean$town)
head(bars_clean[c("zip", "state", "town")],10)
## # A tibble: 10 × 3
## zip state town
## <chr> <chr> <chr>
## 1 01230 Massachusetts Great Barrington
## 2 01337 Massachusetts Bernardston
## 3 01337 Massachusetts Bernardston
## 4 02360 Massachusetts Plymouth
## 5 Ipswich
## 6 Massachusetts Cambridge
## 7 02150 Massachusetts Chelsea
## 8 01970 Massachusetts Salem
## 9 01608 Massachusetts Worcester
## 10 02120 Massachusetts Roxbury Crossing
Not bad!
It’ll take a lot of time, but now we can see the clear steps of what we did.
If you’re really set on power cleaning data, then I suggest mastering Regular Expressions.
We’ve done it a few times like gsub("[[:alpha:]]", "", bars_clean$citstzip)
was a function that got rid of any letters in a string leaving behind numbers and special characters.
There are many great resources. Check this one out, if interested.