An exercise in cleaning data with R.

We’ll be using OpenRefine in conjunction with R.

We’ll be replicating the analysis that went into the story: The bars where drivers got drunk before OUI arrests by Dan Adams and Matt Rocheleau.

Note: This data is property of The Boston Globe and is meant to be used for research purposes only and not for publication.

Let’s load the data in and take a look at what we’re working with

bars <- read.csv("https://docs.google.com/spreadsheets/d/15TJZhuh7jKoO9fDr8LIcBr8T3vf_sUJudBD9L5o0rDM/pub?output=csv", stringsAsFactors=F)
library(knitr)
kable(head(bars,10))
Restaurant.name offense_date citstzip staddress comments ins_date X seq_01 citation
Bell & Anchor Restaurant 1/1/2013 Great Barrington, MA 01230 17-Jan-13 NA 1 R0590639
Bella Notte 1/1/2013 Bernardston, MA 01337 08-Jan-13 NA 1 R1126876
Belle Notte 1/1/2013 Bernardston, MA 01337 199 Huckle Hill Road 10-May-13 NA 1 R1848185
Main Street Bar & Grille 1/1/2013 Plymouth, MA 02360 Main Street 19-Jul-13 NA NA
Majestic Dragon 1/1/2013 Ipswich Rte 1A NA NA
Naga 1/1/2013 Cambridge, Ma. 17-Sep-14 NA NA
LOS RECUERDOS 1/2/2013 CHELSEA, MA. 02150 14-16A FIFTH STREET 24-Jun-14 NA NA
43 Church 1/3/2013 Salem, MA 01970 07-Jan-13 NA 1 R0893843
Gallagher’s 1/3/2013 Worcester, MA 01608 30-Jan-13 NA NA
Slades Bar & Grill 1/3/2013 Roxbury Crossing, MA 02120 958 Tremont St 04-Oct-13 NA NA
This is definitely one of t hose problemati c data sets.

Restaurant names are out of sync, sometimes there’s a dot or a star where there’s no data. Instead of NA it might be “n/a”.

The addresses are inconsistent. SOmetimes they have the state abbreviated, sometimes it’s spelled out. Sometimes it’s just a road.

I don’t have a lot of confidence in the consistency in the names, either.

Let’s take a peek.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
restaurants <- bars %>%
  group_by(Restaurant.name) %>%
  summarize(Total.Incidents=n()) %>%
  arrange(desc(Total.Incidents))

kable(head(restaurants,10))
Restaurant.name Total.Incidents
Private 164
NONE 54
private 47
26
private residence 26
Private Res. 23
pvt. residence 20
British Beer Company 19
99 Restaurant 18
The Grog 18
It looks like there are multiple ways to spell out private residences, among probably many more off spellings or casings or abbreviations.

Count how many different Restaurant names there with the command length(unique(bars$Restaurant.name)).

So, 3029. Let’s see if we can make that number smaller.

Fuzzy matching

We’re going to use a package called rrefine.

It’s pretty much a wrapper for OpenRefine, which used to be known as Google Refine.

First, let’s install and open OpenRefine. It runs in a browser off your system (You’ll need to make sure you’ve got the latest version of Java installed)

# install.packages("devtools")
devtools::install_github("vpnagraj/rrefine")
## Skipping install of 'rrefine' from a github remote, the SHA1 (3f8f813b) has not changed since last install.
##   Use `force = TRUE` to force installation
library(rrefine)

# You'll write the CSV to your computer
write.csv(bars, file = "data/bars.csv", row.names = FALSE)

# And then upload that file to the OpenRefine program
refine_upload(file = "data/bars.csv", project.name = "bars_cleanup", open.browser = TRUE)

This will take a minute or two to load, depending on the size of the file.

But you’ll see a new tab pop open pointing at 127.0.0.1:333 with the project name at the top.

Click on that projet name link.

Click on that projet name link.

Click on that.

Alright, from here, let’s clean things up a bit.

Jump right in and click on the blue arrow next to the Restaurant.name column header and select Edit cells > Cluster and edit….

Cluster and edit

Cluster and edit

You’ll get this window

You’ll get this window

OpenRefine uses a handful of methods and keying functions to detect when a string might match another.

It clusters them together and then by checking the Merge? box next to it, it will turn the strings on the left into the string on the right in New Cell Value. You could manually change that if you like, but it guesses the one you’ll most likely choose by picking the one with the most instances.

Go down through the 440 clusters one by one and cluster them.

You’ll see that Toby Keith’s could also be called Toby Keith’s I Love this Bar and Grill

So fill that in with the full-length name when you come across it.

Others: Fridays = TGI Friday’s.

Takes forever, right?

You could click the Select All button on the bottom right, if you want.

But that’s the benefit of using OpenRefine, there’s more control and you can see patterns that the machine can’t.

We’ll go over how to use a function to go over this, but once again, it’s not going to be as exacting as you are with this method.

Go through all the Methods and Distance Functions in the pulldown menus at the top to ensure you’ve got everything covered.

You have a lot of data so it’ll take a long time to cluster through each pulldown.

Another level of cleaning in OpenRefine

Let’s go further by selecting the the triangle next to Restaurant.name and choose Facet > Text facet.

This groups up the names.

This groups up the names.

This way you can look at the names alphabetically to see if you missed anything.

Hover over the name. If you want to change one of them, just hover over the name and click on the edit link.

Type in what you want.

Type in what you want.

And the text will be updated and the count will be refreshed.

This way you’ll spot more obvious similar names than the algorithm did.

It’s a long process. I got partway through the B’s before I stopped.

Let’s see how far along we’ve gotten, though.

Bring the project back into R.

# I have to rename this file because running this file creates a new file and bringing it back in confuses this packages since then there would be two "bars_cleanup" files

# In short, use the uncommented version of the command below and ignore the one with "bars_cleanup2" since that won't work for you.

# bars_clean <- refine_export(project.name = "bars_cleanup")

bars_clean <- refine_export(project.name = "bars_cleanup2")

length(unique(bars_clean$Restaurant.name))
## [1] 2746

Nice.

We’ve cleaned up nearly 300 rows.

Another way to do it would be if you had a clean list for the official name of all the restaurants in Massachusetts.

Then you could use the adist() function to figure out which bad version of the name approximates to the the good version of the name.

Splitting up the address column

We need to try to extract the town names from citstzip column.

That’s going to be very difficult.

head(bars_clean)
## # A tibble: 6 × 9
##            Restaurant.name offense_date                   citstzip
##                      <chr>        <chr>                      <chr>
## 1 Bell & Anchor Restaurant     1/1/2013 Great Barrington, MA 01230
## 2              Bella Notte     1/1/2013      Bernardston, MA 01337
## 3              Belle Notte     1/1/2013      Bernardston, MA 01337
## 4 Main Street Bar & Grille     1/1/2013         Plymouth, MA 02360
## 5          Majestic Dragon     1/1/2013                    Ipswich
## 6                     NAGA     1/1/2013             Cambridge, Ma.
## # ... with 6 more variables: staddress <chr>, comments <chr>,
## #   ins_date <chr>, X <chr>, seq_01 <int>, citation <chr>

Ok, it looks like we’ve got Town, State and Zipcode at most. Sometimes it’s just the Town. Sometimes it’s Town and State.

We want Town, State, and Zipcode to stand alone in their own colulmns.

So, let’s do this.

bars_clean$zip <- gsub("[[:alpha:]]", "", bars_clean$citstzip)
head(bars_clean$zip)
## [1] " ,  01230" ",  01337"  ",  01337"  ",  02360"  ""          ", ."

Close, but still have to get rid of the spaces, commas, and periods.

bars_clean$zip <- gsub(",", "", bars_clean$zip)
bars_clean$zip <- gsub(" ", "", bars_clean$zip)
bars_clean$zip <- gsub(".", "", bars_clean$zip, fixed=TRUE)
bars_clean$zip <- gsub("*", "", bars_clean$zip, fixed=TRUE)
bars_clean$zip <- gsub("/", "", bars_clean$zip, fixed=TRUE)

head(bars_clean$zip)
## [1] "01230" "01337" "01337" "02360" ""      ""

Nice.

Alright, let’s try to guess what state the addresses have.

bars_clean$state <- ""
bars_clean$state <- ifelse(grepl(" RI", bars_clean$citstzip,  ignore.case = TRUE), "Rhode Island", bars_clean$state)

bars_clean$state <- ifelse(grepl(" MA", bars_clean$citstzip,  ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl("M.A.", bars_clean$citstzip,  ignore.case = TRUE), "Massachusetts", bars_clean$state)

bars_clean$state <- ifelse(grepl(",MA", bars_clean$citstzip,  ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl("MA.", bars_clean$citstzip,  ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl(" NH", bars_clean$citstzip,  ignore.case = TRUE), "New Hampshire", bars_clean$state)
bars_clean$state <- ifelse(grepl("N.H.", bars_clean$citstzip,  ignore.case = TRUE), "New Hampshire", bars_clean$state)

bars_clean$state <- ifelse(grepl("New Hampshire", bars_clean$citstzip,  ignore.case = TRUE), "New Hampshire", bars_clean$state)
bars_clean$state <- ifelse(grepl(" N H", bars_clean$citstzip,  ignore.case = TRUE), "New Hampshire", bars_clean$state)

bars_clean$state <- ifelse(grepl(" CT", bars_clean$citstzip,  ignore.case = TRUE), "Connecticut", bars_clean$state)
bars_clean$state <- ifelse(grepl("Rhode Island", bars_clean$citstzip,  ignore.case = TRUE), "Rhode Island", bars_clean$state)

bars_clean$state <- ifelse(grepl(" VT", bars_clean$citstzip,  ignore.case = TRUE), "Vermont", bars_clean$state)

There’s no clear way to do this.

Just set up some rules and see if it works.

If it missed some, then add more rules until all of them are caught.

Finally, the town column.

bars_clean$town <- gsub(",.*", "", bars_clean$citstzip)
bars_clean$town <- gsub(" MA.*", "", bars_clean$town, ignore.case=T)
bars_clean$town <- gsub("S.", "South ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("N.", "North ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("W.", "West  ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("E.", "North ", bars_clean$town, fixed=TRUE)

library(stringr)

bars_clean$town <- str_to_title(bars_clean$town)

head(bars_clean[c("zip", "state", "town")],10)
## # A tibble: 10 × 3
##      zip         state             town
##    <chr>         <chr>            <chr>
## 1  01230 Massachusetts Great Barrington
## 2  01337 Massachusetts      Bernardston
## 3  01337 Massachusetts      Bernardston
## 4  02360 Massachusetts         Plymouth
## 5                               Ipswich
## 6        Massachusetts        Cambridge
## 7  02150 Massachusetts          Chelsea
## 8  01970 Massachusetts            Salem
## 9  01608 Massachusetts        Worcester
## 10 02120 Massachusetts Roxbury Crossing

Not bad!

It’ll take a lot of time, but now we can see the clear steps of what we did.

If you’re really set on power cleaning data, then I suggest mastering Regular Expressions.

We’ve done it a few times like gsub("[[:alpha:]]", "", bars_clean$citstzip) was a function that got rid of any letters in a string leaving behind numbers and special characters.

There are many great resources. Check this one out, if interested.