An exercise in cleaning data with R.

We’ll be using OpenRefine in conjunction with R.

We’ll be replicating the analysis that went into the story: The bars where drivers got drunk before OUI arrests by Dan Adams and Matt Rocheleau.

Note: This data is property of The Boston Globe and is meant to be used for research purposes only and not for publication.

Let’s load the data in and take a look at what we’re working with

bars <- read.csv("https://docs.google.com/spreadsheets/d/15TJZhuh7jKoO9fDr8LIcBr8T3vf_sUJudBD9L5o0rDM/pub?output=csv", stringsAsFactors=F)
library(knitr)
kable(head(bars,10))

Restaurant.name	offense_date	citstzip	staddress	ins_date	X	seq_01	citation
Bell & Anchor Restaurant	1/1/2013	Great Barrington, MA 01230		17-Jan-13	NA	1	R0590639
Bella Notte	1/1/2013	Bernardston, MA 01337		08-Jan-13	NA	1	R1126876
Belle Notte	1/1/2013	Bernardston, MA 01337	199 Huckle Hill Road	10-May-13	NA	1	R1848185
Main Street Bar & Grille	1/1/2013	Plymouth, MA 02360	Main Street	19-Jul-13	NA	NA
Majestic Dragon	1/1/2013	Ipswich	Rte 1A		NA	NA
Naga	1/1/2013	Cambridge, Ma.		17-Sep-14	NA	NA
LOS RECUERDOS	1/2/2013	CHELSEA, MA. 02150	14-16A FIFTH STREET	24-Jun-14	NA	NA
43 Church	1/3/2013	Salem, MA 01970		07-Jan-13	NA	1	R0893843
Gallagher’s	1/3/2013	Worcester, MA 01608		30-Jan-13	NA	NA
Slades Bar & Grill	1/3/2013	Roxbury Crossing, MA 02120	958 Tremont St	04-Oct-13	NA	NA
This is definitely one of t	hose problemati	c data sets.

Restaurant names are out of sync, sometimes there’s a dot or a star where there’s no data. Instead of NA it might be “n/a”.

The addresses are inconsistent. SOmetimes they have the state abbreviated, sometimes it’s spelled out. Sometimes it’s just a road.

I don’t have a lot of confidence in the consistency in the names, either.

Let’s take a peek.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

restaurants <- bars %>%
  group_by(Restaurant.name) %>%
  summarize(Total.Incidents=n()) %>%
  arrange(desc(Total.Incidents))

kable(head(restaurants,10))

Restaurant.name	Total.Incidents
Private	164
NONE	54
private	47
	26
private residence	26
Private Res.	23
pvt. residence	20
British Beer Company	19
99 Restaurant	18
The Grog	18
It looks like there are	multiple ways to spell out private residences, among probably many more off spellings or casings or abbreviations.

Count how many different Restaurant names there with the command length(unique(bars$Restaurant.name)).

So, 3029. Let’s see if we can make that number smaller.

Fuzzy matching

We’re going to use a package called rrefine.

It’s pretty much a wrapper for OpenRefine, which used to be known as Google Refine.

First, let’s install and open OpenRefine. It runs in a browser off your system (You’ll need to make sure you’ve got the latest version of Java installed)

# install.packages("devtools")
devtools::install_github("vpnagraj/rrefine")

## Skipping install of 'rrefine' from a github remote, the SHA1 (3f8f813b) has not changed since last install.
##   Use `force = TRUE` to force installation

library(rrefine)

# You'll write the CSV to your computer
write.csv(bars, file = "data/bars.csv", row.names = FALSE)

# And then upload that file to the OpenRefine program
refine_upload(file = "data/bars.csv", project.name = "bars_cleanup", open.browser = TRUE)

This will take a minute or two to load, depending on the size of the file.

But you’ll see a new tab pop open pointing at 127.0.0.1:333 with the project name at the top.

Click on that projet name link.

Click on that.

Alright, from here, let’s clean things up a bit.

Jump right in and click on the blue arrow next to the Restaurant.name column header and select Edit cells > Cluster and edit….

Cluster and edit

You’ll get this window

OpenRefine uses a handful of methods and keying functions to detect when a string might match another.

It clusters them together and then by checking the Merge? box next to it, it will turn the strings on the left into the string on the right in New Cell Value. You could manually change that if you like, but it guesses the one you’ll most likely choose by picking the one with the most instances.

Go down through the 440 clusters one by one and cluster them.

You’ll see that Toby Keith’s could also be called Toby Keith’s I Love this Bar and Grill

So fill that in with the full-length name when you come across it.

Others: Fridays = TGI Friday’s.

Takes forever, right?

You could click the Select All button on the bottom right, if you want.

But that’s the benefit of using OpenRefine, there’s more control and you can see patterns that the machine can’t.

We’ll go over how to use a function to go over this, but once again, it’s not going to be as exacting as you are with this method.

Go through all the Methods and Distance Functions in the pulldown menus at the top to ensure you’ve got everything covered.

You have a lot of data so it’ll take a long time to cluster through each pulldown.

Another level of cleaning in OpenRefine

Let’s go further by selecting the the triangle next to Restaurant.name and choose Facet > Text facet.

This groups up the names.

This way you can look at the names alphabetically to see if you missed anything.

Hover over the name. If you want to change one of them, just hover over the name and click on the edit link.

Type in what you want.

And the text will be updated and the count will be refreshed.

This way you’ll spot more obvious similar names than the algorithm did.

It’s a long process. I got partway through the B’s before I stopped.

Let’s see how far along we’ve gotten, though.

Bring the project back into R.

# I have to rename this file because running this file creates a new file and bringing it back in confuses this packages since then there would be two "bars_cleanup" files

# In short, use the uncommented version of the command below and ignore the one with "bars_cleanup2" since that won't work for you.

# bars_clean <- refine_export(project.name = "bars_cleanup")

bars_clean <- refine_export(project.name = "bars_cleanup2")

length(unique(bars_clean$Restaurant.name))

## [1] 2746

Nice.

We’ve cleaned up nearly 300 rows.

Another way to do it would be if you had a clean list for the official name of all the restaurants in Massachusetts.

Then you could use the adist() function to figure out which bad version of the name approximates to the the good version of the name.

Splitting up the address column

We need to try to extract the town names from citstzip column.

That’s going to be very difficult.

head(bars_clean)

## # A tibble: 6 × 9
##            Restaurant.name offense_date                   citstzip
##                      <chr>        <chr>                      <chr>
## 1 Bell & Anchor Restaurant     1/1/2013 Great Barrington, MA 01230
## 2              Bella Notte     1/1/2013      Bernardston, MA 01337
## 3              Belle Notte     1/1/2013      Bernardston, MA 01337
## 4 Main Street Bar & Grille     1/1/2013         Plymouth, MA 02360
## 5          Majestic Dragon     1/1/2013                    Ipswich
## 6                     NAGA     1/1/2013             Cambridge, Ma.
## # ... with 6 more variables: staddress <chr>, comments <chr>,
## #   ins_date <chr>, X <chr>, seq_01 <int>, citation <chr>

Ok, it looks like we’ve got Town, State and Zipcode at most. Sometimes it’s just the Town. Sometimes it’s Town and State.

We want Town, State, and Zipcode to stand alone in their own colulmns.

So, let’s do this.

bars_clean$zip <- gsub("[[:alpha:]]", "", bars_clean$citstzip)
head(bars_clean$zip)

## [1] " ,  01230" ",  01337"  ",  01337"  ",  02360"  ""          ", ."

Close, but still have to get rid of the spaces, commas, and periods.

bars_clean$zip <- gsub(",", "", bars_clean$zip)
bars_clean$zip <- gsub(" ", "", bars_clean$zip)
bars_clean$zip <- gsub(".", "", bars_clean$zip, fixed=TRUE)
bars_clean$zip <- gsub("*", "", bars_clean$zip, fixed=TRUE)
bars_clean$zip <- gsub("/", "", bars_clean$zip, fixed=TRUE)

head(bars_clean$zip)

## [1] "01230" "01337" "01337" "02360" ""      ""

Nice.

Alright, let’s try to guess what state the addresses have.

bars_clean$state <- ""
bars_clean$state <- ifelse(grepl(" RI", bars_clean$citstzip,  ignore.case = TRUE), "Rhode Island", bars_clean$state)

bars_clean$state <- ifelse(grepl(" MA", bars_clean$citstzip,  ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl("M.A.", bars_clean$citstzip,  ignore.case = TRUE), "Massachusetts", bars_clean$state)

bars_clean$state <- ifelse(grepl(",MA", bars_clean$citstzip,  ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl("MA.", bars_clean$citstzip,  ignore.case = TRUE), "Massachusetts", bars_clean$state)
bars_clean$state <- ifelse(grepl(" NH", bars_clean$citstzip,  ignore.case = TRUE), "New Hampshire", bars_clean$state)
bars_clean$state <- ifelse(grepl("N.H.", bars_clean$citstzip,  ignore.case = TRUE), "New Hampshire", bars_clean$state)

bars_clean$state <- ifelse(grepl("New Hampshire", bars_clean$citstzip,  ignore.case = TRUE), "New Hampshire", bars_clean$state)
bars_clean$state <- ifelse(grepl(" N H", bars_clean$citstzip,  ignore.case = TRUE), "New Hampshire", bars_clean$state)

bars_clean$state <- ifelse(grepl(" CT", bars_clean$citstzip,  ignore.case = TRUE), "Connecticut", bars_clean$state)
bars_clean$state <- ifelse(grepl("Rhode Island", bars_clean$citstzip,  ignore.case = TRUE), "Rhode Island", bars_clean$state)

bars_clean$state <- ifelse(grepl(" VT", bars_clean$citstzip,  ignore.case = TRUE), "Vermont", bars_clean$state)

There’s no clear way to do this.

Just set up some rules and see if it works.

If it missed some, then add more rules until all of them are caught.

Finally, the town column.

bars_clean$town <- gsub(",.*", "", bars_clean$citstzip)
bars_clean$town <- gsub(" MA.*", "", bars_clean$town, ignore.case=T)
bars_clean$town <- gsub("S.", "South ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("N.", "North ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("W.", "West  ", bars_clean$town, fixed=TRUE)
bars_clean$town <- gsub("E.", "North ", bars_clean$town, fixed=TRUE)

library(stringr)

bars_clean$town <- str_to_title(bars_clean$town)

head(bars_clean[c("zip", "state", "town")],10)

## # A tibble: 10 × 3
##      zip         state             town
##    <chr>         <chr>            <chr>
## 1  01230 Massachusetts Great Barrington
## 2  01337 Massachusetts      Bernardston
## 3  01337 Massachusetts      Bernardston
## 4  02360 Massachusetts         Plymouth
## 5                               Ipswich
## 6        Massachusetts        Cambridge
## 7  02150 Massachusetts          Chelsea
## 8  01970 Massachusetts            Salem
## 9  01608 Massachusetts        Worcester
## 10 02120 Massachusetts Roxbury Crossing

Not bad!

It’ll take a lot of time, but now we can see the clear steps of what we did.

If you’re really set on power cleaning data, then I suggest mastering Regular Expressions.

We’ve done it a few times like gsub("[[:alpha:]]", "", bars_clean$citstzip) was a function that got rid of any letters in a string leaving behind numbers and special characters.

There are many great resources. Check this one out, if interested.

Cleaning up data

Andrew Ba Tran

1/29/2017

Fuzzy matching

Another level of cleaning in OpenRefine

Splitting up the address column