Bringing two data sets together might be one of the most important processes in a data journalist’s tool kit. Joining data can reveal patterns and provide meaningful insight.
It takes creative thinking to come up with a data set to compare another to. It takes additional skill to prepare the two data sets to join successfully.
Joining data is a tough process and an even tougher concept to process at first.
But when you’re successful, it’s one of the most satisfying feelings ever.
We’ll be learning how to join data by replicating part of the data analysis process used in the Pulitzer-award winning investigative series from the Bristol Herald Courier by reporter Daniel Gilbert.
One reporter at a small local newspaper was able to dig into a story that had a large impact on its readers, revealing mismanagement of natural-gas royalties owed to thousands of land owners in southwest Virginia, spurring state lawmaker action.
If you want to skip ahead to the R code, you can do so via the table of contents on the left.
It comes down to royalties for leasing mineral rights to oil and gas companies.
Who owns it?
It used to be that whoever owned the land where the well was owned all the rights to the oil, no matter where the oil technically was.
That inevitably lead to some hard feelings.
That lead to the idea of “compulsory pooling”.
The Virginia Gas and Oil Act was passed in 1990 as way to develop the state’s coalbed methane without having to deal with who actually owns the land it came from. A form of Eminent Domain called force pooling allowed gas companies to pull the gas on land they didn’t have permission of.
If they couldn’t find out who land owners or the property was under dispute, the royalties were put in a state escrow.
Gilbert attended a hearing of the Virginia Gas and Oil Board and met landowners who were incensed that corporations were draining their gas against their will and not paying them for the intrusion. It wasn’t a new issue. “Two prior managing editors had spiked the story,” he recalled. “Royalties, methane gas, escrow accounts– it’s not the sexiest story.”
With these facts alone, he could have written a stellar story giving voice to citizens’ complaints, and shining a light on a little-known regulatory agency.
That, in many newsrooms, would have been plenty for a “he said/she said” story.
But Gilbert wanted to use data to dig deeper to determine the scope of the issue.
Are landowners not getting paid for gas extracted from wells on their land?
Whenever a well produced natural gas, the energy company was supposed to make a monthly payment into a corresponding escrow account.
Companies report monthly gas production numbers to the state for an online database. The escrow account is maintained by Wachovia Bank, which reports monthly statements to the Division of Gas and Oil.
So, there were two data sets. One reporting production figures per well and the other reporting payments made per well.
Match the production records with the payment schedules to see who had– and had not– been paid.
After FOIAing the data, Gilbert received spreadsheets with thousands of rows.
He started with one month’s worth of data. He looked in one spreadsheet for a well and then looked in another to see if that well existed in the other.
One by one.
Control-f
Control-f
Control-f
At this rate it was going to take months. He convinced his editor to send him to an IRE data bootcamp to learn how to do it more efficiently.
Gilbert spent weeks making sure the names of the gas wells on the escrow statement matched the well names on the production report.
This required 1) cleaning the data so it could successfully join and 2) joining the data
Often the data was off by a single white space, by a hyphen, or by a combination of letters.
He was combining and comparing the data to answer for the following goal:
Reveal the accounts that correspond to where oil or gas has been produced, but royalties have not been paid.
It took him weeks to clean and compare years of data. But we’re just going to look at one month.
But first…
Like the very first GIF above, joins work when the joints are lined up perfectly.
Sometimes, there are gaps in the join– and that’s OK!
Here’s what that would look like.
Imagine that these are two data sets. The black canvas and the sand. The join is the glue.
When the sand is thrown on the canvas, the sand that matches the glue stays on the canvas, creating a new data set– that doge painting.
The leftover black canvas data that didn’t have glue and thus didn’t match with the sand was left blank after the join. Also, the sand that didn’t match up with the glue simply fell down on the ground.
In R, this would be the equivalent of left_join()
in R with the dplyr package.
The black canvas would be NA
rows and the sand that doesn’t match would just exist anymore.
If you did full_join()
with R instead, the leftover sand would be appended to the bottom of the data frame.
There are a whole bunch of other types of joins that we won’t get into now, but you can look them up if you want.
inner_join()
semi_join()
anti_join()
Select January and 2009. Let the page reload.
It will download a file called ‘frmPrint.aspx’ – find it on your computer and rename it to ‘well_production.xls’
If you want, you can download this entire working directory on Github if you’d like to follow along that way.
We’re going to use the readxl package to handle importing Excel files and the dplyr package for joining.
library(readxl)
library(dplyr)
library(knitr)
# Importing the data
pay <- read_excel("data/1_January 09_Wachovia.xls")
prod <- read_excel("data/well_production.xls")
kable(head(pay))
Line | Unit ID | Operator | VGOB ID | Accounting Period | Beg Mkt Val | Additions | Income | Distributions | End Mkt Val |
---|---|---|---|---|---|---|---|---|---|
1 | EH18 | APPALACHIAN ENERGY | 4 | 2009-01-012009-01-31 | 33935.31 | 195.53 | 1.66 | 0 | 34132.50 |
2 | AA10 | CNX | 12 | 2009-01-012009-01-31 | 19903.15 | 0.00 | 0.97 | 0 | 19904.12 |
3 | X10 | CNX | 14 | 2009-01-012009-01-31 | 167836.09 | 0.00 | 8.18 | 0 | 167844.27 |
4 | X11 | CNX | 17 | 2009-01-012009-01-31 | 140848.11 | 0.00 | 6.86 | 0 | 140854.97 |
5 | Y9 | CNX | 31 | 2009-01-012009-01-31 | 284973.27 | 0.00 | 13.89 | 0 | 284987.16 |
6 | AA8 | CNX | 32 | 2009-01-012009-01-31 | 44449.86 | 0.00 | 2.17 | 0 | 44452.03 |
kable(head(prod))
NA | NA | NA | NA | NA | NA |
---|---|---|---|---|---|
Company | File# | Operation Name | County | Month | Production |
Appalachian Energy | A05BU | EH-142 | BUCHANAN | 01/2009 | 0.000000 |
Appalachian Energy | A12BU | EH-135 | BUCHANAN | 01/2009 | 87.000000 |
Appalachian Energy | A14BU | EH-62 | BUCHANAN | 01/2009 | 118.000000 |
Appalachian Energy | A32BU | EH-141 | BUCHANAN | 01/2009 | 27.000000 |
Appalachian Energy | A38BU | EH-59 | BUCHANAN | 01/2009 | 221.000000 |
This is what the data looks like. This is just the first five rows of the two data sets.
If you use View(pay)
and View(prod)
, you can scroll around and notice some oddities.
Data rarely comes in pristine format. For example, in the second data set, there are no column names. Those appear in the first row instead.
We gotta do some work.
# prod's column names are in row 1, so fix it
colnames(prod) <- prod[1,]
# now delete the first row of prod
prod <- prod[-1,]
# Let's get rid of any other NAs in the data frame
prod <- filter(prod, !is.na(County))
# After row 734 in pay, the rest of the data is summary. We don't need that.
# This say keeps the pay dataframe specifically from rows 1 through 734
pay <- pay[1:734,]
kable(head(pay))
Line | Unit ID | Operator | VGOB ID | Accounting Period | Beg Mkt Val | Additions | Income | Distributions | End Mkt Val |
---|---|---|---|---|---|---|---|---|---|
1 | EH18 | APPALACHIAN ENERGY | 4 | 2009-01-012009-01-31 | 33935.31 | 195.53 | 1.66 | 0 | 34132.50 |
2 | AA10 | CNX | 12 | 2009-01-012009-01-31 | 19903.15 | 0.00 | 0.97 | 0 | 19904.12 |
3 | X10 | CNX | 14 | 2009-01-012009-01-31 | 167836.09 | 0.00 | 8.18 | 0 | 167844.27 |
4 | X11 | CNX | 17 | 2009-01-012009-01-31 | 140848.11 | 0.00 | 6.86 | 0 | 140854.97 |
5 | Y9 | CNX | 31 | 2009-01-012009-01-31 | 284973.27 | 0.00 | 13.89 | 0 | 284987.16 |
6 | AA8 | CNX | 32 | 2009-01-012009-01-31 | 44449.86 | 0.00 | 2.17 | 0 | 44452.03 |
kable(head(prod))
Company | File# | Operation Name | County | Month | Production |
---|---|---|---|---|---|
Appalachian Energy | A05BU | EH-142 | BUCHANAN | 01/2009 | 0.000000 |
Appalachian Energy | A12BU | EH-135 | BUCHANAN | 01/2009 | 87.000000 |
Appalachian Energy | A14BU | EH-62 | BUCHANAN | 01/2009 | 118.000000 |
Appalachian Energy | A32BU | EH-141 | BUCHANAN | 01/2009 | 27.000000 |
Appalachian Energy | A38BU | EH-59 | BUCHANAN | 01/2009 | 221.000000 |
Appalachian Energy | A39BU | EH-58 | BUCHANAN | 01/2009 | 221.000000 |
Which column do we want to join at?
It appears the “Unit ID” column in pay
and “Operation Name” in prod
are the well IDs we want to match.
However, you can see that there’s an extra “-” in the prod version.
If you explore the rest of the data in pay
and prod
with View()
you’ll see similar issues.
We need to strip out any special characters such as dashes and spaces so that we’re just left with the letters and the numbers. That will ensure better-quality matches.
We’ll use the gsub()
function which is from Base R (as in you don’t need to bring in a special library).
# First, we need to prep the data we want to join
# Take out the special characters so all we're left with are the letters and numbers
# gsub() is pretty much look for "pattern" with "something else" within dataframe$column
# We'll use regex "[[:punct:]]" and create a new column in each data frame called join
prod$join <- gsub("[[:punct:]]", "", prod$`Operation Name`)
pay$join <- gsub("[[:punct:]]", "", pay$`Unit ID`)
kable(head(prod))
Company | File# | Operation Name | County | Month | Production | join |
---|---|---|---|---|---|---|
Appalachian Energy | A05BU | EH-142 | BUCHANAN | 01/2009 | 0.000000 | EH142 |
Appalachian Energy | A12BU | EH-135 | BUCHANAN | 01/2009 | 87.000000 | EH135 |
Appalachian Energy | A14BU | EH-62 | BUCHANAN | 01/2009 | 118.000000 | EH62 |
Appalachian Energy | A32BU | EH-141 | BUCHANAN | 01/2009 | 27.000000 | EH141 |
Appalachian Energy | A38BU | EH-59 | BUCHANAN | 01/2009 | 221.000000 | EH59 |
Appalachian Energy | A39BU | EH-58 | BUCHANAN | 01/2009 | 221.000000 | EH58 |
kable(head(pay))
Line | Unit ID | Operator | VGOB ID | Accounting Period | Beg Mkt Val | Additions | Income | Distributions | End Mkt Val | join |
---|---|---|---|---|---|---|---|---|---|---|
1 | EH18 | APPALACHIAN ENERGY | 4 | 2009-01-012009-01-31 | 33935.31 | 195.53 | 1.66 | 0 | 34132.50 | EH18 |
2 | AA10 | CNX | 12 | 2009-01-012009-01-31 | 19903.15 | 0.00 | 0.97 | 0 | 19904.12 | AA10 |
3 | X10 | CNX | 14 | 2009-01-012009-01-31 | 167836.09 | 0.00 | 8.18 | 0 | 167844.27 | X10 |
4 | X11 | CNX | 17 | 2009-01-012009-01-31 | 140848.11 | 0.00 | 6.86 | 0 | 140854.97 | X11 |
5 | Y9 | CNX | 31 | 2009-01-012009-01-31 | 284973.27 | 0.00 | 13.89 | 0 | 284987.16 | Y9 |
6 | AA8 | CNX | 32 | 2009-01-012009-01-31 | 44449.86 | 0.00 | 2.17 | 0 | 44452.03 | AA8 |
We’ve created a new column called join in each data frame.
This is the glue.
Let’s make the pay
dataframe the canvas and the sand will be the production
dataframe. We’ll use the left_join()
function from dplyr
# Let's join with pay as the basis dataframe
# And select just a handful of columns we care about
pay_check <- left_join(pay, prod) %>%
select(`Unit ID`, Operator, Income, join, Production)
library(DT)
datatable(pay_check)
Alright, some success.
There are about 734 wells listed in the pay
data set.
You can see in the blank rows (like from 2-10) that the rows on the right half are blank. That means there wasn’t any data in prod
that matched the join data.
Let’s narrow it down further on just the data that successfully joined.
# Let's focus on the ones that successfully joined
pay_check <- filter(pay_check, !is.na(Production))
datatable(pay_check)
That’s interesting.
Based on a month of data with cursory cleaning of columns, we see that there were 156 wells that produced oil.
But if you look in the income column, a good number of wells received $0.
How many?
total <- pay_check %>%
filter(Income==0) %>%
summarize(Total=n())
total
## # A tibble: 1 × 1
## Total
## <int>
## 1 37
That means 24 percent of the wells on peoples’ property listed as producing oil did not get the money they should have in January of 2009.
That’s in line with what Gilbert’s reporting found– which was that of about 750 active individual accounts in the escrow, between 22 percent and 55 percent received no royalty payments.
And this was just a cursory cleaning of the data.
In an actual investigation, a reporter would make sure there were no instances data didn’t join when it should have.
The story doesn’t end after the data analysis. Gilbert had to talk to sources and officials to figure out the why.
Why did this happen?