Pick a data set data on Connecticut’s Open Data Portal and interview someone at the originating agency about the background of the data.
Explain in your paper:
Length: 600-1,000 words
DUE: Before class on March 8
Think of the code in your script as a concise way to explain to me and the computer what you intend to find.
An analysis is not just jumping from A to Z. There’s A to B to C to D to E, etc.
Each step builds upon the success of the previous step.
So each question I ask can be answered by a series of queries, you just have to deconstruct it.
There will be quite a few “write-this-query” from scratch questions. A couple dozen examles are listed below.
But since you have a cheat sheet, it shouldn’t be too difficult to pull that off.
Here are some other example questions and example answers.
Example:
Assume the data above is just a glimpse at a large data set of Boston 311 calls. Which neighborhood gets the most calls about Loud Parties/Music/People?
Partial query
boston_calls %>%
filter(TYPE=="Loud Parties/Music/People") %>%
summarize(...) %>%
...(desc(...)) %>%
head(10)
Replace the … with code.
Answer
boston_calls %>%
filter(TYPE=="Loud Parties/Music/People") %>%
summarize(total=n()) %>%
arrange(desc(total)) %>%
head(10)
Instead of writing a query to solve my question, read the code and figure out what the question I’m asking is.
This is still based on the Boston 311 calls data from above.
library(lubridate)
boston_calls$open_dt <- ymd_hms(boston_calls$open_dt)
summary <- boston_calls %>%
mutate(Year=year(boston_calls$open_dt)) %>%
group_by(neighborhood, TYPE, Year) %>%
summarize(total=n()) %>%
arrange(neighborhood, TYPE, Year)
An acceptable answer (notice how every line in the query is referred to)
Figure out the number of complaints over time by neighborhood and by type and year and sort it by neighborhood, type, and year.
These answers would not get full points.
How many complaints there are by neighborhood and type?
The answer above did not mention the year aspect of the analysis nor how it was sorted.
Two data sets with similar topics. Police-involved shootings in San Francisco and in Dallas.
Make the case that one data set is better than another.
Provide 3 concrete examples of stories that are possible to research and tell because of the better data set’s structure.
Answer
Dallas data set is better because it includes separate fields for race of the suspect and police officer, and a column for whether the suspect waas armed.
I’ll start with a query and list out the code but it won’t work.
This is based on the 311 calls data from above.
Find the 5 neighborhoods with the most complaints
The result should be
## # A tibble: 5 × 2
## neighborhood Total
## <chr> <int>
## 1 Jamaica Plain 13
## 2 Downtown / Financial District 12
## 3 South Boston / South Boston Waterfront 11
## 4 Roxbury 8
## 5 Dorchester 6
But I’m getting
## # A tibble: 5 × 2
## neighborhood Total
## <chr> <int>
## 1 South Boston 1
## 2 Mission Hill 3
## 3 Roslindale 3
## 4 Allston / Brighton 4
## 5 Beacon Hill 4
This is the query.
calls %>%
group_by(neighborhood) %>%
summarize(Total=n()) %>%
arrange(Total) %>%
head(5)
What’s wrong?
Fix it.
Answer
It’s being sorted but not sorted in descending order.
calls %>%
group_by(neighborhood) %>%
summarize(Total=n()) %>%
arrange(desc(Total)) %>%
head(5)
A walkthrough of questions and the matching R and dplyr commands to answer them.
Try to answer each question to the best of your ability before checking for what the code is. Then see how you were off if you were. That’s the most effective way to make the connection.
Some previous tutorials to get you to this point
Import this data from the Boston city portal into R.
payroll <- read.csv("http://andrewbatran.com/ccsu-2017/slides/class6/bostonpayroll2013.csv", stringsAsFactors=F)
There are many, but we’re using one in particular for data wrangling.
library(dplyr)
# Version 1
str(payroll)
## 'data.frame': 22297 obs. of 16 variables:
## $ NAME : chr "Darosa, Baltazar" "Creaven, Jacqueline D" "Cawley, Stephen C" "Danilecki, John H" ...
## $ TITLE : chr "Police Officer" "Police Lieutenant" "Police Lieutenant" "Police Captain" ...
## $ DEPARTMENT : chr "Boston Police Department" "Boston Police Department" "Boston Police Department" "Boston Police Department" ...
## $ REGULAR : chr "$66,829" "$104,661" "$44,632" "$121,244" ...
## $ RETRO : chr "$0" "$0" "$0" "$0" ...
## $ OTHER : chr "$209,237" "$79,248" "$216,037" "$15,663" ...
## $ OVERTIME : chr "$10,473" "$47,572" "$10,158" "$71,198" ...
## $ INJURED : chr "$0" "$0" "$0" "$0" ...
## $ DETAIL : chr "$629" "$33,918" "$4,046" "$41,943" ...
## $ QUINN : chr "$6,724" "$23,721" "$6,031" "$15,156" ...
## $ TOTAL.EARNINGS: chr "$293,892" "$289,121" "$280,904" "$265,203" ...
## $ Community : chr "Brockton" "Boston" "Quincy" "Walpole" ...
## $ ZIPCode : int 2302 2132 2169 2081 2135 2186 2339 2176 1702 2043 ...
## $ State : chr "Massachusetts" "Massachusetts" "Massachusetts" "Massachusetts" ...
## $ X : chr "" "" "" "" ...
## $ X.1 : chr "" "" "" "" ...
# Version 2
glimpse(payroll)
## Observations: 22,297
## Variables: 16
## $ NAME <chr> "Darosa, Baltazar", "Creaven, Jacqueline D", "C...
## $ TITLE <chr> "Police Officer", "Police Lieutenant", "Police ...
## $ DEPARTMENT <chr> "Boston Police Department", "Boston Police Depa...
## $ REGULAR <chr> "$66,829", "$104,661", "$44,632", "$121,244", "...
## $ RETRO <chr> "$0", "$0", "$0", "$0", "$0", "$0", "$0", "$0",...
## $ OTHER <chr> "$209,237", "$79,248", "$216,037", "$15,663", "...
## $ OVERTIME <chr> "$10,473", "$47,572", "$10,158", "$71,198", "$8...
## $ INJURED <chr> "$0", "$0", "$0", "$0", "$0", "$0", "$0", "$0",...
## $ DETAIL <chr> "$629", "$33,918", "$4,046", "$41,943", "$41,84...
## $ QUINN <chr> "$6,724", "$23,721", "$6,031", "$15,156", "$12,...
## $ TOTAL.EARNINGS <chr> "$293,892", "$289,121", "$280,904", "$265,203",...
## $ Community <chr> "Brockton", "Boston", "Quincy", "Walpole", "Bos...
## $ ZIPCode <int> 2302, 2132, 2169, 2081, 2135, 2186, 2339, 2176,...
## $ State <chr> "Massachusetts", "Massachusetts", "Massachusett...
## $ X <chr> "", "", "", "", "", "", "", "", "", "", "", "",...
## $ X.1 <chr> "", "", "", "", "", "", "", "", "", "", "", "",...
# option 1
nrow(payroll)
## [1] 22297
# option 2
payroll %>% summarize(count=n())
## count
## 1 22297
ncol(payroll)
## [1] 16
colnames(payroll)
## [1] "NAME" "TITLE" "DEPARTMENT" "REGULAR"
## [5] "RETRO" "OTHER" "OVERTIME" "INJURED"
## [9] "DETAIL" "QUINN" "TOTAL.EARNINGS" "Community"
## [13] "ZIPCode" "State" "X" "X.1"
head(payroll)
## NAME TITLE
## 1 Darosa, Baltazar Police Officer
## 2 Creaven, Jacqueline D Police Lieutenant
## 3 Cawley, Stephen C Police Lieutenant
## 4 Danilecki, John H Police Captain
## 5 Kervin, Timothy M. Police Lieutenant/Hdq Dispatch
## 6 Hosein, Haseeb Police Lieutenant
## DEPARTMENT REGULAR RETRO OTHER OVERTIME INJURED
## 1 Boston Police Department $66,829 $0 $209,237 $10,473 $0
## 2 Boston Police Department $104,661 $0 $79,248 $47,572 $0
## 3 Boston Police Department $44,632 $0 $216,037 $10,158 $0
## 4 Boston Police Department $121,244 $0 $15,663 $71,198 $0
## 5 Boston Police Department $104,588 $0 $13,494 $88,906 $0
## 6 Boston Police Department $105,261 $0 $12,883 $47,437 $0
## DETAIL QUINN TOTAL.EARNINGS Community ZIPCode State X X.1
## 1 $629 $6,724 $293,892 Brockton 2302 Massachusetts
## 2 $33,918 $23,721 $289,121 Boston 2132 Massachusetts
## 3 $4,046 $6,031 $280,904 Quincy 2169 Massachusetts
## 4 $41,943 $15,156 $265,203 Walpole 2081 Massachusetts
## 5 $41,840 $12,988 $261,815 Boston 2135 Massachusetts
## 6 $76,310 $13,129 $255,020 Milton 2186 Massachusetts
head(payroll, 3)
## NAME TITLE DEPARTMENT
## 1 Darosa, Baltazar Police Officer Boston Police Department
## 2 Creaven, Jacqueline D Police Lieutenant Boston Police Department
## 3 Cawley, Stephen C Police Lieutenant Boston Police Department
## REGULAR RETRO OTHER OVERTIME INJURED DETAIL QUINN TOTAL.EARNINGS
## 1 $66,829 $0 $209,237 $10,473 $0 $629 $6,724 $293,892
## 2 $104,661 $0 $79,248 $47,572 $0 $33,918 $23,721 $289,121
## 3 $44,632 $0 $216,037 $10,158 $0 $4,046 $6,031 $280,904
## Community ZIPCode State X X.1
## 1 Brockton 2302 Massachusetts
## 2 Boston 2132 Massachusetts
## 3 Quincy 2169 Massachusetts
Transform the data set by taking out the dollar signs ($) and commas (,).
Use the functions gsub()
and as.numeric()
.
Show the structure of the dataframe to verify.
payroll$REGULAR <- gsub("\\$", "", payroll$REGULAR)
payroll$REGULAR <- gsub(",", "", payroll$REGULAR)
payroll$REGULAR <- as.numeric(payroll$REGULAR)
payroll$RETRO <- gsub("\\$", "", payroll$RETRO)
payroll$RETRO <- gsub(",", "", payroll$RETRO)
payroll$RETRO <- as.numeric(payroll$RETRO)
payroll$OTHER <- gsub("\\$", "", payroll$OTHER)
payroll$OTHER <- gsub(",", "", payroll$OTHER)
payroll$OTHER <- as.numeric(payroll$OTHER)
payroll$OTHER <- gsub("\\$", "", payroll$OTHER)
payroll$OTHER <- gsub(",", "", payroll$OTHER)
payroll$OTHER <- as.numeric(payroll$OTHER)
payroll$OVERTIME <- gsub("\\$", "", payroll$OVERTIME)
payroll$OVERTIME <- gsub(",", "", payroll$OVERTIME)
payroll$OVERTIME <- as.numeric(payroll$OVERTIME)
payroll$INJURED <- gsub("\\$", "", payroll$INJURED)
payroll$INJURED <- gsub(",", "", payroll$INJURED)
payroll$INJURED <- as.numeric(payroll$INJURED)
payroll$DETAIL <- gsub("\\$", "", payroll$DETAIL)
payroll$DETAIL <- gsub(",", "", payroll$DETAIL)
payroll$DETAIL <- as.numeric(payroll$DETAIL)
payroll$QUINN <- gsub("\\$", "", payroll$QUINN)
payroll$QUINN <- gsub(",", "", payroll$QUINN)
payroll$QUINN <- as.numeric(payroll$QUINN)
payroll$TOTAL.EARNINGS <- gsub("\\$", "", payroll$TOTAL.EARNINGS)
payroll$TOTAL.EARNINGS <- gsub(",", "", payroll$TOTAL.EARNINGS)
payroll$TOTAL.EARNINGS <- as.numeric(payroll$TOTAL.EARNINGS)
str(payroll)
## 'data.frame': 22297 obs. of 16 variables:
## $ NAME : chr "Darosa, Baltazar" "Creaven, Jacqueline D" "Cawley, Stephen C" "Danilecki, John H" ...
## $ TITLE : chr "Police Officer" "Police Lieutenant" "Police Lieutenant" "Police Captain" ...
## $ DEPARTMENT : chr "Boston Police Department" "Boston Police Department" "Boston Police Department" "Boston Police Department" ...
## $ REGULAR : num 66829 104661 44632 121244 104588 ...
## $ RETRO : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OTHER : num 209237 79248 216037 15663 13494 ...
## $ OVERTIME : num 10473 47572 10158 71198 88906 ...
## $ INJURED : num 0 0 0 0 0 0 0 0 0 0 ...
## $ DETAIL : num 629 33918 4046 41943 41840 ...
## $ QUINN : num 6724 23721 6031 15156 12988 ...
## $ TOTAL.EARNINGS: num 293892 289121 280904 265203 261815 ...
## $ Community : chr "Brockton" "Boston" "Quincy" "Walpole" ...
## $ ZIPCode : int 2302 2132 2169 2081 2135 2186 2339 2176 1702 2043 ...
## $ State : chr "Massachusetts" "Massachusetts" "Massachusetts" "Massachusetts" ...
## $ X : chr "" "" "" "" ...
## $ X.1 : chr "" "" "" "" ...
payroll %>%
arrange(desc(OVERTIME)) %>%
head(1)
## NAME TITLE DEPARTMENT
## 1 Stratton, Michael J Police Sergeant (Det) Boston Police Department
## REGULAR RETRO OTHER OVERTIME INJURED DETAIL QUINN TOTAL.EARNINGS
## 1 86675 0 9586 123810 7965 2244 4732 235013
## Community ZIPCode State X X.1
## 1 Hopkinton 1748 Massachusetts
Michael, a police sergeant detective in the Boston police.
Show just the name, title, and total earnings columns of the top 10.
payroll %>%
arrange(desc(TOTAL.EARNINGS)) %>%
select(NAME, TITLE, TOTAL.EARNINGS) %>%
head(10)
## NAME TITLE TOTAL.EARNINGS
## 1 Darosa, Baltazar Police Officer 293892
## 2 Creaven, Jacqueline D Police Lieutenant 289121
## 3 Cawley, Stephen C Police Lieutenant 280904
## 4 Danilecki, John H Police Captain 265203
## 5 Kervin, Timothy M. Police Lieutenant/Hdq Dispatch 261815
## 6 Hosein, Haseeb Police Lieutenant 255020
## 7 Lee, Thomas F Police Captain/DDC 252543
## 8 Assad, Mark L Police Sergeant (Det) 247964
## 9 Eversley, Eric V Police Lieutenant (Det) 247201
## 10 Crossen, Patrick J Police Captain/DDC 245734
Show just the name, title, and total earnings columns of the top 5 in that department.
payroll %>%
filter(DEPARTMENT=="Mayor's Office") %>%
arrange(desc(TOTAL.EARNINGS)) %>%
select(NAME, TITLE, TOTAL.EARNINGS) %>%
head(5)
## NAME TITLE TOTAL.EARNINGS
## 1 Menino, Thomas M. Mayor 175000
## 2 Weiss, Mitchell Chief Of Staff 149136
## 3 Kineavy, Michael J. Chief Policy & Planning 147812
## 4 Harrington, Patrick S Special Assistant 123000
## 5 Leibowitz, Howard Richard Special Assistant 119437
You’ll need to create a new column adding them up before filtering and sorting.
Note: Exclude officers who made no money from overtime and detail.
payroll %>%
filter(DEPARTMENT=="Boston Police Department") %>%
mutate(ot_det= OVERTIME+DETAIL) %>%
filter(ot_det > 0) %>%
arrange(ot_det) %>%
select(NAME, TITLE, ot_det) %>%
head(5)
## NAME TITLE ot_det
## 1 Molina, Yajaira Lee Police Clerk And Typist 32
## 2 Colon Jr., Johnny Police Clerk And Typist 54
## 3 McDonough, James F. Jr Building Custodian 56
## 4 Fish, Michael A. PoliceSergeant/SupvCourtCases 71
## 5 Artesani, Gregory J Police Clerk And Typist 88
payroll %>%
summarize(overtime_total=sum(OVERTIME))
## overtime_total
## 1 90805141
Create a new dataframe called avg_df
by aggregating the data.
Which 5 departments make the most (total) money on average?
avg_df <- payroll %>%
group_by(DEPARTMENT) %>%
summarize(total=mean(TOTAL.EARNINGS, na.rm=T), ot=mean(OVERTIME, na.rm=T)) %>%
arrange(desc(total))
head(avg_df, 5)
## # A tibble: 5 × 3
## DEPARTMENT total ot
## <chr> <dbl> <dbl>
## 1 Boston Fire Department 105210.08 12059.925
## 2 Boston Police Department 99814.27 18639.451
## 3 Administration and Finance 81830.75 0.000
## 4 Women's Commission 79272.00 0.000
## 5 Dpt of Innovation & Technology 73597.83 1583.129
# Version 1
payroll %>%
group_by(DEPARTMENT) %>%
summarize(employees=n()) %>%
arrange(desc(employees)) %>%
head(10)
## # A tibble: 10 × 2
## DEPARTMENT employees
## <chr> <int>
## 1 BPS 13266
## 2 Boston Police Department 3074
## 3 Boston Fire Department 1687
## 4 Boston Cntr - Youth & Families 618
## 5 Boston Public Library 563
## 6 Public Works Department 464
## 7 Transportation Department 372
## 8 Parks Department 300
## 9 Inspectional Services Dept 237
## 10 Property Management 226
# Version 2
payroll %>%
group_by(DEPARTMENT) %>%
tally(sort=TRUE) %>%
head(10)
## # A tibble: 10 × 2
## DEPARTMENT n
## <chr> <int>
## 1 BPS 13266
## 2 Boston Police Department 3074
## 3 Boston Fire Department 1687
## 4 Boston Cntr - Youth & Families 618
## 5 Boston Public Library 563
## 6 Public Works Department 464
## 7 Transportation Department 372
## 8 Parks Department 300
## 9 Inspectional Services Dept 237
## 10 Property Management 226
As in total earnings.
payroll %>%
filter(DEPARTMENT=="Elderly Commission" | DEPARTMENT=="Transportation Department") %>%
group_by(DEPARTMENT) %>%
summarize(max_pay=max(TOTAL.EARNINGS))
## # A tibble: 2 × 2
## DEPARTMENT max_pay
## <chr> <dbl>
## 1 Elderly Commission 91539
## 2 Transportation Department 149832
Based on total earnings.
payroll %>%
filter(TITLE=="Admin Secretary" & DEPARTMENT=="Registry Division") %>%
summarize(avg_pay=mean(TOTAL.EARNINGS))
## avg_pay
## 1 51766
And what is their title? Based on total pay.
Store the result in the most_paid
data frame and show the first 10 rows.
most_paid <- payroll %>%
group_by(DEPARTMENT) %>%
mutate(rank=rank(desc(TOTAL.EARNINGS))) %>%
filter(rank==1) %>%
select(NAME, DEPARTMENT, TOTAL.EARNINGS)
head(most_paid, 10)
## Source: local data frame [10 x 3]
## Groups: DEPARTMENT [10]
##
## NAME DEPARTMENT TOTAL.EARNINGS
## <chr> <chr> <dbl>
## 1 Darosa, Baltazar Boston Police Department 293892
## 2 Quinlan, Richard Boston Fire Department 222698
## 3 Bonds, Jonathan D BPS 221687
## 4 Ryan, Amy E Boston Public Library 196626
## 5 Menino, Thomas M. Mayor's Office 175000
## 6 Donovan, John Dept of Voter Mobilization 162275
## 7 Oates, William Dpt of Innovation & Technology 154875
## 8 Weenick, Meredith Treasury-Treasury Division 153238
## 9 Weise, Susan M. Law Department 150890
## 10 Bartosiak, Michael A. Property Management 149990
Save it to a employee_titles
data frame object.
Show 10 rows with the most department titles.
employee_titles <- payroll %>%
group_by(DEPARTMENT) %>%
summarize(total_employees=n(), total_titles=n_distinct(TITLE)) %>%
arrange(desc(total_titles))
head(employee_titles,10)
## # A tibble: 10 × 3
## DEPARTMENT total_employees total_titles
## <chr> <int> <int>
## 1 BPS 13266 297
## 2 Boston Police Department 3074 177
## 3 Boston Fire Department 1687 138
## 4 Boston Public Library 563 138
## 5 Public Works Department 464 91
## 6 Boston Cntr - Youth & Families 618 84
## 7 Neighborhood Development 163 74
## 8 Property Management 226 69
## 9 Inspectional Services Dept 237 60
## 10 Parks Department 300 60
Save it to a ot_diff
data frame object.
Show the first 10 rows sorted by smallest difference.
Note: ignore departments if average overtime pay is zero dollars.
ot_diff <- payroll %>%
group_by(DEPARTMENT) %>%
summarize(avg_ot=mean(OVERTIME), avg_total=mean(TOTAL.EARNINGS)) %>%
filter(avg_ot>0) %>%
mutate(diff=avg_total-avg_ot) %>%
arrange(diff)
head(ot_diff,10)
## # A tibble: 10 × 4
## DEPARTMENT avg_ot avg_total diff
## <chr> <dbl> <dbl> <dbl>
## 1 Boston Cntr - Youth & Families 441.68608 32563.09 32121.41
## 2 Parks Department 2965.80667 38624.69 35658.88
## 3 Boston City Council 7.91129 36201.37 36193.46
## 4 Mayor's Office-Public Info 266.89286 38658.54 38391.64
## 5 Dept of Voter Mobilization 8596.69231 49046.54 40449.85
## 6 Public Works Department 6776.05388 47588.57 40812.52
## 7 Boston Public Library 1169.52398 42725.11 41555.58
## 8 Elderly Commission 341.34667 42366.16 42024.81
## 9 Cemetery Division 8100.50000 50319.81 42219.31
## 10 Transportation Department 3645.13441 48442.69 44797.55
payroll %>%
filter(DEPARTMENT=="Boston Police Department") %>%
summarize(OT_Total=sum(OVERTIME, na.rm=T), Total=sum(TOTAL.EARNINGS, na.rm=T)) %>%
mutate(Percent_OT=OT_Total/Total*100)
## OT_Total Total Percent_OT
## 1 57297673 306829064 18.67413