Midterm pt I (40 percent of grade)

Pick a data set data on Connecticut’s Open Data Portal and interview someone at the originating agency about the background of the data.

Explain in your paper:

Length: 600-1,000 words

DUE: Before class on March 8

Concepts as code

Think of the code in your script as a concise way to explain to me and the computer what you intend to find.

An analysis is not just jumping from A to Z. There’s A to B to C to D to E, etc.

Each step builds upon the success of the previous step.

So each question I ask can be answered by a series of queries, you just have to deconstruct it.

Midterm pt II example questions (60 percent of grade)

1. Answer these questions with R queries

There will be quite a few “write-this-query” from scratch questions. A couple dozen examles are listed below.

But since you have a cheat sheet, it shouldn’t be too difficult to pull that off.

Here are some other example questions and example answers.

2. Complete this query

  • You will be given a partial query and you must fill in the missing lines
  • You’ll be start with a narrative prompt, a sample of source data, and anticipated results

Example:

Assume the data above is just a glimpse at a large data set of Boston 311 calls. Which neighborhood gets the most calls about Loud Parties/Music/People?

Partial query

boston_calls %>%
filter(TYPE=="Loud Parties/Music/People") %>%
summarize(...) %>%
...(desc(...)) %>%
head(10)

Replace the with code.

Answer

boston_calls %>%
filter(TYPE=="Loud Parties/Music/People") %>%
summarize(total=n()) %>%
arrange(desc(total)) %>%
head(10)

3. Translate these lines of code into human language

Instead of writing a query to solve my question, read the code and figure out what the question I’m asking is.

This is still based on the Boston 311 calls data from above.

library(lubridate)
boston_calls$open_dt <- ymd_hms(boston_calls$open_dt)

summary <- boston_calls %>%
mutate(Year=year(boston_calls$open_dt)) %>%
group_by(neighborhood, TYPE, Year) %>%
summarize(total=n()) %>%
arrange(neighborhood, TYPE, Year)

An acceptable answer (notice how every line in the query is referred to)

Figure out the number of complaints over time by neighborhood and by type and year and sort it by neighborhood, type, and year.

These answers would not get full points.

How many complaints there are by neighborhood and type?

The answer above did not mention the year aspect of the analysis nor how it was sorted.

4. Explain why one data set is superior to another

Two data sets with similar topics. Police-involved shootings in San Francisco and in Dallas.

Make the case that one data set is better than another.

Provide 3 concrete examples of stories that are possible to research and tell because of the better data set’s structure.

Answer

Dallas data set is better because it includes separate fields for race of the suspect and police officer, and a column for whether the suspect waas armed.

  • Can filter the list to figure out officer-involved shootings in which black suspects were unarmed
  • Can summarize percent of those shot by race
  • Can figure out if Black officers shoot Cops supects at a higher or lesser rate than White cops

5. What’s wrong with my code?

I’ll start with a query and list out the code but it won’t work.

This is based on the 311 calls data from above.

Find the 5 neighborhoods with the most complaints

The result should be

## # A tibble: 5 × 2
##                             neighborhood Total
##                                    <chr> <int>
## 1                          Jamaica Plain    13
## 2          Downtown / Financial District    12
## 3 South Boston / South Boston Waterfront    11
## 4                                Roxbury     8
## 5                             Dorchester     6

But I’m getting

## # A tibble: 5 × 2
##         neighborhood Total
##                <chr> <int>
## 1       South Boston     1
## 2       Mission Hill     3
## 3         Roslindale     3
## 4 Allston / Brighton     4
## 5        Beacon Hill     4

This is the query.

calls %>%
  group_by(neighborhood) %>%
  summarize(Total=n()) %>%
  arrange(Total) %>%
  head(5)

What’s wrong?

Fix it.

Answer

It’s being sorted but not sorted in descending order.

calls %>%
  group_by(neighborhood) %>%
  summarize(Total=n()) %>%
  arrange(desc(Total)) %>%
  head(5) 

R query examples

A walkthrough of questions and the matching R and dplyr commands to answer them.

Try to answer each question to the best of your ability before checking for what the code is. Then see how you were off if you were. That’s the most effective way to make the connection.

Some previous tutorials to get you to this point

1. Import the Boston payroll data?

Import this data from the Boston city portal into R.

payroll <- read.csv("http://andrewbatran.com/ccsu-2017/slides/class6/bostonpayroll2013.csv", stringsAsFactors=F)

2. What library do you need to load?

There are many, but we’re using one in particular for data wrangling.

library(dplyr)

3. What is the structure?

# Version 1
str(payroll)
## 'data.frame':    22297 obs. of  16 variables:
##  $ NAME          : chr  "Darosa, Baltazar" "Creaven, Jacqueline D" "Cawley, Stephen C" "Danilecki, John H" ...
##  $ TITLE         : chr  "Police Officer" "Police Lieutenant" "Police Lieutenant" "Police Captain" ...
##  $ DEPARTMENT    : chr  "Boston Police Department" "Boston Police Department" "Boston Police Department" "Boston Police Department" ...
##  $ REGULAR       : chr  "$66,829" "$104,661" "$44,632" "$121,244" ...
##  $ RETRO         : chr  "$0" "$0" "$0" "$0" ...
##  $ OTHER         : chr  "$209,237" "$79,248" "$216,037" "$15,663" ...
##  $ OVERTIME      : chr  "$10,473" "$47,572" "$10,158" "$71,198" ...
##  $ INJURED       : chr  "$0" "$0" "$0" "$0" ...
##  $ DETAIL        : chr  "$629" "$33,918" "$4,046" "$41,943" ...
##  $ QUINN         : chr  "$6,724" "$23,721" "$6,031" "$15,156" ...
##  $ TOTAL.EARNINGS: chr  "$293,892" "$289,121" "$280,904" "$265,203" ...
##  $ Community     : chr  "Brockton" "Boston" "Quincy" "Walpole" ...
##  $ ZIPCode       : int  2302 2132 2169 2081 2135 2186 2339 2176 1702 2043 ...
##  $ State         : chr  "Massachusetts" "Massachusetts" "Massachusetts" "Massachusetts" ...
##  $ X             : chr  "" "" "" "" ...
##  $ X.1           : chr  "" "" "" "" ...
# Version 2
glimpse(payroll)
## Observations: 22,297
## Variables: 16
## $ NAME           <chr> "Darosa, Baltazar", "Creaven, Jacqueline D", "C...
## $ TITLE          <chr> "Police Officer", "Police Lieutenant", "Police ...
## $ DEPARTMENT     <chr> "Boston Police Department", "Boston Police Depa...
## $ REGULAR        <chr> "$66,829", "$104,661", "$44,632", "$121,244", "...
## $ RETRO          <chr> "$0", "$0", "$0", "$0", "$0", "$0", "$0", "$0",...
## $ OTHER          <chr> "$209,237", "$79,248", "$216,037", "$15,663", "...
## $ OVERTIME       <chr> "$10,473", "$47,572", "$10,158", "$71,198", "$8...
## $ INJURED        <chr> "$0", "$0", "$0", "$0", "$0", "$0", "$0", "$0",...
## $ DETAIL         <chr> "$629", "$33,918", "$4,046", "$41,943", "$41,84...
## $ QUINN          <chr> "$6,724", "$23,721", "$6,031", "$15,156", "$12,...
## $ TOTAL.EARNINGS <chr> "$293,892", "$289,121", "$280,904", "$265,203",...
## $ Community      <chr> "Brockton", "Boston", "Quincy", "Walpole", "Bos...
## $ ZIPCode        <int> 2302, 2132, 2169, 2081, 2135, 2186, 2339, 2176,...
## $ State          <chr> "Massachusetts", "Massachusetts", "Massachusett...
## $ X              <chr> "", "", "", "", "", "", "", "", "", "", "", "",...
## $ X.1            <chr> "", "", "", "", "", "", "", "", "", "", "", "",...

4. How many employees are in this data frame?

# option 1
nrow(payroll)
## [1] 22297
# option 2
payroll %>% summarize(count=n())
##   count
## 1 22297

5. How many columns are there?

ncol(payroll)
## [1] 16

6. What are the column names?

colnames(payroll)
##  [1] "NAME"           "TITLE"          "DEPARTMENT"     "REGULAR"       
##  [5] "RETRO"          "OTHER"          "OVERTIME"       "INJURED"       
##  [9] "DETAIL"         "QUINN"          "TOTAL.EARNINGS" "Community"     
## [13] "ZIPCode"        "State"          "X"              "X.1"

7. What’s in the first 6 rows of data?

head(payroll)
##                    NAME                          TITLE
## 1      Darosa, Baltazar                 Police Officer
## 2 Creaven, Jacqueline D              Police Lieutenant
## 3     Cawley, Stephen C              Police Lieutenant
## 4     Danilecki, John H                 Police Captain
## 5    Kervin, Timothy M. Police Lieutenant/Hdq Dispatch
## 6        Hosein, Haseeb              Police Lieutenant
##                 DEPARTMENT  REGULAR RETRO    OTHER OVERTIME INJURED
## 1 Boston Police Department  $66,829    $0 $209,237  $10,473      $0
## 2 Boston Police Department $104,661    $0  $79,248  $47,572      $0
## 3 Boston Police Department  $44,632    $0 $216,037  $10,158      $0
## 4 Boston Police Department $121,244    $0  $15,663  $71,198      $0
## 5 Boston Police Department $104,588    $0  $13,494  $88,906      $0
## 6 Boston Police Department $105,261    $0  $12,883  $47,437      $0
##    DETAIL   QUINN TOTAL.EARNINGS Community ZIPCode         State X X.1
## 1    $629  $6,724       $293,892  Brockton    2302 Massachusetts      
## 2 $33,918 $23,721       $289,121    Boston    2132 Massachusetts      
## 3  $4,046  $6,031       $280,904    Quincy    2169 Massachusetts      
## 4 $41,943 $15,156       $265,203   Walpole    2081 Massachusetts      
## 5 $41,840 $12,988       $261,815    Boston    2135 Massachusetts      
## 6 $76,310 $13,129       $255,020    Milton    2186 Massachusetts

7b. What’s in the first 3 rows of data?

head(payroll, 3)
##                    NAME             TITLE               DEPARTMENT
## 1      Darosa, Baltazar    Police Officer Boston Police Department
## 2 Creaven, Jacqueline D Police Lieutenant Boston Police Department
## 3     Cawley, Stephen C Police Lieutenant Boston Police Department
##    REGULAR RETRO    OTHER OVERTIME INJURED  DETAIL   QUINN TOTAL.EARNINGS
## 1  $66,829    $0 $209,237  $10,473      $0    $629  $6,724       $293,892
## 2 $104,661    $0  $79,248  $47,572      $0 $33,918 $23,721       $289,121
## 3  $44,632    $0 $216,037  $10,158      $0  $4,046  $6,031       $280,904
##   Community ZIPCode         State X X.1
## 1  Brockton    2302 Massachusetts      
## 2    Boston    2132 Massachusetts      
## 3    Quincy    2169 Massachusetts

8. Turn the strings to numbers?

Transform the data set by taking out the dollar signs ($) and commas (,).

Use the functions gsub() and as.numeric().

Show the structure of the dataframe to verify.

payroll$REGULAR <- gsub("\\$", "", payroll$REGULAR)
payroll$REGULAR <- gsub(",", "", payroll$REGULAR)
payroll$REGULAR <- as.numeric(payroll$REGULAR)
payroll$RETRO <- gsub("\\$", "", payroll$RETRO)
payroll$RETRO <- gsub(",", "", payroll$RETRO)
payroll$RETRO <- as.numeric(payroll$RETRO)
payroll$OTHER <- gsub("\\$", "", payroll$OTHER)
payroll$OTHER <- gsub(",", "", payroll$OTHER)
payroll$OTHER <- as.numeric(payroll$OTHER)
payroll$OTHER <- gsub("\\$", "", payroll$OTHER)
payroll$OTHER <- gsub(",", "", payroll$OTHER)
payroll$OTHER <- as.numeric(payroll$OTHER)
payroll$OVERTIME <- gsub("\\$", "", payroll$OVERTIME)
payroll$OVERTIME <- gsub(",", "", payroll$OVERTIME)
payroll$OVERTIME <- as.numeric(payroll$OVERTIME)
payroll$INJURED <- gsub("\\$", "", payroll$INJURED)
payroll$INJURED <- gsub(",", "", payroll$INJURED)
payroll$INJURED <- as.numeric(payroll$INJURED)
payroll$DETAIL <- gsub("\\$", "", payroll$DETAIL)
payroll$DETAIL <- gsub(",", "", payroll$DETAIL)
payroll$DETAIL <- as.numeric(payroll$DETAIL)
payroll$QUINN <- gsub("\\$", "", payroll$QUINN)
payroll$QUINN <- gsub(",", "", payroll$QUINN)
payroll$QUINN <- as.numeric(payroll$QUINN)
payroll$TOTAL.EARNINGS <- gsub("\\$", "", payroll$TOTAL.EARNINGS)
payroll$TOTAL.EARNINGS <- gsub(",", "", payroll$TOTAL.EARNINGS)
payroll$TOTAL.EARNINGS <- as.numeric(payroll$TOTAL.EARNINGS)

str(payroll)
## 'data.frame':    22297 obs. of  16 variables:
##  $ NAME          : chr  "Darosa, Baltazar" "Creaven, Jacqueline D" "Cawley, Stephen C" "Danilecki, John H" ...
##  $ TITLE         : chr  "Police Officer" "Police Lieutenant" "Police Lieutenant" "Police Captain" ...
##  $ DEPARTMENT    : chr  "Boston Police Department" "Boston Police Department" "Boston Police Department" "Boston Police Department" ...
##  $ REGULAR       : num  66829 104661 44632 121244 104588 ...
##  $ RETRO         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OTHER         : num  209237 79248 216037 15663 13494 ...
##  $ OVERTIME      : num  10473 47572 10158 71198 88906 ...
##  $ INJURED       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DETAIL        : num  629 33918 4046 41943 41840 ...
##  $ QUINN         : num  6724 23721 6031 15156 12988 ...
##  $ TOTAL.EARNINGS: num  293892 289121 280904 265203 261815 ...
##  $ Community     : chr  "Brockton" "Boston" "Quincy" "Walpole" ...
##  $ ZIPCode       : int  2302 2132 2169 2081 2135 2186 2339 2176 1702 2043 ...
##  $ State         : chr  "Massachusetts" "Massachusetts" "Massachusetts" "Massachusetts" ...
##  $ X             : chr  "" "" "" "" ...
##  $ X.1           : chr  "" "" "" "" ...

9. Who is paid the most in overtime?

payroll %>%
  arrange(desc(OVERTIME)) %>%
  head(1)
##                  NAME                 TITLE               DEPARTMENT
## 1 Stratton, Michael J Police Sergeant (Det) Boston Police Department
##   REGULAR RETRO OTHER OVERTIME INJURED DETAIL QUINN TOTAL.EARNINGS
## 1   86675     0  9586   123810    7965   2244  4732         235013
##   Community ZIPCode         State X X.1
## 1 Hopkinton    1748 Massachusetts

Michael, a police sergeant detective in the Boston police.

10. Who are the 10 most paid in total earnings?

Show just the name, title, and total earnings columns of the top 10.

payroll %>%
  arrange(desc(TOTAL.EARNINGS)) %>%
  select(NAME, TITLE, TOTAL.EARNINGS) %>%
  head(10)
##                     NAME                          TITLE TOTAL.EARNINGS
## 1       Darosa, Baltazar                 Police Officer         293892
## 2  Creaven, Jacqueline D              Police Lieutenant         289121
## 3      Cawley, Stephen C              Police Lieutenant         280904
## 4      Danilecki, John H                 Police Captain         265203
## 5     Kervin, Timothy M. Police Lieutenant/Hdq Dispatch         261815
## 6         Hosein, Haseeb              Police Lieutenant         255020
## 7          Lee, Thomas F             Police Captain/DDC         252543
## 8          Assad, Mark L          Police Sergeant (Det)         247964
## 9       Eversley, Eric V        Police Lieutenant (Det)         247201
## 10    Crossen, Patrick J             Police Captain/DDC         245734

11. Who are the top 5 most-paid employees (total earnings) in the Mayor’s Office?

Show just the name, title, and total earnings columns of the top 5 in that department.

payroll %>%
  filter(DEPARTMENT=="Mayor's Office") %>%
  arrange(desc(TOTAL.EARNINGS)) %>%
  select(NAME, TITLE, TOTAL.EARNINGS) %>%
  head(5)
##                        NAME                   TITLE TOTAL.EARNINGS
## 1         Menino, Thomas M.                   Mayor         175000
## 2           Weiss, Mitchell          Chief Of Staff         149136
## 3       Kineavy, Michael J. Chief Policy & Planning         147812
## 4     Harrington, Patrick S       Special Assistant         123000
## 5 Leibowitz, Howard Richard       Special Assistant         119437

12. Which 5 Boston police officers make the least amount in Overtime and Detail money?

You’ll need to create a new column adding them up before filtering and sorting.

Note: Exclude officers who made no money from overtime and detail.

payroll %>%
  filter(DEPARTMENT=="Boston Police Department") %>%
  mutate(ot_det= OVERTIME+DETAIL) %>%
  filter(ot_det > 0) %>%
  arrange(ot_det) %>%
  select(NAME, TITLE, ot_det) %>%
  head(5)
##                  NAME                         TITLE ot_det
## 1 Molina, Yajaira Lee       Police Clerk And Typist     32
## 2   Colon Jr., Johnny       Police Clerk And Typist     54
## 3 McDonough, James F.         Jr Building Custodian     56
## 4    Fish, Michael A. PoliceSergeant/SupvCourtCases     71
## 5 Artesani, Gregory J       Police Clerk And Typist     88

13. How much money is Boston spending on overtime pay?

payroll %>%
  summarize(overtime_total=sum(OVERTIME))
##   overtime_total
## 1       90805141

14. What’s the average total pay and overtime pay by department?

Create a new dataframe called avg_df by aggregating the data.

Which 5 departments make the most (total) money on average?

avg_df <- payroll %>%
  group_by(DEPARTMENT) %>%
  summarize(total=mean(TOTAL.EARNINGS, na.rm=T), ot=mean(OVERTIME, na.rm=T)) %>%
  arrange(desc(total))
  
head(avg_df, 5)
## # A tibble: 5 × 3
##                       DEPARTMENT     total        ot
##                            <chr>     <dbl>     <dbl>
## 1         Boston Fire Department 105210.08 12059.925
## 2       Boston Police Department  99814.27 18639.451
## 3     Administration and Finance  81830.75     0.000
## 4             Women's Commission  79272.00     0.000
## 5 Dpt of Innovation & Technology  73597.83  1583.129

15. Which 10 departments have the most employees?

# Version 1
payroll %>%
  group_by(DEPARTMENT) %>%
  summarize(employees=n()) %>%
  arrange(desc(employees)) %>%
  head(10)
## # A tibble: 10 × 2
##                        DEPARTMENT employees
##                             <chr>     <int>
## 1                             BPS     13266
## 2        Boston Police Department      3074
## 3          Boston Fire Department      1687
## 4  Boston Cntr - Youth & Families       618
## 5           Boston Public Library       563
## 6         Public Works Department       464
## 7       Transportation Department       372
## 8                Parks Department       300
## 9      Inspectional Services Dept       237
## 10            Property Management       226
# Version 2
payroll %>%
  group_by(DEPARTMENT) %>%
  tally(sort=TRUE) %>%
  head(10)
## # A tibble: 10 × 2
##                        DEPARTMENT     n
##                             <chr> <int>
## 1                             BPS 13266
## 2        Boston Police Department  3074
## 3          Boston Fire Department  1687
## 4  Boston Cntr - Youth & Families   618
## 5           Boston Public Library   563
## 6         Public Works Department   464
## 7       Transportation Department   372
## 8                Parks Department   300
## 9      Inspectional Services Dept   237
## 10            Property Management   226

16. What’s the max total pay in the Elderly Commission and Transportation Department?

As in total earnings.

payroll %>%
  filter(DEPARTMENT=="Elderly Commission" | DEPARTMENT=="Transportation Department") %>%
  group_by(DEPARTMENT) %>%
  summarize(max_pay=max(TOTAL.EARNINGS))
## # A tibble: 2 × 2
##                  DEPARTMENT max_pay
##                       <chr>   <dbl>
## 1        Elderly Commission   91539
## 2 Transportation Department  149832

17. What’s the average pay for an Admin Secretary in the Registry Division?

Based on total earnings.

payroll %>%
  filter(TITLE=="Admin Secretary" & DEPARTMENT=="Registry Division") %>%
  summarize(avg_pay=mean(TOTAL.EARNINGS))
##   avg_pay
## 1   51766

18. Who’s the most paid person in each department?

And what is their title? Based on total pay.

Store the result in the most_paid data frame and show the first 10 rows.

most_paid <- payroll %>%
  group_by(DEPARTMENT) %>%
  mutate(rank=rank(desc(TOTAL.EARNINGS))) %>%
  filter(rank==1) %>%
  select(NAME, DEPARTMENT, TOTAL.EARNINGS)
head(most_paid, 10)
## Source: local data frame [10 x 3]
## Groups: DEPARTMENT [10]
## 
##                     NAME                     DEPARTMENT TOTAL.EARNINGS
##                    <chr>                          <chr>          <dbl>
## 1       Darosa, Baltazar       Boston Police Department         293892
## 2       Quinlan, Richard         Boston Fire Department         222698
## 3     Bonds, Jonathan  D                            BPS         221687
## 4            Ryan, Amy E          Boston Public Library         196626
## 5      Menino, Thomas M.                 Mayor's Office         175000
## 6          Donovan, John     Dept of Voter Mobilization         162275
## 7         Oates, William Dpt of Innovation & Technology         154875
## 8      Weenick, Meredith     Treasury-Treasury Division         153238
## 9        Weise, Susan M.                 Law Department         150890
## 10 Bartosiak, Michael A.            Property Management         149990

19. How many employees and titles are there per department?

Save it to a employee_titles data frame object.

Show 10 rows with the most department titles.

employee_titles <- payroll %>%
  group_by(DEPARTMENT) %>%
  summarize(total_employees=n(), total_titles=n_distinct(TITLE)) %>%
  arrange(desc(total_titles))
  
  head(employee_titles,10)
## # A tibble: 10 × 3
##                        DEPARTMENT total_employees total_titles
##                             <chr>           <int>        <int>
## 1                             BPS           13266          297
## 2        Boston Police Department            3074          177
## 3          Boston Fire Department            1687          138
## 4           Boston Public Library             563          138
## 5         Public Works Department             464           91
## 6  Boston Cntr - Youth & Families             618           84
## 7        Neighborhood Development             163           74
## 8             Property Management             226           69
## 9      Inspectional Services Dept             237           60
## 10               Parks Department             300           60

20. What’s the average difference in pay between regular pay and overtime by department?

Save it to a ot_diffdata frame object.

Show the first 10 rows sorted by smallest difference.

Note: ignore departments if average overtime pay is zero dollars.

ot_diff <- payroll %>%
  group_by(DEPARTMENT) %>%
  summarize(avg_ot=mean(OVERTIME), avg_total=mean(TOTAL.EARNINGS)) %>%
  filter(avg_ot>0) %>%
  mutate(diff=avg_total-avg_ot) %>%
  arrange(diff)
  
  head(ot_diff,10)
## # A tibble: 10 × 4
##                        DEPARTMENT     avg_ot avg_total     diff
##                             <chr>      <dbl>     <dbl>    <dbl>
## 1  Boston Cntr - Youth & Families  441.68608  32563.09 32121.41
## 2                Parks Department 2965.80667  38624.69 35658.88
## 3             Boston City Council    7.91129  36201.37 36193.46
## 4      Mayor's Office-Public Info  266.89286  38658.54 38391.64
## 5      Dept of Voter Mobilization 8596.69231  49046.54 40449.85
## 6         Public Works Department 6776.05388  47588.57 40812.52
## 7           Boston Public Library 1169.52398  42725.11 41555.58
## 8              Elderly Commission  341.34667  42366.16 42024.81
## 9               Cemetery Division 8100.50000  50319.81 42219.31
## 10      Transportation Department 3645.13441  48442.69 44797.55

21. How much of a percent is Police overtime in Total Earnings?

payroll %>%
  filter(DEPARTMENT=="Boston Police Department") %>%
  summarize(OT_Total=sum(OVERTIME, na.rm=T), Total=sum(TOTAL.EARNINGS, na.rm=T)) %>%
  mutate(Percent_OT=OT_Total/Total*100)
##   OT_Total     Total Percent_OT
## 1 57297673 306829064   18.67413