stats19 icon indicating copy to clipboard operation
stats19 copied to clipboard

Reproducing DfT pedestrian stats sheet

Open BlaiseKelly opened this issue 1 year ago • 15 comments

This is maybe not an issue a such, but to use this package for an analysis I wanted to make sure I was using the right data so I decided to try to reproduce the 2018-2022 sections of the government summary sheet for pedestrian casualties: https://www.gov.uk/government/statistics/reported-road-casualties-great-britain-pedestrian-factsheet-2022/reported-road-casualties-in-great-britain-pedestrian-factsheet-2022#how-far-do-pedestrians-travel. This could also be future test dataset for the package once discrepancies have been ironed out?

My version is output in the fork: https://github.com/BlaiseKelly/stats19 and the qmd file is in the tests folder tests/Pedestrian_factsheet_2022.qmd

There are quite some differences. No doubt some to do with errors I have made. Maybe the first paragraph is a good one to post here as this will likely feed into many of the other stats.

DfT get: "an average of 8 pedestrians died and 109 were seriously injured (adjusted) per week in reported road collisions"

I get 41 pedestrians died and 505 seriously injured, which does seem quite high. Below how I did it:

library(stats19) 
library(dplyr)
library(lubridate)

yrz <-  c(2018,2019,2020, 2021,2022)
col_list <- list()
cas_list <- list()
for(y in yrz){

## request collision data
crashes = get_stats19(year = y, type = "collision", ask = FALSE, format = TRUE, output_format = "data.frame") %>%
  select(-accident_year)

## request casualty
casualties = get_stats19(year = y, type = "casualty", ask = FALSE, format = TRUE, output_format = "data.frame")%>%
  select(-accident_year)


## add each year to a list
  col_list[[as.character(y)]] <- crashes
  cas_list[[as.character(y)]] <- casualties

}

## extract from list
all_cra <- do.call(rbind,col_list)
all_cas <- do.call(rbind,cas_list)

## casualty type data fundamental to this analysis
crash_cas <- all_cra %>% 
  left_join(all_cas, by = "accident_reference")

When the casualty data is joined there are multiple matches for the accident reference. This is consistent with the number_of_casualties column and there is a row for each casualty, so adding a count column and summing this up should give the correct casualty totals?

dths_per_wk <- crash_cas %>% 
  mutate(wk = isoweek(date)) %>% ## calculate the day of week, Monday is 1
  select(wk, accident_severity, casualty_type, number_of_casualties,accident_reference) %>% ## pick out the columns needed
  mutate(count = 1) %>% ## there are multiple rows for each casualty, add a 1 so can sum up the number for each circumstance
  filter(casualty_type == "Pedestrian") %>%  ## pick out only pedestrian casualties
  group_by(wk, accident_severity) %>%
  summarise(casualties = sum(count))

  ## filter for only fatal collisions
fatal_wk <- dths_per_wk %>% 
  filter(accident_severity == "Fatal")

## filter for only serious injuries
serious_wk <- dths_per_wk %>% 
  filter(accident_severity == "Serious")

This gives two dataframes of the total deaths per week for each injury severity level.

Now pick out the mean of this for each injury severity.

## output single value
fatalities = round(mean(fatal_wk$casualties))
serious = round(mean(serious_wk$casualties))

## stats document said 8
fatalities 
## stats doument said 109
serious

Amongst some other queries, the report refers to different age bands and there doesn't appear to be contributory factors in the csvs this package accesses?

BlaiseKelly avatar Jun 24 '24 18:06 BlaiseKelly

I had left out year:

dths_per_wk <- crash_cas %>% 
  mutate(wk = isoweek(date),
         yr = year(date)) %>%
  select(wk,yr, accident_severity, casualty_type, number_of_casualties,accident_reference) %>% 
  mutate(count = 1) %>% 
  filter(casualty_type == report_casualty) %>%  
  group_by(wk, yr, accident_severity) %>%
  summarise(casualties = sum(count))

fatal_wk <- dths_per_wk %>% 
  filter(accident_severity == "Fatal")

## filter for only serious injuries
serious_wk <- dths_per_wk %>% 
  filter(accident_severity == "Serious")

## stats document said 8
fatalities 
## stats doument said 109
serious

BlaiseKelly avatar Jun 25 '24 05:06 BlaiseKelly

"In 2022, 385 pedestrians were killed in Great Britain, whilst 5,901 were reported to be seriously injured (adjusted) and 13,041 slightly injured (adjusted)."

I get:

## 2022
severity_2022 <- crash_cas %>% 
  select(accident_severity, casualty_type, date, number_of_casualties,accident_reference) %>% 
  mutate(accident_year = year(date)) %>% 
  filter(casualty_type == "Pedestrian" & accident_year == 2022) %>% 
  mutate(count = 1) %>% 
  group_by(accident_severity) %>% 
  summarise(casualties = sum(number_of_casualties)) 

fatal_2022 <- filter(severity_2022, accident_severity == "Fatal")

serious_2022 <- filter(severity_2022, accident_severity == "Serious")

slight_2022 <- filter(severity_2022, accident_severity == "Slight")

fatal = round(fatal_2022$casualties)

serious = round(serious_2022$casualties)

slight = round(serious_2022$casualties)

fatal: 556 serious: 6986 slight: 14485

BlaiseKelly avatar Jun 25 '24 05:06 BlaiseKelly

This is awesome! Big thumbs up to reproducing official stats based on the STATS19 datasets and interesting that the numbers differ. It may be worth emailing the relevant address for clarification, if we can confirm that the results do indeed differ. I cannot off the top of my head think of any clear explanation for this so cc others involved in STATS19 data analysis @wengraf, @layik and @rogerbeecham.

Robinlovelace avatar Jun 25 '24 13:06 Robinlovelace

Also cc @PublicHealthDataGeek FYI

Robinlovelace avatar Jun 25 '24 13:06 Robinlovelace

Hi @Robinlovelace and @BlaiseKelly. I just had time to see this thread. As Robin said this is very interesting. Here is my reprex:

I get 41 pedestrians died and 505 seriously injured

library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#

##### same code as above...

## stats document said 8
fatalities 
#> [1] 41
## stats doument said 109
serious
#> [1] 505

Created on 2024-07-17 with reprex v2.1.1

We are not saying this is an issue for the package right?

layik avatar Jul 17 '24 13:07 layik

I'll have a go with MAST and see what that says. Will report back.

wengraf avatar Jul 17 '24 13:07 wengraf

Thanks for taking a look!

I don't think an issue with the package as such, although it seems the underlying data they have used might be different (i.e. age bands and where are the 'contribitory factors'?)

BlaiseKelly avatar Jul 17 '24 13:07 BlaiseKelly

Using MAST, I get 2018 Pedestrian fatalities over the 5 years of 2018-22. That averages out to 404 per annum, and 7.76 per week. I've looked at unadjusted serious (not adjusted like in the quote), and that is 25509, 5102 and 98.11 respectively...that sounds like it will be inline with the adjusted serious numbers they have. So I would say MAST is in line with what @BlaiseKelly says the DfT said (I'll admit, I've not gone back to the source material you're comparing to).

CFs are not in the public data...you need special permission to access those, and they have to be quite carefully interpreted.

The public data has single year of age and also age bands.

Let me now see if I can get my R code to make the DfT result...

wengraf avatar Jul 17 '24 13:07 wengraf

I've had a crack at this, and I get the MAST/DfT result...

library(stats19) 
library(dplyr)
library(lubridate)

yrz <-  c(2018,2019,2020, 2021,2022)
col_list <- list()
cas_list <- list()
for(y in yrz){
  ## request collision data
  crashes = get_stats19(year = y, type = "collision", ask = FALSE, format = TRUE, output_format = "data.frame")
  
  ## request casualty
  casualties = get_stats19(year = y, type = "casualty", ask = FALSE, format = TRUE, output_format = "data.frame")
  casualties$lsoa_of_casualty <- as.character(casualties$lsoa_of_casualty)
  
  ## add each year to a list
  col_list[[as.character(y)]] <- crashes
  cas_list[[as.character(y)]] <- casualties
  
  message(y)
}

## extract from list
all_cra <- bind_rows(col_list)
all_cas <- bind_rows(cas_list)

## Check total casualties and total crashes against MAST.
# Total casualties should be 693028
print(nrow(all_cas))
# Total crashes should be 538461
print(nrow(all_cra))

cas_ped <- subset(all_cas, all_cas$casualty_class == "Pedestrian")

# Clearly an issue with formatting numbers (see Age_of_casualty)...see issue #235

# Make a summary table of pedestrian casualties by severity, then make annual and weekly averages.
summary.df <- cas_ped %>% group_by(casualty_severity) %>% dplyr::summarise("count" = n())

summary.df$annual_average <- summary.df$count/5
summary.df$weekly_average <- summary.df$annual_average/52

There is definitely an issue with formatting numbers, which I've raised previously (#235 ), which is stopping access to single year of age of casualty. For the purpose of reproducing this DfT stat, I cannot see the logic of needing a table join or much of what follows from that...I suspect something weird is going on in all of that...

wengraf avatar Jul 17 '24 14:07 wengraf

Thank you @wengraf! I see, I actually cannot see where the flaw (if any) is in @BlaiseKelly's code. The left join is required to get the "date" of the accident (I think). So I got the date out and worked with "casualty_severity" but still getting close to some 5x the reported/your results. It looks like assigning the casualties to weeks rather than years makes a difference which cannot be right.

@BlaiseKelly wk values do contain number 53. I will leave it there.

layik avatar Jul 17 '24 16:07 layik

Thank you @wengraf! I see, I actually cannot see where the flaw (if any) is in @BlaiseKelly's code. The left join is required to get the "date" of the accident (I think).

For some reason, @BlaiseKelly drops accident year from the downloading function. Don't drop that, and then you have accident year.

I haven't followed through the code from the left_join onwards, but I cannot see the reasoning for it in the first place. You join tables when your base unit (casualty, crash or vehicle) doesn't have some field from another you'd like (e.g., engine capacity of vehicle of casualty, or similar). I can't see how that's needed to reproduce the stat here.

I've not assigned anything to weeks...the initial quote is a five year average turned into a weekly average. In fact, you don't need date or year at all (other than for the initial downloading loop).

wengraf avatar Jul 17 '24 16:07 wengraf

That is it then. The casualties are assigned in @BlaiseKelly code 5x hence.

layik avatar Jul 17 '24 16:07 layik

Please close this when you are happy @BlaiseKelly.

layik avatar Jul 17 '24 16:07 layik

Phew! Sounds like a resolution... This could be hugely useful not only in showing the validity of the package but also in the official stats. I suggest making it a vignette from the updated correct result, that would really be closure for me..

Robinlovelace avatar Jul 17 '24 20:07 Robinlovelace

Thanks for all your efforts looking at this.

Yes I agree left_join was maybe unnecessary for this stat, but this was an example from the qmd trying to reproduce the full report https://github.com/BlaiseKelly/stats19/blob/master/tests%2FPedestrian_factsheet_2022.qmd. Further down this it is necessary to join the three datasets together (I think) so I thought might as well do it at the start to have one big data frame.

Will have a proper look at this in the next few days and hopefully close and definitely try to leave it as an example of some form as @Robinlovelace suggests.

BlaiseKelly avatar Jul 17 '24 21:07 BlaiseKelly

I had another go at finishing this off. Still not quite there but maybe useful for rrsrr. See Pedestrian_factsheet_2023_draft.qmd

BlaiseKelly avatar Sep 01 '25 18:09 BlaiseKelly

For sure useful, perfect timing, thanks loads Blaise!

Robinlovelace avatar Sep 01 '25 18:09 Robinlovelace

Have pushed the latest version of this based on the updated data released a few days ago (vignettes/Pedestrian_factsheet_2024.qmd). Pretty good agreement, but a few stats are slightly out @matthewtranter-dft maybe you would be able to help with them?

The main disagreement is the time of day plot (chart 4). Saturday and Sunday match the chart but Monday to Friday is out by around a factor of 10, it was the same for the 2023 version I was working on also. The totals in the table that feeds the plot match the totals in table 3. I have opened a seperate issue with a reprex https://github.com/ropensci/stats19/issues/262

In table 3 there is an 'other' vehicle that should be added to the car data I think. Is there a dataset that gives the categories? I created my own table.

Section 4 on rates per mile. The sheet NTS0101: https://www.gov.uk/government/statistical-data-sets/nts01-average-number-of-trips-made-and-distance-travelled doesn't appear to have any walking or population data. I used NTS0303 and this population table: https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationestimatestimeseriesdataset/current/pop.csv however the output stats are a bit different. What table/s were used for the report?

Also minor comment/query: not all references to serious and slight have '(adjusted)' after, but I think all the serious and slight summary stats use adjusted data?

BlaiseKelly avatar Sep 27 '25 12:09 BlaiseKelly

Chart 4 time of day is now sorted see https://github.com/ropensci/stats19/issues/262 which was the major discrepancy.

BlaiseKelly avatar Sep 28 '25 08:09 BlaiseKelly

Latest version uploaded here so people don't have to run the code to view https://rpubs.com/Blaise/1349258

BlaiseKelly avatar Sep 28 '25 09:09 BlaiseKelly

Latest version uploaded here so people don't have to run the code to view https://rpubs.com/Blaise/1349258

The results there look amazing, great job @BlaiseKelly ! This one is impressive and new knowledge for me:

Image

Robinlovelace avatar Sep 29 '25 09:09 Robinlovelace

Hi - good work on the time of day issue, perhaps that is one where we could be clearer.

NTS0303 sounds right for the walking data (currently away from the office but happy to check more thoroughly when back next week). Population figures we will have got from ONS, but might be easier to compare actual figures here.

Table 3 - we would be counting cars as vehicle_type categories 8,9,10 - as per the variable list published here: https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2024.xlsx

Hope that helps, but if not happy to try again

matthewtranter-dft avatar Sep 29 '25 23:09 matthewtranter-dft

Latest version of this factsheet works for psst years also, Output for 2023: https://rpubs.com/Blaise/stats19_pfs_2023 and 2024: https://rpubs.com/Blaise/stats19_pfs_2024

BlaiseKelly avatar Oct 07 '25 15:10 BlaiseKelly

This is basically done now, right @BlaiseKelly? Suggestion: set eval=FALSE in the vignette but add a message saying "see rendered results at location or try reproducing the results by downloading the source code of this vignette and setting eval=TRUE" at the beginning. That will speed up checks but retain reproducibility.

Robinlovelace avatar Oct 08 '25 09:10 Robinlovelace

This is basically done now, right @BlaiseKelly? Suggestion: set eval=FALSE in the vignette but add a message saying "see rendered results at location or try reproducing the results by downloading the source code of this vignette and setting eval=TRUE" at the beginning. That will speed up checks but retain reproducibility.

Fixed in https://github.com/ropensci/stats19/pull/275/commits/3ed19eea317b26e775ba50237b17bb308b7588c3 we can close this issue after it's merged I think.

Robinlovelace avatar Oct 08 '25 09:10 Robinlovelace

This is basically done now, right @BlaiseKelly? Suggestion: set eval=FALSE in the vignette but add a message saying "see rendered results at location or try reproducing the results by downloading the source code of this vignette and setting eval=TRUE" at the beginning. That will speed up checks but retain reproducibility.

Yes, no further work planned from my side. Main improvement would be to find the source DfT use for the travel and population data. Once https://github.com/ropensci/stats19/issues/260 is sorted table 4 should correct itself.

So happy to close once we get the checks to pass/turn them off.

BlaiseKelly avatar Oct 08 '25 10:10 BlaiseKelly

Also it is now just Pedestrian_factsheet.Rmd. Pedestrian_factsheet_2024.Rmd can be removed. WIll do that now.

BlaiseKelly avatar Oct 08 '25 10:10 BlaiseKelly