stats19
stats19 copied to clipboard
2023 data
Hi:
I'm struggling to format 2023 data (actually, I'm wanting 2004 to 2023, so would prefer to use dft-road-casualty-statistics-casualty-1979-latest-published-year.csv etc.).
downloader::download("https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-casualty-1979-latest-published-year.csv", "cas.csv")
cas.df <- read_csv("cas.csv")
cas.df <- format_casualties(cas.df)
The code above gives 13 warning messages, and puts in an awful lot of NAs.
I tried this:
x = get_stats19(2023, silent = FALSE, type = "collision", output_format = "data.frame")
with the plan that I'd loop through the relevant years and bind_rows. Unfortunately, it can't find 2023 data. From what I can work out, there is a table of links to which the function (or some other function nested within it) refers. The 2023 links aren't up yet. Would it be possible to make updating that table a user-updateable process? Or, if the single year isn't found when using get_stats19(), to fall back on the 1979 to latest date set of files in the first instance?
Boom, great it's out there, thanks for the nudge. On the case! Cc @layik, could be a good 'bonus' one for our hackathon on Thursday: https://github.com/Robinlovelace/netvishack/
Hi @Robinlovelace: if it is going to a be a quick update, I'll wait before I continue the piece of work I'm currently doing. Thanks!
Good motivation to be fast, will try to do by end of play today.
@Robinlovelace I think the NAs are being introduced by one line within the format_stats19() function.
Specifically the line highlighted in yellow below.
I think the line above that line of code successfully retains the original value in cases where NAs are introduced (for example, for variable where -1 isn't declared as a level in the schema for a given variable).
The yellow highlighted line, introduces NAs as it is (often) trying to convert characters to integers (due to the original class being integer (at least in some cases)).
When i declared a version of format_stats19() with that yellow line of code hashed out, the formatting appears to work correctly, when using format_stats19() directly
format_stats19(casualties_2023, type = "Casualty")
Though i would note that format_stats19() doesn't exist within the current version of the stats19 package, and i obtained the function through https://github.com/ropensci/stats19/blob/master/R/format.R
There might be a good reason why that function isn't available within the current version of the package.
I think there is some urgency to getting 2023 and the partial 2024 stats into scope for stats19. In response to a claim by the Secretary of State for Wales in Parliament I looked at the data from 2019 through to the partial data for 2024. Making loads of assumptions and taking Welsh statistics as a proportion of total statistics it seems that the introduction of 20mph speed limits has reduced fatalities in the 2024 data. Robust statistical analysis would have three potential benefits:
- Quantifying the benefits in terms of lives saved (and cost savings) would enhance driver compliance
- Geospatial statistics would enable a refinement of the restriction zones making them more resistant to a change of Senedd government reversing the policy. A machine learning approach to zones relative to shops, schools, play areas, pedestrian routes and housing etc could be envisaged.
- Getting the application refined would open the way for a similar policy change to be rolled out across England with precision.
I will look to fix this in the coming week.
Thanks @Robinlovelace. In the end, I modified the code of the package in an ugly way to get my particular project over the line, but it would be great if the package could handle year change on its own with less or no human action in the future. I think provisional data (and how to code in appropriate guidance on use) should be a separate issue.
@Robinlovelace I think the NAs are being introduced by one line within the format_stats19() function.
Specifically the line highlighted in yellow below. I think the line above that line of code successfully retains the original value in cases where NAs are introduced (for example, for variable where -1 isn't declared as a level in the schema for a given variable). The yellow highlighted line, introduces NAs as it is (often) trying to convert characters to integers (due to the original class being integer (at least in some cases)).
When i declared a version of format_stats19() with that yellow line of code hashed out, the formatting appears to work correctly, when using format_stats19() directly
format_stats19(casualties_2023, type = "Casualty")
Though i would note that format_stats19() doesn't exist within the current version of the stats19 package, and i obtained the function through https://github.com/ropensci/stats19/blob/master/R/format.R
There might be a good reason why that function isn't available within the current version of the package.
I was using the version that I cloned in July and was able to use the 2023 data with no problem. Digging into the code it seems it is the line @R-M-J-P mentions that is the difference? Not sure when that line was added.
Hi @wengraf, @BlaiseKelly, @ar0berts and @R-M-J-P, with apologies for slow start on this, see starter on updates here: #254
Basically: lots of things have changed, including the xlsx sheet with variable names from DfT that is the source of truth. If anyone wants to have a bash with that, feel free to checkout the branch with
gh pr checkout 254
And open the schema_new.Rmd file and start hacking! Comments/suggestions welcome, will aim to get back on it later today or early next week.
Fixed now I think. Please everyone test and let me know either way :pray:
