stats19
stats19 copied to clipboard
Open data spec changes
Heads-up @BlaiseKelly and others, I've just realised that @matthewtranter-dft kindly sent a spreadsheet outlining the changes that have been made to the open date. Many thanks Matt, this is very useful! I've sent an email on the topic, tracking here.
Thanks Robin! I think the 'schema_new.Rmd' script takes care of this? So hopefully there aren't any manual amendments still needed?
Although in the Pedestrian_Factsheet_2024 vignette the junctions table 4 has some codes that haven't been matched to junctions. So something has changed there. I don't think this table covers that? Anyone got a heads up where to look? Can create a reprex if neccesary.
Reprex would be useful to work out what's going on with the codes that don't match.
I think it was just "Accident" left over in the format_collisions function. But still doing a few tests. Will close when finished.
As mentioned in issue https://github.com/ropensci/stats19/issues/260 (off topic) there were some issues with the data guide, shown in reprex below:
library(openxlsx)
# get definitions
schema_dft = read.xlsx("https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2024.xlsx")
# junction details
j_detail <- schema_dft[schema_dft$field.name == "junction_detail",]
j_detail$label
[1] "Not at junction or within 20 metres" "T or staggered junction"
[3] "Crossroads" "Junction with more than four arms (not roundabout)"
[5] "Using private drive or entrance" "unknown (self reported)"
[7] "Data missing or out of range"
# getting the latest collision data
collisions <- read.csv("https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv")
# junction detail "Data missing of out of range" (-1)
c_df <- filter(collisions, junction_detail < 0)
# find historic codes includes in Data missing
unique(c_df$junction_detail_historic)
9 -1 99
So the historic codes only have 3 repeated values?
# find historic codes includes in Data missing
unique(c_df$junction_detail_historic)
9 -1 99
One for @matthewtranter-dft and team I guess.
@matthewtranter-dft I also noticed a lot of the local_authority_districts were coming out as 'code deprecated'. Testing with 2024 data it seems this year is the issue. All the codes are -1
# get last years data
cra_2024 <- read.csv("https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-2024.csv")
# only -1
unique(cra$local_authority_district)
-1
# get last 5 years
cra_L5Y <- read.csv("https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv")
# seems better
unique(cra_L5Y$local_authority_district)
[1] 240 -1 243 241 124 128 70 71 228 231 232 189 323 364 346 257 601 605 612 583 633 394 460 462 544 470 472 473 474 565 498 505 169 129 130 286 596 635
[39] 424 139 640 643 480 182 184 187 181 610 606 607 608 100 102 106 110 114 91 95 210 211 213 215 147 148 150 300 305 309 200 202 203 204 206 146 16 30
[77] 18 28 19 2 20 27 32 6 4 11 31 29 15 26 25 3 12 23 9 7 22 17 14 24 21 5 13 10 1 395 391 393 392 390 63 65 61 60
[115] 62 321 325 586 588 580 581 556 554 552 555 551 453 455 456 450 463 622 620 491 497 502 495 499 496 490 492 501 33 438 435 436 434 433 542 531 538 536
[153] 532 540 84 82 79 363 365 351 355 352 353 356 407 404 401 341 484 253 251 258 512 514 510 515 517 40 516 291 292 294 290 557 563 558 564 559 560 274
[191] 921 924 927 931 938 939 910 923 935 940 925 926 932 722 723 724 725 751 744 731 734 733 753 161 233 284 611 421 475 479 500 420 641 645 647 644 476 478
[229] 186 185 180 101 104 109 112 90 92 93 302 306 570 8 64 329 324 328 585 584 451 454 461 457 623 494 437 535 533 543 541 539 75 76 366 362 354 400
[267] 402 406 384 381 385 342 345 481 255 412 511 518 277 270 278 57 919 752 743 746 730 742 285 327 582 459 621 624 625 431 73 83 85 350 382 340 343 344
[305] 347 482 256 513 293 276 432 303 307 930 912 918 741 245 368 471 646 107 322 320 589 72 360 386 483 413 410 562 273 914 911 934 745 587 477 149 458 361
[343] 38 928 929 937 916 721 452 383 254 922 750 430 530 77 405 380 485 913 720 740 493 367 252 642 74 732 80 250 915 917 920 609 936 933 941
# filter last 5 years for 2024 and is only -1 again
cra_L5Y_2024 <- cra_L5Y[cra_L5Y$collision_year == 2024,]
unique(cra_L5Y_2024$local_authority_district)
So the historic codes only have 3 repeated values?
# find historic codes includes in Data missing unique(c_df$junction_detail_historic) 9 -1 99One for @matthewtranter-dft and team I guess.
Yes I think it should just be -1 and 99 and 9 (which is "Other Junction") should not be there. @matthewtranter-dft has already replied to say they have identified the issue. But since I was posting with the latest example thought I should make a seperate issue out of it as it is not really related to the adjusted values.
I also noticed a lot of the local_authority_districts were coming out as 'code deprecated'.
An issue with LAD codes is that LAD boundaries keep changing, so the code is only correct for a particular LAD data release for a limited duration until they are replaced by a new boundary, meaning that there would need to be a new column for this every year or so. Spatial joins and police forces are probably more useful anyway. So I think I approve of LAD codes are being deprecated.
There is an awful lot of risk of error in doing long term local area analysis based on these sorts of boundaries. A couple of years ago the adjustments to boundaries of the Bournemouth area (and naming the new entity the same name as one of its historic component areas) called all sorts of grief for people. Proceed with extreme caution. Crashes within political or police entities that no longer exist should be the old ONS code for it (i.e., what it was then), and not updated to what it ought to have been now (because to do so ascribes responsibility/oversight on to a body that didn't have it).
This is good to know, so sounds like is not an issue (although the junction_detail still is I think).
So would the best approach for a LA analysis be to spatially group the data based on a specific shape file? Or use something more uniform along the lines of https://cran.r-project.org/web/packages/zonebuilder/vignettes/paper.html?
If a modern shapefile approach associates a given crash to an entity that did not have responsibility at the time (e.g., because it didn't exist), I think that would be quite problematic. You need to link old ONS areas to the modern one, and display as related but distinct (e.g., with a change in line colour, etc.). So, for example, you might plot the old Scottish police forces as trend lines, then have a single line once a single Scottish force was set up, and have the other lines end. Similarly, you could plot the old Scottish forces, and continue to the present day as separate lines, but then label the lines things like "Police Scotland (old Northern area)". The basic thing is you want to make clear that someone with a RS responsibility (police, LA) wasn't on the hook for something they weren't.
As mentioned in issue #260 (off topic) there were some issues with the data guide, shown in reprex below:
library(openxlsx) # get definitions dft_table = read.xlsx("https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2024.xlsx") # junction details j_detail <- schema_dft[schema_dft$field.name == "junction_detail",] j_detail$label [1] "Not at junction or within 20 metres" "T or staggered junction" [3] "Crossroads" "Junction with more than four arms (not roundabout)" [5] "Using private drive or entrance" "unknown (self reported)" [7] "Data missing or out of range" # getting the latest collision data collisions <- read.csv("https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv") # junction detail "Data missing of out of range" (-1) c_df <- filter(collisions, junction_detail < 0) # find historic codes includes in Data missing unique(c_df$junction_detail_historic) 9 -1 99
Still wondering about this and then can close. Basically in the data-guide-2024.xlsx roundabouts have been dropped and there is no longer "Other Junction". But in the https://www.gov.uk/government/statistics/reported-road-casualties-great-britain-pedestrian-factsheet-2024/reported-road-casualties-in-great-britain-pedestrian-factsheet-2024 "Other junction" is still there. Seems to match with the numbers for "Data missing or out of range", but surely the two are different classifications? @matthewtranter-dft thanks!
Hello - sorry, I can't remember where we were up to on this. There's a coding issue with the 'other junction' (that code is being mapped incorrectly to missing in the open data) that should be fixed next week - apologies, and it's only thanks to your work that we have noticed this one!
Hello - sorry, I can't remember where we were up to on this. There's a coding issue with the 'other junction' (that code is being mapped incorrectly to missing in the open data) that should be fixed next week - apologies, and it's only thanks to your work that we have noticed this one!
Many thanks for the feedback Matt, it means a lot, I may get in touch separately on this as it is a good example of how open source ecosystems can support good government stats, especially when you're as responsive and open as you are, big credit to your team. And huge thanks again to @BlaiseKelly for the detective work on this! Please let us know when the new datasets are out and we can patch this package accordingly and submit to CRAN.