stats19 icon indicating copy to clipboard operation
stats19 copied to clipboard

Some police forces used alternative grid references for eastings and northings ~1979-1981 and 1986

Open cmcaine opened this issue 6 years ago • 22 comments

E.g. Hounslow isn't really in the sea to the west of Glasgow.

screen

The London Boroughs and some other geographical areas did this. If we can find out what system they were using we could fix this.

1986, looks like a CRS issue too, but I don't really know what's going on.

We can also observe a lot of other errors with the early geocoding in this sequence of images:

plot001

Source code:

# smaller_s19 is stats19 1979-2004, serious and fatal only, drop NAs on coordinates
year_maps = smaller_s19 %>%
  mutate(year = lubridate::year(date)) %>% 
  st_as_sf(coords=c("location_easting_osgr", "location_northing_osgr"), crs = BNG) %>%
  qtm() + tm_facets(along="year");

# And copy the image files out of /tmp before this finishes ;)
tmap_animation(year_maps, "stats19-osgr-locations-over-time.mpg")

cmcaine avatar Aug 07 '19 17:08 cmcaine

Well found @cmcaine. At the request of @mem48 I recall we add a warning saying that locations may not be accurate before 2005. It would be amazing if we could rectify the issues in the code. I think that analysing the crashes with clearly errant points could lead to a solution. One question: do the errors also affect the longitue and latitude columns? Rarely use data before 2005 but clearly it's very important so I (and I imagine other users of the data) am very grateful to you for raising this issue.

Robinlovelace avatar Aug 09 '19 06:08 Robinlovelace

There's no longitude or latitude data at all until 1999, by which time the eastings and northings look accurate anyway.

Some number of these will be transcription errors, but others (London) certainly look like systematic use of an alternative (or truncated) grid reference system, so I think there is a chance of fixing those.

Another challenge is that the data for all accidents has fewer obviously missing areas:

image

It seems unlikely that these places really had no serious or fatal collisions in a year. Perhaps police forces used a different system for recording more serious crashes in those areas?

cmcaine avatar Aug 09 '19 10:08 cmcaine

We had this problem with the cyipt project, some of the older data uses less precise grid references and lots of data ended up in the sea.

The british national grid has not change since 1936, and I'm not aware of any regional grids in the UK. So I suspect there is some add hoc truncation, where ploice have left out inital few didgits they would always be the same in their area of intrest.

mem48 avatar Aug 09 '19 10:08 mem48

@cmcaine I've had a deeper dive and this does not seem to be a simple scaling problem. I can't figure out how the coordinates are supposed to map to the BNG. I suggest making an equiery with the DFT they may have some historical context that we are missing.

mem48 avatar Aug 09 '19 13:08 mem48

Thanks for looking at it, Malcolm.

Sent to DfT:

Hello,

As described and illustrated in this github issue1, I am trying to work out how some of the eastings and northings values provided for collisions in the 1979-2004 STATS19 dataset should be interpreted.

In particular, many locations given for collisions between 1979 and 1982 and in 1986 do not appear to be valid British National Grid references.

From the maps I have generated, it looks like the metropolitan police was using a different or truncated reference system from 1979 to 1981.

I would appreciate your help in identifying what scheme was used and any other detailed information about how to interpret the eastings and northings for these early years of the dataset.

I also note that some areas, including much of Scotland, Wales and the North West appear to have reported "Slight" accidents/collisions to STATS19 but not "Serious" or "Fatal" collisions. This is also illustrated in the linked issue.

I would appreciate your help in identifying: 1) if this omission is documented; 2) if data on serious and fatal collisions in these locations is available elsewhere.

Cheers, Colin Caine PhD Student, School of Geography, University of Leeds

cmcaine avatar Aug 09 '19 14:08 cmcaine

Just picking up the thread on this after getting back from holiday yesterday. I suspect there are some systematic errors that can be fixed, and likely some random errors that cannot. I think asking the DfT is a good plan (have you heard anything @cmcaine? can follow up if not) and, if they don't know either, would suggest a collaborative project aimed at doing an even deeper dive than @mem48 did to identify dodgy coordinates (please share analysis code you used for this if you have it).

Quantifying (e.g. range, standard deviation) and plotting differences between expected (based on recent data) and recorded Easting and Northing distributions for each force/year combination in which dodgy coordinates are found should help at least identify the region/years in which there is a pattern to the error.

Robinlovelace avatar Aug 17 '19 15:08 Robinlovelace

I have not received any response yet. Please feel free to follow up, I don't have any contacts at the DfT. Contacting the metropolitan police or the Mayor's Office for Policing and Crime1 might be sensible, to.

It's pretty clear from the maps that it's mostly crime in London that is systematically miscoded in the first two years.

On Sat, 17 Aug 2019 at 16:26, Robin [email protected] wrote:

Just picking up the thread on this after getting back from holiday yesterday. I suspect there are some systematic errors that can be fixed, and likely some random errors that cannot. I think asking the DfT is a good plan (have you heard anything @cmcaine https://github.com/cmcaine? can follow up if not) and, if they don't know either, would suggest a collaborative project aimed at doing an even deeper dive than @mem48 https://github.com/mem48 did to identify dodgy coordinates (please share analysis code you used for this if you have it).

Quantifying (e.g. range, standard deviation) and plotting differences between expected (based on recent data) and recorded Easting and Northing distributions for each force/year combination in which dodgy coordinates are found should help at least identify the region/years in which there is a pattern to the error.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/stats19/issues/101?email_source=notifications&email_token=ABNZA6IKZ5WZ6POZGRYVGRLQFAKB5A5CNFSM4IKCO4N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QNWOI#issuecomment-522246969, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNZA6L3UUGEK66TUR47WCLQFAKB5ANCNFSM4IKCO4NQ .

cmcaine avatar Aug 17 '19 16:08 cmcaine

I couldn't find any pattern in the London Data. It wasnt even rougly in the shape of london, so I think we are going to need expert help

mem48 avatar Aug 19 '19 08:08 mem48

I think it should be discoverable, though. 1981 was only 39 years ago. I'm sure the met police or mopac could reach out to some retired officers for us if they felt like it.

I've sent a similar email to the ONS as well and attached a sample CSV of the eastings and northings.

I attach here a zip of a CSV of all of the eastings and northings for 1979-1981 in case anyone wants to get the data without using R (mostly for the convenience of our external friends). Each observation includes the easting and northing, local authority district name, road class and road number.

stats19_osgr_1979_1981.zip

The exact text of the email sent to the ONS is:

To: [email protected] Subject: Help interpreting unusual grid references in DfT STATS19 data

Hello,

As described and illustrated in this github issue1, I am trying to work out how some of the eastings and northings values provided for collisions in the 1979-2004 STATS19 dataset should be interpreted.

In particular, many locations given for collisions between 1979 and 1982 and in 1986 do not appear to be valid British National Grid references.

From the maps I have generated, it looks like the metropolitan police was using a different or truncated reference system from 1979 to 1981.

I would appreciate your help in identifying what scheme was used and any other detailed information about how to interpret the eastings and northings for these early years of the dataset.

I attach a zipped CSV sample of 50k (out of ~750k) observations from the stats19 dataset 1979-1981 for your convenience. A zipped csv containing all the rows is linked from the github issue. Each observation includes the easting and northing, local authority district name, road class and road number.

Cheers, Colin Caine PhD Student, School of Geography, University of Leeds

cmcaine avatar Aug 19 '19 12:08 cmcaine

Perhaps the coordinates just assume that they're on the OS map for their particular area of London.

If there were enough different maps for different areas of london then the shape of London would not be scaled and recognisable.

cmcaine avatar Aug 19 '19 12:08 cmcaine

That is possible, there may also have been a conversion error that scrambled the data. For example BNG coordinates can be stored like this TQ1234 if these had been convered to numbers incorrectly the may have become garbled.

mem48 avatar Aug 20 '19 09:08 mem48

I plotted all the crashed on the A4 in 1979

A4

There is some random point but clearly a road across the map. The intresting bit is the bottom right which suggests more than one coordinate system is in use.

mem48 avatar Aug 20 '19 09:08 mem48

Just catching up with this, I think one technical issue is also format_sf fails on output of get_stats19 for those years. I can take this to a new ticket if I am right:

Reprex to show all london accidents return empty using `format_sf`:
dd = "~/code/saferactive/ignored/"
acc7904 = stats19::get_stats19(1979, data_dir = dd)
#> No files of that type found for that year.
#> [31mThis will download 240 MB+ (1.8 GB unzipped).[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See https://github.com/ropensci/stats19/issues/101 and https://github.com/ropensci/stats19/issues/102
#> Files identified: Stats19-Data1979-2004.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data1979-2004.zip
#> Data already exists in data_dir, not downloading
#> Data saved at ~/code/saferactive/ignored//Stats19-Data1979-2004/Vehicles7904.csv~/code/saferactive/ignored//Stats19-Data1979-2004/Road-Accident-Safety-Data-Guide-1979-2004.xls~/code/saferactive/ignored//Stats19-Data1979-2004/Casualty7904.csv~/code/saferactive/ignored//Stats19-Data1979-2004/Accidents7904.csv
#> No files of that type found for that year.
#> [31mThis will download 240 MB+ (1.8 GB unzipped).[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See https://github.com/ropensci/stats19/issues/101 and https://github.com/ropensci/stats19/issues/102
#> Reading in:
#> /home/layik/code/saferactive/ignored//Stats19-Data1979-2004/Accidents7904.csv
#> date and time columns present, creating formatted datetime column
# acc7904 = stats19::format_sf(acc7904, lonlat = TRUE)
l = acc7904[acc7904$local_authority_district == "London", ]
nrow(l)
#> [1] 638746
l = stats19::format_sf(l, lonlat = TRUE)
#> 638746 rows removed with no coordinates
#> Warning in min(cc[[1]], na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> Warning in min(cc[[2]], na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> Warning in max(cc[[1]], na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning in max(cc[[2]], na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
nrow(l) == 0 # TRUE
#> [1] TRUE

Created on 2020-07-01 by the reprex package (v0.3.0)

In terms of a systematic error or help with this ticket, I will also send an email to our tech contacts in DfT and cc @Robinlovelace and invite them to contribute here if possible.

layik avatar Jul 01 '20 12:07 layik

I sent an email to our tech contacts in DfT too @cmcaine and @mem48. Will update this ticket if I dig anything out. Great data analysis/insights.

layik avatar Jul 01 '20 12:07 layik

Just come to this - just terrible geo-coding, and no error checking at the time, not an alternative CRS. Unless you can match to main road name and work from that, just learn to live with it. I'd be wary of the idea of "correction" too....

wengraf avatar Nov 02 '20 17:11 wengraf

I think different conventions were used in different forces. Confident there are ways to improve on the assumption of 'bog standard' 27700 (e.g. by dividing coords by 10) for some places, but not a priority!

Robinlovelace avatar Nov 03 '20 08:11 Robinlovelace

A lot would have been found on an old A-Z, then roughly guessed on a paper Landranger, with only the vaguest idea about eastings and northings. Stats19 has duff fields, at points in time, that’s something people have to just come to terms with.

wengraf avatar Nov 03 '20 09:11 wengraf

Some will have been filled with meaningless numbers, like 0,0, just so it passed the check for a filled in field.

wengraf avatar Nov 03 '20 09:11 wengraf

Good point Ivo.

Robinlovelace avatar Nov 03 '20 09:11 Robinlovelace

Can we also close this? As we cannot offer any useful solutions to the issue. Use of road names etc are all outside the main issue. I say we close it.

layik avatar Dec 03 '20 10:12 layik

I think we can close this. We've raised the issue and even give the user a message telling them to watch out. Good suggestion, thanks @layik.

stats19::get_stats19(year = 1979)
#> No files of that type found for that year.
#> [31mThis will download 240 MB+ (1.8 GB unzipped).[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See https://github.com/ropensci/stats19/issues/101 and https://github.com/ropensci/stats19/issues/102
#> Files identified: Stats19-Data1979-2004.zip
#>    http://data.dft.gov.uk.s3.amazonaws.com/road-accidents-safety-data/Stats19-Data1979-2004.zip
#> Data already exists in data_dir, not downloading
#> Data saved at ~/stats19-data/Stats19-Data1979-2004/Vehicles7904.csv~/stats19-data/Stats19-Data1979-2004/Road-Accident-Safety-Data-Guide-1979-2004.xls~/stats19-data/Stats19-Data1979-2004/Casualty7904.csv~/stats19-data/Stats19-Data1979-2004/Accidents7904.csv
#> No files of that type found for that year.
#> [31mThis will download 240 MB+ (1.8 GB unzipped).[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See https://github.com/ropensci/stats19/issues/101 and https://github.com/ropensci/stats19/issues/102
#> Reading in:
#> /home/robin/stats19-data/Stats19-Data1979-2004/Accidents7904.csv
#> date and time columns present, creating formatted datetime column
#> # A tibble: 6,224,198 x 33
#>    accident_index location_eastin… location_northi… longitude latitude
#>    <chr>                     <int>            <int>     <dbl>    <dbl>
#>  1 197901A11AD14                NA               NA        NA       NA
#>  2 197901A1BAW34            198460           894000        NA       NA
#>  3 197901A1BFD77            406380           307000        NA       NA
#>  4 197901A1BGC20            281680           440000        NA       NA
#>  5 197901A1BGF95            153960           795000        NA       NA
#>  6 197901A1CBC96            300370           146000        NA       NA
#>  7 197901A1DAK71            143370           951000        NA       NA
#>  8 197901A1DAP95            471960           845000        NA       NA
#>  9 197901A1EAC32            323880           632000        NA       NA
#> 10 197901A1FBK75            136380           245000        NA       NA
#> # … with 6,224,188 more rows, and 28 more variables: police_force <chr>,
#> #   accident_severity <chr>, number_of_vehicles <int>,
#> #   number_of_casualties <int>, date <date>, day_of_week <chr>, time <chr>,
#> #   local_authority_district <chr>, local_authority_highway <chr>,
#> #   first_road_class <chr>, first_road_number <int>, road_type <chr>,
#> #   speed_limit <int>, junction_detail <chr>, junction_control <chr>,
#> #   second_road_class <chr>, second_road_number <int>,
#> #   pedestrian_crossing_human_control <chr>,
#> #   pedestrian_crossing_physical_facilities <chr>, light_conditions <chr>,
#> #   weather_conditions <chr>, road_surface_conditions <chr>,
#> #   special_conditions_at_site <chr>, carriageway_hazards <chr>,
#> #   urban_or_rural_area <chr>,
#> #   did_police_officer_attend_scene_of_accident <int>,
#> #   lsoa_of_accident_location <chr>, datetime <dttm>

Created on 2020-12-03 by the reprex package (v0.3.0)

Robinlovelace avatar Dec 03 '20 11:12 Robinlovelace

I've looked back at my earlier comments, and perhaps age and fatherhood has mellowed me since then...is there any mileage to be had with reverse geocoding and LA polygon/road/secondary road/junction type etc? Perhaps it is an interesting undergrad or MSc project?

wengraf avatar Nov 06 '23 11:11 wengraf