bdit_data-sources icon indicating copy to clipboard operation
bdit_data-sources copied to clipboard

Collision control totals

Open cczhu opened this issue 3 years ago • 7 comments

In support of our now year-old comparison of collision totals with TPS's records, we need to produce a set of control totals at the different stages of the ACC ingestion pipeline. We already know that ACC has fewer TPS-numbered collisions relative to TPS's own records, but that this may be due to how TPS records are stored and whether or not collisions on private property are being counted.

Our goal is to have a tally by year and by type of the collision events at all stages in the ACC pipeline.

cczhu avatar Apr 26 '21 12:04 cczhu

To begin, we restrict our analysis to years 2014-2019 inclusive. To obtain a set of events from the XML files, we extract the year and GO number from the XML filenames, removing any duplicates (I assume multiple XMLs of the same collision may be sent to update certain fields, though haven't stumbled across any duplicate name XMLs whose contents aren't the same modulo some XML syntax differences). To obtain all TPS collisions from the same timespan in ACC, I use Jim Millington's explanation of the ACCNB generation process to assign TPS, CRC or DUNNO (representing ACCNBs that don't follow either naming convention). I then join the two sets of events together to get this:

year all_xml all_tps_acc xml_not_in_acc xml_class10_not_in_acc xml_nonclass10_not_in_acc xml_public_not_in_acc acc_tps_not_in_xml acc_dunno_not_in_xml
2014 18539 16729 1836 1341 495 185 12 0
2015 16559 14806 1761 1302 459 112 11 0
2016 13061 11507 1595 1112 483 129 49 24
2017 11342 9923 1472 1012 460 122 54 16
2018 10956 9693 1326 973 353 91 62 12
2019 10361 9154 1261 868 393 133 57 32

where

  • all_xml - all collision events from XML folders
  • all_tps_acc - all collision events that appear to have TPS-derived ACCNBs from collisions.events
  • xml_not_in_acc - all events from XML folders that don't appear in collisions.events
  • xml_class10_not_in_acc - xml_not_in_acc that have ACCLOC = '10'
  • xml_nonclass10_not_in_acc - xml_not_in_acc that have ACCLOC != '10'
  • xml_public_not_in_acc - xml_not_in_acc that have NOT ACCLOC IN ('04', '10', '98', '99), and so should be on public land
  • acc_tps_not_in_xml - TPS events in collisions.events and not in XML folders
  • acc_dunno_not_in_xml - events with unknown origin in collisions.events and not in XML folders

Observations:

  • There are 80818 event reports (assuming that two files with the same filename found in different XML folders are different updates to the same collision event) in the XML folders.
  • There are 71812 events in ACC with 10-digit ACCNBs (which indicates they're derived from TPS MVAs). This leaves 9251 events in the XML folders but not in ACC.
  • Of these 9251, 6608 have ACCLOC = 10, while the others do not. According to Evan's documentation only those with ACCLOC = 10 are moved to the ACC_ARCHIVE table. However, the collision_factors.accloc table suggests that class 4, 98 and 99 could potentially also be on private property. If we remove these, there are still 772 collisions that should be in ACC rather than ACC_ARCHIVE. Jesse mentions this can come from several sources:
    • Some collisions could be outside the bounds of Toronto.
    • GO numbers are created from 911 calls, and if TPS doesn't fully consolidate all these calls there could be duplicate records for the same collision.
  • There are 245 collisions that appear to be from the TPS but aren't in the XML folders.
  • There are 84 collisions that don't follow the ACCNB naming conventions and don't correspond to any file in the XML folders. These only start appearing in 2016.
    • Either of the last two could be due to additional records manually created by the mail clerks to support KSI collisions that have yet to be transferred from TPS's database.

cczhu avatar Apr 26 '21 16:04 cczhu

Performing the same exercise as above, but using the union of ACC and ACC_ARCHIVE (where all collisions deemed inappropriate for ACC are loaded instead) on Oracle, we get this table:

year  all_xml all_tps_ocl xml_not_in_ocl xml_class10_not_in_ocl xml_nonclass10_not_in_ocl xml_public_not_in_ocl ocl_tps_not_in_xml ocl_dunno_not_in_xml
2014 18540 18459 112 11 101 93 15 0
2015 16563 16533 42 8 34 27 14 0
2016 13063 13036 73 12 61 46 53 24
2017 11342 11305 98 26 72 55 60 16
2018 10956 10961 55 14 41 37 63 12
2019 10363 10344 71 16 55 45 57 35

where

  • all_xml - all collision events from XML folders
  • all_tps_ocl - all collision events that appear to have TPS-derived ACCNBs from ACC and ACC_ARCHIVE combined
  • xml_not_in_ocl - all events from XML folders that don't appear in ACC and ACC_ARCHIVE combined
  • xml_class10_not_in_ocl - xml_not_in_ocl that have ACCLOC = '10'
  • xml_nonclass10_not_in_ocl - xml_not_in_ocl that have ACCLOC != '10'
  • xml_public_not_in_ocl - xml_not_in_ocl that have NOT ACCLOC IN ('04', '10', '98', '99), and so should be on public land
  • ocl_tps_not_in_xml - TPS events in ACC and ACC_ARCHIVE combined and not in XML folders
  • ocl_dunno_not_in_xml - events with unknown origin in ACC and ACC_ARCHIVE combined and not in XML folders

Observations :

  • The difference between the XML dump and combined tables from Oracle is much narrower than before - there are 451 event records in the XMLs that aren't in Oracle, and 262 records in the Oracle that aren't in the XMLs (with an additional 87 unknown collisions that don't seem to correspond to anything). This confirms that the large number of ACCLOC = 10 collisions missing from ACC are instead being inserted into ACC_ARCHIVE.
  • Of the collisions exclusive to the XML files, a plurality of them are "Non Intersection", rather than on private property. These could potentially be outside the bounds of the City, but then they should also be in ACC_ARCHIVE.
  • We expect the collisions in ACC_ARCHIVE to be either on private property or beyond the boundaries of Toronto. The investigation above shows that is indeed the case - the majority of collisions in ACC_ARCHIVE are class 4, 10 or 99, all of which are related to private property, but about 1% of them are not - this translates to around 400 collisions on public land, almost all of which have lon-lats. If we check these lon-lats, almost all of the 400 collisions on public land are within the bounds of Toronto, while those beyond are almost all on private property. So clearly there's either more information I'm not considering or there's a failure in whatever process divides collisions between ACC and ACC_ARCHIVE.

cczhu avatar May 05 '21 20:05 cczhu

From Jesse:

Circling back to this, from Jim: Conditions where the collision is moved to archive are accloc=10 or stname2='parking lot'

cczhu avatar May 17 '21 20:05 cczhu

Three caveats:

  • There are cases where the ACC number is repeated after 2013, though there are no instances between 2014-2019 in either ACC or ACC_ARCHIVE with two collisions in the same year but different dates. This needs to be accounted for by merging with the occurrence year alongside the ACC number, or else merged tables have duplicates.
  • There are cases, however, where what looks to be the same collision is repeated in both ACC and ACC_ARCHIVE. The only way to handle these is to double check their event-level data after ACC_ARCHIVE and ACC are merged together.
  • The XML files are named for the year the report was generated, and not when the collision occurred. The generated ACC number also uses the year reported (so a collision that occurred in 2015 and reported in 2016 will have an ACC number that starts with 6 or 16, and an ACCYEAR of 2015). This means that when joining between XML and Oracle, we have to drop all ACCNB duplicates in the XMLs, and when joining between TPS's Versadex data and Oracle, we need to be careful that if we join on year that we join on year of occurrence and not report.

Subsequent analysis will try to correct for these.

cczhu avatar May 18 '21 13:05 cczhu

I've successfully matched a dump of TPS data from Andrew (e-mail 2021-05-12), the archived XML files, and ACC and ACC_ARCHIVE combined on Oracle. Here are the results:

In All          72923.0
Oracle Only       162.0
VDX & Oracle      187.0
VDX & XML         357.0
VDX Only         1677.0
XML & Oracle     7446.0
XML Only           96.0
year In All Oracle Only VDX & Oracle VDX & XML VDX Only XML & Oracle XML Only
2014 16117 15 0 61 181 2324 51
2015 14814 14 0 32 86 1703 10
2016 12080 36 41 62 108 900 12
2017 10426 31 45 87 261 820 10
2018 10057 23 52 50 400 841 5
2019 9429 43 49 65 641 858 8

Here, In All means in all databases. Numbers represent individual records.

Observations (these numbers are +/- 2 due to duplicates generated during table merges):

  • There are 82,848 unique entries across the three databases. Of these, 72,923 exist in all three databases.
  • 7446 events only exist in (both) the XML and Oracle databases. When broken up by year, we see that more of these come from 2014 than 2019. There are three possible explanations I can think of off the top of my head:
    • These are actually from the CRC, but were placed in the TPS folders by mistake. (This assumes the CRC XML format is identical to the TPS one.)
    • These XMLs represent collision events that should have been merged with other events, but weren't done so in Oracle for some reason. Of the 7446 events, 6560 are in ACC, not ACC_ARCHIVE, so they should have been merged by a mail clerk if they represent the same event as some other row in ACC.
    • These collisions weren't included in TPS's data dump to us, for some reason.
  • 1678 events only exist in Versadex. When broken up by year, we see a steady increase from year to year, with the plurality of these events coming from 2019. Possible explanations:
    • These collisions were unsuccessfully transmitted to us. This is pretty likely.
    • These collisions represent duplicates in some way (eg. two 911 calls for the same incident that weren't unified in TPS's database).
    • The XMLs were successfully transferred but to a folder not connected to the ACC loading process.
  • There are 96 XML-only files. These may have been merged with other reports in both TPS and ACC databases.
  • There are 162 Oracle-only files, which may either be miscalssified CRC collisions or manually generated by the mail clerks.
  • There are 357 VDX and XML files that don't show up in either ACC or ACC_ARCHIVE. These may have not been successfully loaded by Jim Millington's process.
  • There are 187 VDX & Oracle-only files. These are weird - I can only surmise some XMLs may have accidentally been deleted?

cczhu avatar May 18 '21 22:05 cczhu

Convo with Jesse:

  • CRC doesn't generate XMLs, so they're all from TPS.
  • Injury type vs categories for data that has Oracle entries.
  • Some way to determine relevant properties of collisions in Oracle but not VDX.

cczhu avatar May 19 '21 16:05 cczhu

Jesse's hunch is correct. Here's a join against entries in our comparison that come from Oracle (so not including VDX and XML only entries) with TRAFFIC.ACC__ALL, which has involved properties. This gives us access to ACCTYPE, or collision_type, an event-level property that separates out collisions that are fatal or result in injuries from property-damage only. We can then create a pivot table:

join_status Fatal Non-Fatal Injury Non-Reportable Other Property Damage Only
In All 117 43521 304 16 28873
Oracle Only 28 27 0 1 106
VDX & Oracle 186 1 0 0 0
XML & Oracle 12 4277 84 0 3044

Looks like entries only in Oracle or in Oracle and VDX are highly disproportionately fatal or injury collisions.

If we use the most seriously injured individual in each event, we get

join_status Fatal Major Minimal Minor No Injury
In All 117 1853 18783 22472 29418
Oracle Only 28 1 13 13 107
VDX & Oracle 186 1 0 0 0
XML & Oracle 12 166 2431 1630 3179

which clearly shows that XML + Oracle spans a broad range of possible collisions.

Someone needs to check that the Oracle Only collisions aren't duplicates of other collisions that can be matched to VDX entries. Would be very bad if we're overcounting fatals historically.

cczhu avatar May 19 '21 23:05 cczhu