bdit_data-sources
bdit_data-sources copied to clipboard
Collision control totals
In support of our now year-old comparison of collision totals with TPS's records, we need to produce a set of control totals at the different stages of the ACC ingestion pipeline. We already know that ACC has fewer TPS-numbered collisions relative to TPS's own records, but that this may be due to how TPS records are stored and whether or not collisions on private property are being counted.
Our goal is to have a tally by year and by type of the collision events at all stages in the ACC pipeline.
To begin, we restrict our analysis to years 2014-2019 inclusive. To obtain a set of events from the XML files, we extract the year and GO number from the XML filenames, removing any duplicates (I assume multiple XMLs of the same collision may be sent to update certain fields, though haven't stumbled across any duplicate name XMLs whose contents aren't the same modulo some XML syntax differences). To obtain all TPS collisions from the same timespan in ACC, I use Jim Millington's explanation of the ACCNB generation process to assign TPS
, CRC
or DUNNO
(representing ACCNBs that don't follow either naming convention). I then join the two sets of events together to get this:
year | all_xml | all_tps_acc | xml_not_in_acc | xml_class10_not_in_acc | xml_nonclass10_not_in_acc | xml_public_not_in_acc | acc_tps_not_in_xml | acc_dunno_not_in_xml |
---|---|---|---|---|---|---|---|---|
2014 | 18539 | 16729 | 1836 | 1341 | 495 | 185 | 12 | 0 |
2015 | 16559 | 14806 | 1761 | 1302 | 459 | 112 | 11 | 0 |
2016 | 13061 | 11507 | 1595 | 1112 | 483 | 129 | 49 | 24 |
2017 | 11342 | 9923 | 1472 | 1012 | 460 | 122 | 54 | 16 |
2018 | 10956 | 9693 | 1326 | 973 | 353 | 91 | 62 | 12 |
2019 | 10361 | 9154 | 1261 | 868 | 393 | 133 | 57 | 32 |
where
-
all_xml
- all collision events from XML folders -
all_tps_acc
- all collision events that appear to have TPS-derived ACCNBs fromcollisions.events
-
xml_not_in_acc
- all events from XML folders that don't appear incollisions.events
-
xml_class10_not_in_acc
-xml_not_in_acc
that haveACCLOC = '10'
-
xml_nonclass10_not_in_acc
-xml_not_in_acc
that haveACCLOC != '10'
-
xml_public_not_in_acc
-xml_not_in_acc
that haveNOT ACCLOC IN ('04', '10', '98', '99)
, and so should be on public land -
acc_tps_not_in_xml
- TPS events incollisions.events
and not in XML folders -
acc_dunno_not_in_xml
- events with unknown origin incollisions.events
and not in XML folders
Observations:
- There are 80818 event reports (assuming that two files with the same filename found in different XML folders are different updates to the same collision event) in the XML folders.
- There are 71812 events in ACC with 10-digit ACCNBs (which indicates they're derived from TPS MVAs). This leaves 9251 events in the XML folders but not in ACC.
- Of these 9251, 6608 have
ACCLOC = 10
, while the others do not. According to Evan's documentation only those withACCLOC = 10
are moved to theACC_ARCHIVE
table. However, thecollision_factors.accloc
table suggests that class 4, 98 and 99 could potentially also be on private property. If we remove these, there are still 772 collisions that should be inACC
rather thanACC_ARCHIVE
. Jesse mentions this can come from several sources:- Some collisions could be outside the bounds of Toronto.
- GO numbers are created from 911 calls, and if TPS doesn't fully consolidate all these calls there could be duplicate records for the same collision.
- There are 245 collisions that appear to be from the TPS but aren't in the XML folders.
- There are 84 collisions that don't follow the ACCNB naming conventions and don't correspond to any file in the XML folders. These only start appearing in 2016.
- Either of the last two could be due to additional records manually created by the mail clerks to support KSI collisions that have yet to be transferred from TPS's database.
Performing the same exercise as above, but using the union of ACC
and ACC_ARCHIVE
(where all collisions deemed inappropriate for ACC
are loaded instead) on Oracle, we get this table:
year | all_xml | all_tps_ocl | xml_not_in_ocl | xml_class10_not_in_ocl | xml_nonclass10_not_in_ocl | xml_public_not_in_ocl | ocl_tps_not_in_xml | ocl_dunno_not_in_xml |
---|---|---|---|---|---|---|---|---|
2014 | 18540 | 18459 | 112 | 11 | 101 | 93 | 15 | 0 |
2015 | 16563 | 16533 | 42 | 8 | 34 | 27 | 14 | 0 |
2016 | 13063 | 13036 | 73 | 12 | 61 | 46 | 53 | 24 |
2017 | 11342 | 11305 | 98 | 26 | 72 | 55 | 60 | 16 |
2018 | 10956 | 10961 | 55 | 14 | 41 | 37 | 63 | 12 |
2019 | 10363 | 10344 | 71 | 16 | 55 | 45 | 57 | 35 |
where
-
all_xml
- all collision events from XML folders -
all_tps_ocl
- all collision events that appear to have TPS-derived ACCNBs fromACC
andACC_ARCHIVE
combined -
xml_not_in_ocl
- all events from XML folders that don't appear inACC
andACC_ARCHIVE
combined -
xml_class10_not_in_ocl
-xml_not_in_ocl
that haveACCLOC = '10'
-
xml_nonclass10_not_in_ocl
-xml_not_in_ocl
that haveACCLOC != '10'
-
xml_public_not_in_ocl
-xml_not_in_ocl
that haveNOT ACCLOC IN ('04', '10', '98', '99)
, and so should be on public land -
ocl_tps_not_in_xml
- TPS events inACC
andACC_ARCHIVE
combined and not in XML folders -
ocl_dunno_not_in_xml
- events with unknown origin inACC
andACC_ARCHIVE
combined and not in XML folders
Observations :
- The difference between the XML dump and combined tables from Oracle is much narrower than before - there are 451 event records in the XMLs that aren't in Oracle, and 262 records in the Oracle that aren't in the XMLs (with an additional 87 unknown collisions that don't seem to correspond to anything). This confirms that the large number of
ACCLOC = 10
collisions missing fromACC
are instead being inserted intoACC_ARCHIVE
. - Of the collisions exclusive to the XML files, a plurality of them are "Non Intersection", rather than on private property. These could potentially be outside the bounds of the City, but then they should also be in
ACC_ARCHIVE
. - We expect the collisions in
ACC_ARCHIVE
to be either on private property or beyond the boundaries of Toronto. The investigation above shows that is indeed the case - the majority of collisions inACC_ARCHIVE
are class 4, 10 or 99, all of which are related to private property, but about 1% of them are not - this translates to around 400 collisions on public land, almost all of which have lon-lats. If we check these lon-lats, almost all of the 400 collisions on public land are within the bounds of Toronto, while those beyond are almost all on private property. So clearly there's either more information I'm not considering or there's a failure in whatever process divides collisions betweenACC
andACC_ARCHIVE
.
From Jesse:
Circling back to this, from Jim: Conditions where the collision is moved to archive are accloc=10 or stname2='parking lot'
Three caveats:
- There are cases where the ACC number is repeated after 2013, though there are no instances between 2014-2019 in either
ACC
orACC_ARCHIVE
with two collisions in the same year but different dates. This needs to be accounted for by merging with the occurrence year alongside the ACC number, or else merged tables have duplicates. - There are cases, however, where what looks to be the same collision is repeated in both
ACC
andACC_ARCHIVE
. The only way to handle these is to double check their event-level data afterACC_ARCHIVE
andACC
are merged together. - The XML files are named for the year the report was generated, and not when the collision occurred. The generated
ACC
number also uses the year reported (so a collision that occurred in 2015 and reported in 2016 will have an ACC number that starts with 6 or 16, and an ACCYEAR of 2015). This means that when joining between XML and Oracle, we have to drop all ACCNB duplicates in the XMLs, and when joining between TPS's Versadex data and Oracle, we need to be careful that if we join on year that we join on year of occurrence and not report.
Subsequent analysis will try to correct for these.
I've successfully matched a dump of TPS data from Andrew (e-mail 2021-05-12), the archived XML files, and ACC
and ACC_ARCHIVE
combined on Oracle. Here are the results:
In All 72923.0
Oracle Only 162.0
VDX & Oracle 187.0
VDX & XML 357.0
VDX Only 1677.0
XML & Oracle 7446.0
XML Only 96.0
year | In All | Oracle Only | VDX & Oracle | VDX & XML | VDX Only | XML & Oracle | XML Only |
---|---|---|---|---|---|---|---|
2014 | 16117 | 15 | 0 | 61 | 181 | 2324 | 51 |
2015 | 14814 | 14 | 0 | 32 | 86 | 1703 | 10 |
2016 | 12080 | 36 | 41 | 62 | 108 | 900 | 12 |
2017 | 10426 | 31 | 45 | 87 | 261 | 820 | 10 |
2018 | 10057 | 23 | 52 | 50 | 400 | 841 | 5 |
2019 | 9429 | 43 | 49 | 65 | 641 | 858 | 8 |
Here, In All means in all databases. Numbers represent individual records.
Observations (these numbers are +/- 2 due to duplicates generated during table merges):
- There are 82,848 unique entries across the three databases. Of these, 72,923 exist in all three databases.
- 7446 events only exist in (both) the XML and Oracle databases. When broken up by year, we see that more of these come from 2014 than 2019. There are three possible explanations I can think of off the top of my head:
- These are actually from the CRC, but were placed in the TPS folders by mistake. (This assumes the CRC XML format is identical to the TPS one.)
- These XMLs represent collision events that should have been merged with other events, but weren't done so in Oracle for some reason. Of the 7446 events, 6560 are in
ACC
, notACC_ARCHIVE
, so they should have been merged by a mail clerk if they represent the same event as some other row inACC
. - These collisions weren't included in TPS's data dump to us, for some reason.
- 1678 events only exist in Versadex. When broken up by year, we see a steady increase from year to year, with the plurality of these events coming from 2019. Possible explanations:
- These collisions were unsuccessfully transmitted to us. This is pretty likely.
- These collisions represent duplicates in some way (eg. two 911 calls for the same incident that weren't unified in TPS's database).
- The XMLs were successfully transferred but to a folder not connected to the ACC loading process.
- There are 96 XML-only files. These may have been merged with other reports in both TPS and ACC databases.
- There are 162 Oracle-only files, which may either be miscalssified CRC collisions or manually generated by the mail clerks.
- There are 357 VDX and XML files that don't show up in either
ACC
orACC_ARCHIVE
. These may have not been successfully loaded by Jim Millington's process. - There are 187 VDX & Oracle-only files. These are weird - I can only surmise some XMLs may have accidentally been deleted?
Convo with Jesse:
- CRC doesn't generate XMLs, so they're all from TPS.
- Injury type vs categories for data that has Oracle entries.
- Some way to determine relevant properties of collisions in Oracle but not VDX.
Jesse's hunch is correct. Here's a join against entries in our comparison that come from Oracle (so not including VDX and XML only entries) with TRAFFIC.ACC__ALL
, which has involved properties. This gives us access to ACCTYPE
, or collision_type
, an event-level property that separates out collisions that are fatal or result in injuries from property-damage only. We can then create a pivot table:
join_status | Fatal | Non-Fatal Injury | Non-Reportable | Other | Property Damage Only |
---|---|---|---|---|---|
In All | 117 | 43521 | 304 | 16 | 28873 |
Oracle Only | 28 | 27 | 0 | 1 | 106 |
VDX & Oracle | 186 | 1 | 0 | 0 | 0 |
XML & Oracle | 12 | 4277 | 84 | 0 | 3044 |
Looks like entries only in Oracle or in Oracle and VDX are highly disproportionately fatal or injury collisions.
If we use the most seriously injured individual in each event, we get
join_status | Fatal | Major | Minimal | Minor | No Injury |
---|---|---|---|---|---|
In All | 117 | 1853 | 18783 | 22472 | 29418 |
Oracle Only | 28 | 1 | 13 | 13 | 107 |
VDX & Oracle | 186 | 1 | 0 | 0 | 0 |
XML & Oracle | 12 | 166 | 2431 | 1630 | 3179 |
which clearly shows that XML + Oracle spans a broad range of possible collisions.
Someone needs to check that the Oracle Only collisions aren't duplicates of other collisions that can be matched to VDX entries. Would be very bad if we're overcounting fatals historically.