bdit_data-sources
bdit_data-sources copied to clipboard
`collision_factors` tables don't join neatly to the favoured `collisions` tables
The collisions tables currently in vogue, collisions.events
and collisions.involved
, have many smallint type columns for the various coded fields. However the corresponding tables in collision_factors
are often(always?) text fields. Sometimes a join will work, or partially work, in a way that can easily fool you into thinking it worked as you expected.
SELECT *
FROM collisions.involved
JOIN collision_factors.invtype ON involved.invtype::text = invtype.invtype
LIMIT 100
This query however doesn't return any of the most important categories which are coded with leading zeros, as shown here in collision_factors.invtype
:
Casting the type in the other direction (from text to int) breaks when it encounters the 1A
and 1B
values.
How should these tables be joined? I think that coming up with my own logic for this on an ad hoc basis (and for every field!) will be very error prone.
@Nate-Wessel can you confirm that the non-integer codes (i.e. 1A
and 1B
) only exist for invtype
?
In the MOVE collision ETL pipeline, we remap those to integer values: https://github.com/CityofToronto/bdit_move_etl/blob/a4a673d7567998a5b1b2e3e6d9b952ddb3619870/scripts/airflow/tasks/crash_geocoding/A3_involved_fields_norm.sql#L16
1A
=> 97
1B
=> 98
We also replaced these values in the MOVE copy of the collision factors tables.
As such, you won't see any collisions with invtype
values 1A
and 1B
if using the new collisions schema.
Here's my suggested approach:
- In the short term, can you either apply a manual conversion for these values, or update
collision_factors.invtype
to replace the non-integer codes like we did in the MOVE database? @radumas do you have any concerns with the second approach (replacing the data)? - In the medium term, we can update the collision factors tables to obfuscate these odd values entirely, by syncing the tables with MOVE upstream, similar to how other data replication is happening.
- In the long term, we can obfuscate the need for collision factor cross-referencing entirely by using descriptive text instead of codes.
What do you think?
I'm ok with updating collision_factors.invtype
to replace
1A => 97
1B => 98
and then these lookup tables should get updated from the MOVE side at some point. @tahaislam can take care of this.
can you confirm that the non-integer codes (i.e. 1A and 1B) only exist for invtype?
Some notes on looking through all the tables in this schema:
-
collision_factors.cyccond
is empty -
collision_factors.imploc
permission denied for table imploc - all the rest of these do seem to be
::int
able text values. Confirming that does make these joins, collectively, a lot easier. Still, I'm always wary of needing to convert a type like that.
Permission granted to imploc
. It contains zero-padded numbers and a literal 'NULL'
I suggest UPDATE collision_factors.imploc SET imploc = NULL WHERE imploc = 'NULL'
and then we can close this issue?
FYI imploc is not being replicated from Move.
I suggest
UPDATE collision_factors.imploc SET imploc = NULL WHERE imploc = 'NULL'
and then we can close this issue? FYI imploc is not being replicated from Move.
Has anything else been done to address this issue? There were some suggestions above. Have any of them been implemented?
I see that the numeric fields in these tables are still text, often zero-padded. I still think that joining smallints to a text field like that is asking for trouble.
I think the suggestions made by @mkewins could be transfered to MOVE where these tables are being replicated from?
In the meantime,
FROM collisions.involved
LEFT JOIN collision_factors.invtype ON involved.invtype = invtype.invtype::int
(and similar for other collision_factors tables) works now that 1A, 1B are replaced.
I suggest
UPDATE collision_factors.imploc SET imploc = NULL WHERE imploc = 'NULL'
and then we can close this issue? FYI imploc is not being replicated from Move.
Closing as in MOVE's backlog
.