triage
triage copied to clipboard
Problems with feature column names exceeding Postgres' max identifier length
Triage can try to create column names longer than Postgres' 63 character max identifier length. When this happens, Postgres truncates the candidate column name.
I've seen this causing problems in the feature generation code where an expected column name doesn't match the actual name in database. For example, when triage applies imputation rules, it first queries a feature table for the list of columns containing nulls. If one of the column names returned was truncated on creation, then triage doesn't have an imputation rule to handle it.
This might also cause problems where triage tries to create duplicate columns, as when creating columns for different aggregations of the same quantity:
'[some_63_char_identifier]_sum'
and
'[some_63_char_identifier]_max'
both get truncated to the same identifier:
'[some_63_char_identifier]'
Thanks for flagging this. The temporary workaround is to shorten the name of your columns in your from_obj
or when you are selecting values for categorical columns (this is where I tend to run into it). For example, for categoricals, I will sometimes use something like left(6)
to produce shortened names.
It's worth thinking about what Triage should do in this case. Just flag it in validation and throw an error as early as possible, or is there a reasonable way to autocorrect/autoshorten for the user?
In either case, it'll probably be easier to fix it if https://github.com/dssg/triage/pull/608 is merged, because in it the full feature names (with intervals and aggregate functions, etc) are available much earlier in the process, before the database starts doing work.
Could we
- implement an option to specify an alias for the source column name (the middle part in the generated name) in the config file
- throw an error and require that the user specify a shorter name in the alias
This has the advantage of being easier than trying to figure out an approach to automatically shortening the name, and allows the user to ensure that the resulting name is meaningful/readable while being unique (compared to something like generating a hash of the source column name)
I think #1 is one of the things @ecsalomon was alluding to, if I read it right. The aliasing you can do now.
If you specify the quantity as a dict instead of a string you can call the output column whatever you want. Dirtyduckling uses this aliasing option: https://github.com/dssg/triage/blob/master/example/config/dirty-duckling.yaml#L32
Instead of
quantity: <sourcecolname>
do
quantity:
<alias>: <sourcecolname>
As far as the validation goes, this ticket exists for that.
https://github.com/dssg/triage/issues/517 https://github.com/dssg/triage/issues/514
A bunch of people have hit this problem, so it would be very nice to fix! As of yet it hasn't been done because it's a bit complicated to catch this before feature creation given how the feature creation code is structured right now (hence my bringing up #608)
To be slightly more useful, here is an example feature group with my categorical workaround:
-
prefix: 'dx'
from_obj: |
(SELECT entity_id,
knowledge_date,
left(diagnosis, 6) AS dx # shortened name
FROM diagnosis_table) AS dx
knowledge_date_column: 'knowledge_date'
categoricals_imputation:
all:
type: 'zero'
categoricals:
- # first 6 letters of diagnosis with at least 150 occurrences
column: dx
choice_query: |
SELECT DISTINCT left(dx, 6) # shorten each category to 6 letters
FROM (
SELECT left(diagnosis, 6) AS dx,
count(*)
FROM diagnosis_table
GROUP BY dx
) AS dx_counts
WHERE count > 150
metrics:
- max
- sum