triage icon indicating copy to clipboard operation
triage copied to clipboard

Problems with feature column names exceeding Postgres' max identifier length

Open adunmore opened this issue 4 years ago • 6 comments

Triage can try to create column names longer than Postgres' 63 character max identifier length. When this happens, Postgres truncates the candidate column name.

I've seen this causing problems in the feature generation code where an expected column name doesn't match the actual name in database. For example, when triage applies imputation rules, it first queries a feature table for the list of columns containing nulls. If one of the column names returned was truncated on creation, then triage doesn't have an imputation rule to handle it.

This might also cause problems where triage tries to create duplicate columns, as when creating columns for different aggregations of the same quantity:

'[some_63_char_identifier]_sum' and '[some_63_char_identifier]_max' both get truncated to the same identifier: '[some_63_char_identifier]'

adunmore avatar May 21 '20 19:05 adunmore

Thanks for flagging this. The temporary workaround is to shorten the name of your columns in your from_obj or when you are selecting values for categorical columns (this is where I tend to run into it). For example, for categoricals, I will sometimes use something like left(6) to produce shortened names.

ecsalomon avatar May 21 '20 20:05 ecsalomon

It's worth thinking about what Triage should do in this case. Just flag it in validation and throw an error as early as possible, or is there a reasonable way to autocorrect/autoshorten for the user?

In either case, it'll probably be easier to fix it if https://github.com/dssg/triage/pull/608 is merged, because in it the full feature names (with intervals and aggregate functions, etc) are available much earlier in the process, before the database starts doing work.

thcrock avatar May 21 '20 23:05 thcrock

Could we

  • implement an option to specify an alias for the source column name (the middle part in the generated name) in the config file
  • throw an error and require that the user specify a shorter name in the alias

This has the advantage of being easier than trying to figure out an approach to automatically shortening the name, and allows the user to ensure that the resulting name is meaningful/readable while being unique (compared to something like generating a hash of the source column name)

adunmore avatar May 22 '20 00:05 adunmore

I think #1 is one of the things @ecsalomon was alluding to, if I read it right. The aliasing you can do now.

If you specify the quantity as a dict instead of a string you can call the output column whatever you want. Dirtyduckling uses this aliasing option: https://github.com/dssg/triage/blob/master/example/config/dirty-duckling.yaml#L32

Instead of quantity: <sourcecolname> do

quantity:
   <alias>: <sourcecolname>

thcrock avatar May 22 '20 00:05 thcrock

As far as the validation goes, this ticket exists for that.

https://github.com/dssg/triage/issues/517 https://github.com/dssg/triage/issues/514

A bunch of people have hit this problem, so it would be very nice to fix! As of yet it hasn't been done because it's a bit complicated to catch this before feature creation given how the feature creation code is structured right now (hence my bringing up #608)

thcrock avatar May 22 '20 00:05 thcrock

To be slightly more useful, here is an example feature group with my categorical workaround:

 -
        prefix: 'dx'
        from_obj: |
            (SELECT entity_id,
                   knowledge_date,
                   left(diagnosis, 6) AS dx                     # shortened name
              FROM diagnosis_table) AS dx
        knowledge_date_column: 'knowledge_date'
        categoricals_imputation:
            all:
                type: 'zero'
        categoricals:
            - # first 6 letters of diagnosis with at least 150 occurrences
                column: dx
                choice_query: |
                    SELECT DISTINCT left(dx, 6)       # shorten each category to 6 letters
                      FROM (
                               SELECT left(diagnosis, 6) AS dx, 
                                      count(*)
                                 FROM diagnosis_table
                                GROUP BY dx
                           ) AS dx_counts
                     WHERE count > 150
                metrics: 
                    - max
                    - sum

ecsalomon avatar May 22 '20 14:05 ecsalomon