triage icon indicating copy to clipboard operation
triage copied to clipboard

Write basic feature stats to tables after feature generation

Open thcrock opened this issue 5 years ago • 1 comments

@saleiro @ecsalomon and I went on a bit of a dive this morning following my exploration of the https://github.com/tdda/tdda repository . The team has talked about "premodeling" before, that could expand in scope to be as big as postmodeling. Right now, however, I think we should figure out whats the simplest possible thing that could be useful as a beachhead for this premodeling. And in our discussion I think we have something.

After feature generation (both pre-imputation and post-imputation), create feature stats in table/s that encompass basic min/max/nulls stats for the created features. These would always reflect the current state of the entire feature table (e.g. we recompute each time), provided that this doesn't take too long.

thcrock avatar Feb 21 '19 16:02 thcrock

data_dict_df = pd.DataFrame()
for index,row in cleaned_cols_df.iterrows():
    #this might take a while, let's print some stuff while it runs
    if row['data_type'] != 'USER-DEFINED':                                          #FIXME - error with user-defined
        print(row['table_name'],'--', row['column_name'])
        data_dict_df = data_dict_df.append(
        sql("""
            select '{table}' as table_name, 
                    '{col}' as column_name, 
                    '{dtype}' as data_type,
                    count(*) as count_rows,
                    count(case when {col} is null then 1 end) as count_nulls,
                    count(distinct {col}) as count_distinct, 
                    min(case when pg_typeof({col}) in ({num_datetime}) then {col} else null end),
                    max(case when pg_typeof({col}) in ({num_datetime}) then {col} else null end),
                    avg(case when pg_typeof({col}) in ({num_datetime}) then {col} else null end),
                    sttdev(case when pg_typeof({col}) in ({num_datetime}) then {col} else null end),
                    mode() within group (order by {col})
            from cleaned.{table}
        """.format(table=row['table_name'], col=row['column_name'], dtype=row['data_type'], num_datetime=','.join(num_datetime_types)))
        ,ignore_index=True)

For feature tables, since all columns are non-categorical there is no need to verify the pg_typeof.

saleiro avatar Feb 21 '19 17:02 saleiro