DataProfiler icon indicating copy to clipboard operation
DataProfiler copied to clipboard

Creating a customized column

Open simratbhandari2 opened this issue 4 years ago • 8 comments

Trying to add a customized column (eg: driver_license) in the data profiler library so that the final profiled json contains the customized column (eg: driver_license) with all the usual statistics. Will it be possible to include functionality to make adding a customized column with all the required statistics easier?

simratbhandari2 avatar Jun 21 '21 15:06 simratbhandari2

@simratbhandari2 can you add clarification on whether you want the customized column to be predicted on by the model? i.e. this would require training data for the model to transfer learn and update to the new set of labels for prediction

JGSweets avatar Jun 21 '21 15:06 JGSweets

Could you provide a very simple example? We can expand from there.

lettergram avatar Jun 21 '21 18:06 lettergram

Yes, so basically, I am using the data profiler on a csv that provides me with the necessary statistics. However, I wish to add a customized column, of drivers_license to the csv and get the same statistics and predictions on that. So, I basically wish for the model to read over the drivers_license column and scan over it like it is for the other columns and then provide the json array result of the predictions and statistics.

simratbhandari2 avatar Jun 22 '21 13:06 simratbhandari2

Hi @simratbhandari2 , Just to confirm, you are looking for "drivers_license" to show up as a value to the column data stat data_label, correct?

e.g.

"data_stats": {
    <column name>: {
        .
        .
        .
        "data_label": “drivers_license”
        .
        .
        .
    }
}

or are you wanting a profile that returns stats only specific to driver's licenses?

JGSweets avatar Jun 23 '21 20:06 JGSweets

So, it is both essentially. The drivers_license ideally should appear as a value to the column data stat data_label. However, if I run the profiler on it, I also wish for the profiler to return the stats for drivers_license similar to the rest of the columns.

simratbhandari2 avatar Jun 24 '21 13:06 simratbhandari2

Right, to be able to do this, one has to train a new model that includes the drivers_license label.

An example of training the current model on your data is illustrated in the examples, labelers.ipynb: https://github.com/capitalone/DataProfiler/blob/main/examples/labeler.ipynb

Once that model is trained, the new model can be saved and also used in the profiler. To use the model in the profiler, we can set it as one of the options:

my_new_data_labeler = ....

profile_options = dp.ProfilerOptions()
profile_options.set({'structured_options.data_labeler.data_labeler_object': my_new_data_labeler})
profile = dp.Profiler(data, options=profile_options)

This will add the label to the profiled results and maintain the current stats for the profiles. If one wants to change the stats being returned, that's a different discussion we can also discuss.

JGSweets avatar Jun 24 '21 14:06 JGSweets

You mention initially about making it easier, what sort of workflow would you want to adding a new label?

JGSweets avatar Jun 24 '21 14:06 JGSweets

Hi @simratbhandari2, I want to circle back on this and see if there's more discussion to be had on this issue. Thanks!

JGSweets avatar Aug 26 '21 16:08 JGSweets