evidently icon indicating copy to clipboard operation
evidently copied to clipboard

Data drift: Usage of all the reference data instead of a mean (?

Open bhenriquezsoto opened this issue 1 year ago • 4 comments

Hi, i hope you're doing well. I'm having a trouble related to the data drift plots. image

As you can see, the reference data is counting all of the data took in the dataframe associated to the reference data, and as a consequence, the distribution looks a little odd. There's any way to workaround this??

I was thinking using the mean, but for boolean values, doesn't work very well.

Thanks in advance!

bhenriquezsoto avatar Jan 30 '24 23:01 bhenriquezsoto

Hi @benja20029, do you refer to the fact that the absolute object count is much higher in reference, which makes the plot harder to interpret? If yes, you can toggle the "perc" button in the top right corner to switch from absolute to percentage view, which will be more convenient. Let me know if you meant something else.

elenasamuylova avatar Jan 31 '24 13:01 elenasamuylova

I encountered an issue while following your suggestion, which led to an unexpected problem. Specifically, there appears to be a discrepancy between the information depicted in the data drift plot and that shown in the data distribution plot. This can be observed in the images provided below:

image image

As illustrated, the first plot suggests the presence of approximately 5 or 6 data points around the 240 M mark. However, the data distribution plot indicates there are 61 data points, which is perplexing and warrants further investigation. Could there be a specific reason for this discrepancy?

Additionally, I am interested in understanding how the data is organized with respect to the indexed bins. Is there a method to sort these bins from the highest to the lowest value of the Y axis??

bhenriquezsoto avatar Jan 31 '24 13:01 bhenriquezsoto

Hi @benja20029, the first plot splits your data into 150 bins - there is currently no way to parametrize it. This aggregation helps reduce the overall size of the Report since users often run it for thousands or millions of data points at once.

You can consider the following.

1/ Inlcude the DateTime

If you have a timestamp (DateTime) for each prediction, I would recommend referring to it in Column Mapping. This way, your plot will not just use the binned numbered index but organize values by time to make it more interpretable.

2/ Turn the aggregation off (use raw data)

If you have a small number of data points, it might make sense to turn the raw data on when generating the Report. This way, you will see individual data points instead of binned index. Docs: https://docs.evidentlyai.com/user-guide/customization/report-data-aggregation

3/ Generate additional visualizations for specific columns outside of the DataDriftTable().

If you want to explore your data after you identified columns of interest (e.g. drifted columns), you can use other metrics to visualize them - e.g. ColumnDriftMetric(), ColumnValuePlot() (with raw data on), or ColumnSummaryMetric() for feature stats.

elenasamuylova avatar Jan 31 '24 14:01 elenasamuylova

Thanks for the first two suggestions, they helped me a lot!

I don't understood quite well the third though :(, I would really to change the tooltip when i hover over a data point in the data drift plot, or if that's not possible, generate a new visualization, but i don't know if i should just create a new method or if there's any workaround (because I only want to change a small thing). 😢

bhenriquezsoto avatar Jan 31 '24 15:01 bhenriquezsoto