evidently
evidently copied to clipboard
Data drift: Usage of all the reference data instead of a mean (?
Hi, i hope you're doing well. I'm having a trouble related to the data drift plots.
As you can see, the reference data is counting all of the data took in the dataframe associated to the reference data, and as a consequence, the distribution looks a little odd. There's any way to workaround this??
I was thinking using the mean, but for boolean values, doesn't work very well.
Thanks in advance!
Hi @benja20029, do you refer to the fact that the absolute object count is much higher in reference, which makes the plot harder to interpret? If yes, you can toggle the "perc" button in the top right corner to switch from absolute to percentage view, which will be more convenient. Let me know if you meant something else.
I encountered an issue while following your suggestion, which led to an unexpected problem. Specifically, there appears to be a discrepancy between the information depicted in the data drift plot and that shown in the data distribution plot. This can be observed in the images provided below:
As illustrated, the first plot suggests the presence of approximately 5 or 6 data points around the 240 M mark. However, the data distribution plot indicates there are 61 data points, which is perplexing and warrants further investigation. Could there be a specific reason for this discrepancy?
Additionally, I am interested in understanding how the data is organized with respect to the indexed bins. Is there a method to sort these bins from the highest to the lowest value of the Y axis??
Hi @benja20029, the first plot splits your data into 150 bins - there is currently no way to parametrize it. This aggregation helps reduce the overall size of the Report since users often run it for thousands or millions of data points at once.
You can consider the following.
1/ Inlcude the DateTime
If you have a timestamp (DateTime) for each prediction, I would recommend referring to it in Column Mapping. This way, your plot will not just use the binned numbered index but organize values by time to make it more interpretable.
2/ Turn the aggregation off (use raw data)
If you have a small number of data points, it might make sense to turn the raw data on when generating the Report. This way, you will see individual data points instead of binned index. Docs: https://docs.evidentlyai.com/user-guide/customization/report-data-aggregation
3/ Generate additional visualizations for specific columns outside of the DataDriftTable()
.
If you want to explore your data after you identified columns of interest (e.g. drifted columns), you can use other metrics to visualize them - e.g. ColumnDriftMetric()
, ColumnValuePlot()
(with raw data on), or ColumnSummaryMetric()
for feature stats.
Thanks for the first two suggestions, they helped me a lot!
I don't understood quite well the third though :(, I would really to change the tooltip when i hover over a data point in the data drift plot, or if that's not possible, generate a new visualization, but i don't know if i should just create a new method or if there's any workaround (because I only want to change a small thing). 😢