evidently
evidently copied to clipboard
Unable to understand Evidently report plot
Dear Evidently team, I am using Evidently to compare 2 timeseries arrays to check if there is drifting over time. I use the following code snippet:
data_drift_report = Report(metrics=[
DataDriftPreset(),
])
# Passing on the current and reference curves for comparison
data_drift_report.run(current_data=df_july, reference_data=df_jan, column_mapping=None)
data_drift_report
Where df_july and df_jan are current and reference dataframes which have 7 columns (Sunday - Saturday columns) to compare. Each column is a time series data. I get a nice report where each of the column pairs are compared and KS p-values are obtained. Upon clicking on each of the column comparison plot, I noticed it gives the Data Drift and Data Distribution plot. In the data drift plot, there is the current data plot and a green band below with a bold line. I am curious what does the green band and the bold green line mean or indicate? Or is there a documentation of the function where I can look into the details to understand the output plots. Thank you. I am attaching the snapshot for your kind reference:
Hi @ananda-duetto,
It appears you are using an earlier Evidently version - could you upgrade to the latest one? It will include additional explanations on the legend.
You can also check the description of the Data Drift Report in the docs https://docs.evidentlyai.com/presets/data-drift#4.-data-drift-by-feature
- The bold dark green line is the mean, as seen in the reference dataset.
- The green "band" area covers one standard deviation from the mean.
Hello @elenasamuylova,
Thanks very much for letting me know regarding the new version and for explaining about the bold green and green band. Me and my team would prefer using the old version because in it we can see the plot of the time series (like how it progresses). It seems in the new version, the actual curve plot is gone, just the mean is plotted. In our understanding, plotting just the mean of the reference plot is not providing any information at all but rather taking away the information of the nature or behavior of the curve.
I have a question on the old version plot. How can I get rid of the bold green line and green band in the old version? Is there a flag or parameter which I can use to do so? Thank you.
Hi @ananda-duetto,
1. Seeing the value plot with complete data
If you work with reasonably small datasets and want to keep all the raw data on the plot, you can also achieve this in the new version by passing the raw data parameter as an option. In this case, there will be no aggregation (it will look like the "old" plot - but the report will be large in size if you pass a large dataset). https://docs.evidentlyai.com/user-guide/customization/report-data-aggregation
report = Report(
metrics=[
DataDriftPreset(),
],
options={"render": {"raw_data": True}}
)
report.run(reference_data=df_ref, current_data=df_cur)
report
2. Getting rid of the green line and green band.
In all Evidently versions, the reference dataset on this plot inside the Data Drift report has been represented by the green line / green band. There is no way to get rid of it, unless you create an entirely custom metric with your own visualization. https://docs.evidentlyai.com/user-guide/customization/add-custom-metric-or-test
Could you share a bit more about what you are trying to achieve? Do you want to see only the current dataset distribution? Evidently has multiple other metrics that include distribution visualization (such as DataQualityPreset
, ColumnDistributionMetric
etc.) that you might find useful - that would show only one dataset if you prefer.
Hello @elenasamuylova,
Thanks very much for your prompt reply. Really appreciate it. Please see replies below:
-
That is right, we are not working with a huge dataset and would prefer seeing the curve rather than the overall mean or summary. Thank you for sharing the code snippet. If you scroll up when I posted the 1st question in this conversation you will notice that I had the same code snippet and it gave me the nice curve plot. I didn't have the
options={"render": {"raw_data": True}}
because I wanted the line plot instead of the point plot or scatter plot. We liked the resulting plot very much but was curious regarding what the bold green line and green band meant and you answered it clearly. Thanks. -
I see and good to know what you mentioned. So, the bold green line and green band stays in the plot. Yes, you are right, in the test we are testing for drifting of two curves (reference and current) and in the visualization we wish to see both the curves. But we are able to see the current curve, green bold line and green band. That is the reason I wanted to know what the green band and green line mean and can we get rid of it or not. And you answered both of them. Thanks again.
-
This leads to my last two questions in this discussion: Can we compare drifting of two timeseries datasets as is with providing weights to the points. Like 2 time series along with weight vector mentioned in the parameters. I checked the "Data drift parameters" section in the documentation but didn't find and and so wanted to check with you. Can you please let me know when possible? Thanks.
-
Last question is: In the plot if you scroll above (in my 1st question), I do see the current curve being plotted (which is great). I also see a slight error band or confidence band or variation band kind of thing along with the current curve. It is not throughout the current plot but in parts of the plot. Can you please let me know what this error band or confidence band or variation band means? Thanks.
Thanks again for your help in understanding the plots and parameters.
Hi @ananda-duetto,
Question 1 and 4. Explaining the plot.
Copying the initial plot to clarify:
On this plot, the data IS aggregated. It shows the mean value of the feature binned into 150 bins. The slightly visible "pink band" shows 1 standard deviation of the value inside a given bin.
Basically, the only difference between the visualization on the screenshot and the default visualization you get with the current Evidently version - is that now it has the legend explaining the plot. The contents of the visualization is the same, and it shows the mean value.
If you want to see the raw data, it can only be achieved through the "raw_data": True
option. It will appear as a scatter plot.
(Some backstory: This "raw_data": True"
version used to be default in earlier Evidently versions until we added the aggregated visuals. Basically, the screenshot you posted initially refers to the interim version where the default visualization was "new" and aggregated, but the legend was "old" and partially referred to the version that showed the scatter plot).
Q3. Comparing two time series.
I am not sure I correctly understand the type of visualization you want to add: could you explain how you'd expect the "weight" parameter to work? Maybe you have an example of the plot?
Here is what we have that might be related:
- The following visualization in the
DataQualityPreset
. It is also available asColumnSummaryMetric
for individual columns: https://docs.evidentlyai.com/presets/data-quality#how-it-looks. It requires thedatetime
index, and shows the mean value of a numerical feature over time for reference and current.
-
The
ColumnValuePlot
metric (the default aggregated version): -
The
ColumnValuePlot
metric (withraw_data
set as true):
In this case it is pretty hard to make anything of it due to large number of data points.