evidently PandasDataset mutates DataDefinition

To expand on the title a bit: The init method of PandasDataset will intentionally expand the definition with columns that have not been defined. There is no API to stop this.

To start with, this is very unintuitive. I get auto-generating a definition if I don't pass anything. But if I bothered to create a definition, having it changed on the fly is extremely surprising.

This can cause bugs in production use though. If you're running a report with a reference and current dataset, both of those need to be converted into PandasDataset. The issue here is that now we need to guarantee they have the exact same set of columns. This is actually quite restrictive.

If you're doing a one-time report in a Jupyter notebook, then you're likely to have all your data downloaded in one-shot and can just do a train-test split and filter the columns. But if you're deploying this in production over a time-horizon, you'll have different processes generating that data. The usual "contract" for the data going in is that the features/target/id columns need to be present. But adding columns to the dataset is normal. For a concrete example, I have metadata columns (execution_timestamp) in my current_dataset that are not present in the reference. If I don't add the execution_timestamp to the definition, I'd expect it to be ignored in tests/metrics.

So here's where the issue arises. I create my PandasDatasets with the exact same DataDefinition and pass to a report. But one definition gets mutated and gets an extra column, the report crashes because it tries to run a test on a column that's not in the reference.

Feb 23 '25 11:02 emacha

@emacha Thanks for the issue. I see you've described your use case, but it's always helpful to provide a minimal example (code snippet) to reproduce the bug

Feb 24 '25 18:02 DimaAmega

Hi, @emacha If I understand correctly, you are dealing with 2 different behaviors in library:

If in DataDefinition you omit some of the column types (eg, categorical_columns) code in library will try to fill in missing columns. That is the reason why you see column execution_timestamp in current dataset.
If sets of columns in current and reference dataset does not match (eg, you have a column in current, but not in reference) it can lead to problems, because in some cases we need reference data for column to calculate.

And in your case, it seems, you need to explicitly set full DataDefinition for current and reference, so we wont add any additional columns in sets.

Feb 26 '25 12:02 Liraim

Hey @DimaAmega, I'll try to get a small example that shows the bug. I think it depends on column types on the definition having None and tests which run on all columns, so will need to put some effort.

@Liraim, I don't think we're understanding each other here. I'm saying that my reference and current datasets can potentially have different columns, intentionally. This shouldn't be an issue as long as the DataDefinition columns are present in both.

you need to explicitly set full DataDefinition for current and reference, so we wont add any additional columns in sets.

Adding extra columns is unintuitive. At minimum, it should be a parameter that's off by default. Unless we don't pass a definition at all of course.

Feb 26 '25 18:02 emacha