pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Joint uniqueness unsatisfiable for data synthesis

Open jerinv opened this issue 8 months ago • 0 comments

Describe the bug Enforcing joint uniqueness in a DataFrameSchema is an important feature when needing to validate a dataframe. This option is available through the unique keyword of the DataFrameSchema. There are no issues with using a DataFrameSchema with joint uniqueness enforced to validate another dataframe.

However, an issue arises when using the same DataFrameSchema to create synthetic data through schema.example(). The underlying source code seems to do this by making each column independently unique, which is not the same as joint uniqueness. A column can be independently non-unique, but jointly unique with other columns. And to the example below, if columns are made independently unique, then the maximum size that can be specified is the lowest number of available options in a particular column. This is not accurate and does not result in expected behavior.

An example is shown below.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [ ] (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

Generating a dataframe with the following schema. It enforces joint uniqueness for the columns Name, Year, and Extract. Extract can only take 3 values.

When generating a dataframe of 3 rows (the number of values Extract can take), the code works. However, anything more than 3 does not work, and returns the Unsatisfiable: Unable to satisfy assumptions of example_generating_inner_function error.

import pandas as pd
from pandera import Check, Column, DataFrameSchema

schema = DataFrameSchema(
    {
        "Name": Column(
            str,
            [
                Check.str_matches("[A-Za-z0-9_]+$"),
                Check.str_length(min_value=1, max_value=25),
            ],
        ),
        "Year": Column(
            int,
            Check.in_range(min_value=1947, max_value=3000),
        ),
        "Extract": Column(str, Check.isin(["A", "B", "C"])),
        "Start": Column(int, Check.in_range(1, 1000)),
        "Length": Column(int, Check.in_range(0, 50)),
    },
    unique=["Name", "Year", "Extract"],
)

# This works
schema.example(size=3)

# This does not work, or anything size > 3
schema.example(size=4)

# This works, showing that an example of size 4 is possible
schema.validate(
    pd.DataFrame(
        {
            "Name": ["test1", "test1", "test2","test3"],
            "Year": [1992, 1993, 1993, 1993],
            "Extract": ['A', 'B', 'B', 'C',],
            "Start": [1, 1, 1, 1],
            "Length": [3, 4, 5, 2],
        }
    )
)

Expected behavior

There should be no error and the function should be able to produce a synthetic DataFrame of any size with values that validate under the schema.

Desktop (please complete the following information):

  • OS: Windows
  • Browser: Edge

jerinv avatar Nov 01 '23 20:11 jerinv