pandas-streaming icon indicating copy to clipboard operation
pandas-streaming copied to clipboard

Reproducibility in train_test_apart_stratify()

Open stephengmatthews opened this issue 1 year ago • 0 comments

train_test_apart_stratify() produces different results for the same input data, even when setting random_state=0.

To reproduce this, I've adapted the example from the function's docstring to contain only strings (i.e., the values for a are now str instead of int). Run this several times to see different results.

import pandas
from pandas_streaming.df import train_test_apart_stratify

df = pandas.DataFrame([dict(a="1", b="e"),
                       dict(a="1", b="f"),
                       dict(a="2", b="e"),
                       dict(a="2", b="f")])

train, test = train_test_apart_stratify(
    df, group="a", stratify="b", test_size=0.5)
print(train)
print('-----------')
print(test)

The cause seems to be the sets created in connex_split.py#L530 are then iterated over in connex_split.py#L543 but a set is an unordered collection. Replacing ids[k] with sorted(ids[k]) on L543 seems to fix this.

stephengmatthews avatar Aug 16 '24 11:08 stephengmatthews