Implement groups for train/test split

Open dreadatour opened this issue 11 months ago • 0 comments

In datachain.toolkit we do have train_test_split function for splitting a DataChain into multiple subsets. See docs here.

We need to be able to specify groups spec and ensure that those groups never go into different splits. It’s a very important for real ML scenarios, just basic random doesn’t work in a lot of cases. This might be a separate method or might be an optional param in train_test_split function.

As an example take a look at sklearn.model_selection.GroupShuffleSplit (here and here).

In datachain.toolkit.train_test_split method we are doing split into subsets by using raw SQL query. It will be really nice to implement groups the same way — via SQL query (SQLite for CLI and ClickHouse for SaaS). At the first sight it might be possible by using WITH statement to create temp table with group_by and random value (sys__rand column) and then to use JOIN with dataset table to filter by random value and group_by column. All in single query. But this needs to be checked.

Feb 03 '25 10:02 dreadatour