ludwig icon indicating copy to clipboard operation
ludwig copied to clipboard

Write out data split information as a separate file, i.e. splits.csv, separate from preprocessed data.

Open justinxzhao opened this issue 2 years ago • 2 comments

At the moment, we don’t write the raw data splits to a separate file, i.e. (row #, split #).

This can be useful for when the preprocessed data is too large to write to disk, yet a user may still might want to inspect offline which rows of their dataset were used in which data subsets of their modeling run.

One potential location for such metadata would be in the existing training_set_metadata.json file, or perhaps a separate splits.csv file.

justinxzhao avatar Aug 12 '22 15:08 justinxzhao

We do actually write this information when skip_saved_processed_inputs=False here. Note that this only applies when we are using a dataset from a file, as opposed to a dataframe. So perhaps it could be extended to support the latter.

tgaddair avatar Aug 12 '22 15:08 tgaddair

@tgaddair Ah, thanks for the catch! We should be sure to include this in our documentation.

justinxzhao avatar Aug 12 '22 16:08 justinxzhao