ludwig
ludwig copied to clipboard
Write out data split information as a separate file, i.e. splits.csv, separate from preprocessed data.
At the moment, we don’t write the raw data splits to a separate file, i.e. (row #, split #).
This can be useful for when the preprocessed data is too large to write to disk, yet a user may still might want to inspect offline which rows of their dataset were used in which data subsets of their modeling run.
One potential location for such metadata would be in the existing training_set_metadata.json
file, or perhaps a separate splits.csv
file.
We do actually write this information when skip_saved_processed_inputs=False
here. Note that this only applies when we are using a dataset from a file, as opposed to a dataframe. So perhaps it could be extended to support the latter.
@tgaddair Ah, thanks for the catch! We should be sure to include this in our documentation.