NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[FEA] Allow specify fields when reading files with DocumentDataset.

Open miguelusque opened this issue 1 year ago • 0 comments

In some scenarios, a corpus file may contain columns that are not needed during the data curation step.

We might reduce memory footprint by allowing the user to specify which columns should be loaded when invoking DocumentDataset.read_json or other similar methods.

Maybe something similar to the following snipped, where I have added the columns parameter.

Load the dataset

dataset = DocumentDataset.read_json("./corpus", add_filename=True, input_meta={"file_name":str, "language": str}, columns=["file_name", "language"])

Hope it helps! Miguel

miguelusque avatar Aug 05 '24 12:08 miguelusque