NeMo-Curator [FEA] Allow specify fields when reading files with DocumentDataset.

[FEA] Allow specify fields when reading files with DocumentDataset.

Open miguelusque opened this issue 1 year ago • 0 comments

In some scenarios, a corpus file may contain columns that are not needed during the data curation step.

We might reduce memory footprint by allowing the user to specify which columns should be loaded when invoking DocumentDataset.read_json or other similar methods.

Maybe something similar to the following snipped, where I have added the columns parameter.

Load the dataset

dataset = DocumentDataset.read_json("./corpus", add_filename=True, input_meta={"file_name":str, "language": str}, columns=["file_name", "language"])

Hope it helps! Miguel

Aug 05 '24 12:08 miguelusque

NeMo-Curator NeMo-Curator copied to clipboard

[FEA] Allow specify fields when reading files with DocumentDataset.

Load the dataset

NeMo-Curator
NeMo-Curator copied to clipboard