NeMo-Curator
NeMo-Curator copied to clipboard
[FEA] Allow specify fields when reading files with DocumentDataset.
In some scenarios, a corpus file may contain columns that are not needed during the data curation step.
We might reduce memory footprint by allowing the user to specify which columns should be loaded when invoking DocumentDataset.read_json or other similar methods.
Maybe something similar to the following snipped, where I have added the columns parameter.
Load the dataset
dataset = DocumentDataset.read_json("./corpus", add_filename=True, input_meta={"file_name":str, "language": str}, columns=["file_name", "language"])
Hope it helps! Miguel