NeMo-Curator
NeMo-Curator copied to clipboard
Write to file w/o including `filename` column
Is your feature request related to a problem? Please describe.
I’m working through these Classifier and Heuristic Quality Filtering docs. I’m looking for an elegant way to write filtered docs back to file. If I follow the example of the docs, then I use the books = DocumentDataset.read_json(files, add_filename=True)
and ultimately long_books.to_json("long_books/", write_to_filename=True)
method. This gets me a filename of the correct name, but the new data now has a filename field, which I do not wish to have.
If instead I avoid using add_filename=True
and then use .to_json("long_books")
then I end up with the data I want, but in a file called 0.part
.
Describe the solution you'd like
I'd like to be able to write to a .jsonl
file directly either w/o creating the filename column, or, without including it in the output file.
Describe alternatives you've considered
df = long_books.to_pandas()
df.to_json('output.jsonl', orient='records', lines=True)
...which won't work for larger datasets or multiple files.