NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Write to file w/o including `filename` column

Open joshwyatt opened this issue 4 months ago • 0 comments

Is your feature request related to a problem? Please describe.

I’m working through these Classifier and Heuristic Quality Filtering docs. I’m looking for an elegant way to write filtered docs back to file. If I follow the example of the docs, then I use the books = DocumentDataset.read_json(files, add_filename=True) and ultimately long_books.to_json("long_books/", write_to_filename=True) method. This gets me a filename of the correct name, but the new data now has a filename field, which I do not wish to have.

If instead I avoid using add_filename=True and then use .to_json("long_books") then I end up with the data I want, but in a file called 0.part.

Describe the solution you'd like

I'd like to be able to write to a .jsonl file directly either w/o creating the filename column, or, without including it in the output file.

Describe alternatives you've considered

df = long_books.to_pandas()
df.to_json('output.jsonl', orient='records', lines=True)

...which won't work for larger datasets or multiple files.

joshwyatt avatar Oct 10 '24 18:10 joshwyatt