gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

`lm_dataformat` is outdated

Open wjeliot opened this issue 3 years ago • 0 comments

Describe the bug When running tools/preprocess_data.py to tokenize my dataset, I was confused why the generated .bin and .idx files were empty. It turns out that lm_dataformat, the library which actually reads the dataset into the tokenization logic, was version 0.0.19 as specified in the requirements.txt file. This version of the library doesn't include support for uncompressed .jsonl files, so if you pass in a raw jsonl file, it won't read it, and no error will be raised either.

Since the README doesn't mention the requirement of compressing jsonl to jsonl.zst via the zstd library, this is likely to be a hurdle for those with smaller datasets, kept as jsonl.

Proposed solution Upgrade lm_dataformat to version 0.0.20 which adds support for uncompressed jsonl. Additionally it would be nice to throw an helpful error if nothing actually gets tokenized since that would indicate reading the dataset has failed.

wjeliot avatar Feb 13 '22 13:02 wjeliot