gpt-neox
gpt-neox copied to clipboard
Handling multiple fields of the custom input data in the preprocess_data.py
Describe the bug preprocess_data script expects to have "text" column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have "text" column in the json input.
To Reproduce Steps to reproduce the behavior: Run with custom input file with fields other than "text".
Expected behavior We need to extract json elements given the specific json-keys in the preprocessing.
Proposed solution Modify the lm_dataformat to accept the parameter to read the specific json key object.
Error
File "tools/preprocess_data.py", line 193, in <module>
main()
File "tools/preprocess_data.py", line 163, in main
for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
File "tools/preprocess_data.py", line 143, in <genexpr>
encoded_docs = (encoder.encode(doc) for doc in fin)
File "tools/preprocess_data.py", line 120, in yield_from_files
yield from yielder(fname, semaphore)
File "tools/preprocess_data.py", line 113, in yielder
for f in filter(lambda x: x, lmd.Reader(fname).stream_data()):
File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 116, in stream_data
yield from self._stream_data(get_meta)
File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 149, in _stream_data
yield from self.read_jsonl(f, get_meta)
File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 207, in read_jsonl
yield from handle_jsonl(rdr, get_meta, autojoin_paragraphs, para_joiner)
File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 99, in handle_jsonl
text = ob['text']
KeyError: 'text'
lm_dataformat==0.0.20 gained the ability to use a specified key other than 'text'. Removing the strict requirement in requirements/requirements.txt will unfortunately not be enough to solve this issue, as we are also constrained by the requirement of lm_dataformat==0.0.19 by lm_eval.
There should be nothing preventing GPT-NeoX or the evaluation harness from functioning when using a forced install of lm_dataformat>=0.0.20, it just has not been verified for use with the evaluation harness. pytest and pybind11 are also known to be constrained in this fashion.
I propose the following plan of action to resolve the issue:
- [ ] Update the requirements of
lm_eval - [x] Update the requirements of GPT-NeoX (ready to merge)
- [ ] Add an argument to the preprocessing script to let the user specify their desired key
@EricHallahan you are right, lm_dataformat>=0.0.20 has the jsonl_key as an argument in the _stream_data. But we have to modify the stream_data and _stream_data_threaded functions to accept the jsonl_key since the preprocess_data script uses the stream_data. I also have to introduce the operations within the loop of for key in args.json_keys to save the *.bin and *.idx files; next step is to test lm_eval.
I have opened #456 to integrate the relaxed dependency versions.
I’ve opened a branch of the eval harness to make sure that lm_dataformat>=0.0.20 doesn’t break it.
I also have to introduce the operations within the loop of
for key in args.json_keysto save the*.binand*.idxfiles; next step is to test lm_eval.
@SamTube405 can you elaborate a bit about what your data looks like, and why your usecase involves multiple keys?
@StellaAthena For example, we have multiple sections (e.g., abstract, full texts) extracted from scientific articles that we stored within the same jsonl, and plan to train them in parallel.
@EricHallahan I have confirmed that nothing goes wrong if you use lm_dataformat>=0.0.20 in the Eval Harness, and opened a PR in that repo to update the requirements.