gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

Handling multiple fields of the custom input data in the preprocess_data.py

Open sameeravithana opened this issue 4 years ago • 6 comments

Describe the bug preprocess_data script expects to have "text" column in the json input regardless of the json-keys passed in the arguments. This is due to lmd.Reader(fname).stream_data() expects to have "text" column in the json input.

To Reproduce Steps to reproduce the behavior: Run with custom input file with fields other than "text".

Expected behavior We need to extract json elements given the specific json-keys in the preprocessing.

Proposed solution Modify the lm_dataformat to accept the parameter to read the specific json key object.

Error

File "tools/preprocess_data.py", line 193, in <module>
    main()
  File "tools/preprocess_data.py", line 163, in main
    for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1):
  File "tools/preprocess_data.py", line 143, in <genexpr>
    encoded_docs = (encoder.encode(doc) for doc in fin)
  File "tools/preprocess_data.py", line 120, in yield_from_files
    yield from yielder(fname, semaphore)
  File "tools/preprocess_data.py", line 113, in yielder
    for f in filter(lambda x: x, lmd.Reader(fname).stream_data()):
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 116, in stream_data
    yield from self._stream_data(get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 149, in _stream_data
    yield from self.read_jsonl(f, get_meta)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 207, in read_jsonl
    yield from handle_jsonl(rdr, get_meta, autojoin_paragraphs, para_joiner)
  File "../miniconda3/envs/gpt_neox_grumpycat/lib/python3.8/site-packages/lm_dataformat/__init__.py", line 99, in handle_jsonl
    text = ob['text']
KeyError: 'text'

sameeravithana avatar Nov 05 '21 23:11 sameeravithana

lm_dataformat==0.0.20 gained the ability to use a specified key other than 'text'. Removing the strict requirement in requirements/requirements.txt will unfortunately not be enough to solve this issue, as we are also constrained by the requirement of lm_dataformat==0.0.19 by lm_eval.

There should be nothing preventing GPT-NeoX or the evaluation harness from functioning when using a forced install of lm_dataformat>=0.0.20, it just has not been verified for use with the evaluation harness. pytest and pybind11 are also known to be constrained in this fashion.

I propose the following plan of action to resolve the issue:

  • [ ] Update the requirements of lm_eval
  • [x] Update the requirements of GPT-NeoX (ready to merge)
  • [ ] Add an argument to the preprocessing script to let the user specify their desired key

EricHallahan avatar Nov 06 '21 00:11 EricHallahan

@EricHallahan you are right, lm_dataformat>=0.0.20 has the jsonl_key as an argument in the _stream_data. But we have to modify the stream_data and _stream_data_threaded functions to accept the jsonl_key since the preprocess_data script uses the stream_data. I also have to introduce the operations within the loop of for key in args.json_keys to save the *.bin and *.idx files; next step is to test lm_eval.

sameeravithana avatar Nov 06 '21 01:11 sameeravithana

I have opened #456 to integrate the relaxed dependency versions.

EricHallahan avatar Nov 06 '21 02:11 EricHallahan

I’ve opened a branch of the eval harness to make sure that lm_dataformat>=0.0.20 doesn’t break it.

I also have to introduce the operations within the loop of for key in args.json_keys to save the *.bin and *.idx files; next step is to test lm_eval.

@SamTube405 can you elaborate a bit about what your data looks like, and why your usecase involves multiple keys?

StellaAthena avatar Nov 06 '21 03:11 StellaAthena

@StellaAthena For example, we have multiple sections (e.g., abstract, full texts) extracted from scientific articles that we stored within the same jsonl, and plan to train them in parallel.

sameeravithana avatar Nov 06 '21 03:11 sameeravithana

@EricHallahan I have confirmed that nothing goes wrong if you use lm_dataformat>=0.0.20 in the Eval Harness, and opened a PR in that repo to update the requirements.

StellaAthena avatar Nov 10 '21 04:11 StellaAthena