lm-evaluation-harness IFEval fails when multiple gpus are used (for DDP)

While doing IFEval, the lib downloads NLTK tokenizers. This is an issue when multiple processes are used (for eg. in a DDP inference), because the download is done by each process. i think this leads to race conditions and causes the following issue:

      from lm_eval.tasks.ifeval import instructions_util
    File "/opt/pyenv-root/versions/3.9.17/lib/python3.9/site-packages/lm_eval/tasks/ifeval/instructions_util.py", line 47, in <module>
      download_nltk_resources()
    File "/opt/pyenv-root/versions/3.9.17/lib/python3.9/site-packages/lm_eval/tasks/ifeval/instructions_util.py", line 44, in download_nltk_resources
      nltk.download("punkt_tab")
    File "/opt/pyenv-root/versions/3.9.17/lib/python3.9/site-packages/nltk/downloader.py", line 774, in download
      for msg in self.incr_download(info_or_id, download_dir, force):
    File "/opt/pyenv-root/versions/3.9.17/lib/python3.9/site-packages/nltk/downloader.py", line 642, in incr_download
      yield from self._download_package(info, download_dir, force)
    File "/opt/pyenv-root/versions/3.9.17/lib/python3.9/site-packages/nltk/downloader.py", line 733, in _download_package
      for msg in _unzip_iter(filepath, zipdir, verbose=False):
    File "/opt/pyenv-root/versions/3.9.17/lib/python3.9/site-packages/nltk/downloader.py", line 2250, in _unzip_iter
      zf.extractall(root)
    File "/opt/pyenv-root/versions/3.9.17/lib/python3.9/zipfile.py", line 1642, in extractall
      self._extract_member(zipinfo, path, pwd)
    File "/opt/pyenv-root/versions/3.9.17/lib/python3.9/zipfile.py", line 1692, in _extract_member
      os.mkdir(targetpath)
  FileExistsError: [Errno 17] File exists: '/home/flyte/nltk_data/tokenizers/punkt_tab/russian'

I think,

The NLTK tokenizer should not be downloaded when a module is just imported.
The download should be guarded if multiple processes are used (eg. in Distributed Data Parallel setting)

I used the main to produce this issue: (commit: 8138fd52)

Aug 30 '24 08:08 al093

One workaround for this issue is to download the nltk resources in a desired safer manner before.

Aug 30 '24 08:08 al093

Hi! Thanks for the reporting the issue! The PR should handle this. Thought the simplest way is to check for the LOCAL RANK environment variable, but open to feedback if you have any alternative suggestions.

Aug 30 '24 11:08 baberabb

Thanks for the PR.

I gave suggestion in the PR. I am not sure about the need of downloading nltk tokenizers when the module is imported. If possible it should be refactored.

Aug 30 '24 13:08 al093

I also ran into the punkt_tab problem

[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/__main__.py", line 450, in <module>
[rank1]:     cli_evaluate()
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/__main__.py", line 369, in cli_evaluate
[rank1]:     results = evaluator.simple_evaluate(
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/utils.py", line 395, in _wrapper
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/evaluator.py", line 277, in simple_evaluate
[rank1]:     results = evaluate(
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/utils.py", line 395, in _wrapper
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/evaluator.py", line 478, in evaluate
[rank1]:     metrics = task.process_results(
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/api/task.py", line 1351, in process_results
[rank1]:     return self.config.process_results(doc, results)
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/tasks/leaderboard/ifeval/utils.py", line 120, in process_results
[rank1]:     out_strict = test_instruction_following_strict(inp, response)
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/tasks/leaderboard/ifeval/utils.py", line 43, in test_instruction_following_strict
[rank1]:     if response.strip() and instruction.check_following(response):
[rank1]:   File "/home/litan/leaderboard/lm-evaluation-harness/lm_eval/tasks/ifeval/instructions.py", line 1580, in check_following
[rank1]:     words = instructions_util.nltk.word_tokenize(value)
[rank1]:   File "/opt/conda/envs/harness/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 142, in word_tokenize
[rank1]:     sentences = [text] if preserve_line else sent_tokenize(text, language)
[rank1]:   File "/opt/conda/envs/harness/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 119, in sent_tokenize
[rank1]:     tokenizer = _get_punkt_tokenizer(language)
[rank1]:   File "/opt/conda/envs/harness/lib/python3.10/site-packages/nltk/tokenize/__init__.py", line 105, in _get_punkt_tokenizer
[rank1]:     return PunktTokenizer(language)
[rank1]:   File "/opt/conda/envs/harness/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
[rank1]:     self.load_lang(lang)
[rank1]:   File "/opt/conda/envs/harness/lib/python3.10/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
[rank1]:     lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
[rank1]:   File "/opt/conda/envs/harness/lib/python3.10/site-packages/nltk/data.py", line 579, in find
[rank1]:     raise LookupError(resource_not_found)
[rank1]: LookupError:
[rank1]: **********************************************************************
[rank1]:   Resource punkt_tab not found.
[rank1]:   Please use the NLTK Downloader to obtain the resource:

[rank1]:   >>> import nltk
[rank1]:   >>> nltk.download('punkt_tab')
[rank1]:
[rank1]:   For more information see: https://www.nltk.org/data.html

[rank1]:   Attempted to load tokenizers/punkt_tab/english/

[rank1]:   Searched in:
[rank1]:     - '/home/litan/nltk_data'
[rank1]:     - '/opt/conda/envs/harness/nltk_data'
[rank1]:     - '/opt/conda/envs/harness/share/nltk_data'
[rank1]:     - '/opt/conda/envs/harness/lib/nltk_data'
[rank1]:     - '/usr/share/nltk_data'
[rank1]:     - '/usr/local/share/nltk_data'
[rank1]:     - '/usr/lib/nltk_data'
[rank1]:     - '/usr/local/lib/nltk_data'
[rank1]: **********************************************************************

I have the data in my local folder, and I can even load the tokenizer locally

>>> from nltk import PunktTokenizer
>>> PunktTokenizer("english")
<nltk.tokenize.punkt.PunktTokenizer object at 0x7f6170003a60>

But I got the above error when I ran it with accelerate launch -m lm_eval --model_args pretrained=<model>,dtype=bfloat16 --log_samples --output_path eval_results --tasks leaderboard --batch_size 4 --apply_chat_template --fewshot_as_multiturn

Is it related to a race condition?

Sep 15 '24 16:09 tanliboy

can we get an update on this (either merge the existing PR for fixing this issue or create a new one if needed?) happy to work on it but this issue is blocking multi-GPU evals for me

Sep 19 '24 18:09 ian-scale

can we get an update on this (either merge the existing PR for fixing this issue or create a new one if needed?) happy to work on it but this issue is blocking multi-GPU evals for me

#2267 should fix it. As a workaround you could run python -c "import nltk; nltk.download('punkt') in your local environment before running lm_eval, and this should handle the error for the time being.

Sep 19 '24 19:09 baberabb

For my case, it only failed for the time while downloading the dataset. It is fine after I rerun it for a couple of times. This can be a workaround.

Sep 22 '24 20:09 tanliboy

lm-evaluation-harness lm-evaluation-harness copied to clipboard

IFEval fails when multiple gpus are used (for DDP)

lm-evaluation-harness
lm-evaluation-harness copied to clipboard