evaluate icon indicating copy to clipboard operation
evaluate copied to clipboard

several breakages due to recent `datasets`

Open stas00 opened this issue 1 year ago • 9 comments

It seems that datasets==2.16.0 and higher breaks evaluate

$ cat test-evaluate.py
from evaluate import load
import os
import torch.distributed as dist

dist.init_process_group("nccl")

rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = dist.get_world_size()

metric = load("accuracy",
                  experiment_id = "test4",
                  num_process = world_size,
                  process_id  = rank)
metric.add_batch(predictions=[], references=[])

Problem 1. umask isn't being respected when creating lock files

as we are in a group setting we use umask 000

but this script creates files with missing perms:

-rw-r--r-- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

which is invalid, since umask 000 should have led to:

-rw-rw-rw- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

the problem applies to all other locks created during such run - that is a few more .lock files there.

this is the same issue that was reported and dealt with multiple times in datasets

if I downgrade to datasets==2.15.0 the files are created correctly with:

-rw-rw-rw- 

Problem 2. Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

$ python -u -m torch.distributed.run --nproc_per_node=2 --rdzv_endpoint localhost:6000  --rdzv_backend c10d test-evaluate.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 656, in _init_writer
    self._check_all_processes_locks()  # wait for everyone to be ready
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 350, in _check_all_processes_locks
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 0 but it doesn't exist.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 659, in _init_writer
    self._check_rendez_vous()  # wait for master to be ready and to let everyone go
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 362, in _check_rendez_vous
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

The files are there:

-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:15 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow.lock
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-rdv.lock

if I downgrade to datasets==2.15.0 the above code starts to work.

with datasets<2.16 works, datasets>=2.16 breaks.

Using evaluate==0.4.1

Thank you!

@lhoestq

@williamberrios who reported this

stas00 avatar Jan 29 '24 20:01 stas00

@lhoestq, I updated the OP and was able to bisect which package and version lead to the breakage.

stas00 avatar Jan 29 '24 22:01 stas00

It seems to be an issue with recent versions of filelock ? I was able to reproduce using the latest version 3.13.1

Can you try using an older version ? e.g. I use 3.9.0 which seems to work fine:

pip install "filelock==3.9.0"

lhoestq avatar Jan 30 '24 11:01 lhoestq

I just opened https://github.com/huggingface/datasets/pull/6631 in datasets to fix this.

Can you try it out ? Once I have your green light I can make a new release

lhoestq avatar Jan 30 '24 12:01 lhoestq

thanks a lot, @lhoestq

@williamberrios - could you please test this asap and if all started working they can make a new release - thank you!

stas00 avatar Jan 31 '24 01:01 stas00

Hi @lhoestq, filelock==3.9.0 fixed my issue with distributed evaluation. Thanks a lot ❤️

williamberrios avatar Feb 02 '24 15:02 williamberrios

Thank you for confirming it solved your problem, William!

stas00 avatar Feb 02 '24 18:02 stas00

Problem 2 is affecting me too. Downgrading fixed it but it frustrates me that I have to downgrade filelock on every machine I want to use multi-node evaluate on; is there another workaround? Can we get this fixed @stas00?

jxmorris12 avatar Mar 07 '24 21:03 jxmorris12

Not sure why you've tagged me, Jack ;) I have just reported the problem on behalf of my colleague.

stas00 avatar Mar 07 '24 21:03 stas00

sorry :)

jxmorris12 avatar Mar 07 '24 22:03 jxmorris12

@lhoestq, is it possible to make a new release now that this issue has been fixed? Thank you!

stas00 avatar Apr 29 '24 22:04 stas00

just released 0.4.2 :)

lhoestq avatar Apr 30 '24 09:04 lhoestq

Thank you very much, Quentin!

stas00 avatar Apr 30 '24 19:04 stas00

Unfortunately, I'm facing the same error with the latest versions of evaluate (0.4.2), datasets (2.20.2) and filelock (3.15.4). Downgrading datasets/filelock also doesn't seem to fix the issue for me inspite of having the lockfiles in the cache_dir. Any suggestions to troubleshoot this error?

raghavm1 avatar Jul 09 '24 20:07 raghavm1

Hi, Did it end up working for you? I'm facing this issue now

yaraaa7 avatar Aug 25 '24 16:08 yaraaa7

Hi, Did it end up working for you? I'm facing this issue now

Unfortunately not. I'm also noticing that evaluate is not being maintained actively anymore. I'm not sure but it might be useful to raise this issue in the accelerate repo for learning what could be the next steps on this bug/issue. For now, I'm sticking to a single-node setup where this issue doesn't occur.

raghavm1 avatar Aug 28 '24 16:08 raghavm1