several breakages due to recent `datasets`
It seems that datasets==2.16.0 and higher breaks evaluate
$ cat test-evaluate.py
from evaluate import load
import os
import torch.distributed as dist
dist.init_process_group("nccl")
rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = dist.get_world_size()
metric = load("accuracy",
experiment_id = "test4",
num_process = world_size,
process_id = rank)
metric.add_batch(predictions=[], references=[])
Problem 1. umask isn't being respected when creating lock files
as we are in a group setting we use umask 000
but this script creates files with missing perms:
-rw-r--r-- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock
which is invalid, since umask 000 should have led to:
-rw-rw-rw- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock
the problem applies to all other locks created during such run - that is a few more .lock files there.
this is the same issue that was reported and dealt with multiple times in datasets
if I downgrade to datasets==2.15.0 the files are created correctly with:
-rw-rw-rw-
Problem 2. Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.
$ python -u -m torch.distributed.run --nproc_per_node=2 --rdzv_endpoint localhost:6000 --rdzv_backend c10d test-evaluate.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Traceback (most recent call last):
File "/home/stas/test/test-evaluate.py", line 14, in <module>
metric.add_batch(predictions=[], references=[])
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
self._init_writer()
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 656, in _init_writer
self._check_all_processes_locks() # wait for everyone to be ready
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 350, in _check_all_processes_locks
raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 0 but it doesn't exist.
Traceback (most recent call last):
File "/home/stas/test/test-evaluate.py", line 14, in <module>
metric.add_batch(predictions=[], references=[])
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
self._init_writer()
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 659, in _init_writer
self._check_rendez_vous() # wait for master to be ready and to let everyone go
File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 362, in _check_rendez_vous
raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.
The files are there:
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:15 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow.lock
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-rdv.lock
if I downgrade to datasets==2.15.0 the above code starts to work.
with datasets<2.16 works, datasets>=2.16 breaks.
Using evaluate==0.4.1
Thank you!
@lhoestq
@williamberrios who reported this
@lhoestq, I updated the OP and was able to bisect which package and version lead to the breakage.
It seems to be an issue with recent versions of filelock ? I was able to reproduce using the latest version 3.13.1
Can you try using an older version ? e.g. I use 3.9.0 which seems to work fine:
pip install "filelock==3.9.0"
I just opened https://github.com/huggingface/datasets/pull/6631 in datasets to fix this.
Can you try it out ? Once I have your green light I can make a new release
thanks a lot, @lhoestq
@williamberrios - could you please test this asap and if all started working they can make a new release - thank you!
Hi @lhoestq, filelock==3.9.0 fixed my issue with distributed evaluation. Thanks a lot ❤️
Thank you for confirming it solved your problem, William!
Problem 2 is affecting me too. Downgrading fixed it but it frustrates me that I have to downgrade filelock on every machine I want to use multi-node evaluate on; is there another workaround? Can we get this fixed @stas00?
Not sure why you've tagged me, Jack ;) I have just reported the problem on behalf of my colleague.
sorry :)
@lhoestq, is it possible to make a new release now that this issue has been fixed? Thank you!
just released 0.4.2 :)
Thank you very much, Quentin!
Unfortunately, I'm facing the same error with the latest versions of evaluate (0.4.2), datasets (2.20.2) and filelock (3.15.4). Downgrading datasets/filelock also doesn't seem to fix the issue for me inspite of having the lockfiles in the cache_dir. Any suggestions to troubleshoot this error?
Hi, Did it end up working for you? I'm facing this issue now
Hi, Did it end up working for you? I'm facing this issue now
Unfortunately not.
I'm also noticing that evaluate is not being maintained actively anymore. I'm not sure but it might be useful to raise this issue in the accelerate repo for learning what could be the next steps on this bug/issue.
For now, I'm sticking to a single-node setup where this issue doesn't occur.