It seems that datasets==2.16.0 and higher breaks evaluate

$ cat test-evaluate.py
from evaluate import load
import os
import torch.distributed as dist

dist.init_process_group("nccl")

rank = int(os.environ.get("LOCAL_RANK", 0))
world_size = dist.get_world_size()

metric = load("accuracy",
                  experiment_id = "test4",
                  num_process = world_size,
                  process_id  = rank)
metric.add_batch(predictions=[], references=[])

Problem 1. `umask` isn't being respected when creating lock files

as we are in a group setting we use umask 000

but this script creates files with missing perms:

-rw-r--r-- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

which is invalid, since umask 000 should have led to:

-rw-rw-rw- 1 [...]/metrics/accuracy/default/test4-2-rdv.lock

the problem applies to all other locks created during such run - that is a few more .lock files there.

this is the same issue that was reported and dealt with multiple times in datasets

if I downgrade to datasets==2.15.0 the files are created correctly with:

-rw-rw-rw-

Problem 2. `Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.`

$ python -u -m torch.distributed.run --nproc_per_node=2 --rdzv_endpoint localhost:6000  --rdzv_backend c10d test-evaluate.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Using the latest cached version of the module from /data/huggingface/modules/evaluate_modules/metrics/evaluate-metric--accuracy/f887c0aab52c2d38e1f8a215681126379eca617f96c447638f751434e8e65b14 (last modified on Mon Jan 29 18:42:31 2024) since it couldn't be found locally at evaluate-metric--accuracy, or remotely on the Hugging Face Hub.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 656, in _init_writer
    self._check_all_processes_locks()  # wait for everyone to be ready
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 350, in _check_all_processes_locks
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 0 but it doesn't exist.
Traceback (most recent call last):
  File "/home/stas/test/test-evaluate.py", line 14, in <module>
    metric.add_batch(predictions=[], references=[])
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 510, in add_batch
    self._init_writer()
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 659, in _init_writer
    self._check_rendez_vous()  # wait for master to be ready and to let everyone go
  File "/env/lib/conda/evaluate-test/lib/python3.9/site-packages/evaluate/module.py", line 362, in _check_rendez_vous
    raise ValueError(
ValueError: Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

The files are there:

-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:15 /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock
-rw-rw-rw- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-1.arrow.lock
-rw-r--r-- 1 stas stas 0 Jan 29 22:14 /data/huggingface/metrics/accuracy/default/test4-2-rdv.lock

if I downgrade to datasets==2.15.0 the above code starts to work.

with datasets<2.16 works, datasets>=2.16 breaks.

Using evaluate==0.4.1

Thank you!

@lhoestq

@williamberrios who reported this

Jan 29 '24 20:01 stas00

@lhoestq, I updated the OP and was able to bisect which package and version lead to the breakage.

Jan 29 '24 22:01 stas00

It seems to be an issue with recent versions of filelock ? I was able to reproduce using the latest version 3.13.1

Can you try using an older version ? e.g. I use 3.9.0 which seems to work fine:

pip install "filelock==3.9.0"

Jan 30 '24 11:01 lhoestq

I just opened https://github.com/huggingface/datasets/pull/6631 in datasets to fix this.

Can you try it out ? Once I have your green light I can make a new release

Jan 30 '24 12:01 lhoestq

thanks a lot, @lhoestq

@williamberrios - could you please test this asap and if all started working they can make a new release - thank you!

Jan 31 '24 01:01 stas00

Hi @lhoestq, filelock==3.9.0 fixed my issue with distributed evaluation. Thanks a lot ❤️

Feb 02 '24 15:02 williamberrios

Thank you for confirming it solved your problem, William!

Feb 02 '24 18:02 stas00

Problem 2 is affecting me too. Downgrading fixed it but it frustrates me that I have to downgrade filelock on every machine I want to use multi-node evaluate on; is there another workaround? Can we get this fixed @stas00?

Mar 07 '24 21:03 jxmorris12

Not sure why you've tagged me, Jack ;) I have just reported the problem on behalf of my colleague.

Mar 07 '24 21:03 stas00

sorry :)

Mar 07 '24 22:03 jxmorris12

@lhoestq, is it possible to make a new release now that this issue has been fixed? Thank you!

Apr 29 '24 22:04 stas00

just released 0.4.2 :)

Apr 30 '24 09:04 lhoestq

Thank you very much, Quentin!

Apr 30 '24 19:04 stas00

Unfortunately, I'm facing the same error with the latest versions of evaluate (0.4.2), datasets (2.20.2) and filelock (3.15.4). Downgrading datasets/filelock also doesn't seem to fix the issue for me inspite of having the lockfiles in the cache_dir. Any suggestions to troubleshoot this error?

Jul 09 '24 20:07 raghavm1

Hi, Did it end up working for you? I'm facing this issue now

Aug 25 '24 16:08 yaraaa7

Hi, Did it end up working for you? I'm facing this issue now

Unfortunately not. I'm also noticing that evaluate is not being maintained actively anymore. I'm not sure but it might be useful to raise this issue in the accelerate repo for learning what could be the next steps on this bug/issue. For now, I'm sticking to a single-node setup where this issue doesn't occur.

Aug 28 '24 16:08 raghavm1

several breakages due to recent `datasets`

Problem 1. umask isn't being respected when creating lock files

Problem 2. Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.

Problem 1. `umask` isn't being respected when creating lock files

Problem 2. `Expected to find locked file /data/huggingface/metrics/accuracy/default/test4-2-0.arrow.lock from process 1 but it doesn't exist.`