[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist.

Open raghavm1 opened this issue 1 year ago • 1 comments

When calling .compute in distributed multi-node setting, I get this error -

[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2750, in _evaluate
[rank1]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank1]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3641, in evaluate
[rank1]:     output = eval_loop(
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3924, in evaluation_loop
[rank1]:     metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank1]:   File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 486, in <lambda>
[rank1]:     compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer, accelerator),
[rank1]:   File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 119, in compute_metrics
[rank1]:     return metric.compute(predictions=decoded_preds, references=references)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 455, in compute
[rank1]:     self.add_batch(**inputs)
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 515, in add_batch
[rank1]:     self._init_writer()
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 664, in _init_writer
[rank1]:     self._check_rendez_vous()  # wait for master to be ready and to let everyone go
[rank1]:   File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 377, in _check_rendez_vous
[rank1]:     raise ValueError(f"Couldn't acquire lock on {lock_file_name} from process {self.process_id}.") from None
[rank1]: ValueError: Couldn't acquire lock on /scratch/rm6418/gemt5_cache/sacrebleu/default/gemt5_exp1-12-rdv.lock from process 1.

I've looked at https://github.com/huggingface/evaluate/issues/481, https://github.com/huggingface/evaluate/issues/542 but this issue still seems to be happening on the current latest released versions

All metrics are loaded with the same experiment_id, and with the correct num_process arguments. All the files (the lock files) are present in the cache directory.

Environment

evaluate - 0.4.2 accelerate - 0.31.0 datasets - 2.20.0 transformers - 4.42.3

Any suggestions appreciated!

Jul 11 '24 15:07 raghavm1

I ended up fixing it by computing metrics only on the main process. I used accelerator.gather_for_metrics() and then the following:

if accelerator.is_main_process:
    metrics.compute()

Only workaround I could find until it gets fixed upstream.

Nov 04 '24 09:11 ffrancesco94