[Metrics] ValueError: Expected to find locked file from process x but it doesn't exist.
When calling .compute in distributed multi-node setting, I get this error -
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2750, in _evaluate
[rank1]: metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer_seq2seq.py", line 180, in evaluate
[rank1]: return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3641, in evaluate
[rank1]: output = eval_loop(
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3924, in evaluation_loop
[rank1]: metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels))
[rank1]: File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 486, in <lambda>
[rank1]: compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer, accelerator),
[rank1]: File "/home/rm6418/projects/GEMT5/gemt5-mul.py", line 119, in compute_metrics
[rank1]: return metric.compute(predictions=decoded_preds, references=references)
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 455, in compute
[rank1]: self.add_batch(**inputs)
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 515, in add_batch
[rank1]: self._init_writer()
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 664, in _init_writer
[rank1]: self._check_rendez_vous() # wait for master to be ready and to let everyone go
[rank1]: File "/ext3/miniconda3/envs/venv/lib/python3.8/site-packages/evaluate/module.py", line 377, in _check_rendez_vous
[rank1]: raise ValueError(f"Couldn't acquire lock on {lock_file_name} from process {self.process_id}.") from None
[rank1]: ValueError: Couldn't acquire lock on /scratch/rm6418/gemt5_cache/sacrebleu/default/gemt5_exp1-12-rdv.lock from process 1.
I've looked at https://github.com/huggingface/evaluate/issues/481, https://github.com/huggingface/evaluate/issues/542 but this issue still seems to be happening on the current latest released versions
All metrics are loaded with the same experiment_id, and with the correct num_process arguments.
All the files (the lock files) are present in the cache directory.
Environment
evaluate - 0.4.2 accelerate - 0.31.0 datasets - 2.20.0 transformers - 4.42.3
Any suggestions appreciated!
I ended up fixing it by computing metrics only on the main process. I used accelerator.gather_for_metrics() and then the following:
if accelerator.is_main_process:
metrics.compute()
Only workaround I could find until it gets fixed upstream.