rf-detr
rf-detr copied to clipboard
multiprocessing.context.AuthenticationError: digest received was wrong
Search before asking
- [x] I have searched the RF-DETR issues and found no similar bug report.
Bug
I am trying to fine-tune the Medium model to my dataset containing single class. This is my training script
from rfdetr import RFDETRMedium
model = RFDETRMedium()
model.train(
dataset_dir="nov11",
epochs=100,
batch_size=4,
grad_accum_steps=4,
lr=1e-5,
num_workers=1,
output_dir="rfdet_nov11",
resolution=1232,
device='cuda',
wandb=True,
project="ball_det",
early_stopping=True,
early_stopping_patience=10
)
It runs into this error before finishing the first epoch
...
.0049) loss_giou_2_unscaled: 0.1385 (0.1908) cardinality_error_2_unscaled: 0.7500 (1.7778) loss_ce_enc_unscaled: 0.5104 (0.6461) loss_bbox_enc_unscaled: 0.0036 (0.0056) loss_giou_enc_unscaled: 0.1650 (0.2092) cardinality_error_enc_unscaled: 0.5000 (0.6321) time: 0.6672 data: 0.0072 max mem: 10285
Traceback (most recent call last):
File "/home/quidich/Documents/train_rf.py", line 5, in <module>
model.train(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 83, in train
self.train_from_config(config, **kwargs)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 191, in train_from_config
self.model.train(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/main.py", line 341, in train
train_stats = train_one_epoch(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/engine.py", line 88, in train_one_epoch
for data_iter_step, (samples, targets) in enumerate(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/util/misc.py", line 239, in log_every
for obj in iterable:
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
data = self._next_data()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1482, in _next_data
idx, data = self._get_data()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1444, in _get_data
success, data = self._try_get_data()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1275, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
fd = df.detach()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/connection.py", line 514, in Client
deliver_challenge(c, authkey)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/connection.py", line 750, in deliver_challenge
raise AuthenticationError('digest received was wrong')
multiprocessing.context.AuthenticationError: digest received was wrong
Initially, the num_workers was not set in my script. I set it to 1 after getting this error. But the error still persists.
Environment
RF-DETR 1.3.0 OS Ubuntu 24.04.3 Python 3.10.0 PyTorch 2.9.0 CUDA/cuDNN V12.0.140 GPU 4090Ti
Minimal Reproducible Example
Just run training script. The data exists in COCO format.
Additional
No response
Are you willing to submit a PR?
- [ ] Yes, I'd like to help by submitting a PR!
For me, I get
...
(1.7863) loss_ce_enc_unscaled: 0.4922 (0.5725) loss_bbox_enc_unscaled: 0.0028 (0.0043) loss_giou_enc_unscaled: 0.1738 (0.2304) cardinality_error_enc_unscaled: 0.7500 (0.7780)
Accumulating evaluation results...
Traceback (most recent call last):
File "/home/quidich/Documents/train_rf.py", line 5, in <module>
model.train(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 83, in train
self.train_from_config(config, **kwargs)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 191, in train_from_config
self.model.train(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/main.py", line 401, in train
ema_test_stats, _ = evaluate(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/engine.py", line 329, in evaluate
coco_evaluator.accumulate()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/datasets/coco_eval.py", line 76, in accumulate
coco_eval.accumulate()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/pycocotools/cocoeval.py", line 362, in accumulate
dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])
AttributeError: 'range_iterator' object has no attribute 'concatenate'
wandb:
wandb: 🚀 View run unique-feather-155 at:
after validation step in first/second epoch.