rf-detr icon indicating copy to clipboard operation
rf-detr copied to clipboard

multiprocessing.context.AuthenticationError: digest received was wrong

Open pra-dan opened this issue 1 month ago • 1 comments

Search before asking

  • [x] I have searched the RF-DETR issues and found no similar bug report.

Bug

I am trying to fine-tune the Medium model to my dataset containing single class. This is my training script

from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="nov11",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-5,
    num_workers=1, 
    output_dir="rfdet_nov11",
    resolution=1232,
    device='cuda',
    wandb=True,
    project="ball_det",
    early_stopping=True,
    early_stopping_patience=10
)

It runs into this error before finishing the first epoch

...
.0049)  loss_giou_2_unscaled: 0.1385 (0.1908)  cardinality_error_2_unscaled: 0.7500 (1.7778)  loss_ce_enc_unscaled: 0.5104 (0.6461)  loss_bbox_enc_unscaled: 0.0036 (0.0056)  loss_giou_enc_unscaled: 0.1650 (0.2092)  cardinality_error_enc_unscaled: 0.5000 (0.6321)  time: 0.6672  data: 0.0072  max mem: 10285
Traceback (most recent call last):
  File "/home/quidich/Documents/train_rf.py", line 5, in <module>
    model.train(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 83, in train
    self.train_from_config(config, **kwargs)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 191, in train_from_config
    self.model.train(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/main.py", line 341, in train
    train_stats = train_one_epoch(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/engine.py", line 88, in train_one_epoch
    for data_iter_step, (samples, targets) in enumerate(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/util/misc.py", line 239, in log_every
    for obj in iterable:
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
    data = self._next_data()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1482, in _next_data
    idx, data = self._get_data()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1444, in _get_data
    success, data = self._try_get_data()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1275, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/connection.py", line 514, in Client
    deliver_challenge(c, authkey)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/connection.py", line 750, in deliver_challenge
    raise AuthenticationError('digest received was wrong')
multiprocessing.context.AuthenticationError: digest received was wrong

Initially, the num_workers was not set in my script. I set it to 1 after getting this error. But the error still persists.

Environment

RF-DETR 1.3.0 OS Ubuntu 24.04.3 Python 3.10.0 PyTorch 2.9.0 CUDA/cuDNN V12.0.140 GPU 4090Ti

Minimal Reproducible Example

Just run training script. The data exists in COCO format.

Additional

No response

Are you willing to submit a PR?

  • [ ] Yes, I'd like to help by submitting a PR!

pra-dan avatar Nov 12 '25 06:11 pra-dan

For me, I get

...
(1.7863)  loss_ce_enc_unscaled: 0.4922 (0.5725)  loss_bbox_enc_unscaled: 0.0028 (0.0043)  loss_giou_enc_unscaled: 0.1738 (0.2304)  cardinality_error_enc_unscaled: 0.7500 (0.7780)
Accumulating evaluation results...
Traceback (most recent call last):
  File "/home/quidich/Documents/train_rf.py", line 5, in <module>
    model.train(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 83, in train
    self.train_from_config(config, **kwargs)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 191, in train_from_config
    self.model.train(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/main.py", line 401, in train
    ema_test_stats, _ = evaluate(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/engine.py", line 329, in evaluate
    coco_evaluator.accumulate()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/datasets/coco_eval.py", line 76, in accumulate
    coco_eval.accumulate()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/pycocotools/cocoeval.py", line 362, in accumulate
    dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])
AttributeError: 'range_iterator' object has no attribute 'concatenate'
wandb: 
wandb: 🚀 View run unique-feather-155 at: 

after validation step in first/second epoch.

q-prashant avatar Nov 14 '25 12:11 q-prashant