Error While Pretraining
Hi,
Thank you for the great work and for providing the pre-trained models. I was trying to run pre-training following the instructions at pretrain.md. I am getting the attached error. I am listing my environment details below. Any help would be appreciated.

PyTorch: 1.9.0+cu11.1 TorchVision: 0.10.0 Transformers: 4.5.1 Hardware: A single machine with 4xRTX A6000
Hi @mmaaz60 Thanks for your interest in MDETR.
Could you provide the following information to help debug your error?:
- Exact command line
- Did you change anything to the dataset?
- Have you tried running it on one gpu first?
- Have you tried running on cpu first?
Hi @mmaaz60 Thanks for your interest in MDETR.
Could you provide the following information to help debug your error?:
Hi @alcinos,
Thank you for your reply. Please find the required information below.
- Exact command line
export CUBLAS_WORKSPACE_CONFIG=:4096:8
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --dataset_config configs/pretrain.json --ema --backbone timm_tf_efficientnet_b3_ns --lr_backbone 5e-5 --batch_size 4 --output-dir ./mdetr/pretrain_batch_4
- Did you change anything to the dataset?
No, I didn't change anything in the dataset.
- Have you tried running it on one gpu first?
I have tried running the same command with --nproc_per_node=1 on the same machine and got the same error. However, I tried running a distributed training on PCs equipped with single GPUs and connected with LAN. The training started successfully.
- Have you tried running on cpu first?
No, I didn't try that.