mdetr icon indicating copy to clipboard operation
mdetr copied to clipboard

Error While Pretraining

Open mmaaz60 opened this issue 4 years ago • 2 comments

Hi,

Thank you for the great work and for providing the pre-trained models. I was trying to run pre-training following the instructions at pretrain.md. I am getting the attached error. I am listing my environment details below. Any help would be appreciated. image

PyTorch: 1.9.0+cu11.1 TorchVision: 0.10.0 Transformers: 4.5.1 Hardware: A single machine with 4xRTX A6000

mmaaz60 avatar Aug 17 '21 00:08 mmaaz60

Hi @mmaaz60 Thanks for your interest in MDETR.

Could you provide the following information to help debug your error?:

  • Exact command line
  • Did you change anything to the dataset?
  • Have you tried running it on one gpu first?
  • Have you tried running on cpu first?

alcinos avatar Aug 22 '21 08:08 alcinos

Hi @mmaaz60 Thanks for your interest in MDETR.

Could you provide the following information to help debug your error?:

Hi @alcinos,

Thank you for your reply. Please find the required information below.

  • Exact command line
export CUBLAS_WORKSPACE_CONFIG=:4096:8
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --dataset_config configs/pretrain.json --ema --backbone timm_tf_efficientnet_b3_ns --lr_backbone 5e-5 --batch_size 4 --output-dir ./mdetr/pretrain_batch_4
  • Did you change anything to the dataset?

No, I didn't change anything in the dataset.

  • Have you tried running it on one gpu first?

I have tried running the same command with --nproc_per_node=1 on the same machine and got the same error. However, I tried running a distributed training on PCs equipped with single GPUs and connected with LAN. The training started successfully.

  • Have you tried running on cpu first?

No, I didn't try that.

mmaaz60 avatar Aug 22 '21 09:08 mmaaz60