sam-hq icon indicating copy to clipboard operation
sam-hq copied to clipboard

When i tried to train the mode. There is a bug

Open Ryanye2000 opened this issue 2 years ago • 6 comments

屏幕截图 2023-11-23 181537 I found a bug when i ran the training code. I only have one gpu, so i set the --nproc_per_node to one,but the bug triggered. I do not know why

Ryanye2000 avatar Nov 23 '23 10:11 Ryanye2000

what's your pytorch version and cuda version? Does the model inference normally?

lkeab avatar Nov 24 '23 09:11 lkeab

i have a version of this conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch

Ryanye2000 avatar Nov 25 '23 08:11 Ryanye2000

what's your pytorch version and cuda version? Does the model inference normally?

and my cuda is 11.2. But i have used this kind of version to run another code already and it succeed

Ryanye2000 avatar Nov 25 '23 08:11 Ryanye2000

I have got the same error. the demo code works fine and generate the segmented results. Have you found any solution?

mzg0108 avatar Dec 06 '23 01:12 mzg0108

this work for me, lower the batch_size and nproc_per_node if u have only 1 gpu


torchrun --nproc_per_node=2 train.py --checkpoint ./pretrained_checkpoint/sam_vit_h_4b8939.pth --batch_size_train 16 --model-type vit_h --output work_dirs/hq_sam_h

torchrun --nproc_per_node=2 train.py --checkpoint ./pretrained_checkpoint/sam_vit_l_0b3195.pth --batch_size_train 16 --model-type vit_l --output work_dirs/hq_sam_l

crapthings avatar Dec 06 '23 07:12 crapthings

I solved this problem on Google Colab:

  • After libraries importing, write the following lines: local_rank = int(os.environ["LOCAL_RANK"])
  • Remove this line from train.py: parser.add_argument('--local_rank', type=int, help='local rank for dist')
  • Change the command from: python -m torch.distributed.launch train.py TRAIN_ARGS to torchrun train.py TRAIN_ARGS

halqadasi avatar Jan 22 '24 10:01 halqadasi