HiSup icon indicating copy to clipboard operation
HiSup copied to clipboard

Training the model with another setup

Open minhvu120201dn opened this issue 1 year ago • 10 comments

I have tried training the model with:

  • RTX 2080 Super GPU with 8GB VRAM
  • Backbone: HRNetW48-V2
  • Number of epochs: 30
  • Dataset: AICrowd small But only obtained 52.0 on AP, while that of the original paper is 75.8. Can anyone explain the reason why?

minhvu120201dn avatar Jun 20 '23 17:06 minhvu120201dn

We used all the training data containing 280,741 tiles for the final model. The small version was only utilized for ablation studies.

Best

Nan 2023年6月21日 +0800 AM1:54 minhvu120201dn @.***>,写道:

I have tried training the model with:

• RTX 2080 Super GPU with 8GB VRAM • Backbone: HRNetW48-V2 • Number of epochs: 30 • Dataset: AICrowd small But only obtained 52.0 on AP, while that of the original paper is 75.8. Can anyone explain the reason why?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

cherubicXN avatar Jun 20 '23 23:06 cherubicXN

Hi author, I'm using RTX2080TI, 12G, gpu for training. The dataset used: crowdAI 20% of the original size, 60,000 images for training, the Using this network: crowdai-small_hrnet48.yaml But it still prompts me for a video memory overflow. I would like to ask what is the reason for this, is it because the size of my dataset is bigger or because this network of yours is bigger, author?

zem118 avatar Jul 25 '23 15:07 zem118

I am not sure what is "a video memory overflow". We never encounter this error message during all experiments. Please make sure that you can run the demo and get predictable results. Then, I suggest that you could try the following changes during training. One is to reduce the batch size, which will reduce the GPU memory usage. Another is to replace the HRNet48 with a smaller version such as HRNet18.

SarahwXU avatar Jul 26 '23 03:07 SarahwXU

Ok, thanks for the reply, I've solved the problem! It works successfully on single GPU. Now my computer is with two GPUs (3080) and I want to train on multiple GPUs, I used the multi-train.py from your model for training, but the run is stuck (that is, he doesn't report an error or continue to run, and it doesn't return information about the training process) I don't know why this is. So I would like to ask you if there are any other additional operations you do when training with multiple GPUs?

zem118 avatar Jul 26 '23 12:07 zem118

Hi, Is your CUDA capability compatible with the current PyTorch version?

XJKunnn avatar Jul 26 '23 12:07 XJKunnn

cuda is compatible with pytorch and it has been able to run successfully on a single GPU successfully. It just doesn't run successfully on dual GPUs (both GPUs are idle)

zem118 avatar Jul 26 '23 12:07 zem118

Could you please share the log while running the training code?

XJKunnn avatar Jul 26 '23 12:07 XJKunnn

In the terminal, it gets stuck at index created! b855a1fa2358c0759a301aa47b3713a

zem118 avatar Jul 26 '23 13:07 zem118

I just run the multi-gpu code and it runs well. image Maybe you should check your environment carefully and follow the steps of README file.

XJKunnn avatar Jul 26 '23 13:07 XJKunnn

Okay, thanks for the answer.

zem118 avatar Jul 26 '23 13:07 zem118