easy_ViTPose icon indicating copy to clipboard operation
easy_ViTPose copied to clipboard

Slow loss convergence

Open arielverbin opened this issue 1 year ago • 3 comments

Hello, I'm attempting to perform fine tuning with your implementation (I'm using the commit e8e2ad1 from April 24, as I don't need feet key points). Unfortunately I think the loss might not converge properly. I tried to run the training without fine tuning (from scratch) - in the first 5 epochs it decreased from 0.0168 to 0.0063, but remained stuck at 0.0063 for the next 25 epochs.

Do you have any suggestions for how to solve it? I've used the same hyper parameters in your code, but changed the layer decay rate from 0.75 to 1-1e-4.

Thank you for your time and assistance!

arielverbin avatar Mar 05 '24 21:03 arielverbin

I'm sorry to hear that. Can you try to repeat your experiments with the original implementation I started from and see if there's any difference? https://github.com/jaehyunnn/ViTPose_pytorch

JunkyByte avatar Mar 07 '24 14:03 JunkyByte

Same problem :( the loss doesn't seem to go below 0.006-0.007.

image

I used the exact code from the repository, except:

  • In config.yaml, changed resume_from to False.
  • In COCO.py, changed np.float to float (It raised an error probably due to a version difference).
  • In COCO.py, I also added conversion to RGB if image.ndim == 2 (as you did in this repository).
  • In train.py, changed data_version="train_custom" / "valid_custom", to "train2017" / "val2017" (so it would match the name of the directories in COCO). Maybe this is the problem? I used COCO dataset without any preprocessing.

I might just be impatient, but in the log files of the official repo, the loss reached 0.003 on the first epoch.

arielverbin avatar Mar 07 '24 16:03 arielverbin

I’m sorry, it seems to be a problem with the original project. I will remove the fine tuning part completely from the current state of the repository if it is broken. I would suggest you to use the original vitpose implementation or check if any obvious bug is present in this implementation. Good luck

JunkyByte avatar Mar 07 '24 16:03 JunkyByte