yolov5 icon indicating copy to clipboard operation
yolov5 copied to clipboard

Hyperparameter evolve multi GPU with clearml

Open mbenami opened this issue 3 years ago • 9 comments
trafficstars

Search before asking

  • [X] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

Hi thanks for this great repo! I tried to run Hyperparameter search according to the code in https://github.com/ultralytics/yolov5/issues/607

# Multi-GPU
for i in 0 1 2 3; do
  sleep $(expr 30 \* $i) &&  # 30-second delay (optional)
  echo 'Starting GPU '$i'...' &&
  nohup python train.py --epochs 10 --data my_data.yaml --weights yolov5s.pt --device $i --evolve > evolve_gpu_$i.log &
done

on instance with 4 T4 GPUs with clearml install

then 3 of the training are stop do to error

clearml.backend_interface.session.SendError: Action failed <400/801: projects.create/v1.0 (Value combination already exists (unique field already contains this value): name=Hyperparameter, company=*****)> (name=Hyperparameter, description=)

How can I do that? Thanks for the help

Additional

No response

mbenami avatar Sep 15 '22 16:09 mbenami

@thepycoder might be able to help you with this

AyushExel avatar Sep 15 '22 17:09 AyushExel

Hi @mbenami

Thank you for your patience! I was able to reproduce your issue and it seems like you really were on the right track! Like you expected, ClearML chokes when running 4 instances at the very same time. Normally this does not happen (e.g. when using multigpu training), but when you use an external bash file, ClearML can't regulate its API requests properly because it isn't aware of the other instances at all.

So the issue does seem to come from multiple instances calling API requests at the same time, which should be solved by leaving a bit time in between each initialization. You seem like you had the same idea in your script:

sleep $(expr 30 \* $i) &&  # 30-second delay (optional)

Only, you called && after it, which in bash terms means you don't actually wait for that command to complete before going to the next one. The sleep does not have any effect at all!

Changing your script to:

# Multi-GPU
for i in 0 1 2 3; do
  sleep $(expr 30 \* $i)  # 30-second delay (optional)  # <--- missing the &&!
  echo 'Starting GPU '$i'...' &&
  nohup python train.py --epochs 10 --data my_data.yaml --weights yolov5s.pt --device $i --evolve > evolve_gpu_$i.log &
done

Seems to work for me (granted, I have no badass T4 cluster, so the 3 next ones fail with OOM on my single GPU, but the ClearML task creation works without issue!). Can you check if this solves the issue?

Also, from the top of my head, if you run into the issue of all 4 reporting to the same task and overriding it, please add reuse_last_task_id=False to the clearml init here:

https://github.com/ultralytics/yolov5/blob/fda8aa551d0b732153c2e0848dd6abd887a41cd1/utils/loggers/clearml/clearml_utils.py#L87

If you find that this is an issue (it's hard to tell without multi-GPU), please report it here and we will fix that for everyone :)

thepycoder avatar Sep 19 '22 08:09 thepycoder

Hi @thepycoder thanks for the clarification (I'll check it later when I have a time slot for the GPUs ) just to clarify with the modification on the bush script and changing reuse_last_task_id=False I should get on Clearml only one task running, right?

mbenami avatar Sep 19 '22 10:09 mbenami

I would indeed try both changes:

  • The 'new' bash script (remove the first &&) like discussed above
  • reuse_last_task_id=False in the clearml init

With these 2 changes you should get a ClearML task for each of the processes ( so that's 4 in total in your case, 1 per GPU). If you want to see them together you can always compare those 4 and the results will be plotted together :)

FYI: if you don't want to change the code, adding the CLEARML_TASK_NO_REUSE=1 as an environment variable should have the same effect as reuse_last_task_id=False.

IF you can and have the resources for it, it would be nice to see what happens when you don't set reuse_last_task_id=False (you can remove it or set it to True). It should also spawn 4 tasks in ClearML, but I have a suspicion that all 4 processes might try to report to the same task, overriding the others' progress, which would be very bad and would need fixing from our end.

If you can't get the time to do that, we can take a look as well, but it might take a little longer.

thepycoder avatar Sep 19 '22 11:09 thepycoder

@thepycoder ok but if I what to speed up the hyperparameter search with 4 workers 1 per GPU that all report to the same experiment how can I do that? (this is the case without clearml, in the end, I get hyp_evolve.yaml file with the best configuration file I cannot see with clearml)

mbenami avatar Sep 19 '22 11:09 mbenami

@mbenami

I'm not completely sure how the evolve functionality actually works, maybe @AyushExel can advise here. So what you're saying is that if you run it without clearml, you get 1 final evolve that contains the best parameter combination from between all 4 processes?

#9352 This recent issue seems to suggest you get 4 separate folders each with their own hyp_evolve.yaml instead, 1 for each GPU process.

As far as ClearML is concerned, the bash script launches 4 completely separate python instances, so there is no way for ClearML to know that it should report to 1 single task, you will always get 4 in this way. That being said, I'm not even sure ClearML would handle the evolve functionality well, because it seems like it retrains the model multiple times with different parameter combinations within the same process/instance, which ClearML was not designed to handle, I'll take a closer look at that when I can.

In the meantime, @AyushExel, can you elaborate on how the evolve functionality works on multi-GPU and how it e.g. works when used in combination with ClearML or wandb?

@mbenami If you really want 1 task and a nice overview in the ClearML interface, maybe take a look at ClearML native hyperparameter optimization? You would have to setup 4 workers on your 4GPU node (will take like 5 minutes), but then you can use ClearML HPO with a nice dashboard and all!

thepycoder avatar Sep 19 '22 11:09 thepycoder

Hi @thepycoder I can confirm that with remove && the error is solved and now I have 4 processes running without setting reuse_last_task_id=False as you suspect the clearml logging is overwritten by the last one so for now I'm training with reuse_last_task_id=False and I get a log on clearml per hyperparameter run (so if I have evolve 100 I should get 100 experiments on clearml )

also for the hyperparameter search as I understand from the code if you set the --project and --name to be fixed and set --resume all the 4 logging will be to the same folder and the parameter for each run in the evolve will be selected by the evolve.csv file

mbenami avatar Sep 21 '22 15:09 mbenami

@thepycoder Just seeing this! I missed the previous ping. Glad that its working as intended now

AyushExel avatar Sep 21 '22 20:09 AyushExel

@thepycoder only one thing currently, you can compare only 10 experiments on clearml but as I mention now I have 1 per hyperparameter run is there a way to compare more?

mbenami avatar Sep 21 '22 20:09 mbenami

@mbenami

Yes, you can indeed only directly compare up to 10 experiments. There's another way to get an overview though! Direct comparison is meant to be in-depth, at which point more than 10 experiments can become quite cluttered to compare plots etc.

So the other way to go about this, is to add the final metrics you think are most interesting (e.g. mAP) and add them to the experiment table as custom metrics. I've added a gif to show you what I mean. It essentially adds any metric you want to the main experiment list and allows you to filter or sort on them. Use shift+click on a column to secondary sort on it. Effectively this creates a "leaderboard".

leaderboard

When you have this leaderboard, you can quickly find the best performing models, select only them for compare and now you can dive in deep without the clutter of the less performing ones.

thepycoder avatar Sep 27 '22 16:09 thepycoder

@thepycoder Thanks one thing if you can change please is that have default 10 runs on the page ( not 15 ) or increase the compare to 15 runs for now, I have to mark all, scroll down and unselect the last 5 to compare so changing one of them will be nice Thanks

mbenami avatar Sep 28 '22 08:09 mbenami

@mbenami I proposed your troubles to our devs. But most people ask for more tasks in the view, not less. On the other hand, to make it at least slightly easier, you can select the first result and then shift click the last result, everything in between will be selected too.

If you're really annoyed by it, feel free to fix it in the server and open a PR :) we'll be glad to receive it and discuss how to accommodate.

thepycoder avatar Oct 11 '22 07:10 thepycoder