issue-tracking
issue-tracking copied to clipboard
Hyperparmeters are not being logged at times when using with distributed pytorch
Describe the Bug
Hello, I am using comet.ml with distributed pytorch. When the program is executed the model is initialized on N GPUs and comet.ml starts the corresponding N experiments. However, comet.ml is logging the hyperparameters only on some of the GPUs/experiments and not on all. Why would this happen and what does it mean?
Expected behavior
I would expect the hyperparameters to be logged in all the experiments.
Where is the issue?
- [ ] Comet Python SDK
- [ ] Comet UI
Screenshots or GIFs

For experiment typical_root_7164, the hyperparameters are not logged.

Whereas for experiment surviving_seasoining_1118, the hyperparameters are logged.

Additional context
Add any other context about the problem here.
Do they eventually show up after the experiment has finished running?
I always kill the program and restart it. I do this till all the parameters are visible on all the experiments. But I could let it run and get back to you about it. It just seemed strange that it would log for some experiments and not others.
Some things don't log until the end, so it isn't a good idea to kill an experiment. If at all possible, the experiment should run until completion. If you want, you can call experiment.end() manually in your code.
Hi Douglas,
The parameters are not showing up after the experiment has finished running.
appalling_aracde has not logged any parameters.

Some additional questions:
- Did the hyperparameters ever did show up (even after a browser refresh)? Sometimes the server takes a few minutes to process everything.
- Did all of these really ran for 44 hours, 3 minutes, and 30-some seconds? I don't think I've ever seen such consistency across computers for that long. Are they reporting a lot of data over that time, or all at once? An experiment could get throttled on various limits. See: https://www.comet.ml/docs/python-sdk/warnings-errors/#rate-limits
- Do you have the output captured for the experiment(s) that aren't showing the hyperparameters? I'm wondering if they had connection issues or crashed. Are these links you can share with me in DM? I'm [email protected]
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.