issue-tracking icon indicating copy to clipboard operation
issue-tracking copied to clipboard

local rank 0 logging random values of loss when used auto_metric_logging is set to true ?

Open nsriniva03 opened this issue 3 years ago • 7 comments

Describe the Bug

While logging the loss metric comet_ml logs random values for local rank 0 and the right value of loss for other ranks. Look at the graphs below logged by comet_ml.

For rank 0:

Screen Shot 2022-02-07 at 8 22 06 AM

For rank 1: Screen Shot 2022-02-07 at 8 22 50 AM

Loss metric logged on different ranks:

Screen Shot 2022-02-07 at 8 22 35 AM

I noticed that a very large value is logged for rank 0.

Expected behavior

I expect the loss logged on rank 0 to be very similar to that of rank 1.

Where is the issue?

  • [x ] Comet Python SDK
  • [x ] Comet UI
  • [ ] Third Party Integrations (Huggingface, TensorboardX, Pytorch Lighting etc)

nsriniva03 avatar Feb 07 '22 14:02 nsriniva03

Hello @nsriniva03. Looking into this. Just to confirm, are you running distributed training with Comet?

Would it be possible to share the code used to run these experiments? It would help our team have more context around what could be happening.

DN6 avatar Feb 07 '22 14:02 DN6

@DN6 , I can't share the code but I could put together a skeletal code that highlights the main steps.

nsriniva03 avatar Feb 07 '22 15:02 nsriniva03

Thank you! That would be very helpful. Additionally, are you using data parallel training? If so, is your data being distributed the same way across your training nodes? Or is it being shuffled and batches are randomly assigned to nodes across runs?

DN6 avatar Feb 07 '22 18:02 DN6

yes, I am using distributed data parallel for training. I am using the distributed data sampler in pytorch to handle data distribution across ranks. It loads a subset of the data to each rank that is exclusive to it.

nsriniva03 avatar Feb 07 '22 23:02 nsriniva03

Understood.

Would it be possible to run your experiment again and send the data that is currently being sent to rank 0 to another rank to see if the same behavior occurs in this rank?

DN6 avatar Feb 09 '22 07:02 DN6

@DN6 , This will take sometime as currently I have a large experiment running. But I will keep you posted.

nsriniva03 avatar Feb 09 '22 13:02 nsriniva03

Got it. Would it be possible to take a look at the data being sent to rank 0 to see if there is anything unusual about it? Also if you can share a skeletal version of your code I can try to replicate the issue.

DN6 avatar Feb 10 '22 12:02 DN6

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Oct 19 '23 21:10 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Oct 24 '23 21:10 github-actions[bot]