issue-tracking Memory leak when Logging 3d histogram PyTorch tensor on GPU

Describe the Bug

The application runs out of memory and is killed attempting to log_histogram_3d with a Pytorch tensor on the GPU.

Expected behavior

Either of the following behaviors would be acceptable:

comet automatically converts the tensor to a form it can use
comet raises an informative exception

Where is the issue?

[ ] Comet Python SDK
[ ] Comet UI
[x] Third Party Integrations (Huggingface, TensorboardX, Pytorch Lighting etc)

To Reproduce

import comet_ml
import torch

assert torch.cuda.is_available()
experiment = comet_ml.Experiment(project_name="test")

device = 'cuda'
# device = 'cpu'
x = torch.rand(100, device=device)

experiment.set_step(0)
experiment.log_histogram_3d(x, "x")

The issue goes away when you set device='cpu'

Stack Trace

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

stack trace if I stop it mid memory leak:

Traceback (most recent call last):
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1537, in fast_flatten
    items = numpy.array(items, dtype=float)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 725, in __array__
    return self.numpy().astype(dtype, copy=False)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1543, in fast_flatten
    items = numpy.array([numpy.array(item) for item in items], dtype=float)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1543, in <listcomp>
    items = numpy.array([numpy.array(item) for item in items], dtype=float)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 723, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ian.pegg/projects/shining_software/src/shining_research/map_divergence_detection/debug.py", line 12, in <module>
    experiment.log_histogram_3d(x, "x")
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/experiment.py", line 2861, in log_histogram_3d
    histogram.add(values)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 956, in add
    values = fast_flatten(values)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1550, in fast_flatten
    return numpy.array(flatten(items))
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1518, in flatten
    return list(lazy_flatten(items))
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1503, in lazy_flatten
    new_iterator = iter(value)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 688, in __iter__
    if torch._C._get_tracing_state():
KeyboardInterrupt

Link to Comet Project/Experiment

https://www.comet.ml/ianpegg-bc/test

Nov 12 '21 01:11 ianpegg-bc

Thanks for catching this @ianpegg-bc. I'll have our engineering team look into this.

Nov 12 '21 17:11 DN6

@ianpegg-bc Following up here. I've created a ticket to for the engineering team to address this. In the mean time, the work around would be move the tensor to CPU before logging it as a histogram, as you have suggested.

Nov 12 '21 17:11 DN6

issue-tracking issue-tracking copied to clipboard

Memory leak when Logging 3d histogram PyTorch tensor on GPU

Describe the Bug

Expected behavior

Where is the issue?

To Reproduce

Stack Trace

Link to Comet Project/Experiment

issue-tracking
issue-tracking copied to clipboard