issue-tracking
issue-tracking copied to clipboard
Memory leak when Logging 3d histogram PyTorch tensor on GPU
Describe the Bug
The application runs out of memory and is killed attempting to log_histogram_3d with a Pytorch tensor on the GPU.
Expected behavior
Either of the following behaviors would be acceptable:
- comet automatically converts the tensor to a form it can use
- comet raises an informative exception
Where is the issue?
- [ ] Comet Python SDK
- [ ] Comet UI
- [x] Third Party Integrations (Huggingface, TensorboardX, Pytorch Lighting etc)
To Reproduce
import comet_ml
import torch
assert torch.cuda.is_available()
experiment = comet_ml.Experiment(project_name="test")
device = 'cuda'
# device = 'cpu'
x = torch.rand(100, device=device)
experiment.set_step(0)
experiment.log_histogram_3d(x, "x")
The issue goes away when you set device='cpu'
Stack Trace
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
stack trace if I stop it mid memory leak:
Traceback (most recent call last):
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1537, in fast_flatten
items = numpy.array(items, dtype=float)
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 725, in __array__
return self.numpy().astype(dtype, copy=False)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1543, in fast_flatten
items = numpy.array([numpy.array(item) for item in items], dtype=float)
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1543, in <listcomp>
items = numpy.array([numpy.array(item) for item in items], dtype=float)
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 723, in __array__
return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ian.pegg/projects/shining_software/src/shining_research/map_divergence_detection/debug.py", line 12, in <module>
experiment.log_histogram_3d(x, "x")
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/experiment.py", line 2861, in log_histogram_3d
histogram.add(values)
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 956, in add
values = fast_flatten(values)
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1550, in fast_flatten
return numpy.array(flatten(items))
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1518, in flatten
return list(lazy_flatten(items))
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1503, in lazy_flatten
new_iterator = iter(value)
File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 688, in __iter__
if torch._C._get_tracing_state():
KeyboardInterrupt
Link to Comet Project/Experiment
https://www.comet.ml/ianpegg-bc/test
Thanks for catching this @ianpegg-bc. I'll have our engineering team look into this.
@ianpegg-bc Following up here. I've created a ticket to for the engineering team to address this. In the mean time, the work around would be move the tensor to CPU before logging it as a histogram, as you have suggested.