Pytorch video_inference_superanimal runs slow and then hangs, despite torch detecting GPU, GPU not used.
Is there an existing issue for this?
- [x] I have searched the existing issues
Operating System
macOS Sequoia
DeepLabCut version
3.0.0rc13
What engine are you using?
pytorch
DeepLabCut mode
single animal
Device type
Apple M3 Ultra
Bug description 🐛
when running: deeplabcut.video_inference_superanimal( [video_path], "superanimal_topviewmouse", "hrnet_w32", "fasterrcnn_resnet50_fpn_v2", device="mps", batch_size=1, detector_batch_size=[variable], video_adapt=True, max_individuals=1, )
runs slowly when running detector with detector batch size 1-1000. Eventually hangs. No GPU usage observed during this period.
In the deeplabcut env, if I run: import torch print(torch.backends.mps.is_available()) I get True.
Steps To Reproduce
No response
Relevant log output
Anything else?
No response
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
Hi, I was debugging some things yesterday and I found cpu is assigned when using mps.
/deeplabcut/pose_estimation_pytorch/apis/utils.py
# FIXME: Cannot run detectors on MPS
detector_device = device
if device == "mps":
detector_device = "cpu"
I don't know why it shouldn't work since after it's using torch device, so try to remove those lines. I cannot test mps from my side.
Hope that helps Juan
thanks Juan, I went through utils.py and commented out all instances of converting device to cpu if specified as mps. The code above now engages the GPU, but unfortunately still hangs after several multiples of the the detector batch size.
Hi @drhochbaum, unfortunately MPS support in PyTorch is still evolving and not yet on par with CUDA. This holds especially for detectors like Faster R-CNN which rely on complex ops. This is probably the reason why you experience your inference hanging / slow, when you enable device = 'mps'. The inelegant device assignment in our code mentioned by @juan-cobos was introduced to avoid the case where tensors are continuously copied back to CPU as fallback, making the detection slower or even fail with MPS enabled. We are currently not testing on MPS, so always happy to get feedback.
Have you tried using the newer versions of PyTorch? I suspect that your issues still remain, but happy to hear otherwise. In case your analysis still hangs on MPS or remains slow, we advice to run your analysis on CPU instead, or try to get access to CUDA-enabled machine (or use a cloud service, like Google Colab).
Here are some useful links for reference: https://github.com/pytorch/pytorch/issues/141287 https://discuss.pytorch.org/t/current-state-of-mps/172212/2 https://pytorch-lightning.readthedocs.io/en/2.4.0/pytorch/accelerators/mps_basic.html