DeepLabCut icon indicating copy to clipboard operation
DeepLabCut copied to clipboard

Pytorch video_inference_superanimal runs slow and then hangs, despite torch detecting GPU, GPU not used.

Open drhochbaum opened this issue 1 month ago • 2 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Operating System

macOS Sequoia

DeepLabCut version

3.0.0rc13

What engine are you using?

pytorch

DeepLabCut mode

single animal

Device type

Apple M3 Ultra

Bug description 🐛

when running: deeplabcut.video_inference_superanimal( [video_path], "superanimal_topviewmouse", "hrnet_w32", "fasterrcnn_resnet50_fpn_v2", device="mps", batch_size=1, detector_batch_size=[variable], video_adapt=True, max_individuals=1, )

runs slowly when running detector with detector batch size 1-1000. Eventually hangs. No GPU usage observed during this period.

In the deeplabcut env, if I run: import torch print(torch.backends.mps.is_available()) I get True.

Steps To Reproduce

No response

Relevant log output


Anything else?

No response

Code of Conduct

drhochbaum avatar Dec 04 '25 17:12 drhochbaum

Hi, I was debugging some things yesterday and I found cpu is assigned when using mps.

/deeplabcut/pose_estimation_pytorch/apis/utils.py

# FIXME: Cannot run detectors on MPS
detector_device = device
if device == "mps":
    detector_device = "cpu"

I don't know why it shouldn't work since after it's using torch device, so try to remove those lines. I cannot test mps from my side.

Hope that helps Juan

juan-cobos avatar Dec 05 '25 16:12 juan-cobos

thanks Juan, I went through utils.py and commented out all instances of converting device to cpu if specified as mps. The code above now engages the GPU, but unfortunately still hangs after several multiples of the the detector batch size.

drhochbaum avatar Dec 05 '25 18:12 drhochbaum

Hi @drhochbaum, unfortunately MPS support in PyTorch is still evolving and not yet on par with CUDA. This holds especially for detectors like Faster R-CNN which rely on complex ops. This is probably the reason why you experience your inference hanging / slow, when you enable device = 'mps'. The inelegant device assignment in our code mentioned by @juan-cobos was introduced to avoid the case where tensors are continuously copied back to CPU as fallback, making the detection slower or even fail with MPS enabled. We are currently not testing on MPS, so always happy to get feedback.

Have you tried using the newer versions of PyTorch? I suspect that your issues still remain, but happy to hear otherwise. In case your analysis still hangs on MPS or remains slow, we advice to run your analysis on CPU instead, or try to get access to CUDA-enabled machine (or use a cloud service, like Google Colab).

Here are some useful links for reference: https://github.com/pytorch/pytorch/issues/141287 https://discuss.pytorch.org/t/current-state-of-mps/172212/2 https://pytorch-lightning.readthedocs.io/en/2.4.0/pytorch/accelerators/mps_basic.html

deruyter92 avatar Dec 18 '25 09:12 deruyter92