DeepLabCut Pytorch video_inference_superanimal runs slow and then hangs, despite torch detecting GPU, GPU not used.

Is there an existing issue for this?

[x] I have searched the existing issues

Operating System

macOS Sequoia

DeepLabCut version

3.0.0rc13

What engine are you using?

pytorch

DeepLabCut mode

single animal

Device type

Apple M3 Ultra

Bug description 🐛

when running: deeplabcut.video_inference_superanimal( [video_path], "superanimal_topviewmouse", "hrnet_w32", "fasterrcnn_resnet50_fpn_v2", device="mps", batch_size=1, detector_batch_size=[variable], video_adapt=True, max_individuals=1, )

runs slowly when running detector with detector batch size 1-1000. Eventually hangs. No GPU usage observed during this period.

In the deeplabcut env, if I run: import torch print(torch.backends.mps.is_available()) I get True.

Steps To Reproduce

No response

Relevant log output

Anything else?

No response

Code of Conduct

[x] I agree to follow this project's Code of Conduct

Dec 04 '25 17:12 drhochbaum

Hi, I was debugging some things yesterday and I found cpu is assigned when using mps.

/deeplabcut/pose_estimation_pytorch/apis/utils.py

# FIXME: Cannot run detectors on MPS
detector_device = device
if device == "mps":
    detector_device = "cpu"

I don't know why it shouldn't work since after it's using torch device, so try to remove those lines. I cannot test mps from my side.

Hope that helps Juan

Dec 05 '25 16:12 juan-cobos

thanks Juan, I went through utils.py and commented out all instances of converting device to cpu if specified as mps. The code above now engages the GPU, but unfortunately still hangs after several multiples of the the detector batch size.

Dec 05 '25 18:12 drhochbaum

Hi @drhochbaum, unfortunately MPS support in PyTorch is still evolving and not yet on par with CUDA. This holds especially for detectors like Faster R-CNN which rely on complex ops. This is probably the reason why you experience your inference hanging / slow, when you enable device = 'mps'. The inelegant device assignment in our code mentioned by @juan-cobos was introduced to avoid the case where tensors are continuously copied back to CPU as fallback, making the detection slower or even fail with MPS enabled. We are currently not testing on MPS, so always happy to get feedback.

Have you tried using the newer versions of PyTorch? I suspect that your issues still remain, but happy to hear otherwise. In case your analysis still hangs on MPS or remains slow, we advice to run your analysis on CPU instead, or try to get access to CUDA-enabled machine (or use a cloud service, like Google Colab).

Here are some useful links for reference: https://github.com/pytorch/pytorch/issues/141287 https://discuss.pytorch.org/t/current-state-of-mps/172212/2 https://pytorch-lightning.readthedocs.io/en/2.4.0/pytorch/accelerators/mps_basic.html

Dec 18 '25 09:12 deruyter92