sam-audio MPS Support

Can we have MPS support for Apple silicon? There are bunch of audio/video professionals on Mac platform that could benefit from this.

Dec 18 '25 16:12 chigkim

We didn’t implement anything MPS-specific, but the code is device-agnostic and should work on Apple Silicon. You can try running it by moving the model and inputs to torch.device("mps").

Dec 18 '25 23:12 chevalierNoir

Unfortunately it looks like Perception Model requires Decord which its wheel is not available on pip for MacOS. Is there a way to use Sam Audio without perception model, so we can at least use just for Audio without the visual?

Dec 19 '25 17:12 chigkim

Did you find any solution?

Dec 20 '25 03:12 sushilkhadkaanon

Right now, there are 3 places in SAM audio with PE model dependency: visual encoder, span predictor and judge. It'd be easier to just remove the dependencies of decord while installing PE. They don't seem to use decord that much in the key modules in PE.

Dec 20 '25 06:12 chevalierNoir

Thanks! I cloned the Perception model repo and removed decord, but I still could not get it to install, so I gave up and asked gemini-cli to give it a shot. It relaxed numpy requirement, modified Perception model/core/transformers.py and probe.py for xformers. Then it was able to install sam-audio. Now my problem is the request to access the model on Hugging Face is still pending lol. How long does approval usually take? I submitted the request on Dec 16, and my username is chibop on HF. Thanks!

Dec 20 '25 11:12 chigkim

@chigkim could you share your setup?

FYI I also applied on the 16th and got approved within 48 hours

Dec 20 '25 14:12 gotofritz

@chigkim could you try canceling your original request and request again? There was some issue in the first batch of requests. After canceling and request again, you should be able to get it within 30min.

Dec 20 '25 15:12 chevalierNoir

Awesome, thanks! I canceled it, and requested again, and I got in! I finally got the model working in MPS using the slightly modified inference code below! It would be really nice if we can cleanly install Perception model on Apple Silicon. Should I close this or leave it open in case you guys have a plan to make it work on MPS out of the box?

For other people, the key is to clone and install Perception model separately with decord removed as dependencies. According to what Gemini3 did, you also need to fudge numpy version requirement, modified Perception model/core/transformers.py and probe.py to accommodate xformers. FYI, it's really memory hungry! It uses 60GB/64GB on my Mac if I run with 15 second audio in examples/assets/office.mp4 using predict_spans=True, reranking_candidates=8.

import torch
import torchaudio
from sam_audio import SAMAudio, SAMAudioProcessor
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

model = SAMAudio.from_pretrained("facebook/sam-audio-large", map_location=device).to(device).eval()
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")

audio_file = "office.wav"
description = "A man speaking"

inputs = processor(audios=[audio_file], descriptions=[description]).to(device)
with torch.inference_mode():
    result = model.separate(inputs, predict_spans=True, reranking_candidates=8)

target = result.target[0].unsqueeze(0).cpu()
torchaudio.save("target.wav", target, processor.audio_sampling_rate)
residual = result.residual[0].unsqueeze(0).cpu()
torchaudio.save("residual.wav", residual, processor.audio_sampling_rate)

Dec 20 '25 16:12 chigkim

According to what Gemini3 did, you also need to fudge numpy version requirement, modified Perception model/core/transformers.py and probe.py to accommodate xformers.

any chance you could share any of that?

Dec 20 '25 17:12 gotofritz

Sure, here's result from git diff.

diff --git a/core/probe.py b/core/probe.py
index fb81722..e5818fc 100644
--- a/core/probe.py
+++ b/core/probe.py
@@ -32,7 +32,10 @@ from torch.nn.attention import SDPBackend, sdpa_kernel
 from torch.utils._python_dispatch import TorchDispatchMode
 from torch.utils._pytree import tree_map
 from torch.utils.module_tracker import ModuleTracker
-from xformers.ops import fmha
+try:
+    from xformers.ops import fmha
+except ImportError:
+    fmha = None
 
 
 @torch.library.custom_op("torchprobe::log", mutates_args=(), device_types=None)
@@ -482,7 +485,7 @@ class AutoProbeD(TorchDispatchMode):
                 func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True
             )
             _compute_attn_stats_sdpa(self, path, **kwargs)
-        elif func._overloadpacket == fmha.flash.FwOp.OPERATOR:
+        elif fmha is not None and func._overloadpacket == fmha.flash.FwOp.OPERATOR:
             _, kwargs = normalize_function(
                 func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True
             )
diff --git a/core/transformer.py b/core/transformer.py
index 6f76f90..e083f8c 100644
--- a/core/transformer.py
+++ b/core/transformer.py
@@ -10,7 +10,11 @@ from torch import nn
 from torch.nn import functional as F
 from torch.nn.attention.flex_attention import (BlockMask, _mask_mod_signature,
                                                flex_attention)
-from xformers.ops import AttentionBias, fmha
+try:
+    from xformers.ops import AttentionBias, fmha
+except ImportError:
+    class AttentionBias: pass
+    fmha = None
 
 from core import probe
 
diff --git a/requirements.txt b/requirements.txt
index 2ac27f8..e94fdf9 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,9 +1,8 @@
-numpy==2.1.2
+numpy
 omegaconf==2.3.0
 msgspec==0.19.0
 rouge-score==0.1.2
 sacrebleu==2.5.1
-sentencepiece==0.2.0
 tiktoken==0.9.0
 blobfile==3.0.0
 wandb==0.19.8
@@ -19,12 +18,11 @@ iopath==0.1.10
 torchdata==0.11.0
 torchcodec
 timm==1.0.15
-decord==0.6.0
 opencv-python==4.11.0.86
 pycocoevalcap==1.2
 scikit-learn==1.6.1
 scipy==1.15.2
-sentencepiece==0.2.0
+sentencepiece
 tokenizers==0.21.1
 webdataset==0.2.111
 fsspec

Dec 20 '25 17:12 chigkim

Thanks @chigkim !

Dec 20 '25 17:12 gotofritz

@chigkim I ported this model to MLX-Audio last night, PR comming later today :)

I also mitigated speed issues by using chunking and euler as a ODE option. This gives you a 2-3x speedup at the cost of a little accuracy.

Dec 23 '25 12:12 Blaizzy

Will work on lowering memory usage later.

One more thing, for now it doesn't have the visual it's purely audio to audio :)

Let me know if you are interested in the visual aspect and I will add it.

Dec 23 '25 12:12 Blaizzy

In case anyone stumble on this issue, I wrote out the instructions from @chigkim plus some other steps I followed to make it work on a Mac in a blog post

However, I can confirm what @chigkim said, processing anything > 15 secs is not really realistic with this set up. Hopefully @Blaizzy's work using Apple's own MLX framework will prove to be fruitful

Clip length, s	Processing time, s	Processing time, mins
4.77	57.8	~ 1
9.54	144.4	~ 2.5
14.32	258.76	~ 4.5
23.86	704.31	~ 12
38.17	MPS backend out of memory

Dec 26 '25 23:12 gotofritz

Hey @gotofritz Here are my results:

Sam-Audio Large

Method	Clip length, s	Processing time, s	Processing time, mins	Peak memory	ODE
Default	143.2	~300s	~ 3-4	20GB (FP32)	2/32 (midpoint)
Default	143.2	~150s	~ 2	15GB (FP16)	2/32 (midpoint)
Chunking	143.2	56.5	~ 1	9GB (FP16)	2/4 (Euler)
Streaming	143.2s	~10 (first chunk) / ~58 (final chunk)	~ 1	15GB (FP16)	2/4 (Euler)

Note:

I have 2 level chunking (input and output chunking).
Input chunking just splits the input audio in chunks of 10s with 3s overlap by default.
Output chunk basically, chunks the ODE steps for target and residual to reduce memory spikes (default value is 50). This keeps the peak memory usage constant (at 9-20GB) even after back to back calls.
Just added streaming here and it's phenomenal, I get the first audio chunk I can listen in 8-10seconds and I can listen to the rest on the fly.

Finally, I need help with the anchors, they don't seem to be working properly or I don't understand the scale (is it 1:1 with seconds)?

Jan 04 '26 11:01 Blaizzy

Hey thanks for sharing that @Blaizzy. It turns out I had reranking_candidates set to 8 which was far too much. If set to 1 and implemented chunking with 10s windows (no overlaps) I get these values.

Clip length, s	Processing time, s	Processing time, mins
132.98	501.345	8m 21.3s
237.03	389.048	6m 29.0s

Only input chunking ofc, I don't have access to the ODE. Very variable (I guess it depends on whatever background tasks the MacBook is busy with at the time) but not a million miles from yours. I'll be playing with your libraries next!

Jan 06 '26 12:01 gotofritz

Awesome, it's looking much better!

My implementation is purely Audio-to-Audi. The rerank, vision and Perceltion Encoder are missing at the moment.

On MLX-Audio performance is constant for the most part unless you have really low battery or less URAM available.

Let me know how it goes :)

Jan 06 '26 12:01 Blaizzy