MPS Support
Can we have MPS support for Apple silicon? There are bunch of audio/video professionals on Mac platform that could benefit from this.
We didn’t implement anything MPS-specific, but the code is device-agnostic and should work on Apple Silicon. You can try running it by moving the model and inputs to torch.device("mps").
Unfortunately it looks like Perception Model requires Decord which its wheel is not available on pip for MacOS. Is there a way to use Sam Audio without perception model, so we can at least use just for Audio without the visual?
Did you find any solution?
Right now, there are 3 places in SAM audio with PE model dependency: visual encoder, span predictor and judge. It'd be easier to just remove the dependencies of decord while installing PE. They don't seem to use decord that much in the key modules in PE.
Thanks! I cloned the Perception model repo and removed decord, but I still could not get it to install, so I gave up and asked gemini-cli to give it a shot. It relaxed numpy requirement, modified Perception model/core/transformers.py and probe.py for xformers. Then it was able to install sam-audio. Now my problem is the request to access the model on Hugging Face is still pending lol. How long does approval usually take? I submitted the request on Dec 16, and my username is chibop on HF. Thanks!
@chigkim could you share your setup?
FYI I also applied on the 16th and got approved within 48 hours
@chigkim could you try canceling your original request and request again? There was some issue in the first batch of requests. After canceling and request again, you should be able to get it within 30min.
Awesome, thanks! I canceled it, and requested again, and I got in! I finally got the model working in MPS using the slightly modified inference code below! It would be really nice if we can cleanly install Perception model on Apple Silicon. Should I close this or leave it open in case you guys have a plan to make it work on MPS out of the box?
For other people, the key is to clone and install Perception model separately with decord removed as dependencies.
According to what Gemini3 did, you also need to fudge numpy version requirement, modified Perception model/core/transformers.py and probe.py to accommodate xformers.
FYI, it's really memory hungry! It uses 60GB/64GB on my Mac if I run with 15 second audio in examples/assets/office.mp4 using predict_spans=True, reranking_candidates=8.
import torch
import torchaudio
from sam_audio import SAMAudio, SAMAudioProcessor
device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")
model = SAMAudio.from_pretrained("facebook/sam-audio-large", map_location=device).to(device).eval()
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
audio_file = "office.wav"
description = "A man speaking"
inputs = processor(audios=[audio_file], descriptions=[description]).to(device)
with torch.inference_mode():
result = model.separate(inputs, predict_spans=True, reranking_candidates=8)
target = result.target[0].unsqueeze(0).cpu()
torchaudio.save("target.wav", target, processor.audio_sampling_rate)
residual = result.residual[0].unsqueeze(0).cpu()
torchaudio.save("residual.wav", residual, processor.audio_sampling_rate)
According to what Gemini3 did, you also need to fudge numpy version requirement, modified Perception model/core/transformers.py and probe.py to accommodate xformers.
any chance you could share any of that?
Sure, here's result from git diff.
diff --git a/core/probe.py b/core/probe.py
index fb81722..e5818fc 100644
--- a/core/probe.py
+++ b/core/probe.py
@@ -32,7 +32,10 @@ from torch.nn.attention import SDPBackend, sdpa_kernel
from torch.utils._python_dispatch import TorchDispatchMode
from torch.utils._pytree import tree_map
from torch.utils.module_tracker import ModuleTracker
-from xformers.ops import fmha
+try:
+ from xformers.ops import fmha
+except ImportError:
+ fmha = None
@torch.library.custom_op("torchprobe::log", mutates_args=(), device_types=None)
@@ -482,7 +485,7 @@ class AutoProbeD(TorchDispatchMode):
func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True
)
_compute_attn_stats_sdpa(self, path, **kwargs)
- elif func._overloadpacket == fmha.flash.FwOp.OPERATOR:
+ elif fmha is not None and func._overloadpacket == fmha.flash.FwOp.OPERATOR:
_, kwargs = normalize_function(
func, args=args, kwargs=kwargs, normalize_to_only_use_kwargs=True
)
diff --git a/core/transformer.py b/core/transformer.py
index 6f76f90..e083f8c 100644
--- a/core/transformer.py
+++ b/core/transformer.py
@@ -10,7 +10,11 @@ from torch import nn
from torch.nn import functional as F
from torch.nn.attention.flex_attention import (BlockMask, _mask_mod_signature,
flex_attention)
-from xformers.ops import AttentionBias, fmha
+try:
+ from xformers.ops import AttentionBias, fmha
+except ImportError:
+ class AttentionBias: pass
+ fmha = None
from core import probe
diff --git a/requirements.txt b/requirements.txt
index 2ac27f8..e94fdf9 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,9 +1,8 @@
-numpy==2.1.2
+numpy
omegaconf==2.3.0
msgspec==0.19.0
rouge-score==0.1.2
sacrebleu==2.5.1
-sentencepiece==0.2.0
tiktoken==0.9.0
blobfile==3.0.0
wandb==0.19.8
@@ -19,12 +18,11 @@ iopath==0.1.10
torchdata==0.11.0
torchcodec
timm==1.0.15
-decord==0.6.0
opencv-python==4.11.0.86
pycocoevalcap==1.2
scikit-learn==1.6.1
scipy==1.15.2
-sentencepiece==0.2.0
+sentencepiece
tokenizers==0.21.1
webdataset==0.2.111
fsspec
Thanks @chigkim !
@chigkim I ported this model to MLX-Audio last night, PR comming later today :)
I also mitigated speed issues by using chunking and euler as a ODE option. This gives you a 2-3x speedup at the cost of a little accuracy.
Will work on lowering memory usage later.
One more thing, for now it doesn't have the visual it's purely audio to audio :)
Let me know if you are interested in the visual aspect and I will add it.
In case anyone stumble on this issue, I wrote out the instructions from @chigkim plus some other steps I followed to make it work on a Mac in a blog post
However, I can confirm what @chigkim said, processing anything > 15 secs is not really realistic with this set up. Hopefully @Blaizzy's work using Apple's own MLX framework will prove to be fruitful
| Clip length, s | Processing time, s | Processing time, mins |
|---|---|---|
| 4.77 | 57.8 | ~ 1 |
| 9.54 | 144.4 | ~ 2.5 |
| 14.32 | 258.76 | ~ 4.5 |
| 23.86 | 704.31 | ~ 12 |
| 38.17 | MPS backend out of memory |
Hey @gotofritz Here are my results:
Sam-Audio Large
| Method | Clip length, s | Processing time, s | Processing time, mins | Peak memory | ODE |
|---|---|---|---|---|---|
| Default | 143.2 | ~300s | ~ 3-4 | 20GB (FP32) | 2/32 (midpoint) |
| Default | 143.2 | ~150s | ~ 2 | 15GB (FP16) | 2/32 (midpoint) |
| Chunking | 143.2 | 56.5 | ~ 1 | 9GB (FP16) | 2/4 (Euler) |
| Streaming | 143.2s | ~10 (first chunk) / ~58 (final chunk) | ~ 1 | 15GB (FP16) | 2/4 (Euler) |
Note:
- I have 2 level chunking (input and output chunking).
- Input chunking just splits the input audio in chunks of 10s with 3s overlap by default.
- Output chunk basically, chunks the ODE steps for target and residual to reduce memory spikes (default value is 50). This keeps the peak memory usage constant (at 9-20GB) even after back to back calls.
- Just added streaming here and it's phenomenal, I get the first audio chunk I can listen in 8-10seconds and I can listen to the rest on the fly.
Finally, I need help with the anchors, they don't seem to be working properly or I don't understand the scale (is it 1:1 with seconds)?
Hey thanks for sharing that @Blaizzy. It turns out I had reranking_candidates set to 8 which was far too much. If set to 1 and implemented chunking with 10s windows (no overlaps) I get these values.
| Clip length, s | Processing time, s | Processing time, mins |
|---|---|---|
| 132.98 | 501.345 | 8m 21.3s |
| 237.03 | 389.048 | 6m 29.0s |
Only input chunking ofc, I don't have access to the ODE. Very variable (I guess it depends on whatever background tasks the MacBook is busy with at the time) but not a million miles from yours. I'll be playing with your libraries next!
Awesome, it's looking much better!
My implementation is purely Audio-to-Audi. The rerank, vision and Perceltion Encoder are missing at the moment.
On MLX-Audio performance is constant for the most part unless you have really low battery or less URAM available.
Let me know how it goes :)