SAM-Audio

model_image

Segment Anything Model for Audio [Blog] [Paper] [Demo]

SAM-Audio is a foundation model for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans.

SAM-Audio and the Judge model crucially rely on Perception-Encoder Audio-Visual (PE-AV), which you can read more about here

Setup

Requirements:

Python >= 3.11
CUDA-compatible GPU (recommended)

Install dependencies:

pip install .

Usage

⚠️ Before using SAM Audio, please request access to the checkpoints on the SAM Audio Hugging Face repo. Once accepted, you need to be authenticated to download the checkpoints. You can do this by running the following steps (e.g. hf auth login after generating an access token.)

Basic Text Prompting

from sam_audio import SAMAudio, SAMAudioProcessor
import torchaudio
import torch

model = SAMAudio.from_pretrained("facebook/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
model = model.eval().cuda()

file = "<audio file>" # audio file path or torch tensor
description = "<description>"

batch = processor(
    audios=[file],
    descriptions=[description],
).to("cuda")

with torch.inference_mode():
    # NOTE: `predict_spans` and `reranking_candidates` have a large impact on performance.
    # Setting `predict_span=True` and `reranking_candidates=8` will give you better results at the cost of
    # latency and memory. See the "Span Prediction" section below for more details
   result = model.separate(batch, predict_spans=False, reranking_candidates=1)

# Save separated audio
sample_rate = processor.audio_sampling_rate
torchaudio.save("target.wav", result.target.cpu(), sample_rate)      # The isolated sound
torchaudio.save("residual.wav", result.residual.cpu(), sample_rate)  # Everything else

Prompting Methods

SAM-Audio supports three types of prompts:

Text Prompting: Describe the sound you want to isolate using natural language. To match training, please use lowercase noun-phrase/verb-phrase (NP/VP) format for text (for example instead of "Thunder can be heard in the background" use "thunder").
```
processor(audios=[audio], descriptions=["man speaking"])
```

Visual Prompting: Use video frames and masks to isolate sounds associated with visual objects

processor(audios=[video], descriptions=[""], masked_videos=processor.mask_videos([frames], [mask]))

Span Prompting: Specify time ranges where the target sound occurs

processor(audios=[audio], descriptions=["car honking"], anchors=[[["+", 6.3, 7.0]]])

See the examples directory for more detailed examples

Span Prediction (Optional for Text Prompting)

We also provide support for automatically predicting the spans based on the text description, which is especially helpful for separating non-ambience sound events. You can enable this by adding predict_spans=True in your call to separate

with torch.inference_mode()
   outputs = model.separate(batch, predict_spans=True)

# To further improve performance (at the expense of latency), you can add candidate re-ranking
with torch.inference_mode():
   outputs = model.separate(batch, predict_spans=True, reranking_candidates=8)

Re-Ranking

We provide the following models to assess the quality of the separated audio:

CLAP: measures the similarity between the target audio and text description
Judge: measures the overall separation quality across 3 axes: precision, recall, and faithfulness (see the model card for more details)
ImageBind: for visual prompting, we measure the imagebind embedding similarity between the separated audio and the masked input video

We provide support for generating multiple candidates (by setting reranking_candidates=<k> in your call to separate), which will generate k audios, and choose the best one based on the ranking models mentioned above

Models

Below is a table of each of the models we released along with their overall subjective evaluation scores

Model	General SFX	Speech	Speaker	Music	Instr(wild)	Instr(pro)
`sam-audio-small`	3.62	3.99	3.12	4.11	3.56	4.24
`sam-audio-base`	3.28	4.25	3.57	3.87	3.66	4.27
`sam-audio-large`	3.50	4.03	3.60	4.22	3.66	4.49

We additional release another variant (in each size) that is better specifically on correctness of target sound as well as visual prompting:

Evaluation

See the eval directory for instructions and scripts to reproduce results from the paper

Contributing

See contributing and code of conduct for more information.

License

This project is licensed under the SAM License - see the LICENSE file for details.

Citing SAM Audio

If you use SAM Audio in your research, please use the following BibTex entry:

@article{shi2025samaudio,
    title={SAM Audio: Segment Anything in Audio},
    author={Bowen Shi and Andros Tjandra and John Hoffman and Helin Wang and Yi-Chiao Wu and Luya Gao and Julius Richter and Matt Le and Apoorv Vyas and Sanyuan Chen and Christoph Feichtenhofer and Piotr Doll{\'a}r and Wei-Ning Hsu and Ann Lee},
    year={2025},
    url={https://arxiv.org/abs/2512.18099}
}

sam-audio
sam-audio copied to clipboard

Metadata

SAM-Audio

Setup

Usage

Basic Text Prompting

Prompting Methods

Span Prediction (Optional for Text Prompting)

Re-Ranking

Models

Evaluation

Contributing

License

Citing SAM Audio

← Metadata

Owner

Metadata

sam-audio sam-audio copied to clipboard

Metadata

SAM-Audio

Setup

Usage

Basic Text Prompting

Prompting Methods

Span Prediction (Optional for Text Prompting)

Re-Ranking

Models

Evaluation

Contributing

License

Citing SAM Audio

← Metadata

Owner

Metadata

sam-audio
sam-audio copied to clipboard