api-inference-community icon indicating copy to clipboard operation
api-inference-community copied to clipboard

Audio-to-regions widget and community API for pyannote.audio

Open hbredin opened this issue 3 years ago • 8 comments

Opening an issue as per @osanseviero's suggestion on Twitter. Issue imported from https://github.com/pyannote/pyannote-audio/issues/835


pyannote.audio 2.0 will bring a unified pipeline API:

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
output = pipeline("audio.wav")   # or pipeline({"waveform": np.ndarray, "sample_rate": int})

where output is a pyannote.core.Annotation instance.

I just created a space that allows to test a bunch of pipelines shared on Hugginface Hub but it would be nice if those were testable directly in their own model card.

My understanding is that two things needs to happen

hbredin avatar Feb 01 '22 14:02 hbredin

This is a cool proposal! In Twitter I did mention we can model this task with audio-to-audio and it would already work by outputting multiple audios. But having a nice custom widget more specific for the task would be very cool!

cc @mishig25 @julien-c WDYT?

osanseviero avatar Feb 01 '22 14:02 osanseviero

This is very cool !

Definitely a good target for audio-to-audio as a starter (no widget needed). audio-segmentation seems like a good fit for what you're trying to do (does not exist yet, but should cover multiple use cases)

Narsil avatar Feb 01 '22 16:02 Narsil

audio-token-classification? 😱 audio-to-structured?

not sure of the best new task type to keep some generality

But yeah could be cool to have it

julien-c avatar Feb 01 '22 17:02 julien-c

audio-token-classification? scream

You're actually pretty on spot on IMO, since token-classification is actually text-segmentation I think. It's also aligned with image-segmentation.

Which basically should be a list of "objects" found in text/audio/image + some descriptor of "where" it is in the original input those objects are. (audio and text are 1D with basically never non contiguous objects, so start + stop are enough, IMO) in image because it's 2D, a full mask is basically required even for contiguous objects (boxes is also a simplification).

Narsil avatar Feb 02 '22 09:02 Narsil

Btw audio-segmentation (speech-segmentation) existed and we deprecated it in favor of audio-to-audio no @Narsil ?

osanseviero avatar Feb 08 '22 16:02 osanseviero

speech-segmentation was never deprecated, but it also never had widget support afaik.

It's output is not audio so I don't see how audio-to-audio could be used:

https://github.com/huggingface/huggingface_hub/blob/main/api-inference-community/docker_images/superb/app/pipelines/speech_segmentation.py

Narsil avatar Feb 14 '22 09:02 Narsil

Nice!

Would you recommend we update this PR to speech-segmentation then?

hbredin avatar Feb 14 '22 11:02 hbredin

I think we can keep the PR as is, merge it when ready, so things are functional (even though less than perfect).

And when support for audio-segmentation is ready (or even before), we can simply create a new PR.

Narsil avatar Feb 15 '22 16:02 Narsil