api-inference-community Audio-to-regions widget and community API for pyannote.audio

Opening an issue as per @osanseviero's suggestion on Twitter. Issue imported from https://github.com/pyannote/pyannote-audio/issues/835

pyannote.audio 2.0 will bring a unified pipeline API:

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization")
output = pipeline("audio.wav")   # or pipeline({"waveform": np.ndarray, "sample_rate": int})

where output is a pyannote.core.Annotation instance.

I just created a space that allows to test a bunch of pipelines shared on Hugginface Hub but it would be nice if those were testable directly in their own model card.

My understanding is that two things needs to happen

pyannote pipeline support must be added to Hugginface Inference API
a new widget (based on wavesurfer.js like the aforementioned space?) must be added to the list of Huggingface widgets

Feb 01 '22 14:02 hbredin

This is a cool proposal! In Twitter I did mention we can model this task with audio-to-audio and it would already work by outputting multiple audios. But having a nice custom widget more specific for the task would be very cool!

cc @mishig25 @julien-c WDYT?

Feb 01 '22 14:02 osanseviero

This is very cool !

Definitely a good target for audio-to-audio as a starter (no widget needed). audio-segmentation seems like a good fit for what you're trying to do (does not exist yet, but should cover multiple use cases)

Feb 01 '22 16:02 Narsil

audio-token-classification? 😱 audio-to-structured?

not sure of the best new task type to keep some generality

But yeah could be cool to have it

Feb 01 '22 17:02 julien-c

audio-token-classification? scream

You're actually pretty on spot on IMO, since token-classification is actually text-segmentation I think. It's also aligned with image-segmentation.

Which basically should be a list of "objects" found in text/audio/image + some descriptor of "where" it is in the original input those objects are. (audio and text are 1D with basically never non contiguous objects, so start + stop are enough, IMO) in image because it's 2D, a full mask is basically required even for contiguous objects (boxes is also a simplification).

Feb 02 '22 09:02 Narsil

Btw audio-segmentation (speech-segmentation) existed and we deprecated it in favor of audio-to-audio no @Narsil ?

Feb 08 '22 16:02 osanseviero

speech-segmentation was never deprecated, but it also never had widget support afaik.

It's output is not audio so I don't see how audio-to-audio could be used:

https://github.com/huggingface/huggingface_hub/blob/main/api-inference-community/docker_images/superb/app/pipelines/speech_segmentation.py

Feb 14 '22 09:02 Narsil

Nice!

Would you recommend we update this PR to speech-segmentation then?

Feb 14 '22 11:02 hbredin

I think we can keep the PR as is, merge it when ready, so things are functional (even though less than perfect).

And when support for audio-segmentation is ready (or even before), we can simply create a new PR.

Feb 15 '22 16:02 Narsil