Is your feature request related to a problem? Please describe.

I primarily work with audio data and it is particularly challenging to visualize different stages of audio data like waveforms or spectrograms. It becomes more challenging if the data is multi-channel audio or very long audio. Currently I have to use jupyter-notebook to display and play my audio. The context switching is very tiring. Also, it is more challenging to exactly relate the audio waveform at a particular timestamp and its corresponding spectrograms. This becomes worse, if we are working of multimodal models like Automatic Speech Recognition (ASR) systems which require text visualization with its corresponding audio.

Describe the solution you'd like

I am very impressed with the video support that is provided by rerun api. I would like to see a similar first-class support for audio based projects too with following features:

[important] play my audio as a time-series data
[important] plot and visualize the changing spectrograms as the audio is playing to precisely pinpoint the timestamp and its corresponding extracted features. Support for various power-spectrums like MFCC would be extremely helpful.
[important] ability ot play individual channels separately or play multiple channels combined. This is essential for various tasks such as source-separation, denoising.
[important] For various tasks like Automatic Speech Recognition (ASR) we would want to see a correlation between the timestamp-window and the respective text produced by the ASR model. This would be scalable across waveform, power-spectrums and ASR text-output so we can comprehend everything at once.
[nice-to-have] ability to apply various types of windows (eg. hanning, hamming etc) and filters (eg. low-pass, high-pass, band-pass etc.) on a audio or a batch to quick experiment on-the-fly.

Describe alternatives you've considered

As far as I know, there is not a comprehensive tool that supports these features, yet. I have to use Jupyter-notebook and librosa most of my experimentation and the biggest challenge is making sure that the timestamp in audio is exactly same as in the power-spectrums.

Additional context

Jul 28 '23 02:07 imflash217

One fundamental thing we need to implement before we start working on this is log events with a duration. Currently each log event is associated with a single instance (a video is just a set of frames, each logged individually). This won't work for audio: you'd like to log e.g. a two second sound in one log call. We will also need this functionality when implementing proper video codecs.

Aug 08 '23 13:08 emilk

I'm very interested in logging and labeling realtime audio when tracing Talon with Rerun!

I'll note that Talon's audio is realtime/continuous/infinite, but it might make more sense efficiency wise to log it in larger chunks than in say 30ms intervals. If we did that, I would want an easy way to backdate a longer chunk of streamed audio to the actual timestep/frame in which it originated during logging.

I think plotting and visualizing audio features is very useful, but I don't want Rerun to calculate the features (spectrogram, windowing, filters, etc) for me. Those are labels / data processing I can ship with the audio signal and they're in my domain of expertise to make sure the data I'm sending you to render is exactly what I want.

Audio Timeline Space

I think I want a kind of "audio timeline" space, which looks sort of like an audacity track and maybe supports several audio channels (vertically stacked), and maybe supports other views of the same audio like spectrograms (which I'm happy to embed in the trace myself).

It would have a scrubber synchronized with the global rerun timeline.
You can single-click in the audio track to change the global timestep.
You can click+drag or click+shift-click to select a region of audio in the audio timeline.
- If you "play" the audio during this time, playback stops when you get to the end of the selected region (can have a clickable option to loop the region as well).
- It would be nice to be able to export the selected audio to a file for further debugging (e.g. wav or flac).
You can mute individual audio tracks. I'd probably want audio to be muted by default so I'm not blasting audio in public, and because I might have a number of audio streams that would sound terrible if you played them overlapping. (You could add a blueprint setting to unmute individual tracks if users wanted to change this default?)

Annotations

You can attach labels to either a zero-width section of the audio timeline, or to a span of the audio timeline.
In some cases, annotations could feasibly be their own space or their own track within the audio space. Timeline annotations might make enough sense outside of audio to consider the general use case for them.
Some annotations are audio track specific, and some may be more global.

Here's an extreme example of what duration annotations might look like in audacity: screenshot_2023-08-08_at_3 48 43_pm

Spatial audio

I think about spatial audio as well, e.g. several audio tracks with distinct 3d positions that can change over time. I wouldn't worry about playing the audio back spatially at first, but being able to select an audio track and see it highlighted + move around in the 3d scene might be really useful.

Oct 06 '23 20:10 lunixbochs

This looks like a nice, simple audio library for rust:

https://github.com/mrDIMAS/tinyaudio

Oct 19 '23 11:10 emilk

Very interested in audio support as well. Would also love to be able to visualize alongside 2D matrices where each row covers a fixed time window (may be a probability vector over an alphabet, a spectogram entry, or similar).

+1 to text as well.

Dec 18 '23 07:12 CatalinVoss

+1 for spectrogram. I'm hoping to use a spectrogram to visualize streaming (unbounded / realtime) brain signals, not audio, but I think the solution will work equally well for either.

I don't think rerun should be responsible for doing the spectral transformation. This is too personal and domain specific. (Pre-Filtering? Windowing? Log-transform? FFT or Wavelets? Multi-taper? Frequency resolution? Window duration? Window step size?). It should be up to the user to do their spectral transformation then log their spectrum / spectra.

The SDK client app logs a tensor (e.g., "channels" x "frequencies") for a single frame, or a batch of frames ("times" x "channels" x "frequencies").
Somewhere under the hood, the history of tensors get concatenated along the "time" axis.

The Space view should be something like a mix of the Tensor view and TimeSeries view:

representation is an image, like the generic tensor view
x-dimension is always "time" and can be scrolled and scrubbed over, just like a TimeSeries view.
user can choose whether the y-axis is "channel" or "frequency" and the index into each of the non-rendered dimensions, just like the generic tensor view

Until something like this is implemented, I might try plotting a scalar for every time x frequency, for only a single channel, and then coloring each scalar independently, probably with a SeriesPoint and square markers.

May 02 '24 18:05 cboulay

For decoding audio (that is not simple PCM), we should be able to use ffmpeg over CLI, like we do for video (see https://github.com/rerun-io/rerun/pull/7962/)

Oct 31 '24 18:10 emilk

Support for `audio` data based projects

Audio Timeline Space

Annotations

Spatial audio