audiobookshelf WIP: Adding Transcription/Subtitle Viewing Support to the Web Player (VTT)

I have begun work on adding transcription support to the Web Player. I've used Whisper to generate transcriptions for some audiobooks and podcasts. Many tools based on Whisper support exports in VTT and SRT formats. For this pull request, I'm only supporting VTT as it is natively supported by browsers. Support for SRT can be added in a future pull request.

How does it work?

A new endpoint, api/items/:id/file/:fileid/transcript, has been created on the backend. This endpoint attempts to return a transcription for each audio track. For instance, if there's an audio file named adventuresherlockholmes_01_doyle_64kb.mp3, this endpoint will attempt to return the file adventuresherlockholmes_01_doyle_64kb.vtt.

On the frontend, when an audio file is set as the source property of the <audio> HTML tag, a <track> is created and linked to that <audio>. The source property for the <track> HTML tag is populated with the link to the aforementioned endpoint.

What does this PR support?

Show/Hide transcription block
Highlighting the current transcription line
Clicking on a line to seek the player to that time
Changing transcriptions when the audio file changes (supports audiobooks and podcasts)

Demo

https://github.com/advplyr/audiobookshelf/assets/814828/3bd43148-6adc-48b7-8417-bc068be14c7b

What is missing for the scope of this PR

Hiding the "Show transcription" button when the transcription is not available for the audio file
Known issues

Known issues

When playing an audio file with transcription, if you close the web player and reopen it, the transcription block is not displayed, even though the transcription is still available. Clicking on the "Show transcription" button to display the block again. I think this is related with the MediaPlayerContainer.vue component not reloading the TranscriptionUi component.

https://github.com/advplyr/audiobookshelf/assets/814828/c1aaef74-2f70-45a3-b0ce-b04053bf3bc5

When playing an audio file with transcription, if you change the audio file, the active transcription line for the new audio file focuses on the first line. The focus shifts to the correct line only when the next line change occurs.

I also don't like the placement. I was thinking putting it on a side panel or a floating, movable modal. But, the side panel raises concerns about taking up too much space on the sidebar, especially if the user has a narrow display. The floating, movable modal adds more complexity to the JavaScript and CSS. I will make some tests with both behaviours and try to provide updates here.

Sidebar like Apple Music:

Floating transcription window:

May 28 '24 20:05 mfcar

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw

Jun 02 '24 17:06 ashwinm4friends

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw

Snipd uses word level timestamps, while such subs are easy to generate the only sane format is ssa/ass (srt can blow into megabytes which is insane) afaik. Which is not natively supported by browsers. No idea if it's possible with vtt (and I'm guessing that snippd just uses raw JSON or something since the subs can be generated on the fly)

Jun 02 '24 20:06 barolo

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw

Snipd uses word level timestamps, while such subs are easy to generate the only sane format is ssa/ass (srt can blow into megabytes which is insane) afaik. Which is not natively supported by browsers. No idea if it's possible with vtt (and I'm guessing that snippd just uses raw JSON or something since the subs can be generated on the fly)

Look at the WebVTT, which supports something similar to the "Karaoke Style" using :past and :future pseudo-classes. However, VTT files need to be adapted for this as well. I think it's not common to get a VTT file with this information. I was using Whisper to generate transcriptions, but I'm not sure if we can generate word-by-word transcriptions.

SSA/ASS and SRT support, I was checking what the best approach is. I was considering parsing to VTT to keep the implementation consistent with how we show the transcriptions, I'm not sure if this is the best way yet

Jun 02 '24 22:06 mfcar

@mfcar I've used https://github.com/jianfch/stable-ts to generate ass/ssa karaoke style captions with custom style for my podcasts/books. I don't remember if vtt is one of the options. Whisper.cpp can spit out world level output too, but you have to process it with a script to get a valid subs file.

Jun 02 '24 22:06 barolo

In the past, I have used stable-ts to create VTT files. I generated word-level timestamps with Whisper’s base.en model.

Jun 03 '24 02:06 ashwinm4friends

audiobookshelf
audiobookshelf copied to clipboard

WIP: Adding Transcription/Subtitle Viewing Support to the Web Player (VTT)

How does it work?

What does this PR support?

Demo

What is missing for the scope of this PR

Known issues

Related

audiobookshelf audiobookshelf copied to clipboard

WIP: Adding Transcription/Subtitle Viewing Support to the Web Player (VTT)

How does it work?

What does this PR support?

Demo

What is missing for the scope of this PR

Known issues

Related

audiobookshelf
audiobookshelf copied to clipboard