audiobookshelf icon indicating copy to clipboard operation
audiobookshelf copied to clipboard

WIP: Adding Transcription/Subtitle Viewing Support to the Web Player (VTT)

Open mfcar opened this issue 9 months ago • 7 comments

I have begun work on adding transcription support to the Web Player. I've used Whisper to generate transcriptions for some audiobooks and podcasts. Many tools based on Whisper support exports in VTT and SRT formats. For this pull request, I'm only supporting VTT as it is natively supported by browsers. Support for SRT can be added in a future pull request.

How does it work?

A new endpoint, api/items/:id/file/:fileid/transcript, has been created on the backend. This endpoint attempts to return a transcription for each audio track. For instance, if there's an audio file named adventuresherlockholmes_01_doyle_64kb.mp3, this endpoint will attempt to return the file adventuresherlockholmes_01_doyle_64kb.vtt.

On the frontend, when an audio file is set as the source property of the <audio> HTML tag, a <track> is created and linked to that <audio>. The source property for the <track> HTML tag is populated with the link to the aforementioned endpoint.

What does this PR support?

  • Show/Hide transcription block
  • Highlighting the current transcription line
  • Clicking on a line to seek the player to that time
  • Changing transcriptions when the audio file changes (supports audiobooks and podcasts)

Demo

https://github.com/advplyr/audiobookshelf/assets/814828/3bd43148-6adc-48b7-8417-bc068be14c7b

What is missing for the scope of this PR

  • Hiding the "Show transcription" button when the transcription is not available for the audio file
  • Known issues

Known issues

  • When playing an audio file with transcription, if you close the web player and reopen it, the transcription block is not displayed, even though the transcription is still available. Clicking on the "Show transcription" button to display the block again. I think this is related with the MediaPlayerContainer.vue component not reloading the TranscriptionUi component.

https://github.com/advplyr/audiobookshelf/assets/814828/c1aaef74-2f70-45a3-b0ce-b04053bf3bc5

  • When playing an audio file with transcription, if you change the audio file, the active transcription line for the new audio file focuses on the first line. The focus shifts to the correct line only when the next line change occurs.

Related

  • #1723 - This PR can helps to implement the Whisper support on the Web Player

mfcar avatar May 04 '24 20:05 mfcar

The placement irks me for some reason. I think that this feature demands something like "Now Playing" screen. But even in this form I really really want this feature in. My sister is hearing impaired and this would really help her.

barolo avatar May 28 '24 00:05 barolo

The placement irks me for some reason. I think that this feature demands something like "Now Playing" screen. But even in this form I really really want this feature in. My sister is hearing impaired and this would really help her.

I also don't like the placement. I was thinking putting it on a side panel or a floating, movable modal. But, the side panel raises concerns about taking up too much space on the sidebar, especially if the user has a narrow display. The floating, movable modal adds more complexity to the JavaScript and CSS. I will make some tests with both behaviours and try to provide updates here.

Sidebar like Apple Music:

image


Floating transcription window:

image

mfcar avatar May 28 '24 20:05 mfcar

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw

ashwinm4friends avatar Jun 02 '24 17:06 ashwinm4friends

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw

Snipd uses word level timestamps, while such subs are easy to generate the only sane format is ssa/ass (srt can blow into megabytes which is insane) afaik. Which is not natively supported by browsers. No idea if it's possible with vtt (and I'm guessing that snippd just uses raw JSON or something since the subs can be generated on the fly)

barolo avatar Jun 02 '24 20:06 barolo

Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw

Snipd uses word level timestamps, while such subs are easy to generate the only sane format is ssa/ass (srt can blow into megabytes which is insane) afaik. Which is not natively supported by browsers. No idea if it's possible with vtt (and I'm guessing that snippd just uses raw JSON or something since the subs can be generated on the fly)

Look at the WebVTT, which supports something similar to the "Karaoke Style" using :past and :future pseudo-classes. However, VTT files need to be adapted for this as well. I think it's not common to get a VTT file with this information. I was using Whisper to generate transcriptions, but I'm not sure if we can generate word-by-word transcriptions.

SSA/ASS and SRT support, I was checking what the best approach is. I was considering parsing to VTT to keep the implementation consistent with how we show the transcriptions, I'm not sure if this is the best way yet

image

mfcar avatar Jun 02 '24 22:06 mfcar

@mfcar I've used https://github.com/jianfch/stable-ts to generate ass/ssa karaoke style captions with custom style for my podcasts/books. I don't remember if vtt is one of the options. Whisper.cpp can spit out world level output too, but you have to process it with a script to get a valid subs file.

barolo avatar Jun 02 '24 22:06 barolo

In the past, I have used stable-ts to create VTT files. I generated word-level timestamps with Whisper’s base.en model.

ashwinm4friends avatar Jun 03 '24 02:06 ashwinm4friends