audiobookshelf
audiobookshelf copied to clipboard
WIP: Adding Transcription/Subtitle Viewing Support to the Web Player (VTT)
I have begun work on adding transcription support to the Web Player. I've used Whisper to generate transcriptions for some audiobooks and podcasts. Many tools based on Whisper support exports in VTT and SRT formats. For this pull request, I'm only supporting VTT as it is natively supported by browsers. Support for SRT can be added in a future pull request.
How does it work?
A new endpoint, api/items/:id/file/:fileid/transcript
, has been created on the backend. This endpoint attempts to return a transcription for each audio track. For instance, if there's an audio file named adventuresherlockholmes_01_doyle_64kb.mp3
, this endpoint will attempt to return the file adventuresherlockholmes_01_doyle_64kb.vtt
.
On the frontend, when an audio file is set as the source property of the <audio>
HTML tag, a <track>
is created and linked to that <audio>
. The source
property for the <track>
HTML tag is populated with the link to the aforementioned endpoint.
What does this PR support?
- Show/Hide transcription block
- Highlighting the current transcription line
- Clicking on a line to seek the player to that time
- Changing transcriptions when the audio file changes (supports audiobooks and podcasts)
Demo
https://github.com/advplyr/audiobookshelf/assets/814828/3bd43148-6adc-48b7-8417-bc068be14c7b
What is missing for the scope of this PR
- Hiding the "Show transcription" button when the transcription is not available for the audio file
- Known issues
Known issues
- When playing an audio file with transcription, if you close the web player and reopen it, the transcription block is not displayed, even though the transcription is still available. Clicking on the "Show transcription" button to display the block again. I think this is related with the
MediaPlayerContainer.vue
component not reloading theTranscriptionUi
component.
https://github.com/advplyr/audiobookshelf/assets/814828/c1aaef74-2f70-45a3-b0ce-b04053bf3bc5
- When playing an audio file with transcription, if you change the audio file, the active transcription line for the new audio file focuses on the first line. The focus shifts to the correct line only when the next line change occurs.
Related
- #1723 - This PR can helps to implement the Whisper support on the Web Player
The placement irks me for some reason. I think that this feature demands something like "Now Playing" screen. But even in this form I really really want this feature in. My sister is hearing impaired and this would really help her.
The placement irks me for some reason. I think that this feature demands something like "Now Playing" screen. But even in this form I really really want this feature in. My sister is hearing impaired and this would really help her.
I also don't like the placement. I was thinking putting it on a side panel or a floating, movable modal. But, the side panel raises concerns about taking up too much space on the sidebar, especially if the user has a narrow display. The floating, movable modal adds more complexity to the JavaScript and CSS. I will make some tests with both behaviours and try to provide updates here.
Sidebar like Apple Music:
Floating transcription window:
Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw
Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw
Snipd uses word level timestamps, while such subs are easy to generate the only sane format is ssa/ass (srt can blow into megabytes which is insane) afaik. Which is not natively supported by browsers. No idea if it's possible with vtt (and I'm guessing that snippd just uses raw JSON or something since the subs can be generated on the fly)
Great job on the project! For UX improvement, please consider looking into word highlighting in Snipd, as shown in this video: https://www.youtube.com/watch?v=jBi-OId37Uw https://www.youtube.com/watch?v=jzPekGpC4uw
Snipd uses word level timestamps, while such subs are easy to generate the only sane format is ssa/ass (srt can blow into megabytes which is insane) afaik. Which is not natively supported by browsers. No idea if it's possible with vtt (and I'm guessing that snippd just uses raw JSON or something since the subs can be generated on the fly)
Look at the WebVTT, which supports something similar to the "Karaoke Style" using :past and :future pseudo-classes. However, VTT files need to be adapted for this as well. I think it's not common to get a VTT file with this information. I was using Whisper to generate transcriptions, but I'm not sure if we can generate word-by-word transcriptions.
SSA/ASS and SRT support, I was checking what the best approach is. I was considering parsing to VTT to keep the implementation consistent with how we show the transcriptions, I'm not sure if this is the best way yet
@mfcar I've used https://github.com/jianfch/stable-ts to generate ass/ssa karaoke style captions with custom style for my podcasts/books. I don't remember if vtt is one of the options. Whisper.cpp can spit out world level output too, but you have to process it with a script to get a valid subs file.
In the past, I have used stable-ts to create VTT files. I generated word-level timestamps with Whisper’s base.en model.