jupiterbroadcasting.com icon indicating copy to clipboard operation
jupiterbroadcasting.com copied to clipboard

Initial transcription support

Open FlakM opened this issue 2 years ago • 11 comments

Hi! So this is the initial MR for getting the ball rolling on incorporating the transcriptions created for issue #301. The idea is that the transcriptions should be a plain json file and they should be displayed only for the pages where the relevant transcription is already present.

This is in no way a ready code, just an initial setup, maybe someone will have an easier time picking it up now :+1:

Features I'd like to see:

  • links at timestamp to set playback of local web player to given time
  • link to GitHub repo to the corresponding JSON with transcriptions to enable easy edits
  • some nice formatting of the text

Unfortunately, whisper AI is currently cutting the sentences strangely - this should be fixed in sometime in the future. I'd be happy to rerun them then and backport modifications.

FlakM avatar Jan 15 '23 21:01 FlakM

sounds asthough this PR should be marked WIP?

and... very exciting!!!!

gerbrent avatar Jan 17 '23 18:01 gerbrent

I've added the WIP flag but this is misleading since I'm not currently able to work on it a lot. With the limited time I get I'd rather focus on improving transcriptions and maybe preparing POC with search which was my initial goal.

The current code requires some love to improve the looks (little HTML, some CSS and maybe javascript to set the correct time in a web player)

It seems like a perfect opportunity for someone to pick up a nice task. I'd be more than happy to "mentor" as much as I can

FlakM avatar Jan 17 '23 19:01 FlakM

Just a thought do we want this to be on a separate page or should it ideally be embedded into the episode page. That would also be better for the JS interaction. I was thinking maybe tabs "show notes" and "transcript"

ChanceM avatar Jan 17 '23 19:01 ChanceM

Currently, it is embedded inside the episode site:

transcripts

Whisper is sadly cutting them strangely for some files.

FlakM avatar Jan 17 '23 20:01 FlakM

Since it has not received much attention I've picked it up. For now setting playback time based on timestamp is not possible but folks at podverse will soon add it https://github.com/podverse/podverse-web/discussions/1071#discussioncomment-4748501 💪

FlakM avatar Jan 22 '23 10:01 FlakM

Over the weekend I've tried to give it a run, I've uploaded fresh transcripts (90sh) for different episodes: Here are the screens:

01 02

As mentioned above it is currently impossible to set the current time in podverse online player (well unless we proxy podverse player on the same domain but this would open a can of warms). Transcripts are imperfect but are easily editable by users even using an online GitHub client :+1: for the newest ones I'll definitely want to run the large.en model.

@gerbrent @ChanceM @pagdot @noblepayne (people mentioned in other issues) please provide feedback :smile:

FlakM avatar Jan 23 '23 21:01 FlakM

Can you also upload the code to run the transcriptions? Could imagine it also being in another repository, but it would allow others to also work on it or just be inspired :)

pagdot avatar Jan 24 '23 08:01 pagdot

Can you also upload the code to run the transcriptions? Could imagine it also being in another repository, but it would allow others to also work on it or just be inspired :)

The sources are available in my repository Jupiter search I still have a lot of work there but on x86 machine it should be as simple as downloading a model and running inference using docker image

FlakM avatar Jan 24 '23 16:01 FlakM

I definitely agree with @ChanceM (src):

Just a thought do we want this to be on a separate page or should it ideally be embedded into the episode page. That would also be better for the JS interaction. I was thinking maybe tabs "show notes" and "transcript"

for a final solution we should have some type of separate page or tabbed area. For now though, I think that it just being inline is fine for an initial PoC.

@FlakM, once this is merged (or even before then) could you collect a list of enhancements that we could do for transcriptions? Maybe this one we'll consider as closing #301 and we have another one for enhancing the transcription experience. Then we can reference the old issue in the new one, so anyone that wants to make that leap (from PoC -> enhanced) has that link. We can eventually break each of those tasks out in their own GH issues (to allow individuals to work on them separately), but for now I think just doing a single issue would be nice (till we do some spring cleaning on some of these issues :sweat_smile: )

elreydetoda avatar Feb 19 '23 13:02 elreydetoda

Hello @elreydetoda as I've mentioned in the matrix (hehe) this work has been taken over by JB crew and if I'm not mistaken they have different ideas about how the transcripts are to be generated.

If there is an actual decision to host transcripts on s3, not a GitHub repo, then this MR should probably get closed and new one should be created to ingest data from s3. As for further improvements here is the list of my personal acceptance criteria that I would add:

  • transcripts should be open for edits - mistakes are bound to happen. Some might be offensive
  • there should be only single source of truth - edits are automatically shown in all places (RSS feed, web site, search index etc)
  • format should be open for future extension - for instance currently whisper is not supporting speaker diarization but it is completely possible that some framework might in the future add support for it. AFAIK there might be some work in whisper cpp project https://github.com/ggerganov/whisper.cpp/issues/64#issuecomment-1349732665. There definitely will be progress in the quality of the tooling it would be awesome to be able to update the transcripts then
  • on the web version clicking on the timestamp should set the current playback time and append time to URL. Hugo should also honour URLs with timestamps to enable sharing a particular moment
  • there should be public documentation on how to run a whole s2t pipeline and hopefully it should be automated

FlakM avatar Feb 19 '23 19:02 FlakM

Hello @elreydetoda as I've mentioned in the matrix (hehe) this work has been taken over by JB crew and if I'm not mistaken they have different ideas about how the transcripts are to be generated.

Hello @FlakM 😁

Yep, no problem I remember seeing it (thank you for the reminder 🙃) . I just wanted to get my feedback about the longer term goal to be added to the PR/issue about this feature.

If there is an actual decision to host transcripts on s3, not a GitHub repo, then this MR should probably get closed and new one should be created to ingest data from s3. As for further improvements here is the list of my personal acceptance criteria that I would add:

  • transcripts should be open for edits - mistakes are bound to happen. Some might be offensive
  • there should be only single source of truth - edits are automatically shown in all places (RSS feed, web site, search index etc)
  • format should be open for future extension - for instance currently whisper is not supporting speaker diarization but it is completely possible that some framework might in the future add support for it. AFAIK there might be some work in whisper cpp project https://github.com/ggerganov/whisper.cpp/issues/64#issuecomment-1349732665. There definitely will be progress in the quality of the tooling it would be awesome to be able to update the transcripts then
  • on the web version clicking on the timestamp should set the current playback time and append time to URL. Hugo should also honour URLs with timestamps to enable sharing a particular moment
  • there should be public documentation on how to run a whole s2t pipeline and hopefully it should be automated

I completely agree with all of these criteria/features! Whatever I can do to convey these point I'll definitely try to get them all included (if I'm asked/consultanted about this feature). I can't guarantee it'll happen (since in the end it's up to the JB team), but IMO I definitely think 1, 2, & 5 (of the points you listed above) should be considered critical (even for an MVP) to ensure the transcript doesn't offend someone and reflect badly on JB. If they did, it would allow anyone in the community to quickly fix it and that would fix it for everything/one.

Honestly, (just thinking out loud here) would you think a good alternative to an s3 bucket could be something like GH pages to actually host just the raw text of the transcripts (in whatever format they need to be in). That way the transcripts could just be hosted in a repo (probably a separate one to simplify separation of concerns) and then published via a GH action workflow.

Plus IIRC GH pages already has some type of CDN in front of it. If it doesn't, since it's just text, we could just put cloudflare in front of it too.

elreydetoda avatar Feb 21 '23 11:02 elreydetoda