youtube-transcript-api icon indicating copy to clipboard operation
youtube-transcript-api copied to clipboard

[Question] word precision

Open irux opened this issue 3 years ago • 21 comments

Hello, when you watch a video and put the autogenerated subtitles, it seems like the video and the subtitles are perfectly synchronized. It gives the impresion that they have the exactly timestamp precision when a word is going to be spoken. Is it posible to get this precision ?

Thank you

irux avatar Mar 28 '21 21:03 irux

Hi @irux, getting the timestamps per word is currently not supported and the endpoint we are currently using does not provide this information. However, there probably is some way to access that information (since it is used by the YouTube web-client) I don't know about yet. So If anyone wants to look into how that works I would be happy to merge such a feature, I am a bit short on time currently so I don't se myself implementing this in the near future.

Do you mind sharing what your use-case is for this? Just so I could get a better idea of how important this feature is for this module in general.

jdepoix avatar Mar 29 '21 06:03 jdepoix

@jdepoix my use case is for a video editing tool. I am doing ASR at the moment with AWS, but with this feature I would be able to avoid more api calls to other services and it would be free.

Btw, I already found the endpoint. I will need to test more things but I think I would be able to make a pull request in the following days.

irux avatar Mar 29 '21 10:03 irux

Nice! 👍 Let me know if you need any help on integrating this feature. Maybe you wanna share some details on how you plan to implement it once you have the endpoint figured out, so we could discuss what's the best way to integrate it into the existing API. Having that discussion beforehand could save some time in the PR and avoid having to do more iterations than needed! 😊

jdepoix avatar Mar 29 '21 11:03 jdepoix

It is actually quite simple. You only need to append &fmt=json3 at the end of the track link. What I don't know is how it behaves with not ASR subtitles. The idea would be to have the same response like at the moment but with more precision. example

irux avatar Mar 29 '21 11:03 irux

@jdepoix I am seeing that it behaves like it should be. It actually would save some work because the format is already in json and you don't need to convert it from xml

irux avatar Mar 29 '21 11:03 irux

Sweet, that looks great! I agree that this could replace the xml request and the parsing which goes along with using that. However, implementing this will unfortunately not be as simple as it may seem. We can't just change the format of what get_transcript returns as this is a massively breaking change. We could introduce a new param like timestamp_mode to the Transcript.fetch() method, which could be PER_SEGMENT or PER_WORD. But I am afraid returning different formats from fetch() will break compatibility with the recently added formatters module.

This might be the time to make Transcript.fetch() return a more sophisticated object instead of a dict (something I wanted to do for a while, but there wasn't really a reason to change it). This object could contain all relevant information and provide a to_dict(timestamp_mode) method. The formatters would have to be changed to use this new Transcript object instead of the dicts they currently work on. What do you think about this @crhowell ? You think this could be integrated into stuff like the WebVTTFormatter?

jdepoix avatar Mar 29 '21 12:03 jdepoix

We would probably also have to rename a few classes for the naming to make sense in that case:

TranscriptListFetcher -> AvailableTranscriptsFetcher
TranscriptList -> AvailableTranscripts
Transcript -> AvailableTranscript

This would "give room" for a new Transcript object representing a fetched transcript. I am open on suggestions on the naming, not quite satisfied yet 😄

jdepoix avatar Mar 29 '21 12:03 jdepoix

@irux you'll also have to test this very extensively, as changing this could potentially completely break this module. Looking at the json returned by the endpoint a few questions come to mind, as the structure doesn't seem to be consistent.

  • is the content of the transcript always in "utf8" or are there different encodings available?
  • any ideas on what "acAsrConf" is?
  • there are dicts containing "aAppend" and it seems that they alway simply contain \n. However, is this consistent? Can we simply filter out everything with the key "aAppend"?
  • is there any other information in there which could be useful to us? 🤔

jdepoix avatar Mar 29 '21 12:03 jdepoix

I was doing some research and I found a new thing that can help you with the webttv format problem.

if you use fmt=vtt you are going to get it as vtt

EDIT:

You actually have:

  • json3
  • srv1
  • srv2
  • srv3
  • ttml
  • vtt

as possibilities (That's what I could find to the date)

irux avatar Mar 29 '21 18:03 irux

@jdepoix I haven't fully dug into everything yall said yet but tinkering with that URL provided above (No longer working). This seems to be some kind of undocumented API that does require us to explore options to see what does/doesn't work. Looks like there are some StackOverflow posts of folks talking about this undocumented API. It actually looks like it does quite a bit of work for us.

Two useful example URLs that respond defaulting to XML.

A Single Transcript

Transcript list of supported Langs

Remaining Thoughts

  • These URLs definitely simplify the TranscriptListFetcher work that needs to be done with manually parsing HTML into JSON.
  • It seems to be an undocumented API that could disappear anytime or stop responding entirely to these types of requests. Whereas the non-API manual parsing way the TranscriptListFetcher performs will likely always be able to make those requests. Except it is only fragile if they decide to switch up the element they place the transcript inside of inside the actual HTML DOM response that is fetched. Which could break the code but we could still at least update the repo to adapt.

So with that being said, I feel like it would almost be better to keep the current API as it is as some kind of fallback in case this undocumented API disappears. So you don't create a heavy reliance on it. Maybe we could introduce a "contrib"-like subpackage within youtube_transcript_api that has its own API implementation that could be leveraged from other code?

I am just throwing initial thoughts here I am not entirely certain at the moment. @jdepoix Whatever you feel is the best way to go I will try to help contribute to help make that happen. 🙂

crhowell avatar Mar 30 '21 01:03 crhowell

@crhowell it's normal that the link expires after a certain time, that and the concerns you raised about it being undocumented is also true about the endpoint we are currently using. In fact, it is the same endpoint only with the addition of the fmt=json3 param. So I would assume this would make the module just as reliable/unreliable as it currently is.

The problem with the simple timtedtext api you stumbled across on SO is that it doesn't support automatically created subtitles. That was actually the reason I created this module 😄

With the API providing so many formatting options I am starting to think that we should maybe rework how the formatters work. I will have to think that through a bit more when I have a bit more time. But feel free to keep the ideas going on how to integrate this.

jdepoix avatar Mar 30 '21 07:03 jdepoix

@jdepoix Thanks for touching on that. I think I totally made an assumption there based off what I was seeing when trying to follow the logic in the code. After directly inspecting a Transcript object's _url attribute I see that you are already in-fact hitting that timedtext route, so this is not a new find.

I guess some assumption that WATCH_URL = 'https://www.youtube.com/watch?v={video_id}' was always the base URL and that you were always assembling your route to pattern match elements off that page to eventually grab the contents out of the page like you would with a scraper. But it seems you did some clever assembling of the timedtext route in there.

Disregard my previous post, my thoughts were primarily based on that assumption. 😅

crhowell avatar Apr 02 '21 16:04 crhowell

@jdepoix To adapt the formatters, seeing formatters as still a "post-processing" feature of a Transcript my immediate thought is exactly what you previously stated.

The formatters would have to be changed to use this new Transcript object instead of the dicts they currently work on.

We would probably want to pass a Transcript object and assuming the Transcript objects remain about the same as they are now, we could maybe have the formatter alter the _url prior to the fetch. Then that formatter can make assumptions about the data it is going to get back or generate a proper response if the request fails or is empty?

I will keep thinking about this as well, I will likely do a feature branch and experiment with some ideas.

crhowell avatar Apr 02 '21 16:04 crhowell

Can you add the ability to grab the json3 and not do any post-processing/formatting for now?

nikitalita avatar Aug 25 '21 05:08 nikitalita

Hi @nikitalita, could you maybe share your usecase so that I can get a better idea of why this would be useful to you (what information does the json3 version have, which the current output doesn't)? I'll have to do some refactoring on the formatters to integrate this feature into this module nicely, which I currently don't have a lot of time for. Therefore, I will unfortunately not be able to implement this in the near future, unless there is a usecase which makes this very urgent.

However, you can call some private methods to get the raw json output which is currently being processed:

import requests
from youtube_transcript_api._transcripts import TranscriptListFetcher

video_id = '<video-id>'
fetcher = TranscriptListFetcher(requests.Session())
print(fetcher._extract_captions_json(fetcher._fetch_video_html(video_id), video_id))

Does this help in any way? 😊

jdepoix avatar Aug 25 '21 07:08 jdepoix

If you take a look at this issue, it explains it; using timestamps per word for better accuracy in determining where sentence breaks are.

https://github.com/shashank2123/Punctuation-Restoration-For-Youtube-Transcript/issues/1

Does this help in any way? 😊

That does help, thank you :)

nikitalita avatar Aug 25 '21 07:08 nikitalita

For those of you playing at home, here's how you get the json3 url using the above (substitute languageCode where appropriate):

import requests
from youtube_transcript_api._transcripts import TranscriptListFetcher

video_id = '<video_id>'
fetcher = TranscriptListFetcher(requests.Session())
json = fetcher._extract_captions_json(fetcher._fetch_video_html(video_id), video_id)
captionTracks = json['captionTracks']
transcript_track_url = ''
for track in captionTracks:
     if track['kind'] == 'asr' and track['languageCode'] == 'en':
        transcript_track_url = track['baseUrl'] + '&fmt=json3'

print(transcript_track_url)

nikitalita avatar Aug 25 '21 08:08 nikitalita

@nikitalita actually, now that I come to think about it, this can be done more easily, since you just have add a param to the url of the transcript. Basically you can use the full api of the TranscriptListFetcher and simply access the private _url property of the returned transcripts to add the url params you need:

transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
transcript = transcript_list.find_generated_transcript(['en'])
print(transcript._url + '&fmt=json3')

jdepoix avatar Aug 25 '21 09:08 jdepoix

Hi many thanks for the information. I was trying to figure out one thing actually. Consider the video lbykL9VjRvM. If you fetch the subtitles with the TranscriptAPI as it is, you end up with the first two segments:

start_sec duration_sec text
0.03	5.46	hey guys welcome back hey guys brand new
3.21	3.48	podcast wait a minute it's my podcast oh

where the intervals overlap. Looking at the json3 format, I think this is because the webpage renders long subtitles by partitiong them into two sub-segments. When the second part kicks in in the second line (3.21+0.03), the first part still has to remain on screen (until 5.49). It is possible to get the (sort-of) correct boundaries through the word-annotated json3 format but in its current form, these timestamps are actually wrong, am I missing something?

I think this only occurs on segments that are splitted into two to ease overlaying on the video. Any thoughts @nikitalita

ozancaglayan avatar May 10 '22 22:05 ozancaglayan

No idea, I've stopped trying to mess with youtube's subtitles

nikitalita avatar May 10 '22 23:05 nikitalita

@ozancaglayan

Before I took an extended break, I was looking at this overlapping time issue as I noticed it too while I was trying to write up the basic WebVTT formatter.

From my understanding, the start time is when the transcript text is meant to show on-screen and the duration isnt necessarily the amount of time spent saying a given set of words. It actually seems to be the duration of time to leave the text on the screen.

You can't really even use duration to calculate what I would call the end time of a given line (or end of set of word segments). But we could calculate the end of a line/segments by looking at the start of the next line/segments.

Is your goal to try to have something like this:

start_sec duration_sec text
0.03	5.46	hey guys welcome back hey guys brand new
3.21	3.48	podcast wait a minute it's my podcast oh
5.49    1.2     sorry sorry I forgot
6.69    .22     enjoy are you ready to do a podcast I'm
8.91    .74     ready you're ready yeah

become this instead?

start_sec duration_sec text
0.03	3.18	hey guys welcome back hey guys brand new 
3.21	2.28	podcast wait a minute it's my podcast oh
5.49	1.2 	sorry sorry I forgot
6.69	2.22	enjoy are you ready to do a podcast I'm
8.91	1.74	ready you're ready yeah

Take note of the start + duration which should add up to the start time of the next line.

crhowell avatar May 29 '22 02:05 crhowell