youtube-transcript-api icon indicating copy to clipboard operation
youtube-transcript-api copied to clipboard

[Feature] Default language on Transcript class

Open arturoalcibia opened this issue 2 years ago • 13 comments

Hello! It'd be great to have the default language of a video available as an attribute on the TranscriptList class.

I've been able to get this by accesing the list of subtitles from this url:

  • https://video.google.com/timedtext?v={videoId}&type=list

Ex:

  • https://video.google.com/timedtext?v=omGF6Ps9Nog&type=list

If more than one subtitle is available, there will be a "default_lang" key on the xml. Which is what the user chose as the language of the video when uploading a file.

I have a M.R. ready but wanted to submit it as an issue in case someone was already working on something similar or had a better approach.

arturoalcibia avatar Nov 11 '21 02:11 arturoalcibia

Hi @arturoalcibia, sorry for the late reply, somehow I must've missed this issue...

I am not really sure what functionality you are asking for exactly. You are currently able to retrieve transcripts in different languages using

YouTubeTranscriptApi.get_transcript(video_ids, languages=['de', 'en'])

or

YouTubeTranscriptApi.list_transcripts(video_id).find_transcript(['de', 'en']).fetch()

What use case do you have which is not covered by these methods?

jdepoix avatar Dec 16 '21 08:12 jdepoix

Hi @jdepoix,

no worries.

This would give us access to what the user intended the default caption track to be played. Which is usually the language of the video.

As an example, this video contains multiple manually created tracks: https://www.youtube.com/watch?v=UOgvbS4GkF0 But English is the one the user set to default.

You can find which transcript track is set to default by looking at the html returned with the key "defaultCaptionTrackIndex".

In this case, the html has the index 3 as the "defaultCaptionTrackIndex" which corresponds to the english track.

Here's a quick dirty snippet to get the index (Which refers to the english track ).

import requests
from youtube_transcript_api._transcripts import TranscriptListFetcher
videoId = 'UOgvbS4GkF0'

with requests.Session() as http_client:

    tListFetcher = TranscriptListFetcher(http_client)
    htmlContent = tListFetcher._fetch_video_html(videoId)
    captions_json = tListFetcher._extract_captions_json(tListFetcher._fetch_video_html(videoId), videoId)
    defaultCaptionIndex = captions_json['audioTracks'][0].get('defaultCaptionTrackIndex', 0)
    print(defaultCaptionIndex)

I'd be happy to contribute with a proper M.R. on this.

arturoalcibia avatar Jan 04 '22 06:01 arturoalcibia

Hi @arturoalcibia,

okay, that makes sense. In that case the default language would have to be added as a param to the TranscriptList constructor and the TranscriptList.build method would have to determine the default language and set it. The language_codes params on find_manually_created_transcript, find_generated_transcript and find_transcript would have to become optional and if they are not set the default language is used.

Of course any contributions on this are very much welcome! 😊

My only concern is that this would change the default behaviour of this module and could break peoples code if they expect english subtitles (since that's what they've been getting by simply calling get_transcript). However, using the default language provided by the uploader seems like a more fitting default for this module, so maybe we should accept this breaking change. Any thoughts on this?

jdepoix avatar Jan 04 '22 08:01 jdepoix

Hi @jdepoix,

Sounds good, I agree that the breaking change seems worth it, adding any extra function or argument to return the default language seems overkill and would get confusing. I also think having "english" as a default language feels arbitrary. Returning the default language provided by the user looks cleaner.

arturoalcibia avatar Jan 05 '22 06:01 arturoalcibia

Hi @jdepoix,

I think I have a working version with this feature, would it be possible to be added as a contributor to submit a M.R.?

arturoalcibia avatar Jan 05 '22 17:01 arturoalcibia

Hi @arturoalcibia, you don't need to be a contributor to submit a PR. You can simply submit a PR from your fork. Read this to find out more!

jdepoix avatar Jan 05 '22 17:01 jdepoix

Hi @arturoalcibia, as this topic just came up in #177, is this something you are still working on? Is there anything I can help you with?

jdepoix avatar Dec 06 '22 14:12 jdepoix

Hi @jdepoix, My bad! I worked on it but forgot to ever submit the PR, if that's okay I will submit it this weekend for review.

arturoalcibia avatar Dec 07 '22 03:12 arturoalcibia

@arturoalcibia no worries, I am always appreciative about contributions in any way 😊

jdepoix avatar Dec 07 '22 10:12 jdepoix

Any progress on this? I'm using the cli and it'd be great to have a flag that just returned the default language of the video

dcsilver avatar Mar 17 '23 21:03 dcsilver

@dcsilver I haven't done any active development on this. Apparently, @arturoalcibia has been working on a PR, but hasn't turned it in so far. Any news on this @arturoalcibia?

jdepoix avatar Mar 20 '23 09:03 jdepoix

Any update about defaultAudioLanguage??\

KhaledLela avatar May 29 '23 20:05 KhaledLela

@KhaledLela sorry, I haven't done any development on this and @arturoalcibia has unfortunately never turned in that PR.

jdepoix avatar Jun 12 '23 08:06 jdepoix