youtube-transcript-api icon indicating copy to clipboard operation
youtube-transcript-api copied to clipboard

youtubetranscript.com cc selection option

Open pasdesinfos opened this issue 2 years ago β€’ 13 comments

Is your feature request related tweets o a problem? Please describe. :( Unknown error: Could not retrieve a transcript for the video http://www.youtube.com/watch?v=oBfDbucxPU4! This is most likely caused by: No transcripts were found for any of the requested language codes: ('en',) For this video (oBfDbucxPU4) transcripts are available in the following languages: (MANUALLY CREATED) None (GENERATED) - es ("Spanish (auto-generated)")[TRANSLATABLE] (TRANSLATION LANGUAGES) - af ("Afrikaans") - ak ("Akan") - sq ("Albanian") - am ("Amharic") - ar ("Arabic") - hy ("Armenian") - as ("Assamese") - ay ("Aymara") - az ("Azerbaijani") - bn ("Bangla") - eu ("Basque") - be ("Belarusian") - bho ("Bhojpuri") - bs ("Bosnian") - bg ("Bulgarian") - my ("Burmese") - ca ("Catalan") - ceb ("Cebuano") - zh-Hans ("Chinese (Simplified)") - zh-Hant ("Chinese (Traditional)") - co ("Corsican") - hr ("Croatian") - cs ("Czech") - da ("Danish") - dv ("Divehi") - nl ("Dutch") - en ("English") - eo ("Esperanto") - et ("Estonian") - ee ("Ewe") - fil ("Filipino") - fi ("Finnish") - fr ("French") - gl ("Galician") - lg ("Ganda") - ka ("Georgian") - de ("German") - el ("Greek") - gn ("Guarani") - gu ("Gujarati") - ht ("Haitian Creole") - ha ("Hausa") - haw ("Hawaiian") - iw ("Hebrew") - hi ("Hindi") - hmn ("Hmong") - hu ("Hungarian") - is ("Icelandic") - ig ("Igbo") - id ("Indonesian") - ga ("Irish") - it ("Italian") - ja ("Japanese") - jv ("Javanese") - kn ("Kannada") - kk ("Kazakh") - km ("Khmer") - rw ("Kinyarwanda") - ko ("Korean") - kri ("Krio") - ku ("Kurdish") - ky ("Kyrgyz") - lo ("Lao") - la ("Latin") - lv ("Latvian") - ln ("Lingala") - lt ("Lithuanian") - lb ("Luxembourgish") - mk ("Macedonian") - mg ("Malagasy") - ms ("Malay") - ml ("Malayalam") - mt ("Maltese") - mi ("Māori") - mr ("Marathi") - mn ("Mongolian") - ne ("Nepali") - nso ("Northern Sotho") - no ("Norwegian") - ny ("Nyanja") - or ("Odia") - om ("Oromo") - ps ("Pashto") - fa ("Persian") - pl ("Polish") - pt ("Portuguese") - pa ("Punjabi") - qu ("Quechua") - ro ("Romanian") - ru ("Russian") - sm ("Samoan") - sa ("Sanskrit") - gd ("Scottish Gaelic") - sr ("Serbian") - sn ("Shona") - sd ("Sindhi") - si ("Sinhala") - sk ("Slovak") - sl ("Slovenian") - so ("Somali") - st ("Southern Sotho") - es ("Spanish") - su ("Sundanese") - sw ("Swahili") - sv ("Swedish") - tg ("Tajik") - ta ("Tamil") - tt ("Tatar") - te ("Telugu") - th ("Thai") - ti ("Tigrinya") - ts ("Tsonga") - tr ("Turkish") - tk ("Turkmen") - uk ("Ukrainian") - und ("Unknown Language") - ur ("Urdu") - ug ("Uyghur") - uz ("Uzbek") - vi ("Vietnamese") - cy ("Welsh") - fy ("Western Frisian") - xh ("Xhosa") - yi ("Yiddish") - yo ("Yoruba") - zu ("Zulu") If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

Describe the solution you'd like When available auto-generated subtitl, to be translated to en and transcribed as per default

Describe alternatives you've considered cc selection option

Additional context n/a

pasdesinfos avatar Dec 22 '22 06:12 pasdesinfos

Same error here. Maybe adding an option to select language solves the problem :)

erseco avatar Dec 22 '22 09:12 erseco

yeah same here, option to select would be good.

ghost avatar Dec 22 '22 12:12 ghost

Hi @pasdesinfos, I definitely see the use case for a feature where transcripts are auto-translated if they are not available in the requested language. However, this should not be the default. As this module is commonly used to train/validate Machine Learning models, translating the transcripts will introduce another variable into the data quality, which the user should always be aware of (by opting into it).

I actually thought about introducing this as an optional feature before, but there is an implementation detail that stopped me from doing so: if we want to automatically translate to the user-requested language, which transcript do we choose to translate from (if there are multiple)? Depending on the transcript we are translating from, the quality of the output will vary. A few things to consider:

  • I generally expect manually generated transcripts to be of higher quality than ASR transcripts. However, there is no data on the average quality of manually generated transcripts on YouTube, so I can not verify this. With how good modern ASR models have become, I could also imagine ASR transcripts being more reliable (on average) for high-resource languages like English, while being less reliable for low-resource languages.
  • Translating from high-resource languages (English, German, French, etc.) will most likely yield the best quality results. So they should probably be prioritised. However, this could conflict with prioritising manually generated transcripts.

So which heuristic for choosing the transcript to translate from, is most likely to yield the highest quality transcript? Any thoughts on this?

jdepoix avatar Jan 02 '23 11:01 jdepoix

@jdepoix First of all, I don't know what it means to translate transcripts, but the ASRs created in Turkish were understandable, if not completely accurate.

ghost avatar Jan 04 '23 10:01 ghost

Hi, IMHO the problem is when the main language of the video is in another language different to English, @toprak, @pasdeinfos and I are talking about adding an option (or allow automatically) the option of getting the source video original generated subtitles, not about translating them. If you get any Spanish video like this one: https://youtubetranscript.com/?v=Dby0_0vdr30 you will see the error, in the CLI tool you have to set the Spanish language to allow getting the correct transcript

Hope this explains the use case, best regards!

erseco avatar Jan 05 '23 10:01 erseco

Hi @toprak and @erseco, I think what you are asking for is something different and it already is documented as a feature request in #133. To my understanding, @pasdesinfos is asking about a feature where the transcripts are automatically translated to the requested language if no transcripts are available in that language. Could you maybe clarify @pasdesinfos to make sure we are on the same page here?

jdepoix avatar Jan 05 '23 10:01 jdepoix

Hi @jdepoix @toprak @erseco,

I trust everything is well.

That's right @jdepoix. For instance, in the output for the video https://youtu.be/BOKqyl0VT7A , https://youtubetranscript.com/?v=BOKqyl0VT7A, indicates "No transcripts were found for any of the requested language codes: ('en',)", however it appears that "transcripts are available in the following languages: (MANUALLY CREATED) None (GENERATED) - fr ("French (auto-generated)")[TRANSLATABLE] ".

Could the heuristic be obtaining, by default, the auto-translated english version, when GENERATED transcript exists and is TRANSLATABLE. Ergo the output ":( Unknown error" will appear only in the event no transcripts at all exist.

Kind regards to everyone!

pasdesinfos avatar Jan 08 '23 05:01 pasdesinfos

Hi all. Same here, only if the YT source isn't in EN. As mentioned, just a selector can handle it.

p-toni avatar Jun 27 '23 23:06 p-toni

Hi @jdepoix, @toprak, @erseco, @toniseldr,

I wanted to take a moment to express my heartfelt gratitude to each of you for your invaluable contributions, unwavering dedication, commitment, and hard work. Your efforts have truly made a significant impact in making lives more wonderful. πŸ™πŸŽ‰

I mean, let's be honest here, without your brilliance, I'd probably be lost in a sea of confusion and chaos. πŸŒŠπŸ˜…

With self-deprecating humor and sincere appreciation, @pasdesinfos πŸ˜„πŸ™Œ

pasdesinfos avatar Jun 28 '23 12:06 pasdesinfos

Hi @pasdesinfos,

thank you very much for the kind words! 😊

However, this hasn't been implemented so I think it is okay for the ticket to stay open. Although I am not actively working on this, it might be something that someone wants to contribute to!

jdepoix avatar Jun 28 '23 14:06 jdepoix

Hi, Im getting the same error with the following video: https://www.youtube.com/watch?v=EtpRcefOD6M even if I specify the correct language 'de' in the languages parameter :

from llama_index.readers.youtube_transcript import YoutubeTranscriptReader

loader = YoutubeTranscriptReader() documents = loader.load_data( ytlinks=['https://www.youtube.com/watch?v=EtpRcefOD6M'], languages=["de","en"] )

Do you have any idea how can this be solved ?

MarouaneZhani avatar Aug 11 '24 22:08 MarouaneZhani

Hi @MarouaneZhani, what is the exact error message you are getting?

jdepoix avatar Aug 12 '24 08:08 jdepoix

Hi @jdepoix
Sorry I already got it running using "de-DE" in languages, the error that I was getting : Could not retrieve a transcript for the video https://www.youtube.com/watch?v=EtpRcefOD6M This is most likely caused by: No transcripts were found for any of the requested language codes.

I saw somewhere in the error the available code language was something like that "de-DE" and it worked after trying it !

Thanks Marouane

MarouaneZhani avatar Aug 12 '24 09:08 MarouaneZhani