youtube-transcript-api
youtube-transcript-api copied to clipboard
get_transcript not working
Hi, I'm in a Linux environment and have verified the following before raising this issue:
- The version in use is youtube-transcript-api==0.4.1
- From my local Ubuntu 18.04.5 LTS, I'm able to get desired response for https://www.youtube.com/watch?v=FStcqEIH9G0
- Trying to hit API from Ubuntu servers setup for dev and prod environments leads to not getting any transcripts. (https://github.com/jdepoix/youtube-transcript-api/issues/74))
- I tried ping youtube.com from my dev and prod envs and both of them seem to default to an IPv6 address as can be seen in the screenshot:
@jdepoix Any idea what could be the reason?
Thanks!
Hi @salonygupta76, could maybe try and run curl 'https://www.youtube.com/watch?v=FStcqEIH9G0'
from your server and upload the resulting HTML somewhere and post the link here?
Also, since you posted that specific video, is the error only on that video or every video you try?
@jdepoix No, I'm getting issue with any video with available transcript that I try. Here's the result on running above command: link
@salonygupta76 are you sure you pulled that html from the same host the module was failing on? The html you uploaded seems just fine and I have no problem extracting transcripts from it.
I've exposed this service as an API in dev environment and I'm trying to query the same using Postman from my local system. The response html is shared after ssh'ing into dev and running curl command that was shared by you.
Just so that I am understanding you correctly: curling YouTube from your local machine returns the same html as curling YouTube from your server, yet this module works on your machine, but not on your server, right? That seems really odd! What python version are you using btw? And could you please post the exact error message which is returned by this module.
@jdepoix
Hello, I don't know if it's the right place to post, if no, I am very sorry, maybe I should have created a separate issue. But I have problems with your tool using it from the EU zone countries when I do it from my command line. When I do with EU countries VPN on the sites like replit or pythonanywhere, it works fine (it sends requests from their IPs I suppose). When I use VPN for out-of-EU country, it works fine. When I do from the EU (and my friends), it doesn't work. Maybe they've applied some law about it because I have troubles with YouTube tool downloading live chats as well.
@vanyamlb could you please explain your infrastructure a bit more, I am not sure if I am understanding you correctly. Also, what do you mean by live chats? This module only supports transcripts. And what version of this module are you using?
@vanyamlb could you please explain your infrastructure a bit more, I am not sure if I am understanding you correctly. Also, what do you mean by live chats? This module only supports transcripts. And what version of this module are you using?
- Windows 10, 21H1, 64 bit. Python 3.9.1.
- I meant that I have another YouTube tool for live chats from another developer (not connected with this topic), but it also stops working in the same conditions.
- I live in Ukraine (so out of the EU sadly). Your tool (last version) and another tool for live chats from another developer work fine together, but once I turn EU country's VPN on - they stop working. When I change VPN to, for example, Israel - tools start working again.
- The same happens with my friends who actually live in that countries (for example, Germany and the UK).
- Thus, I came to the conclusion that there is some law about personal data maybe in the EU or something like that which changed the tools' behavior.
Date when the tools stopped working: the beginning of this year's April.
@jdepoix
Just checked everything again.
1 - under Italy's VPN or any other EU 2 - under Israel's VPN or just my own Ukrainian's IP (or any other out-of-EU)
Well, I live in Germany and I don't have a problem using this, so I'm pretty sure it's not a problem with EU law 😄 Sound like it's a problem with your VPN to me. What happens if you just open youtube.com from one of those VPNs were it's not working?
@jdepoix the problem is that I am not alone, my friend from Germany has this issue too and friends from other EU countries :/ (without a VPN) If it was only with VPN, then sure I would think about that... Maybe it's IPS blocking it or youtube...? I don't know the explanation... I checked YouTube with my VPN and it worked fine... But once more, when I use the same VPN app for non-EU country - i get the tool working! :/
just wondering... could you please check one thing for me? my friend developed his own tool based on yours (it's called yxd and is installed through pip as well). with it, you can scan an entire channel to get transcripts using YouTube API key for video listing + your tool (by the way there was an issue about scanning the entire channel, you can tell that person she can use it). you just enter yxd, then enter your API, then enter yxd -c linktothechannel --first=10 and it starts downloading. Just interesting if it works for you living in Germany. Thanks in advance!
(if it says transcript unavailable while there is one, then it doesn't work, but if your tool works then it should lol)
I'm sorry @vanyamlb but I can't provide support for other modules. However, I might be able to help you if you upload the HTML you receive when accessing any given video (with subtitles) on youtube.com through curl or a browser.
@jdepoix with curl just got too many requests errors (429) with VPN :/ wondering why did they block the IP and how to avoid that..
@vanyamlb probably you're sharing an IP address with other users of that VPN and that IP has been blocked because of too many requests. Is there any way to change the IP address?
@jdepoix ok I realize that with VPN that's possible... but people who used this for themselves with their IPS and got the block (while I used it so so much too and didn't get it :/)... Maybe they have static IPs while I have dynamic... do you know any way to get your IP unlocked without changing IP address?
yes, when I changed, it started working, but idk how to change it within the country to check if that will help (but looks like it should)
Unfortunately, there is no way I know of to get around the block without changing the IP or simply waiting until the block gets removed. So there's not really anything you can do here.
I guess this issue lost track a bit. @salonygupta76 any news on your end?
@salonygupta76 any news? Otherwise I will close this issue.
Hey @jdepoix Sorry about the delay in getting back. Unfortunately, at this point of time, I'm unsure what the root cause could be. For a video like this: url (where a transcript exists), sometimes the API simply throws this error instead of retrieving it:
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=-em-_gFlDfQ! This is most likely caused by:
The video is no longer available
If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem! - -em-_gFlDfQ
Note that this happens only in case of trying to access the code available on one of dev/prod environments (through an API) and not while testing on my local environment.
Also of imp maybe, sometimes simply rebuilding the project from Jenkins resolves the problem.
@salonygupta76 where did you deploy your application? Maybe you are running into a problem similar to what @vanyamlb is describing?
@jdepoix I'm deploying the code in Linux environments as a Flask application. I've been capturing the error tracebacks for a while now and some of them do look like the one @vanyamlb shared above. I'm even using proxies.
@salonygupta76 but what infrastructure are you hosting your application on? If it is a cloud provider like GCP, AWS etc. it is likely that you are sharing a public IP with other users and therefore are being blocked by YouTube. What happens when you curl https://www.youtube.com/watch?v=-em-_gFlDfQ
from that environment as it is throwing that error? Do you get a 429 as well? Otherwise could you maybe upload the returned html so that I can have a look at it?
@jdepoix Infra is AWS. Right now, the block has been lifted and I'm able to get results. Can share Curl response when it reverts to throwing errors.
Hey @jdepoix , facing the issue yet again and error is mostly "Video is not available..." one when trying your code.
When I hit curl -L https://www.youtube.com/watch?v=-em-_gFlDfQ, I get the following as response:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.youtube.com/watch?v=-em-_gFlDfQ</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="rnH4HjXBSULi9Z3KDh5b_oQ9pgFANNtapJKcTK-CjTXBg8Hqc9N8hByEEhbopeLD7xbVzfe7oU7OpTu2BP-qMb83fsobbLndnTRr7AeMtdfr4xMa_to3VWg8EcfI33aWd52OwNaJVeDnOCdveOlL-WN5BgA8hH-srYfpjrhxv10PbtXDkvAFHkspxsQ40iQm5wnjZjtABLJaV6Pulwc3FGYsbviqJYwUyBaobFE"></div>
<input type='hidden' name='q' value='EgQNOG6qGLnCsIcGIhBk7JGbIK-s219AkHLO2dTFMgFy'><input type="hidden" name="continue" value="https://www.youtube.com/watch?v=-em-_gFlDfQ">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">
<div style="font-size:13px;">
<b>About this page</b><br><br>
Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>
<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>
IP address: DEV_IP_ADDRESS}<br>Time: 2021-07-12T11:02:18Z<br>URL: https://www.youtube.com/watch?v=-em-_gFlDfQ<br>
</div>
</div>
</body>
</html>
Is this the curl response you're looking for? There seems to be some request limit, which when surpassed throws this error.
Hi @salonygupta76, thank you very much for the detailed information. That is exactly what I was looking for. Unfortunately, this confirms my assumption that you are being blocked by YouTube. The only way to work around this is to
- Manually solve the captcha in a browser, then export the cookie and use it for future requests
- Use a different IP address
- Wait until the ban on your IP has been lifted
I am aware that none of these solutions are great, but it's all we can do unfortunately (at least afaik). While this doesn't directly solve your problem, I could at least use this html to make sure a more suitable error is raised in this case. Maybe you could catch this error and implement a sleep, while hoping that the ban will be lifted in the meantime (unfortunately I haven't been able to figure out how long you'll have to wait until bans get lifted and I feel like it's not really consistent). If you want to implement a more sophisticated solution you could catch the error and trigger a change of IP address, which definitely will be the most reliable solution, but also the most expensive one.
@jdepoix Yes, I've thought of these solutions, IP rotation in particular.
Inducing a sleep could work if we know which quota limit is specifically in play here, that is, requests per min or requests per day or any other. Do you have any idea which of these leads to our issue?
I tried making requests via a Proxy (in the methods exposed where proxies is None by default) but in vain. My assumption is that they're not even being used as they're supposed to be because my specific IP is being captured and blocked.
Is there any way I could see what request payload is being sent to YouTube using your package? (Like we do with requests by printing out their request headers and data attributes)
@salonygupta76 Unfortunately, I don't know how long the "sleeping interval" would have to be to be sufficient. You'd have to play around with that. But if you do so, I would greatly appreciate if you could share your findings!
If you want to look deeper into the requests which are being sent, you'll have checkout the code and add some logs or run it in a debugger. More specifically youtube_transcript_api._transcripts.TranscriptListFetcher._fetch_html
is the method which does the actual request, so if you want to log the response, that's were you'll have to look.
I am using the statement transcripts=YouTubeTranscriptApi.get_transcripts((video_id)); in a for loop, however, there are a few videos which have their id disabled. Is there a success or failure call for YouTubeTranscriptApi.get_transcripts((video_id))
@AaditBhatia what do you mean by success/failure call? You can simply wrap your call in a try
/except
and ignore the exception.
I will close this issue now, as there isn't really much we can do here and the discussion went off rails a bit.