ArchiveBot icon indicating copy to clipboard operation
ArchiveBot copied to clipboard

Youtube-dl is broken

Open hook321 opened this issue 7 years ago • 5 comments

hook321 avatar Oct 01 '17 18:10 hook321

I don't mean to be a stickler, but are you sure it is broken and not merely out of date? The youtube-dl folks release a new version much more frequently than anyone updates pipelines - we should probably include a cronjob. The reason for their updates to it is that youtube is constantly changing the way you have to download videos, so old versions are brittle.

falconkirtaran avatar Oct 01 '17 19:10 falconkirtaran

@falconkirtaran Yes, it is broken at least on your pipelines. This is due to falconkirtaran/wpull@56138077, where you removed support for CONNECT. As a result, tunnelling of youtube-dl doesn't work and results in this error in the wpull log (from job aizndzerpuqfaxa5350rx0jno):

2017-06-06 05:17:24,967 - wpull.processor.coprocessor.youtubedl - WARNING - ERROR: Unable to download webpage: <urlopen error Tunnel connection failed: 501 CONNECT is intentionally not supported> (caused by URLError(OSError('Tunnel connection failed: 501 CONNECT is intentionally not supported',),))

I don't think it works on other pipelines either, potentially due to the issue you tried to fix with that commit (I don't think anyone else is using your wpull fork on their pipeline), but I haven't checked that systematically. Admittedly, I don't know what a successful youtube-dl invocation would look like in the archives exactly.

On a somewhat related note, PhantomJS scrolling appears to be broken on your pipelines as well. For example, Twitter jobs for a user (i.e. !a https://twitter.com/username --phantomjs-scroll 50000) never retrieve the second page and therefore only archive a handful of tweets instead of the entire account. @Asparagirl and I confirmed that it works correctly on her pipelines (though it might still not grab all tweets due to an unrelated issue); I didn't test other pipelines.

JustAnotherArchivist avatar Oct 06 '17 12:10 JustAnotherArchivist

I had no idea that it requires the CONNECT verb.  Perhaps this is a compound fault; it does require regular updates to work too.  I think a tcpdump of the first bit of a working youtube-dl session is in order.

I don't see how this could be related to issues with phantomjs. Also, I believe most or all the pipelines are using my wpull fork because wpull upstream is essentially abandoned and the latest upstream commit is so crashy no pipeline would stay running for more than a few minutes.

On 10/6/2017 05:15, JustAnotherArchivist wrote:

@falconkirtaran https://github.com/falconkirtaran Yes, it is broken at least on your pipelines. This is due to falconkirtaran/wpull@5613807 https://github.com/falconkirtaran/wpull/commit/56138077, where you removed support for CONNECT. As a result, tunnelling of youtube-dl doesn't work and results in this error in the wpull log (from job aizndzerpuqfaxa5350rx0jno):

2017-06-06 05:17:24,967 - wpull.processor.coprocessor.youtubedl -
WARNING - ERROR: Unable to download webpage: <urlopen error Tunnel
connection failed: 501 CONNECT is intentionally not supported>
(caused by URLError(OSError('Tunnel connection failed: 501 CONNECT
is intentionally not supported',),))

I don't think it works on other pipelines either, potentially due to the issue you tried to fix with that commit (I don't think anyone else is using your wpull fork on their pipeline), but I haven't checked that systematically. Admittedly, I don't know what a successful youtube-dl invocation would look like in the archives exactly.

On a somewhat related note, PhantomJS scrolling appears to be broken on your pipelines as well. For example, Twitter jobs for a user (i.e. |!a https://twitter.com/username --phantomjs-scroll 50000|) never retrieve the second page and therefore only archive a handful of tweets instead of the entire account. @Asparagirl https://github.com/asparagirl and I confirmed that it works correctly on her pipelines (though it might still not grab all tweets due to an unrelated issue); I didn't test other pipelines.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ArchiveTeam/ArchiveBot/issues/291#issuecomment-334738127, or mute the thread https://github.com/notifications/unsubscribe-auth/AFNkF_f79eWgYA15Awqhf2yqNFzIePRdks5sphnxgaJpZM4Pp-U2.

falconkirtaran avatar Oct 06 '17 19:10 falconkirtaran

Yes, #217 is definitely still very relevant.

I don't see how this could be related to issues with phantomjs.

Not in the technical sense. I just wanted to say "youtube-dl isn't the only coprocessor of wpull which is currently not working as intended".

Also, I believe most or all the pipelines are using my wpull fork

Ah. I see now that the requirements.txt here mentions your fork directly. I thought it just lists "wpull", which would grab 2.0.1 from PyPI. I'm pretty sure though that I saw different error messages about youtube-dl than the "CONNECT not supported" one in some wpull logs; with your fork, I'd only expect that message... But it's been a while since I looked into this, so I don't remember the details.

Yeah, 2.0.1 is basically unusable. I'm still using 1.2.3 for my own stuff because of that. My URL priorisation implementation is based on 2.0.1 though; I hope to finish that soonish.

JustAnotherArchivist avatar Oct 06 '17 21:10 JustAnotherArchivist

Upstream issue: ArchiveTeam/wpull#392

JustAnotherArchivist avatar Oct 10 '18 15:10 JustAnotherArchivist