youtube-dl icon indicating copy to clipboard operation
youtube-dl copied to clipboard

[Invidious] Add new extractor

Open OverShifted opened this issue 1 year ago • 14 comments

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • [x] I am the original author of this code and I am willing to release it under Unlicense
  • [ ] I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • [ ] Bug fix
  • [ ] Improvement
  • [x] New extractor
  • [ ] New feature

Description of your pull request and other information

Add a new extractor which is able to download from Invidious instances, since the Youtube extractor isn't able to download from Invidious correctly.

OverShifted avatar Dec 15 '22 16:12 OverShifted

Thanks, but I doubt that this is a good solution.

The existing YT extradtor knows about a whole lot of Invidious instances. I believe that your problem is just that the list of instances in extractor/youtube.py doesn't include the ones you want. Creating a separate extractor with another unmaintainable list will just make it worse. Or is there some way in which the extraction in the YT module is unsatisfactory for the currently supported IV sites?

See also #29885. The discussion there is now really of historical interest, though (and also the linked PR) because yt-dlp has now implemented a page-based extraction system in the generic extractor to handle these cases (Invidious, PeerTube, etc). yt-dl will eventually pull this in instead of the original PR, so as to maximise commonality and avoid incompatible reinvention.

dirkf avatar Dec 17 '22 00:12 dirkf

It seems like the youtube extractor sends at least one request to youtube. I've added r'(?:www\.)?yt\.artemislena\.eu' to _INVIDIOUS_SITES. And also added print("self._downloader.urlopen called with", url_or_request) before this line. And here is the output:

$ python -m youtube_dl --verbose -F https://yt.artemislena.eu/watch\?v\=BaW_jenozKc
[debug] System config: []
[debug] User config: []
[debug] Custom config: []
[debug] Command-line args: ['--verbose', '-F', 'https://yt.artemislena.eu/watch?v=BaW_jenozKc']
[debug] Encodings: locale UTF-8, fs utf-8, out utf-8, pref UTF-8
[debug] youtube-dl version 2021.12.17
[debug] Git HEAD: a784be739
[debug] Python version 3.10.8 (CPython) - Linux-6.0.12-arch1-1-x86_64-with-glibc2.36
[debug] exe versions: ffmpeg 5.1.2, ffprobe 5.1.2, rtmpdump 2.4
[debug] Proxy map: {}
[youtube] BaW_jenozKc: Downloading webpage
self._downloader.urlopen called with https://www.youtube.com/watch?v=BaW_jenozKc&bpctr=9999999999&has_verified=1

This behavior can be problematic in environments with limited access to youtube itself.

OverShifted avatar Dec 17 '22 07:12 OverShifted

But then won't the download (googlevideo.com) URLs also be inaccessible?

dirkf avatar Dec 17 '22 07:12 dirkf

In my case, no. That's why I implemented _patch_url. But even if googlevideo.com was accessible, I wouldn't be able to extract the download link in the first place.

OverShifted avatar Dec 17 '22 07:12 OverShifted

Normally a benefit of yt-dl vs using the YT web interface is to avoid the odious bloat of the latter while being able to capture a lot of the detailed metadata that comes with it.

If a user who has YT access wants yt-dl to process an Invidious page, going to YT directly can give a better result because (AFAIK) less rich metadata is available on the IV page. The API is another matter, but reproducing the details of the YT extractor using the IV API would be a massive task.

But if YT is blocked for the user, it would plainly be better to use the IV page instead. The problem is how to combine these tactics. For one-off uses, the IV page may have a download function, but yt-dl users are going to want a batchable solution.

And all these considerations apply equally for other YT front-ends, which seem to be proliferating.

dirkf avatar Dec 17 '22 13:12 dirkf

IMHO, when a user gives yt-dl an invidious link, he/she probably wants to download from invidious servers. Because otherwise, he/she could just "convert" that to a youtube link. (just replace the host with youtube.com)

OverShifted avatar Dec 17 '22 15:12 OverShifted

IMHO, when a user gives yt-dl an invidious link, he/she probably wants to download from invidious servers. Because otherwise, he/she could just "convert" that to a youtube link. (just replace the host with youtube.com)

If this were a completely new feature, I would agree. But we have be auto-translating invidious inks to youtube for a long time. This means many users would be expecting to get all the metadata youtube provides even with a invidious URL. Having the new extractor return less data is a regression. Perhaps a invidious: prefix could be supported similar to teachable:

pukkandan avatar Dec 21 '22 13:12 pukkandan

Maybe, as there are other front-end sites for which the same issue could arise, we should introduce an option like --[no-]extract-page-only with no- being the default (surely not the best option name). Then an IV extractor could check this and by default punt to self.url_result('https://www.youtube.com/watch?v=' + video_id, ie='Youtube'); or if --extract-page-only it could go ahead and extract the IV page without touching YT.

This might also apply where a site has links and metadata in the page but could also use some API URL(s) for more metadata and formats, whether to avoid blocked URLs or increase extraction speed.

dirkf avatar Dec 22 '22 03:12 dirkf

This means many users would be expecting to get all the metadata youtube provides even with a invidious URL.

If that were the case, why would they use an invidious url?

Having the new extractor return less data is a regression.

I struggle to see how performing the expected behaviour is a regression. Invidious is always going to be worse than youtube, but that doesn't mean people who pass invidious urls expect their urls to be silently converted to youtube urls

we should introduce an option like --[no-]extract-page-only with no- being the default

That seems reasonable, although I think there should be a warning if someone passes an invidious url with neither option, and people can silence that warning by explicitly using --no-extract-page-only (I don't know if that's actually possible to implement)

This might also apply where a site has links and metadata in the page but could also use some API URL(s) for more metadata and formats, whether to avoid blocked URLs or increase extraction speed.

Wanting to avoid Google feels like a completely different use-case to not wanting to download from the website you're using's api.

gamer191 avatar Mar 11 '23 10:03 gamer191

Having the new extractor return less data is a regression.

If this is an issue (and imo it's not) perhaps the new invidious extractor should be limited to new instances (that aren't in youtube.py)

gamer191 avatar May 07 '23 10:05 gamer191

Having the new extractor return less data is a regression.

Recently I wanted to download a video from one of the Invidious servers. I was very surprised when it redirected to YouTube. :)

This behavior can be problematic in environments with limited access to youtube itself.

krasnh avatar Dec 17 '23 10:12 krasnh

Having the new extractor return less data is a regression.

If this is an issue (and imo it's not) perhaps the new invidious extractor should be limited to new instances (that aren't in youtube.py)

@gamer191 Invidious will always return less data than YouTube, regardless of which version of Invidious that you use. It also doesn't support things like multiple audio tracks and subtitle translating (the have to use the Innertube transcript API endpoint, which doesn't support translating, and convert the response to WebVTT, as the publicly listed instances get ratelimited on YouTube's subtitle endpoint). The format list -F would also be useless if you built it based on the Invidious API, as it returns hardcoded dimensions based on the itag (most noticeable for vertical videos, because the dimensions will be horizontal).

absidue avatar Jan 08 '24 05:01 absidue

Consider the two use cases:

  1. I want to access YouTube content through Invidious without ever directly interacting with YT servers, because they are unreachable for me, or because I dislike them, or whatever.
  2. I want to access YouTube content through Invidious because I got a link to Invidious and actually had no idea that it was anything to do with YouTube.

Since the second case was trivial, if tiresome, to support, that's what happened.

Arguably the first case should have been given priority, since it would have supported users who need (or want) Invidious to act as a proxy, and so are content with whatever limitations that implies.

dirkf avatar Jan 08 '24 11:01 dirkf

I do agree that passing an Invidious URL should download from Invidious, I just wanted to point out that there is significantly less usable metadata that you might have thought at fist. So you'll either have to decided to show the incorrect metadata that Invidious returns or not show it at all, in either case you are likely to get user complaints.

My point is that while the change does seem like a good idea it will be a breaking change, which you'll want to mention clearly in the changelog and potentially even log a warning message for a while.

Think of it from a users perspective if you upgrade youtube-dl and suddenly your format filter/selector no longer works, because height, width and fps are not available or completely incorrect, you would want to be clearly informed during downloading why that is happening.

absidue avatar Jan 20 '24 08:01 absidue