youtube-dl
youtube-dl copied to clipboard
[TikTok] Support Sigi-type pages, etc
Please follow the guide below
Before submitting a pull request make sure you have:
- [ ] Searched the bugtracker for similar pull requests
- [x] Read adding new extractor tutorial
- [x] Read youtube-dl coding conventions and adjusted the code to meet them
- [x] Covered the code with tests (note that PRs without tests will be REJECTED)
- [x] Checked the code with flake8
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
- [x] I am the original author of this code and I am willing to release it under Unlicense Except: this PR subsumes PR #30224 whose author also affirmed this.
- [ ] I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)
What is the purpose of your pull request?
- [x] Bug fix
- [x] Improvement
- [ ] New extractor
- [ ] New feature
Description of your pull request and other information
TT switched (possibly partially) its framework from NextJS to Sigi, and the persisted state JSON sent in the page changed as a result. Instead of a <script>
element with id
__NEXT_DATA__
, we get one with id
sigi_persisted_state
and JSON with a slightly different structure.
This PR deals with both types of page format, based on PR #30224 and this patch which gets more metadata.
Also, extraction could fail with a timeout (Error 60 in Windows, SSLError('The read operation timed out',) in Linux) or connection reset (Error 54 in Windows) due to some weird blocking by whatever fronts TikTok's pages (Akamai, apparenty). In order to download the page for parsing, some cookie has to be sent and a way to get it is to make a previous request to the site. The extractor fetched https://www.tiktok.com/ before doing anything else. In yt-dlp, the code fetches the webpage itself twice, commenting that you get 403 otherwise. This PR copies that tactic but instead of fetching the whole page (GET
request) it just sends a HEAD
request; if a page is actually returned, rather than an error with a Set-Cookie
header, it doesn't actually have to be downloaded.
Probably resolves #28741 Resolves #30251 Resolves #30432 Resolves #30439 Resolves #30445 Resolves #30454 Resolves #30470.
Finally the non-working TikTokUserIE
has been resurrected for accessing all the videos of a specific user.
Resolves #30174.
Patching hints, depending on your installation type (substitute PR number 30479 and file youtube_dl/extractor/tiktok.py
as appropriate):
- https://github.com/ytdl-org/youtube-dl/pull/30184#issuecomment-990859585
- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecomment-965418428
- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecomment-966349844
- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecomment-972929975
- https://github.com/ytdl-org/youtube-dl/issues/29326#issuecomment-981108888.
Hi! After your patch has worked for several days, I am now encountering new problems (with the "vanilla" youtube-dl as well): #30538
Patrick
when this merge?
As observed in https://github.com/yt-dlp/yt-dlp/issues/3776#issuecomment-1155586954 the user pages are currently redirecting to a captcha more or less whatever we do wrt cookies and UAs.
In a browser with JS disabled and UA set to Mozilla/5.0
after clearing cookies for TT, a request to a user page gets the captcha page, and then reloading with the provided cookies opens the desired page. This doesn't happen with the extractor even with a delay between the two fetches.
Looks like every issue is about this, when will this get merged?
Do we think this will see the light of day? :D Was hoping to be able to use it for a little fun project!
Thanks
I think this is also outdated now. There is no sigi_persisted_state
in the returned HTML.