youtube-dl
youtube-dl copied to clipboard
[twitter] Shorten title attribute for #29912
Please follow the guide below
- You will be asked some questions, please read them carefully and answer honestly
- Put an
x
into all the boxes [ ] relevant to your pull request (like that [x]) - Use Preview tab to see how your pull request will actually look like
Before submitting a pull request make sure you have:
- [x] Searched the bugtracker for similar pull requests
- [x] Read adding new extractor tutorial
- [x] Read youtube-dl coding conventions and adjusted the code to meet them
- [x] Covered the code with tests (note that PRs without tests will be REJECTED)
- [x] Checked the code with flake8
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
- [x] I am the original author of this code and I am willing to release it under Unlicense
- [ ] I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)
What is the purpose of your pull request?
- [x] Bug fix
- [ ] Improvement
- [ ] New extractor
- [ ] New feature
This request trims the standard Twitter ID field to workaround the bug reported in #29912
@dirkf, I believe I have achieved your request with a little less code. The uploader name is pretty strictly limited by Twitter, so I don't think we need to slice it. That means we can use a more straightforward slice like the one just pushed. Give it a look and let me know what you think.
@dirkf, those changes are in.
The flake errors in the test run appear to be in other files.
Apparently the default flake8 settings that the project uses changed in some way so that a keyword must be separated from the next token by a space, which is basically good but irritating. Otherwise GH somehow removed some spaces from the code, which seems unlikely and would be much worse.
If you update from the current master the 2 places where this caused problems are fixed.
I think I've done that.
imo, this should not be implemented.
- Would all extractors be expected to do this?
- title != filename. We shouldn't extract wrong metadata just because it is convenient for filenames. Output template should be used to control filename length instead. This is a poorly implemented bandaid.
As I see, changing the default outtmpl to something like %(title).200s-%(id)s.%(ext)s
would be much better than this solution.
Or, if the ...
is important, move this code to core
This extractor is unusual in that there is no title in the plain webpage, although a title and an og:title
are set by JS. The full tweet text is extracted in the description
. So:
- Would all extractors be expected to do this?
No. But Twitter was the source of some of the original issues complaining about filename length errors with the default output template, so this is an opportunity to avoid those in future, without expecting users to check for a FAQ that tells them to use %(title).100s
or whatever, and without changing the core code.
Instead of fixing the title, we could generate a warning if the title is "long", or possibly include the FAQ text in the error message if we could reliably detect an exception caused by an over-long filename (a generic "solution" in place of #29989).
Changing the default output template could be an alternative but I would be concerned that setting a sufficiently low value to avoid filename length issues would affect filenames on filesystems with no effective limit, attracting a different set of complaints. The extractor could manipulate the output template by inserting a length field into any %(title)s
but that would weirdly affect all URLs in the download session.
- title != filename. We shouldn't extract wrong metadata just because it is convenient for filenames. ...
I believe that the default output template was chosen on the assumption that title
would be a short text field, for some definition of "short". Implicitly, yt-dl tries to cast all sites in the mould of YouTube, where the title is (a) required (b) not longer than 100 characters, although SEO fiends recommend no more than 70.
Apparently the page title for a tweet set by Twitter has the form (in terms of the extracted fields) {og_title}: "{clean_html(description)}"
, where og_title
, the og:title
meta field, is {uploader_id} on Twitter
. We emulate that but remove on Twitter
, presumably because, where else would it be, given the context? A profile page has {uploader_id} (@{uploader}) / Twitter
, but we don't have an extractor for those [^note].
[^note]: We do have an extractor for twitter:card
but it redirects to an invalid URL and all the test URLs but the single only_matching
URL give a "nothing to see here" page; even that gets 403 for the API call to list video details, as does the only similar URL known to G.
Nitter tries to improve Twitter by making a tweet title thus {profile.fullname} (@{profile.username}): "{stripHtml(tweet.text)}"
, combining the two methods used by Twitter. Perhaps we should also do that, as {uploader_id} (@{uploader}): "{clean_html(description)}"
(and stripping unwanted Twitter links from the description
, as now).
In the browser title bar I see the Twitter title presented with ...
replacing as much of the middle of the text as is needed to make the text fit the title bar. Something like that still seems appropriate.
Output template should be used to control filename length instead. This is a poorly implemented bandaid.
After all, yt-dl itself is just a bandaid in some sense. What would make for a better implementation? Surely even a harsh critic would call the current implementation at least "adequate"?
What would make for a better implementation? Surely even a harsh critic would call the current implementation at least "adequate"?
I gave one suggestion above. As you pointed out, that has it's own issues too. (I still think this is a better solution though). If a proper solution was easy, I doubt this issue would have existed for so long
From a user perspective, I do agree this is better than nothing. But I also think that it'll become a headache to maintain any kind of consistency b/w extractors over time.
And I disagree that the twitter is special in this regard. There are plenty of other extractors that face this issue. This search is a quick way to get some examples
That said, since twitter is so popular, I can understand why you'd consider this technical debt to be acceptable.
And for the record, while an FAQ entry might help some users, I don't think it will reduce the number of issues opened. The kind of people who actually read the FAQ are the same people who'd look for duplicate issues.
If you really want to get existential about it, I think the core problem is requiring a title for all sites. As detailed above, there's not a tidy title for a tweet analogous to a YouTube title. So there will never be tidy answer to this problem.
That said, making the title field optional seems like a can of worms that isn't worth opening either. Some kind of compromise is necessary and, as documented in the issue that prompted this pull request, the current solution is bad for users. So, IMHO, some alternative is needed. Something along the lines of what's proposed here does seem like the simplest solution. If universality is what you're after, I think enforcing a global trim on all filenames seems like the most natural extension.
The rationale for title
is presumably what I said above, trying to make all sites fit the YT metadata schema. yt-dlp has allowed a None
title while continuing to require the title
field without the sky falling.
... If a proper solution was easy, I doubt this issue would have existed for so long
Amen to that. If we can get something like #29989 to work, that would fix the problem for all sites. Meanwhile:
... since twitter is so popular, I can understand why you'd consider this technical debt to be acceptable.
Exactly. The Twitter title field is up for grabs: its extraction shouldn't routinely crash the program. I guess that extractor is used more than all the other possibly problematic extractors together, which are also a small proportion of the ~800 (crikey) we have.
And
...while an FAQ entry might help some users, I don't think it will reduce the number of issues opened
Absolutely, why I mentioned putting some FAQ text in error message appears, if only we knew reliably what that would be.