youtube-dl icon indicating copy to clipboard operation
youtube-dl copied to clipboard

[twitter] Shorten title attribute for #29912

Open palewire opened this issue 1 year ago • 10 comments

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • [x] I am the original author of this code and I am willing to release it under Unlicense
  • [ ] I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • [x] Bug fix
  • [ ] Improvement
  • [ ] New extractor
  • [ ] New feature

This request trims the standard Twitter ID field to workaround the bug reported in #29912

palewire avatar Aug 09 '22 02:08 palewire

@dirkf, I believe I have achieved your request with a little less code. The uploader name is pretty strictly limited by Twitter, so I don't think we need to slice it. That means we can use a more straightforward slice like the one just pushed. Give it a look and let me know what you think.

palewire avatar Aug 09 '22 19:08 palewire

@dirkf, those changes are in.

palewire avatar Aug 10 '22 12:08 palewire

The flake errors in the test run appear to be in other files.

palewire avatar Aug 10 '22 14:08 palewire

Apparently the default flake8 settings that the project uses changed in some way so that a keyword must be separated from the next token by a space, which is basically good but irritating. Otherwise GH somehow removed some spaces from the code, which seems unlikely and would be much worse.

If you update from the current master the 2 places where this caused problems are fixed.

dirkf avatar Aug 10 '22 14:08 dirkf

I think I've done that.

palewire avatar Aug 10 '22 20:08 palewire

imo, this should not be implemented.

  • Would all extractors be expected to do this?
  • title != filename. We shouldn't extract wrong metadata just because it is convenient for filenames. Output template should be used to control filename length instead. This is a poorly implemented bandaid.

As I see, changing the default outtmpl to something like %(title).200s-%(id)s.%(ext)s would be much better than this solution.

Or, if the ... is important, move this code to core

pukkandan avatar Aug 11 '22 02:08 pukkandan

This extractor is unusual in that there is no title in the plain webpage, although a title and an og:title are set by JS. The full tweet text is extracted in the description. So:

  • Would all extractors be expected to do this?

No. But Twitter was the source of some of the original issues complaining about filename length errors with the default output template, so this is an opportunity to avoid those in future, without expecting users to check for a FAQ that tells them to use %(title).100s or whatever, and without changing the core code.

Instead of fixing the title, we could generate a warning if the title is "long", or possibly include the FAQ text in the error message if we could reliably detect an exception caused by an over-long filename (a generic "solution" in place of #29989).

Changing the default output template could be an alternative but I would be concerned that setting a sufficiently low value to avoid filename length issues would affect filenames on filesystems with no effective limit, attracting a different set of complaints. The extractor could manipulate the output template by inserting a length field into any %(title)s but that would weirdly affect all URLs in the download session.

  • title != filename. We shouldn't extract wrong metadata just because it is convenient for filenames. ...

I believe that the default output template was chosen on the assumption that title would be a short text field, for some definition of "short". Implicitly, yt-dl tries to cast all sites in the mould of YouTube, where the title is (a) required (b) not longer than 100 characters, although SEO fiends recommend no more than 70.

Apparently the page title for a tweet set by Twitter has the form (in terms of the extracted fields) {og_title}: "{clean_html(description)}", where og_title, the og:title meta field, is {uploader_id} on Twitter. We emulate that but remove on Twitter, presumably because, where else would it be, given the context? A profile page has {uploader_id} (@{uploader}) / Twitter, but we don't have an extractor for those [^note].

[^note]: We do have an extractor for twitter:card but it redirects to an invalid URL and all the test URLs but the single only_matching URL give a "nothing to see here" page; even that gets 403 for the API call to list video details, as does the only similar URL known to G.

Nitter tries to improve Twitter by making a tweet title thus {profile.fullname} (@{profile.username}): "{stripHtml(tweet.text)}", combining the two methods used by Twitter. Perhaps we should also do that, as {uploader_id} (@{uploader}): "{clean_html(description)}" (and stripping unwanted Twitter links from the description, as now).

In the browser title bar I see the Twitter title presented with ... replacing as much of the middle of the text as is needed to make the text fit the title bar. Something like that still seems appropriate.

Output template should be used to control filename length instead. This is a poorly implemented bandaid.

After all, yt-dl itself is just a bandaid in some sense. What would make for a better implementation? Surely even a harsh critic would call the current implementation at least "adequate"?

dirkf avatar Aug 11 '22 06:08 dirkf

What would make for a better implementation? Surely even a harsh critic would call the current implementation at least "adequate"?

I gave one suggestion above. As you pointed out, that has it's own issues too. (I still think this is a better solution though). If a proper solution was easy, I doubt this issue would have existed for so long

From a user perspective, I do agree this is better than nothing. But I also think that it'll become a headache to maintain any kind of consistency b/w extractors over time.

And I disagree that the twitter is special in this regard. There are plenty of other extractors that face this issue. This search is a quick way to get some examples

That said, since twitter is so popular, I can understand why you'd consider this technical debt to be acceptable.

And for the record, while an FAQ entry might help some users, I don't think it will reduce the number of issues opened. The kind of people who actually read the FAQ are the same people who'd look for duplicate issues.

pukkandan avatar Aug 11 '22 11:08 pukkandan

If you really want to get existential about it, I think the core problem is requiring a title for all sites. As detailed above, there's not a tidy title for a tweet analogous to a YouTube title. So there will never be tidy answer to this problem.

That said, making the title field optional seems like a can of worms that isn't worth opening either. Some kind of compromise is necessary and, as documented in the issue that prompted this pull request, the current solution is bad for users. So, IMHO, some alternative is needed. Something along the lines of what's proposed here does seem like the simplest solution. If universality is what you're after, I think enforcing a global trim on all filenames seems like the most natural extension.

palewire avatar Aug 11 '22 12:08 palewire

The rationale for title is presumably what I said above, trying to make all sites fit the YT metadata schema. yt-dlp has allowed a None title while continuing to require the title field without the sky falling.

... If a proper solution was easy, I doubt this issue would have existed for so long

Amen to that. If we can get something like #29989 to work, that would fix the problem for all sites. Meanwhile:

... since twitter is so popular, I can understand why you'd consider this technical debt to be acceptable.

Exactly. The Twitter title field is up for grabs: its extraction shouldn't routinely crash the program. I guess that extractor is used more than all the other possibly problematic extractors together, which are also a small proportion of the ~800 (crikey) we have.

And

...while an FAQ entry might help some users, I don't think it will reduce the number of issues opened

Absolutely, why I mentioned putting some FAQ text in error message appears, if only we knew reliably what that would be.

dirkf avatar Aug 11 '22 13:08 dirkf