e621ng icon indicating copy to clipboard operation
e621ng copied to clipboard

Source alternate handlers & cleanup: FA, DA, IB, Twitter, Weasyl

Open Deer-Spangle opened this issue 3 years ago • 35 comments

As discussed in this thread here: https://e621.net/forum_topics/34730

I've improved the FurAffinity source handler, and added source handlers for Deviantart, Twitter, Inkbunny, and Weasyl. I've added to the Source::Alternate handlers, and added new ones. I've add unit tests for them, and added a fix script to update a large amount of broken sources.

Full list of the checks added:

  • Base: Convert known secure domains to https [also in fix script]
  • Deviantart: Add gallery link via submission link
  • Deviantart: Convert old format links (artist.deviantart.com) to new format (deviantart.com/artist) [also in fix script]
  • Furaffinity: Convert old/broken cdn links (d.facdn.net, d2.facdn.net) to new domain (d.furaffinity.net) [also in fix script]
  • Furaffinity: Convert /full/ links to /view/ links [also in fix script]
  • Furaffinity: Remove ?upload-successful queries from URLs [also in fix script]
  • Furaffinity: Remove comment anchors #cid: from URLs [also in fix script]
  • Inkbunny: Remove page anchors #pictop from URLs [also in fix script]
  • Twitter: Convert mobile links (mobile.twitter.com) to main domain (twitter.com) [also in fix script]
  • Twitter: Convert FixTweet links (fxtwitter.com, ayytwitter.com, vxtwitter.com, etc) to main domain (twitter.com) [also in fix script]
  • Twitter: Remove tracking data from links [also in fix script]
  • Twitter: Convert old direct links (file.jpg:orig) to new format (file?format=jpg&name=orig) [also in fix script]
  • Weasyl: Add gallery links from submission links

And other changes:

  • Moved force_https check after parsing URL, to allow base handler to force https on other domains without a specific handler
  • Moved source length truncation to post.rb, so all source links get truncated, not just unhandled ones
  • Updated direct image link parser in FurAffinity source handler
  • Stopped FurAffinity source handler from duplicating direct link
  • Fixed adding direct_url to list of sources
  • Moved null source strategy test into tests/unit/sources/strategies so that alternates tests can go in a sibling directory
  • Converting some post tests to use https scheme. Fixing one of them

I've tested this with the included unit tests, and deployed a local instance with these changes and ensured new sources get modified as planned. Also tried spinning up a local instance on the master branch and running the fix script, which behaved as designed. Running the full suite of tests, there are a bunch of failures, but all look unrelated to this. 1319 runs, 2322 assertions, 32 failures, 10 errors, 0 skips I checked a couple that looked like they could be relevant, but they were not. Fixed one of them anyway.

Deer-Spangle avatar Aug 06 '22 14:08 Deer-Spangle

Made those changes, tests still passing ^_^ There's still a couple unless in there, for the optional named arguments in alternate_should_work() because false.present? is falsy, so not a true inversion of .nil?.

And yeah, not sure whether to move that source_test_helper.rb to the tests/helpers directory, or keep it where it is used. Not sure it'll be useful elsewhere

Deer-Spangle avatar Aug 07 '22 01:08 Deer-Spangle

Thank you, it looks much better already. I'm going to take a more in-depth look some other time, and try to run it through the database export so see if anything unexpected happens.

Earlopain avatar Aug 07 '22 06:08 Earlopain

As per #427: What is the actual usage of adding a gallery link from the submission url? The gallery link is already easily visible, and it just duplicates the information. And like you said, the username doesn't even have to be correct. I knew that it works like that at twitter, but I didn't know that that's also the case for deviantart/wesyl.

Earlopain avatar Aug 07 '22 15:08 Earlopain

I had much the same opinion, yeah. But I've had numerous folks say it was helpful, and I had never witnessed anyone doing that to Weasyl and DA links (and had not tested it until I mentioned in the other ticket!) But yeah, people say gallery links are useful when galleries are nuked, and the upload form on the site also emphasizes this:

You should include: A link to the artists page where this was obtained, and a link to the submission page where this image was obtained. No available source should ONLY be used if the content has never been posted online anywhere else.

Deer-Spangle avatar Aug 07 '22 17:08 Deer-Spangle

I brought this up in the staff chat, automatically adding the gallery link when you can infer it from the submission url is not something that we want to do

Earlopain avatar Aug 07 '22 19:08 Earlopain

Okay! I only implemented it in the weasyl handler, because it is already present in the FA handler on master. Shall I remove it from all of them?

Deer-Spangle avatar Aug 07 '22 23:08 Deer-Spangle

For the whole "alternate twitter" bit, I've seen another one floating around: nitter

Main url I've seen is nitter.moomoo.me, and it's mostly from one user.

They've got a list of instances if you want to go that far

DonovanDMC avatar Aug 08 '22 02:08 DonovanDMC

Just the one you added is fine

Earlopain avatar Aug 08 '22 05:08 Earlopain

Alright, removed gallery parsing from deviantart, weasyl, and the note from the twitter handler. Removed the weasyl handler entirely, as gallery parsing was all it did. (Added it to base handler's list of known https domains)

For the whole "alternate twitter" bit, I've seen another one floating around: nitter

Oh yeah! Nitter was in my original cleanup script too, can't believe I forgot to port it over. Though, adding the full list of nitter domains feels like overkill... And there is an issue now, that the handler-finder only checks domains, not subdomains, so for that one, I would need to add a check for "moomoo.me", which might be imprecise. I'll do some extra full-domain checks, maybe.

Ranking of nitter domains in use:

00 = {tuple: 2} ('nitter.net', 401) 01 = {tuple: 2} ('nitter.moomoo.me', 55) 02 = {tuple: 2} ('nitter.kavin.rocks', 29) 03 = {tuple: 2} ('nitter.it', 14) 04 = {tuple: 2} ('nitter.domain.glass', 7) 05 = {tuple: 2} ('nitter.ca', 5) 06 = {tuple: 2} ('nitter.fdn.fr', 4) 08 = {tuple: 2} ('nitter.ir', 1) 09 = {tuple: 2} ('nitter.cc', 1) 10 = {tuple: 2} ('nitter.snopyta.org', 1) 11 = {tuple: 2} ('nitter.tedomum.net', 1) 12 = {tuple: 2} ('nitter.eu', 1) 13 = {tuple: 2} ('twnitter.com', 1)

I'll add the ones with more than one use, and manually fix the others now

Deer-Spangle avatar Aug 08 '22 08:08 Deer-Spangle

Ah, darn, I need to handle nitter media links ( e.g. https://nitter.net/pic/media%2FCTNngvZW4AAHvGM.jpg%3Fname%3Dsmall ) a bit better than that! Will fix this evening, as my train just arrived at work.

It just needs to strip the "/pic" from the start of the path, switch host to pbs.twimg.com, and urldecode the rest. How is best to do a URI decode on that last bit? Google suggests URI.decode_www_form_component(encoded_uri), but is that the preferred way here, or is there another magic Rails solution perhaps? I can't find any other examples in the repo

Deer-Spangle avatar Aug 08 '22 09:08 Deer-Spangle

Please don't forget to update your tests, like when you added NITTER_HOSTS. You also left a reference to the weasyl class in alternates.rb, the weasyl tests are also still there.

Earlopain avatar Aug 08 '22 13:08 Earlopain

I can't seem to get things working with moving the test helper. For some reason, imports are failing. Will investigate further when I can

Deer-Spangle avatar Aug 08 '22 18:08 Deer-Spangle

Use extend instead of include, that should work. I looked at how danbooru does it.

Earlopain avatar Aug 08 '22 18:08 Earlopain

Use extend instead of include, that should work. I looked at how danbooru does it.

Hmmm, was having some odd behaviour when doing that too (4 arguments instead of 3?) I was trying to mirror what was going on with the test/helpers/upload_test_helper.rb, putting it alongside, adding it to test/test_helper.rb as an include, or an extend, right next to it.

My laptop just ran out of power though, so I'll need to give it another few attempts once I get home (train delay today gave me a bit of extra time to explore at least!) ^_^

Deer-Spangle avatar Aug 08 '22 19:08 Deer-Spangle

It seems to work fine for me with extend. Maybe there were some issues when you removed the gallery url parameter?

You will need to remove the setup block when moving it over there, and the instance variable too. Otherwise the test class will have multiple setup blocks, and every class will use the @site of the last defined test, resulting in basically all tests being broken. At least that was the behaviour I was observing, might be something different.

Also, don't feel like you need to hurry. Kira will do a deploy this week, and I'm planning for this to be part of the one after that, or even one more after that. Time between deploys is typically about 2-4 weeks, depending on how active (or not) the repo is. I try to not put too many changes in at once.

Earlopain avatar Aug 08 '22 20:08 Earlopain

Ah, I saw the "checking against the last site" thing too! Okay. I guess I should just do the URL parsing in each check then, fair enough. There's only 2 checks left in there anyway!

Yeah, I figured it might take a while to land, I'm not worried about that, just trying to keep my momentum before I dig into other stuff and forget where this was ^_^ (Probably the next thing for me will be throwing some script together to go hunt those gallery-only links and fixing them up through the API?)

Deer-Spangle avatar Aug 08 '22 21:08 Deer-Spangle

alternate_should_work is just a normal function being evaluated top-to-bottom, which means that you can put plain local-scoped variables in there. Just move the instance variable out of the current setup block and loose the @, that should work. No need to parse the url twice.

If you mean a seperate project to fix it up, sure. I don't think that that would be something that should be part of the main repo. One thing that comes to mind is doing md5 checks. That would yield a 0% error rate, while missing higher/lower quality submissions though. Someone did this already, I remember seeing post edit reasons mentioning exact fa md5 matches dating back a few years now I believe.

Earlopain avatar Aug 08 '22 21:08 Earlopain

Ohhh, of course! I get a bit muddled thanks to Ruby's lack of parentheses around arguments, and the blocks. It has a magical way of making functions look like keywords to me. That works! Just need to tweak some things with the nitter image parsing, now.

And yes, definitely meant a separate project! Definitely would not be something for the main repo. Would need to be querying some other stuff. I've got a database of all the perceptual image hashes across most the big furry sites now, along with a lot of other metadata, so should be able to whip up a python script using that.

Deer-Spangle avatar Aug 08 '22 21:08 Deer-Spangle

Awesome, there we go

spangle@kobold:~/e621ng$ docker-compose run tests test/unit/sources/
[+] Running 4/0
 ⠿ Container e621ng-redis-1      Running                                                                                                                                                           0.0s
 ⠿ Container e621ng-elastic-1    Running                                                                                                                                                           0.0s
 ⠿ Container e621ng-memcached-1  Running                                                                                                                                                           0.0s
 ⠿ Container e621ng-postgres-1   Running                                                                                                                                                           0.0s
Waiting for elastic to come up
== Seeding database with sample content ==
Run options: --seed 47641

# Running:

..........................................................

Finished in 0.413580s, 140.2390 runs/s, 145.0748 assertions/s.
58 runs, 60 assertions, 0 failures, 0 errors, 0 skips

Deer-Spangle avatar Aug 08 '22 21:08 Deer-Spangle

A removal of the /photo/X part from Twitter links could also be done It does nothing for the actual link, Twitter opens the tweet all the same with or without it

In fact, it's removed when you open the url So it's useless outside of twitter

DonovanDMC avatar Aug 23 '22 00:08 DonovanDMC

Hmmm, I don't know about that actually. It's a designator for the specific image of the tweet, and is used when doing previews on various platforms, I believe? (Though, maybe that's just fxtwitter, actually) I think it makes some sense to keep it there, as a pointer to the precise image the post is sourced from, but on the other hand, it allows duplicates where something could link both the tweet, and the individual photo. And as you say, going to the link just strips it out. (Which does seem odd.)

I could be convinced either way on that, I think

Deer-Spangle avatar Aug 23 '22 23:08 Deer-Spangle

Doesn't change the display on Discord, Telegram, Guilded

(in fact, guilded shows a reply as the image) Screenshot from 2022-08-23 20-11-27 image

The only place it seems to make a difference is on slack, and that's iffy at best

~~the tweet, if you're curious~~

DonovanDMC avatar Aug 24 '22 01:08 DonovanDMC

Fair enough! I originally thought those links actually went to the pic in the tweet, was surprised to discover they just redirect to the tweet, and they don't seem to do anything on major platforms either. Added a check to remove those too, and a test

Deer-Spangle avatar Aug 26 '22 12:08 Deer-Spangle

Did you commit the right thing? Only the test is present, and something else unrelated in an ancient fixer script seems to have slipped in

Earlopain avatar Aug 26 '22 13:08 Earlopain

... I did not commit the right thing! Sorry, I was rushing toward the end of my lunch break there, and ticked the wrong files in RubyMine. I'll get that corrected in the evening! (The todo was just a note from real early, so I knew an example fix script to check when writing my one. Should have cleared that on my local by now maybe)

Deer-Spangle avatar Aug 26 '22 14:08 Deer-Spangle

Could you also do some rebasing? The commit history is pretty messy right now. If you don't want to bother, that's fine. I can just squash merge instead.

Earlopain avatar Aug 26 '22 16:08 Earlopain

Sure! I've reorganised a local branch to have a cleaner history, but I see most commits have a prefix, what are the recommended prefixes for this? [SourceAlternate] and [Tests] or something like that?

Deer-Spangle avatar Aug 27 '22 02:08 Deer-Spangle

Hm, I would just go with [Posts] in this case I think. It's not an exact science, but I try to relate my commits to a rails model. Since this improves a posts sources I would go with that.

Though I also sometimes just use [Tests] when just modifying tests for example. Or [Cleanup] when I think the model doesn't matter much.

Just do what you think is best, it doesn't really matter much in the end. Though I would go with [Alternates]. It's shorter and pluralized.

Earlopain avatar Aug 27 '22 07:08 Earlopain

That sounds good! Alright, lets see how that force push goes..

Deer-Spangle avatar Aug 27 '22 12:08 Deer-Spangle

How's that? ^_^

Deer-Spangle avatar Aug 27 '22 12:08 Deer-Spangle