hydrus icon indicating copy to clipboard operation
hydrus copied to clipboard

[Bug] NetworkJob::Gallery Tab Network jobs stall in pending or working state after relaunch/crash

Open bbappserver opened this issue 1 year ago • 4 comments

Hydrus version

598

Qt major version

Qt 6

Operating system

Linux (specify distro and version in comments)

Install method

Running from source

Install and OS comments

Ubuntu 23 x86

Bug description and reproduction

Gallery importer network jobs become stuck in the pending or working state but never make any progress after hydrus is relaunched.

My best guess for why this happens is that after the tab is deserialized the jobs have become orphaned instead of being reinserted into the job queue.

Symptoms

  • The gallery url which initiated the job in the search log reads as successful, or there is a gallery url for the next page which has a blank status.
  • File/Post urls in the file log which have not yet been processed have a lank status but are never processed

Reproduction Steps

  1. Start several gallery downloaders.
  2. Reboot and kill(-9) hydrus at various stages of the network job. In my case a crash caused the issue.

Workaround

Copy the query from the faulted job and run it again. Delete the faulted job from the tab.

Log output

[There is no specific log or dump] The program is running without crash or error, however these specific jobs seem hung at the model level.

bbappserver avatar Nov 18 '24 22:11 bbappserver

Thank you for this report. This is pretty odd--obviously the program is supposed to resume if it boots with a non-complete downloader, and even if there is a crash it should normally be fine--since the downloader page is basically rewound in time a bit, it'll usually just blit through the first few results as it realises they are 'already in db' since they were imported after the crash (and so saved to the database, which did sync, but not the GUI session, which was pre-empted by the crash), and then it'll continue as always.

I agree that this is probably some odd scheduling thing. I did change some file log stuff recently, which could screw with some duplicate URLs or paths in file logs, but it doesn't sound like you have this unless you have some very exotic URL class rules.

Unfortunately I cannot reproduce this--if I force a crash and restart, things get back to normal as I would expect. There are some forced limits in the number of downloaders that can run at once--it is something like 5 gallery downloaders and 10 file downloaders--so sometimes the pending/working situation can stall when things are busy, for instance right after boot, but you will see things move forward unless the client is really suffering under hundreds of competing watchers or whatever.

Can we gather a bit more information?

  • First off, if your client closes cleanly, no crash or kill(-9), and then you boot up again, does a downloader reload the exit session state and get back to work ok in that case? Is it only crashes that throw it off, or is it any 'boot on incomplete downloader'?
  • If you get this infinite pending/waiting jobs state, hit network->data->review network jobs. Do you see the respective downloads in there? It should be the 'working' guys, but not the pending. Do they have any interesting or useful info? Is anything else working there, and if you hit 'refresh snapshot', is it moving along or all stalling? If you do your workaround and copy the query, do you see the network stuff working now, and if/how is it different?
  • Advanced, and only if you have the time to do it: Hit network->pause->always boot with paused network traffic, and then restart the program cleanly. Now open a new downloader and set up our crash test. Unpause the network traffic for a bit, and then initiate the crash. Now boot up, and all the network traffic should be paused again. Hit up help->debug->report modes->file import report mode and network report mode. Now unpause network traffic again and our guys should do their thing and you should get a whole bunch of popup spam, typically like forty popups per successful file import. Is there anything interesting in there? Does any of the network stuff get going, or is there actually basically doing going on? Does it always stop after x step?

hydrusnetwork avatar Nov 19 '24 23:11 hydrusnetwork

@hydrusnetwork

  • I have paused all but the stalled jobs, on the stalled jobs search and files are not paused
  • Subscriptions are paused

Paused all new network traffic, network report mode on, no messages from the network engine. Reviewing network jobs shows empty table, no jobs waiting for bandwidth.

On closer inspection another symptom is that search is not ◼️ on the offending jobs, but is ◼️ on files. However the search never seems to be proceeding to populate more files.

bbappserver avatar Nov 21 '24 00:11 bbappserver

Thanks. I think you are correct, as you said on discord, that somehow these old (1 year+) import jobs have some busted variable and serialisation will help us debug. I did recently update some of the 'how do we identify ourselves' vs 'how do we do our job' variables inside file import jobs, and I think gallery search jobs, so perhaps this is what has happened here.

I will figure out some 'export this to json' job for the file and search logs and we'll examine what URLs etc.. they think they really have.

hydrusnetwork avatar Nov 23 '24 18:11 hydrusnetwork

Ok, slightly stupid location for the menu item, but both the file log and search log will in v600 support JSON export of the current selection to clipboard. Please DM me some examples here, and maybe the equivalent on a newly created downloader that we know work, and we'll see what the differences are.

image

hydrusnetwork avatar Nov 27 '24 03:11 hydrusnetwork

@hydrusnetwork This only seems to be occurring with very unstable webservers, so I believe it is related to network job not being robust enough to resume at a seek point, so they just keep restarting.

So it is not so much that the jobs stall a the connection is reset and they never complete or as is the case with e.g. kemono.su the file bandwidth is quickly consumed so the job stalls waiting on it and in the interim the server closes the connection so the job resets to 0 percent.

bbappserver avatar Apr 23 '25 12:04 bbappserver

Possibly the same as https://github.com/hydrusnetwork/hydrus/issues/1094

Sorry I've been slow getting traces. I switched over to a completely different session for like half a year of a big scrape job.

bbappserver avatar Sep 22 '25 15:09 bbappserver

Possibly similar to https://github.com/hydrusnetwork/hydrus/issues/971

bbappserver avatar Sep 22 '25 15:09 bbappserver