[Bug] NetworkJob::Gallery Tab Network jobs stall in pending or working state after relaunch/crash
Hydrus version
598
Qt major version
Qt 6
Operating system
Linux (specify distro and version in comments)
Install method
Running from source
Install and OS comments
Ubuntu 23 x86
Bug description and reproduction
Gallery importer network jobs become stuck in the pending or working state but never make any progress after hydrus is relaunched.
My best guess for why this happens is that after the tab is deserialized the jobs have become orphaned instead of being reinserted into the job queue.
Symptoms
- The gallery url which initiated the job in the search log reads as successful, or there is a gallery url for the next page which has a blank status.
- File/Post urls in the file log which have not yet been processed have a lank status but are never processed
Reproduction Steps
- Start several gallery downloaders.
- Reboot and kill(-9) hydrus at various stages of the network job. In my case a crash caused the issue.
Workaround
Copy the query from the faulted job and run it again. Delete the faulted job from the tab.
Log output
[There is no specific log or dump] The program is running without crash or error, however these specific jobs seem hung at the model level.
Thank you for this report. This is pretty odd--obviously the program is supposed to resume if it boots with a non-complete downloader, and even if there is a crash it should normally be fine--since the downloader page is basically rewound in time a bit, it'll usually just blit through the first few results as it realises they are 'already in db' since they were imported after the crash (and so saved to the database, which did sync, but not the GUI session, which was pre-empted by the crash), and then it'll continue as always.
I agree that this is probably some odd scheduling thing. I did change some file log stuff recently, which could screw with some duplicate URLs or paths in file logs, but it doesn't sound like you have this unless you have some very exotic URL class rules.
Unfortunately I cannot reproduce this--if I force a crash and restart, things get back to normal as I would expect. There are some forced limits in the number of downloaders that can run at once--it is something like 5 gallery downloaders and 10 file downloaders--so sometimes the pending/working situation can stall when things are busy, for instance right after boot, but you will see things move forward unless the client is really suffering under hundreds of competing watchers or whatever.
Can we gather a bit more information?
- First off, if your client closes cleanly, no crash or kill(-9), and then you boot up again, does a downloader reload the exit session state and get back to work ok in that case? Is it only crashes that throw it off, or is it any 'boot on incomplete downloader'?
- If you get this infinite pending/waiting jobs state, hit
network->data->review network jobs. Do you see the respective downloads in there? It should be the 'working' guys, but not the pending. Do they have any interesting or useful info? Is anything else working there, and if you hit 'refresh snapshot', is it moving along or all stalling? If you do your workaround and copy the query, do you see the network stuff working now, and if/how is it different? - Advanced, and only if you have the time to do it: Hit
network->pause->always boot with paused network traffic, and then restart the program cleanly. Now open a new downloader and set up our crash test. Unpause the network traffic for a bit, and then initiate the crash. Now boot up, and all the network traffic should be paused again. Hit uphelp->debug->report modes->file import report modeandnetwork report mode. Now unpause network traffic again and our guys should do their thing and you should get a whole bunch of popup spam, typically like forty popups per successful file import. Is there anything interesting in there? Does any of the network stuff get going, or is there actually basically doing going on? Does it always stop after x step?
@hydrusnetwork
- I have paused all but the stalled jobs, on the stalled jobs search and files are not paused
- Subscriptions are paused
Paused all new network traffic, network report mode on, no messages from the network engine. Reviewing network jobs shows empty table, no jobs waiting for bandwidth.
On closer inspection another symptom is that search is not ◼️ on the offending jobs, but is ◼️ on files. However the search never seems to be proceeding to populate more files.
Thanks. I think you are correct, as you said on discord, that somehow these old (1 year+) import jobs have some busted variable and serialisation will help us debug. I did recently update some of the 'how do we identify ourselves' vs 'how do we do our job' variables inside file import jobs, and I think gallery search jobs, so perhaps this is what has happened here.
I will figure out some 'export this to json' job for the file and search logs and we'll examine what URLs etc.. they think they really have.
Ok, slightly stupid location for the menu item, but both the file log and search log will in v600 support JSON export of the current selection to clipboard. Please DM me some examples here, and maybe the equivalent on a newly created downloader that we know work, and we'll see what the differences are.
@hydrusnetwork This only seems to be occurring with very unstable webservers, so I believe it is related to network job not being robust enough to resume at a seek point, so they just keep restarting.
So it is not so much that the jobs stall a the connection is reset and they never complete or as is the case with e.g. kemono.su the file bandwidth is quickly consumed so the job stalls waiting on it and in the interim the server closes the connection so the job resets to 0 percent.
Possibly the same as https://github.com/hydrusnetwork/hydrus/issues/1094
Sorry I've been slow getting traces. I switched over to a completely different session for like half a year of a big scrape job.
Possibly similar to https://github.com/hydrusnetwork/hydrus/issues/971