invidious
invidious copied to clipboard
Limit jobs to 5 and wrap in exception handler to release
@unixfox I've narrowed down the edits to this. If I revert these changes I see the errors again.
#4362
Can you please explain what this code does?
Can you please explain what this code does?
@SamantazFox I've added inline comments.
Oooh, okay, I see what you did here!
For your information, Jobs in invidious aren't scheduled in any way. Each job is running in it's own "fiber" (co-routine), and is basically a While True
loop, with a delay (sleep
) on each iteration. This sleep allows the internal Crystal scheduler to jump between fibers. And the role of job.begin
is to spawn that fiber, then exit right away.
In essence, your code doesn't prevent jobs from running, and if it did, half of invidious would break.
Please be aware that a more advanced job management library already exists, but we don't have the time (and no urgent need) to integrate it.
So I started with this purely to help in troubleshooting the issue, but somehow after implementing this change, I could not and still have not reproduced the issue. If I revert the change, the issue comes back though, so this is reproducible.
Btw, I'm not interested in the bounty at all (assuming that's even on the table).
The details in the issue I created shows that invidious was only creating a few connections to the DB, but that, for reasons still unknown, had something similar to jobs locking up (race condition?) until it got to the point where it just sat there not talking to the DB at all. It def wasn't using the default 100 connections that the DB was configured for, and the DB wasn't overwhelmed with requests -- it had answers waiting for invidious to pick up that it just wasn't.
So I figured I'd limit the # of jobs, and handle any exceptions that might have happened in a job such that the slots couldn't become "zombied".
So I started with this purely to help in troubleshooting the issue, but somehow after implementing this change, I could not and still have not reproduced the issue. If I revert the change, the issue comes back though, so this is reproducible.
I think that it might have to do with the error handling rather than the semaphore part. Can you test different variations (no semaphore, no rescuing, less/more channels)?
This pull request has been automatically marked as stale and will be closed in 30 days because it has not had recent activity and is much likely abandoned or outdated. If you think this pull request is still relevant and applicable, you just have to post a comment and it will be unmarked.