4cat icon indicating copy to clipboard operation
4cat copied to clipboard

Add support for multiple instances & one job queue

Open stijn-uva opened this issue 2 years ago • 6 comments

For now this requires connecting to the same postgres database, and sharing the same result data location (e.g. via a network share).

Later we can make this more robust/flexible :-)

Fixes #64.

stijn-uva avatar Dec 03 '21 10:12 stijn-uva

Hmmmm. Well, one thing we need to figure out is the API. Does it need to run on each instance of the backend (I think no)? How do we ensure the job runs on the correct instance (or each instance, if I'm wrong about that)?

I tried run this commit with two backends and things... went haywire. The API failed to start and now, I think, the API job won't be claimed again even when I switched back to one backend. I haven't found what changed there to cause that. I'll completely remove my 4CAT and install just this version with one backend to make sure it works at least.

dale-wahl avatar Dec 06 '21 12:12 dale-wahl

API does not seem to get picked up by either instance when two are running, but I was able to run some queries successfully.

I went and made docker logs store onto a volume a bit ago and both backends are using the same log so I'm not yet sure if both backends are actually doing work or just one.

dale-wahl avatar Dec 06 '21 13:12 dale-wahl

Thanks a lot for testing this! I hadn't thought about the API. It should be run by all instances, since it reports statistics about that instance and is used to cancel jobs on that specific instance via the front-end. I'll fix it to do that properly in this scenario - might have to do with constraints on the jobs table.

The way this is set up now, all jobs by default can be run by any instance. The jobs that should run on each instance are queued explicitly for each instance in manager.py. The idea is that processors can be run by any instance in the 'network'. So I queue a Twitter dataset on instance 1, but instance 2 runs the query (if it can) and then stores the results on instance 1. Meanwhile, instance 1 can claim another Twitter dataset and run that. So you can run multiple datasets of the same type in parallel - which could already be done via the processor's max_workers property, but for API calls it's bad karma to run multiple in parallel, you run into rate limits etc.

stijn-uva avatar Dec 07 '21 16:12 stijn-uva

Interesting. The frontend (or whatever happens to be calling the API), is going to need to be able to determine which backend to reach out to (for example with cancel-job) or, perhaps, reach out to all of them and they handle accordingly (makes sense for possibly everything else). That means it needs to know host:port for each. We could do that in the config, but that becomes harder to maintain in a cloud setting so the database would be far better for that. I'm thinking out loud so perhaps we should just chat!

But some other thoughts before I go... 😄

We'll probably want a max_workers and a max_workers_per_backend variable. Handle cases were we want to restrict something like an API key vs. not wanting something processor/memory hungry running multiple times on the same machine.

I'm not super sure I get remote_id in instances where there is no dataset (it's either localhost or 0). From what I can tell, that's the reason the API jobs are not recreated. We could add instance to the constraints in order to have multiple API jobs, but that might allow multiple instances to create other identical jobs. Oh, could feed instance_id to the remote_id. I may have another issue with Docker since this platform.uname().node is not static under... well some yet to be determined conditions.

dale-wahl avatar Dec 08 '21 10:12 dale-wahl

These were all great suggestions. I've done the following:

  • Changed the way worker interrupts are requested. Now the scheduler checks the database on each loop to see if there are workers that do not have a corresponding job in the database. If so, they are interrupted. So you can simply remove a job from the database and the relevant instance will handle the interruption and cancellation.
  • Consequently the API now requests the deletion of the job from the database, instead of directly requesting worker interruption. This avoids the issue of not knowing which instance's API should be interrupting the worker (since they all have access to the database).
  • remote_id is now set to the instance ID for 'housekeeping' workers. The unique key for the jobs table now is a compound key of (jobtype, remote_id, instance) so the same job can exist in multiple instances.

What remains to be done:

  • Docker & INSTANCE_ID. The solution might be to have the setup script set it to docker-4cat-[random number] or something...
  • max_workers_per_backend. This is surprisingly tricky, because the one scenario I can think of (someone queueing multiple jobs for the same API key, which would run into rate limiting if run simultaneously) would require deeper checking to see if the same API key is being used simultaneously at multiple instances - if it is, then the jobs should not run in parallel, but if it isn't, that's sort of the scenario this is made for and they should...

Anyway, for now this is running at three of our own instances, seemingly without issues, but I'm sure some will come up.

stijn-uva avatar Dec 13 '21 09:12 stijn-uva

Based on the experience so far, I think this needs a better way to share result files between instances. Relying on network mounts is not so great, you run into issues with write permissions and so on. The shared database seems the simplest way to share the actual queue, but sharing files might be done better via e.g. the HTTP API.

stijn-uva avatar Dec 13 '21 16:12 stijn-uva

This remains a useful feature to have at some point but since the solution used here is not optimal, and the code in this branch is so far out of sync, I'm closing this PR for now.

stijn-uva avatar Apr 12 '24 10:04 stijn-uva