CAPEv2 icon indicating copy to clipboard operation
CAPEv2 copied to clipboard

Updating scheduler/database to improve the time a machine is locked for

Open cccs-kevin opened this issue 2 years ago • 0 comments

Something that I've noticed in the following environment:

  • you have ~100 clients that submit tasks and poll until analysis completion,
  • you have 100 machines that are eligible for all tasks
  • each analysis runs for ~2 minutes

The logic in the scheduler.py with the machine_lock prevents 100 machines to be assigned a task, and reaches an equilibrium of ~50 tasks in the pending queue with ~30-50 machines in use. Here is why:

  • It takes ~2 seconds for a task to be assigned to a machine, and the majority of this time is spent waiting to acquire the machine_lock https://github.com/kevoreilly/CAPEv2/blob/master/lib/cuckoo/core/scheduler.py#L184
    • Therefore every ten seconds, ~5 tasks are always assigned to a machine
  • Analyses complete periodically, depending on the file type
    • Every ten seconds, >5 tasks are completed and machines are freed up

My proposition in this PR:

  • Speed up the time that a machine_lock is acquired for by moving stuff that isn't required outside of the time when the machine_lock is locked.
    • Moving the route_network method does not need to be in this section and takes a bit of time, so that cuts the machine_lock blocking time.
    • Merging the guest_start and set_task_vm methods into one since both methods commit & refresh the DB for a Task row. This speeds up the blocking time a bit.
  • Remove the time to sleep time between scheduler loops. This results in more DB calls, but removes an arbitrary sleep in the loop. There may be a way here to avoid sleeping in a smarter way? Any input is welcome.

Since the availables() method is called every iteration of the scheduler loop, the <query>.count() method in SQLAlchemy is known to be slow, so here is a thread talking about a way to speed it up: https://gist.github.com/hest/8798884

These modifications help a bit, sometimes 12 tasks are assigned within a 10 second period... which is great BUT I think there is a way to improve the use of the machine_lock usage. Maybe with semaphores? I'm not sure...

Maybe something smart like a bounded semaphore that can be acquired X times where X is the number of relevant available machines?

cccs-kevin avatar Aug 23 '22 20:08 cccs-kevin