CAPEv2
CAPEv2 copied to clipboard
Updating scheduler/database to improve the time a machine is locked for
Something that I've noticed in the following environment:
- you have ~100 clients that submit tasks and poll until analysis completion,
- you have 100 machines that are eligible for all tasks
- each analysis runs for ~2 minutes
The logic in the scheduler.py
with the machine_lock
prevents 100 machines to be assigned a task, and reaches an equilibrium of ~50 tasks in the pending queue with ~30-50 machines in use. Here is why:
- It takes ~2 seconds for a task to be assigned to a machine, and the majority of this time is spent waiting to acquire the
machine_lock
https://github.com/kevoreilly/CAPEv2/blob/master/lib/cuckoo/core/scheduler.py#L184- Therefore every ten seconds, ~5 tasks are always assigned to a machine
- Analyses complete periodically, depending on the file type
- Every ten seconds, >5 tasks are completed and machines are freed up
My proposition in this PR:
- Speed up the time that a
machine_lock
is acquired for by moving stuff that isn't required outside of the time when themachine_lock
is locked.- Moving the
route_network
method does not need to be in this section and takes a bit of time, so that cuts themachine_lock
blocking time. - Merging the
guest_start
andset_task_vm
methods into one since both methods commit & refresh the DB for a Task row. This speeds up the blocking time a bit.
- Moving the
- Remove the time to sleep time between scheduler loops. This results in more DB calls, but removes an arbitrary sleep in the loop. There may be a way here to avoid sleeping in a smarter way? Any input is welcome.
Since the availables()
method is called every iteration of the scheduler loop, the <query>.count()
method in SQLAlchemy is known to be slow, so here is a thread talking about a way to speed it up: https://gist.github.com/hest/8798884
These modifications help a bit, sometimes 12 tasks are assigned within a 10 second period... which is great BUT I think there is a way to improve the use of the machine_lock
usage. Maybe with semaphores? I'm not sure...
Maybe something smart like a bounded semaphore that can be acquired X times where X is the number of relevant available machines?