CAPEv2
CAPEv2 copied to clipboard
Updating scheduler/database to improve the time a machine is locked for
Something that I've noticed in the following environment:
- you have ~100 clients that submit tasks and poll until analysis completion,
- you have 100 machines that are eligible for all tasks
- each analysis runs for ~2 minutes
The logic in the scheduler.py with the machine_lock prevents 100 machines to be assigned a task, and reaches an equilibrium of ~50 tasks in the pending queue with ~30-50 machines in use. Here is why:
- It takes ~2 seconds for a task to be assigned to a machine, and the majority of this time is spent waiting to acquire the
machine_lockhttps://github.com/kevoreilly/CAPEv2/blob/master/lib/cuckoo/core/scheduler.py#L184- Therefore every ten seconds, ~5 tasks are always assigned to a machine
- Analyses complete periodically, depending on the file type
- Every ten seconds, >5 tasks are completed and machines are freed up
My proposition in this PR:
- Speed up the time that a
machine_lockis acquired for by moving stuff that isn't required outside of the time when themachine_lockis locked.- Moving the
route_networkmethod does not need to be in this section and takes a bit of time, so that cuts themachine_lockblocking time. - Merging the
guest_startandset_task_vmmethods into one since both methods commit & refresh the DB for a Task row. This speeds up the blocking time a bit.
- Moving the
- Remove the time to sleep time between scheduler loops. This results in more DB calls, but removes an arbitrary sleep in the loop. There may be a way here to avoid sleeping in a smarter way? Any input is welcome.
Since the availables() method is called every iteration of the scheduler loop, the <query>.count() method in SQLAlchemy is known to be slow, so here is a thread talking about a way to speed it up: https://gist.github.com/hest/8798884
These modifications help a bit, sometimes 12 tasks are assigned within a 10 second period... which is great BUT I think there is a way to improve the use of the machine_lock usage. Maybe with semaphores? I'm not sure...
Maybe something smart like a bounded semaphore that can be acquired X times where X is the number of relevant available machines?