hash icon indicating copy to clipboard operation
hash copied to clipboard

Enable multiple worker execution

Open TimDiekmann opened this issue 3 years ago • 0 comments

🌟 What is the purpose of this PR?

Currently, we are setting the number of workers to 1. In order to achieve good scaling, the number of workers should be adjustable so tasks can run on different threads (and processes in the case of Python) in parallel.

⚠️ Known issues

  • Simulation, which removes agents, currently does not necessarily pass. This is probably because of our logic on removing groups as we remove them by index, not by ID, so when a batch is removed, which is not the last index, other batches' index changes. For worker indices, the second dot in What does this change? provides a fix.

🚫 Blockers

🔍 What does this change?

  • Remove the hardcoded setting of number of workers to 1
  • Fix the removal of active workers in DistributionController. Previously, when a worker with a lower index (e.g. 0) was removed, other workers had a changed index. In order to achieve this, I replaced the Vec at the active_workers field with a Set. Alternatively, we could also have searched the vector for the worker id instead. Done in cca9b2b04859167cb10da15af28e63c8d9d5132a at packages/engine/src/workerpool/pending.rs This enables running simulations with agents distributed for all workers (i.e. no workers have "empty" tasks). This leads to:
  • Only create behavior tasks if the task would have agents associated with: a2b1b94f2c49b760093445bc07889e078ae4616a
  • Add an #[instrument] and debug! output to track the current worker_index: 560d61977184e75c9299b69a7d155b168083c818
  • When multiple workers are enabled, some workers may have no batches assigned. c1fffd891ab343f5788fb0af7473206fa76138d1 handles empty batches correctly, when new agents are added when running context packages. This also sets the worker_index on batches correctly to be picked up when distributing new agents.

📜 Does this require a change to the docs?

No

🔗 Related links

🛡 What tests cover this?

The integration tests should cover this well enough

❓ How to test this?

Run different simulations with multiple workers enabled. Especially simulations which are adding/removing agents are crucial.

TimDiekmann avatar Mar 10 '22 13:03 TimDiekmann