Parallelize Synapse worker template render
I noticed there are a few areas in the playbook that could benefit from running tasks in parallel rather than serially, and I've had significant time savings running the roles from doing so . I started with updating the Synapse install role by rendering worker configuration and service files in parallel rather than serially and wanted to share my changes for feedback and/or merging in the playbook if this is of interest.
Some quick benchmarks for running the install-synapse tag with enabled workers:
| Controller CPU + ~Number of total workers | Original duration (mm:ss) | New duration (mm:ss) | % Reduction |
|---|---|---|---|
| Intel Xeon E7-8880, ~30+ workers | ~5:45 | ~5:15 | 8.7% |
| Ryzen 5900X, ~30+ workers | ~5:00 | ~2:10 | 56% |
| Ryzen 5900X, ~130+ workers | ~19:35 | ~4:15 | 78% |
Some things to note regarding the implementation:
ansible.builtin.templateunfortunately does not support running the task asynchronously directly in the playbook, so the logic for launching the templating tasks are required to use shell commands and passing in the required environment variables.- The prerequisites in the docs indicate that Ansible can be, but is not required to be installed on the server. Therefore, the templating tasks are delegated to running on the controller.
- I ran into ssh connection reset errors due to running out of available sessions on my server due to all the template tasks running in parallel. If there is interest in merging this PR, it may be helpful to add a note in the documentation regarding increasing the
MaxSessionsconfiguration option in your server if you run into such errors.
If there is interest in parallelizing more roles in the playbook, let me know as I think there are other roles I could try running tasks in parallel that could benefit and share further contributions.
This is quite interesting and it seems nicely done! I'm just worried about the additional complication it introduces.
Seems like this is mostly helpful for very large deployments. And also possibly with a high latency between the Ansible controller and remote server.
Benchmarks
I'll add my own benchmark as well to serve as extra data points.
My Ansible controller is my mini PC, which has an AMD Ryzen 9 6900HX CPU. I'm testing against 2 different servers - doing the tests one by one.
Command: ansible-playbook -i inventory/hosts setup.yml --tags=install-synapse -l MY_MATRIX_HOSTNAME
This is like just install-service -l .., but skips the services-restarting part which is slow and somewhat unrelated.
Measuring systemd service-restarting probably hides away some of the benefits because the total time will grow and templating worker files being faster will not matter that much - see Amdahl's law. Including service-restarting would have probably made the benchmarks more real-world accurate, because most people would definitely be restarting services after a playbook run.
| Host | Workers | Server Latency | Before (synchronous) | After (async, your branch as-is) | After (async, ansible target host fixes) |
|---|---|---|---|---|---|
| A | 13 (one-of-each) |
35ms | 61 sec. (baseline) | 59 sec. (4% improvement) | 52 sec. (15% improvement) |
| B | 13 (one-of-each) |
4ms | 31 sec. (baseline) | not measured | 30 sec. (4% improvement) |
The After (async, ansible target host fixes) column refers to me replacing matrix_servers with {{ inventory_hostname }} - see the important bugfix below.
It seems like when the total number of workers is low and the server latency is tiny, doing things asynchronously does not help much. I expect that this PR has some complexity cost which won't help most users of the playbook, but may certainly help some large-scale deployments.
As your benchmarks indicate, when targeting a server with 130 workers, it makes a big difference. My benchmarks do not target such a large large deployment, but I trust your results. It's probably a good idea to confirm if your benchmarks used ansible matrix_servers or ansible {{ inventory_hostname }} though, as that may be skewing the results.
Areas for improvement
Some areas for improvement for this PR might be:
-
(important bugfix) not hardcoding
matrix_serversin theansiblecommand call, as that runs the task against all servers (with the data for a single host even, causing incorrect results for the other hosts). You don't know how the originalansible-playbookcommand has been invoked and how many hosts it targets. You can see in my example that I'm usingansible-playbook .. -l MY_MATRIX_HOSTNAMEto limit execution to a single host. I suppose we can doansible {{ inventory_hostname }} -i ...or something - each host spawning tasks only for itself. -
(portability improvement) not hardcoding
inventory/hostsand trying to use some Ansible variable for it instead. Some people may have their inventory elsewhere or their current working directory may not be the Ansible playbook itself (e.g.ansible-playbook -i /playbook/inventory/hosts /playbook/setup.yml ...oransible-playbook -i /inventory/hosts /playbook/setup.yml ...). -
keeping the old code (
ansible.builtin.template) in a comment next to theansible.builtin.commandcall. Well, one could figure it out from the arguments passed toansible(-a, etc.), but it's probably more obvious/approachable if it's there to see -
complicating things further, by introducing some variable which limits how many async tasks would be spawned (e.g.
matrix_synapse_ansible_async_tasks_limit: 50). After having spawned a certain number of tasks (matrix_synapse_worker_template_job_status_result_list | length), it may fall back to doing things synchronously (perhaps by changingpoll: 0topoll: 60or something, to make tasks non-adhoc past a certain point). This may also help if one runs the playbook against many hosts, each of which may wish to spawn an infinite number of async tasks, multiplying the problem. -
(debatable if it should be done) complicating things further by allowing for both async- and sync- ways of working. Maintaining 2 code paths is probably not a good idea though. It's best if we all run the same code, or the non-default way of doing things will fall out of date at some point. Perhaps setting
matrix_synapse_ansible_async_tasks_limitto0is one way to make the async code less parallel, by forcingpoll: 60to be used for all tasks. -
extracting this
poll: 60value (used when the async limitmatrix_synapse_ansible_async_tasks_limithas been hit) into yet another variable (e.g.matrix_synapse_ansible_async_poll_duration: 60) to make it configurable.matrix_synapse_ansible_async_poll_durationmay also serve a double purpose - it can be used for checking on the status (ansible.builtin.async_status)
Thanks Slavi, and appreciate the detailed feedback! Yeah, most of the benefit comes from large deployments that have a lot of generic workers or event stream workers. I think smaller deployments might also benefit from having a couple more generic workers than using the one-of-each preset, but that is topic to revisit in the future after collecting more load testing data and observing CPU load across all workers for an average-case chat server load.
I re-ran one of the benchmarks with the suggested bugfix , and here is what I got:
| Controller CPU | Workers | Server Latency | Before (synchronous) | After (async, your branch as-is) | After (async, ansible target host fixes) |
|---|---|---|---|---|---|
| Ryzen 5900X | 123 total: 11 (one-of-each) + 96 generic + 16 events stream | 24ms | ~19:35 (baseline) | ~4:15 (78% improvement) | 3:31 (82% improvement) |
In regards to areas of improvement, I agree with your suggestions and will follow up later to add them along with including changes to documentation for the relevant docs articles. Thanks again for the feedback!
Here's a different crazy idea for parallelizing Synapse workers.
I've seen some Ansible resources suggest creating multiple entries in your inventory for a single machine. For example, maybe you want to run two different instances of a given service, with two totally different configurations. Apparently one way to do it is to create two inventory entries and pretend that those are two totally separate hosts. (I don't know enough about Ansible to know whether this is an accepted best practice or not.)
So, along those lines: Might it be possible to create multiple fake "hosts" under matrix_servers and configure each of them to run some subset of the Synapse workers?
In the limit, you could create one inventory entry for each Synapse worker, and then your parallelism would be limited only by the number of SSH connections that Ansible is willing to run. In reality that's probably crazy. But for ~128ish workers, it might be feasible to have 16 "virtual" hosts running 8 workers each.
Splitting one host into multiple "virtual" hosts has the following problems:
-
there are a lot of tasks that run in each role, which may be duplicated and run on both. Rare race conditions may be possible (generally unlikely, but there's a chance), You'd probably like to run some common tasks in some places and then split some others across hosts. This will be tricky
-
right now, when workers are configured, the flow is something like this:
- create a list of workers that we'll be setting up for this installation. Make a plan
- go through any existing workers you can find on the machine. Stop them and delete their configuration
- set up the new workers
In step 2 of setting up workers, we delete anything we don't recognize as a valid worker for the current configuration. If there are multiple Ansible instances operating on the machine, some will stop & delete worker configuration done by other instances.