jamulus Evaluate further multithreading improvements

Has this feature been discussed and generally agreed?

No.

Describe the solution you'd like

I think there are multiple places where there is a chance of further improving the multithreading behavior. These are things to investigate and benchmark, so there's no plan to modify this yet.

[ ] The multithreading code currently hardcodes the number of threads to the number of visible processor cores. While this might be good for maximum throughput, it might be bad for predictability. In other words: There might be other tasks running on the same machine, e.g. kernel threads which require low latency (network I/O), other Jamulus instances or even other non-Jamulus server processes.
- [ ] We should check if this default makes sense or if $NUM_CORES-1 would be better. Previous discussion: https://github.com/jamulussoftware/jamulus/pull/960#pullrequestreview-585007016 https://github.com/jamulussoftware/jamulus/pull/960#issuecomment-839836327 (If we are unsure whether touching the default is a good idea, we should leave it as-is to avoid causing problems for users)
- [ ] We should look into making this configurable or document how to do it outside of Jamulus (cpuset?). At the very least we should log (on startup) how many threads will be used. Previous discussion: https://github.com/jamulussoftware/jamulus/pull/960#issuecomment-782758759 https://github.com/jamulussoftware/jamulus/pull/960#discussion_r632814416
[ ] The amount of work which is put into the thread pool is pre-calculated both for decoding and encoding. In other words, on a 5 core machine with 21 clients, 5 threads will be used and blocks of work are generated as 5, 5, 5, 5, 1 (AFAIU).
- [ ] This calculation is sub-optimal. Having one job which does way less than the others doesn't seem logical (of course, depending on client count and number of threads no perfectly balanced distribution is possible). For 21 clients and 5 threads, 5, 4, 4, 4, 4 might be better though.
- [ ] This upfront calculation assumes that Jamulus has full insight into core performance behavior, while in practice it might not have this insight. Cores can be slower due to individual frequency scaling or due to parallel workload (network IO, system software). We don't have insight into the OS scheduler either. So unless the overhead of pThreadPool->enqueue() is huge (it requires a lock, so it may be relevant), I'd like to try using smaller blocks of work or dropping the upfront block planning altogether.
[ ] Find out why a Jamulus server under high load stops responding to protocol messages. Is network I/O the next bottleneck? https://github.com/jamulussoftware/jamulus/blob/master/src/server.cpp#L1476-L1511 looks interesting.

cc'ing @menzels who last touched these parts and might have had good reasons for not going down these paths. :) Also cc'ing @atsampson, @kraney and @softins who gave valuable input in #960.

Any insights to how benchmarking/profiling was previously done would also be helpful. I'll mainly focus on measuring decoding/mixing duration and jitter using this: https://github.com/jamulussoftware/jamulus/discussions/2440

Mar 02 '22 09:03 hoffie

I'd suggest you split this into separate issues... Looks a bit large and some parts may be more easily achieved than others.

We should look into making this configurable or document how to do it outside of Jamulus (cpuset?)

At the very least, there should be a command line option (hopefully supported in the UI...) to set the core count, in case Jamulus gets it stupidly wrong, for whatever reason. (And if -j isn't yet used, it's allocated to whatever the long option will be :) .)

are generated as 5, 5, 5, 5, 1 (AFAIU).

That could be a bug - X axis vs Y axis kind of thing. It should drop a task into each bucket in a round robin fashion, which should give a flat distribution - your "5, 4, 4, 4, 4" result.

This upfront calculation assumes that Jamulus has full insight into core performance behavior

Hence needing at least a command line option telling it the core count - leave scheduling the tasks to the underlying scheduler (tasks pass through Qt to the OS), based on the task declaration: it should know best where to run it.

Ideally, if we're using the Qt scheduler, we wouldn't be telling it (a) how many cores or threads to use or (b) which core/thread to use for a particular task - except by some task type information... I don't think it's that clever, though.

Mar 02 '22 17:03 pljones

(And if -j isn't yet used, it's allocated to whatever the long option will be :) .)

I'll disagree with me on that one.

First, I think we should switch to having --multithreading enabled by default. I'd actually get rid of -M / --multithreading (i.e. initially just ignore them totally, remove them from --help).

Next, add an option to limit the use of cores. (We need to avoid the term "thread" as, even without --multithreading, the server kicks off multiple "threads" (main, CSocketThread, a UI thread if not headless; the recorder is also a separate thread).) This could be used with a detailed syntax, if needed - e.g. if I had eight cores and wanted to do weighting across them "--cpuset 4;2;0;1;0;1;0;1", both restricting certain cores from use and prefering others; where a single weight was given it could be interpretted as "use any core but no more than N in total". It could then be noted that this would only apply to the mix processing tasks.

Mar 02 '22 19:03 pljones

As regards to 1 thread per core: I have noticed that Intel appears to specify 2 threads per core (in single core, core2 at 2 threads --> modern 4 core cpus at 8 threads).

Mar 02 '22 19:03 ghost

Next, add an option to limit the use of cores.

We don't need to redo the work of taskset. That works just fine and as I would expect. I limit my Jamulus server to the 4 high power cores and the 2 low power cores do the rest.

Mar 06 '22 15:03 dcorson-ticino-com

We don't need to redo the work of taskset.

It's probably adequate for 80%+ of servers. Possibly not for 80% of server admins. But then any solution would need deep technical knowledge.

However, the default for Jamulus should not require a user to use an OS-specific command. If we know that it's not easy to predict the server behaviour but we can provide guidance from within Jamulus, then that should be the default.

Mar 06 '22 18:03 pljones

... in fact...

Jamulus     1355       1  0 17:58 ?        00:00:00 /opt/Jamulus/bin/Jamulus-private -n -s ....

$ taskset -pa 1355
pid 1355's current affinity mask: 3f
pid 1401's current affinity mask: 3f
pid 1402's current affinity mask: 3f
pid 1403's current affinity mask: 3f
pid 1404's current affinity mask: 3f
pid 1405's current affinity mask: 3f
pid 1406's current affinity mask: 3f
pid 1407's current affinity mask: 3f

I'd really like more control over the individual threads: "Run the recorder and main app thread only on this core; run the audio and network threads on any of these cores; (if enabled, run the UI on this core)".

Doing that when running, with taskset <mask> <command> isn't going to work. You'd need to wait until the threads had pids and then do them - which isn't really something that can be automated from systemd.

However, it all really depends on having greater insight into what is eating CPU and blocking on I/O to make the right choices about (a) what should be in a thread on its own and (b) how best to schedule threads.

Mar 06 '22 19:03 pljones

Ok. Re-Opened

Apr 23 '22 20:04 ann0see

jamulus jamulus copied to clipboard

Evaluate further multithreading improvements

jamulus
jamulus copied to clipboard