tsung
tsung copied to clipboard
Tsung fails to use all system resources
Hi,
I have the following setup: 6 nodes with 8 CPU/8GB RAM On the main node where scripts are run I define weight of 25% to equalize the load with other nodes. Tsung is generating load perfectly until about 50% of CPU is used, then if fails to generate more load. All machines are tuned with following settings: "profiles::sysctl::_config": { "net.ipv4.tcp_tw_reuse": { "value": "1", "permanent": true }, "net.ipv4.tcp_tw_recycle": { "value": "1", "permanent": true }, "net.ipv4.ip_local_port_range": { "value": "1024 65535", "permanent": true }, "fs.file-max": { "value": "500000", "permanent": true }, "net.ipv4.tcp_fin_timeout": { "value": "10", "permanent": true }
Network capacity is much greater than required. Other monitoring tools also confirm that CPU is only at 50%. There about 280 unknown errors which I assume is a good number for such setup? Loglevel is set to error only.
Attaching screenshots of generated load and resources used.
Any ideas what might be wrong? Tsung version is 1.6, Erlang 16
Could it be due to CPU load reaching 8 (i.e. 1 for each core)?
Same test and configuration, except for single machine:
It's 3 times more effective! Am I doing something wrong or Tsung is very ineffective in distributed configuration?
Did some benchmarking and the result is that load generation capacity drops significantly for each machine added to distributed configuration. Network latency between the Tsung nodes also has very big impact. In the end I get the result that a single machine can generate about 60% of load vs 6 machines of the same CPU/RAM in distributed setup. This makes Tsung distributed load generation a bit useless (at least for some scenarios).
I suspect this might be due to batch size of users sent from controller to slave nodes (observed some cases where batch size is only 1k users which causes very frequent communication between nodes and thus decreased performance?). I understand that increased batch size would probably have the impact of less even/accurate load distribution.
Maybe development team could consider a parameter to control controller instructions batch size? Or some other improvements to reduce the amount of communication between controller and slave nodes and increase distributed load performance?
Hey @manukoshe.
Can you share your tsung configuration?
AFAIK there are some "well-known" issues, where there is more coordination required between controllers and generators than probably necessary. #237 is one step to solve one of those issues (decentralised file_server access).
Also AFAIK user arrivals are also in part coordinated by the controller.
Depending on how you model your test case, we've had quite good experience so far in running really large tests (~1M connected users, >300k req/s, using 100+ generator nodes).
Thanks for comment, will send you setup privately.
I observed about 5-10% drop in generated max requests 1 machine VS 1 controller/1 slave setup in the same DataCenter.
I observed about 20% drop in generated max requests 1 machine VS 1 controller/1 slave setup in different geographical locations.
Would be nice if anyone else could benchmark and share results for their scenarios.
Additional findings: No matter how many machines I was using, I hit ~120k req/s limit per cluster configuration. I got same 120k/s requests max either with 1 master/3 slaves OR 1 master/11 slaves. There seems to be some kind of limitation in tsung controller or metrics reporting or somewhere else.
I was able to generate much more load by separately starting 3 x tsung clusters consisting 1 master/3 slaves (total 12 VMs). However this is not very convenient because:
- replication of test scripts to 3 machines
- manual login ant test run in 3 machines instead of 1
- test reports are split (however not a big issue in my case since I use Grafana for monitoring)
Hopefully these issues will be addressed in later Tsung versions.
Hey @manukoshe. Again, it would be very helpful if you could share your test configuration. There are issues regarding scalability, but many of them can currently be mitigated. It would be nice to know what the potential problem is, that you are currently running into.
Hi, can't share test script in public. Please respond to my messages in Linkedin or Twitter:)
I took a look at what @manukoshe send me. I'm pretty sure that the problem is the currently centralized nature of file_server
. An optimised and distributed local_file_server
is WIP and there is a PR for this: #237.
If you have the possibility, @manukoshe, you're very welcome to checkout https://github.com/processone/tsung/pull/237, compile tsung and give this a try. The change to your configuration should be rather simple.
Hi @tisba,
I have updated the patches in my local machines tsung directories successfully. When I start tsung I am getting the following errors,
Starting Tsung Log directory is: /root/.tsung/log/20200925-2042 Config Error, aborting ! {{case_clause,"local_file"}, [{ts_config,parse,2, [{file,"src/tsung_controller/ts_config.erl"}, {line,1051}]}, {lists,foldl,3,[{file,"lists.erl"},{line,1263}]}, {ts_config,handle_read,3, [{file,"src/tsung_controller/ts_config.erl"}, {line,85}]}, {ts_config,read,2, [{file,"src/tsung_controller/ts_config.erl"}, {line,70}]}, {ts_config_server,handle_call,3, [{file, "src/tsung_controller/ts_config_server.erl"}, {line,209}]}, {gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,661}]}, {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,690}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,249}]}]}
Can you please tell me where I should update the patches.