tsung icon indicating copy to clipboard operation
tsung copied to clipboard

Tsung fails to use all system resources

Open manukoshe opened this issue 6 years ago • 10 comments

Hi,

I have the following setup: 6 nodes with 8 CPU/8GB RAM On the main node where scripts are run I define weight of 25% to equalize the load with other nodes. Tsung is generating load perfectly until about 50% of CPU is used, then if fails to generate more load. All machines are tuned with following settings: "profiles::sysctl::_config": { "net.ipv4.tcp_tw_reuse": { "value": "1", "permanent": true }, "net.ipv4.tcp_tw_recycle": { "value": "1", "permanent": true }, "net.ipv4.ip_local_port_range": { "value": "1024 65535", "permanent": true }, "fs.file-max": { "value": "500000", "permanent": true }, "net.ipv4.tcp_fin_timeout": { "value": "10", "permanent": true }

Network capacity is much greater than required. Other monitoring tools also confirm that CPU is only at 50%. There about 280 unknown errors which I assume is a good number for such setup? Loglevel is set to error only.
Attaching screenshots of generated load and resources used. 2018-03-08 14_10_41-tsung - graphs report 2018-03-08 14_11_24-tsung - graphs report 2018-03-08 14_11_35-tsung - graphs report

Any ideas what might be wrong? Tsung version is 1.6, Erlang 16

manukoshe avatar Mar 08 '18 12:03 manukoshe

Could it be due to CPU load reaching 8 (i.e. 1 for each core)?

manukoshe avatar Mar 08 '18 12:03 manukoshe

Same test and configuration, except for single machine: image image image

It's 3 times more effective! Am I doing something wrong or Tsung is very ineffective in distributed configuration?

manukoshe avatar Mar 09 '18 11:03 manukoshe

Did some benchmarking and the result is that load generation capacity drops significantly for each machine added to distributed configuration. Network latency between the Tsung nodes also has very big impact. In the end I get the result that a single machine can generate about 60% of load vs 6 machines of the same CPU/RAM in distributed setup. This makes Tsung distributed load generation a bit useless (at least for some scenarios).

I suspect this might be due to batch size of users sent from controller to slave nodes (observed some cases where batch size is only 1k users which causes very frequent communication between nodes and thus decreased performance?). I understand that increased batch size would probably have the impact of less even/accurate load distribution.

Maybe development team could consider a parameter to control controller instructions batch size? Or some other improvements to reduce the amount of communication between controller and slave nodes and increase distributed load performance?

manukoshe avatar Mar 28 '18 08:03 manukoshe

Hey @manukoshe.

Can you share your tsung configuration?

AFAIK there are some "well-known" issues, where there is more coordination required between controllers and generators than probably necessary. #237 is one step to solve one of those issues (decentralised file_server access).

Also AFAIK user arrivals are also in part coordinated by the controller.

Depending on how you model your test case, we've had quite good experience so far in running really large tests (~1M connected users, >300k req/s, using 100+ generator nodes).

tisba avatar Mar 28 '18 09:03 tisba

Thanks for comment, will send you setup privately.

I observed about 5-10% drop in generated max requests 1 machine VS 1 controller/1 slave setup in the same DataCenter.

I observed about 20% drop in generated max requests 1 machine VS 1 controller/1 slave setup in different geographical locations.

Would be nice if anyone else could benchmark and share results for their scenarios.

manukoshe avatar Mar 28 '18 12:03 manukoshe

Additional findings: No matter how many machines I was using, I hit ~120k req/s limit per cluster configuration. I got same 120k/s requests max either with 1 master/3 slaves OR 1 master/11 slaves. There seems to be some kind of limitation in tsung controller or metrics reporting or somewhere else.

I was able to generate much more load by separately starting 3 x tsung clusters consisting 1 master/3 slaves (total 12 VMs). However this is not very convenient because:

  • replication of test scripts to 3 machines
  • manual login ant test run in 3 machines instead of 1
  • test reports are split (however not a big issue in my case since I use Grafana for monitoring)

Hopefully these issues will be addressed in later Tsung versions.

manukoshe avatar May 03 '18 09:05 manukoshe

Hey @manukoshe. Again, it would be very helpful if you could share your test configuration. There are issues regarding scalability, but many of them can currently be mitigated. It would be nice to know what the potential problem is, that you are currently running into.

tisba avatar May 03 '18 10:05 tisba

Hi, can't share test script in public. Please respond to my messages in Linkedin or Twitter:)

manukoshe avatar May 03 '18 11:05 manukoshe

I took a look at what @manukoshe send me. I'm pretty sure that the problem is the currently centralized nature of file_server. An optimised and distributed local_file_server is WIP and there is a PR for this: #237.

If you have the possibility, @manukoshe, you're very welcome to checkout https://github.com/processone/tsung/pull/237, compile tsung and give this a try. The change to your configuration should be rather simple.

tisba avatar May 30 '18 15:05 tisba

Hi @tisba,

I have updated the patches in my local machines tsung directories successfully. When I start tsung I am getting the following errors,

Starting Tsung Log directory is: /root/.tsung/log/20200925-2042 Config Error, aborting ! {{case_clause,"local_file"}, [{ts_config,parse,2, [{file,"src/tsung_controller/ts_config.erl"}, {line,1051}]}, {lists,foldl,3,[{file,"lists.erl"},{line,1263}]}, {ts_config,handle_read,3, [{file,"src/tsung_controller/ts_config.erl"}, {line,85}]}, {ts_config,read,2, [{file,"src/tsung_controller/ts_config.erl"}, {line,70}]}, {ts_config_server,handle_call,3, [{file, "src/tsung_controller/ts_config_server.erl"}, {line,209}]}, {gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,661}]}, {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,690}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,249}]}]}

Can you please tell me where I should update the patches.

MoorthiRaj avatar Sep 25 '20 15:09 MoorthiRaj