cloud-native-setup metrics: report pod communication latency percentiles in pdf

Add R routine to load and plot percentiles from JSON.
Support any number of percentiles 1..n.
Store percentile configuration in JSON.
Fix handling percentile value 100.

Signed-off-by: Antti Kervinen [email protected]

Oct 11 '19 12:10 askervin

Request for comments:

The default latency percentiles are now 5-25-50-75-95.
This means we visualize
- median (50)
- the range including 50 % of the measurements (25, 75)
- borderlines of the fastest and slowest 5 % (5, 95)
If we'd like to drop only 1 % as outliers in both ends (instead of 5 %), we could use 1-25-50-75-99.
If we'd like to add to previous also the fastest and the slowest values, we could use 0-1-25-50-75-99-100.
With the default value I'm throwing out 10 % of results as outliers. That is less noise, more stable results, but with the risk of throwing out something that should and could have been spotted with (1, 99) in the percentiles.

Oct 11 '19 12:10 askervin

I've done one R metric latency graph before, over in the kata fio metrics. I was only plotting two percentiles, but you can see the example graphs over on the issue at https://github.com/kata-containers/tests/issues/885

Note I used colours/opacity to try and group and separate the tests and the values. Yeah, I do think we need a legend on the graph to show which colour/lines mean what :-)

Oct 23 '19 16:10 grahamwhaley

OOI @askervin - I see you are taking the time measures for the nc in I think nanoseconds, and then scale them to ms - I know we do that for the launch times etc. (which take place in the range of 'seconds' iirc) - is this ms granularity working nicely on your test cluster? on my local cluster I pretty much see results of only 4 and 5 ms reported. I'm wondering if we should measure microseconds.... I tweaked the script to dump out the sorted raw data, and I see data like this (the tail end of the data from a single node 20 pod test run)...

4375 4377 4382 4401 4405 4413 4414 4418 4421 4421 4423 4424 4424 4427 4428 4428 4429 4429 4430
 4430 4430 4432 4433 4434 4435 4436 4436 4438 4438 4443 4443 4444 4446 4447 4447 4448 4448 4449 
4449 4449 4449 4450 4452 4452 4454 4454 4455 4456 4456 4458 4458 4458 4458 4458 4460 4461 4461 
4461 4461 4462 4462 4462 4463 4463 4463 4464 4464 4464 4464 4465 4465 4465 4465 4465 4466 4466 
4467 4467 4468 4469 4469 4469 4470 4470 4471 4471 4471 4472 4473 4473 4473 4473 4473 4474 4474 
4474 4475 4475 4477 4477 4477 4478 4478 4478 4478 4479 4479 4480 4480 4480 4480 4481 4481 4482 
4482 4483 4483 4483 4484 4484 4484 4484 4485 4485 4485 4485 4485 4486 4486 4486 4486 4486 4486 
4487 4487 4487 4487 4487 4487 4487 4488 4488 4489 4489 4490 4490 4490 4490 4490 4490 4491 4491 
4491 4491 4491 4492 4492 4492 4492 4493 4493 4493 4493 4493 4493 4494 4494 4494 4494 4494 4494 
4494 4494 4495 4495 4495 4495 4496 4496 4497 4497 4497 4497 4497 4497 4497 4498 4498 4499 4499 
4499 4499 4499 4500 4500 4500 4500 4501 4501 4501 4501 4501 4501 4501 4501 4501 4501 4501 4501 
4502 4502 4502 4502 4502 4503 4503 4503 4503 4503 4503 4503 4504 4504 4504 4504 4504 4504 4504 
4505 4505 4505 4505 4505 4505 4506 4506 4507 4507 4507 4507 4507 4508 4508 4508 4509 4509 4509 
4509 4509 4509 4509 4509 4510 4510 4510 

.... a bunch skipped for brevity.....
 4901 4902 4902 4903 4903 4905 4906 4909 4910 4910 4910 4910 4912 4912 4913 4913 4914 4914 4915 
4915 4916 4916 4916 4917 4917 4917 4918 4918 4918 4918 4919 4919 4919 4920 4921 4922 4922 4923 
4924 4924 4924 4925 4925 4925 4926 4928 4928 4930 4930 4931 4931 4931 4932 4933 4933 4933 4935 
4935 4937 4937 4938 4938 4940 4941 4941 4942 4942 4942 4943 4943 4943 4944 4944 4944 4946 4946 
4947 4948 4948 4949 4949 4950 4950 4951 4951 4951 4951 4951 4953 4953 4954 4954 4955 4956 4956 
4956 4956 4957 4959 4961 4962 4962 4963 4963 4967 4971 4972 4973 4974 4974 4975 4976 4977 4978 
4979 4980 4981 4982 4983 4984 4985 4985 4986 4986 4990 4991 4993 4994 4995 4999 5001 5002 5004 
5004 5007 5008 5011 5011 5013 5017 5017 5019 5020 5020 5022 5024 5024 5024 5027 5027 5043 5045 
5046 5051 5053 5056 5057 5058 5058 5061 5066 5074 5077 5088 5092 5098 5099 5101 5106 5115 5117 
5121 5124 5125 5125 5125 5137 5142 5150 5160 5166 5171 5174 5175 5176 5177 5182 5182 5184 5185 
5196 5198 5206 5214 5220 5232 5240 5254 5287 5305 5317 5320 5325 5431 5436 5439 5447 5457 5470 
5556 5561 5570 5588 5634 5671 5677 5688 5732 5773 5888 5901 5911 5922 6000 6023 6034 6088 6094 
6625 6769 7660 8599 8816 11069 11124

What is interesting for me right now is we don't really get to see that spread across the 4-5ms range, and we don't get to see those tail end outliers that ramp from 6ms up to 11ms. This is the sort of reason we should probably capture the raw data in the JSON and find some way to show that data in the graphs - I think it can give us more insight than the current rather step-function result we are seeing.

Oct 25 '19 13:10 grahamwhaley

for ref, as I have one to hand, here is a report generated diagram of a 100 pod run at ms resolution, and then a 20 pod run at us resolution - both from the same node/cluster. shows promise I think:

scaling_nc-1

Oct 25 '19 14:10 grahamwhaley

@grahamwhaley , I agree that millisecond granularity is rather coarse for reporting network latency. I get 12 ms on my VM cluster, but 3-4 ms on real hardware. However, I think we should make sure that we'd really benefit from it instead of just adding noise.

For the current measurement loop, I compared date, $EPOCHREALTIME and Python time.time() granularity as fast as possible.

#!/bin/bash

measure_accuracy() {
    n_limit=$1 # number of accuracy measurements
    n=0
    while [[ $n -lt $n_limit ]] ;
    do
        n=$(( $n + 1 ))
        local start_time=$(date +%s%N)
        local end_time=$(date +%s%N)
        local date_timediff_ns=$(( end_time - start_time ))

        local er_start_time=$EPOCHREALTIME
        local er_end_time=$EPOCHREALTIME
        local er_timediff_us=$(bc <<< "(($er_end_time-$er_start_time) * 1000000)/1")

        local py_timediff_us=$(python3 -c 'import time; start_t=time.time(); end_t=time.time(); print("%d" % ((end_t-start_t)*1000000,))')

        echo "date_ms $(( $date_timediff_ns / 1000000 ))"
        echo "date_us $(( $date_timediff_ns / 1000 ))"
        echo "date_ns $date_timediff_ns"
        echo "EPOCHREALTIME_ms $(( $er_timediff_us / 1000 ))"
        echo "EPOCHREALTIME_us $er_timediff_us"
        echo "python3_ms $(( $py_timediff_us / 1000 ))"
        echo "python3_us $py_timediff_us"
    echo "---"
    done
}

measure_accuracy $1

On my high-performance desktop measuring the time of "do nothing":

date_us varies between 900 - 2000 us
EPOCHREALTIME_us varies between 12 - 37 us
python3_us varies between 4-5 us

Another thing is the latency server implementation. Trying out different server+client combos on localhost:

nc ... -e /bin/cat server + nc client:
- date_us varies between 2700 - 8700 us
- EPOCHREALTIME_us varies between 1400 - 6400 us
nc ... -e /bin/cat server + python client doing time.time() and socket.sendall()...socket.recv()
- python timediff varies between 540 - 1200 us
python socket server + python socket client:
- python timediff varies between 418 - 732 us

I would conclude that if we want to get to microsecond level, we should not do the time measurement with date and nc client anymore. We should use more efficient time measurement and socket send/recv. Currently solution would give +-3000 us error, Python client roughly +-300 us error, Python server+client +-150 us error.

Oct 28 '19 14:10 askervin

Good data and valid point on the noise/overheads of the tools contributing significantly to the end result ;-) That fits/correlates with what I was inferring - that we are trimming off too many significant digits right now I think.

In which case, I agree, we probably want to be using better tools to measure the latency. First, we should probably define exactly what we are trying to measure (TCP or just network response time from a pod for instance) - and then we can consider other specific network tools like ping (for an ICMP bounce, if the pod network supports that), or iperf3 or similar, where we can get back good raw data. iirc, iperf3 can return results in JSON format https://github.com/kata-containers/tests/blob/master/metrics/network/network-metrics-iperf3.sh#L63 ;-).

one of the problems with network tools is - there are just soooo many to choose from...

Oct 28 '19 15:10 grahamwhaley

@grahamwhaley , good point. I changed the time in the result json to microsecond accuracy. We can handle the measurement method as another topic. In any case, current round trip time is accuracy is enough for spotting a reproducible anomaly after scaling to 400+ pods.

I also changed the number of percentiles from 5 to 9, so that we can see where the previously hidden 5 % of latency times land. (Added min, max, 1 % and 99 % percentiles.)

Saving the original measurements with all necessary details (like pod IPs associated with each latency time and the timestamp) helps debugging anomalies. It also is a remedy to the concern on having only "cooked" data available after test run.

Oct 30 '19 12:10 askervin

Whilst I remember, something I spotted that is not actually modified in this PR.... in the JSON stats capture, I think we are storing a bunch of global data that does not change inside each iterations results. Probably better to save that out once during the init() phase at a global test level. In particular, I was thinking about:

				"Pod_command": "${pod_command//\"/\\\"}",
				"Request_length": "${nc_req_msg_len}",
				"Requests_per_pod": "${nc_reqs_per_pod}",
				"Sender": "serial",
				"Percentiles": [$(IFS=, ; echo "${latency_percentiles[*]}")],

Oct 31 '19 12:10 grahamwhaley

Rebased the whole series. Now this is based on the rapid test. Quite a few changes, see last commit message. I'll run few test rounds...

Nov 21 '19 15:11 askervin

hi @askervin - thanks for the update. I tried to give it a quick spin - initial feedback...

Probably consider a rebase/merge of the patches
- as they are getting quite long now, and I think that last one is a replace/rewrite of a bunch of the previous ones?
- the branch looks like it needs a rebase before we can merge anyway (conflicted)
I think you need to set SMF_USE_COLLECTD=true somewhere. Right now I think the collectd daemonset gets run due to a bug in my framework code, where if [ -n "$SMF_USE_COLLECTD" ] does not do the 'not set' check like I thought! If you'd like to fix that whilst you are there.... (change it to an == string test for true?)
I failed to make the busybox deploy without setting use_api=no - my suspicion is that the ncat command line substition maybe does not work in the json. I dunno.
nitpick: I think the hotlink in the Rmd points to the old nc test still (not the _rapid version)
I ran with the default 20 pods. The lat table shows results for n_pods 1,5,10 - I was thinking it might show 1,10,20 - so I peeked at the code - it's not obvious to me how it is working out those rows - I'll leave that for you to contemplate, and fix/document/clarify as necessary.

(oh, I got a run by hacking out my collectd check bug and setting no_api btw)

Nov 21 '19 18:11 grahamwhaley

@grahamwhaley , Thanks for your feedback!

Patches squashed into one.
SMF_USE_COLLECTD=true added and the string test changed to true.
Rmd script link fixed.

Issue:

When I try to use_api, I'm getting HTTP/HTTPS complaint... like this:

$ strace -e trace=network -s 1000 curl -s http://127.0.0.1:6443/apis/apps/v1/namespaces/default/deployments -XPOST -H 'Content-Type: application/json' -d@/root/metrics/scaling/rapidnc.json
...
sendto(3<TCP:[847285]>, "POST /apis/apps/v1/namespaces/default/deployments HTTP/1.1\r\nHost: 127.0.0.1:6443\r\nUser-Agent: curl/7.66.0...")
...
recvfrom(3<TCP:[847285]>, "HTTP/1.0 400 Bad Request\r\n\r\nClient sent an HTTP request to an HTTPS server.\n", 102400, 0, NULL, NULL) = 76

But curl gets stuck if I just s/http:/https:. Need to figure how to make apiserver accept http requests. Do you have any tips for this?

Todo:

check n_pods in the table with single testname. should show the first, mid and last.

Nov 22 '19 13:11 askervin

thanks @askervin - I'll try to get to review..... @dklyle is the curl API expert ;-)

Nov 22 '19 16:11 grahamwhaley

I may know what the use_api issue I ran into is, and it's probably not this patchset - it happens with the existing rapid test as well... I ran into this yesterday. When using the API, the tests start up the kubeproxy, but, that only works if you happen to be running the tests on the master node. This works for me when I'm running a single node local clr-k8s-examples stack, but when I'm running a 3-node kind cluster, the nodes run under docker, and I run the tests on the bare metal host.... and then I get the busybox launch failure...

I'm slightly confused though, as I was sure I'd been running API based tests on the kind cluster. So, I'm going to have to dig a little more I think, but there is an implicit requirement I think in our tests that if you want to use the (default) API method, you need to run on the master node.... /cc @dklyle (and thanks for confirming my suspicions on this yesterday)

Nov 27 '19 11:11 grahamwhaley

(oops, accidental mis-click closed.... re-opened now....)

Nov 27 '19 11:11 grahamwhaley

OK, take that kubeproxy item with a little pinch of salt - turned out on my test machine I still had a kube proxy running (probably to the clr-k8s-examples stackup) when I tried to run and bind to the kind cluster. I still got some 'nc TIMEOUT' errors, but wrt the use of use_api, this PR is probalby fine, but...

I know we have some cleanup/exit issues, as often when a test fails or I quit it I end up with the daemonsets still present (collectd), and probably the proxy as well
we could/should detect if a proxy is already running and probably error out, maybe.

Nov 27 '19 11:11 grahamwhaley