openshift-etcd-suite icon indicating copy to clipboard operation
openshift-etcd-suite copied to clipboard

Etcd Compaction script is sorting only the worst values

Open nstamatelopoulos opened this issue 3 years ago • 3 comments

Hello

I was checking this tool and it is very helpfull. Well done!

I have a small improvement to suggest as i am not sure if the results you get from the compaction script are valid. In more detail the etcd compaction is happening every 5 minutes according to the logs and can have high and low values depending on the status of the performance at a certain time. This means that high values can happen from time to time but it doesn't mean that this is the average status of the performance. The problem i see is that the filter you use sorts these values from low to high and then prints the last 10 values. In a sample of 300 + lines the user will see the worst values and therefore this might be misleading. Maybe we can improve by getting the average of all sorted numbers

I tried some tests and if we pipe this string after the "sort" command instead of "tail -10" we can get the average value like below.

${CLIENT} logs pod/$1 -n ${NS} -c etcd | grep "compaction"| grep -E "[0-9]+(.[0-9]+)*"|cut -d " " -f13|sort|awk '{s+=$1}END{print "ave:",s/NR}'

Please note that the "ms" and the ")" that you remove with cut -d ')' -f 1 is not needed as the awk will use only 3 digits after the "." and everything else will be removed for the average calculation.

I'm talking about the etcd_compaction function in line 36 of the etcd.sh script.

In case you have any other ideas please share.

I find this tool very handy and also for the reason that it can be used with must-gather reports. Great job!

nstamatelopoulos avatar Mar 02 '22 17:03 nstamatelopoulos

Hi and thanks. There are few issues with this approach.

  1. average value won't tell you anything about worst values (we're looking for).. but i will probably add it there just for info
  2. don't forget that compaction values can be also in seconds or minutes, so this approach needs more sofisticated calculator (as you have different units).

peterducai avatar Mar 03 '22 14:03 peterducai

Hello

  1. Ok so you mean that even if there are 10 values over 100ms and the rest are normal <100ms then still we should take this as a sign of poor etcd performance. Right?
  2. If you mean that the values of a Sample of some lines let's say can have mixed values like seconds in one line and ms in another. So yes that is something that needs improvement as the average is taken based on the number and not the unit of the number .Also i didnt know that these logs can display something other that milliseconds. But if this idea does not provide any value then why to bother.

nstamatelopoulos avatar Mar 03 '22 16:03 nstamatelopoulos

current implementation has seconds and miliseconds

[highest seconds] 1.579384709s 1.582356124s

[highest ms]

peterducai avatar Mar 31 '22 11:03 peterducai

Perfect! Im closing this issue.

nstamatelopoulos avatar Nov 25 '22 13:11 nstamatelopoulos