munin icon indicating copy to clipboard operation
munin copied to clipboard

Issues when combining cdef, sum and negative: either incorrect or error

Open sjvrijn opened this issue 3 years ago • 0 comments

Description

I'm monitoring some multi-GPU machines and want to make a combined CPU/GPU utilization graph with GPU as positive and CPU as negative.

I can create such a graph just fine for a single GPU against 100 - (cpu.idle / #cores), but run into issues when trying to use the mean GPU utilization values, as calculated using sum and cdef.

Examples

Below are four situations to illustrate the issues for a machine with two GPUs. Config and output are shown below a short description:

  1. Baseline. I can plot the CPU and individual GPU values without problem.
# Shows the individual values without problem
test0.graph_title Test 0: baseline values
test0.graph_args --base 1000 -l -100 -u 100 -r
test0.graph_vlabel CPU / GPU
test0.graph_category system
test0.graph_order \
        cpu=multigpu.example.com:cpu.idle \
        gpu1=multigpu.example.com:nvidia_gpu_utilization.utilization0 \
        gpu2=multigpu.example.com:nvidia_gpu_utilization.utilization1
test0.cpu.cdef 100,cpu,48,/,-

test 0: baseline values

  1. I can also create my intended GPU-positive-CPU-negative plot without problem for an individual GPU's utilization combined with the cdef'd CPU value
# Correctly shows GPU0 values as positive, CPU values as negative
test1.graph_title Test 1: direct
test1.graph_args --base 1000 -l -100 -u 100 -r
test1.graph_vlabel CPU / GPU
test1.graph_category system
test1.graph_order \
        cpu=multigpu.example.com:cpu.idle \
        gpu1=multigpu.example.com:nvidia_gpu_utilization.utilization0
test1.cpu.cdef 100,cpu,48,/,-
test1.cpu.graph no
test1.gpu1.negative cpu

test 1: successful gpu1 with cpu as negative

  1. If I simply plot the CPU and mean of 2 GPUs on the same graph, the CPU values are no longer correct, but seem to be the sum of the GPU-mean and CPU values? No idea what is happening here...
# CPU values show up incorrect here
test2.graph_title Test 2: mean
test2.graph_args --base 1000 -l -100 -u 100 -r
test2.graph_vlabel CPU / GPU
test2.graph_category system
test2.graph_order \
        cpu=multigpu.example.com:cpu.idle \
        gpu
test2.cpu.cdef 100,cpu,48,/,-
test2.gpu.label gpu mean
test2.gpu.sum \
        multigpu.example.com:nvidia_gpu_utilization.utilization0 \
        multigpu.example.com:nvidia_gpu_utilization.utilization1
test2.gpu.cdef gpu,2,/

test 2: defining a mean gpu changes cpu values

  1. If I try to combine them into a positive/negative graph, the rendering errors with Not a valid vname ccpu in munin-graph.log (where 'cpu' is my variable name)
test3.graph_title Test 3: up/down
test3.graph_args --base 1000 -l -100 -u 100 -r
test3.graph_vlabel CPU / GPU
test3.graph_category system
test3.graph_order \
        cpu=multigpu.example.com:cpu.idle \
        gpu
test3.cpu.cdef 100,cpu,48,/,-
test3.gpu.label gpu mean
test3.gpu.sum \
        multigpu.example.com:nvidia_gpu_utilization.utilization0 \
        multigpu.example.com:nvidia_gpu_utilization.utilization1
test3.gpu.cdef gpu,2,/
test3.cpu.graph no
test3.gpu.negative cpu

Log

munin-graph.log:

2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-day.png : Not a valid vname: ccdefcpu in line GPRINT:ccdefcpu:LAST:%6.2lf%s/\g
2021/06/25 16:21:28 [RRD ERROR] rrdtool 'graph' 'test3-day.png' \
        '--title' \
        'Test 3: up/down - by day' \
        '--start' \
        '-2000m' \
        '--base' \
        '1000' \
        '-l' \
        '-100' \
        '-u' \
        '100' \
        '-r' \
        '--vertical-label' \
        'CPU / GPU' \
        '--slope-mode' \
        '--height' \
        '175' \
        '--width' \
        '400' \
        '--imgformat' \
        'PNG' \
        '--lazy' \
        '--font' \
        'DEFAULT:0:DejaVuSans,DejaVu Sans,DejaVu LGC Sans,Bitstream Vera Sans' \
        '--font' \
        'LEGEND:7:DejaVuSansMono,DejaVu Sans Mono,DejaVu LGC Sans Mono,Bitstream Vera Sans Mono,monospace' \
        '--color' \
        'BACK#F0F0F0' \
        '--color' \
        'FRAME#F0F0F0' \
        '--color' \
        'CANVAS#FFFFFF' \
        '--color' \
        'FONT#666666' \
        '--color' \
        'AXIS#CFD6F8' \
        '--color' \
        'ARROW#CFD6F8' \
        '--border' \
        '0' \
        '-W' \
        'Munin 2.0.66' \
        'DEF:acpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:MAX' \
        'DEF:icpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:MIN' \
        'DEF:gcpu=/var/lib/munin/multigpu.example.com-cpu-idle-d.rrd:42:AVERAGE' \
        'DEF:az2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:MAX' \
        'DEF:iz2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:MIN' \
        'DEF:gz2_1=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization1-g.rrd:42:AVERAGE' \
        'DEF:az2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:MAX' \
        'DEF:iz2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:MIN' \
        'DEF:gz2_0=/var/lib/munin/multigpu.example.com-nvidia_gpu_utilization-utilization0-g.rrd:42:AVERAGE' \
        'CDEF:acdefz2_0=az2_0,UN,0,az2_0,IF' \
        'CDEF:icdefz2_0=iz2_0,UN,0,iz2_0,IF' \
        'CDEF:gcdefz2_0=gz2_0,UN,0,gz2_0,IF' \
        'CDEF:ccdefz2_0=gcdefz2_0' \
        'CDEF:acdefz2_1=az2_1,UN,0,az2_1,IF,acdefz2_0,ADDNAN,2,/' \
        'CDEF:icdefz2_1=iz2_1,UN,0,iz2_1,IF,icdefz2_0,ADDNAN,2,/' \
        'CDEF:gcdefz2_1=gz2_1,UN,0,gz2_1,IF,gcdefz2_0,ADDNAN,2,/' \
        'CDEF:ccdefz2_1=gcdefz2_1' \
        'COMMENT:        ' \
        'COMMENT:Cur (-/+)' \
        'COMMENT:Min (-/+)' \
        'COMMENT:Avg (-/+)' \
        'COMMENT:Max (-/+) \j' \
        'LINE1:gcdefz2_1#00CC00:gpu mean ' \
        'GPRINT:ccdefcpu:LAST:%6.2lf%s/\g' \
        'GPRINT:ccdefz2_1:LAST:%6.2lf%s' \
        'GPRINT:icdefcpu:MIN:%6.2lf%s/\g' \
        'GPRINT:icdefz2_1:MIN:%6.2lf%s' \
        'GPRINT:gcdefcpu:AVERAGE:%6.2lf%s/\g' \
        'GPRINT:gcdefz2_1:AVERAGE:%6.2lf%s' \
        'GPRINT:acdefcpu:MAX:%6.2lf%s/\g' \
        'GPRINT:acdefz2_1:MAX:%6.2lf%s\j' \
        'CDEF:acdefcpu=100,acpu,48,/,-' \
        'CDEF:icdefcpu=100,icpu,48,/,-' \
        'CDEF:gcdefcpu=100,gcpu,48,/,-' \
        'CDEF:ccdefcpu=gcdefcpu' \
        'CDEF:re_zero=gcdefcpu,UN,0,0,IF' \
        'CDEF:ngcdefcpu=gcdefcpu,-1,*' \
        'LINE1:ngcdefcpu#00CC00' \
        'LINE1:re_zero#000000' \
        'VRULE:1624630818#999999' \
        'COMMENT:Last update\: Fri Jun 25 16\:20\:18 2021\r' \
        '--end' \
        '1624630500'
2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-week.png : Not a valid vname: ccpu in line GPRINT:ccpu:LAST:%6.2lf%s/\g
[... repeated details omitted for brevity ...]
2021/06/25 16:21:28 [RRD ERROR] Unable to graph test3-month.png : Not a valid vname: ccdefcpu in line GPRINT:ccdefcpu:LAST:%6.2lf%s/\g
[...]

Additional information

Desktop

  • CentOS 7.9
  • Munin Version 2.0.66

Additional context Issue #1373 seems related?

Also posted as a question on serverfault

sjvrijn avatar Nov 22 '21 15:11 sjvrijn