smokeping_prober icon indicating copy to clipboard operation
smokeping_prober copied to clipboard

Performance and Scaling

Open bjornskovlund opened this issue 5 years ago • 16 comments

Hi,

Thanks for making this available!

I have a bit of an odd use-case (possibly), but I'm planning to use smokeping_prober to monitor a very large number of endpoints. So I did some testing with scaling yesterday. First, I've changed the hosts to be a list read from a file instead of a command line argument.

My test server is a VM with 8 vCPUs, 4GB. Memory usage is negligible throughout my tests, so not bothered with that. I kicked off with about 1000 endpoints and an interval of 1s, with 20 buckets. After application start-up, the CPU usage was around 400-500%, but looking at the results, there were both an excessive amount of drops, but also RTT was measured in a very large spread across the buckets, indicating that the application was too slow to record the results. Having successfully had a smokeping instance with roughly 100 endpoints monitored, I tried to divide the list into 10 of 100 each, then kicking off 10 smokeping_prober instances - still on the same server. This worked wonders, again leaving the CPU usage around 400-500%, but now giving very fine results on the individual measurements, with any kind of jitter introduced by the application being less than 2ms from what I can see.

My programming skills aren't very good, so there's not much I can do to optimise the performance here, but I'll happily make a PR for documentation if the above is "as expected".

bjornskovlund avatar Jun 25 '19 08:06 bjornskovlund

That's a good quesiton, I have not benchmarked things at this large a scale.

A lot of the resource use is likely related to the upstream go-ping library that is used. I recently updated the code to switch from a fork back to the real upstream go-ping source. There were some recent performance improvements there. But I haven't cut a new release.

There are also a couple of known issues in go-ping regarding memory use over time. We've been talking about how to mitigate this upstream.

The first thing to do is to enable the Go pprof endpoint and gather some CPU profiles.

SuperQ avatar Jun 25 '19 09:06 SuperQ

Once we have a release with pprof, you can use this to grab a CPU profile.

I ran a quick test of 100 targets against localhost. It seems to use about 30-40% of a core on my laptop (i7-8550U). Most of the time seems to be spent in Go. Handling syscalls, runtime allocs, etc. It's possible that some of this could be improved by reducing allocs and such in the go-ping library.

$ go tool pprof http://localhost:9374/debug/pprof/profile
Fetching profile over HTTP from http://localhost:9374/debug/pprof/profile
Saved profile in /home/ben/pprof/pprof.smokeping_prober.samples.cpu.004.pb.gz
File: smokeping_prober
Type: cpu
Time: Jun 25, 2019 at 11:50am (CEST)
Duration: 30.12s, Total samples = 6.25s (20.75%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 50
Showing nodes accounting for 5.10s, 81.60% of 6.25s total
Dropped 124 nodes (cum <= 0.03s)
Showing top 50 nodes out of 147
      flat  flat%   sum%        cum   cum%
     0.96s 15.36% 15.36%      1.04s 16.64%  syscall.Syscall
     0.94s 15.04% 30.40%      0.94s 15.04%  runtime.futex
     0.45s  7.20% 37.60%      0.45s  7.20%  syscall.Syscall6
     0.23s  3.68% 41.28%      0.46s  7.36%  runtime.scanobject
     0.20s  3.20% 44.48%      0.20s  3.20%  runtime.procyield
     0.20s  3.20% 47.68%      0.20s  3.20%  runtime.usleep
     0.16s  2.56% 50.24%      0.20s  3.20%  runtime.heapBitsSetType
     0.12s  1.92% 52.16%      0.12s  1.92%  runtime.epollwait
     0.10s  1.60% 53.76%      0.13s  2.08%  runtime.findObject
     0.10s  1.60% 55.36%      1.38s 22.08%  runtime.mallocgc
     0.10s  1.60% 56.96%      0.10s  1.60%  runtime.pageIndexOf
     0.10s  1.60% 58.56%      0.10s  1.60%  time.now
     0.09s  1.44% 60.00%      0.26s  4.16%  runtime.lock
     0.09s  1.44% 61.44%      0.09s  1.44%  runtime.nextFreeFast

SuperQ avatar Jun 25 '19 09:06 SuperQ

I published v0.3.0, please give it a spin.

https://github.com/SuperQ/smokeping_prober/releases/tag/v0.3.0

SuperQ avatar Jun 25 '19 14:06 SuperQ

Hi,

Just gave it a quick spin and tried the profiler. I have no idea what it means ;) I will have a look at the metrics coming out tomorrow.

# go tool pprof http://localhost:19999/debug/pprof/profile
Fetching profile from http://localhost:19999/debug/pprof/profile
Please wait... (30s)
Saved profile in /root/pprof/pprof.smokeping_prober.localhost:19999.samples.cpu.001.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top 50
67.13s of 83.52s total (80.38%)
Dropped 331 nodes (cum <= 0.42s)
Showing top 50 nodes out of 115 (cum >= 4.07s)
      flat  flat%   sum%        cum   cum%
    14.66s 17.55% 17.55%     15.53s 18.59%  syscall.Syscall
     5.41s  6.48% 24.03%     11.38s 13.63%  runtime.scanobject
     3.63s  4.35% 28.38%      4.59s  5.50%  runtime.findObject
     3.06s  3.66% 32.04%      3.06s  3.66%  runtime.epollwait
     2.90s  3.47% 35.51%     25.86s 30.96%  runtime.mallocgc
     2.63s  3.15% 38.66%     13.46s 16.12%  runtime.gentraceback
     2.39s  2.86% 41.52%      2.39s  2.86%  runtime.markBits.isMarked
     2.39s  2.86% 44.38%      2.78s  3.33%  runtime.step
     2.16s  2.59% 46.97%      2.16s  2.59%  runtime.pageIndexOf
     1.86s  2.23% 49.20%      4.82s  5.77%  runtime.pcvalue
     1.63s  1.95% 51.15%      2.02s  2.42%  runtime.heapBitsSetType
     1.46s  1.75% 52.90%      4.72s  5.65%  runtime.scanblock
     1.37s  1.64% 54.54%      1.71s  2.05%  runtime.lock
     1.37s  1.64% 56.18%      1.55s  1.86%  runtime.unlock
     1.27s  1.52% 57.70%      1.90s  2.27%  runtime.gcmarknewobject
     1.25s  1.50% 59.20%      1.33s  1.59%  time.now
     1.16s  1.39% 60.58%      1.16s  1.39%  runtime.memclrNoHeapPointers
     1.15s  1.38% 61.96%      1.15s  1.38%  runtime.nextFreeFast
     1.13s  1.35% 63.31%      3.45s  4.13%  runtime.selectgo
     1.07s  1.28% 64.60%      1.07s  1.28%  runtime.futex
     0.82s  0.98% 65.58%      1.01s  1.21%  runtime.spanOf
     0.80s  0.96% 66.53%      2.61s  3.12%  runtime.sweepone
     0.79s  0.95% 67.48%      5.35s  6.41%  runtime.greyobject
     0.78s  0.93% 68.41%     17.55s 21.01%  runtime.scang
     0.77s  0.92% 69.34%      3.09s  3.70%  runtime.gcWriteBarrier
     0.74s  0.89% 70.22%      0.78s  0.93%  runtime.casgstatus
     0.74s  0.89% 71.11%      0.74s  0.89%  runtime.procyield
     0.67s   0.8% 71.91%      0.71s  0.85%  runtime.findfunc
     0.62s  0.74% 72.65%     11.77s 14.09%  runtime.gcDrainN
     0.58s  0.69% 73.35%      2.89s  3.46%  runtime.wbBufFlush1
     0.56s  0.67% 74.02%      4.02s  4.81%  runtime.getStackMap
     0.52s  0.62% 74.64%     26.41s 31.62%  golang.org/x/net/internal/socket.(*Conn).recvMsg
     0.46s  0.55% 75.19%      1.78s  2.13%  internal/poll.runtime_pollSetDeadline
     0.45s  0.54% 75.73%      0.45s  0.54%  runtime.memmove
     0.43s  0.51% 76.25%      0.43s  0.51%  runtime.(*lfstack).pop
     0.41s  0.49% 76.74%      0.45s  0.54%  runtime.heapBitsForAddr
     0.28s  0.34% 77.07%     40.76s 48.80%  github.com/sparrc/go-ping.(*Pinger).recvICMP
     0.28s  0.34% 77.41%      0.49s  0.59%  runtime.(*gcBits).bitp
     0.28s  0.34% 77.74%      8.79s 10.52%  runtime.scanframeworker
     0.27s  0.32% 78.07%     12.76s 15.28%  runtime.newobject
     0.25s   0.3% 78.36%     32.75s 39.21%  golang.org/x/net/ipv4.(*payloadHandler).ReadFrom
     0.24s  0.29% 78.65%     17.54s 21.00%  runtime.markroot
     0.23s  0.28% 78.93%      1.67s  2.00%  runtime.(*mspan).sweep
     0.20s  0.24% 79.17%      0.63s  0.75%  runtime.sellock
     0.19s  0.23% 79.39%      0.78s  0.93%  runtime.deferreturn
     0.18s  0.22% 79.61%      3.52s  4.21%  runtime.(*mcentral).cacheSpan
     0.17s   0.2% 79.81%      0.44s  0.53%  runtime.newdefer
     0.16s  0.19% 80.00%      3.18s  3.81%  runtime.pcdatavalue
     0.16s  0.19% 80.20%     15.79s 18.91%  runtime.scanstack
     0.15s  0.18% 80.38%      4.07s  4.87%  golang.org/x/net/internal/socket.parseInetAddr

bjornskovlund avatar Jun 25 '19 22:06 bjornskovlund

I can confirm that running 945 targets (sorry, not 1000 exactly) in a single smokeping (which is what gave the above pprof) still gives very bad results, with the application obviously being slow at recording them.

bjornskovlund avatar Jun 26 '19 09:06 bjornskovlund

I'm not surprised that many would start to have issues. The code is far from optima at this point. Do you have a graph of rate(process_cpu_seconds_total[2m])?

In addition, each target has 25 metrics. With 1k targets, that's 25k metrics every scrape, which is quite a lot. I would double check / graph the scrape_duration_seconds values for the exporter to see how long it takes.

I don't have a lot of other example systems to compare this against to really say if this is good or terrible performance. :smiley:

SuperQ avatar Jun 26 '19 10:06 SuperQ

Hi,

rate(process_cpu_seconds_total[2m]): Screenshot 2019-06-26 at 12 05 37 scrape_duration_seconds: Screenshot 2019-06-26 at 12 07 23

Yes, I am aware that 1.000 targets is a lot, also in the amount of timeseries for Prometheus. To make it worse, I expect the target list to grow towards 10.000 within the year ;) It seems to scale quite well as long as I only have 100 targets per process, so I will just scale it horizontally instead of vertically.

bjornskovlund avatar Jun 26 '19 10:06 bjornskovlund

There doesn't seem to be any particular scaling issue. It's using exactly 10x the amount of CPU as I saw for 100 targets.

I'm sure there are efficiency gains that can be made, likely mostly around making sure we reduce the number of memory allocations per packet handled.

It will take some time to profile this and fix things.

SuperQ avatar Jun 26 '19 10:06 SuperQ

But the difference is: If I run 1 instance of 1000 targets on a VM, I get very blurry results due to what I expect to be slow registration of replies: Screenshot 2019-06-26 at 12 48 06

If I run 10 instances of 100 targets on the same VM, I get better results: Screenshot 2019-06-26 at 12 49 42

Both scenarios consumes roughly the same CPU (400-500% of 2.4GHz Skylake)

bjornskovlund avatar Jun 26 '19 10:06 bjornskovlund

Interesting. That is definitely related to the way go is handling packets.

I know there has been some work in the statsd exporter to improve things. But this is a bit out of my skillset as well.

SuperQ avatar Jun 26 '19 11:06 SuperQ

I know there has been some work in the statsd exporter to improve things

Do you have a link or information what you are refering to?

There are also a couple of known issues in go-ping regarding memory use over time. We've been talking about how to mitigate this upstream.

We've been running probes for 6+ weeks now and also see quite an increase in memory usage over time. Anything I can do to assist here?

patrickjahns avatar Apr 23 '20 08:04 patrickjahns

@patrickjahns I don't have any specific links for the statsd_exporter. I haven't had the time to look into that.

As for memory, there's several tracking structures in the go-ping library. Sadly, the upstream has stopped maintaining it since June 2019. I've been thinking about making a breaking fork to continue improving it.

SuperQ avatar Apr 23 '20 08:04 SuperQ

Thanks for the quick response - I'll take a look at statsd and see what I can find out.

As for memory, there's several tracking structures in the go-ping library. Sadly, the upstream has stopped maintaining it since June 2019. I've been thinking about making a breaking fork to continue improving it.

I am still in the phase of learning go - but happy to also try to help/assist here as much as possible. Any way maybe to ask upstream to migrate the repository to an organisation so a community can improve it?

Investigating one of the probes, I saw this in the logs

fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x99b6d1, 0x16)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0x7ffd8bd0e7f0 sp=0x7ffd8bd0e7c0 pc=0x42d8c2
runtime.sysMap(0xc000000000, 0x4000000, 0xe30378)
	/usr/local/go/src/runtime/mem_linux.go:170 +0xc7 fp=0x7ffd8bd0e830 sp=0x7ffd8bd0e7f0 pc=0x417fc7
runtime.(*mheap).sysAlloc(0xe18000, 0x2000, 0xe18010, 0x1)
	/usr/local/go/src/runtime/malloc.go:633 +0x1cd fp=0x7ffd8bd0e8d8 sp=0x7ffd8bd0e830 pc=0x40af4d
runtime.(*mheap).grow(0xe18000, 0x1, 0x0)
	/usr/local/go/src/runtime/mheap.go:1222 +0x42 fp=0x7ffd8bd0e930 sp=0x7ffd8bd0e8d8 pc=0x4252e2
runtime.(*mheap).allocSpanLocked(0xe18000, 0x1, 0xe30388, 0x0)
	/usr/local/go/src/runtime/mheap.go:1150 +0x37f fp=0x7ffd8bd0e968 sp=0x7ffd8bd0e930 pc=0x4251cf
runtime.(*mheap).alloc_m(0xe18000, 0x1, 0x2a, 0x0)
	/usr/local/go/src/runtime/mheap.go:977 +0xc2 fp=0x7ffd8bd0e9b8 sp=0x7ffd8bd0e968 pc=0x424822
runtime.(*mheap).alloc.func1()
	/usr/local/go/src/runtime/mheap.go:1048 +0x4c fp=0x7ffd8bd0e9f0 sp=0x7ffd8bd0e9b8 pc=0x457ebc
runtime.(*mheap).alloc(0xe18000, 0x1, 0x1002a, 0x0)
	/usr/local/go/src/runtime/mheap.go:1047 +0x8a fp=0x7ffd8bd0ea40 sp=0x7ffd8bd0e9f0 pc=0x424afa
runtime.(*mcentral).grow(0xe18e00, 0x0)
	/usr/local/go/src/runtime/mcentral.go:256 +0x95 fp=0x7ffd8bd0ea88 sp=0x7ffd8bd0ea40 pc=0x417a45
runtime.(*mcentral).cacheSpan(0xe18e00, 0x7f631c9cc000)
	/usr/local/go/src/runtime/mcentral.go:106 +0x2ff fp=0x7ffd8bd0eae8 sp=0x7ffd8bd0ea88 pc=0x41754f
runtime.(*mcache).refill(0x7f631c9cc008, 0x2a)
	/usr/local/go/src/runtime/mcache.go:135 +0x86 fp=0x7ffd8bd0eb08 sp=0x7ffd8bd0eae8 pc=0x416fe6
runtime.(*mcache).nextFree(0x7f631c9cc008, 0xe1172a, 0x7f631c9cc008, 0x7f631c9cc000, 0x8)
	/usr/local/go/src/runtime/malloc.go:786 +0x88 fp=0x7ffd8bd0eb40 sp=0x7ffd8bd0eb08 pc=0x40b788
runtime.mallocgc(0x180, 0x98b860, 0x1, 0xe303e0)
	/usr/local/go/src/runtime/malloc.go:939 +0x76e fp=0x7ffd8bd0ebe0 sp=0x7ffd8bd0eb40 pc=0x40c09e
runtime.newobject(0x98b860, 0x4000)
	/usr/local/go/src/runtime/malloc.go:1068 +0x38 fp=0x7ffd8bd0ec10 sp=0x7ffd8bd0ebe0 pc=0x40c4a8
runtime.malg(0x25a68400008000, 0xe1a670)
	/usr/local/go/src/runtime/proc.go:3220 +0x31 fp=0x7ffd8bd0ec50 sp=0x7ffd8bd0ec10 pc=0x4369e1
runtime.mpreinit(...)
	/usr/local/go/src/runtime/os_linux.go:311
runtime.mcommoninit(0xe12ac0)
	/usr/local/go/src/runtime/proc.go:618 +0xc2 fp=0x7ffd8bd0ec88 sp=0x7ffd8bd0ec50 pc=0x430662
runtime.schedinit()
	/usr/local/go/src/runtime/proc.go:540 +0x74 fp=0x7ffd8bd0ece0 sp=0x7ffd8bd0ec88 pc=0x4302f4
runtime.rt0_go(0x7ffd8bd0ed18, 0x3, 0x7ffd8bd0ed18, 0x0, 0x0, 0x3, 0x7ffd8bd0ffa4, 0x7ffd8bd0ffc4, 0x7ffd8bd0ffd7, 0x0, ...)
	/usr/local/go/src/runtime/asm_amd64.s:195 +0x11a fp=0x7ffd8bd0ece8 sp=0x7ffd8bd0ece0 pc=0x459d6a
fatal error: runtime: out of memory

runtime stack:
runtime.throw(0x99b6d1, 0x16)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0x7ffd62426660 sp=0x7ffd62426630 pc=0x42d8c2
runtime.sysMap(0xc000000000, 0x4000000, 0xe30378)
	/usr/local/go/src/runtime/mem_linux.go:170 +0xc7 fp=0x7ffd624266a0 sp=0x7ffd62426660 pc=0x417fc7
runtime.(*mheap).sysAlloc(0xe18000, 0x2000, 0xe18010, 0x1)
	/usr/local/go/src/runtime/malloc.go:633 +0x1cd fp=0x7ffd62426748 sp=0x7ffd624266a0 pc=0x40af4d
runtime.(*mheap).grow(0xe18000, 0x1, 0x0)
	/usr/local/go/src/runtime/mheap.go:1222 +0x42 fp=0x7ffd624267a0 sp=0x7ffd62426748 pc=0x4252e2
runtime.(*mheap).allocSpanLocked(0xe18000, 0x1, 0xe30388, 0x0)
	/usr/local/go/src/runtime/mheap.go:1150 +0x37f fp=0x7ffd624267d8 sp=0x7ffd624267a0 pc=0x4251cf
runtime.(*mheap).alloc_m(0xe18000, 0x1, 0x2a, 0x0)
	/usr/local/go/src/runtime/mheap.go:977 +0xc2 fp=0x7ffd62426828 sp=0x7ffd624267d8 pc=0x424822
runtime.(*mheap).alloc.func1()
	/usr/local/go/src/runtime/mheap.go:1048 +0x4c fp=0x7ffd62426860 sp=0x7ffd62426828 pc=0x457ebc
runtime.(*mheap).alloc(0xe18000, 0x1, 0x1002a, 0x0)
	/usr/local/go/src/runtime/mheap.go:1047 +0x8a fp=0x7ffd624268b0 sp=0x7ffd62426860 pc=0x424afa
runtime.(*mcentral).grow(0xe18e00, 0x0)
	/usr/local/go/src/runtime/mcentral.go:256 +0x95 fp=0x7ffd624268f8 sp=0x7ffd624268b0 pc=0x417a45
runtime.(*mcentral).cacheSpan(0xe18e00, 0x7f34fa007000)
	/usr/local/go/src/runtime/mcentral.go:106 +0x2ff fp=0x7ffd62426958 sp=0x7ffd624268f8 pc=0x41754f
runtime.(*mcache).refill(0x7f34fa007008, 0x2a)
	/usr/local/go/src/runtime/mcache.go:135 +0x86 fp=0x7ffd62426978 sp=0x7ffd62426958 pc=0x416fe6
runtime.(*mcache).nextFree(0x7f34fa007008, 0xe1172a, 0x7f34fa007008, 0x7f34fa007000, 0x8)
	/usr/local/go/src/runtime/malloc.go:786 +0x88 fp=0x7ffd624269b0 sp=0x7ffd62426978 pc=0x40b788
runtime.mallocgc(0x180, 0x98b860, 0x1, 0xe303e0)
	/usr/local/go/src/runtime/malloc.go:939 +0x76e fp=0x7ffd62426a50 sp=0x7ffd624269b0 pc=0x40c09e
runtime.newobject(0x98b860, 0x4000)
	/usr/local/go/src/runtime/malloc.go:1068 +0x38 fp=0x7ffd62426a80 sp=0x7ffd62426a50 pc=0x40c4a8
runtime.malg(0x25a68500008000, 0xe1a670)
	/usr/local/go/src/runtime/proc.go:3220 +0x31 fp=0x7ffd62426ac0 sp=0x7ffd62426a80 pc=0x4369e1
runtime.mpreinit(...)
	/usr/local/go/src/runtime/os_linux.go:311
runtime.mcommoninit(0xe12ac0)
	/usr/local/go/src/runtime/proc.go:618 +0xc2 fp=0x7ffd62426af8 sp=0x7ffd62426ac0 pc=0x430662
runtime.schedinit()
	/usr/local/go/src/runtime/proc.go:540 +0x74 fp=0x7ffd62426b50 sp=0x7ffd62426af8 pc=0x4302f4
runtime.rt0_go(0x7ffd62426b88, 0x3, 0x7ffd62426b88, 0x0, 0x0, 0x3, 0x7ffd62427fa4, 0x7ffd62427fc4, 0x7ffd62427fd7, 0x0, ...)
	/usr/local/go/src/runtime/asm_amd64.s:195 +0x11a fp=0x7ffd62426b58 sp=0x7ffd62426b50 pc=0x459d6a
time="2020-04-22T14:56:30Z" level=info msg="Starting smokeping_prober (version=0.3.0, branch=HEAD, revision=594bd985ddfac52c473dcd4d290e9a8798406a10)" source="main.go:100"
time="2020-04-22T14:56:30Z" level=info msg="Build context (go=go1.12.6, user=root@100e204324f4, date=20190625-13:48:41)" source="main.go:101"
time="2020-04-22T14:56:30Z" level=info msg="Starting prober for <redacted>" source="main.go:122"
time="2020-04-22T14:56:30Z" level=info msg="Listening on :9374" source="main.go:140"

patrickjahns avatar Apr 25 '20 13:04 patrickjahns

One issue is that this may be related to upstream performance/deadlock in the Ping library.

See: https://github.com/go-ping/ping/issues/125

SuperQ avatar Dec 14 '20 16:12 SuperQ

I've updated the ping library, there have been a number of improvements that affect scaling, deadlocks, and ping concurrency.

I have a couple more changes going in shortly that will help even more.

See: https://github.com/SuperQ/smokeping_prober/pull/49

If you could test with the latest code, that would help.

SuperQ avatar Feb 02 '21 18:02 SuperQ

Has anyone been able to test this with 1000+ devices again with the most recent changes?

cchance27 avatar Jun 16 '21 05:06 cchance27

I think we've solved most of the issues with the upstream ping library. If anyone has additional information, please open a new issue.

SuperQ avatar Jan 26 '23 07:01 SuperQ