goflow2 icon indicating copy to clipboard operation
goflow2 copied to clipboard

Help with performance, trying to figure out what is wrong with my setup or Goflow2 posible memory Leak

Open luciano2683 opened this issue 8 months ago • 6 comments

Hello all,

I´m working for a big company where I'm pushing the usage of Goflow2... I've been fighting with this for the last 2 month but I'm always going back the starting point. Finally last week i discover that the issue is goflow2 itself.

In my scenario im using a server with 24vcpu, 64Gb RAM and all flash NVME storage (VM, not docker). Inside of this machine i have Goflow2, kafka and clickhouse. (so is a standalone node)

What i noticed is that the traffic suddenly drops and i was blaming clickhouse all the time but at the end was goflow2 and its related to the number of workers.

For example, this is my configuration: (debug level never worked for me)

[Unit] Description=GoFlow2 NetFlow Collector After=network.target kafka.service Requires=kafka.service

[Service] Environment="GOMEMLIMIT=4000000000" ExecStart=/data/goflow2/goflow2-2.2.2-linux-x86_64 -listen "netflow://xxx.xxx.xxx.xxx:9995/?count=8&workers=16&queue_size=1000000" -transport kafka -transport.kafka.brokers "127.0.0.1:9092" -transport.kafka.topic "flows" -format bin -addr "xxx.xxx.xxx.xxx:8080" -mapping /data/goflow2/mapping.yaml -loglevel="debug" >> /var/log/goflow2.log 2>&1 Restart=always RestartSec=10 User=root TimeoutStartSec=300 StandardOutput=append:/var/log/goflow2.log StandardError=append:/var/log/goflow2.log WorkingDirectory=/data/goflow2

[Install] WantedBy=multi-user.target

If i increment the number of workers i see that I'm able to capture more data however there is a point in time where the CPU utilization is so high that is like goflow2 slow down processing incoming netflow packets (IPFIX in my case, from Aruba/SilverPeak devices) but the CPU utilization keeps going up!

If i use low number of workers (let's say count=4 and workers=4) the issue is not happening but i miss a lot of traffic.

This is my current mapping.yaml: formatter: fields: - type - time_received_ns - sequence_num - sampling_rate - flow_direction - bi_flow_direction - sampler_address - time_flow_start_ns - time_flow_end_ns - bytes - packets - src_addr - src_net - dst_addr - dst_net - src_prefix # Add - dst_prefix # Add - etype - proto - src_port - dst_port - in_if - out_if - src_mac - dst_mac - icmp_name - templateId - ip_tos - tcp_flags - src_as - dst_as - edge_app_name - edge_app_cat - edge_overlay_name - edge_direction - edge_from_zone - edge_to_zone - edge_host_name - vrf_name - ingress_vrf_id - egress_vrf_id - edge_post_nat_src_addr protobuf: - name: flow_direction index: 42 type: varint - name: bi_flow_direction index: 41 type: varint - name: ingress_vrf_id index: 39 type: varint - name: egress_vrf_id index: 40 type: varint - name: templateId index: 999 type: varint - name: src_prefix index: 44 type: bytes length: 16 - name: dst_prefix index: 45 type: bytes length: 16 - name: edge_app_name index: 2014 type: string #array: true - name: edge_app_cat index: 2027 type: string #array: true - name: edge_overlay_name index: 2025 type: string #array: true - name: edge_direction index: 2028 type: string #array: true - name: edge_from_zone index: 2022 type: string #array: true - name: edge_to_zone index: 2023 type: string #array: true - name: edge_host_name index: 2031 type: string #array: true - name: vrf_name index: 2030 type: string #array: true - name: edge_post_nat_src_addr index: 2032 type: bytes length: 16 render: time_received_ns: datetimenano edge_post_nat_src_addr: ip ipfix: mapping: - field: 44 destination: src_prefix - field: 45 destination: dst_prefix - field: 61 destination: flow_direction - field: 239 destination: bi_flow_direction - field: 234 destination: ingress_vrf_id - field: 235 destination: egress_vrf_id - field: 256 destination: templateId - field: 252 destination: in_if - field: 253 destination: out_if - field: 96 destination: edge_app_name array: true encoding: utf8 - field: 27 penprovided: true pen: 23867 destination: edge_app_cat array: true encoding: utf8 - field: 26 penprovided: true pen: 23867 destination: edge_direction array: true encoding: utf8 - field: 25 penprovided: true pen: 23867 destination: edge_overlay_name array: true encoding: utf8 - field: 22 penprovided: true pen: 23867 destination: edge_from_zone array: true encoding: utf8 - field: 23 penprovided: true pen: 23867 destination: edge_to_zone array: true encoding: utf8 - field: 8 penprovided: true pen: 23867 destination: edge_host_name array: true encoding: utf8 - field: 236 destination: vrf_name array: true encoding: utf8 - field: 225 destination: edge_post_nat_src_addr netflowv9: mapping: - field: 34 destination: sampling_rate endian: little - field: 61 destination: flow_direction sflow: mapping: - layer: "udp" offset: 48 length: 16 destination: csum - layer: "tcp" offset: 128 length: 16 destination: csum

i did some fine tuning to the Linux also. Despite all of this the issue appears during high volume of traffic.

Image

After a service restart everything goes back to normal... So, it's like a memory leak or something related to goflow2

I'm seriously thinking throw this away and go for pmacct which in theory should consume less memory due to language difference.

I´m hoping that anyone can help me on this. Ty!

luciano2683 avatar Jun 16 '25 16:06 luciano2683

I was checking differences here: https://github.com/netsampler/goflow2/compare/v2.2.2...v2.2.3 And decided to download and use v2.2.3. let see how it behaves during Europe Business hours.

luciano2683 avatar Jun 17 '25 01:06 luciano2683

@luciano2683 it is unlikely v2.2.3 improves the situation as there was no changes related to performance.

I am lacking a lot of important information to be able to help you. It would be helpful to know how many packets per second and collect the prometheus metrics and the host metrics (there are metrics or tools dropwatch).

i did some fine tuning to the Linux also.

What kind of tuning?

Is it possible you are reaching the limits of a single machine. GoFlow2 is engineered to be horizontally scalable. Kafka and ClickHouse being on the same node could be causing issues. You seem to be extracting a lot of custom fields too which has an added compute cost. I would recommend setting up a load-balanced system. This architecture would also be more resilient against data loss.

Regarding debug mode, it may need a cleanup since the move to slog.

For the workers/queues I would recommend having a look at https://github.com/netsampler/goflow2/blob/main/docs/performance.md

What is likely happening is that the queue gets filled because there are too many flows coming and the kernel starts dropping UDP packets but GoFlow2 is not informed of this.

You can increasing the sampling rate on your devices. Using htop should also show the spread over your CPUs. If there is an imbalance: you could try creating 24 sockets and 24 workers.

It was written in Golang for the flexibility, extensibility and maintainability and with a tradeoff of raw performance because it's a garbage collected language. So it depends what you are looking for.

lspgn avatar Jun 17 '25 05:06 lspgn

Hello louis! @lspgn

Thanks for your answer, Yes, and I'm sorry that i provided almost no data for troubleshooting, I'm super limited to what i can share. Having said this, and as you assumed, new version didnt help. I did not setup prometheus but I'm attaching the /metrics at the moment of the issue. If i did not count bad, I'm having 14,164 flows/second

goflow2-10.txt

This is one of the tuning i made: net.core.rmem_max = 33554432 net.core.rmem_default = 16777216 net.core.wmem_max = 8388608 net.core.wmem_default = 4194304 net.core.netdev_max_backlog = 10000

Image

Image

Also here is dropwatch, i ran it for 20 seconds: (under full load as you can see on the picture of above)

dropwatch> start Enabling monitoring... Kernel monitoring activated. Issue Ctrl-C to stop monitoring 2 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 8 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 36 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 5 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 5 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 40 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 4 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 2 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 18 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 4 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 19 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 3 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 3 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 8 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 1540 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 18 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 3 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 19 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 3 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 4 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 4 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 25 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 25 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 3 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 7 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 24 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 4 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at tcp_rcv_established+218 (0xffffffffadecf2d8) [software] 23 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 3773 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 183 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 5 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 8010 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 1 drops at tcp_validate_incoming+135 (0xffffffffadecdea5) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 15 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 5 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 24840 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 11 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 30 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 4 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 4624 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 6 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software] 9 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 29 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 5 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 15 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at unix_release_sock+205 (0xffffffffadf589a5) [software] 2 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at sk_stream_kill_queues+58 (0xffffffffaddea558) [software] 12 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 35 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at tcp_rcv_established+218 (0xffffffffadecf2d8) [software] 3 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 5 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 2 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 33 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 4 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 7 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at tcp_rcv_established+218 (0xffffffffadecf2d8) [software] 38 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 5 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 38 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 6 drops at skb_queue_purge_reason+d6 (0xffffffffadde4126) [software] 13 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 1 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software] 4 drops at __init_scratch_end+fbdc203 (0xffffffffc05dc203) [software]

I would expect to start losing packets when i reach the maximum capacity of the software. But the problem here is the sudden drop...

Answering your questions and comments of what I'm looking for: All of my devices are Aruba SD-WAN/SIlverPeak devices and they don't have any option for sampling. The active Flow Timeout is 5 minutes, and the templates are being sent every 10 minutes.

In the past we had CA NFA (now Broadcom NFA) a single instance with much less resources were able to handle all of this traffic, the old software supported Cisco NBAR protocol but nothing about this Aruba fields I'm collecting.

I had 3 servers, one for Americas Region, other for EU and other for APAC, here I'm trying to do the same, but the problem is EU region that is having issues to handle 140+ routers.

I guess what i can do now is monitor less interfaces. Waiting for your comments, Ty!

luciano2683 avatar Jun 17 '25 10:06 luciano2683

 3773 drops at udp_queue_rcv_one_skb+389 (0xffffffffadeeea09) [software]

are the concerning ones.

10GB of RAM is also quite high for 14kflows. Would you be able to know how many pps?

Could you try running without a mapping file?

lspgn avatar Jun 18 '25 07:06 lspgn

@lspgn ty again for the answer. I made 3 changes and now is super stable.

The fisrt one was to remove fields from the mapping.yaml, explanation: I only leave the fields that im collecting at clickhouse, there were fields that are here: https://github.com/netsampler/goflow2/blob/main/cmd/goflow2/mapping.yaml

But i was not using them, so i removed. Although i did this, there was no change in the cpu utilization from goflow2 side.

The second one was to use Realtime mode, not buffer, explanation:

At the beginning i saw that memory consumption was pretty high, this is why i set Environment="GOMEMLIMIT=4000000000" to limit Goflow... But then i saw that CPU utilization was at 100% while packets being pushed to kafka were low. Then i remove this Environment="GOMEMLIMIT=4000000000" and the memory consumption jumped to 44Gb of ram... Again, CPU utilization super high but at least messages were pushed to kafka. However i could not maintain this scenario as i would suffer again the issue that i showed i my 2 previous posts.

I switch to count=24 workers=16 and blocking=true... the software did survive the "business hour". So now i pushed it a little bit and now count and workers are 24.

The third one i did (although i did not like it at all) was to remove LAN interfaces from the netflow export, now I'm having only the WAN interfaces, basically a 50% reduction.

Image

In the picture of above you can see the drop (when blocking=false) around the 12:30 until i switch to blocking=true around the 14:40. Since then, the software remained stable, there is a small drop around 6:50 but it was me increasing the workers.

So far is working ok, of course i need to perform more tests but it seems more stable.

Unfortunately i cannot give you the pps i had yesterday because my raw table has 1 day of TTL.

However this was today at 9 AM CEST:

`Query id: d6bf00a3-5941-41ef-a6b4-06742205b656

┌─total_packets─┬───────time_interval─┬─packets_per_second─┐
  1. │ 272427142 │ 2025-06-18 09:04:00 │ 4540452.366666666 │
  2. │ 454939854 │ 2025-06-18 09:03:00 │ 7582330.9 │
  3. │ 450186045 │ 2025-06-18 09:02:00 │ 7503100.75 │
  4. │ 381875687 │ 2025-06-18 09:01:00 │ 6364594.783333333 │
  5. │ 294265977 │ 2025-06-18 09:00:00 │ 4904432.95 │
  6. │ 253189326 │ 2025-06-18 08:59:00 │ 4219822.1 │
  7. │ 422295042 │ 2025-06-18 08:58:00 │ 7038250.7 │
  8. │ 428407763 │ 2025-06-18 08:57:00 │ 7140129.383333334 │
  9. │ 383163691 │ 2025-06-18 08:56:00 │ 6386061.516666667 │
  10. │ 274902038 │ 2025-06-18 08:55:00 │ 4581700.633333334 │ └───────────────┴─────────────────────┴────────────────────┘

10 rows in set. Elapsed: 0.379 sec. Processed 62.12 million rows, 993.95 MB (163.71 million rows/s., 2.62 GB/s.) Peak memory usage: 1.78 MiB. `

From my point of view, there is something wrong with the memory handling when we have high traffic. This issue happened only at my EU server where i have most traffic, this is not happening at Americas or Asia Pacific servers where the maping.yaml remains the same as my fisrt post.

Ty!

luciano2683 avatar Jun 18 '25 13:06 luciano2683

Hello,

Just an update here:

Image

(just highlighting when i increased the N° of workers) The software "survived" the EU business hours without any issue and with much, much more better resource usage. Currently I'm averaging 64K flows/sec (IPFIX with my custom fields) with a peak of 96K flows/sec. Now just for testing i doubled the number of workers, pushing it to 48, so currently i have: count=24&workers=48

additional custom configs: vm.swappiness = 1 net.core.rmem_max = 33554432 net.core.rmem_default = 16777216 net.core.wmem_max = 8388608 net.core.wmem_default = 4194304 net.core.netdev_max_backlog = 50000

Realtime mode is much more stable, less resource hungry and confirms my theory of memory handling issues when buffered mode is enabled and traffic is high. (I'm always talking about a single instance).

Will monitor it for the next days. Will be back on monday or tuesday.

Ty!

Just more information for others: I have this VM (not docker): Rocky Linux vCPU 24 64Gb RAM All flash NVMe Storage

At clickhouse i have more than 64 tables, all of them using "aggregatingmergetree" and refreshed materialized view with append at the end.

works perfect for now!

luciano2683 avatar Jun 19 '25 13:06 luciano2683

Image

Hello @lspgn ,all. Finally, i can confirm that in my scenario realtime mode does not suffer the issue that i showed in the initial posts. No sudden drops, no critical issues My single instance is able to handle up to 203340 (peak) flows/seconds, 112035 (avg) without any issues. Memory and CPU are also stable and under control, with expected peaks due to user activity or Materialized views updating.

Image

Sorry to insists but this experience told me that the software (goflow2 at current version v2.2.3) has issues handling memory (when buffer mode is enabled). Same configuration (mapping.yaml, flow.proto, etc) does not have the same behavior when Realtime is enabled.

Ty!!!

luciano2683 avatar Jun 24 '25 10:06 luciano2683