goflow2 icon indicating copy to clipboard operation
goflow2 copied to clipboard

IPFIX Templates flush from memory quickly

Open bmarlow opened this issue 1 year ago • 4 comments

Describe the bug When running the collector, it constantly gives the following error messages for a few minutes: WARN[0954] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.16]:38705 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

WARN[0954] template error blocking=false count=1 error="receiver: message from [::ffff:10.89.1.16]:38705 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

Which I would expect, as in the beginning it hasn't gotten the template yet. After a few minutes when the template packets get sent I quit getting the error messages and everything decodes fine and the expected data gets sent off to kafka.

After about a minute, all of the sudden I start getting the same errors again. It will take about 5 minutes (the template interval) and the whole process will repeat itself.

To Reproduce Steps to reproduce the behavior:

  1. I run goflow2 as such: podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp --network shared -v ./templates/:/templates/:z localhost/goflow2:latest -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "x.x.x.x:9092" -err.cnt 10000 -err.int "1s" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"
  2. Wait for templates to get sent
  3. Wait for goflow2 to mysteriously lose the templates

Expected behavior Once the templates are learned they shoudn't be forgotten.

Captures Attached is wirehark of the behavior. It is worth noting that wireshark properly decodes, so after it has received the template packets it goes and applies the template to all of the captured packets.

Sampler device:

  • Linux Host
  • Red Hat Core OS
  • Version: 9.2
  • Estimated flow traffic: very low, maybe 100/second

GoFlow2:

  • Version: output of ./goflow2 -v is GoFlow2 however container was built based on the v2.1.3 code
  • Environment: Running under Podman
  • OS: RHEL

Additional context Not sure what else to say other than after aquiring the template it randomly loses it. The IPFIX source reliably sends the template packets every 5 minutes, then shortly thereafter goflow2 loses it. Screenshot from 2024-06-07 18-37-46

bmarlow avatar Jun 08 '24 01:06 bmarlow

Hi @bmarlow,

Thank you for raising this issue.

A few things:

  • Can you confirm neither the sampler nor GoFlow2 restart?
  • Are you running multiple GoFlow2 on the same port?
  • If you reduce the interval for templates, does the problem disappear?
  • Are the domain ID, template ID and router (host+port) remaining the same on the error messages?
  • Could you test using the latest main?
  • In Wireshark, the affected packets in GoFlow2 reference the same template always?

lspgn avatar Jun 08 '24 04:06 lspgn

Hi @lspgn Thanks for getting back to me so quickly!

First, a point of correction. I previously stated the templates were being sent every 5 minutes, that is not correct, they are being sent every 10 minutes.

Now, in response to your questions:

  • I've had goflow2 running for over 15 hours at this point (without crashing), still exhibiting the same behavior (this command run from within the container running goflow2):
~ $ ps -o pid -o etime -o comm -o args
PID   ELAPSED COMMAND          COMMAND
    1 15h14   goflow2          ./goflow2 -transport kafka -transport.kafka.topic ipfix-topic -transport.kafka.brokers 10.0.101.33:9092 -err.cnt 10000 -err.int 1s -format json -transport.kafka.flushbytes 10000 -templ
   10 15h14   sh               /bin/sh
  • I only have one instance of goflow2 running (it is running in podman)
[bmarlow@podman-host ~]$ podman ps
CONTAINER ID  IMAGE                         COMMAND               CREATED       STATUS       PORTS                                           NAMES
c93109b6ed82  docker.io/apache/kafka:3.7.0  /etc/kafka/docker...  16 hours ago  Up 16 hours  0.0.0.0:9092->9092/tcp                          kafka
6135e40e0cca  localhost/goflow2:latest      -transport kafka ...  15 hours ago  Up 15 hours  0.0.0.0:8080->8080/tcp, 0.0.0.0:2055->2055/udp  goflow2
  • Unfortunately the system sending the flow data isn't configurable with regard to the template interval. That being said, the templates are sent every 10 minutes, as the RFC suggests (RFC5153-6.2)

  • The domain ID and the sourceIP/port are always the same, the template numbers 280,281,284, 285 (there may be a couple of others, but of those 280 and 281 are 99% of the errors)

  • This container was built of the current main branch.

[bmarlow@podman-host goflow2]$ git rev-parse --verify HEAD
7265159473ecb80dae984d375682a132f963f270
[bmarlow@podman-host goflow2]$ git rev-parse --short HEAD
7265159

It was built using the following command from the top level inside the goflow2 directory

[bmarlow@podman-host goflow2]$ buildah bud -t goflow:latest .

And was run with this command:

[bmarlow@podman-host goflow2]$ podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp --network shared -v ./templates/:/templates/:z localhost/goflow2:latest -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "10.0.101.33:9092" -err.cnt 10000 -err.int "1s" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"

Here is a zipped packet capture that captures all netflow data sent to the goflow2 collector over the span of about an hour. The most commonly used templates are 280 and 281 (which aligns with the error messages we see). You can also see that the template packets are sent like clockwork, every 10 minutes starting at packet 11929, and the templates data for 280 and 281 (and others) is setn in packet 11939 (and then every 10 minutes thereafter). ipfix.zip

bmarlow avatar Jun 08 '24 14:06 bmarlow

Really strange. Thank you for providing the packet capture.

I dived into the code:

https://github.com/netsampler/goflow2/blob/7265159473ecb80dae984d375682a132f963f270/decoders/netflow/templates.go#L60C32-L60C46

^ this function is not called, so I'm leaning towards a possible race condition

On small thing:

-templates.path "/templates"

This is an artifact of a previous CLI argument, it does not work.

It could be an issue with the "error aggregator" (controlled with err.int and err.cnt). Do you see missing samples in Kafka/database? Since there's one flow per set per packet, can you check with the Sequence Number?

Could you try without error logging?

goflow2 -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "x.x.x.x:9092" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"

Could you try without Kafka?

goflow2 -err.cnt 10000 -err.int "1s" -format "json" > /dev/null

lspgn avatar Jun 08 '24 20:06 lspgn

Once again, thank you for the speedy response, it is greatly appreciated.

Ahh, yes, apologies the template flag was an artifact of my flailing around trying to understand what was going on.

I had previously tried this without the err.cnt and err.int flags, but was constantly getting muted messages in the logs, and It was not clear to me if the muting would cause it to potentially miss template packets, or if that was muting just for log suppression.

RE: DeleteTemplate function- good to know, I saw that function but wasn't sure if it was being called from somewhere I couldn't find (admittedly I'm not much of a Go person).

Test 1

[bmarlow@podman-host goflow2]$ podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp --network shared localhost/goflow2:latest -transport "kafka" -transport.kafka.topic ipfix-topic -transport.kafka.brokers "10.0.101.33:9092" -format "json" -transport.kafka.flushbytes 10000 -templates.path "/templates" -transport.kafka.version "3.7.0"

And the behavior was the same. I verified that my kafka consumer was properly receiving messages by using another machine to connect to Kafka as a producer and send a message on that topic using the standard Kafka tools:

[bmarlow@aworkstation bin]$ ./kafka-console-producer.sh --bootstrap-server 10.0.101.33:9092 --topic ipfix-topic
>test
[bmarlow@bmarlow-anotherworkstation]$ python connector.py 
Streams initiated...
test

I have also attached a pcap during this time from the host running the goflow2 container. This PCAP captures both the IPFIX as well as the Kafka traffic. I'm not sure if the program is set to batch messages to kafka, but I see the following behavior:

  1. Template packets received
  2. Several messages sent to Kafka
  3. Suddenly messages to Kafka stop, within about a minute or so (hard to correlate this with exact log messages since the default log output suppresses)
  4. Goflow2 still receiving lots of IPFIX, but not sending anything to Kafka (at this point in time I would expect that it would be sending about 1 message a second to kafka, unless their is batching that I'm unaware of)
  5. Repeat 1-4 every 10 minutes, as template packets are received.

One interesting difference I see (maybe just because the logs are scrolling as fast) is that I now see a template error every once in a while in the logs:

WARN[1292] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[1293] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[1293] too many receiver messages, muting           
WARN[1302] skipped 191 receiver messages                
WARN[1302] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.89.1.18]:47961 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

Looking at the PCAP:

  • I see a handshake and keep-alives with Kafka from the beginning
  • The first templates sent at packet #8963
  • Then actual data packets get sent to Kafka for a bit
  • Then at packet #10279 I see a significantly smaller data packet sent to Kafka (3698 bytes instead of 10000+ bytes as all the previous (buffer was storing data to be forwarded has drained maybe?)
  • IPFIX packets continue to be received
  • Kafka Reverts to just sending keepalives
  • Process starts over again, templates sent starting at #21997, last Kafka message at #23807 then reverts to just keep-alives
  • Process repeats

ipfix_and_kafka.zip

Test 2

I started up goflow2 using the following command:

[bmarlow@podman-host]$ podman run --name goflow2 -dt -p 8080:8080/tcp -p 2055:2055/udp localhost/goflow2:latest -err.cnt 10000 -err.int "1s" -format "json" > /dev/null

I immediately started getting errors (as I would expect since I hadn't received the template yet):


WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0136] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0137] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:37907 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

After a while I received the template packets and things started properly recording:

{"type":"IPFIX","time_received_ns":1717953981897508819,"sequence_num":5383103,"sampling_rate":0,"sampler_address":"10.0.2.100","time_flow_start_ns":1717953981899000000,"time_flow_end_ns":1717953981899000000,"bytes":49200,"packets":400,"src_addr":"10.99.0.51","dst_addr":"10.128.1.140","etype":"IPv4","proto":"TCP","src_port":6443,"dst_port":56196,"in_if":7,"out_if":0,"src_mac":"aa:ae:30:ca:10:32","dst_mac":"0a:58:0a:80:01:8c","src_vlan":0,"dst_vlan":0,"vlan_id":0,"ip_tos":0,"forwarding_status":0,"ip_ttl":0,"ip_flags":0,"tcp_flags":0,"icmp_type":0,"icmp_code":0,"ipv6_flow_label":0,"fragment_id":0,"fragment_offset":0,"src_as":0,"dst_as":0,"next_hop":"","next_hop_as":0,"src_net":"0.0.0.0/0","dst_net":"0.0.0.0/0","bgp_next_hop":"","bgp_communities":[],"as_path":[],"mpls_ttl":[],"mpls_label":[],"mpls_ip":[],"observation_domain_id":0,"observation_point_id":0}
{"type":"IPFIX","time_received_ns":1717953982062822738,"sequence_num":5383104,"sampling_rate":0,"sampler_address":"10.0.2.100","time_flow_start_ns":1717953982064000000,"time_flow_end_ns":1717953982064000000,"bytes":33200,"packets":400,"src_addr":"10.128.0.52","dst_addr":"10.128.0.77","etype":"IPv4","proto":"TCP","src_port":6443,"dst_port":49144,"in_if":472,"out_if":0,"src_mac":"0a:58:0a:80:00:34","dst_mac":"0a:58:0a:80:00:4d","src_vlan":0,"dst_vlan":0,"vlan_id":0,"ip_tos":0,"forwarding_status":0,"ip_ttl":0,"ip_flags":0,"tcp_flags":0,"icmp_type":0,"icmp_code":0,"ipv6_flow_label":0,"fragment_id":0,"fragment_offset":0,"src_as":0,"dst_as":0,"next_hop":"","next_hop_as":0,"src_net":"0.0.0.0/0","dst_net":"0.0.0.0/0","bgp_next_hop":"","bgp_communities":[],"as_path":[],"mpls_ttl":[],"mpls_label":[],"mpls_ip":[],"observation_domain_id":0,"observation_point_id":0}

Then after ~<1 minute later, I start getting the error messages again, and no IPFIX data is logged (just errors)

WARN[0325] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:281] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0325] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2
WARN[0325] template error                                blocking=false count=1 error="receiver: message from [::ffff:10.0.2.100]:60372 IPFIX [version:10 type:Decode obsDomainId:0: templateId:280] Error template not found" hostname= port=2055 queue_size=1000000 scheme=netflow workers=2

Then the pattern repeats:

  • Get template packets
  • Logs print to screen correctly for ~<1 Minute
  • Logging starts printing error messages again
  • Previous steps repeat

Notes

  • Given that we are seeing the same behavior for both transport mechanisms (stdout and kafka) this seems to point to something central in how goflow2 is storing/managing the templates after they are received.
  • The goflow2 process never crashed or restarted in either of these processes
    • This can be observed in the first test by seeing that there is only continuous TCP session for Kafka that starts after the first template packets are received and that session stays alive through a couple of cycles of the pattern we are observing here
    • This can be observed in the second test by seeing that after we successfully receive the template packets, once the errors resume, the error numbers continue to increment to higher numbers than before the successful messages (as the code restarts the error numbering when the program restarts

bmarlow avatar Jun 09 '24 17:06 bmarlow

@bmarlow my apologies for the delay this time :(

Thank you for all the detailed tests. I'll have a better look soon and try to reproduce and replay the pcap.

Given that we are seeing the same behavior for both transport mechanisms (stdout and kafka) this seems to point to something central in how goflow2 is storing/managing the templates after they are received.

AFAIK there are no expiration. Maybe a packet is corrupting the template

Did you also collect the pcap for Test 2?

There's one more test that could help:

  • Run two GoFlow2 with the same stream (eg: UDP duplication), but one with -produce=raw. This could potentially help isolating a bad packet.

lspgn avatar Jul 16 '24 06:07 lspgn

One thing I just noticed: in the logs, the source port is different... 60372 vs 37907 Couldn't reproduce with the pcap.

The code that computes the key to find the template set uses IP+Port:

https://github.com/netsampler/goflow2/blob/f363efffcd870f132ce157e72ec758f06b03739a/utils/pipe.go#L151

https://github.com/netsampler/goflow2/blob/f363efffcd870f132ce157e72ec758f06b03739a/utils/udp.go#L29

If you have a way to test manually:

key := pkt.Src.Addr().String()

should make the key ignore the port.

Let me know if that works for you.

lspgn avatar Jul 17 '24 06:07 lspgn

@bmarlow let me know if you're still experiencing the issue :)

lspgn avatar Jul 31 '24 04:07 lspgn

@lspgn

Sorry for the delay in response, I've been wrapped up in other projects.

Unfortunately I haven't been able to recreate this.

I've tried replaying those packet captures, even against previous branches, and I can't recreate it.

I've since set up another cluster and had it send flow data against previous configs that were failing and it seems to work just fine now as well.

Possibly it was the previous OpenShift cluster version sending a bad packet as previously hypothesized.

I'm confused why I can't recreate it, but I guess I'm glad its working correctly now.

I'll re-open with appropriate data if it re-occurs. I appreciate the time spent on this.

bmarlow avatar Jul 31 '24 16:07 bmarlow