nfdump icon indicating copy to clipboard operation
nfdump copied to clipboard

Packets dropped during rotation

Open TheMysteriousX opened this issue 8 months ago • 8 comments

This is a brand new setup, so it's entirely possible I'm doing something wrong. Currently running 1.7.3 as this is the version packaged for RHEL.

When nfcapd is rotating files, I'm seeing that that some traffic is being dropped until the rotation is finished - around 10 seconds - visible in /proc/net/udp.

This is a somewhat scaled environment, with the host receiving about 0.5 trillion flows per day, 300-400 gigabytes of capture files per hour (majority, if not all ipfix).

  • currently running 6 nfcapd processes - can add more but when files are not being rotated the setup drops nothing
  • backing disk is a compressed ramdisk (zram) so compression is offloaded
  • rotation interval reduced to 1 minute, which did help
  • udp buffer set to 256mb per process
  • NIC configured for high throughput
  • 'top' shows that the processes are more IO/kernel bound than anything else, actual cpu usage is around 10-20%

Are drops during file rotation expected while under heavy load, or is there some bottleneck that I've overlooked?

TheMysteriousX avatar Jun 10 '25 16:06 TheMysteriousX

Hmm .. I would need to check the code back for 1.7.3, but speaking for 1.7.6 and later, file rotation should not delay a lot, as flushing and closing on a fast FS should be done almost instantanisously. If you could try it with 1.7.6, I'd be happy to help, if the issue still exists.

phaag avatar Jun 11 '25 09:06 phaag

Thanks - looks like Red Hat have actually gone and updated their spec file to 1.7.6 in the last few days so shouldn't be too much trouble to build it myself.

TheMysteriousX avatar Jun 11 '25 18:06 TheMysteriousX

Flow collection and compressing/storing data is done in different threads. They need to synchronise during the rotation cycle - means the collector thread gets suspended until the queue of data records gets fully flushed to disk in order to re-open the new file. This can be speeded up by using more worker threads -W in 1.7.6. However, it should be balanced with the I/O system. More than 3 to 4 workers, depending on the compression method, won't improve the throughput. If the most time is spend in I/O, more workers won't affect significantly the throughput.

Further more I assume compressed ramdisk (zram) is very fast. I have no experience with that. However, if CPU is underutilised you may play with the collector compression -z=lzo which is very fast or -z=lz4 which reduces size better. In sum, you have less data to transfer through the I/O system. You may try -z=lzo -W4.

I see options to outsource the rotation cycle ( flushing the data queue and rotation the file) into a separate thread, which unblocks data collection and pump records into the memory queue during the rotation cycle. This would need some architectural changes.

There is always something to improve, but I am happy to see, that nfcapd still does a decent job, even under heavy load :)

Feel free to share other ideas here or by email - See the AUTHORS file.

phaag avatar Jun 11 '25 19:06 phaag

need to synchronise during the rotation cycle - means the collector thread gets suspended until the queue of data records gets fully flushed

Ah - this could be the cause. zram is very fast, but it's constrained by a fixed pool of write threads in the kernel. We're rotating at 1 minute intervals, but that still creates files of around 1000mb at peak time. If the collector thread is suspended and isn't able to grab a write thread immediately, then our larger exporters will fill the buffer in a few seconds.

I've patched the code to remove calls to fsync - all the files in the ramdisk are lost if the host loses power, so data integrity isn't an issue for us. I'll check at peak time tomorrow and see if there's any drops.

I am happy to see, that nfcapd still does a decent job, even under heavy load :)

I've been impressed at how few processes I've needed to cope with the load, we have an old SiLK setup (predates me looking at this, so no idea if it's done right) but it's not able to process anywhere near the volume we've got running through nfcap/nfdump.

Thanks for the info on how the thread synchronisation works, you've probably saved me a good few hours trying to remember how to use GDB.

Feel free to share other ideas here or by email

If you want ideas... :)

  • The biggest challenge with our ingestion is the exporters being bursty. It wasn't too awful on a gigabit network as that provided some rate limiting but after migrating our netflow processing onto 10 gigabit+, I've seen some of the exporters hit the nfcapd host with microbursts of up to 5 Gb/s - needing huge udp buffers in spite of overall CPU load being pretty manageable. I don't think there's anything to do here, other than to run with a comically large recv buffer that can absorb the bursts.

  • One of the pain points with deployment is running multiple processes. It's not awful on the nfcapd side, but we have to communicate to the team running the exporters which port to use on which device, and occasionally ask them to change configuration to balance load. We fixed it in Samplicator by crudely patching it to use SO_REUSEPORT. For UDP traffic the kernel hashes the source/destination address and port to a process ID - the source ports on our exporters are stable so this gave us almost perfect load balancing across processes. We then just started as many samplicator processes as needed all on the same port, so every single exporter is configured to point at the one address and port. We could do the same thing with nfcapd if it had SO_REUSEPORT enabled. The option is specific to Linux though so it wouldn't be a portable change (there is SO_REUSEPORT_LB for *BSD but I don't know how it compares to Linux).

  • In the longer term we're looking to export the data into a "big data" store for additional analysis. The current plan is to use a script published by another group (https://codeberg.org/poorting/nfdump2clickhouse), which uses nfdump to export to CSV, then converts CSV to parquet to ingest into the database. It might be possible to do this with the json/nljson output plugin natively, I haven't looked in detail yet, but adding an export type for integration with such databases would definitely see some use - though having a quick look nfdump's export plugins are very complex so would be a very substantial bit of work. Parquet is well supported, though there is no native C library - just a Glib binding. Kafka fits the same role too, though it would export remotely rather than to file so a very different model.

  • nfcapd does not cope well with the disk filling up - not an issue now I've tuned how much flow we can store, but during testing when the disk would occasionally get filled, nfcapd wouldn't start writing again until it was manually restarted.

  • It doesn't appear nfexpire and dynamic sources worked together - nfexpire didn't remove anything without running it explicitly in each source directory, and nfcapd's auto expire didn't appear to work in any configuration with dynamic sources. Adding a script to implement expiry wasn't too much of a bother though.

  • nfcapd 1.7.6 doesn't appear to cull processes invoked with -x on completion. They do get cleaned up on the next rotation but there's always 1 <defunct> process per source. Doesn't seem to cause any issues, it's just cosmetic.

TheMysteriousX avatar Jun 12 '25 21:06 TheMysteriousX

Feel free to share other ideas here or by email

If you want ideas... :)

  • The biggest challenge with our ingestion is the exporters being bursty. It wasn't too awful on a gigabit network as that provided some rate limiting but after migrating our netflow processing onto 10 gigabit+, I've seen some of the exporters hit the nfcapd host with microbursts of up to 5 Gb/s - needing huge udp buffers in spite of overall CPU load being pretty manageable. I don't think there's anything to do here, other than to run with a comically large recv buffer that can absorb the bursts.

Large bursts may get handled only by large buffers, I guess. The buffers should be placed, where, they are most efficient. That was my thought for the UPD socket buffer, in order to buffer packets, while rotating files. Usually this fits well. In the end it's a balance between CPU, parallelism and I/O throughput. However, these parameters are different on every system. If the socket buffer fills up too fast and I/O cannot cope with storing away the data - you will run into problems. Buffers should be able to balance the average UDP stream with average I/O throughput. If a sustained UDP stream outperforms the backing I/O, buffers don't help. Anyway - I will try to optimise the file-rotation cycle in order to prevent data loss.

  • One of the pain points with deployment is running multiple processes. It's not awful on the nfcapd side, but we have to communicate to the team running the exporters which port to use on which device, and occasionally ask them to change configuration to balance load. We fixed it in Samplicator by crudely patching it to use SO_REUSEPORT. For UDP traffic the kernel hashes the source/destination address and port to a process ID - the source ports on our exporters are stable so this gave us almost perfect load balancing across processes. We then just started as many samplicator processes as needed all on the same port, so every single exporter is configured to point at the one address and port. We could do the same thing with nfcapd if it had SO_REUSEPORT enabled. The option is specific to Linux though so it wouldn't be a portable change (there is SO_REUSEPORT_LB for *BSD but I don't know how it compares to Linux).

If the current fix with samplicator does the job, I would recommend to leave it that way. The advantage of having multiple collectors is also to get more CPU assigned by the scheduler - at least from what I know. Running several collectors in parallel is the simpler solution. However, and just for the sake of interest - I like the idea of hashing and assigning it to different collectors. If you could share back that code, I'd be interested to study it. Maybe I get some ideas to optimise the collector.

  • In the longer term we're looking to export the data into a "big data" store for additional analysis. The current plan is to use a script published by another group (https://codeberg.org/poorting/nfdump2clickhouse), which uses nfdump to export to CSV, then converts CSV to parquet to ingest into the database. It might be possible to do this with the json/nljson output plugin natively, I haven't looked in detail yet, but adding an export type for integration with such databases would definitely see some use - though having a quick look nfdump's export plugins are very complex so would be a very substantial bit of work. Parquet is well supported, though there is no native C library - just a Glib binding. Kafka fits the same role too, though it would export remotely rather than to file so a very different model.

I am well aware of the big data issue. I had quick look myself on which backend would fit best for netflow data, but did not yet come to a conclusion. I also had clickhouse on the short list for a direct exporter, but had not (yet) enough time to dig deeper. I would love to export/import the data in a binary way, as I am not a big fan to convert everything into ascii and re-import it. Yes - it's most compatible, but not most efficient ... I will re-schedule that task on my todo list. Maybe we could exchange some thoughts also on that topic.

  • nfcapd does not cope well with the disk filling up - not an issue now I've tuned how much flow we can store, but during testing when the disk would occasionally get filled, nfcapd wouldn't start writing again until it was manually restarted.

True - I agree. I will try to fix it, to be more tolerant on that issue.

  • It doesn't appear nfexpire and dynamic sources worked together - nfexpire didn't remove anything without running it explicitly in each source directory, and nfcapd's auto expire didn't appear to work in any configuration with dynamic sources. Adding a script to implement expiry wasn't too much of a bother though.

Ahh - did not know that! I need to check.

  • nfcapd 1.7.6 doesn't appear to cull processes invoked with -x on completion. They do get cleaned up on the next rotation but there's always 1 process per source. Doesn't seem to cause any issues, it's just cosmetic.

Thanks! I will fix that.

phaag avatar Jun 13 '25 09:06 phaag

Disabling fsync has made a difference - packet loss during rotation is now negligible. Perhaps it'd work as a command line option, with the caveat that enabling the option will destroy your data.

If you could share back that code, I'd be interested to study it

The patch I put together for samplicator is at the end of this message - it's not production quality as it doesn't consider portability, and in most circumstances should be gated behind a command line switch, as once the option is enabled any process launched by the same uid can bind to the port and start receiving a share of the traffic. It needs to be used with SO_REUSEADDR, otherwise additional processes will still fail to bind due to address reuse checks. The preprocessor symbol is defined in socket.h, so I was able to avoid any autoconf changes too.

The hash based load balancing for UDP is free - the kernel takes care of this (if TCP is used, then traffic is distributed via round-robin). Best docs for the setting is the socket man page: https://man7.org/linux/man-pages/man7/socket.7.html

I would love to export/import the data in a binary way, as I am not a big fan to convert everything into ascii and re-import it.

Definitely - nfdump is pretty fast and efficient at exporting to CSV, but then shuffling that round in text form is a bottleneck.

Maybe we could exchange some thoughts also on that topic.

Sure, I don't claim to be an expert on the subject but I'm happy to share the bits I know and test things out.

Thanks again for your help with pinning down the source of the loss.


From 249d95d831b027c8e4a84ac94615fff3616d3aae Mon Sep 17 00:00:00 2001
From: AdamB <[email protected]>
Date: Tue, 29 Apr 2025 12:06:54 +0000
Subject: [PATCH] Allow samplicate to share its port

---
 samplicate.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/samplicate.c b/samplicate.c
index 409ce97..3e2fb9a 100644
--- a/samplicate.c
+++ b/samplicate.c
@@ -184,6 +184,22 @@ make_recv_socket (ctx)
 	  fprintf (stderr, "socket(): %s\n", strerror (errno));
 	  break;
 	}
+#ifdef SO_REUSEPORT
+    int on = 1;
+
+      if (setsockopt(ctx->fsockfd, SOL_SOCKET, SO_REUSEADDR,
+              &on, sizeof on) < 0)
+    {
+      fprintf(stderr, "Warning: setsockopt(SO_REUSEADDR) failed: %s\n",
+          strerror (errno));
+    } 
+      if (setsockopt(ctx->fsockfd, SOL_SOCKET, SO_REUSEPORT,
+              &on, sizeof on) < 0)
+    {
+      fprintf(stderr, "Warning: setsockopt(SO_REUSEPORT) failed: %s\n",
+          strerror (errno));
+    }
+#endif
       if (setsockopt (ctx->fsockfd, SOL_SOCKET, SO_RCVBUF,
 		      (char *) &ctx->sockbuflen, sizeof ctx->sockbuflen) == -1)
 	{
-- 
2.49.0

TheMysteriousX avatar Jun 16 '25 13:06 TheMysteriousX

Thanks! - Just one quest - which fsync did you remove? All of them or a specific one?

phaag avatar Jun 17 '25 13:06 phaag

It was the two in rc/libnffile/nffile.c - I didn't do any testing to see if one was sufficient, so there may be a more optimal solution.

TheMysteriousX avatar Jun 18 '25 19:06 TheMysteriousX