logstash-codec-netflow
logstash-codec-netflow copied to clipboard
Ingesting IPFIX Flows From YAF Oddly Slow
I've been benchmarking the netflow codec for a tool that stitches IPFIX/ netflow data together into bi-directional sessions. The codec seems to run well when ingesting data from SonicWall and Cisco devices. However, the codec runs oddly slow when ingesting flows produced by YAF.
Test Machine Specifications:
- DigitalOcean CPU Optimized Instance
- 32GB of RAM
- 16 vCPUs
- 16 udp workers
- 64MB UDP buffer (rmem_max)
- LS_JAVA_OPTS=
-Xms8g -Xmx8g
- Null logstash output
- logstash-codec-netflow version
4.1.0
- logstash version
6.2.4
- XPack Monitoring
Known benchmarks:
- netflow_bench_cisco_asr.py: 18200 flows/ second
- netflow_bench_cisco_asa.py: 11500 flows/ second
First, I ran yaf with the following command: yaf --uniflow --in test.pcap --out my.logstash.ip --ipfix-port 2055 --ipfix udp
and saw that I was only processing 3000 flows/ second.
I did some packet captures and saw that about one fifth of the data was templates. In order to make a better comparison to known benchmarks, I created a similar script (had to change file extension to txt).
With the script, the codec manages 4000 flows/ second.
Running yaf with --silk
doesn't seem to dramatically change the performance characteristics. I didn't create a test script for silk style flows, but when running the tool, adding --silk
resulted in 200 less flows/ second on average.
I know it is not IPFIX specific since I created a similar test script (can be provided) for flows based on SonicWall's IPFIX implementation. The SonicWall IPFIX benchmark managed 17500 flows/ second.
I know it is not template-processing related since the benchmark only sends the template once.
The YAF flows encode timestamps as flowStartMilliseconds which need a simple conversion to iso8601, while the sonicwall flows only include systemInit relative timestamps, so they don't need a conversion. I thought this may be the cause, but netflow v9 also requires the timestamp conversion on the cisco data sets. The cisco data sets perform fine.
The YAF flows do have a subTemplateList
but that should be efficiently handled since the field is labeled :skip
in the definitions file.
I thought the data size of each flow may affect the speed at which the flows are processed. However, the YAF benchmark bundles 25 flows per packet with an overall packet size of 1415 bytes, while the cisco asr benchmark bundles 21 flows per packet with an overall packet size of 1392 bytes. The cisco asr actually sends larger flows (~66 bytes) to logstash than YAF does (~56 bytes). Yet, YAF is processed much slower.
The YAF flows do include an enterprise field, but the field flowAttributes
is well accounted for in the definitions file.
I've looked through the code and I can't work out what is causing the slow down. However, I'm not too good at ruby to be honest.
The script also includes the wireshark descriptions of the template and data packets.
After more benchmarking, it seems as if processing YAF flows doesn't scale vertically.
The following was captured using Digital Ocean CPU-Optimized Droplets while running MongoDB, Logstash, Elasticsearch, and Kibana. Elasticsearch and Kibana were needed in order to collect the data.
The system was benchmarked with the scripts here as well as with two yaf commands run directly on a sample pcap file.
The system caps out at around 13000 flows/ second due to the MongoDB output plugin and database in the pipeline.
codec version | RAM | -xmx -xms GB | udp buffer MB | vCPU | Workers | CPU Optimized | Mongo | YAF flows /sec | sonicWALL IPFIX flows/ sec | sonicWALL v9 flows/ sec | Cisco ASA flows/ sec | Cisco ASR flows/ sec | yaf sample.pcap --no-tombstone --no-stats --silk flows/ sec | yaf sample.pcap --no-tombstone --no-stats flows/ sec |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4.1.0 | 32 | 8 | 64 | 16 | 16 | 1 | 1 | 4000 | 13250 | 11500 | 11500 | 13500 | 2850 | 3000 |
4.1.0 | 16 | 6 | 64 | 8 | 8 | 1 | 1 | 3600 | 10000 | 6750 | 11100 | 11750 | 2500 | 2800 |
4.1.0 | 16 | 6 | 64 | 6* | 6 | 1 | 1 | 3500 | 8700 | 5600 | 10000 | 12900 | 2500 | 2650 |
4.1.0 | 16 | 6 | 64 | 4* | 4 | 1 | 1 | 3100 | 6200 | 3750 | 6800 | 9200 | 2100 | 2350 |
4.1.0 | 16 | 6 | 64 | 2* | 2 | 1 | 1 | 2350 | 3200 | 1900 | 3600 | 4750 | 1550 | 1650 |
* CPU limited via docker cpu scheduler on 8 core machine
EDIT: Added more details for each benchmark
YAF flows /sec | sonicWALL IPFIX flows/ sec | sonicWALL v9 flows/ sec | Cisco ASA flows/ sec | Cisco ASR flows/ sec | |
---|---|---|---|---|---|
Flows/ packet | 25 | 5 | 5 | 14 | 21 |
Packet Size | 1415 | 285 | 289 | 1452 | 1392 |
Rough Avg Flow Size | 56.6 | 57 | 57.8 | 103.7142857 | 66.28571429 |
The YAF flows use flowStartMilliseconds (152) and flowEndMilliseconds (153) fields (the Cisco flows don't use them) so take a slightly other path in the code: https://github.com/logstash-plugins/logstash-codec-netflow/blob/f131bcf0a2202b7ddba22897dfa52dc014bdf0d8/lib/logstash/codecs/netflow.rb#L382-L397
Thanks for looking at this!
A very similar code path is taken for the Cisco v9 however. https://github.com/logstash-plugins/logstash-codec-netflow/blob/f131bcf0a2202b7ddba22897dfa52dc014bdf0d8/lib/logstash/codecs/netflow.rb#L297-L302
In the YAF IPFIX case, we have a division and a timestamp conversion (line 388). In the netflow case, we have more arithmetic operations and a timestamp conversion. Nonetheless, the netflow data scales well.
I've created a new benchmark script that compares the two. https://github.com/logstash-plugins/logstash-codec-netflow/blob/master/spec/codecs/benchmarks/flowStartMilliseconds.rb
From that benchmark it seems v.snapshot is solely responsible for the difference in performance.
Other differences like .to_f
, or the regex matches on the field names don't seem to contribute to the difference.
It runs each above (simplified) snippet 2 million times,
implementation | result |
---|---|
IPFIX timestamp snippet with v.snapshot |
4.7 sec |
IPFIX timestamp snippet without v.snapshot |
2.4 sec |
Netflow9 timestamp snippet original | 2.7 sec |
Giving it some more thought, the gains are only 2 seconds for 1 million flows. Like you already suggested this can't be causing of the performance difference.
Circling back to this issue... perhaps the performance discrepancy can be explained by the huge number of templates YAF sends? Every time a tpl comes in, we use a mutex to synchronise access to it. Which also applies to data packets because they need access to the template. This makes it a non-parallellizable operation, perhaps explaining your flat scaling curve.
I don't think the number of templates can be the issue. When we run the benchmark script, we see similarly poor performance as we do when running YAF outright. Yet, the benchmark script only sends the template once.
Hmm yes good point!