rust-syslog-rfc5424 SD parsing slower than expected

trafficstars

I'll start with small complaint, when parsing SD heavy log lines, I've noted significant slowdown. That was not totally unexpected, we're basically using SD elements as a replacement for JSON, but flamegraph shows we spend half of the time in parsing SD stuff, and half of the time inserting it into BTreeMap. flamegraph_btree

Some context:

Average log line is 4kB
Average SD element set size is 2.5kB; 5 to 6 SD elements with number of parameters from 5 to 60ish
The above test is run on a single gzipped log file (~ 300MB) with 1M of log entries
It takes about 3.5 second to parse the input log file (line by line), we're using very fast flate2 'zlib-ng-compat' decoder.
It takes 13 seconds for the wc to calculate the number of lines (zcat test.log.gz | wc -l)

So, I've tried to find some low hanging fruit. Replacing BTreeMap with HashMap it gets just a little bit slower, but we gain performance (up to 10%) when by providing a custom hasher for HashMap (like FxHash).

Of course, ordering the keys can be important, so I've tested with IndexMap with good results too.

And now, the TLDR; part

Would it be acceptable to replace [1] BTree with IndexMap, with provisions to override the hash by the end user, so we don't pull in too many dependencies. If the answer is yes, I'll create some tests and provide PR.

[1] alternatively, instead of replacing maybe use a feature flag?

That's all for now :-)

Jan 09 '21 16:01 nezirus

This is a spectacularly well-produced and researched issue; thank you.

Let me look at it; I remember doing a good bit of microbenchmarking when I wrote this and finding BTreeMap to be the best choice (although I was accepting hostile data, so I can't use a non-cryptographic hash function).

Jan 17 '21 18:01 Roguelazer

In some initial testing on synthetic messages with 10 SD items each of which has 10 params, IndexMap (even with FxHash) is about 15% slower than BTreeMap (~18000 messages/s versus 16000 messages/s). This is evident regardless of whether the keys are unsorted in the input, which was my original guess for what might make it faster.

Anyhow, #19 implements this; let me know what you think.

Jan 18 '21 00:01 Roguelazer

Hmm, I'd prefer that using HashMap or IndexMap doesn't introduce such penalty in generic cases :-(

I'm using 'jemallocator' crate and 'lto = true', first provides general speed boost (as noted before we're spending some time in allocations), and latter helps too, especially with IndexMap. However, please note that I use 'rustc_hash' for FxHasher, there is less code than 'fxhasher', and with 'lto' it is faster. Maybe you can redo your tests with that? For the reference, with single thread, index map, I'm able to parse 1M messages in 24 seconds (41.6k msg/sec); on a laptop with i5-7440HQ, nothing too fancy.

The 'indexmap' branch greatly helps with testing, but I think we should pursue this even further, there must be some performance to be gained here. I've researched the following areas:

I could swear that just by mechanically converting SD to JSON and using one of the available parsers we could improve 2x (only partially kidding, numbers are unbelievable https://github.com/serde-rs/json-benchmark)
I've looked into other popular syslog parser crate (syslog-loose), it is significantly faster, but stores SD as tuples in Vector (so not really useful if you want to access the keys).
Incidentally, syslog-loose uses 'nom', parsing macros (actually functions now), look very similar to your own stuff, but might be more optimized.

P.S. Please forgive me if the above is too much, SD is basically my justification for rfc5424, and it is important to make it as efficient as possible, otherwise we'd just switch to JSON.

Jan 22 '21 19:01 nezirus

I'm mildly curious what your use-case is where you would accept the DOS risks of using an insecure hash map, but whyou need more than 40k messages/thread/s? Like, if you're accepting anything from untrusted sources, it seems way too rsky to consider rustc_hash (or even FxHasher at all).

FWIW, the attached branch does about 80k message/s on Apple M1 (arm64). Just gotta get yourself a fleet of Mac Minis. :-)

I'm using nom for some internal parsing projects and might play with it for this repo, but I haven't found nom to be very fast at all compared to manual recursive-descent; it's just easier and cleaner to write and debug.

Jan 22 '21 19:01 Roguelazer

Thanks for keeping this open, I've returned to this issue again, tried some other ideas but no improvements yet.

BTW, I've just found out that SVG files produced by rust flamegraph are clickable/browsable, so one can properly see all function names (but you need the original SVG and you must open it in a browser).

May 05 '21 08:05 anezirovic

rust-syslog-rfc5424 rust-syslog-rfc5424 copied to clipboard

SD parsing slower than expected

rust-syslog-rfc5424
rust-syslog-rfc5424 copied to clipboard