hilti
hilti copied to clipboard
Use __hlt_bytes_find_bytes to translate '&until' spicy attribute ?
Hello !
I am currently testing CPU performance of parsers and i think found an improvement for &unti
spicy attribute.
Here a representative parser of my test :
module Foo;
export type Test = unit {
data : bytes &until("\x00");
};
I notice the processing time is quite sensitive (nearly x2) to the payload size because of &until("\x00")
when "\x00" delimiter is not present.
I see that the &until
attribute is translated into unpack (data, cur) Hilti::Packed::BytesDelim "\x00"
which is transformed into a llvm loop composed of llvm basic block.
https://github.com/rsmmr/hilti/blob/master/hilti/codegen/unpacker.cc#L134-L210
At each loop step tree functions are called:
- hlt_iterator_bytes_eq
- hlt_bytes_match_at
- hlt_iterator_bytes_incr
The pseudo code is :
while 1 {
if ( hlt_iterator_bytes_eq(cur, END_BYTE))
// => raise Hilti::WouldBloc
break;
else{
if hlt_bytes_match_at(cur, delim)
// => Find !
break;
else
hlt_iterator_bytes_incr(cur) # cur = cur+1
}
}
Even through these functions are "inlines", the loop assembly is quite heavy to just search a string pattern in a bytes.
I saw that __hlt_bytes_find_byte_from [EDIT : __hlt_bytes_find_bytes
] exists to search a string pattern, it is written in C and use memchr
of libc to search a char. memch
seems to be optimized for performance (better than a simple loop on each char)
https://github.com/rsmmr/hilti/blob/master/libhilti/bytes.c#L1019-L1029
http://code.metager.de/source/xref/gnu/glibc/string/memchr.c
Before trying to use __hlt_bytes_find_bytes
for &until
i would like to know if there is maybe a reason that it is not used in the unpacker.cc ?
Yeah, quite heavy for sure in this case. Iirc, the main reason for this is supporting searching for more than a single byte. __hlt_bytes_find_byte_from looks only for a single byte. If the code generator can guarantee that the delimiter is indeed just exactly one byte, switching over should be fine (i.e., it needs to be a byte constant of length 1).
More generally, there are plenty cases in the code generator right now where it picks the most generic/easy path but could optimize further for special cases where it knows more. There's also a bit handwaving that LLVM's optimization might take care of some stuff which at first appears inefficient (might be interesting to look at the actual LLVM bit code generated here after optimization).
One more note: are you confident it's indeed the searching that's expensive? The data : bytes &until("\x00");
construct is expensive is a different way too: it needs to buffer al the data for later access. If's the matching input is long, that can be significant. A way to avoid that is using &transient
, which doesn't store the whole thing but still gives access to chunks as they come in.
I did a mistake in my previous post, in fact it is hlt_bytes_find_bytes
(or hlt_bytes_find_bytes_at_iter
to use with a byte iterator) and not __hlt_bytes_find_byte_from
. It seems they can be used with bytes pattern.
https://github.com/rsmmr/hilti/blob/master/libhilti/bytes.h#L192-L213
Effectively for LLVM optimization, I did not check how it optimizes it. In my case I just did a callgrind analysis and compared the difference of instruction fetches between a small payload and a big one (A simple TFTP request with no option (delimited by \x00) and a big one with an option but without \x00).
I will try the &transient
to see the difference and also check if it is really the searching which is expensive.
Today I have started to implement a quick patch to use hlt_bytes_find_bytes
, for the moment it works with a hlt script but I did not success yet to test it with a spicy analyzer (need to convert a hilti string into byte before calling the function). I am not very confortable with llvm code generation ^^
I will give you more information about results and perform test comparisons.
Hello Robin,
I have a first version patch using hlt_bytes_find_bytes :) ! Here my perfomance results:
For this test I wrote a simple analyzer:
module FOO;
export type Test = unit {
data: bytes &until(b"\x00");
};
"A" | "A"*1300 | Diff perf % | IR "A" | IR "A"*1300 | Diff IR % | |
---|---|---|---|---|---|---|
master | 11135.73869 | 98344.34831 | 783.1416673 | 34766 | 311453 | 795.8551458 |
master &transient | 10924.07524 | 91367.08665 | 736.3828029 | 32490 | 309177 | 851.6066482 |
hlt_bytes_find_bytes | 10365.35155 | 10746.25974 | 3.674821712 | 37305 | 37690 | 1.03203324 |
hlt_bytes_find_bytes &transient | 9675.547945 | 9716.534161 | 0.4236061515 | 35033 | 35418 | 1.098963834 |
I perfomed 4 differents tests:
- master : The current code
- master &transient : The current code + add
&transient
to the Test.data field - hlt_bytes_find_bytes : _bytesUnpackDelim patched
- hlt_bytes_find_bytes &transient : _bytesUnpackDelim patched + add
&transient
to the Test.data field
Time measure
The two first rows contain the time in nanosecond. It is an average on about 1000 packets.
I measured it by getting the time before and after the parsing call function:
void* pboj =(*endp->parser->parse_func)(endp->data, &endp->cookie, &except, ctx)
Instruction fetches
The "IR" rows contain the number of instruction fetches regarding the ".hlt.foo_parse_Test_internal" that executes protocol decoding. It given by kcachegrind with a valgrind record. When I measure "IR" I use only one packet because the result is identical for a same packet (same execution path)
Payloads I sent were just "A" string without the delimiter "\x00" : a small packet just the character "A" and a big packet "AAAAA..." 1300 times.
Interpretation
I notice that time measures can change between two same measures (the cpu load of the system is not constant) but the ratio "Diff perf %" is the same order of magnitude (about +700%) It seems the time increase between "A" and "A"*1300 rows depends of the difference of instruction fetch which is logical.
I also checked a delimiter with several bytes and it works correctly.
patch
static void _bytesUnpackDelim(CodeGen* cg, const UnpackArgs& args, const UnpackResult& result, bool skip)
{
auto iter_type = builder::iterator::typeBytes();
auto bytes_type = builder::reference::type(builder::bytes::type());
auto delim = builder::codegen::create(bytes_type, args.arg);
auto begin = builder::codegen::create(iter_type, args.begin);
auto end = builder::codegen::create(iter_type, cg->llvmIterBytesEnd());
auto block_body = cg->newBuilder("unpack-loop-body");
auto block_exit = cg->newBuilder("unpack-loop-exit");
auto block_insufficient = cg->newBuilder("unpack-loop-insufficient");
// Copy the start iterator.
cg->llvmCreateStore(args.begin, result.iter_ptr);
cg->llvmCreateBr(block_body);
cg->pushBuilder(block_body);
auto cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));
CodeGen::expr_list params = { cur, delim };
auto find_result = cg->llvmCall("hlt::bytes_find_bytes_at_iter", params);
auto next = cg->llvmTupleElement(iter_type, find_result, 1, false);
cg->llvmCreateStore(next, result.iter_ptr);
//auto success = cg->llvmTupleElement(??????,find_result, 0, false);
cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));
params = {cur, end};
auto done = cg->llvmCall("hlt::iterator_bytes_eq", params);
cg->llvmCreateCondBr(done, block_insufficient, block_exit);
cg->popBuilder();
// Not found.
cg->pushBuilder(block_insufficient);
cg->llvmRaiseException("Hilti::WouldBlock", args.location);
cg->popBuilder();
// Loop exit.
cg->pushBuilder(block_exit);
if ( ! skip ) {
// Get the string up to here.
cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));
params = { begin, cur };
auto match = cg->llvmCall("hlt::bytes_sub", params);
cg->llvmCreateStore(match, result.value_ptr);
}
else {
auto init_val = cg->llvmInitVal(bytes_type);
cg->llvmCreateStore(init_val, result.value_ptr);
}
// Move beyond delimiter.
params = { delim };
auto l = cg->llvmCall("hlt::bytes_len", params);
auto len = builder::codegen::create(builder::integer::type(64), l);
cur = builder::codegen::create(iter_type, cg->builder()->CreateLoad(result.iter_ptr));
params = { cur, len };
next = cg->llvmCall("hlt::iterator_bytes_incr_by", params);
cg->llvmCreateStore(next, result.iter_ptr);
// Leave builder on stack.
}
It could be improved and be cleaned up:
- remove the llvm block_body if possible
- Get if succeed directly with
cg->llvmTupleElement
instead of of call hlt::iterator_bytes_eq which is useless because hlt::bytes_find_bytes_at_iter returns a struct containing if the pattern was found. I did not found the hilti::Type to use in llvmTupleElement. https://github.com/rsmmr/hilti/blob/master/libhilti/bytes.h#L65-L68
Nice, thanks for measuring! Feel free to file a merge request. I'm a bit behind with with Spicy/HILTI stuff due to time constraints but will aim to take a closer look then.