fluent-bit
fluent-bit copied to clipboard
out_es: fix es bulk buffer overrun
Fixes #5680.
Enter [N/A] in the box, if an item is not applicable to your change.
Testing Before we can approve your change; please submit the following in a comment:
- [ ] Example configuration file for the change
- [x] Debug log output from testing the change *
- [ ] Attached Valgrind output that shows no leaks or memory corruption was found
If this is a change to packaging of containers or native binaries then please confirm it works for all targets.
- [ ] Attached local packaging test output showing all targets (including any new ones) build.
Documentation
- [ ] Documentation required for this feature
Backporting
- [ ] Backport to latest stable release.
Note: All testing was on Windows Server 2016. See https://github.com/fluent/fluent-bit/issues/5680#issuecomment-1194723285 for logs leading up to the crash before these changes.
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
PTAL; I've been testing this in production for the last week. The output looks good and the daily crashes have stopped.
I fixed the commit message to comply with the contribution guidelines.
I'm not sure how to fix the linker issues with max though, which obviously builds and links for windows. 😕
I've tried a tweak to fix the max issue. ~Just need the workflow approved to run again.~
The macos builds are still failing with:
[ 67%] Building C object plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/es_conf.c.o
[ 67%] Building C object plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/es_bulk.c.o
[ 67%] Building C object plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/es.c.o
/Users/runner/work/fluent-bit/fluent-bit/plugins/out_es/es_bulk.c:88:27: error: implicit declaration of function 'max' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
append_size = max(append_size, remaining_size);
^
/Users/runner/work/fluent-bit/fluent-bit/plugins/out_es/es_bulk.c:88:27: note: did you mean 'fmax'?
/Applications/Xcode_13.2.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk/usr/include/math.h:532:15: note: 'fmax' declared here
extern double fmax(double, double);
^
1 error generated.
make[2]: *** [plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/es_bulk.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/all] Error 2
@edsiper @PettitWesley Any ideas how to fix this? What am I missing?
@PettitWesley I'm hoping to get this fix into the next 1.9 release. Any ideas how to fix the non-windows builds?
Rebased with recent changes.
Also added the buffer read overrun fixes since I've been testing with those too (with es).
@gitfool you need signoff on all commits. Also it should be one commit per component for a smalll change like this IMO so I think that's 2 commits, one for each plugin. Thank you for remembering opensearch!
Okay, I'll squash and signoff.
@gitfool Actually sorry just realized you didn't make these same changes for opensearch... would you mind doing that? the code is basically exactly the same so it should be copy paste
@PettitWesley I only fixed the buffer read overrun in the opensearch plugin since it doesn't seem to do bulk operations like the es plugin. (Not sure if that's a good thing but I'll find out when we eventually switch from es to opensearch on the backend.)
If this gets approved then I presume I should submit another pull request to master with the same fixes?
@gitfool Yup
If this gets approved then I presume I should submit another pull request to master with the same fixes?
@PettitWesley is there anything else I can do to help get this reviewed and merged? 🙏
@nokute78 @edsiper CI has passed on this one, I think its ready for review
@gitfool This is all we can do sorry. Have to wait for other maintainers to review.
Missed the 1.9.8 release too. 😭
Rebased with recent changes. @nokute78 @edsiper PTAL.
This seemed correct to me at the time and think it makes no sense to keep stalling it.
Could we do a quick recap so we can give it closure?
I'm not familiar at all with this but the way I understood is that there are two issues this fixes :
- When nothing has been converted yet
i_lenis not taken in account when calculatingrequired
This seems fairly straightforward considering there is a memcpy in line 101 that places i_len bytes from index into the bulk structure buffer.
After reviewing the rest of the code I noticed that the previous statement could not be true so please correct me if I'm wrong but the code seems to use ES_BULK_HEADER when calculating required which according to es.c is the maximum value i_len could have (it's used to allocate the buffer we receive there in index).
- When calculating (this is speculation right?) how much space would be required for the remaining msgpack bytes to be converted to json the original code was implcitly rounding down the result of the division of
bulk->size / converted_sizewhich is used to establish a conversion ratio which is then multiplied by the remainder byte count to get a ballpark of how much space we think we will need for that.
Unless I completely misunderstood it I'm not entirely convinced of the original approach to buffer size requirement because due to how msgpack is internally represented and I see no checks to validate that the new buffer size is large enough to fit i_len + j_len bytes.
I wonder if I'm missing the point or if this could be simplified this way with no noticeable side effects :
#define BULK_BUFFER_GROWTH_SIZE 4096
required = i_len + j_len;
required = fmaxl(required, BULK_BUFFER_GROWTH_SIZE);
available = bulk->size - bulk->len;
if (available < required) {
ptr = flb_realloc(bulk->ptr, bulk->size + required);
}
Which could be considered wasteful if you're flushing before you fill the 4kb or sub optimal if what you are appending to it usually causes the raw required size to be larger than 4kb (ie. reallocating on each append) but that could be improved either by turning it into something the user can set or having the plugin start with a sensible value and adjust based on the average append size.
I want to give closure to this and considering that your fix is working for you I'm inclined to approve it because the way I see it, the only side effect it could have would be using a little bit more memory. I'm only mentioning these other aspects because it's what I understood after trying to make sense of the original code and thought it could help make.
Please don't hesitate to correct me if there is anything I missed or misunderstood so we can promptly get these improvements merged.
Note on msgpack vs json : an array comprised of 10 False values msgpack encoded would be 11 bytes long, however the same array json encoded would take 70 bytes.
@leonardo-albertovich thank you very much for the review!
- I agree that using
ES_BULK_HEADERto calculaterequiredseems redundant given the 3 subsequent writes tobulk->ptr(i_lenandj_lenbytes then1byte for the trailing newline).
I left this in place because I did not know this code at all and wanted to merely tweak it to work as I determined was intended; but it would make more sense if reference to ES_BULK_HEADER was removed as it seems to be irrelevant to es_bulk_append.
- As I noted in the issue, this calculation to predict future buffer requirements seems overly complex, which lead to bugs.
I'm sure it could and probably should be simplified, but I would firmly but politely ask for that to be done in a separate pull request after this one is merged to 1.9 and master (in https://github.com/fluent/fluent-bit/pull/6396) since I know the current changes work after extensive "testing" in production, going on for 4 months now.
Also, as I commented above in this pull request, the opensearch plugin doesn't seem to have an equivalent bulk append, unless I'm missing something, and I'd expect its implementations to be similar, so whatever is eventually applied here should presumably be applied there too, which again should be done in a separate pull request.
Good news everyone!
In order to be confident enough to back your PR I wrote a quick PoC to demonstrate the value expansion issue I commented in my previous message and realized that two changes you made sealed the deal :
- Moving the assignment of
append_sizewhich actually takes in account the length of what we are trying to append to line 82 - Adding the
fmaxline to ensure that the calculated size is large enough to fit data or fall back toappend_size
Without those changes I could crash fluent-bit while running with this configuration :
[SERVICE]
flush 10
log_level trace
[INPUT]
Name tcp
listen 127.0.0.1
port 9556
[OUTPUT]
Name es
Match *
host 127.0.0.1
port 9555
retry_limit no_retries
workers 0
tls off
Just launching fluent-bit and then running these two lines before the first flush cycle ran :
python gen_good_json.py | nc 127.0.0.1 9556
python gen_bad_json.py | nc 127.0.0.1 9556
# gen_good_json.py
payload = "a" * 4000
generated = '{"a": "%s"}' % (payload)
print(generated)
# gen_bad_json.py
payload = [False] * 1100
payload = (str(e).lower() for e in payload)
payload = ", ".join(payload)
generated = '{"a": [%s]}' % (payload)
print(generated)
The idea there is to prime the bulk structure with the first message where each input character roughly translates to one output character (sending enough characters that the relationship between whole and converted gives us a small multiplier (in fact, if you send a 10 byte payload followed by a 7kb payload the result of the calculation is 285kb) and then send a large json payload that shrinks up when converted to msgpack.
Anyway, the result is I proved that your patch is good enough, not because of the correction in how floating point numbers are handled but because it properly handles that corner case I was rightfully worried about.
@edsiper please, let's get this PR and #6396 merged ASAP.
And once again, thanks a lot for finding the issue, fixing it and sticking through @gitfool, we really appreciate it.