fluent-bit icon indicating copy to clipboard operation
fluent-bit copied to clipboard

out_es: fix es bulk buffer overrun

Open gitfool opened this issue 3 years ago • 17 comments

Fixes #5680.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing Before we can approve your change; please submit the following in a comment:

  • [ ] Example configuration file for the change
  • [x] Debug log output from testing the change *
  • [ ] Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [ ] Attached local packaging test output showing all targets (including any new ones) build.

Documentation

  • [ ] Documentation required for this feature

Backporting

  • [ ] Backport to latest stable release.

Note: All testing was on Windows Server 2016. See https://github.com/fluent/fluent-bit/issues/5680#issuecomment-1194723285 for logs leading up to the crash before these changes.


Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

gitfool avatar Jul 22 '22 03:07 gitfool

PTAL; I've been testing this in production for the last week. The output looks good and the daily crashes have stopped.

gitfool avatar Jul 28 '22 19:07 gitfool

I fixed the commit message to comply with the contribution guidelines.

I'm not sure how to fix the linker issues with max though, which obviously builds and links for windows. 😕

gitfool avatar Aug 01 '22 21:08 gitfool

I've tried a tweak to fix the max issue. ~Just need the workflow approved to run again.~

gitfool avatar Aug 01 '22 22:08 gitfool

The macos builds are still failing with:

[ 67%] Building C object plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/es_conf.c.o
[ 67%] Building C object plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/es_bulk.c.o
[ 67%] Building C object plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/es.c.o
/Users/runner/work/fluent-bit/fluent-bit/plugins/out_es/es_bulk.c:88:27: error: implicit declaration of function 'max' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
            append_size = max(append_size, remaining_size);
                          ^
/Users/runner/work/fluent-bit/fluent-bit/plugins/out_es/es_bulk.c:88:27: note: did you mean 'fmax'?
/Applications/Xcode_13.2.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX12.1.sdk/usr/include/math.h:532:15: note: 'fmax' declared here
extern double fmax(double, double);
              ^
1 error generated.
make[2]: *** [plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/es_bulk.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [plugins/out_es/CMakeFiles/flb-plugin-out_es.dir/all] Error 2

@edsiper @PettitWesley Any ideas how to fix this? What am I missing?

gitfool avatar Aug 02 '22 00:08 gitfool

@PettitWesley I'm hoping to get this fix into the next 1.9 release. Any ideas how to fix the non-windows builds?

gitfool avatar Aug 08 '22 20:08 gitfool

Rebased with recent changes.

gitfool avatar Aug 10 '22 20:08 gitfool

Also added the buffer read overrun fixes since I've been testing with those too (with es).

gitfool avatar Aug 10 '22 21:08 gitfool

@gitfool you need signoff on all commits. Also it should be one commit per component for a smalll change like this IMO so I think that's 2 commits, one for each plugin. Thank you for remembering opensearch!

PettitWesley avatar Aug 10 '22 21:08 PettitWesley

Okay, I'll squash and signoff.

gitfool avatar Aug 10 '22 21:08 gitfool

@gitfool Actually sorry just realized you didn't make these same changes for opensearch... would you mind doing that? the code is basically exactly the same so it should be copy paste

PettitWesley avatar Aug 10 '22 21:08 PettitWesley

@PettitWesley I only fixed the buffer read overrun in the opensearch plugin since it doesn't seem to do bulk operations like the es plugin. (Not sure if that's a good thing but I'll find out when we eventually switch from es to opensearch on the backend.)

gitfool avatar Aug 10 '22 21:08 gitfool

If this gets approved then I presume I should submit another pull request to master with the same fixes?

gitfool avatar Aug 10 '22 21:08 gitfool

@gitfool Yup

If this gets approved then I presume I should submit another pull request to master with the same fixes?

PettitWesley avatar Aug 10 '22 22:08 PettitWesley

@PettitWesley is there anything else I can do to help get this reviewed and merged? 🙏

gitfool avatar Aug 19 '22 22:08 gitfool

@nokute78 @edsiper CI has passed on this one, I think its ready for review

@gitfool This is all we can do sorry. Have to wait for other maintainers to review.

PettitWesley avatar Aug 22 '22 21:08 PettitWesley

Missed the 1.9.8 release too. 😭

gitfool avatar Sep 07 '22 21:09 gitfool

Rebased with recent changes. @nokute78 @edsiper PTAL.

gitfool avatar Sep 20 '22 21:09 gitfool

This seemed correct to me at the time and think it makes no sense to keep stalling it.

Could we do a quick recap so we can give it closure?

I'm not familiar at all with this but the way I understood is that there are two issues this fixes :

  1. When nothing has been converted yet i_len is not taken in account when calculating required

This seems fairly straightforward considering there is a memcpy in line 101 that places i_len bytes from index into the bulk structure buffer.

After reviewing the rest of the code I noticed that the previous statement could not be true so please correct me if I'm wrong but the code seems to use ES_BULK_HEADER when calculating required which according to es.c is the maximum value i_len could have (it's used to allocate the buffer we receive there in index).

  1. When calculating (this is speculation right?) how much space would be required for the remaining msgpack bytes to be converted to json the original code was implcitly rounding down the result of the division of bulk->size / converted_size which is used to establish a conversion ratio which is then multiplied by the remainder byte count to get a ballpark of how much space we think we will need for that.

Unless I completely misunderstood it I'm not entirely convinced of the original approach to buffer size requirement because due to how msgpack is internally represented and I see no checks to validate that the new buffer size is large enough to fit i_len + j_len bytes.

I wonder if I'm missing the point or if this could be simplified this way with no noticeable side effects :

#define BULK_BUFFER_GROWTH_SIZE 4096

required  = i_len + j_len;
required  = fmaxl(required, BULK_BUFFER_GROWTH_SIZE);
available = bulk->size - bulk->len;

if (available < required) {
    ptr = flb_realloc(bulk->ptr, bulk->size + required);
}

Which could be considered wasteful if you're flushing before you fill the 4kb or sub optimal if what you are appending to it usually causes the raw required size to be larger than 4kb (ie. reallocating on each append) but that could be improved either by turning it into something the user can set or having the plugin start with a sensible value and adjust based on the average append size.

I want to give closure to this and considering that your fix is working for you I'm inclined to approve it because the way I see it, the only side effect it could have would be using a little bit more memory. I'm only mentioning these other aspects because it's what I understood after trying to make sense of the original code and thought it could help make.

Please don't hesitate to correct me if there is anything I missed or misunderstood so we can promptly get these improvements merged.

Note on msgpack vs json : an array comprised of 10 False values msgpack encoded would be 11 bytes long, however the same array json encoded would take 70 bytes.

leonardo-albertovich avatar Nov 19 '22 11:11 leonardo-albertovich

@leonardo-albertovich thank you very much for the review!

  1. I agree that using ES_BULK_HEADER to calculate required seems redundant given the 3 subsequent writes to bulk->ptr (i_len and j_len bytes then 1 byte for the trailing newline).

I left this in place because I did not know this code at all and wanted to merely tweak it to work as I determined was intended; but it would make more sense if reference to ES_BULK_HEADER was removed as it seems to be irrelevant to es_bulk_append.

  1. As I noted in the issue, this calculation to predict future buffer requirements seems overly complex, which lead to bugs.

I'm sure it could and probably should be simplified, but I would firmly but politely ask for that to be done in a separate pull request after this one is merged to 1.9 and master (in https://github.com/fluent/fluent-bit/pull/6396) since I know the current changes work after extensive "testing" in production, going on for 4 months now.

Also, as I commented above in this pull request, the opensearch plugin doesn't seem to have an equivalent bulk append, unless I'm missing something, and I'd expect its implementations to be similar, so whatever is eventually applied here should presumably be applied there too, which again should be done in a separate pull request.

gitfool avatar Nov 19 '22 18:11 gitfool

Good news everyone!

In order to be confident enough to back your PR I wrote a quick PoC to demonstrate the value expansion issue I commented in my previous message and realized that two changes you made sealed the deal :

  1. Moving the assignment of append_size which actually takes in account the length of what we are trying to append to line 82
  2. Adding the fmax line to ensure that the calculated size is large enough to fit data or fall back to append_size

Without those changes I could crash fluent-bit while running with this configuration :

[SERVICE]
  flush 10
  log_level trace

[INPUT]
  Name tcp
  listen 127.0.0.1
  port 9556

[OUTPUT]
  Name es
  Match *
  host 127.0.0.1
  port 9555
  retry_limit no_retries
  workers 0
  tls off

Just launching fluent-bit and then running these two lines before the first flush cycle ran :

python  gen_good_json.py  | nc 127.0.0.1 9556
python  gen_bad_json.py  | nc 127.0.0.1 9556
# gen_good_json.py

payload = "a" * 4000
generated = '{"a": "%s"}' % (payload)
print(generated)
# gen_bad_json.py

payload = [False] * 1100
payload = (str(e).lower() for e in payload)
payload = ", ".join(payload)
generated = '{"a": [%s]}' % (payload)
print(generated)

The idea there is to prime the bulk structure with the first message where each input character roughly translates to one output character (sending enough characters that the relationship between whole and converted gives us a small multiplier (in fact, if you send a 10 byte payload followed by a 7kb payload the result of the calculation is 285kb) and then send a large json payload that shrinks up when converted to msgpack.

Anyway, the result is I proved that your patch is good enough, not because of the correction in how floating point numbers are handled but because it properly handles that corner case I was rightfully worried about.

@edsiper please, let's get this PR and #6396 merged ASAP.

And once again, thanks a lot for finding the issue, fixing it and sticking through @gitfool, we really appreciate it.

leonardo-albertovich avatar Nov 20 '22 13:11 leonardo-albertovich