materialize icon indicating copy to clipboard operation
materialize copied to clipboard

deps: upgrade to unpatched librdkafka

Open benesch opened this issue 1 year ago • 15 comments

To work around an unexplained memory error involving OpenSSL, we had to revert a librdkafka commit that changes how Kafka clients load CA certificates (f8830a2).

However, after extensive review, neither @petrosagg nor I can spot a memory error in that commit. That suggests that there might be a preexisting memory error in either OpenSSL or librdkafka that https://github.com/benesch/librdkafka/commit/f8830a28652532009e3f16854cb9d5004d9de06b exacerbates. That's scary, because there might be a way to trigger that memory error even with https://github.com/benesch/librdkafka/commit/f8830a28652532009e3f16854cb9d5004d9de06b reverted.

This commit upgrades us to mainline librdkafka v2.4.0--i.e., a version of librdkafka with https://github.com/benesch/librdkafka/commit/f8830a28652532009e3f16854cb9d5004d9de06b unreverted--to make the memory error reproducible. That will allow us to debug and find the root cause of the memory error.

FOR DEBUGGING PURPOSES ONLY. Do NOT merge this commit in its current state.

Motivation

  • This PR will help us to debug a librdkafka issue.

Checklist

  • [x] This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • [x] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • [x] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • [x] This PR includes the following user-facing behavior changes:
    • n/a

benesch avatar Jul 01 '24 01:07 benesch

@def- I could use your expertise in trying to find this memory error. Is it easy to run ASan and/or MSan against this branch? I spent a while variously staring at the code and adding printfs to librdkafka without any luck. I think it's a pretty subtle bug, which sanitizers are particularly well suited to find.

Don't spend too much time or energy on this, though. Since discovering that reverting https://github.com/confluentinc/librdkafka/commit/f8830a2 allowed us to upgrade to librdkafka v2.4.0, there's no longer a pressing product need to investigate this issue. I'm just nervous from an engineering perspective about any unexplained memory issue.

If the sanitizer fails, the next step will be to add printfs to OpenSSL. I'm quite confident I'll be able to track it down like that, but it would be a substantial time investment.

benesch avatar Jul 01 '24 01:07 benesch

@def- I could use your expertise in trying to find this memory error. Is it easy to run ASan and/or MSan against this branch? I spent a while variously staring at the code and adding printfs to librdkafka without any luck. I think it's a pretty subtle bug, which sanitizers are particularly well suited to find.

@benesch: Do you want to do this while running the tests? If so, you can select a sanitizer in "Trigger CI Builds".

nrainer-materialize avatar Jul 01 '24 05:07 nrainer-materialize

Yes, exactly—thank you! Didn't realize that functionality had been incorporated into Trigger CI. A few builds:

  • Undefined behavior: https://buildkite.com/materialize/nightly/builds/8296
  • Thread: https://buildkite.com/materialize/nightly/builds/8297
  • HW address: https://buildkite.com/materialize/nightly/builds/8298
  • Address: https://buildkite.com/materialize/nightly/builds/8299

benesch avatar Jul 01 '24 05:07 benesch

Unfortunately ASan is currently broken because of https://github.com/MaterializeInc/database-issues/issues/8126, I'll try to find a workaround for that.

Also, triggering multiple sanitizer runs is a bit awkward since it stops the previous one when you have the same branch name selected in Buildkite. Maybe I should add a suffix by default to indicate the sanitizer? (Edit: Done)

def- avatar Jul 01 '24 08:07 def-

Unfortunately ASan is currently broken because of https://github.com/MaterializeInc/database-issues/issues/8126, I'll try to find a workaround for that.

Ah, bummer. That's probably my bad. If I find some time I'll look into it.

benesch avatar Jul 01 '24 18:07 benesch

My ASan run didn't find anything new: https://github.com/MaterializeInc/materialize/pull/27990

def- avatar Jul 02 '24 21:07 def-

Ah, bummer. Thanks anyway for trying.

benesch avatar Jul 03 '24 06:07 benesch

I checked and the C/C++ -fsanitize=address flags seem to be passed through correctly to both openssl and librdkafka.

Have we tried upgrading our OpenSSL version or would that be more involved?

def- avatar Jul 03 '24 22:07 def-

Aren’t we running a very recent OSSL at this point? IIRC Parker just updated it.

On Wed, Jul 3, 2024 at 6:59 PM Dennis Felsing @.***> wrote:

I checked and the C/C++ -fsanitize=address flags seem to be passed through correctly to both openssl and librdkafka.

Have we tried upgrading our OpenSSL version or would that be more involved?

— Reply to this email directly, view it on GitHub https://github.com/MaterializeInc/materialize/pull/27963#issuecomment-2207441641, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXSIDBWLN7XLUIEPAZEJLZKR66VAVCNFSM6AAAAABKES43B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBXGQ2DCNRUGE . You are receiving this because you were mentioned.Message ID: @.***>

benesch avatar Jul 04 '24 05:07 benesch

Indeed, then I'm out of quick ideas. Might still be some buffer reuse inside of OpenSSL.

def- avatar Jul 04 '24 07:07 def-

I'm wondering if https://github.com/sfackler/rust-openssl/commit/142deef717bad843fc04c5afb925bfd9e7dc4305 could have been our issue. Pushed up a rebase that incorporates both unpatched librdkafka and patched rust-openssl (0.10.66).

benesch avatar Aug 04 '24 17:08 benesch

Nightly run: https://buildkite.com/materialize/nightly/builds/8949

benesch avatar Aug 04 '24 17:08 benesch

No luck. Same corruption error as before.

benesch avatar Aug 04 '24 22:08 benesch

Trying again with an unpatched librdkafka v2.5: https://buildkite.com/materialize/nightly/builds/9350

I'm not terribly hopeful, but the release notes for v2.5 do mention a buffer overflow that got fixed.

benesch avatar Aug 31 '24 02:08 benesch

Still reliably causing PostgreSQL connection issues.

benesch avatar Aug 31 '24 03:08 benesch

Abandoning.

benesch avatar Dec 25 '24 20:12 benesch