jaeger icon indicating copy to clipboard operation
jaeger copied to clipboard

[Bug]: race condition when reloading TLS certificates

Open yurishkuro opened this issue 1 year ago • 6 comments

What happened?

This is a continuation of this discussion: https://github.com/jaegertracing/jaeger/pull/4260#discussion_r1141124965

Steps to reproduce

We can probably reproduce it by creating a unit test (which will run with -race) that does the following:

  • updates the certificates a couple of times
  • meanwhile, makes a couple of client/server calls with TLS using the same tls config

Expected behavior

We expect that changes to certificates in a server take effect on future client connections. Instead, the test described above should detect a race condition as we try to update the same certPools that are already being used by the server (and therefore will be read on new client connections).

Relevant log output

No response

Screenshot

No response

Additional context

The rough proposal would be as follows:

  • make the certWatcher the owner of certPools internally, accessible via getter (instead of passing them from options.go)
  • store those pools as atomic.Value
  • when reloading the certs and adding the to the pools, create new pools via copy first, and store them in atomic.Value
  • and most importantly, define tls.Config.GetConfigForClient function by returning a new config with certPools retrieved from certWatcher (similar to how we already define cfg.Get{Client}Certificate functions). This way the server will be using immutable tls configs and cert pools

NB: this probably would not address the issue of reloading root CA for the server or the client, perhaps we should not be even supporting it. GetConfigForClient would only allow supporting reloading of ClientCA on the server, per @tsaarni 's comment.

Jaeger backend version

No response

SDK

No response

Pipeline

No response

Stogage backend

No response

Operating system

No response

Deployment model

No response

Deployment configs

No response

yurishkuro avatar Mar 19 '23 17:03 yurishkuro

@yurishkuro I am trying to reproduce the error by writing a test case.

Is this the right approach to identify the problem you are referring in the issue?

  1. Can i create self signed server certificates and keys and store them as temp files inside func generateNewCertificateFiles()
  2. Setup tls config and spin up a server
  3. In another goroutine ,in a loop, use generateNewCertificateFiles() to create new certs and reload certs for local server.
  4. In another loop try to make get calls to local server.

ChillOrb avatar Apr 13 '23 06:04 ChillOrb

Yes, and you can do this all in a unit test, such that if run with -race flag Go should reliably detect race condition.

For step 1 you don't need to generate new certs, we already have working pairs as text fixtures in the tlscfg package.

yurishkuro avatar Apr 14 '23 03:04 yurishkuro

@yurishkuro sorry got busy. I am having trouble in reproducing the race condition 1. this is the test function that i came up with to reproduce the error

  1. I am trying to create option with existing test certs in tlscfg package link
  2. Here i start tls server in background with option above link
  3. In another goroutine , generating new certificates using gen-certs.sh and generate new server config by calling cfg, err := options.Config(zap.NewNop()) then i update the server config server.TLSConfig = cfg this causes another race condition.
  4. In another go routine , crating client with options and getting new config , also making Get calls to background

Not able to reproduce race conditions at

WARNING: DATA RACE
Read at 0x00c0000105e8 by goroutine 22:
  crypto/x509.(*CertPool).addCertFunc()
      /opt/hostedtoolcache/go/1.20.2/x64/src/crypto/x509/cert_pool.go:189 +0x6b8
  crypto/x509.(*CertPool).AppendCertsFromPEM()
      /opt/hostedtoolcache/go/1.20.2/x64/src/crypto/x509/cert_pool.go:227 +0x4f1
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.addCertToPool()
      /home/runner/work/jaeger/jaeger/pkg/config/tlscfg/options.go:142 +0x1d7

as mentioned in comment I want to know if this approach to test is in the right direction ?

I will try to spend some more time on this

ChillOrb avatar May 01 '23 04:05 ChillOrb

Hey @yurishkuro ! I was trying to reproduce this issue. So I came up with two tests, first test TestConcurrentCertPoolAccessForDataRace link directly provoking a data race by concurrently appending certificates to an x509.CertPool while the second test TestConcurrentConfigAccess link concurrently invokes a method Options.Config that loads and update certificate pools indirectly.

shashank-iitbhu avatar Feb 11 '24 14:02 shashank-iitbhu

Both the test are provoking Data Race conditions. Data Race provoked by the first test TestConcurrentCertPoolAccessForDataRace link is similar to the actual Data Race:

==================
WARNING: DATA RACE
Read at 0x00c00012f7a0 by goroutine 8:
  runtime.makeBucketArray()
      /usr/local/go/src/runtime/map.go:346 +0x21c
  crypto/x509.(*CertPool).addCertFunc()
      /usr/local/go/src/crypto/x509/cert_pool.go:189 +0x55c
  crypto/x509.(*CertPool).AppendCertsFromPEM()
      /usr/local/go/src/crypto/x509/cert_pool.go:227 +0x404
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func1()
      /Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:204 +0xa0
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func2()
      /Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:212 +0x44

Previous write at 0x00c00012f7a0 by goroutine 16:
  runtime.mapaccessK()
      /usr/local/go/src/runtime/map.go:519 +0x1ec
  crypto/x509.(*CertPool).addCertFunc()
      /usr/local/go/src/crypto/x509/cert_pool.go:193 +0x594
  crypto/x509.(*CertPool).AppendCertsFromPEM()
      /usr/local/go/src/crypto/x509/cert_pool.go:227 +0x404
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func1()
      /Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:204 +0xa0
  github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func2()
      /Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:212 +0x44
==================```

shashank-iitbhu avatar Feb 11 '24 14:02 shashank-iitbhu

Does it mean your tests reliably reproduce data race? If so, great job! Can you create a PR adding those tests? And since you're deep into this, do you see a way to fix this?

yurishkuro avatar Feb 11 '24 20:02 yurishkuro