jaeger
jaeger copied to clipboard
[Bug]: race condition when reloading TLS certificates
What happened?
This is a continuation of this discussion: https://github.com/jaegertracing/jaeger/pull/4260#discussion_r1141124965
Steps to reproduce
We can probably reproduce it by creating a unit test (which will run with -race
) that does the following:
- updates the certificates a couple of times
- meanwhile, makes a couple of client/server calls with TLS using the same tls config
Expected behavior
We expect that changes to certificates in a server take effect on future client connections. Instead, the test described above should detect a race condition as we try to update the same certPools that are already being used by the server (and therefore will be read on new client connections).
Relevant log output
No response
Screenshot
No response
Additional context
The rough proposal would be as follows:
- make the certWatcher the owner of certPools internally, accessible via getter (instead of passing them from options.go)
- store those pools as atomic.Value
- when reloading the certs and adding the to the pools, create new pools via copy first, and store them in atomic.Value
- and most importantly, define
tls.Config.GetConfigForClient
function by returning a new config with certPools retrieved from certWatcher (similar to how we already define cfg.Get{Client}Certificate functions). This way the server will be using immutable tls configs and cert pools
NB: this probably would not address the issue of reloading root CA for the server or the client, perhaps we should not be even supporting it. GetConfigForClient would only allow supporting reloading of ClientCA on the server, per @tsaarni 's comment.
Jaeger backend version
No response
SDK
No response
Pipeline
No response
Stogage backend
No response
Operating system
No response
Deployment model
No response
Deployment configs
No response
@yurishkuro I am trying to reproduce the error by writing a test case.
Is this the right approach to identify the problem you are referring in the issue?
- Can i create self signed server certificates and keys and store them as temp files inside func generateNewCertificateFiles()
- Setup tls config and spin up a server
- In another goroutine ,in a loop, use generateNewCertificateFiles() to create new certs and reload certs for local server.
- In another loop try to make get calls to local server.
Yes, and you can do this all in a unit test, such that if run with -race
flag Go should reliably detect race condition.
For step 1 you don't need to generate new certs, we already have working pairs as text fixtures in the tlscfg package.
@yurishkuro sorry got busy. I am having trouble in reproducing the race condition 1. this is the test function that i came up with to reproduce the error
- I am trying to create option with existing test certs in tlscfg package link
- Here i start tls server in background with option above link
- In another goroutine , generating new certificates using gen-certs.sh and generate new server config by calling
cfg, err := options.Config(zap.NewNop())
then i update the server configserver.TLSConfig = cfg
this causes another race condition. - In another go routine , crating client with options and getting new config , also making Get calls to background
Not able to reproduce race conditions at
WARNING: DATA RACE
Read at 0x00c0000105e8 by goroutine 22:
crypto/x509.(*CertPool).addCertFunc()
/opt/hostedtoolcache/go/1.20.2/x64/src/crypto/x509/cert_pool.go:189 +0x6b8
crypto/x509.(*CertPool).AppendCertsFromPEM()
/opt/hostedtoolcache/go/1.20.2/x64/src/crypto/x509/cert_pool.go:227 +0x4f1
github.com/jaegertracing/jaeger/pkg/config/tlscfg.addCertToPool()
/home/runner/work/jaeger/jaeger/pkg/config/tlscfg/options.go:142 +0x1d7
as mentioned in comment I want to know if this approach to test is in the right direction ?
I will try to spend some more time on this
Hey @yurishkuro ! I was trying to reproduce this issue. So I came up with two tests, first test TestConcurrentCertPoolAccessForDataRace
link directly provoking a data race by concurrently appending certificates to an x509.CertPool
while the second test TestConcurrentConfigAccess
link concurrently invokes a method Options.Config
that loads and update certificate pools indirectly.
Both the test are provoking Data Race conditions.
Data Race provoked by the first test TestConcurrentCertPoolAccessForDataRace
link is similar to the actual Data Race:
==================
WARNING: DATA RACE
Read at 0x00c00012f7a0 by goroutine 8:
runtime.makeBucketArray()
/usr/local/go/src/runtime/map.go:346 +0x21c
crypto/x509.(*CertPool).addCertFunc()
/usr/local/go/src/crypto/x509/cert_pool.go:189 +0x55c
crypto/x509.(*CertPool).AppendCertsFromPEM()
/usr/local/go/src/crypto/x509/cert_pool.go:227 +0x404
github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func1()
/Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:204 +0xa0
github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func2()
/Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:212 +0x44
Previous write at 0x00c00012f7a0 by goroutine 16:
runtime.mapaccessK()
/usr/local/go/src/runtime/map.go:519 +0x1ec
crypto/x509.(*CertPool).addCertFunc()
/usr/local/go/src/crypto/x509/cert_pool.go:193 +0x594
crypto/x509.(*CertPool).AppendCertsFromPEM()
/usr/local/go/src/crypto/x509/cert_pool.go:227 +0x404
github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func1()
/Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:204 +0xa0
github.com/jaegertracing/jaeger/pkg/config/tlscfg.TestConcurrentCertPoolAccessForDataRace.func2()
/Users/shashankmittal/Documents/Developer/jaeger/pkg/config/tlscfg/options_test.go:212 +0x44
==================```
Does it mean your tests reliably reproduce data race? If so, great job! Can you create a PR adding those tests? And since you're deep into this, do you see a way to fix this?