[FLINK-37504] Handle TLS Certificate Renewal
https://issues.apache.org/jira/browse/FLINK-37504
What is the purpose of the change
Implementation of the TLS Renewal for SSL.
Adds functionality to be notified if SSL keys were changed on the Flink container. Different networking mechanisms triggers the reload of new ssl keystore/truststore. Covers the functionality with appropriate Unit Tests and Integration Tests.
More details are in design doc: https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal
Brief change log
- Adds new configuration to enable the SSL certificates reload
- Adds watch service, which is able to watch certain directories and notify subscribers if those directories were changed
- Netty, Pekko and Blob Server components subscribes to the new Watch Service and reloads SSL Context if needed
- BlobServer recreates the Socket on the certificate reload, we count on BlobClient retries to handle temporary connectivity issues
- Test functionality
Verifying this change
This change added tests and can be verified as follows:
- Added integration tests for end-to-end deployments, which ensures that certificates are reloaded, not reloaded, not used according to the provided ssl options
- Added unit tests to test the watch service behaviour, with multiple writers, writes, readers. Ensures that proposed mechanism with Dirty state machine works fine. Executed test 100 times locally to eliminate flackiness
- If needed if is easy to experiment with given test manually. Running 100 of threads, changes seems to be too slow to be executed regularly
- Covered in particular the BlobServer reload mechanism. Ensured that certificate is reloaded if changed. Also run multiple times locally to eliminate flackiness
- Deployed the server on the local environment, triggered certificate change
- Run given implementation in staging environment for several months. Note: only application mode is used in staging environment
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? yes
- If yes, how is the feature documented? New configurations are added
CI report:
- ccf6043e3f4ce96b8022b2f86fa06a4158e70927 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
@flinkbot run azure
@flinkbot run azure
@flinkbot run azure
@flinkbot run azure
@flinkbot run azure
@flinkbot run azure
@flinkbot run azure