clickhouse-operator
clickhouse-operator copied to clipboard
Add support for setting secure communication between clickhouse instances
- [x] All commits in the PR are squashed. More info
- [x] The PR is made into dedicated
next-releasebranch, not intomasterbranch1. More info - [x] The PR is signed. More info
Fixes #668
Verified this works when configuring clickhouse to use TLS and with no unencrypted ports.
A snippet of my CHI:
templates:
hostTemplates:
- name: host-template
spec:
{{- if .Values.clickhouse.tls.enabled }}
tcpPort: 9440
secure: true
httpPort: 8443
interserverHTTPPort: 9010
{{- else }}
tcpPort: 9000
httpPort: 8123
interserverHTTPPort: 9009
{{- end }}
It would be ideal to add a test for that, but even an example manifest that shows the use of this feature would help.
Sure, I can provide an example, and see if I can write a test. I didn't see much regarding unit tests so I wasn't sure where the proper place to test this would be.
Maybe some extra automation may be useful as well, e.g. automatically change default ports to secure when secure flag is used.
That would be nice, but seems out of scope for this PR, and seems more like a separate feature potentially. I'm currently satisfied with just being able to control the configuration directly, and have less magic. Reason for this is that similar functionality for setting the ports in the (chop-generated-ports.xml), and that currently conflicts a bit with trying to disable the non _secure ports (though currently disabling these ports still works, thankfully).
Here's an example based on what we're using to deploy clickhouse:
https://gist.github.com/chancez/3da4b1df4f9942d2a260360a6a762912
For testing: It shouldn't be terribly difficult to test using TLS, but it requires a bit of effort. I'd need some guidance on how you would expect certificates to be generated for testing.
In our project, we deploy clickhouse with TLS by using cert-manager in the cluster to issue certs. Alternatively, we could use openssl, though I'm not familiar with the python crypto APIs at all, so I'd prefer to just exec openssl to generate certificates for tests.
Once certs are generated, you just need to create a secret for them and then deploy using the example CHI resource I provided.
Currently working on the tests. I have a minimal working example that I'm porting to tests atm.
I "tried" to write some tests, but I'm having a hard time getting the test image to build via ./tests/image/build_docker.sh, it just seems to fail in different places each time, sometimes pulling from quay, but most recently on:
Package docker-ce is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
docker-ce-cli:amd64
Also, it doesn't seem like the test image is multi-arch, so I might have issues regardless, as I'm on an m1 Mac.
I've pushed my attempt at some tests, but unfortunately it doesn't look like CI runs tests automatically, so I can't iterate that way either.
Ah I just realized the reason the test image doesn't build is because it's using amd64 repos (and binaries) but I'm pulling the arm64 image. I guess I just have to update it to work for both.
You can develop test without building image on arm64 platform
just setup minikube, python3, python3-pip, 'python3-venv`
and run
minikube start
python3 -m venv ~/operator-venv/
~/operator-venv/bin/pip3 install -U -r ./tests/image/requirements.txt
~/operator-venv/bin/python3 ./tests/regression.py --only="/regression/e2e.test_operator/yout_test*" --native
--native mode mean don't use docker-compose and use installed default minikube KUBECONFIG
Gotcha. I'm still having issues running tests:
(.venv) (⎈ |minikube:default) ~/p/w/clickhouse-operator ❮❮❮ python3 ./tests/regression.py --only "/regression/e2e.test_operator/test_001*" --native
May 23,2022 10:57:19 ⟥ Suite regression
ClickHouse Operator test regression suite.
Attributes
native
True
keeper_type
zookeeper
Specifications
QA-SRS026 ClickHouse Operator
May 23,2022 10:57:19 ⟥ Feature e2e.test_operator
Requirements
RQ.SRS-026.ClickHouseOperator.CustomResource.APIVersion
version 1.0
May 23,2022 10:57:19 ⟥ Given Clean namespace test, flags:MANDATORY
13ms [bash]
13ms [bash] The default interactive shell is now zsh.
13ms [bash] To update your account to use zsh, please run `chsh -s /bin/zsh`.
13ms [bash] For more details, please visit https://support.apple.com/kb/HT208050.
13ms [bash] bash: kube_ps1: command not found
It just hangs there.
Here's the stack trace when I cancel it: it seems to be stuck on getting crds:
May 23,2022 11:02:50 ⟥ Given Clean namespace test, flags:MANDATORY
11ms [bash]
11ms [bash] The default interactive shell is now zsh.
11ms [bash] To update your account to use zsh, please run `chsh -s /bin/zsh`.
11ms [bash] For more details, please visit https://support.apple.com/kb/HT208050.
11ms [bash] bash: kube_ps1: command not found
^C 10s 382ms ⟥ Exception: Traceback (most recent call last):
File "/Users/chancezibolski/projects/work/clickhouse-operator/./tests/regression.py", line 62, in <module>
regression()
File "/Users/chancezibolski/projects/work/clickhouse-operator/./tests/regression.py", line 55, in regression
run_features()
File "/Users/chancezibolski/projects/work/clickhouse-operator/./tests/regression.py", line 50, in run_features
Feature(run=load(feature_name, "test"))
File "/Users/chancezibolski/projects/work/clickhouse-operator/tests/e2e/test_operator.py", line 2210, in test
util.clean_namespace(delete_chi=True)
File "/Users/chancezibolski/projects/work/clickhouse-operator/tests/e2e/util.py", line 165, in clean_namespace
kubectl.delete_all_chi(settings.test_namespace)
File "/Users/chancezibolski/projects/work/clickhouse-operator/tests/e2e/kubectl.py", line 68, in delete_all_chi
crds = launch("get crds -o=custom-columns=name:.metadata.name", ns=ns).splitlines()
File "/Users/chancezibolski/projects/work/clickhouse-operator/tests/e2e/kubectl.py", line 40, in launch
cmd = shell(cmd, timeout=timeout)
File "/Users/chancezibolski/.asdf/installs/python/3.9.13/lib/python3.9/queue.py", line 180, in get
self.not_empty.wait(remaining)
File "/Users/chancezibolski/.asdf/installs/python/3.9.13/lib/python3.9/threading.py", line 316, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Running kubectl get crds -o=custom-columns=name:.metadata.name on a fresh env results in just one line of output:
(⎈ |minikube:default) ~/p/w/clickhouse-operator ❯❯❯ kubectl get crds -o=custom-columns=name:.metadata.name secure_shard_communications ✭ ✱
name
Ah, I thought the
13ms [bash] bash: kube_ps1: command not found
wasn't a big deal, but running bash directly and I think it's maybe causing issues. It seems like that error only comes up from bash, when I execute bash after sourcing my virtualenv, so the virtualenv is....persisting my PS1 from zsh, even when executing bash, later.
Well, I unset my PS1 before running tests and that error is gone, but the tests are still hanging, so perhaps that wasn't the problem. I'm really unsure what to do here. Looks like TestFlows isn't super actively developed or used either, so not sure how likely it is it could be a bug, or something.
Ok, I root caused it. It's expecting the prompt to match, so it's expecting the prompt to look correct, which is why things are busted. I was able to tweak some stuff to make it get passed this spot it's hanging, probably because of the ps1 issue I mentioned before.
It's still getting stuck, just somewhere different, this time it's stuck on either deleting the namespace or creating it. @Slach has there been any discussion on your team to have a CI setup so people could test their changes with a setup that's known to work? At the moment I'm not super confident in getting testflows to work properly for me locally.
Atm I'm setting up the python env in a VM since it seems like it really dislikes my Mac shell environment.
Alrighty, I got a local test setup working (built a custom docker image to run tests with) and got tests working. Please take a look.
ok. look like your default shell is zsh usually we run tests under bash in --native mode @vzakaznikov JSFYI, as testflows author
@Slach Yeah, it's the default on Mac, it seemed to be correctly opening bash, but I got it all sorted out. Running it in docker worked well.
Let me know if there's anything else this PR needs. I'm really going to need this so we can roll out Clickhouse with TLS in our environments.
@Slach I want to be able to configure the cluster secret or internode user user/pass as a way to authenticate inter-node communication, eg: https://github.com/ClickHouse/ClickHouse/blob/b29e877f269e84ae452c446e70b406a695863470/tests/integration/test_distributed_inter_server_secret/configs/remote_servers_n1.xml#L4
Should that be a separate PR?