clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

Add support for setting secure communication between clickhouse instances

Open chancez opened this issue 3 years ago • 17 comments

  • [x] All commits in the PR are squashed. More info
  • [x] The PR is made into dedicated next-release branch, not into master branch1. More info
  • [x] The PR is signed. More info

Fixes #668

Verified this works when configuring clickhouse to use TLS and with no unencrypted ports.

A snippet of my CHI:

  templates:
    hostTemplates:
    - name: host-template
      spec:
        {{- if .Values.clickhouse.tls.enabled }}
        tcpPort: 9440
        secure: true
        httpPort: 8443
        interserverHTTPPort: 9010
        {{- else }}
        tcpPort: 9000
        httpPort: 8123
        interserverHTTPPort: 9009
        {{- end }}

chancez avatar May 19 '22 23:05 chancez

It would be ideal to add a test for that, but even an example manifest that shows the use of this feature would help.

Sure, I can provide an example, and see if I can write a test. I didn't see much regarding unit tests so I wasn't sure where the proper place to test this would be.

Maybe some extra automation may be useful as well, e.g. automatically change default ports to secure when secure flag is used.

That would be nice, but seems out of scope for this PR, and seems more like a separate feature potentially. I'm currently satisfied with just being able to control the configuration directly, and have less magic. Reason for this is that similar functionality for setting the ports in the (chop-generated-ports.xml), and that currently conflicts a bit with trying to disable the non _secure ports (though currently disabling these ports still works, thankfully).

chancez avatar May 20 '22 15:05 chancez

Here's an example based on what we're using to deploy clickhouse:

https://gist.github.com/chancez/3da4b1df4f9942d2a260360a6a762912

chancez avatar May 20 '22 15:05 chancez

For testing: It shouldn't be terribly difficult to test using TLS, but it requires a bit of effort. I'd need some guidance on how you would expect certificates to be generated for testing.

In our project, we deploy clickhouse with TLS by using cert-manager in the cluster to issue certs. Alternatively, we could use openssl, though I'm not familiar with the python crypto APIs at all, so I'd prefer to just exec openssl to generate certificates for tests.

Once certs are generated, you just need to create a secret for them and then deploy using the example CHI resource I provided.

chancez avatar May 20 '22 15:05 chancez

Currently working on the tests. I have a minimal working example that I'm porting to tests atm.

chancez avatar May 20 '22 16:05 chancez

I "tried" to write some tests, but I'm having a hard time getting the test image to build via ./tests/image/build_docker.sh, it just seems to fail in different places each time, sometimes pulling from quay, but most recently on:

Package docker-ce is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source
However the following packages replace it:
  docker-ce-cli:amd64

Also, it doesn't seem like the test image is multi-arch, so I might have issues regardless, as I'm on an m1 Mac.

I've pushed my attempt at some tests, but unfortunately it doesn't look like CI runs tests automatically, so I can't iterate that way either.

chancez avatar May 20 '22 19:05 chancez

Ah I just realized the reason the test image doesn't build is because it's using amd64 repos (and binaries) but I'm pulling the arm64 image. I guess I just have to update it to work for both.

chancez avatar May 20 '22 22:05 chancez

You can develop test without building image on arm64 platform

just setup minikube, python3, python3-pip, 'python3-venv` and run

minikube start
python3 -m venv ~/operator-venv/
~/operator-venv/bin/pip3 install -U -r ./tests/image/requirements.txt
~/operator-venv/bin/python3 ./tests/regression.py --only="/regression/e2e.test_operator/yout_test*" --native

--native mode mean don't use docker-compose and use installed default minikube KUBECONFIG

Slach avatar May 21 '22 02:05 Slach

Gotcha. I'm still having issues running tests:

(.venv) (⎈ |minikube:default) ~/p/w/clickhouse-operator ❮❮❮ python3 ./tests/regression.py --only "/regression/e2e.test_operator/test_001*" --native
May 23,2022 10:57:19   ⟥  Suite regression
                            ClickHouse Operator test regression suite.
                            Attributes
                              native
                                True
                              keeper_type
                                zookeeper
                            Specifications
                              QA-SRS026 ClickHouse Operator
May 23,2022 10:57:19     ⟥  Feature e2e.test_operator
                              Requirements
                                RQ.SRS-026.ClickHouseOperator.CustomResource.APIVersion
                                  version 1.0
May 23,2022 10:57:19       ⟥  Given Clean namespace test, flags:MANDATORY
                13ms            [bash] 
                13ms            [bash] The default interactive shell is now zsh.
                13ms            [bash] To update your account to use zsh, please run `chsh -s /bin/zsh`.
                13ms            [bash] For more details, please visit https://support.apple.com/kb/HT208050.
                13ms            [bash] bash: kube_ps1: command not found

It just hangs there.

chancez avatar May 23 '22 17:05 chancez

Here's the stack trace when I cancel it: it seems to be stuck on getting crds:

May 23,2022 11:02:50       ⟥  Given Clean namespace test, flags:MANDATORY
                11ms            [bash] 
                11ms            [bash] The default interactive shell is now zsh.
                11ms            [bash] To update your account to use zsh, please run `chsh -s /bin/zsh`.
                11ms            [bash] For more details, please visit https://support.apple.com/kb/HT208050.
                11ms            [bash] bash: kube_ps1: command not found
^C           10s 382ms       ⟥    Exception: Traceback (most recent call last):
                                    File "/Users/chancezibolski/projects/work/clickhouse-operator/./tests/regression.py", line 62, in <module>
                                      regression()
                                    File "/Users/chancezibolski/projects/work/clickhouse-operator/./tests/regression.py", line 55, in regression
                                      run_features()
                                    File "/Users/chancezibolski/projects/work/clickhouse-operator/./tests/regression.py", line 50, in run_features
                                      Feature(run=load(feature_name, "test"))
                                    File "/Users/chancezibolski/projects/work/clickhouse-operator/tests/e2e/test_operator.py", line 2210, in test
                                      util.clean_namespace(delete_chi=True)
                                    File "/Users/chancezibolski/projects/work/clickhouse-operator/tests/e2e/util.py", line 165, in clean_namespace
                                      kubectl.delete_all_chi(settings.test_namespace)
                                    File "/Users/chancezibolski/projects/work/clickhouse-operator/tests/e2e/kubectl.py", line 68, in delete_all_chi
                                      crds = launch("get crds -o=custom-columns=name:.metadata.name", ns=ns).splitlines()
                                    File "/Users/chancezibolski/projects/work/clickhouse-operator/tests/e2e/kubectl.py", line 40, in launch
                                      cmd = shell(cmd, timeout=timeout)
                                    File "/Users/chancezibolski/.asdf/installs/python/3.9.13/lib/python3.9/queue.py", line 180, in get
                                      self.not_empty.wait(remaining)
                                    File "/Users/chancezibolski/.asdf/installs/python/3.9.13/lib/python3.9/threading.py", line 316, in wait
                                      gotit = waiter.acquire(True, timeout)
                                  KeyboardInterrupt

Running kubectl get crds -o=custom-columns=name:.metadata.name on a fresh env results in just one line of output:

(⎈ |minikube:default) ~/p/w/clickhouse-operator ❯❯❯ kubectl get crds -o=custom-columns=name:.metadata.name                                secure_shard_communications ✭ ✱
name

chancez avatar May 23 '22 18:05 chancez

Ah, I thought the

                13ms            [bash] bash: kube_ps1: command not found

wasn't a big deal, but running bash directly and I think it's maybe causing issues. It seems like that error only comes up from bash, when I execute bash after sourcing my virtualenv, so the virtualenv is....persisting my PS1 from zsh, even when executing bash, later.

chancez avatar May 23 '22 18:05 chancez

Well, I unset my PS1 before running tests and that error is gone, but the tests are still hanging, so perhaps that wasn't the problem. I'm really unsure what to do here. Looks like TestFlows isn't super actively developed or used either, so not sure how likely it is it could be a bug, or something.

chancez avatar May 23 '22 18:05 chancez

Ok, I root caused it. It's expecting the prompt to match, so it's expecting the prompt to look correct, which is why things are busted. I was able to tweak some stuff to make it get passed this spot it's hanging, probably because of the ps1 issue I mentioned before.

chancez avatar May 23 '22 18:05 chancez

It's still getting stuck, just somewhere different, this time it's stuck on either deleting the namespace or creating it. @Slach has there been any discussion on your team to have a CI setup so people could test their changes with a setup that's known to work? At the moment I'm not super confident in getting testflows to work properly for me locally.

Atm I'm setting up the python env in a VM since it seems like it really dislikes my Mac shell environment.

chancez avatar May 23 '22 19:05 chancez

Alrighty, I got a local test setup working (built a custom docker image to run tests with) and got tests working. Please take a look.

chancez avatar May 23 '22 22:05 chancez

ok. look like your default shell is zsh usually we run tests under bash in --native mode @vzakaznikov JSFYI, as testflows author

Slach avatar May 24 '22 01:05 Slach

@Slach Yeah, it's the default on Mac, it seemed to be correctly opening bash, but I got it all sorted out. Running it in docker worked well.

Let me know if there's anything else this PR needs. I'm really going to need this so we can roll out Clickhouse with TLS in our environments.

chancez avatar May 25 '22 16:05 chancez

@Slach I want to be able to configure the cluster secret or internode user user/pass as a way to authenticate inter-node communication, eg: https://github.com/ClickHouse/ClickHouse/blob/b29e877f269e84ae452c446e70b406a695863470/tests/integration/test_distributed_inter_server_secret/configs/remote_servers_n1.xml#L4

Should that be a separate PR?

chancez avatar May 25 '22 21:05 chancez