clickhouse-operator icon indicating copy to clipboard operation
clickhouse-operator copied to clipboard

CHK resource fails to come up on AWS EKS due to wrong FS permissions on /var/lib/clickhouse-keeper

Open hodgesrm opened this issue 1 year ago • 9 comments

What's wrong:

The CHK example in 02-extended-1-node.yaml fails if you use the Altinity 23.8.8.21 docker image on EKS. I thought this was an Altinity Stable Bug but it also happens with clickhouse/clickhouse-keeper:23.8.10.43-alpine.

I'm using the 0.23.3 operator installed using helm. I am using Kubernetes 1.26 on AWS EKS installed with our EKS blueprint.

How to Reproduce: Run kubectl apply -f on this resource.

apiVersion: "clickhouse-keeper.altinity.com/v1"
kind: "ClickHouseKeeperInstallation"
metadata:
  name: chk-1-node-reduced
spec:
  configuration:
    clusters:
      - name: "reduced-1"
        layout:
          replicasCount: 1
  templates:
    podTemplates:
      - name: default
        spec:
          containers:
            - name: clickhouse-keeper
              imagePullPolicy: IfNotPresent
              image: "altinity/clickhouse-keeper:23.8.8.21.altinitystable"
    volumeClaimTemplates:
      - name: default
        metadata:
          name: both-paths
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 25Gi

What happens:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e1a23b in /usr/bin/clickhouse-keeper
1. DB::ErrnoException::ErrnoException(String const&, int, int, std::optional<String> const&) @ 0x0000000000a2a0b4 in /usr/bin/clickhouse-keeper
2. DB::throwFromErrnoWithPath(String const&, String const&, int, int) @ 0x0000000000e1b7e7 in /usr/bin/clickhouse-keeper
3. DB::WriteBufferFromFile::WriteBufferFromFile(String const&, unsigned long, int, std::shared_ptr<DB::Throttler>, unsigned int, char*, unsigned long) @ 0x0000000000eb220b in /usr/bin/clickhouse-keeper
4. DB::DiskLocal::writeFile(String const&, unsigned long, DB::WriteMode, DB::WriteSettings const&) @ 0x00000000009a82f3 in /usr/bin/clickhouse-keeper
5. DB::ChangelogWriter::setFile(std::shared_ptr<DB::ChangelogFileDescription>, DB::WriteMode) @ 0x000000000079ddca in /usr/bin/clickhouse-keeper
6. DB::ChangelogWriter::rotate(unsigned long) @ 0x000000000079c287 in /usr/bin/clickhouse-keeper
7. DB::Changelog::readChangelogAndInitWriter(unsigned long, unsigned long) @ 0x0000000000799625 in /usr/bin/clickhouse-keeper
8. DB::KeeperLogStore::init(unsigned long, unsigned long) @ 0x00000000007f0817 in /usr/bin/clickhouse-keeper
9. DB::KeeperServer::startup(Poco::Util::AbstractConfiguration const&, bool) @ 0x00000000007f7718 in /usr/bin/clickhouse-keeper
10. DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool, bool, std::shared_ptr<DB::Macros const> const&) @ 0x00000000007d9438 in /usr/bin/clickhouse-keeper
11. DB::Context::initializeKeeperDispatcher(bool) const @ 0x0000000000a48f00 in /usr/bin/clickhouse-keeper
12. DB::Keeper::main(std::vector<String, std::allocator<String>> const&) @ 0x0000000000b58fdb in /usr/bin/clickhouse-keeper
13. Poco::Util::Application::run() @ 0x0000000000fbf406 in /usr/bin/clickhouse-keeper
14. DB::Keeper::run() @ 0x0000000000b55e5e in /usr/bin/clickhouse-keeper
15. Poco::Util::ServerApplication::run(int, char**) @ 0x0000000000fd6519 in /usr/bin/clickhouse-keeper
16. mainEntryClickHouseKeeper(int, char**) @ 0x0000000000b54dd8 in /usr/bin/clickhouse-keeper
17. main @ 0x0000000000b63bb9 in /usr/bin/clickhouse-keeper
 (version 23.8.10.43 (official build))
2024.03.09 21:01:39.116891 [ 1 ] {} <Debug> KeeperDispatcher: Shutting down storage dispatcher
2024.03.09 21:01:39.116956 [ 1 ] {} <Information> KeeperServer: RAFT doesn't start, shutdown not required
2024.03.09 21:01:39.117825 [ 1 ] {} <Error> void DB::KeeperDispatcher::shutdown(): Code: 49. DB::Exception: Changelog must be initialized before flushing records. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e1a23b in /usr/bin/clickhouse-keeper
1. DB::Exception::Exception<char const (&) [54]>(int, char const (&) [54]) @ 0x00000000007a6c80 in /usr/bin/clickhouse-keeper
2. DB::Changelog::flushAsync() @ 0x00000000007a6a48 in /usr/bin/clickhouse-keeper
3. DB::Changelog::flush() @ 0x00000000007a64b6 in /usr/bin/clickhouse-keeper
4. DB::KeeperLogStore::flushChangelogAndShutdown() @ 0x00000000007f1587 in /usr/bin/clickhouse-keeper
5. DB::KeeperDispatcher::shutdown() @ 0x00000000007dcabd in /usr/bin/clickhouse-keeper
6. DB::KeeperDispatcher::~KeeperDispatcher() @ 0x00000000007dda70 in /usr/bin/clickhouse-keeper
7. DB::ContextSharedPart::~ContextSharedPart() @ 0x0000000000a4f18f in /usr/bin/clickhouse-keeper
8. DB::SharedContextHolder::~SharedContextHolder() @ 0x0000000000a46a58 in /usr/bin/clickhouse-keeper
9. DB::Keeper::main(std::vector<String, std::allocator<String>> const&) @ 0x0000000000b5bff2 in /usr/bin/clickhouse-keeper
10. Poco::Util::Application::run() @ 0x0000000000fbf406 in /usr/bin/clickhouse-keeper
11. DB::Keeper::run() @ 0x0000000000b55e5e in /usr/bin/clickhouse-keeper
12. Poco::Util::ServerApplication::run(int, char**) @ 0x0000000000fd6519 in /usr/bin/clickhouse-keeper
13. mainEntryClickHouseKeeper(int, char**) @ 0x0000000000b54dd8 in /usr/bin/clickhouse-keeper
14. main @ 0x0000000000b63bb9 in /usr/bin/clickhouse-keeper
 (version 23.8.10.43 (official build))
2024.03.09 21:01:39.117849 [ 1 ] {} <Debug> KeeperDispatcher: Dispatcher shut down
2024.03.09 21:01:39.118624 [ 26 ] {} <Trace> BaseDaemon: Received signal 6
2024.03.09 21:01:39.118719 [ 37 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2024.03.09 21:01:39.118762 [ 37 ] {} <Fatal> BaseDaemon: (version 23.8.10.43 (official build), build id: 4642563B164611A8691A973CA4983D140F1A7C08, git hash: a278225bba98c092a9b8101e6c02836bbc4d030b) (from thread 1) Received signal 6
2024.03.09 21:01:39.118787 [ 37 ] {} <Fatal> BaseDaemon: Signal description: Aborted
2024.03.09 21:01:39.118799 [ 37 ] {} <Fatal> BaseDaemon:
2024.03.09 21:01:39.118814 [ 37 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a374b0
2024.03.09 21:01:39.118823 [ 37 ] {} <Fatal> BaseDaemon: ########################################
2024.03.09 21:01:39.118860 [ 37 ] {} <Fatal> BaseDaemon: (version 23.8.10.43 (official build), build id: 4642563B164611A8691A973CA4983D140F1A7C08, git hash: a278225bba98c092a9b8101e6c02836bbc4d030b) (from thread 1) (no query) Received signal Aborted (6)
2024.03.09 21:01:39.118869 [ 37 ] {} <Fatal> BaseDaemon:
2024.03.09 21:01:39.118883 [ 37 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a374b0
2024.03.09 21:01:39.118916 [ 37 ] {} <Fatal> BaseDaemon: 0. signalHandler(int, siginfo_t*, void*) @ 0x0000000000a374b0 in /usr/bin/clickhouse-keeper
2024.03.09 21:01:39.118929 [ 37 ] {} <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
2024.03.09 21:01:39.118937 [ 37 ] {} <Fatal> BaseDaemon: Report this error to https://github.com/ClickHouse/ClickHouse/issues

Possible Root Cause: It appears that the paths under /var/lib/clickhouse-keeper come up with root ownership. I confirmed this by hacking the liveness probe so that I could bring up the pod and check permissions.

Mitigations:

  1. If you can run chown -R clickhouse:clickhouse /var/lib/clickhouse-keeper and delete the pod to make it restart, Keeper comes up fine.

**Notes: ** This also fails with clickhouse/clickhouse-keeper:23.8.10.43-alpine.

hodgesrm avatar Mar 09 '24 21:03 hodgesrm

@sunsingerus @alex-zaitsev look like latest version of clickhouse backported changes for default user we need to change default security context

look to entypoint.sh in https://github.com/ClickHouse/ClickHouse/tree/master/docker/keeper/

@hodgesrm try to add

  templates:
    podTemplates:
      - name: default
        spec:
          securityContext:
            fsGroup: 101 
            runAsUser: 101 
          containers:
            - name: clickhouse-keeper
              imagePullPolicy: IfNotPresent
              image: "altinity/clickhouse-keeper:23.8.8.21.altinitystable" 

Slach avatar Mar 10 '24 06:03 Slach

Planned for 0.23.4

alex-zaitsev avatar Mar 10 '24 07:03 alex-zaitsev

@Slach your fix works and will hold me until 0.23.4 is available. Thank you both for the quick turnaround.

hodgesrm avatar Mar 10 '24 17:03 hodgesrm

@Slach , why we do not need securityContext for ClickHouse but need for ClickHouseKeeper?

alex-zaitsev avatar Apr 17 '24 09:04 alex-zaitsev

@alex-zaitsev i don't know maybe this is different entrypoint.sh in keeper image maybe this is different behavior inside clickhouse-keeper binary during startup

Slach avatar Apr 17 '24 09:04 Slach

any plans here?

orloffv avatar Apr 23 '24 05:04 orloffv

@orloffv workaround is here https://github.com/Altinity/clickhouse-operator/issues/1370#issuecomment-1987106345

Slach avatar Apr 23 '24 08:04 Slach

interesting, but it can't help me. only If you can run chown -R clickhouse:clickhouse /var/lib/clickhouse-keeper and delete the pod to make it restart, Keeper comes up fine.

orloffv avatar Apr 23 '24 09:04 orloffv

Security context is needed for both CHI and CHK, so behavior is consistent now. Maybe we need a separate task to add default security context to images, and also make sure it is correctly merged with a security context provided in by user explicitly

alex-zaitsev avatar May 07 '24 10:05 alex-zaitsev