hetzner-k3s icon indicating copy to clipboard operation
hetzner-k3s copied to clipboard

status keeps looping forever??

Open codeagencybe opened this issue 1 year ago • 32 comments

I just spin up a small cluster (3 master + 3 workers), all servers created in Hetzner project, all good. But in terminal it keeps rolling like below for nearly 1 hour now. Inside Hetzner itself, it all looks ok and servers are marked "green".

Where can I see why terminal keeps waiting forever or not seeing the real status? Side note: I used the latest stable k3s version (if that matters?) k3s_version: v1.24.4+k3s1

Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-master3 to be up...
Waiting for server test-cpx21-master1 to be up...
Waiting for server test-cpx21-master2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-master3 to be up...
Waiting for server test-cpx21-master1 to be up...
Waiting for server test-cpx21-master2 to be up...
Waiting for server test-cpx21-master3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-master1 to be up...
Waiting for server test-cpx21-master2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-master3 to be up...

image

codeagencybe avatar Aug 31 '22 14:08 codeagencybe

Update:

I tried stopping the process and run it again, now it does this:

Server test-cpx21-pool-workerpool-worker3 already exists, skipping.
Server test-cpx21-master1 already exists, skipping.
Server test-cpx21-pool-workerpool-worker2 already exists, skipping.
Server test-cpx21-master2 already exists, skipping.
Server test-cpx21-master3 already exists, skipping.
Server test-cpx21-pool-workerpool-worker1 already exists, skipping.

Waiting for server test-cpx21-master2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker3 to be up...
Waiting for server test-cpx21-master1 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-master3 to be up...
Waiting for server test-cpx21-master2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-master1 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-master3 to be up...
Waiting for server test-cpx21-master2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-pool-workerpool-worker3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-master3 to be up...
Waiting for server test-cpx21-master1 to be up...
Waiting for server test-cpx21-master2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-master1 to be up...
Waiting for server test-cpx21-master3 to be up...
Waiting for server test-cpx21-master2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-master1 to be up...
Waiting for server test-cpx21-master3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker3 to be up...
Waiting for server test-cpx21-pool-workerpool-worker2 to be up...
Waiting for server test-cpx21-pool-workerpool-worker1 to be up...
Waiting for server test-cpx21-master2 to be up...

codeagencybe avatar Aug 31 '22 14:08 codeagencybe

Hi, another user was having the same problem and from an initial investigation it looked like some kind of a problem with SSH connections, probably due to some keys or something like that. I haven't been able to reproduce the issue myself unfortunately and that person decided not to use Kubernetes so they can't help, but maybe you can.

Can you give me more details on how you installed the tool, which OS and architecture and what kind of SSH key you are using?

vitobotta avatar Aug 31 '22 15:08 vitobotta

BTW creating a cluster only takes a few minutes normally :)

vitobotta avatar Aug 31 '22 15:08 vitobotta

@vitobotta Ha yes, I know from previous journeys with k3d tool etc... that it spins up crazy fast but yeah I was like let it have its time, maybe it does got something going on at hetzner etc... causing it to be slower but 1 hour is way too long. Its just from my end to be sure it's not something by accident.

I installed your tool with Brew on a new Macbook pro M1, running macOS 12.5.1 (21G83) I created a new ssh key from Mac terminal with the usual ssh-keygen command and saved it as k3stest.pub inside my home folder.

This is the excerpt from my yaml

cluster_name: test
kubeconfig_path: "./kubeconfig"
k3s_version: v1.24.4+k3s1
public_ssh_key_path: "/Users/codeagency/k3stest.pub"
private_ssh_key_path: "/Users/codeagency/k3stest"
ssh_allowed_networks:
  - 0.0.0.0/0
api_allowed_networks:
  - 0.0.0.0/0

codeagencybe avatar Aug 31 '22 16:08 codeagencybe

I'm having the exact same problem too.

nicolasvac avatar Aug 31 '22 23:08 nicolasvac

Same here on M1

lucianogreiner avatar Sep 01 '22 02:09 lucianogreiner

@nicolasvac Are you also on a Mac with M1 (or M2)? Perhaps there is a relation to this for the issue, so if everybody confirm what kind of chip, Vito might discover something.

codeagencybe avatar Sep 01 '22 08:09 codeagencybe

@nicolasvac Are you also on a Mac with M1 (or M2)? Perhaps there is a relation to this for the issue, so if everybody confirm what kind of chip, Vito might discover something.

Yes, but this also happened on WSL with Windows.

nicolasvac avatar Sep 01 '22 08:09 nicolasvac

@nicolasvac OK dan we cross that theory of Mac M1 🤣

codeagencybe avatar Sep 01 '22 08:09 codeagencybe

Did you guys check (SSH'ing to one of the nodes) if anything from the stuff in the yaml (additional packages) and / or if k3s gets installed?

Privatecoder avatar Sep 01 '22 08:09 Privatecoder

I just deployed a single master, single worker test-cluster using 0.6.3 on macOS 12.3.1 on a M1 machine successfully.

However I did have to add my SSH private key to the keychain before using ssh-add --apple-use-keychain ~/.ssh/[your-private-key].

Without adding it, it did not loop forever but throw an exception:

Waiting for server test-cpx21-master1 to be up...
#<Thread:0x00007fec2c3e3238 /__enclose_io_memfs__/local/lib/hetzner/k3s/cluster.rb:593 run> terminated with exception (report_on_exception is true):
Traceback (most recent call last):
	28: from /__enclose_io_memfs__/local/lib/hetzner/k3s/cluster.rb:593:in `block (2 levels) in wait_for_servers'
	27: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:57:in `wait_for_ssh'
	26: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:110:in `timeout'
	25: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:33:in `catch'
	24: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:33:in `catch'
	23: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:33:in `block in catch'
	22: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:95:in `block in timeout'
	21: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:62:in `block in wait_for_ssh'
	20: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:62:in `loop'
	19: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:63:in `block (2 levels) in wait_for_ssh'
	18: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:84:in `ssh'
	17: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh.rb:254:in `start'
	16: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/session.rb:71:in `authenticate'
	15: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/session.rb:71:in `each'
	14: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/session.rb:87:in `block in authenticate'
	13: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:18:in `authenticate'
	12: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/key_manager.rb:128:in `each_identity'
	11: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/key_manager.rb:128:in `each'
	10: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/key_manager.rb:131:in `block in each_identity'
	 9: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:19:in `block in authenticate'
	 8: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:97:in `authenticate_with'
	 7: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:97:in `each'
	 6: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:100:in `block in authenticate_with'
	 5: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:60:in `authenticate_with_alg'
	 4: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/key_manager.rb:166:in `sign'
	 3: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/key_factory.rb:41:in `load_private_key'
	 2: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/key_factory.rb:50:in `load_data_private_key'
	 1: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/key_factory.rb:199:in `classify_key'
/__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/ed25519_loader.rb:19:in `raiseUnlessLoaded': OpenSSH keys only supported if ED25519 is available (NotImplementedError)
net-ssh requires the following gems for ed25519 support:
 * ed25519 (>= 1.2, < 2.0)
 * bcrypt_pbkdf (>= 1.0, < 2.0)
See https://github.com/net-ssh/net-ssh/issues/565 for more information
LoadError : "cannot load such file -- ed25519_ref10"
Traceback (most recent call last):
	28: from /__enclose_io_memfs__/local/lib/hetzner/k3s/cluster.rb:593:in `block (2 levels) in wait_for_servers'
	27: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:57:in `wait_for_ssh'
	26: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:110:in `timeout'
	25: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:33:in `catch'
	24: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:33:in `catch'
	23: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:33:in `block in catch'
	22: from /__enclose_io_memfs__/lib/ruby/2.7.0/timeout.rb:95:in `block in timeout'
	21: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:62:in `block in wait_for_ssh'
	20: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:62:in `loop'
	19: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:63:in `block (2 levels) in wait_for_ssh'
	18: from /__enclose_io_memfs__/local/lib/hetzner/utils.rb:84:in `ssh'
	17: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh.rb:254:in `start'
	16: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/session.rb:71:in `authenticate'
	15: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/session.rb:71:in `each'
	14: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/session.rb:87:in `block in authenticate'
	13: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:18:in `authenticate'
	12: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/key_manager.rb:128:in `each_identity'
	11: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/key_manager.rb:128:in `each'
	10: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/key_manager.rb:131:in `block in each_identity'
	 9: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:19:in `block in authenticate'
	 8: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:97:in `authenticate_with'
	 7: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:97:in `each'
	 6: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:100:in `block in authenticate_with'
	 5: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/methods/publickey.rb:60:in `authenticate_with_alg'
	 4: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/key_manager.rb:166:in `sign'
	 3: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/key_factory.rb:41:in `load_private_key'
	 2: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/key_factory.rb:50:in `load_data_private_key'
	 1: from /__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/key_factory.rb:199:in `classify_key'
/__enclose_io_memfs__/lib/ruby/gems/2.7.0/gems/net-ssh-7.0.1/lib/net/ssh/authentication/ed25519_loader.rb:19:in `raiseUnlessLoaded': OpenSSH keys only supported if ED25519 is available (NotImplementedError)
net-ssh requires the following gems for ed25519 support:
 * ed25519 (>= 1.2, < 2.0)
 * bcrypt_pbkdf (>= 1.0, < 2.0)
See https://github.com/net-ssh/net-ssh/issues/565 for more information
LoadError : "cannot load such file -- ed25519_ref10"

Privatecoder avatar Sep 01 '22 09:09 Privatecoder

@Privatecoder Seems like its an issue with the type of SSH keygen. As the error message refers to OpenSSH keys only supported if ED25519 is available (NotImplementedError)

codeagencybe avatar Sep 01 '22 14:09 codeagencybe

Well I did just test the tool on my end to see if I could help here, but it creates the cluster flawlessly.

However I encountered the issue above when running the tool for the first time, which did go away after adding the SSH-keys to the macOS keychain.

Not sure why not adding my key to the macOS keychain did throw that particular exception, but it did and I wanted to share this here in case somebody encounters the same issue.

Privatecoder avatar Sep 01 '22 14:09 Privatecoder

Weird, I have never added my SSH key to keychain and I have never been able to reproduce these issues. So annoying....

Tomorrow (time permitting) I will try to produce prerelease binaries with some more debug logging, hoping it might help to find the cause of these issues.

In the meantime, since the standalone binaries are a pretty new thing and might still have issues, do you have a chance to test creating clusters with the regular Ruby gem using a normal Ruby 2.7.1 environment instead of the standalone binaries? Or with Docker if you cannot set up Ruby. Just to see if the problem occurs only with the standalone binaries.

I wish I were able to reproduce the issue here, it would be much easier to investigate.

vitobotta avatar Sep 01 '22 20:09 vitobotta

I actually did a release already, 0.6.4, which allows you to enable debug logging for SSH connections by setting the environment variable to true

export SSH_DEBUG=true

Run this before creating the cluster and it should produce quite a lot of output which may help find the cause of the issue.

The gem, the binaries, the Homebrew formula and the Docker image have been updated already.

Thanks in advance for checking it out.

vitobotta avatar Sep 01 '22 20:09 vitobotta

@vitobotta

I just upgraded to 0.6.4 (Homebrew), ran the export command and then the create cluster command but I still get same error, but I also don't get any extra debug output.

Am I missing something or doing something bluntly wrong? Or does it require an extra param in the create cluster? I did check the documentation/readme but nothing specific mentioned about such param.

╰─ export SSH_DEBUG=true ─╯ ╰─ hetzner-k3s create-cluster --config-file cluster_config.yaml ─╯

codeagencybe avatar Sep 02 '22 03:09 codeagencybe

@vitobotta Please ignore my last message, seems like I screwed up. Homebrew did not properly upgrade.

I'm getting extra output now, but holy snike it is awefully lots of verbose output. But I'm seeing that it's repeating constantly this message:

SSH CONNECTION DEBUG: Connection refused - connect(2) for xxx.xxx.xxx.xxx:22 So there is either something wrong the SSH or there is something wrong with firewall.

codeagencybe avatar Sep 02 '22 03:09 codeagencybe

update: it's definetely something related to the SSH key. When I try to manually ssh into one of the nodes, it's also connection refused.

In Hetzner > project > secret, the ssh key is created though, so it does exist there. I only don't know if this problem is now related to specific type of ssh key? or is there something else happening that is not creating the key inside each node so basically it turns impossible to ssh into any node.

I wanted to test with an existing key from my Termius account which I use by default from my Ubuntu PC and Windows PC, but I haven't found yet how I can get it on my Mac itself from the Termius "cloud". I can't find any path either where Termius is storing it's ssh keys for syncing.

I'll try to find some more info and see if I can get it working.

codeagencybe avatar Sep 02 '22 03:09 codeagencybe

@Privatecoder @codeagencybe are you using keys with a passphrase?

vitobotta avatar Sep 02 '22 16:09 vitobotta

I am not however adding it to the keychain fixes the error

# add keys to keychain
# ssh-add --apple-use-keychain ~/.ssh/[your-private-key] (>=12.x)
# ssh-add -K ~/.ssh/[your-private-key] (<12.x)

Privatecoder avatar Sep 02 '22 16:09 Privatecoder

@vitobotta is it possible that the Ubuntu image used by Hetzner to setup the machine has now changed to use ed25519 and therefore also your Docker images etc are affected when setting up new Clusters (because old ones wouldn't be affected because they're still using RSA)?

Privatecoder avatar Sep 02 '22 16:09 Privatecoder

I don't know, but the weird thing is that if I use the Ruby gem instead of the standalone binary, I don't get the error about dependencies.

vitobotta avatar Sep 02 '22 16:09 vitobotta

Can you please try with the Ruby gem instead of the binaries, and without that workaround you found? I want to see if you get the same problem with the gem directly.

vitobotta avatar Sep 02 '22 16:09 vitobotta

I will. Let me restart and see if the error still occours and then try the gem. One sec..

Privatecoder avatar Sep 02 '22 16:09 Privatecoder

Restarting and removing commenting out my ssh-config brings back the issue:

#Host *
#  UseKeychain yes
#  AddKeysToAgent yes
#  IdentityFile ~/.ssh/id_rsa

Let's see what the gem does..

Privatecoder avatar Sep 02 '22 16:09 Privatecoder

Thanks, looking forward to an update!

vitobotta avatar Sep 02 '22 16:09 vitobotta

@vitobotta removed hetzner-k3s via brew uninstall however it is stll working. Do you know where the bin installed by brew is located?

Privatecoder avatar Sep 02 '22 16:09 Privatecoder

found it

Privatecoder avatar Sep 02 '22 16:09 Privatecoder

On M1 it's in /opt/homebrew/bin/hetzner-k3s - on Intel it's somewhere in /usr/local/

vitobotta avatar Sep 02 '22 16:09 vitobotta

what a pain to set-up ruby 2.7 on macOS and adjusting all paths.. "unfortunately" the gem works fine (i.e. doesn't throw the error about ed25519) @vitobotta

Privatecoder avatar Sep 02 '22 17:09 Privatecoder