tmt The `login` action is stuck when called separately

When the login step is called in a separate command after the guest has been provisioned, the connection seems to be stuck. No input or output is shown and it's not possible to disconnect from the guest. Steps to reproduce:

tmt run provision -h virtual
tmt run --last login

Seems the problem was introduced in d549bd3aa3e2eb5e3e8ba984c1e293f97409253e. @happz, any idea what might be the cause here?

Feb 10 '25 17:02 psss

Hm, the difference in the master socket paths is this:

/run/user/12559/tmt-default-0
/run/user/12559/tmt

I guess there might be still collisions across multiple runs if we do not include any run-specific string in the path, right?

Feb 10 '25 18:02 psss

Hm, the difference in the master socket paths is this:
/run/user/12559/tmt-default-0
/run/user/12559/tmt
I guess there might be still collisions across multiple runs if we do not include any run-specific string in the path, right?

These paths should no longer appear, current tmt uses plan’s workdir.

Feb 10 '25 18:02 happz

Right, so path should not be the problem. Here's an example ssh command:

ssh -vvv -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p10022 -i /var/tmp/tmt/run-128/default/plan/provision/default-0/id_ecdsa -oPasswordAuthentication=no -S/var/tmp/tmt/run-128/ssh-sockets/127.0.0.1-10022-root.socket -t [email protected]

Confirmed to be stuck when run from the same terminal, but works fine if run from another terminal. When dropping the -S option the connection works without a problem.

Feb 11 '25 08:02 psss

It might be some race condition, socket removed or something similar. provision does remove them when leaving, and I'm not sure how it interacts with --last.

Feb 11 '25 08:02 happz

The same problem appears using -i instead of --last, e.g.

tmt run -i run-017 login

second run of tmt run -i run-017 login, while the first one is still running, works as expected

Feb 11 '25 10:02 bachradsusi

~~I'm afraid it works for me:~~

Nevermind, it does not work, and I'm able to reproduce the behavior:

(dev) [pts-6:0]: happz@multivac [main] ~/git/tmt $ tmt run provision -h virtual plan -n  /plans/features/core
/var/tmp/tmt/run-175

/plans/features/core
    provision
        queued provision.provision task #1: default-0

        provision.provision task #1: default-0
        how: virtual
        memory: 2048 MB
        disk: 40 GB
        progress: booting...
        multihost name: default-0
        arch: x86_64
        distro: Fedora Linux 41 (Cloud Edition)

        summary: 1 guest provisioned
(dev) [pts-6:0]: happz@multivac [main] ~/git/tmt $ tmt run --last login
/var/tmp/tmt/run-175

/plans/features/core
    provision
        status: done
        summary: 1 guest provisioned
        login: Starting interactive shell
[root@default-0 tree]#

16:27:25         Possible SSH master socket path '/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket' (trivial method).
16:27:25         SSH master socket path will be '/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket' (trivial method).
16:27:25         Spawning the SSH master process: ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p10100 -i /var/tmp/tmt/run-175/plans/features/core/provision/default-0/id_ecdsa -oPasswordAuthentication=no -S/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p10100 -i /var/tmp/tmt/run-175/plans/features/core/provision/default-0/id_ecdsa -oPasswordAuthentication=no -S/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket -MNnT [email protected]
...
16:28:00         Possible SSH master socket path '/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket' (trivial method).
16:28:00         SSH master socket path will be '/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket' (trivial method).
16:28:00         Spawning the SSH master process: ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p10100 -i /var/tmp/tmt/run-175/plans/features/core/provision/default-0/id_ecdsa -oPasswordAuthentication=no -S/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p10100 -i /var/tmp/tmt/run-175/plans/features/core/provision/default-0/id_ecdsa -oPasswordAuthentication=no -S/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket -MNnT [email protected]
...
16:28:00         Run command: ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -p10100 -i /var/tmp/tmt/run-175/plans/features/core/provision/default-0/id_ecdsa -oPasswordAuthentication=no -S/var/tmp/tmt/run-175/ssh-sockets/127.0.0.1-10100-root.socket -t [email protected] 'export TMT_PLAN_DATA=/var/tmp/tmt/run-175/plans/features/core/data; export TMT_PLAN_ENVIRONMENT_FILE=/var/tmp/tmt/run-175/plans/features/core/data/variables.env; export TMT_TREE=/var/tmp/tmt/run-175/plans/features/core/tree; export TMT_VERSION=1.41.0.dev3+g01d38b80.d20241231; cd /var/tmp/tmt/run-175/plans/features/core/tree; bash'

Can you check and share these SSH-related lines from your logs?

Feb 11 '25 15:02 happz

I believe it's a "leaked" SSH master process. The first tmt invocation leaves the guest running, and with that, the SSH master process is also up and running. The second tmt invocation spawns its own SSH master process - using the same socket path, because why wouldn't it, it's more or less deterministic, so both invocations infer the same path -, and I guess that's the problem. The SSH process spawning the remote shell is asked to use the socket path, and we have two master processes "owning" it...

If I kill the first master process, things work as expected.

So, is tmt expected to kill the master process even when the guest is kept running? Or do we want follow-up tmt invocations to reuse the master process spawned by the first tmt invocation? I vote for the former, I would kill the master process, and let the follow-up tmt invocations spawn - and own and clean up eventually - their own. The master process PID is lost anyway, no follow-up process will ever clean it up, because it's not stored anywhere.

Feb 11 '25 15:02 happz

I'm thinking of adding some kind of "soft cleanup" method to our steps, something that would be called when leaving the step, but it would not be expected to clean provisioned guests or whatever else is there to remove after a said step. A method that would be aware that the run may be revisited later, and in case of provision, it would tear down the SSH master socket but not the guest.

(And we're back to adding an extra cleanup step besides finish).

Feb 11 '25 18:02 happz

I've been playing with ssh on and off for quite some time now and would love to learn your thoughts if any of these are worth pursuing:

paramiko - it used to be pain to install on all archs due to the dependency on cryptography package, but became de-facto standard and is available everywhere (rip ssh2-python rpms). Somehow I don't think it's a good fit for tmt, but what do I know.
ansible-pylibssh - libssh bindings by Ansible for Ansible. Doesn't tmt basically have the same ssh workflow/environment? Nice api, probably fastest? I believe ansible is using this package with fallback to paramiko when libssh is no available. Could we use the established 'session' in rsync using -e?
Current subprocess ssh exec - a few things I wanted to ask about:

How about using ControlMaster=auto for opportunistic multiplexing? ControlPath could be used in similar/same way as current socket path and ControlPersist to define whether to keep the master connection running in the background or not. I'm guessing we could rely on ssh would handle additional logic/management/cleanup? see man ssh_config
Using SetEnv instead of running multiple export key=val on the guest - with tmt using env vars a lot, wouldn't it be nice to use a dedicted ssh option for setting them? I assume the only tangible benefit would be having a clean command strings, but still. Apparently this would require making sure the remote guest's ssh is set to accept these (same as paramiko's environment)

Oh and we should do some performance benchmarking when touching things like ssh, wdyt?

Feb 13 '25 22:02 martinhoyer

I've been playing with ssh on and off for quite some time now and would love to learn your thoughts if any of these are worth pursuing:

I don't know what's bothering you about SSH, what would be the benefit. I'm seeing bottlenecks elsewhere, but I might be wrong (it did happen once or twice in the past...). So I plan to play lazy devil's advocate :)

I suppose the question is, why? What will be better if we switch from calling SSH commands to calling methods from paramiko/ansible-pylibssh/etc. What will change if we swap SSH client for another one? tmt can optimize its SSH use, e.g. we discussed turning fact gathering into a single script to save N SSH calls, but that kind of task does not seem like it would fully benefit from switching to a different client, as it would still be tmt running N distinct remote commands one by one.

paramiko - it used to be pain to install on all archs due to the dependency on cryptography package, but became de-facto standard and is available everywhere (rip ssh2-python rpms). Somehow I don't think it's a good fit for tmt, but what do I know.

ansible-pylibssh - libssh bindings by Ansible for Ansible. Doesn't tmt basically have the same ssh workflow/environment? Nice api, probably fastest? I believe ansible is using this package with fallback to paramiko when libssh is no available. Could we use the established 'session' in rsync using -e?

Current subprocess ssh exec - a few things I wanted to ask about:

How about using ControlMaster=auto for opportunistic multiplexing? ControlPath could be used in similar/same way as current socket path and ControlPersist to define whether to keep the master connection running in the background or not. I'm guessing we could rely on ssh would handle additional logic/management/cleanup? see man ssh_config

IIUIC, auto opens a shared connection if it does not exist already, which means we could maybe drop the explicit SSH call opening it. But how do we close it when we are done with the guest? It might simplify how tmt establishes the shared connection.

Using SetEnv instead of running multiple export key=val on the guest - with tmt using env vars a lot, wouldn't it be nice to use a dedicted ssh option for setting them? I assume the only tangible benefit would be having a clean command strings, but still. Apparently this would require making sure the remote guest's ssh is set to accept these (same as paramiko's environment)

This might be tricky. Cleaner commands, yes, on the other hand, the dirty ones show exactly what has been sent to the guest :) We had some proposals on moving them into test shell wrappers, but that would not help with rsync calls.

The tricky part might be required refresh of these envvars, sort of cleaning all of them and setting new ones, because prepare and execute consume envvars prepared by scripts, so refreshing the envvars to match the desired state would be necessary. In the current setup, it's easy, just dump various environment mappings into a command. A prepare script may remove envvars from TMT_PLAN_ENVIRONMENT_FILE, and we need to make sure they are truly gone.

Oh and we should do some performance benchmarking when touching things like ssh, wdyt?

Sure, hard data is better than guessing. tmt can already emit some timing info, with --log-topic command-events you should see the duration of commands executed by tmt, including ssh. But that's a very specialized one, and there are plenty of profilers out there we can use.

Feb 14 '25 12:02 happz

I don't know what's bothering you about SSH, what would be the benefit. I'm seeing bottlenecks elsewhere, but I might be wrong (it did happen once or twice in the past...). So I plan to play lazy devil's advocate :)

I suppose the question is, why? What will be better if we switch from calling SSH commands to calling methods from paramiko/ansible-pylibssh/etc. What will change if we swap SSH client for another one?

From my point of view, simply not having to maintain that code. I don't know enough low-level ssh stuff to, for example, debug issue like this one.

tmt can optimize its SSH use, e.g. we discussed turning fact gathering into a single script to save N SSH calls, but that kind of task does not seem like it would fully benefit from switching to a different client, as it would still be tmt running N distinct remote commands one by one.

I don't know if pylibssh or other modules would be faster/slower, that's why I wanted to bounce this topic off experienced devs like yourself.

The thinking is - what does tmt do differently than Ansible?; ansible-core being dependency; Your comment above suggesting move to session-based approach instead of socket-based (iiuic); We can benchmark it; If I ask about it and receive context for why/why not, I can stop thinking about it and try to familiarize with the current implementation.

Feb 17 '25 15:02 martinhoyer