sshkit.ex icon indicating copy to clipboard operation
sshkit.ex copied to clipboard

[#40] Allow SSHKit.run executing on hosts in parallel

Open seungjin opened this issue 6 years ago • 7 comments

Allow SSHKit.run executing on hosts in parallel

Description

SSHKit.run can do its job in sequential and parallel. SSHKit.run(context, cmd, :sequential, 1000) runs cmd sequentially with its timeout 1000ms. SSHKit.run(context, cmd, :parallel, 1000) runs cmd sequentially with its each cmd's timeout 1000ms.

Motivation and Context

Please refer to https://github.com/bitcrowd/sshkit.ex/issues/40

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • [ ] I have updated the documentation accordingly.
  • [x] I have added tests to cover my changes (with unit and/or functional tests).
  • [ ] I have added a note to CHANGELOG.md if necessary (in the ## master section).

seungjin avatar Aug 30 '18 14:08 seungjin

  • [x] I have added tests to cover my changes (with unit and/or functional tests).

functional test added: https://github.com/bitcrowd/sshkit.ex/pull/131/commits/0d239f68217d6807782cd02fd34c44791994ce62 https://github.com/bitcrowd/sshkit.ex/pull/131/commits/7c84642a0e7a3c95c873de6baa279b09f4f7abe3

testing 2 ssh sessions with 'sleep 2' commands and checking the returns less than 4 seconds.

seungjin avatar Sep 03 '18 17:09 seungjin

Shouldn't such kind of a parallelization be a client consideration?

lessless avatar Apr 19 '19 11:04 lessless

Hi @lessless, thanks a lot for bringing some life to this PR again 🌱💧

I think this is a very valid question that we have also asked ourselves. On the one hand, it's of course rather simple to achieve parallel execution even if the library does not provide it. On the other hand, it looks like such a common use case that it seems nice if we can avoid everyone having to re-roll their own copy of parallel task execution.

We ourselves usually deploy to at least two machines for load-balancing and I assume there's really a lot of people out there who'll have similar needs. I personally think this (optimizing for the common use case) is a good reason for having it in the library.

That said, we're super happy to hear and consider different opinions anytime though so please don't hesitate to share your thoughts! 💚

pmeinhardt avatar Apr 21 '19 19:04 pmeinhardt

At the same time, sorry @seungjin for the long silence 🙇

I had another look at how we could verify that connections are established in parallel. Here's a thought:

  1. Start 2 different sshd Docker containers via the @boot tag
  2. Run sth. like sleep 5 on both of these hosts via SSH using :parallel mode (the command should just keep the connection open long enough. but we'll kill it as soon as we detect success)
  3. In parallel, run docker top [CONTAINER] for both hosts until we see a connection on both of them at the same time
  4. Once we do see a session on both hosts, kill the containers, (prematurely) terminating the sleep so we're not wasting any time.

Below is a rough outline visualizing what this looks like:

session

The left pane is running the SSH server. The top-right, just sets up a test user. Both of these steps are taken care of by the @boot tag. The pane in the middle was used to open an SSH connection. The bottom-right pane shows the docker top output with the session active first, then without the SSH connection. We'd basically be looking for something like the first output on two containers "at the same time" to have a successful test.

Are you by any chance still up for working on this?

Again, sorry for the long radio silence and thanks a lot for your contribution ❤️

pmeinhardt avatar Apr 21 '19 19:04 pmeinhardt

In case you guys will decide to move on with this one concern is to properly propagate is exit codes and associated outputs in case of error back to the caller. I’m not sure if I’m not being late here because I didn’t look at the code yet - we are just exploring bout options with SSHKit for now. One of the drawbacks of doing things in parallel is more involving retry and recovery logic. Can you please shed some lights on how are dealing with that or what available options will be there on a table after merging this PR?

Many thanks!

On 21 Apr 2019, at 21:27, Paul Meinhardt [email protected] wrote:

At the same time, sorry @seungjin for the long silence 🙇

I had another look at how we could verify that connections are established in parallel. Here's a thought:

Start 2 different sshd Docker containers via the @boot tag Run sth. like sleep 5 on both of these hosts via SSH using :parallel mode (the command should just keep the connection open long enough. but we'll kill it as soon as we detect success) In parallel, run docker top [CONTAINER] for both hosts until we see a connection on both of them at the same time Once we do see a session on both hosts, kill the containers, (prematurely) terminating the sleep so we're not wasting any time. Below is a rough outline visualizing what this looks like:

The left pane is running the SSH server. The top-right, just sets up a test user. Both of these steps are taken care of by the @boot tag. The pane in the middle was used to open an SSH connection. The bottom-right pane shows the docker top output with the session active first, then without the SSH connection. We'd basically be looking for something like the first output on two containers "at the same time" to have a successful test.

Are you by any chance still up for working on this?

Again, sorry for the long radio silence and thanks a lot for your contribution ❤️

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lessless avatar Apr 22 '19 07:04 lessless

Hey @lessless, there's currently no plan to integrate retry into the package – it seems to me like this would be highly application/use-case dependent. Any SSHKit.run and SSHKit.upload/download calls return a list of statuses and stdout/stderr aggregates , one per host. So to retry/recover after an error, these are the place to look 🙂

If you have additional questions, maybe we should move these into a separate issue. I'd definitely be happy to hear more about your use case. We're still working on improving SSHKit's interface and design so this kind of feedback is incredibly helpful ✌️

pmeinhardt avatar Apr 25 '19 06:04 pmeinhardt

@pmeinhardt it will be amazing if we an bounce off few ideas but I'm not sure if it's a right thing to open a new issue for it.

My use case in a nutshell is parallel try/ rescue:

  • run commands in parallel on multiple nodes
  • catch exit statuses
  • revert changes if needed

Here is more detailed description of the idea https://elixirforum.com/t/call-for-feedback-multi-node-deployment-tool-edeliver-2-0-design/21747. It'll be great if you can take look at it and let us know how we can utilize SSHKit to implement that design or what are possible riffs and obstacles to avoid.

I'm also @lessless on Elixir Slack if anything.

Cheers!

lessless avatar Apr 28 '19 10:04 lessless