otp icon indicating copy to clipboard operation
otp copied to clipboard

`ssh`: Fix stealing of `'EXIT'` messages if a client is trapping exits

Open Maria-12648430 opened this issue 1 year ago • 6 comments

Fixes #8223.

I haven't written any tests for that yet. TBH, I don't know in which of the 20 suites I should put them :sweat_smile:

Maria-12648430 avatar Mar 05 '24 14:03 Maria-12648430

CT Test Results

    2 files     29 suites   17m 19s :stopwatch:   460 tests   454 :white_check_mark:  5 :zzz: 1 :x: 1 665 runs  1 639 :white_check_mark: 25 :zzz: 1 :x:

For more details on these failures, see this check.

Results for commit a6482bd4.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

github-actions[bot] avatar Mar 05 '24 14:03 github-actions[bot]

I haven't written any tests for that yet. TBH, I don't know in which of the 20 suites I should put them 😅

I'm glad not to be the only one 😅

  • in case of doubts, I would pick ssh_connection_SUITE as it seems a safe choice as everything in ssh is based on some connections ;-)

u3s avatar Mar 05 '24 15:03 u3s

Added two tests (one for ssh:connect and another for ssh:daemon) in the last commit.

Maria-12648430 avatar Mar 06 '24 13:03 Maria-12648430

@Maria-12648430 I add some thoughts on how I think it should work.

IngelaAndin avatar Mar 15 '24 09:03 IngelaAndin

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

IngelaAndin avatar Mar 15 '24 09:03 IngelaAndin

@Maria-12648430 I add some thoughts on how I think it should work.

Thanks @IngelaAndin, I will take a closer look next week, I'm currently on a short vacation 🙂

Maria-12648430 avatar Mar 16 '24 08:03 Maria-12648430

@IngelaAndin your suggestions make sense to me and I think we should go that way. However, this makes it a larger refactoring (vs a rather simple bugfix) which will take some time to work out.

Maria-12648430 avatar Mar 20 '24 07:03 Maria-12648430

@Maria-12648430 no problem Maria, I think it is important to fix things the "tm correct way" because otherwise it is mainly a question of time until some new problem pops up. I believe in fixing the problem not only shutting the symptoms up. We very much apricate you taking the time, as we alas not always have enough resources to fix everything as swiftly as we may desire. It was an unfortunate mistake I think this design, that probably slipped through as we might have missed to review some code, but code reviewing activities are also a tradeoff between other things that needs doing, so we just have to deal with it.

IngelaAndin avatar Mar 20 '24 09:03 IngelaAndin

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

Do you mean to have that in scenario with parallel_login == true or parallel_login == false?

u3s avatar Mar 20 '24 15:03 u3s

I agree with suggestions from Ingela. When browsing code I also found code line which is probably redundant.

user_pid option is added twice - probably safe to remove (tm) ;-) if turns out related to this work. ssh_acceptor:handle_connection/4 invokes ?PUT_INTERNAL_OPT([{user_pid, which is then repeated in ssh_system_sup:start_subsystem/4

I think instruction in ssh_acceptor could be removed.

u3s avatar Mar 21 '24 10:03 u3s

This is pretty hard stuff :cold_sweat: Everything happens all over the place and gets pushed around to everywhere else, or so it seems 🤪 This will take a while to sort out and not break anything.

Maria-12648430 avatar Mar 21 '24 16:03 Maria-12648430

Also I think it should work to have several accepting processes, first one will win and the others might get the next connection attempt.

Do you mean to have that in scenario with parallel_login == true or parallel_login == false? I guess it makes little sense for parallel_login false! I think this option makes little sense at all. I think you could always have parallel login and put the limit to one to get false behavior.

IngelaAndin avatar Mar 22 '24 16:03 IngelaAndin

@IngelaAndin @u3s Ok, with help from @juhlig, this is what we came up with. Let us know what you think.

We put another simple_one_for_one supervisor ssh_acceptor_subsup between ssh_acceptor_sup and ssh_acceptor. It sets up the listening socket for a daemon (this was formerly in ssh_acceptor). The actual acceptor processes are children of this supervisor.

We also introduced a new value for the parallel_login option in addition to true and false which can be a positive integer.

The overall behavior is as follows, depending on the value given for parallel_login:

  • If parallel_login is a number N (false is synonymous to N=1), that number of ssh_acceptor children are started initially. Each of those acceptors runs in a loop that waits for a connection, handles it, then wraps around to waiting for the next connection. This way, up to N login attempts can be handled simultaneously; if there are more than N simultaneous login attempts, they have to wait.
  • If parallel_login is true (in the sense of an unlimited number of simultaneous login attempts), only 1 ssh_acceptor child is started initially. As soon as this acceptor gets a connection, it tells the supervisor to start another acceptor before handling the connection and finally exiting (instead of looping around to wait for another connection). The freshly started acceptor will behave in the same way.

The outlined behavior introduces no backwards incompatibilities. parallel_login values false and true remain with the same outward behavior, and the newly introduced value N provides a good "middle ground" between sequential accepting and unlimited all-out parallelism.

Moreover, the approach solves the issue (#8223) that gave raise to this PR, and as far as we can tell incorporates the suggestions made before.

There is one failing test, though: ssh_protocol_SUITE:client_close_after_hello/1 (there may be more, some are being skipped on my machine, don't know about those). I don't really grasp what this one is supposed to be testing, and if it still applies with the changes made. I would be grateful if you could have a look at it.

Maria-12648430 avatar Mar 25 '24 11:03 Maria-12648430

@IngelaAndin @u3s Ok, with help from @juhlig, this is what we came up with. Let us know what you think.

Thanks a lot for work. We planned code review and will get back.

u3s avatar Apr 03 '24 07:04 u3s

I think the test looks like a white-box test, and probably it trying to test that there is no process leak if the client gives up the connection after sending its hello message. Probably it should be checked differently as you have changed the supervisor tree.

I like the way that you describe in the documentation what should happen. I have not full understood the code yet. I think I need to look at it as whole the diff is not enough.

IngelaAndin avatar Apr 09 '24 14:04 IngelaAndin

@IngelaAndin looking forward to your review :slightly_smiling_face: I'll see what I can do about that test next week.

Btw, I got a notification by email about a comment from you regarding proc_lib:spawn_link vs proc_lib:start_link+proc_lib:init_ack in ssh_acceptor, but strangely I can't find it here in Github :thinking:

Maria-12648430 avatar Apr 10 '24 06:04 Maria-12648430

@Maria-12648430 yes it was because I thought maybe my comment was premature, so I deleted it for the moment. It is true that processes under a simple_one_for_one supervisor would benefit from using proc_lib:spawn_link and then use behaviour:enter_loop instead of proc_lib:start_link as they are independent of each other and do not need to wait for each others init, so we can skip send and receiving of the ack message. Question is, if I had put this comment in the right place. So now I just made a general comment instead.

IngelaAndin avatar Apr 10 '24 07:04 IngelaAndin

We are still stuck on this one test. It just doesn't add up.

Before the structural changes in this PR, when parallel login was enabled, the acceptor would spawn_link a new process to do the authentication and handshake, which would trap exits. If the acceptor died during this, it would notice the exit via the link and stop the connection handler process again. This is what it looks like the test covers: it sends an exit to the acceptor (which is not trapping exits), then checks that there are no processes in handshake and no connection handler processes.

Notably, when parallel login was disabled, the implementation would not stop a connection handler process hanging in handshake. The acceptor process was doing the authentication and handshake by itself (ie, not in a spawned process), and the death of the acceptor would just stop the process. The handshake might time out, that would be kinda ok, but if it succeeded later there would be an orphaned connection handler process.

Notably again, this can also happen when a user starts a client via connect. If the client dies when the connection is in handshake, the connection handler process will not be stopped.

With the changes in this PR, there are now multiple acceptors handling authorization and handshake when parallel login is enabled, each handling authentication and handshake by itself like the single acceptor in the old implementation with parallel login disabled. When the acceptor receives an exit and dies, the connection handler is orphaned the same way.

The acceptor process could be changed to trap exits and shut down connections stuck in handshake, but as the selective receive in the handshake does not have a clause, this could only be done after the handshake eventually succeeded or timed out. Adding another clause is also not a good option since the same function is also used in the user process (which gave raise to the initial issue). So it looks like this cannot be solved without more structural changes, that is, another layer of separate processes independent from the acceptor and user processes. But as I already said above, this gets out of hand and turns into a bigger refactoring which requires better planning. What to do? :cold_sweat: (@IngelaAndin @u3s)

Maria-12648430 avatar Apr 15 '24 14:04 Maria-12648430

What to do?

Maybe we could go with a "good enough bug fix" and then make a refactor later in new master after OTP-27 (as it is really soon)

I guess the only "good enough bugfix" that we can reasonably aim for now, without going into large-scale restructuring, would be what we had in the beginning: Revert the structural changes we have now (bring them back later), instead keep the spawn_linking of a process by the acceptor, but nail the receive of the 'EXIT' in handshake down to the acceptor pid :man_shrugging:

juhlig avatar Apr 15 '24 14:04 juhlig

@juhlig @Maria-12648430 We discussed it at meeting today and we feel that we will try avoiding making a good enough bug-fix as it might introduce new strangeness. We rather like to continue working on a rewrite in the OTP-27 track that possible could be backported later when it has proven itself.

IngelaAndin avatar Apr 16 '24 13:04 IngelaAndin

In general I think all process should be under a supervisor unless they are really temporary and simple so we know they will terminate. An example of that is start_connection_tree in tls_gen_connection. This is needed for the dynaminc TLS supervisor tree with a significant child to work correctly as otherwise it could become inconsistent if the user process dies in the middle of starting.

If we need to synchronize something with users process and it is not trough a behavior call it should be tagged with the pid of the process we are synchronizing with.

I think the transport accept call should be done by an acceptor process and the handshake shall be done by a new connection handler process.

Timeouts should be handled on the server side of things.

All non temporary process should be implemented as behaviors.

And as discussed before I think we can avoid user process opening listen sockets.

IngelaAndin avatar Apr 17 '24 07:04 IngelaAndin

Ok @IngelaAndin, I agree :slightly_smiling_face: Shall we close this PR then?

Maria-12648430 avatar Apr 17 '24 14:04 Maria-12648430

@Maria-12648430 sure we can close it and we are hoping you replace it with a new and even better one :) We really apricate your help with this. We are kind of swamped right now in the security protocol area so to speak.

IngelaAndin avatar Apr 17 '24 15:04 IngelaAndin

Ok then, I'll dig into the code to better understand the status quo, see what I can do, and come back when I have something to show :smiley:

Maria-12648430 avatar Apr 18 '24 13:04 Maria-12648430