gpdb
gpdb copied to clipboard
gpssh: Retry with TERM env variable set during failures
Currently gpssh always clears out the TERM env variable before performing SSH connections. This will cause problems when users want only to use a specific terminal (like tmux) when performing SSH connections since the empty TERM will cause some terminals to fail.
To fix this issue, a retry operation is added when trying to login to a host. The first login attempt is made in the current manner where it clears out the TERM env variable. If it fails, then the operation is retried by restoring the TERM env variable.
gpssh will now also print the exception in case of errors so that we get to know the actual reason for SSH failures aiding us in debugging.
Output During Failure
Before:
[gpadmin@cdw ~]$ gpssh -h cdw
[ERROR] unable to login to cdw
Could not acquire connection.
After:
[gpadmin@cdw ~]$ gpssh -h cdw
[ERROR] unable to login to cdw
Could not acquire connection.
End Of File (EOF). Exception style platform.
<gpssh_modules.gppxssh_wrapper.PxsshWrapper object at 0x7ffa144b7860>
version: 3.3
command: /usr/bin/ssh
args: ['/usr/bin/ssh', '-o', 'StrictHostKeyChecking=no', '-o', 'BatchMode=yes', '-q', '-l', 'gpadmin', 'cdw']
searcher: <pexpect.searcher_re object at 0x7ffa144b7c50>
buffer (last 100 chars): b''
before (last 100 chars): b' 9 07:34:06 2024 from 10.0.34.166\r\r\nopen terminal failed: missing or unsuitable terminal: unknown\r\n'
after: <class 'pexpect.EOF'>
match: None
match_index: None
exitstatus: None
flag_eof: True
pid: 52832
child_fd: 3
closed: False
timeout: 30
delimiter: <class 'pexpect.EOF'>
logfile: None
logfile_read: None
logfile_send: None
maxread: 2000
ignorecase: False
searchwindowsize: None
delaybeforesend: 0.05
delayafterclose: 0.1
delayafterterminate: 0.1
What exact aspect of error reporting will give clue to users to use this new flag (would be good to call that our in documentation for this flag).
Also, would be good to know other way around what happens if I always run the utility with this flag.
What exact aspect of error reporting will give clue to users to use this new flag (would be good to call that our in documentation for this flag).
The console output we get from the exception message gives the reason for SSH failure. For example, in the above exception, the line before (last 100 chars): b' 9 07:34:06 2024 from 10.0.34.166\r\r\nopen terminal failed: missing or unsuitable terminal: unknown\r\n' indicates the reason.
Also, would be good to know other way around what happens if I always run the utility with this flag.
Not entirely sure about this, but the reason for making the change to empty the TERM variable was due to a customer issue back in 2009. They had the following setup (SunOS 5.10, TERM=vt102) which was causing the gpssh output to be inconsistent when they used vt102 terminals.
What exact aspect of error reporting will give clue to users to use this new flag (would be good to call that our in documentation for this flag).
The console output we get from the exception message gives the reason for SSH failure. For example, in the above exception, the line
before (last 100 chars): b' 9 07:34:06 2024 from 10.0.34.166\r\r\nopen terminal failed: missing or unsuitable terminal: unknown\r\n'indicates the reason.
Frankly, I find it really hard to draw the connection of that error with the newly provided option. As a user of this tool I would have to do quite a bit google search to trace out this new option will solve the problem for me and I should try running with it.
- Can't we do better to catch the exception and provide direct message to run with
--preserve-term? or - better try the current default without $TERM and if fails for this such reasons restore $TERM and run with it instead of asking user to try with that option?
The option provided seems tuned for very advanced users to me, but not making life easy in general. Still might continue to get queries for the gpssh failures.
- better try the current default without $TERM and if fails for this such reasons restore $TERM and run with it instead of asking user to try with that option?
@ashwinstar Based on your input, we will initially try to make the SSH connection using the default way and if it fails for some reason, we will retry by setting the $TERM variable. Let me know if you have any other input on this.
- better try the current default without $TERM and if fails for this such reasons restore $TERM and run with it instead of asking user to try with that option?
@ashwinstar Based on your input, we will initially try to make the SSH connection using the default way and if it fails for some reason, we will retry by setting the
$TERMvariable. Let me know if you have any other input on this.
Thanks for implementing the same.
I should have clarified wasn't opposed to providing the flag. As I feel advanced users would still find it handy to use the flag instead of paying the penalty of failing and then on retry succeeding if they know wish to retain the $TERM. So was proposing to do both.
I should have clarified wasn't opposed to providing the flag. As I feel advanced users would still find it handy to use the flag instead of paying the penalty of failing and then on retry succeeding if they know wish to retain the $TERM. So was proposing to do both.
@ashwinstar Discussed this with @nimish350 . We feel it is okay to not provide the flag as from the user's point of view, there is no indication that we are retrying to make an SSH connection. The user will experience the same behavior as the current gpssh does and internally we will perform a retry operation everytime an error occurs.