algo icon indicating copy to clipboard operation
algo copied to clipboard

Installer gets stuck

Open kstenerud opened this issue 3 years ago • 16 comments

Describe the bug

When I ran the algo script, it asked me a few questions, and now it's hung.

To Reproduce

Steps to reproduce the behavior:

  1. Set up an ubuntu 20.04 vps
  2. download and extract https://github.com/trailofbits/algo/archive/master.zip
  3. sudo apt install -y --no-install-recommends python3-virtualenv
  4. Install dependencies (step 4)
  5. Configure (step 5). I just changed the user list
  6. run ./algo

Expected behavior

The script completes

Additional context

Add any other context about the problem here.

Full log

# ./algo
[WARNING]: Could not match supplied host pattern, ignoring: vpn-host

PLAY [localhost] *************************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************
ok: [localhost]

TASK [Playbook dir stat] *****************************************************************************************************************
ok: [localhost]

TASK [Ensure Ansible is not being run in a world writable directory] *********************************************************************
ok: [localhost] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [Ensure the requirements installed] *************************************************************************************************
ok: [localhost]

TASK [Set required ansible version as a fact] ********************************************************************************************
ok: [localhost] => (item=ansible==2.9.7)

TASK [Verify Python meets Algo VPN requirements] *****************************************************************************************
ok: [localhost] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [Verify Ansible meets Algo VPN requirements] ****************************************************************************************
ok: [localhost] => {
    "changed": false,
    "msg": "All assertions passed"
}
[WARNING]: Found variable using reserved name: no_log

PLAY [Ask user for the input] ************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************
ok: [localhost]
[Cloud prompt]
What provider would you like to use?
    1. DigitalOcean
    2. Amazon Lightsail
    3. Amazon EC2
    4. Microsoft Azure
    5. Google Compute Engine
    6. Hetzner Cloud
    7. Vultr
    8. Scaleway
    9. OpenStack (DreamCompute optimised)
    10. CloudStack (Exoscale optimised)
    11. Linode
    12. Install to existing Ubuntu 18.04 or 20.04 server (for more advanced users)
  
Enter the number of your desired provider
:
12^M
TASK [Cloud prompt] **********************************************************************************************************************
ok: [localhost]

TASK [Set facts based on the input] ******************************************************************************************************
ok: [localhost]
[Cellular On Demand prompt]
Do you want macOS/iOS clients to enable "Connect On Demand" when connected to cellular networks?
[y/N]
:
y^M
TASK [Cellular On Demand prompt] *********************************************************************************************************
ok: [localhost]
[Wi-Fi On Demand prompt]
Do you want macOS/iOS clients to enable "Connect On Demand" when connected to Wi-Fi?
[y/N]
:
y^M
TASK [Wi-Fi On Demand prompt] ************************************************************************************************************
ok: [localhost]
[Trusted Wi-Fi networks prompt]
List the names of any trusted Wi-Fi networks where macOS/iOS clients should not use "Connect On Demand"
(e.g., your home network. Comma-separated value, e.g., HomeNet,OfficeWifi,AlgoWiFi)
:
^M
TASK [Trusted Wi-Fi networks prompt] *****************************************************************************************************
ok: [localhost]
[Retain the PKI prompt]
Do you want to retain the keys (PKI)? (required to add users in the future, but less secure)
[y/N]
:
y^M
TASK [Retain the PKI prompt] *************************************************************************************************************
ok: [localhost]
[DNS adblocking prompt]
Do you want to enable DNS ad blocking on this VPN server?
[y/N]
:
^M
TASK [DNS adblocking prompt] *************************************************************************************************************
ok: [localhost]
[SSH tunneling prompt]
Do you want each user to have their own account for SSH tunneling?
[y/N]
:
^M
TASK [SSH tunneling prompt] **************************************************************************************************************
ok: [localhost]

TASK [Set facts based on the input] ******************************************************************************************************
ok: [localhost]

PLAY [Provision the server] **************************************************************************************************************

TASK [Gathering Facts] *******************************************************************************************************************
ok: [localhost]

--> Please include the following block of text when reporting issues:

Algo running on: Ubuntu 20.04.1 LTS (Virtualized: kvm)
ZIP file created: 2020-12-11 09:57:27.000000000 +0000
Python 3.8.5
Runtime variables:
    algo_provider "local"
    algo_ondemand_cellular "True"
    algo_ondemand_wifi "True"
    algo_ondemand_wifi_exclude "X251bGw="
    algo_dns_adblocking "False"
    algo_ssh_tunneling "False"
    wireguard_enabled "True"
    dns_encryption "True"

TASK [Display the invocation environment] ************************************************************************************************
changed: [localhost -> localhost]

TASK [Install the requirements] **********************************************************************************************************
changed: [localhost -> localhost]
[local : pause]
Enter the IP address of your server: (or use localhost for local installation):
[localhost]
:
51.15.124.87^M
TASK [local : pause] *********************************************************************************************************************
ok: [localhost]

TASK [local : Set the facts] *************************************************************************************************************
ok: [localhost]
[local : pause]
What user should we use to login on the server? (note: passwordless login required, or ignore if you're deploying to localhost)
[root]
:
^M
TASK [local : pause] *********************************************************************************************************************
ok: [localhost]

TASK [local : Set the facts] *************************************************************************************************************
ok: [localhost]
[local : pause]
Enter the public IP address or domain name of your server: (IMPORTANT! This is used to verify the certificate)
[51.15.124.87]
:
^M
TASK [local : pause] *********************************************************************************************************************
ok: [localhost]

TASK [local : Set the facts] *************************************************************************************************************
ok: [localhost]

TASK [Set subjectAltName as a fact] ******************************************************************************************************
ok: [localhost]

TASK [Add the server to an inventory group] **********************************************************************************************
changed: [localhost]

TASK [Wait until SSH becomes ready...] ***************************************************************************************************
ok: [localhost]

TASK [debug] *****************************************************************************************************************************
ok: [localhost] => {
    "IP_subject_alt_name": "51.15.124.87"
}
	

kstenerud avatar Dec 18 '20 07:12 kstenerud

If you wish to turn the system on which you're running ./algo into an AlgoVPN, enter localhost at this prompt:

Enter the IP address of your server: (or use localhost for local installation):

davidemyers avatar Dec 18 '20 13:12 davidemyers

I've successfully installed algo on a remote (over passwordless ssh) ubuntu before.

I can confirm issue being described is both reproducible and a departure from the expected behavior.

tamsky avatar Dec 21 '20 19:12 tamsky

@tamsky Since @kstenerud mentioned setting up a new VPS first it sounded like maybe it wasn't his intention to install remotely over SSH.

I'm still able to install to a remote server over SSH. Could something have changed on your end?

davidemyers avatar Dec 21 '20 20:12 davidemyers

I tried a few times just now using Lightsail, their OS-only images of Ubuntu 18 and Ubuntu 20.

Both get stuck installing the remote server over SSH at the same point as the original issue description above.

After switching to a localhost-install on either of those Lightsail OS versions, I can assume that ansible is pausing within the following task:

TASK [Wait 600 seconds for target connection to become reachable/usable] 

because that is the task output that immediately follows the IP_subject_alt_name debug output.

tamsky avatar Dec 21 '20 20:12 tamsky

So, knowing that it's waiting 600 seconds, I left ./algo alone at the stalled step for more than 10 minutes, after which it spits out the following error:

TASK [Wait 600 seconds for target connection to become reachable/usable] *******************************************************************************
failed: [localhost -> <elided>] (item=<elided>) => {"ansible_loop_var": "item", "changed": false, "elapsed": 807, "item": "<elided>", "msg": "timed out waiting for ping module test success: Failed to connect to the host via ssh: Warning: Permanently added '<elided>' (ECDSA) to the list of known hosts.\r\nubuntu@<elided>: Permission denied (publickey)."}

<elided> is an ipv4 address.

I can assert that the subject IPv4 is reachable via passwordless (via ssh-agent) ssh ubuntu@<elided> and the host key was manually accepted before invoking ./algo.

Are there manual debug steps for the ping module?

tamsky avatar Dec 21 '20 21:12 tamsky

My previous test was with Vultr, but Lightsail works for me as well. Here's how I'm testing:

  1. Create a new Ubuntu Server 20.04 VPS
  2. SSH to the VPS (as root on Vultr or ubuntu on Lightsail) and upgrade all packages
  3. Reboot the VPS since the kernel was upgraded
  4. Use a fresh git clone of Algo on a local Ubuntu Server 20.04 system to configure the VPS via SSH

davidemyers avatar Dec 21 '20 21:12 davidemyers

Can you still SSH into the system after Algo has failed?

davidemyers avatar Dec 21 '20 21:12 davidemyers

Can you still SSH into the system after Algo has failed?

Yes.

tamsky avatar Dec 21 '20 22:12 tamsky

This is looking more and more like the ssh client is not using my ssh-agent to perform authentication.

Permission denied (publickey)

relevant snippet from ./algo -vvv:

<<ipv4_elided>> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null 
-o ConnectTimeout=6 -o ConnectionAttempts=30 -o IdentitiesOnly=yes -o StrictHostKeyChecking=no -o Port=22 
-o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey 
-o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=60 -o ControlPath=/Users/<user_elided>/.ansible/cp/317d98769d 
<ipv4_elided> '/bin/sh -c '"'"'echo ~ubuntu && sleep 0'"'"''

<<ipv4_elided>> (255, b'', b"Warning: Permanently added '<ipv4_elided>' (ECDSA) to the list of known hosts.\r\nubuntu@<ipv4_elided>: Permission denied (publickey).\r\n")

<<ipv4_elided>> ssh_retry: attempt: 4, ssh return code is 255. cmd ([b'ssh', b'-o', b'ControlMaster=auto', b'-o', b'ControlPersist=60s', b'-o', b'UserKnownHostsFile=/dev/null', b'-o', b'ConnectTimeout=6', b'-o', b'ConnectionAttempts=30', b'-o', b'IdentitiesOnly=yes', b'-o', b'StrictHostKeyChecking=no', b'-o', b'Port=22', b'-o', b'KbdInteractiveAuthentication=no', b'-o', b'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', b'-o', b'PasswordAuthentication=no', b'-o', b'User="ubuntu"', b'-o', b'ConnectTimeout=60', b'-o', b'ControlPath=/Users/<user_elided>/.ansible/cp/317d98769d', b'<ipv4_elided>', b"/bin/sh -c 'echo ~ubuntu && sleep 0'"]...), pausing for 7 seconds

tamsky avatar Dec 21 '20 22:12 tamsky

The following diff fixes the problematic behavior by removing -o IdentitiesOnly=yes:

diff -r 04aedbe6bfe0 ansible.cfg
--- a/ansible.cfg       Fri Dec 11 12:57:27 2020 +0300
+++ b/ansible.cfg       Mon Dec 21 15:18:21 2020 -0800
@@ -12,6 +12,6 @@
 record_host_keys = False
 
 [ssh_connection]
-ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null -o ConnectTimeout=6 -o ConnectionAttempts=30 -o IdentitiesOnly=yes
+ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o UserKnownHostsFile=/dev/null -o ConnectTimeout=6 -o ConnectionAttempts=30
 scp_if_ssh = True
 retries = 30

tamsky avatar Dec 21 '20 23:12 tamsky

Very interesting. Quoting the man page:

IdentitiesOnly Specifies that ssh(1) should only use the configured authentication identity and certificate files (either the default files, or those explicitly configured in the ssh_config files or passed on the ssh(1) command-line), even if ssh-agent(1) or a PKCS11Provider or SecurityKeyProvider offers more identities. The argument to this keyword must be yes or no (the default). This option is intended for situations where ssh-agent offers many different identities.

In my testing I'm using default identity files (in my case ~/.ssh/id_ed25519 with Vultr and ~/.ssh/id_rsa with Lightsail) and these work with ssh-agent.

So are you using a non-default identity file?

I wonder if we can safely remove this option from ansible.cfg

Edited to add: IdentitiesOnly was added here, probably for a good reason.

davidemyers avatar Dec 22 '20 12:12 davidemyers

Please keep in mind that the use case here is for option 12: 12. Install to existing Ubuntu 18.04 or 20.04 server (for more advanced users)

So are you using a non-default identity file?

This definitely depends on what your expectations are surrounding the word "default".

I'm definitely using the default name for the downloaded key/certfile after they have been generated & downloaded within the Lightsail console, namely: ~/.ssh/LightsailDefaultKey-us-east-1.pem, and then added to ssh-agent via ssh-add ~/.ssh/LightsailDefaultKey-us-east-1.pem.

I wonder if we can safely remove this option from ansible.cfg

I wonder if we can add support for dynamically removing IdentitiesOnly from the ssh_config value when Option 12 is in use, and:

  • when ssh-agent env var is detected
  • perhaps also implementing some kind of early sanity check
  • when sanity check fails, a ssh_key pathmname prompt can be issued

IdentitiesOnly was added here, probably for a good reason.

I don't see a good reason, or a "probably", anywhere in that commit, or any of the issues connected to the commit (#152 #151 #112).

tamsky avatar Dec 27 '20 04:12 tamsky

This definitely depends on what your expectations are surrounding the word "default".

My expectations are irrelevant, we're talking about OpenSSH. From the man page for ssh on macOS:

-i identity_file Selects a file from which the identity (private key) for public key authentication is read. The default is ~/.ssh/id_dsa, ~/.ssh/id_ecdsa, ~/.ssh/id_ed25519 and ~/.ssh/id_rsa.

This explains why I wasn't able to reproduce your issue. I was using a default identity file so it didn't get excluded by IdentitiesOnly.


So @jackivanov here is the issue, I think:

  • ssh_args defined in ansible.cfg includes the option IdentitiesOnly=yes.
  • When performing an install to an existing instance via SSH, IdentitiesOnly=yes causes SSH to ignore identity files in ssh-agent other than the default files.
  • So if a user sets up passwordless SSH to an instance with ssh-agent but uses a non-default name for their identity file, Algo will hang because that identity in ssh-agent will be ignored.

Is IdentitiesOnly=yes still needed?

davidemyers avatar Dec 27 '20 14:12 davidemyers

If you wish to turn the system on which you're running ./algo into an AlgoVPN, enter localhost at this prompt:

Enter the IP address of your server: (or use localhost for local installation):

This works for me. I was trying to install it locally, instead of remote

jiayanguo avatar May 17 '21 04:05 jiayanguo

Removing IdentitiesOnly=yes solved the issue for me. I had exactly the situation @davidemyers described with a non-default passwordless SSH identity file.

letalumil avatar Aug 04 '21 08:08 letalumil

This might help someone using GCP compute instance.

Why this was happening to me was that I was trying to install on gcloud compute instance and I was using gcloud compute ssh command and not simple ssh command to login to the instance.

After I directly did ssh user@ip I was able to install algo.

luvpreetsingh avatar Feb 22 '22 12:02 luvpreetsingh