warpgate icon indicating copy to clipboard operation
warpgate copied to clipboard

SSH server dies when creating lots of connections

Open NefixEstrada opened this issue 1 year ago • 3 comments

Hello,

We've been using WarpGate happily, but we're facing an issue:

  1. We use Ansible
  2. We have an inventory with 14 hosts
  3. When running a (fairly long) playbook, at some point, the SSH server dies, The web server still works, but the SSH server stops listening to the port (running ss -tulpn doesn't show it)
  4. It keeps the opened connections alive, but is unable to open new connections
  5. After a restart of the server, it works again

It seems to be related to the number of connections or high opening / closing of connections.

How can we further debug this?

Thanks!!

Néfix Estrada IsardVDI

======

Further looking, there are lots of this error:

ERROR warpgate_core::logging::database: Failed to store log entry error=Exec(SqlxError(Database(SqliteError { code: 14, message: "unable to open database file" })))

NefixEstrada avatar Nov 12 '24 13:11 NefixEstrada

What OS are you running (EX, RHEL w/ SELinux enabled)?

ERROR warpgate_core::logging::database: Failed to store log entry error=Exec(SqlxError(Database(SqliteError { code: 14, message: "unable to open database file" })))

Are you seeing this error after you restart the warpgate server?

codyro avatar Nov 12 '24 17:11 codyro

@codyro We're running Ubuntu 22.04

This error appears only after a while, when the server is under load. Restarting the server fixes the issue temporarily, but it comes back when there's load again

NefixEstrada avatar Nov 12 '24 17:11 NefixEstrada

I was able to test this (albeit on RHEL9) but couldn't replicate it (although there appears to be another issue I faced that may be related to something else—I need to debug it a bit more and submit an issue if necessary).

Here is what I did:

Steps to reproduce

  1. Spin up 20VMs
  2. Setup Ansible with some debug tasks/roles and an inventory for test instances
  3. Ran ansible-playbook with various forks (1,5,10,20) with the default strategy [1]

Playbook(s) used to test

I ran a small playbook to see if everything ran cleanly and quickly. That worked without issues. Since you noted that your playbooks are relatively long, I added another role. This was when I ran into various issues; however, none of them were what you experienced. warpgate continued to listen normally and accepted new connections without a problem.

---
- name: Test warpgate
  hosts: all
  gather_facts: true
  vars:
    ansible_user: "codyr:{{ inventory_hostname }}"
    ansible_host: warpgate.host.com
    ansible_port: 2222
  tasks:
    - name: Print hostname
      ansible.builtin.debug:
        var: ansible_hostname

    - name: Add sshkey to root user
      ansible.posix.authorized_key:
        user: root
        state: present
        key: 'ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGM0TQe1trzJ4VEsKRhURyJJ7wsr/9UAY2JJdUhaPZfA  cody@test'
  ## Make the playbook run longer
  # roles:
  #   - role: geerlingguy.mysql

What version of ansible & warpgate are you running? I was also using the sqlite backend and did not see the errors you were receiving.

codyr@wp ~ [2]> podman exec -it systemd-warpgate warpgate --version
warpgate 0.11.0
(hawkansible-venv) codyr@Portia ~/D/g/t/s/w/ansible-warpgate (warpgate-testing)> ansible --version
ansible [core 2.17.5]

codyro avatar Nov 20 '24 15:11 codyro