passbolt_docker icon indicating copy to clipboard operation
passbolt_docker copied to clipboard

php times out and causes 500 errors

Open s4ke opened this issue 3 years ago • 13 comments

We are using 3.0.1-ce from docker hub.

Our database is mariadb Ver 15.1 Distrib 10.1.47-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 on the host

This is on an instance we upgraded from 2.13.5 to 3.0.1:

2021/02/26 09:21:57 [error] 154#154: *62 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.18.0.1, server: _, request: "GET /settings.json?api-version=v2 HTTP/1.0", upstream: "fastcgi://unix:/run/php/php7.3-fpm.sock", host: "localhost:8444"
2021/02/26 09:21:57 [error] 154#154: *63 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.18.0.1, server: _, request: "GET /auth/is-authenticated.json HTTP/1.0", upstream: "fastcgi://unix:/run/php/php7.3-fpm.sock", host: "localhost:8444"

Also, we get these errors from the apache inside the docker container:

2021-02-26 09:22:53,357 INFO reaped unknown pid 216
2021-02-26 09:22:53,364 INFO reaped unknown pid 218
2021-02-26 09:22:53,377 INFO reaped unknown pid 220
2021-02-26 09:22:53,390 INFO reaped unknown pid 222
2021-02-26 09:22:53,396 INFO reaped unknown pid 224
2021-02-26 09:22:53,403 INFO reaped unknown pid 226
2021-02-26 09:22:53,426 INFO reaped unknown pid 228
2021-02-26 09:22:53,451 INFO reaped unknown pid 231
2021-02-26 09:22:53,458 INFO reaped unknown pid 233
2021-02-26 09:22:53,473 INFO reaped unknown pid 235
2021-02-26 09:22:53,486 INFO reaped unknown pid 237
2021-02-26 09:22:53,508 INFO reaped unknown pid 239
2021-02-26 09:22:53,589 INFO reaped unknown pid 242

We configure everything via env vars and use the mounts from here:

   volumes:
      - "/data/ansible/passbolt/data/gpg:/etc/passbolt/gpg"
      - "/data/ansible/passbolt/data/images:/usr/share/php/passbolt/webroot/img/public"
      - "/data/ansible/certs/{{ passbolt_cert_dir }}:/certs"
      - "/data/ansible/passbolt/mariadb/mariadb.cnf:/etc/mysql/mariadb.conf.d/mariadb.cnf"
    env:
      DATASOURCES_DEFAULT_HOST: "{{ database_server_str }}"
      DATASOURCES_DEFAULT_USERNAME: "{{ passbolt_db_user }}"
      DATASOURCES_DEFAULT_PASSWORD: "{{ passbolt_db_passwd }}"
      DATASOURCES_DEFAULT_DATABASE: "{{ passbolt_db }}"
      APP_FULL_BASE_URL: "https://{{ passbolt_domain }}"
      DATASOURCES_DEFAULT_SSL_KEY: "/certs/server.key"
      DATASOURCES_DEFAULT_SSL_CERT: "/certs/server.crt"
      DATASOURCES_DEFAULT_SSL_CA: "/certs/rootCA.crt"
      SECURITY_SALT: "{{ passbolt_salt }}"
      PASSBOLT_KEY_LENGTH: "{{ passbolt_key_length }}"
      PASSBOLT_SUBKEY_LENGTH: "{{ passbolt_subkey_length }}"
      PASSBOLT_KEY_NAME: "{{ passbolt_default_user }}"
      PASSBOLT_KEY_EMAIL: "{{ passbolt_default_user_mail }}"
      PASSBOLT_KEY_EXPIRATION: "{{ passbolt_key_expiration }}"
      EMAIL_TRANSPORT_DEFAULT_CLASS_NAME: "{{ passbolt_email_transport_class_name }}"
      EMAIL_DEFAULT_FROM: "{{ passbolt_email_from }}"
      EMAIL_DEFAULT_TRANSPORT: "default"
      EMAIL_TRANSPORT_DEFAULT_HOST: "{{ passbolt_email_host }}"
      EMAIL_TRANSPORT_DEFAULT_PORT: "{{ passbolt_email_port }}"
      EMAIL_TRANSPORT_DEFAULT_TIMEOUT: "{{ passbolt_email_timeout }}"
      EMAIL_TRANSPORT_DEFAULT_USERNAME: "{{ passbolt_email_user }}"
      EMAIL_TRANSPORT_DEFAULT_PASSWORD: "{{ passbolt_email_password }}"
      EMAIL_TRANSPORT_DEFAULT_TLS: "{{  passbolt_email_use_tls }}"
      PASSBOLT_EMAIL_SHOW_COMMENT: "false"
      PASSBOLT_EMAIL_SHOW_DESCRIPTION: "false"
      PASSBOLT_EMAIL_SHOW_SECRET: "false"
      PASSBOLT_EMAIL_SHOW_URI: "false"
      PASSBOLT_EMAIL_SHOW_USERNAME: "false"
    restart_policy: unless-stopped

s4ke avatar Feb 26 '21 09:02 s4ke

During upgrade I simply changed the volumes that mounted to the original paths to map to the new paths in /etc/passbolt/gpg and /usr/share/php/passbolt/webroot/img/public

s4ke avatar Feb 26 '21 09:02 s4ke

Hi @s4ke,

Thanks for your report. Is there any specific action you do to obtain such timeouts? Or does it happen randomly?

dlen avatar Feb 26 '21 10:02 dlen

Hey, we are getting these errors seemingly randomly.

s4ke avatar Feb 26 '21 10:02 s4ke

If it helps, this is the nginx location config we use on the host:

    location / {
        proxy_pass          https://localhost:8444/;
        proxy_set_header    X-Real-IP          $remote_addr;
        proxy_set_header    X-Forwarded-For    $proxy_add_x_forwarded_for;
        proxy_set_header    X-Forwarded-Host   $host;
        proxy_set_header    X-Forwarded-Server $host;
        proxy_set_header    X-Forwarded-Port   $server_port;
        proxy_set_header    X-Forwarded-Proto  $scheme;
    }

s4ke avatar Feb 26 '21 11:02 s4ke

We are investigating the issue but for now we are unable to obtain the timeouts.

As we continue to try to reproduce your issue could you confirm if your database data is cleaned? You can do this by running the cleanup task:

su -s /bin/bash -c "/usr/share/php/passbolt/bin/cake passbolt cleanup" www-data

dlen avatar Feb 26 '21 12:02 dlen

@s4ke You could try adding proxy_read_timeout {seconds}; in your location block of your noted reverse proxy settings and see if this helps at all. Those errors are showing same time so maybe this will help. NGINX ref.

If this doesn't help we could target another setting on the NGINX in the container itself.

garrettboone avatar Feb 26 '21 13:02 garrettboone

@garrettboone The error messages are from the internal Apache failing.

@dlen I executed this already and it is clean.

s4ke avatar Feb 26 '21 14:02 s4ke

@s4ke The internal webserver is NGINX not Apache. I see the trailing slash on the proxy_pass - there should be no slash. NGINX ref. There are other NGINX settings that could be tweaked internal to the container webserver but we would not typically expect that to be needed.

Can you share specs on the machine environment? Is there another service that might cause a resource issue?

garrettboone avatar Feb 26 '21 15:02 garrettboone

@garrettboone about the internal nginx: understood

The trailing slash should not be the cause of this issue, though, right? I can try to change it though.

We are running this on the same vm we used before, the cpu load is fine.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel Xeon Processor (Skylake, IBRS)
stepping        : 4
microcode       : 0x1
cpu MHz         : 2099.998
cache size      : 16384 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips        : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel Xeon Processor (Skylake, IBRS)
stepping        : 4
microcode       : 0x1
cpu MHz         : 2099.998
cache size      : 16384 KB
physical id     : 0
siblings        : 2
core id         : 1
cpu cores       : 2
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips        : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Linux Host: Linux nf01 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 22 20:32:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Docker: Docker version 20.10.3, build 48d30b5

mariadb is only used by passbolt.

s4ke avatar Feb 26 '21 16:02 s4ke

If it helps, we use our playbooks at:

https://github.com/neuroforgede/passbolt_ansible/blob/master/roles/passbolt/tasks/main.yml

We have ~300-400 passwords saved and 5 users on this instance. (I am estimating here, I have 240 passwords visible on my own account). The timeouts happen even shortly after a restart of the docker container.

If you want, we can investigate this live on my machine. I'd be available over the weekend or sometime next week.

s4ke avatar Feb 26 '21 16:02 s4ke

Is anyone connecting by VPN? Does this occur for all users? There is a NGINX setting related to this kind of timeout...but I would not use it as a permanent fix: http://nginx.org/en/docs/http/ngx_http_fastcgi_module.html#fastcgi_read_timeout

If you apply internally to the NGINX in the container, and it helps, then maybe it's a network thing or other general periodic event like a signal drop on VPN or something. Just guessing, but you could try it.

garrettboone avatar Feb 26 '21 22:02 garrettboone

No, noone is connecting by VPN. The database is on the same host.

This occurs sporadically. The cpu usage looks fine though.

I will try the fastcgi_read_timeout though.

s4ke avatar Mar 01 '21 22:03 s4ke

With the recent changes in the browser plugin it looks like our usability issues have subsided. Tbh, I haven't looked at the logs of the API server in a while, so it can be that they were there from the get go.

s4ke avatar Mar 07 '21 10:03 s4ke

Feel free to re-open if this issue persists, thanks for your feedback

dlen avatar Sep 14 '22 12:09 dlen