proftpd icon indicating copy to clipboard operation
proftpd copied to clipboard

unkillable connections to ftp port 21

Open gojun077 opened this issue 3 years ago • 19 comments

What I Did

I have been using proftpd 1.2.x in a production environment for several years and recently upgraded to the latest stable 1.3.7c and run proftpd inside a Docker container. Since the upgrade, security scanners like Qualys and random (malicious?) users create hanging connections stuck on status "Authenticating" that are unkillable with ftpdctl and that ignore /etc/proftpd.conf settings "TimeoutNoTransfer", "TimeoutStalled", "TimeoutIdle". I am able to kill the connections by using the connection PID from ftptop and executing kill -9 <PID>. These hanging connections are causing CPU 100%.

Here's what I see in the docker logs for proftpd for one such malicious connection that was made on Jan 13 and remained in COMMAND (authenticating) for 270h!:

{"log":"2022-01-13 23:35:22,627 ce99838480c9 proftpd[17641] 172.31.0.2 (193.187.91.226[193.187.91.226]): mod_delay/0.7: no DelayOnEvent rules configured with \"DelayTable none\" in effect, disabling module\n","stream":"stderr","time":"2022-01-13T23:35:22.627867719Z"}
{"log":"2022-01-13 23:35:22,628 ce99838480c9 proftpd[17641] 172.31.0.2 (193.187.91.226[193.187.91.226]): FTP session opened.\n","stream":"stderr","time":"2022-01-13T23:35:22.628442412Z"}
{"log":"2022-01-13 23:35:22,706 ce99838480c9 proftpd[17641] 172.31.0.2 (193.187.91.226[193.187.91.226]): USER admin: no such user found from 193.187.91.226 [193.187.91.226] to 172.31.0.2:21\n","stream":"stderr","time":"2022-01-13T23:35:22.707114467Z"}
{"log":"2022-01-13 23:35:22,786 ce99838480c9 proftpd[17641] 172.31.0.2 (193.187.91.226[193.187.91.226]): USER admin: no such user found from 193.187.91.226 [193.187.91.226] to 172.31.0.2:21\n","stream":"stderr","time":"2022-01-13T23:35:22.786513963Z"}

The attacker tries to login with an invalid user admin but this connection is never terminated until I discover it 1.5 weeks after Jan 13 and manually kill the PID.

What I Expected/Wanted

I expect that connections should be dropped after failing user/passwd auth and that the proftpd.conf settings for "TimeoutNoTransfer", "TimeoutStalled", "TimeoutIdle" should apply to all connections. I also expected that ftpdctl kick <IP> should work on stalled connections. Note that ftpdctl kick works fine on non-zombie/malicious ftp sessions.

ProFTPD Version and Configuration

Please help us reproduce the problem/issue you are encountering. To do this, we need to know which version of ProFTPD you are using, how it was built, etc. The following command is an easy way to get all of this information:

$ proftpd -V
Compile-time Settings:
  Version: 1.3.7c (maint)
  Platform: LINUX [Linux 5.4.0-1057-gcp x86_64]
  Built: Sun Sep 5 2021 00:00:00 UTC
  Built With:
    configure  '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-depend
ency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--da
tadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--localstatedir=/var'
'--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--libexecdir=/usr/libexec/proftpd' '
--localstatedir=/run/proftpd' '--disable-strip' '--enable-ctrls' '--enable-dso' '--enable-facl' '--enable-ipv6' '--enab
le-memcache' '--enable-nls' '--enable-openssl' '--disable-pcre' '--disable-redis' '--enable-shadow' '--enable-tests=non
etwork' '--with-libraries=/usr/lib64/mariadb' '--with-includes=/usr/include/mysql' '--with-modules=mod_readme:mod_auth_
pam:mod_tls' '--with-shared=mod_sql:mod_sql_passwd:mod_sql_mysql:mod_sql_postgres:mod_sql_sqlite:mod_quotatab:mod_quota
tab_file:mod_quotatab_ldap:mod_quotatab_radius:mod_quotatab_sql:mod_ldap:mod_ban:mod_ctrls_admin:mod_facl:mod_load:mod_
vroot:mod_radius:mod_ratio:mod_rewrite:mod_site_misc:mod_exec:mod_shaper:mod_wrap2:mod_wrap2_file:mod_wrap2_sql:mod_cop
y:mod_deflate:mod_ifversion:mod_qos:mod_sftp:mod_sftp_pam:mod_sftp_sql:mod_tls_shmcache:mod_tls_memcache:mod_unique_id:
mod_ifsession' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CC=gcc' 'CFLAGS=-O2 -flto=au
to -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
-Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/r
edhat/redhat-annobin-cc1  -m64  -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection' '
LDFLAGS=-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redh
at/redhat-annobin-cc1 ' 'CXX=g++' 'CXXFLAGS=-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pip
e -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-har
dened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64  -mtune=generic -fasynchronous-u
nwind-tables -fstack-clash-protection -fcf-protection'

  CFLAGS: -g2 -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-securit
y -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-st
rong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64  -mtune=generic -fasynchronous-unwind-tables -fstack-clash-pro
tection -fcf-protection -Wall -fno-omit-frame-pointer -fno-strict-aliasing -Werror=implicit-function-declaration
  LDFLAGS: -L$(top_srcdir)/lib -L$(top_builddir)/lib -Wl,-z,relro -Wl,--as-needed  -Wl,-z,now -specs=/usr/lib/rpm/redha
t/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -rdynamic -L/usr/lib64/mariadb -L/usr/lib64/ -L/usr
/lib64
  LIBS: -lacl  -lssl -lcrypto -lcap  -lssl -lcrypto  -lpam -lsupp -lattr -lresolv -lresolv -lcrypt -lmemcachedutil -lme
mcached

  Files:
    Configuration File:
      /etc/proftpd.conf
    Pid File:
      /run/proftpd/proftpd.pid
    Scoreboard File:
      /run/proftpd/proftpd.scoreboard
    Header Directory:
      /usr/include/proftpd
    Shared Module Directory:
      /usr/libexec/proftpd

  Info:
    + Max supported UID: 4294967295
    + Max supported GID: 4294967295

  Features:
    - Autoshadow support
    + Controls support
    + curses support
    - Developer support
    + DSO support
    + IPv6 support
    + Largefile support
    - Lastlog support
    + Memcache support
    + ncursesw support
    + NLS support
    + OpenSSL support (OpenSSL 1.1.1k  FIPS 25 Mar 2021, FIPS enabled)
    - PCRE support
    + POSIX ACL support
    - Redis support
    + Sendfile support
    + Shadow file support
    - Sodium support
    + Trace support
    + xattr support

  Tunable Options:
    PR_TUNABLE_BUFFER_SIZE = 1024
    PR_TUNABLE_DEFAULT_RCVBUFSZ = 8192
    PR_TUNABLE_DEFAULT_SNDBUFSZ = 8192
    PR_TUNABLE_ENV_MAX = 2048
    PR_TUNABLE_GLOBBING_MAX_MATCHES = 100000
    PR_TUNABLE_GLOBBING_MAX_RECURSION = 8
    PR_TUNABLE_HASH_TABLE_SIZE = 40
    PR_TUNABLE_LOGIN_MAX = 256
    PR_TUNABLE_NEW_POOL_SIZE = 512
    PR_TUNABLE_PATH_MAX = 4096
    PR_TUNABLE_SCOREBOARD_BUFFER_SIZE = 80
    PR_TUNABLE_SCOREBOARD_SCRUB_TIMER = 30
    PR_TUNABLE_SELECT_TIMEOUT = 30
    PR_TUNABLE_TIMEOUTIDENT = 10
    PR_TUNABLE_TIMEOUTIDLE = 600
    PR_TUNABLE_TIMEOUTLINGER = 10
    PR_TUNABLE_TIMEOUTLOGIN = 300
    PR_TUNABLE_TIMEOUTNOXFER = 300
    PR_TUNABLE_TIMEOUTSTALLED = 3600
    PR_TUNABLE_XFER_SCOREBOARD_UPDATES = 10

I am using the proftpd v1.3.7c package provided by Fedora 35.

In addition, we need to see all of the ProFTPD configuration files you are using (minus any sensitive information like passwords, of course). Armed with the version and configuration data, then, we can set up ProFTPD locally using the same configuration, and see what happens.

proftpd conf files:

proftpd-bug-rpt.tar.gz

ftptop screen shots of zombie connections

Screen Shot 2021-12-28 at 11 33 18

Screen Shot 2022-01-25 at 15 17 38

gojun077 avatar Jan 26 '22 09:01 gojun077

Thanks for the detailed bug report! For these scanner connections, can you tell if they are only sending the USER command (for e.g. "admin"), or do they appear to hang after sending a PASS command? There's a lot more machinery involved in the processing of a PASS command; if these "unkillable connections" happen only with USER commands, that will narrow down the code paths to examine...

Castaglia avatar Jan 30 '22 17:01 Castaglia

One way to tell about the above question is to use ps auxwww (especially as root); ProFTPD updates the process title for each session with its current/recent command. This means that ps auxwww will show a little more information than ftptop; I'm hoping that it can show us the last command/activity for these hanging connections.

Castaglia avatar Jan 30 '22 19:01 Castaglia

One way to tell about the above question is to use ps auxwww (especially as root)

Sure -- I need to add ps to the stripped-down fedora35 container environment running proftpd and then next time I see a hanging connection I will run ps auxwww and share my findings with you. btw I am currently on holiday for the Lunar New Year in Asia so it will be several days before I can take a look at this. Thanks for your help.

gojun077 avatar Jan 31 '22 12:01 gojun077

Sorry for the delay in getting back to you! I created a new proftpd container image that includes ps and when I got another hanging connection I ran ftptop and ps auxwww. Here are the screenshots (although I blurred out the user name and IP from ps). ftptop_hanging_cnxn

psauxwww_hanging_cnxn

Note that this time the hanging connection came from a normal user I gave access to, not a scanner like Qualys. The only difference between the ftptop and ps output is that the latter also shows the username of the connection stuck in "authenticating".

Thanks

gojun077 avatar Mar 08 '22 03:03 gojun077

Thanks! For that normal user session, did the client eventually send the PASS command (to complete the authentication), or did it just send USER, and then do nothing?

Castaglia avatar Mar 19 '22 22:03 Castaglia

For the normal user session w/ PID 32148 the connection got stuck after USER was sent, but I think this is an isolated incident b/c user al... has been using proftpd without problems.

gojun077 avatar Mar 20 '22 13:03 gojun077

OK. I'm poring over the code base, to see what else might be happening here. In the mean time, the next time you see one of these stuck/unkillable processes, can you try doing a strace -f -p $PID on the process ID? That might provide some more information about what system calls it is doing, which can then help me narrow down where in the code those system calls are happening. Thanks!

Castaglia avatar Apr 02 '22 21:04 Castaglia

We are also seeing this issue, though only from malicious connections so far.

An strace -f -p of the PID shows the same thing looping (two entries here):

--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0
rt_sigaction(SIGALRM, NULL, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0
alarm(5) = 0
rt_sigreturn({mask=[]}) = 35984944
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0
rt_sigaction(SIGALRM, NULL, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, 8) = 0
rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0
alarm(5) = 0
rt_sigreturn({mask=[]}) = 35952464

Our config:

#  /opt/proftpd/sbin/proftpd -V
Compile-time Settings:
  Version: 1.3.7d (git)
  Platform: LINUX [Linux 3.10.0-1160.53.1.el7.x86_64 x86_64]
  Built: Fri Feb 25 2022 20:09:42 GMT
  Built With:
    configure  '--prefix=/opt/proftpd-1.3.7d-56d500e' '--enable-ctrls' '--with-modules=mod_sftp:mod_tls:mod_ifsession:mod_ban:mod_auth_pam:mod_vroot'

  CFLAGS: -g2 -O2 -Wall -fno-omit-frame-pointer -fno-strict-aliasing
  LDFLAGS: -L$(top_srcdir)/lib -L$(top_builddir)/lib  -rdynamic
  LIBS:  -lssl -lcrypto -lpam  -lssl  -lcrypto -lsupp -lcrypt -ldl

  Files:
    Configuration File:
      /opt/proftpd-1.3.7d-56d500e/etc/proftpd.conf
    Pid File:
      /opt/proftpd-1.3.7d-56d500e/var/proftpd.pid
    Scoreboard File:
      /opt/proftpd-1.3.7d-56d500e/var/proftpd.scoreboard

  Info:
    + Max supported UID: 4294967295
    + Max supported GID: 4294967295

  Features:
    - Autoshadow support
    + Controls support
    - curses support
    - Developer support
    - DSO support
    + IPv6 support
    + Largefile support
    - Lastlog support
    - Memcache support
    - ncurses support
    - NLS support
    + OpenSSL support (OpenSSL 1.0.2k  26 Jan 2017, FIPS enabled)
    - PCRE support
    - POSIX ACL support
    - Redis support
    + Sendfile support
    + Shadow file support
    - Sodium support
    + Trace support
    + xattr support

  Tunable Options:
    PR_TUNABLE_BUFFER_SIZE = 1024
    PR_TUNABLE_DEFAULT_RCVBUFSZ = 8192
    PR_TUNABLE_DEFAULT_SNDBUFSZ = 8192
    PR_TUNABLE_ENV_MAX = 2048
    PR_TUNABLE_GLOBBING_MAX_MATCHES = 100000
    PR_TUNABLE_GLOBBING_MAX_RECURSION = 8
    PR_TUNABLE_HASH_TABLE_SIZE = 40
    PR_TUNABLE_LOGIN_MAX = 256
    PR_TUNABLE_NEW_POOL_SIZE = 512
    PR_TUNABLE_PATH_MAX = 4096
    PR_TUNABLE_SCOREBOARD_BUFFER_SIZE = 80
    PR_TUNABLE_SCOREBOARD_SCRUB_TIMER = 30
    PR_TUNABLE_SELECT_TIMEOUT = 30
    PR_TUNABLE_TIMEOUTIDENT = 10
    PR_TUNABLE_TIMEOUTIDLE = 600
    PR_TUNABLE_TIMEOUTLINGER = 10
    PR_TUNABLE_TIMEOUTLOGIN = 300
    PR_TUNABLE_TIMEOUTNOXFER = 300
    PR_TUNABLE_TIMEOUTSTALLED = 3600
    PR_TUNABLE_XFER_SCOREBOARD_UPDATES = 10

The OS is Centos 7.9, kernel 3.10.0-1160.53.1.el7.x86_64.

jpfinger avatar Apr 05 '22 14:04 jpfinger

@jpfinger Thanks for that info, that helps! The next time you can, can you run strace -f -p -t -T? The additional -t -T options will add the wall clock timestamps, and time spent in syscalls, to the strace output. For the above SIGALRM loop you provided, the timing can help us narrow down which timers are causing the SIGARLM signals, which in turn can narrow down the code path in question.

Castaglia avatar Apr 06 '22 14:04 Castaglia

Unfortunately I had to kill the existing stuck sessions we had and we haven't seen any more yet. This is still on my radar and will get an update when we see it again.

jpfinger avatar Apr 20 '22 15:04 jpfinger

Ok, we have one currently stuck at ~20 hours. Someone knows we're talking about them! I'll hold off on killing this one for a bit.

# strace -f -t -T -p 28387
strace: Process 28387 attached
19:13:36 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
19:13:36 rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0 <0.000019>
19:13:36 rt_sigaction(SIGALRM, NULL, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, 8) = 0 <0.000022>
19:13:36 rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0 <0.000016>
19:13:36 alarm(5)                       = 0 <0.000031>
19:13:36 rt_sigreturn({mask=[]})        = 40489792 <0.000017>
19:13:41 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
19:13:41 rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0 <0.000026>
19:13:41 rt_sigaction(SIGALRM, NULL, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, 8) = 0 <0.000021>
19:13:41 rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0 <0.000021>
19:13:41 alarm(5)                       = 0 <0.000022>
19:13:41 rt_sigreturn({mask=[]})        = 36750912 <0.000022>
19:13:46 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
19:13:46 rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0 <0.000025>
19:13:46 rt_sigaction(SIGALRM, NULL, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, 8) = 0 <0.000027>
19:13:46 rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0 <0.000020>
19:13:46 alarm(5)                       = 0 <0.000021>
19:13:46 rt_sigreturn({mask=[]})        = 35991104 <0.000021>
19:13:51 --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
19:13:51 rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0 <0.000025>
19:13:51 rt_sigaction(SIGALRM, NULL, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, 8) = 0 <0.000022>
19:13:51 rt_sigaction(SIGALRM, {sa_handler=0x422ab0, sa_mask=[], sa_flags=SA_RESTORER|SA_INTERRUPT, sa_restorer=0x7fc76b4f0400}, NULL, 8) = 0 <0.000021>
19:13:51 alarm(5)                       = 0 <0.000021>
19:13:51 rt_sigreturn({mask=[]})        = 38513872 <0.000021>

jpfinger avatar Apr 21 '22 19:04 jpfinger

Thanks! From that trace output, we do see that the SIGARLM signal is being sent to the process every 5 secs. None of the hardcoded default timeouts are 5s; @jpfinger do you have any 5s timeouts in your ProFTPD configuration anywhere (including contrib modules)?

Castaglia avatar Apr 21 '22 20:04 Castaglia

In fact we do in mod_ctrls:


...
<IfModule mod_ctrls.c>
  ControlsEngine        on
  ControlsACLs          all allow user root
  ControlsMaxClients    2
  ControlsLog           /some/path/logs/controls.log
  ControlsInterval      5
  ControlsSocket        /some/path/ctrls.sock
  ControlsSocketOwner   root root
  ControlsSocketACL     allow user root
  <IfModule mod_ctrls_admin.c>
    AdminControlsACLs all allow user root
  </IfModule>
</IfModule>
...

jpfinger avatar Apr 21 '22 20:04 jpfinger

OK, great. I'll work through the mod_ctrls code, with an eye toward newly forming connections/sessions, see if anything turns up.

In the mean time, for the next time you see such unkillable processes, in addition to the strace -f -p -t -T command, I'm thinking it might (or might not) be useful to configure trace logging for timers, e.g.:

TraceLog /path/to/proftpd/trace.log
Trace timer:30

and then search the generated TraceLog for the PID of the unkillable process. That should confirm that it is indeed the ControlsInterval timer which keeps firing. In theory, it should not be that particular timer, since mod_ctrls is supposed to remove that timer on TCP connect (see here); that timer is meant only for the listening daemon process, not the forked session process.

Castaglia avatar Apr 21 '22 20:04 Castaglia

Hmm; it turns out that there is another way that we might see these 5s SIGALRM signals, without any explicitly configured timeouts matching that 5s: see https://github.com/proftpd/proftpd/blob/master/src/timers.c#L191.

The Timer API is not entirely accurate, and in this case, it will default to checking any registered timer intervals/callbacks every 5s anyway. So that's possibly what's happening with this unkillable process.

Unfortunately, that doesn't provide any clues yet about how the process got into this state. But the fact that it's "unkillable" does suggest that timers/alarms are blocked for that process somehow -- I'll be adding trace logging, for the timer trace channel, for when alarms are being blocked/unblocked, to aid in this diagnosis.

Castaglia avatar Apr 23 '22 20:04 Castaglia

@jpfinger for your stuck PID 28387, in the strace output, is there anything other than the SIGALRM? That is, in addition to this 5s SIGLARM signal being delivered, I would also expect to see some other system calls, such as read(2) on a socket or somesuch...

Castaglia avatar Apr 24 '22 16:04 Castaglia

@Castaglia I ran the strace against it for 30 minutes this morning and all I got were the quoted SIGALRM lines.

jpfinger avatar Apr 25 '22 15:04 jpfinger

Here are the timer trace log lines for one of these connections:

2022-04-30 00:39:23,379 [7139] <timer:7>: added timer ID 1 ('TimeoutLogin', for module 'auth'), triggering in 300 seconds
2022-04-30 00:39:23,380 [7139] <timer:7>: added timer ID 2 ('TimeoutIdle', for module 'core'), triggering in 600 seconds
2022-04-30 00:39:28,580 [7139] <timer:7>: reset timer ID 2 ('TimeoutIdle', for module 'core')
2022-04-30 00:39:28,696 [7139] <timer:7>: reset timer ID 2 ('TimeoutIdle', for module 'core')
2022-04-30 00:39:33,897 [7139] <timer:7>: reset timer ID 2 ('TimeoutIdle', for module 'core')
2022-04-30 00:39:34,013 [7139] <timer:7>: reset timer ID 2 ('TimeoutIdle', for module 'core')

jpfinger avatar May 01 '22 13:05 jpfinger

Hi,

we are seeing the same issue since quite some time. I'm not sure we are running the same configuration, however today I actually attached the debugger.

(gdb) bt
#0  0x0000557bb1cb879c in free_blocks (pool_tag=<optimized out>, blok=0x557bb1fadf00) at pool.c:173
#1  clear_pool (p=0x557bb21ee8a0) at pool.c:580
#2  destroy_pool (p=0x557bb21ee8a0) at pool.c:611
#3  0x0000557bb1cb8f91 in destroy_pool (p=<optimized out>) at pool.c:591
#4  0x0000557bb1cecd69 in pr_fsio_close (fh=0x557bb21ee8e8) at fsio.c:5132
#5  0x0000557bb1d1dd14 in af_endgrent () at mod_auth_file.c:593
#6  authfile_endgrent (cmd=<optimized out>) at mod_auth_file.c:1238
#7  0x0000557bb1ce062e in pr_module_call (m=0x557bb1e25ec0 <auth_file_module>, func=0x557bb1d1dcf6 <authfile_endgrent>, cmd=cmd@entry=0x557bb21e2dc0)
    at modules.c:59
#8  0x0000557bb1ce33db in dispatch_auth (cmd=cmd@entry=0x557bb21e2dc0, match=match@entry=0x557bb1dc42ea "endgrent", m=m@entry=0x0) at auth.c:425
#9  0x0000557bb1ce3c16 in pr_auth_endgrent (p=<optimized out>) at auth.c:523
#10 0x0000557bb1d20ea0 in auth_pre_pass (cmd=0x557bb201bdf8) at mod_auth.c:2495
#11 0x0000557bb1ce062e in pr_module_call (m=0x557bb1e26500 <auth_module>, func=0x557bb1d20e6b <auth_pre_pass>, cmd=cmd@entry=0x557bb201bdf8) at modules.c:59
#12 0x0000557bb1cb3e83 in _dispatch (cmd=cmd@entry=0x557bb201bdf8, cmd_type=cmd_type@entry=1, validate=validate@entry=0, match=0x557bb201be90 "PASS",
    match@entry=0x0) at main.c:366
#13 0x0000557bb1cb6b17 in pr_cmd_dispatch_phase (cmd=0x557bb201bdf8, phase=phase@entry=0, flags=flags@entry=3) at main.c:678
#14 0x0000557bb1cb6d3e in pr_cmd_dispatch (cmd=<optimized out>) at main.c:801
#15 cmd_loop (server=<optimized out>, c=<optimized out>) at main.c:944
#16 0x0000557bb1cb4e21 in fork_server (fd=<optimized out>, l=<optimized out>, no_fork=<optimized out>) at main.c:1517
#17 0x0000557bb1cb5bb6 in daemon_loop () at main.c:1766
#18 0x0000557bb1cb351e in standalone_main () at main.c:1967
#19 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at main.c:2792

Looking into free_blocks it seems there is a reference to itself (fn_s_s_p_blok_hnext does s + s + p blok->h.next):

(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$19 = (union block_hdr *) 0x5573c6883e20
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$20 = (union block_hdr *) 0x5573c6882c50
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$21 = (union block_hdr *) 0x5573c68843c0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$22 = (union block_hdr *) 0x5573c6882b60
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$23 = (union block_hdr *) 0x5573c6884a50
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$24 = (union block_hdr *) 0x5573c6882a70
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$25 = (union block_hdr *) 0x5573c6883d30
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$26 = (union block_hdr *) 0x5573c6882980
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$27 = (union block_hdr *) 0x5573c68841e0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$28 = (union block_hdr *) 0x5573c69458c0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$29 = (union block_hdr *) 0x5573c6945af0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$30 = (union block_hdr *) 0x5573c694f3f0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$31 = (union block_hdr *) 0x5573c6950630
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$32 = (union block_hdr *) 0x5573c694f1c0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$33 = (union block_hdr *) 0x5573c68831f0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$34 = (union block_hdr *) 0x5573c68845a0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$35 = (union block_hdr *) 0x5573c6883010
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$36 = (union block_hdr *) 0x5573c6884000
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$37 = (union block_hdr *) 0x5573c6883100
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$38 = (union block_hdr *) 0x5573c6884690
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$39 = (union block_hdr *) 0x5573c6882f20
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$40 = (union block_hdr *) 0x5573c6883f10
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$41 = (union block_hdr *) 0x5573c6882e30
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$42 = (union block_hdr *) 0x5573c68844b0
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$43 = (union block_hdr *) 0x5573c6882d40
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
$44 = (union block_hdr *) 0x5573c6883e20
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
# !! at this point we are starting again !!
$45 = (union block_hdr *) 0x5573c6882c50
(gdb) fn_s_s_p_blok_hnext
175     in pool.c
173     in pool.c
[...]

One more thing to add. The process table entry always is: <pid> ftp .... proftpd: a.b.c.d:12345: USER xyz. xyz is a valid user, however no authentication has happened as the process is still running with the unprivileged user. Running ProFTPD 1.3.8rc3

manuelm avatar Sep 22 '22 10:09 manuelm

@manuelm Thanks, that is very helpful! I'll focus some attention in that area, see if I can figure out how that particular clear_pool() behavior can happen.

Castaglia avatar Sep 25 '22 18:09 Castaglia

@manuelm would you be able to try out https://github.com/proftpd/proftpd/commit/16870859b39b891b68e3708d76fa843747051409 in your environment, see if it helps clear up this issue in your environment?

Castaglia avatar Oct 01 '22 21:10 Castaglia

@manuelm would you be able to try out 1687085 in your environment, see if it helps clear up this issue in your environment?

Yeah sure. Rolling out the patch as we speak. It usually takes a few days, maybe even 1-2 weeks for the bug to happen.

manuelm avatar Oct 01 '22 21:10 manuelm

Due to kernel updates and subsequent reboots it took longer to verify than expected. However we are seeing this issue again.

backtrace is the same. Loop in pool.c:173

manuelm avatar Oct 24 '22 13:10 manuelm

@manuelm Got it. Thanks for the update. I should've looked closer at the stack trace you provided; the loop in the linked list is quite large, and not as simple as assumed by my PR.

To help unwind this linked list loop, would it be possible to print out the pool_tag variable in gdb, for each of those layers? I know it might be tedious (I wish I could reproduce this locally!), but knowing those tags of the pools would help identify where those pools come from in the source code -- which, in turn, would help me understand how this loop is created in the first place.

Castaglia avatar Oct 27 '22 00:10 Castaglia

To help unwind this linked list loop, would it be possible to print out the pool_tag variable in gdb, for each of those layers?

(gdb) bt
#0  0x0000562a6b1f773f in free_blocks (pool_tag=<optimized out>, blok=0x562a6cd6e530) at pool.c:175
#1  clear_pool (p=0x562a6cd73cb0) at pool.c:589
#2  destroy_pool (p=0x562a6cd73cb0) at pool.c:620
#3  0x0000562a6b1f7f2e in destroy_pool (p=<optimized out>) at pool.c:600
[...]

Verify loop:

(gdb) p ((pool *)0x562a6cd73cb0)->first
(union block_hdr *) 0x562a6cd73c90
(gdb) p ((pool *)0x562a6cd73cb0)->first->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next->h.next
(union block_hdr *) 0x562a6cd73c90

all ancestors:

(gdb) p ((pool *)0x562a6cd73cb0)->tag
0x0
(gdb) p ((pool *)0x562a6cd73cb0)->parent->tag
0x562a6b30439f "FS Pool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->parent->tag
0x562a6b2ff430 "<VirtualHost> Pool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->parent->parent->tag
0x562a6b2fb645 "permanent_pool"

Looks like the tag of the pool with the loop has been optimized out. I'll re-compile + rollout with less optimization.

parents sub pools:

(gdb) p ((pool *)0x562a6cd73cb0)->parent->sub_next->tag
0x562a6b2fc48d "config_rec pool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->sub_next->sub_next->tag
0x562a6b2fc48d "config_rec pool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->sub_next->sub_next->sub_next->tag
0x562a6b2fc48d "config_rec pool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->sub_next->sub_next->sub_next->sub_next->tag
0x562a6b2fc48d "config_rec pool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->sub_next->sub_next->sub_next->sub_next->sub_next->tag
0x562a6b2fc48d "config_rec pool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->sub_next->sub_next->sub_next->sub_next->sub_next->sub_next->tag
0x562a6b2fc48d "config_rec pool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->sub_next->sub_next->sub_next->sub_next->sub_next->sub_next->sub_next->tag
0x562a6b2fc4af "pr_config_add() subpool"
(gdb) p ((pool *)0x562a6cd73cb0)->parent->sub_next->sub_next->sub_next->sub_next->sub_next->sub_next->sub_next->sub_next->tag
0x562a6b2fbdd1 "table pool"

manuelm avatar Oct 27 '22 08:10 manuelm

@manuelm Would it be possible to see your ProFTPD configuration, and proftpd -V output, so that I can see which modules (and which vhost configuration) are involved with your stack trace? Thanks!

Castaglia avatar Oct 29 '22 15:10 Castaglia

Sure. Only users using the FTP vhost are looping. SFTP is working fine. All foreign modules should be available at https://manuel.mausz.at/coding/proftpd/

Compile-time Settings:
  Version: 1.3.8rc3 (devel)
  Platform: LINUX [Linux 5.15.74-gentoo-ud59 x86_64]
  OS/Release:
    NAME=Gentoo
    ID=gentoo
    PRETTY_NAME="Gentoo Linux"
    VERSION_ID="2.8"
  Built: Thu Oct 27 2022 10:45:45 CEST
  Built With:
    configure  '--prefix=/usr' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--datadir=/usr/share' '--sysconfdir=/etc' '--localstatedir=/var/lib' '--docdir=/usr/share/doc/proftpd-1.3.8_rc3' '--htmldir=/usr/share/doc/proftpd-1.3.8_rc3/html' '--libdir=/usr/lib64' '--localstatedir=/run/proftpd' '--sysconfdir=/etc/proftpd' '--disable-strip' '--enable-facl' '--enable-auth-file' '--enable-cap' '--disable-ctrls' '--disable-dso' '--disable-ident' '--enable-ipv6' '--disable-memcache' '--disable-ncurses' '--enable-nls' '--enable-openssl' '--disable-auth-pam' '--enable-pcre' '--enable-sodium' '--disable-tests' '--enable-trace' '--enable-shadow' '--enable-autoshadow' '--with-includes=/usr/include' '--with-modules=mod_facl:mod_ifsession:mod_sftp:mod_tls:mod_tls_shmcache:mod_wrap2:mod_wrap2_file:mod_vroot:mod_useralias:mod_fw_ipset:mod_log_login:mod_dovecot_anvil:mod_xclient' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=x86_64-pc-linux-gnu' 'CC=x86_64-pc-linux-gnu-gcc' 'CFLAGS=-march=nocona -O2 -pipe -g' 'LDFLAGS=-Wl,-O1 -Wl,--as-needed' 'CXXFLAGS=-march=nocona -O2 -pipe -g'

  CFLAGS: -g2 -march=nocona -O2 -pipe -g -Wall -fno-omit-frame-pointer -fno-strict-aliasing
  LDFLAGS: -Wl,-L$(top_srcdir)/lib,-L$(top_builddir)/lib -Wl,-O1 -Wl,--as-needed -rdynamic 
  LIBS:  -lpcreposix -lpcre -lssl -lcrypto -lsodium -lipset -lcap  -ldb-5.3  -lssl  -lcrypto -lattr -lcrypt  -pthread

  Files:
    Configuration File:
      /etc/proftpd/proftpd.conf
    Pid File:
      /run/proftpd/proftpd.pid
    Scoreboard File:
      /run/proftpd/proftpd.scoreboard

  Info:
    + Max supported UID: 4294967295
    + Max supported GID: 4294967295

  Features:
    + Autoshadow support
    - Controls support
    + curses support
    - Developer support
    - DSO support
    + IPv6 support
    + Largefile support
    - Lastlog support
    - Memcache support
    - ncurses support
    + NLS support
    + OpenSSL support (OpenSSL 1.1.1q  5 Jul 2022)
    + PCRE support
    - PCRE2 support
    + POSIX ACL support
    - Redis support
    + Sendfile support
    + Shadow file support
    + Sodium support
    + Trace support
    + xattr support

  Tunable Options:
    PR_TUNABLE_BUFFER_SIZE = 1024
    PR_TUNABLE_DEFAULT_RCVBUFSZ = 8192
    PR_TUNABLE_DEFAULT_SNDBUFSZ = 8192
    PR_TUNABLE_ENV_MAX = 2048
    PR_TUNABLE_GLOBBING_MAX_MATCHES = 100000
    PR_TUNABLE_GLOBBING_MAX_RECURSION = 8
    PR_TUNABLE_HASH_TABLE_SIZE = 40
    PR_TUNABLE_LOGIN_MAX = 256
    PR_TUNABLE_NEW_POOL_SIZE = 512
    PR_TUNABLE_PATH_MAX = 4096
    PR_TUNABLE_SCOREBOARD_BUFFER_SIZE = 80
    PR_TUNABLE_SCOREBOARD_SCRUB_TIMER = 30
    PR_TUNABLE_SELECT_TIMEOUT = 30
    PR_TUNABLE_TIMEOUTIDENT = 10
    PR_TUNABLE_TIMEOUTIDLE = 600
    PR_TUNABLE_TIMEOUTLINGER = 10
    PR_TUNABLE_TIMEOUTLOGIN = 300
    PR_TUNABLE_TIMEOUTNOXFER = 300
    PR_TUNABLE_TIMEOUTSTALLED = 3600
    PR_TUNABLE_XFER_SCOREBOARD_UPDATES = 10
ServerName  "FTP Service"
ServerType  standalone
ServerIdent on "FTP Server ready."

UseFtpUsers     off
UseReverseDNS   off
DefaultServer   off
SocketBindTight on
ScoreboardScrub off

User  ftp
Group ftp

MaxInstances 30

# see https://github.com/proftpd/proftpd/issues/1528
FSCachePolicy off

<Class xclient>
  From 194.117.254._
</Class>

<Global>
  DefaultRoot ~
  Umask 022
  VRootEngine on

  MaxClientsPerUser     6
  MaxClientsPerHost     6
  MaxConnectionsPerHost 6
  TimeoutLogin   30
  TimeoutIdle    180
  TimeoutStalled 180
  TimeoutSession 1800
  DelayEngine    off

  TimesGMT      off
  #IdentLookups off
  WtmpLog       off
  TransferLog   /var/log/xferlog

  AllowOverwrite    on
  AllowStoreRestart on

  # Groups that are not allowed to login
  <Limit LOGIN>
    AllowGroup ftponly
    DenyAll
  </Limit>
  RequireValidShell off

  # drop some default capabilities
  RootRevoke on
  CapabilitiesSet -CAP_CHOWN -CAP_SETUID

  # custom modules
  UserAliasDB /etc/proftpd/useralias
  UserAliasToLower on
  DovecotAnvil on
  <IfClass xclient>
    TLSRequired off
    XClientEngine on
    AllowForeignAddress on
  </IfClass>
</Global>

# disable the main server
Port 0

<VirtualHost ::>
  Port 21
  PassivePorts 49200 49399
  ListOptions "-a"

  AuthUserFile  /etc/proftpd/auth_passwd
  AuthGroupFile /etc/proftpd/auth_group
  # disable PAM as we have virtual users
  <IfModule mod_auth_pam.c>
    AuthPAM off
  </IfModule>
  AuthOrder mod_auth_file.c

  TLSEngine   on
  TLSRequired on
  TLSProtocol ALL -TLSv1 -TLSv1.1
  TLSRSACertificateFile /etc/proftpd/certs/server.pem
  TLSECCertificateFile  /etc/proftpd/certs/server2.pem
  TLSDHParamFile        /etc/proftpd/dhparams.pem
  TLSCipherSuite EECDH+AESGCM:EDH+AESGCM:EECDH+CHACHA20:EDH+CHACHA20:!AESCCM:!DSS
  TLSServerCipherPreference on
  # some ftp clients have problems with reusing the session
  TLSOptions NoSessionReuseRequired
  TLSRenegotiate none

  # TLS cipher logging
  TLSOptions NoSessionReuseRequired StdEnvVars

  # custom modules
  FwIPsetName4 ftpin4
  FwIPsetName6 ftpin6
</VirtualHost>

<VirtualHost ::>
  Port 33

  AuthUserFile  /etc/proftpd/auth_passwd
  AuthGroupFile /etc/proftpd/auth_group
  # disable PAM as we have virtual users
  <IfModule mod_sftp_pam.c>
    SFTPPAMEngine off
  </IfModule>
  AuthOrder mod_auth_file.c

  SFTPEngine on
  SFTPHostKey NoDSA
  SFTPHostKey /etc/ssh/ssh_host_rsa_key
  SFTPHostKey /etc/ssh/ssh_host_ecdsa_key
  SFTPHostKey /etc/ssh/ssh_host_ed25519_key
  SFTPKeyBlacklist /etc/proftpd/blacklist.dat
  # see https://github.com/proftpd/proftpd/commit/b307fba5957b3846b19b2dc71e663941686d6902
  #SFTPHostKeys ssh-ed25519 ecdsa-sha2-nistp256 rsa-sha2-256 rsa-sha2-512
  SFTPCiphers [email protected] [email protected] \
    aes256-ctr aes192-ctr aes128-ctr
  SFTPKeyExchanges \
    [email protected] \
    ecdh-sha2-nistp521 ecdh-sha2-nistp384 ecdh-sha2-nistp256 \
    diffie-hellman-group18-sha512 diffie-hellman-group16-sha512 \
    diffie-hellman-group14-sha256 diffie-hellman-group-exchange-sha256
  SFTPDigests [email protected] [email protected] [email protected] \
    [email protected] hmac-sha2-256 hmac-sha2-512
  SFTPAuthorizedUserKeys file:/etc/proftpd/authkeys/%u

  # Allow the same number of authentication attempts as OpenSSH.
  MaxLoginAttempts 6
</VirtualHost>

manuelm avatar Oct 29 '22 21:10 manuelm

@manuelm Thanks! I hadn't realized there were non-standard modules involved in your build; good to know. Hopefully they aren't causing this behavior -- I'll take a look at them.

And, separately, as a mitigation measure, I've been thinking of ways to properly handle signal interruption in the Pool API internals, where normally signals are blocked (and handled later). That would be a way of interrupting this infinite loop, and protecting against future occurrences...but it would also be nice to understand and address the cause, too.

Castaglia avatar Oct 30 '22 17:10 Castaglia

Looking at the details provided by @gojun077 and @jpfinger and @manuelm, I'm noticing that the mod_vroot module (which does use the FSIO API) is a common factor (as is mod_tls); let's see if I can make something of that. @manuelm which version of mod_vroot are you using?

Castaglia avatar Oct 30 '22 17:10 Castaglia

@manuelm Thanks! I hadn't realized there were non-standard modules involved in your build; good to know. Hopefully they aren't causing this behavior -- I'll take a look at them.

Hopefully :-) - I've disabled my mod_useralias module. It isn't used anyway.

@manuelm which version of mod_vroot are you using?

v0.9.9. However there's a new v0.9.10 tag. I can rebuild with that.

manuelm avatar Oct 30 '22 18:10 manuelm