ModSecurity nginx worker hangs while logging to the audit log

Hello, In our company we use ModSecurity in our nginx. We noticed that sometimes nginx workers just hangs and do nothing, however when we strace the process (or attach the gdb), then the worker starts spinning again. This problem happens when we enable audit logging (ex: SecAuditLog /var/log/modsec_audit.log), I have traced the issues to the following line https://github.com/SpiderLabs/ModSecurity/blob/0eb3c123f447b8787ea726ad4d4439018a07ee31/src/utils/shared_files.cc#L236 it appers that proccess is unable to wakeup and acquire the lock for the file and it is just waiting.

backtrace:

#1  0x00007f31debc4714 in __GI___pthread_mutex_lock (mutex=0x7f31debef008) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f31de95474a in modsecurity::utils::SharedFiles::write(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) () from /usr/lib/x86_64-linux-gnu/libmodsecurity.so.3
#3  0x00007f31de8dd9a4 in modsecurity::audit_log::writer::Serial::write(modsecurity::Transaction*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) () from /usr/lib/x86_64-linux-gnu/libmodsecurity.so.3
#4  0x00007f31de8dcc0e in modsecurity::audit_log::AuditLog::saveIfRelevant(modsecurity::Transaction*, int) () from /usr/lib/x86_64-linux-gnu/libmodsecurity.so.3
#5  0x00007f31de8cd966 in modsecurity::Transaction::processLogging() () from /usr/lib/x86_64-linux-gnu/libmodsecurity.so.3
#6  0x0000563cedc9d11c in ngx_http_modsecurity_log_handler (r=<optimized out>) at ../modules/ngx_modsecurity_module/src/ngx_http_modsecurity_log.c:83
#7  0x0000563cedc0b41d in ngx_http_log_request (r=0x563cf1a308e0) at src/http/ngx_http_request.c:3684
#8  ngx_http_free_request (r=r@entry=0x563cf1a308e0, rc=<optimized out>) at src/http/ngx_http_request.c:3630
#9  0x0000563cedc0b4f8 in ngx_http_close_request (rc=0, r=<optimized out>) at src/http/ngx_http_request.c:3576
#10 0x0000563cedc1e90d in ngx_http_upstream_process_request (r=0x563cf1a308e0, u=0x563cf20a8538) at src/http/ngx_http_upstream.c:4122
#11 0x0000563cedc2275b in ngx_http_upstream_send_response (u=0x563cf20a8538, r=0x563cf1a308e0) at src/http/ngx_http_upstream.c:3306
#12 ngx_http_upstream_process_header (r=0x563cf1a308e0, u=0x563cf20a8538) at src/http/ngx_http_upstream.c:2479
#13 0x0000563cedc1dd92 in ngx_http_upstream_handler (ev=<optimized out>) at src/http/ngx_http_upstream.c:1329
#14 0x0000563cedbf6836 in ngx_epoll_process_events (cycle=<optimized out>, timer=<optimized out>, flags=<optimized out>) at src/event/modules/ngx_epoll_module.c:901
#15 0x0000563cedbece76 in ngx_process_events_and_timers (cycle=cycle@entry=0x563cf02d4250) at src/event/ngx_event.c:260
#16 0x0000563cedbf4a18 in ngx_worker_process_cycle (cycle=cycle@entry=0x563cf02d4250, data=data@entry=0x1) at src/os/unix/ngx_process_cycle.c:767
#17 0x0000563cedbf3159 in ngx_spawn_process (cycle=cycle@entry=0x563cf02d4250, proc=proc@entry=0x563cedbf4990 <ngx_worker_process_cycle>, data=data@entry=0x1, name=name@entry=0x563ceded3328 "worker process", respawn=respawn@entry=-3) at src/os/unix/ngx_process.c:203
#18 0x0000563cedbf5366 in ngx_start_worker_processes (cycle=cycle@entry=0x563cf02d4250, n=40, type=type@entry=-3) at src/os/unix/ngx_process_cycle.c:372
#19 0x0000563cedbf581b in ngx_master_process_cycle (cycle=0x563cf02d4250) at src/os/unix/ngx_process_cycle.c:144
#20 0x0000563cedbc8cad in main (argc=<optimized out>, argv=<optimized out>) at src/core/nginx.c:382

We use the following connector

server:

nginx 1.17 - 40 workers
Debian 10 (buster)
libmodsecurity3 (3.0.3)

Is this a bug or this directive cannot be used with multiple processes logging to the same file?

Jul 27 '20 10:07 wutchzone

Hi, @wutchzone

That sounds to be a bug. Further investigation is needed. Do you happen to have the audit log in Serial or Parallel mode?

Jul 27 '20 12:07 zimmerle

We have set it as SecAuditLogType Serial.

I will try to create a reproducible example on local machine. For now I am able to reproduce this behavior only in production, where we have multiple CPU cores (40+). I will also experiment with other types of locks.

Jul 27 '20 12:07 wutchzone

Hi @wutchzone ,

I'm not certain whether this would help your particular situation, but you might want to consider upgrading your libmodsecurity version to v3.0.4. There were some fixes post-v3.0.3 that might be relevant. I'm thinking in particular of: https://github.com/SpiderLabs/ModSecurity/commit/3d2030426c22c41a85cda92289dc284094fba5fd

Jul 27 '20 13:07 martinhsv

@wutchzone did you had the chance to test v3.0.4?

Jul 28 '20 14:07 zimmerle

Here's the solution:

I was experiencing the same problem, and it would appear that Atomicorp wrote some 'bad rules' which have caused nginx processes to hang. If you are running ASL or if you are just using atomicorp's modsecurity package, this is why your nginx processes are getting stuck. They have pushed out an update, so you will need to update your modsecurity package as soon as possible, or you will probably continue getting the same errors.

Good luck.

Jul 28 '20 17:07 mrwizard64

We wil update the library by the end of the week. I will let you know once I will have the results.

Thank you all for quick response and help.

Jul 29 '20 08:07 wutchzone

@zimmerle Sadly updating the package to the latest version did not help either. Workers are still getting stuck as cen bee see on the image below.

Screenshot 2020-08-03 at 12 14 52

@mrwizard64 I am not aware of that we are using rules from Atomicorp. We are using this.

Aug 03 '20 10:08 wutchzone

But after the update it stucks less frequent. Once i catch it again I will provide full backtrace.

Aug 03 '20 11:08 wutchzone

But after the update it stucks less frequent. Once i catch it again I will provide full backtrace.

thank you. The core flie will lead you to the fullbacktrace. It could help a lot

Aug 03 '20 12:08 zimmerle

Thanks buddy

On Mon, Aug 3, 2020 at 7:35 AM Daniel Sedlak [email protected] wrote:

But after the update it stucks less frequent. Once i catch it again I will provide full backtrace.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SpiderLabs/ModSecurity/issues/2373#issuecomment-667972475, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIL6PLCVD7T5N3XUU5WHM7LR62OGHANCNFSM4PITTMIA .

Aug 03 '20 16:08 mrwizard64

It looks like it is stuck on another pthread_mutex_lock(). I am unable to locate pthread_mutex_lock in this function (modsecurity::utils::find_resource), maybe something got inlined.

#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
#1  0x00007fa81d99a714 in __GI___pthread_mutex_lock (mutex=0x7fa81d9c5008) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007fa81d72a74a in modsecurity::utils::find_resource(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)
    () from /usr/lib/x86_64-linux-gnu/libmodsecurity.so.3
#3  0x0000000000000008 in ?? ()
#4  0x384b505942563964 in ?? ()
#5  0x0000292934302100 in ?? ()
#6  0x00007fff254c5710 in ?? ()
#7  0x0000000000000000 in ?? ()

But this time it happens less often. When we turn off Audit Log workers no longer stucks.

Aug 04 '20 10:08 wutchzone

Any chance that the server demand of actually writing to the audit log is draining system resources to the point of collapse?

On Tue, Aug 4, 2020 at 6:07 AM Daniel Sedlak [email protected] wrote:

It looks like it is stuck on another pthread_mutex_lock(). I am unable to locate pthread_mutex_lock in this function ( modsecurity::utils::find_resource), maybe something got inlined.

#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103 #1 0x00007fa81d99a714 in __GI___pthread_mutex_lock (mutex=0x7fa81d9c5008) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007fa81d72a74a in modsecurity::utils::find_resource(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >*) () from /usr/lib/x86_64-linux-gnu/libmodsecurity.so.3 #3 0x0000000000000008 in ?? () #4 0x384b505942563964 in ?? () #5 0x0000292934302100 in ?? () #6 0x00007fff254c5710 in ?? () #7 0x0000000000000000 in ?? ()

But this time it happens less often. When we turn off Audit Log workers no longer stucks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SpiderLabs/ModSecurity/issues/2373#issuecomment-668507646, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIL6PLDZOFE2MFNBWL2T5XTR67MXDANCNFSM4PITTMIA .

Aug 04 '20 13:08 mrwizard64

Any chance that the server demand of actually writing to the audit log is draining system resources to the point of collapse?

Not possible. There is over 100 GiB of free RAM, and few TBs of disk space.

Aug 06 '20 12:08 wutchzone

Is it possible to see a full backtrace with symbols (like in your original posting) for this new instance? The stack of function names with line numbers could be helpful.

Aug 06 '20 16:08 martinhsv

Our nginx is compiled with Lua support, sometimes in GDB it is not possible to see function names in backtrace due to LuaVM. Dunno why, maybe it was somehow stripped or JITed out. I will check it out more.

Aug 12 '20 13:08 wutchzone

@wutchzone Is this issue already solved?

Jan 12 '21 13:01 zimmerle

No, issue still not resolved. Using Mainline nginx 1.19 and latest modsec. After enabling module, nginx just hangs out after ~30sec.

Feb 13 '21 12:02 ponasromas

I am sorry for the late response. The issue still persists. I am unable to get a better backtrace, but the issue occurs more frequently on machines with more workers.

Feb 13 '21 14:02 wutchzone

Hi @wutchzone, are you still facing this problem? I'm using the same spec as yours (ModSecurity V3.0.3 + OWASP CRS), and we got 4 pods crashed in the production environment yesterday, have no idea what's going on because K8s just deleted those pods so we couldn't check the logs then.

How did you avoid this kind of problem finally? Just by turning off the audit logging?

Pod spec:

limits:
  cpu: 400m
  memory: 800Mi

requests:
  cpu: 100m
  memory: 300Mi

ModSecurity conf:

SecAuditEngine RelevantOnly
SecAuditLogRelevantStatus "^(?:5|4(?!04)(?!01))"
SecAuditLogParts ABIJDEFHKZ
SecAuditLogType Concurrent
SecAuditLog /var/log/modsec_audit.log

# And We include most of OWASP CRS that official recommends
Include /etc/nginx/owasp-modsecurity-crs/crs-setup.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-900-EXCLUSION-RULES-BEFORE-CRS.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-901-INITIALIZATION.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-903.9001-DRUPAL-EXCLUSION-RULES.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-903.9002-WORDPRESS-EXCLUSION-RULES.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-905-COMMON-EXCEPTIONS.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-910-IP-REPUTATION.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-912-DOS-PROTECTION.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-921-PROTOCOL-ATTACK.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-930-APPLICATION-ATTACK-LFI.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-931-APPLICATION-ATTACK-RFI.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-932-APPLICATION-ATTACK-RCE.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-933-APPLICATION-ATTACK-PHP.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-941-APPLICATION-ATTACK-XSS.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-942-APPLICATION-ATTACK-SQLI.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-943-APPLICATION-ATTACK-SESSION-FIXATION.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/REQUEST-949-BLOCKING-EVALUATION.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-950-DATA-LEAKAGES.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-951-DATA-LEAKAGES-SQL.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-952-DATA-LEAKAGES-JAVA.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-953-DATA-LEAKAGES-PHP.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-954-DATA-LEAKAGES-IIS.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-959-BLOCKING-EVALUATION.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-980-CORRELATION.conf
Include /etc/nginx/owasp-modsecurity-crs/rules/RESPONSE-999-EXCLUSION-RULES-AFTER-CRS.conf

We are using SecAuditLogType Concurrent which the official docs recommended avoiding performance impact. But after we checked the Nginx ingress deployment resource on GKE.

(the left-side chart is CPU, and the right one is memory) (blue line is the limit, the orange line is used, and the yellow line is requested)

As you can see, the CPU usage was going up slowly since 8 PM, and after it reached the peak, the health check could not be responded to in 5 seconds, so the pods were terminated, and that's why the CPU usage dropped dramatically, then we were unable to roll out the deployment normally because the health check was still failed even we killed the broken pod. So we disable ModSecurity, rollout the deployment, and everything was being alright.

The pods couldn't respond to any requests when the CPU and Memory usage didn't even reach half of the limit.

If you have any suggestions, please let me know! Thank you!

Jun 04 '21 02:06 JasperSui

I have the same behavior. I've tried logging through the Concurrent and the Serial modes and, in both cases, the system hangs after running for a while. And it seems it is not much related with the load nginx is receiving, as I had issues by the beginning of the day and by the rush hour.

Below is a graph on the server CPU:

Screenshot_2022-03-10_10-07-28

Not sure if this matters, but I'm running this is on GCP, on a e2-standard-2 machine (2 CPUs and 8 GB of RAM).

Using nginx 1.20.2 and modsecurity-nginx 1.0.2 . CoreRuleSet OWASP 3.3.2.

Mar 10 '22 13:03 cpbruno

ModSecurity ModSecurity copied to clipboard

nginx worker hangs while logging to the audit log

ModSecurity
ModSecurity copied to clipboard