incubator-pagespeed-ngx icon indicating copy to clipboard operation
incubator-pagespeed-ngx copied to clipboard

Mod_pagespeed deadlock?

Open lanki567 opened this issue 6 years ago • 16 comments
trafficstars

Our apache 2.4 is configured to use latest stable version of mod_pagespeed but unfortunately after sometime apache doesn't serve the response for all the requests configured to use pagespeed. As per the Mod_forensic log, Response was never sent for these requests. During this period, all url's configured with "ModPagespeedDisallow" works as expected. Even the url's configured to use pagespeed works as expected when the header/request parameter "ModPagespeed=off" is set. Thread dump shows several threads stuck at the address "0x00007f1417d94253". We are consistently able to reproduce this on all our RHEL 7.6 servers(kernel version is 3.10.0-957.21.3.el7.x86_64. )

addr2line -e mod_pagespeed_ap24.so 0x00007f1417d94253 ??:0 #0 0x00007f1427f42d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f1417d94253 in ?? () from /apache/modules/mod_pagespeed_ap24.so #2 0x00007f1417bfccfc in ?? () from /apache/modules/mod_pagespeed_ap24.so #3 0x00007f1417bfcd8e in ?? () from /apache/modules/mod_pagespeed_ap24.so #4 0x00007f1417acda2e in ?? () from /apache/modules/mod_pagespeed_ap24.so #5 0x00007f1417aa44e8 in ?? () from /apache/modules/mod_pagespeed_ap24.so #6 0x00007f1417aa626f in ?? () from /apache/modules/mod_pagespeed_ap24.so #7 0x00007f1426f24194 in filter_harness (f=0x7f12580c6e48, bb=0x7f12580617e0) at mod_filter.c:323 #8 0x00007f1418621d4d in ajp_process_callback (msg=<optimized out>, pmsg=0x292bfc8, ae=0x292bf50, r=0x7f12d47f2a60, l=0x2727470) at jk_ajp_common.c:2146 #9 0x00007f14186243ca in ajp_get_reply (e=<optimized out>, s=0x7f12d47f2a60, l=0x2727470, p=0x292bf50, op=0x7f12d47f1830) at jk_ajp_common.c:2310 #10 0x00007f1418628cb7 in ajp_service (e=<optimized out>, s=0x7f12d47f2a60, l=<optimized out>, is_error=<optimized out>) at jk_ajp_common.c:2678 #11 0x00007f1418611b4f in service (e=<optimized out>, s=<optimized out>, l=<optimized out>, is_error=0x7f12d47f2cbc) at jk_lb_worker.c:1418 #12 0x00007f14185fe121 in jk_handler (r=<optimized out>) at mod_jk.c:2896 #13 0x0000000000453960 in ap_run_handler (r=r@entry=0x7f1258069ce0) at config.c:170 #14 0x0000000000453ea9 in ap_invoke_handler (r=r@entry=0x7f1258069ce0) at config.c:444 #15 0x00000000004697ba in ap_process_async_request (r=0x7f1258069ce0) at http_request.c:453 #16 0x0000000000469a7e in ap_process_request (r=r@entry=0x7f1258069ce0) at http_request.c:488 #17 0x0000000000465c45 in ap_process_http_sync_connection (c=0x7f12ac021098) at http_core.c:210 #18 ap_process_http_connection (c=0x7f12ac021098) at http_core.c:251 #19 0x000000000045d480 in ap_run_process_connection (c=c@entry=0x7f12ac021098) at connection.c:42 #20 0x000000000045d9a8 in ap_process_connection (c=c@entry=0x7f12ac021098, csd=csd@entry=0x7f12ac020e80) at connection.c:219 #21 0x00000000004706f2 in process_socket (bucket_alloc=0x7f125801ef08, my_thread_num=13, my_child_num=7, sock=0x7f12ac020e80, p=0x7f12ac020df8, thd=0x2f481e8) at worker.c:479 #22 worker_thread (thd=0x2f481e8, dummy=<optimized out>) at worker.c:808 #23 0x00007f1427f3edd5 in start_thread () from /lib64/libpthread.so.0 #24 0x00007f1427a6402d in clone () from /lib64/libc.so.6

lanki567 avatar Sep 11 '19 13:09 lanki567

would you be willing to try and reproduce this with a debug build? Hopefully that gives us a backtrace with names instead of addresses. (and then it might be interesting to see backtraces of all the threads)

oschaaf avatar Sep 11 '19 21:09 oschaaf

I will check internally if we are allowed to install the apache 2.4 debug build and update here. Does the same pagespeed rpm work or do we need to rebuild pagespeed from source?

lanki567 avatar Sep 13 '19 18:09 lanki567

Hi @lanki567 I think you have to rebuild it. RPM are build as Release branch and you need a Debug branch. If you see the docs for rebuild pagespeed you will see a parameter BUILDTYPE set to Release, you ust change it to Debug

Lofesa avatar Sep 14 '19 18:09 Lofesa

Hi @Lofesa

We don't possess skillset on c++ or python. it is taking forever to fix the build issues. All branches are currently broken due to missing git submodules hosted on apache infra. Is there any debug build RPM available for centos 7.x?

lanki567 avatar Sep 24 '19 05:09 lanki567

I don´t know if there is. But as stated in apache infra maybe the solution take too much time so maybe pagespeed module must change their deps to the github himself. @oschaaf and @jmarantz what do you think?

Lofesa avatar Sep 24 '19 08:09 Lofesa

@lanki567 you could try this to work around git.apache.org being down:

git config --global url.https://github.com/apache/.insteadOf https://git.apache.org/

@Lofesa Updating the submodules to point them to github sounds lika a good idea to me, but what makes me hesitate is that I'm not sure if it's according to ASF policy to do so. Maybe ask around on the dev list about that?

oschaaf avatar Sep 24 '19 09:09 oschaaf

@oschaaf As stated in this doc we should use the github url. "....There will no longer be a git.apache.org URL for git mirrors, to lessen the confusion with gitbox.apache.org. Projects wishing to point to a git copy of their subversion repository should use their respective GitHub URLs. "

Lofesa avatar Sep 25 '19 08:09 Lofesa

Oh wow, I did not see that coming; but yes, we should run replace git.apache.org with github.com/apache/ then

oschaaf avatar Sep 25 '19 08:09 oschaaf

We have executed the below command on the latest stable branch but the build fails for the BUILDTYPE=Debug however build is successful with BUILDTYPE=Release. Any help is much appreciated. Thanks in advance

Release build type generated binary for the following commands on RHEL 7

  • make BUILDTYPE=Release mod_pagespeed_test pagespeed_automatic_test
  • ./out/Release/mod_pagespeed_test
  • ./out/Release/pagespeed_automatic_test
  • make BUILDTYPE=Release

Debug build fails for the following command on RHEL 7

Please find the log below. Let me know if you need the complete log make -d BUILDTYPE=Debug mod_pagespeed_test pagespeed_automatic_test

Partial Log Rejecting impossible implicit prerequisitepagespeed/kernel/http/bot_checker.gperf.s'. Trying pattern rule with stem bot_checker.gperf'. Rejecting impossible implicit prerequisite pagespeed/kernel/http/bot_checker.gperf.S'. Trying pattern rule with stem bot_checker.gperf'. Rejecting impossible implicit prerequisite pagespeed/kernel/http/bot_checker.gperf.mod'. Trying pattern rule with stem bot_checker.gperf'. Trying implicit prerequisite pagespeed/kernel/http/bot_checker.gperf.sh'. Looking for a rule with intermediate file pagespeed/kernel/http/bot_checker.gperf.sh'. Avoiding implicit rule recursion. No implicit rule found for pagespeed/kernel/http/bot_checker.gperf'. Finished prerequisites of target file pagespeed/kernel/http/bot_checker.gperf'. No need to remake target pagespeed/kernel/http/bot_checker.gperf'. Pruning file FORCE_DO_CMD'. Finished prerequisites of target file out/Debug/obj/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.cc'. Prerequisite pagespeed/kernel/http/bot_checker.gperf' is older than target out/Debug/obj/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.cc'. Prerequisite FORCE_DO_CMD' of target out/Debug/obj/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.cc' does not exist. Must remake target out/Debug/obj/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.cc'. Invoking recipe from pagespeed/pagespeed_http_gperf.target.mk:10 to update target out/Debug/obj/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.cc'. Successfully remade target file out/Debug/obj/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.cc'. Pruning file FORCE_DO_CMD'. Pruning file out/Debug/obj/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.cc'. Finished prerequisites of target file out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o'. Must remake target out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o'. Invoking recipe from pagespeed/pagespeed_http_gperf.target.mk:327 to update target out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o'. Putting child 0x2a9b070 (out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o) PID 11913 on the chain. Live child 0x2a9b070 (out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o) PID 11913 CXX(target) out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o Reaping winning child 0x2a9b070 PID 11913 Live child 0x2a9b070 (out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o) PID 11914 Reaping winning child 0x2a9b070 PID 11914 Live child 0x2a9b070 (out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o) PID 11915 In file included from /usr/include/c++/4.8.2/x86_64-redhat-linux/bits/os_defines.h:39:0, from /usr/include/c++/4.8.2/x86_64-redhat-linux/bits/c++config.h:2097, from /usr/include/c++/4.8.2/cctype:41, from ./pagespeed/kernel/base/string_util.h:22, from ./pagespeed/kernel/http/bot_checker.h:22, from ../pagespeed/kernel/http/bot_checker.gperf:5: /usr/include/features.h:330:4: error: #warning _FORTIFY_SOURCE requires compiling with optimization (-O) [-Werror=cpp] warning _FORTIFY_SOURCE requires compiling with optimization (-O) ^ cc1plus: all warnings being treated as errors Reaping losing child 0x2a9b070 PID 11915 make: *** [out/Debug/obj.target/pagespeed_http_gperf/gen/gperf_out/instaweb/kernel/http/bot_checker.gp.o] Error 1 Removing child 0x2a9b070 PID 11915 from chain. `

lanki567 avatar Oct 02 '19 06:10 lanki567

Below is the stacktrace of 2 threads with the Release module compiled from latest stable build branch. We see 100's of threads with the below stacktrace `Thread 24 (Thread 0x7f3c425d6700 (LWP 12478)): #0 0x00007f3da8e984ed in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f3da8e93dcb in _L_lock_883 () from /lib64/libpthread.so.0 #2 0x00007f3da8e93c98 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f3d98e11afb in net_instaweb::SharedMemCache<64ul>::Get(std::string const&, net_instaweb::CacheInterface::Callback*) () from /apache/config/modules/mod_pagespeed_ap24.so #4 0x00007f3d98e05977 in net_instaweb::CacheStats::Get(std::string const&, net_instaweb::CacheInterface::Callback*) () from /apache/config/modules/mod_pagespeed_ap24.so #5 0x00007f3d98e05977 in net_instaweb::CacheStats::Get(std::string const&, net_instaweb::CacheInterface::Callback*) () from /apache/config/modules/mod_pagespeed_ap24.so #6 0x00007f3d98bb1888 in net_instaweb::CachePropertyStore::Get(std::string const&, std::string const&, std::string const&, std::vector<net_instaweb::PropertyCache::Cohort const*, std::allocator<net_instaweb::PropertyCache::Cohort const*> > const&, net_instaweb::PropertyPage*, net_instaweb::Callback1, net_instaweb::AbstractPropertyStoreGetCallback**) () from /apache/config/modules/mod_pagespeed_ap24.so #7 0x00007f3d98bb53b4 in net_instaweb::PropertyPage::Read(std::vector<net_instaweb::PropertyCache::Cohort const, std::allocator<net_instaweb::PropertyCache::Cohort const*> > const&) () from /apache/config/modules/mod_pagespeed_ap24.so #8 0x00007f3d989e0606 in net_instaweb::InstawebContext::BlockingPropertyCacheLookup() () from /apache/config/modules/mod_pagespeed_ap24.so #9 0x00007f3d989e0eb5 in net_instaweb::InstawebContext::InstawebContext(request_rec*, net_instaweb::RequestHeaders*, net_instaweb::ContentType const&, net_instaweb::ApacheServerContext*, std::string const&, net_instaweb::RefCountedPtr<net_instaweb::RequestContext> const&, net_instaweb::QueryParams const&, net_instaweb::QueryParams const&, bool, net_instaweb::RewriteOptions const&) () from /apache/config/modules/mod_pagespeed_ap24.so #10 0x00007f3d989e8169 in net_instaweb::(anonymous namespace)::build_context_for_request(request_rec*) () from /apache/config/modules/mod_pagespeed_ap24.so #11 0x00007f3d989e8755 in net_instaweb::(anonymous namespace)::instaweb_out_filter(ap_filter_t*, apr_bucket_brigade*) () from /apache/config/modules/mod_pagespeed_ap24.so #12 0x00007f3da7e77194 in filter_harness (f=0x7f3bf8071fa0, bb=0x7f3bf80723a0) at mod_filter.c:323 #13 0x00000000004421f7 in default_handler (r=0x7f3bf8078020) at core.c:4836 #14 0x0000000000453960 in ap_run_handler (r=r@entry=0x7f3bf8078020) at config.c:170 #15 0x0000000000453ea9 in ap_invoke_handler (r=r@entry=0x7f3bf8078020) at config.c:444 #16 0x00000000004697ba in ap_process_async_request (r=0x7f3bf8078020) at http_request.c:453 #17 0x0000000000469a7e in ap_process_request (r=r@entry=0x7f3bf8078020) at http_request.c:488 #18 0x0000000000465c45 in ap_process_http_sync_connection (c=0x7f3c3005b2c8) at http_core.c:210 #19 ap_process_http_connection (c=0x7f3c3005b2c8) at http_core.c:251 #20 0x000000000045d480 in ap_run_process_connection (c=c@entry=0x7f3c3005b2c8) at connection.c:42 #21 0x000000000045d9a8 in ap_process_connection (c=c@entry=0x7f3c3005b2c8, csd=csd@entry=0x7f3c3005b0b0) at connection.c:219 #22 0x00000000004706f2 in process_socket (bucket_alloc=0x7f3bf806aa08, my_thread_num=44, my_child_num=3, sock=0x7f3c3005b0b0, p=0x7f3c3005b038, thd=0xc47928) at worker.c:479 #23 worker_thread (thd=0xc47928, dummy=) at worker.c:808 #24 0x00007f3da8e91dd5 in start_thread () from /lib64/libpthread.so.0 #25 0x00007f3da89b702d in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7f3c42fd7700 (LWP 12477)): #0 0x00007f3da8e984ed in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f3da8e93dcb in _L_lock_883 () from /lib64/libpthread.so.0 #2 0x00007f3da8e93c98 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f3d98e11afb in net_instaweb::SharedMemCache<64ul>::Get(std::string const&, net_instaweb::CacheInterface::Callback*) () from /apache/config/modules/mod_pagespeed_ap24.so #4 0x00007f3d98e05977 in net_instaweb::CacheStats::Get(std::string const&, net_instaweb::CacheInterface::Callback*) () from /apache/config/modules/mod_pagespeed_ap24.so #5 0x00007f3d98e05977 in net_instaweb::CacheStats::Get(std::string const&, net_instaweb::CacheInterface::Callback*) () from /apache/config/modules/mod_pagespeed_ap24.so #6 0x00007f3d98bb1888 in net_instaweb::CachePropertyStore::Get(std::string const&, std::string const&, std::string const&, std::vector<net_instaweb::PropertyCache::Cohort const*, std::allocator<net_instaweb::PropertyCache::Cohort const*> > const&, net_instaweb::PropertyPage*, net_instaweb::Callback1, net_instaweb::AbstractPropertyStoreGetCallback**) () from /apache/config/modules/mod_pagespeed_ap24.so #7 0x00007f3d98bb53b4 in net_instaweb::PropertyPage::Read(std::vector<net_instaweb::PropertyCache::Cohort const, std::allocator<net_instaweb::PropertyCache::Cohort const*> > const&) () from /apache/config/modules/mod_pagespeed_ap24.so #8 0x00007f3d989e0606 in net_instaweb::InstawebContext::BlockingPropertyCacheLookup() () from /apache/config/modules/mod_pagespeed_ap24.so #9 0x00007f3d989e0eb5 in net_instaweb::InstawebContext::InstawebContext(request_rec*, net_instaweb::RequestHeaders*, net_instaweb::ContentType const&, net_instaweb::ApacheServerContext*, std::string const&, net_instaweb::RefCountedPtr<net_instaweb::RequestContext> const&, net_instaweb::QueryParams const&, net_instaweb::QueryParams const&, bool, net_instaweb::RewriteOptions const&) () from /apache/config/modules/mod_pagespeed_ap24.so #10 0x00007f3d989e8169 in net_instaweb::(anonymous namespace)::build_context_for_request(request_rec*) () from /apache/config/modules/mod_pagespeed_ap24.so #11 0x00007f3d989e8755 in net_instaweb::(anonymous namespace)::instaweb_out_filter(ap_filter_t*, apr_bucket_brigade*) () from /apache/config/modules/mod_pagespeed_ap24.so #12 0x00007f3da7e77194 in filter_harness (f=0x7f3bec071f88, bb=0x7f3bec072388) at mod_filter.c:323 #13 0x00000000004421f7 in default_handler (r=0x7f3bec078020) at core.c:4836 #14 0x0000000000453960 in ap_run_handler (r=r@entry=0x7f3bec078020) at config.c:170 #15 0x0000000000453ea9 in ap_invoke_handler (r=r@entry=0x7f3bec078020) at config.c:444 #16 0x00000000004697ba in ap_process_async_request (r=0x7f3bec078020) at http_request.c:453 #17 0x0000000000469a7e in ap_process_request (r=r@entry=0x7f3bec078020) at http_request.c:488 #18 0x0000000000465c45 in ap_process_http_sync_connection (c=0x7f3c300591e8) at http_core.c:210 #19 ap_process_http_connection (c=0x7f3c300591e8) at http_core.c:251 #20 0x000000000045d480 in ap_run_process_connection (c=c@entry=0x7f3c300591e8) at connection.c:42 #21 0x000000000045d9a8 in ap_process_connection (c=c@entry=0x7f3c300591e8, csd=csd@entry=0x7f3c30058fd0) at connection.c:219 #22 0x00000000004706f2 in process_socket (bucket_alloc=0x7f3bec06aa08, my_thread_num=43, my_child_num=3, sock=0x7f3c30058fd0, p=0x7f3c30058f58, thd=0xc478f8) at worker.c:479 #23 worker_thread (thd=0xc478f8, dummy=) at worker.c:808 #24 0x00007f3da8e91dd5 in start_thread () from /lib64/libpthread.so.0 #25 0x00007f3da89b702d in clone () from /lib64/libc.so.6`

lanki567 avatar Oct 03 '19 14:10 lanki567

Thanks so much for hanging in there and persisting. These backtraces look very very helpful. One last question; could you, while you're at it, post backtraces from all the threads somewhere for reference later on?

oschaaf avatar Oct 03 '19 14:10 oschaaf

Unfortunately, Policies here doesn't let me upload all the logs. Are you specifically looking for anything? I can paste those logs here.

Is there any workaround/property to disable this specific functionality?

lanki567 avatar Oct 03 '19 19:10 lanki567

Well, I was hoping to inspect what all the threads are at when the hang occurs. The two threads you posted both seem to be waiting to acquire a lock, but there's probably (hopefully!) also another one out there that holds it.

oschaaf avatar Oct 03 '19 19:10 oschaaf

Thank you for the backtraces. They reveal this is happening in the shared-memory cache, where multiple processes use a cross-process mutex to access shared memory.

I have always been concerned that a process crash, or a SIGKILL while holding onto a shared-memory mutex could make it impossible to release the lock. However the shared-memory cache has been running for probably 7 years now and I've never heard of this actually occurring.

If this is just a once in a decade fluke, it might be a reasonable workaround to simply reboot the machine to clear the shm-locks and hopefully you will not be unlucky twice.

If, after reboot, the problem re-appears, a workaround is to disable the shared-memory cache, and instead use per-process in-memory LRU caches as your L1s. See https://www.modpagespeed.com/doc/system for details.

Hope this helps!

jmarantz avatar Oct 03 '19 20:10 jmarantz

I just wanted to provide an update. Into 6th day, We didn't see the issue yet after disabling shared memory cache. Thanks for the workaround

We are now little bit concerned about performance. Should be consider memcache/redis as alternative solution?

lanki567 avatar Oct 09 '19 15:10 lanki567

Yes, actually memcache/redis would be good to use as an L2 independent of whether you are using the shared-memory cache or the per-child-process cache as an L1.

Before resorting to the per-child-process L1, did you try to reboot the machine to clear out the shm-locks? Did that not work?

On Wed, Oct 9, 2019 at 11:02 AM lanki567 [email protected] wrote:

I just wanted to provide an update. Into 6th day, We didn't see the issue yet after disabling shared memory cache. Thanks for the workaround

We are now little bit concerned about performance. Should be consider memcache/redis as alternative solution?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-pagespeed-ngx/issues/1662?email_source=notifications&email_token=AAO2IPOUMGDSR4H6335QBX3QNXXAPA5CNFSM4IVUCFE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYGIEA#issuecomment-540042256, or mute the thread https://github.com/notifications/unsubscribe-auth/AAO2IPNTRXWTXY5EHZMPXY3QNXXAPANCNFSM4IVUCFEQ .

jmarantz avatar Oct 09 '19 15:10 jmarantz