unit
unit copied to clipboard
apcu_fetch(): Failed to acquire read lock on production testing
Hello,
We are trying to replace php-fpm with nginx unit, running inside docker, but after tuning the setups to the same resources and configuration, we get this error from apcu
lib:
Noticed exception 'ErrorException' with message 'apcu_fetch(): Failed to acquire read lock' in /var/www/.../ApcuStore.php:23
The code there is very simple:
/**
* @param array $items
* @param int $ttl
*/
public function store(array $items, int $ttl): void
{
apcu_store($items, null, $ttl);
}
public function fetch($identifiers): array
{
return apcu_fetch($identifiers);
}
Any idea why this is happening? Do we need to change something in regards to concurrency locks now that php is running in embed mode?
Thank you in advance! R
Hi @razvanphp Do you have some sources to reproduce this error / issue? We will have a look on it
What do you mean by sources? A complete reproducible setup/code? Not really, but the relevant PHP part is already pasted above.
Hi, I am a colleague from @razvanphp Unfortunately, we can't provide the complete code to reproduce it but I can tell you a bit more about the issue.
We have a product that has categories. Categories are shared across many products and to avoid fetching them from other sources - we store them in APCu.
A typical request that is coming to our service has 100 products with around 20 categories. If the category is not found in APCu, it is fetched from Redis and stored in APCu. There is no rocket science on how it is working - it is just fetch/store methods from apcu. somehow I believe that the problem is laying somewhere in parallel execution as at the same time another product with the same category may be requested by another thread. We also cannot really see the issue if there is no traffic, so it is only visible under some load (10k rpm is enough)
Unfortunately, there is not much I can say about the issue on top of that :(
Hi @arku31 - thanks for making that clear. We will try to create a peace of software that will be able to reproduce it (fill APCu Cache with data and try to read and write from / to it) but on top of that - could you also share the Unit debug log?
What version of Unit are you using and what PHP Version? Are you using our docker images? Just asking to be able to reproduce the issue.
However based on what you described I have a feeling it is about writing to the cache while another thread is trying to read a cache entry by a given key. That would explain why you only see this as the app is under more load. But we need to check this to be sure. Looking forward to get more information from you
Good night Timo
Hi, sorry for the long answer..
We were using PHP 8.0.18 and nginx unit 1.25.0 which was running in the docker on AWS. It was behind the normal nginx. We've used a custom docker image but in terms of nginx unit that was copied from that repo - https://github.com/nginx/unit/blob/1.25.0/pkg/docker/Dockerfile.php8.0
Hi, No problem. We were also busy with KubeCon EU last week, but this issue is on my list this week (along site with some other PHP related once). Keep you posted and thanks for sharing the information with us.
Hi, I am facing the same issue (apcu_fetch(): Failed to acquire read lock ) when routing some of the production traffic with over ~500 req/s to app based on lumen framework with apcu cache driver enabled.
Actually, it makes unusable APCU with PHP8 for higher performance apps
PHP 8.0.25 (cli) (built: Oct 28 2022 18:38:41) ( NTS ) Copyright (c) The PHP Group Zend Engine v4.0.25, Copyright (c) Zend Technologies with Zend OPcache v8.0.25, Copyright (c), by Zend Technologies
php.ini:
apcu apc.coredump_unmap => Off => Off apc.enable_cli => On => On apc.enabled => On => On apc.entries_hint => 4096 => 4096 apc.gc_ttl => 3600 => 3600 apc.mmap_file_mask => no value => no value apc.preload_path => no value => no value apc.serializer => php => php apc.shm_segments => 1 => 1 apc.shm_size => 32M => 32M apc.slam_defense => Off => Off apc.smart => 0 => 0 apc.ttl => 0 => 0 apc.use_request_time => Off => Off
config.json:
"applications": { "lumen": { "type": "php", "root": "/var/www/public/", "script": "index.php", "limits": { "timeout": 92, "requests": 10000 }, "processes": { "max": 64, "spare": 64, "idle_timeout": 30 } } }
Any updates in this issue?
Hi @hbernaciak we haven't looked into the apcu
features again. We are working currently on other features of the Unit project. But as you mentioned this issue seams still present. We would need to create some sort ob Lab for this issue to trace it down. But this is currently not on the Prio-List. But any code / configuration you can share to reproduce the issue would be appreciated.
I'll see if I can somehow reproduce this.
Well that was easy! :) :(
Great @ac000! if you still need anything, just let me know :)
Here's what I've found so far.
tl;dr
The issue seems to be caused by our usage of clone(2) (well actually we call the kernel system call directly and not the glibc wrapper).
Gory details
This all applies to Linux (maybe others...)
Some background
The APCu module uses a mmap(2)'d file as a shared memory cache between processes.
At the start of this memory is a header and in that header is a lock structure for whatever locking mechanism is in use, which on Linux will most likely be POSIX Read/Write locks, pthread_rwlock-*, which uses a pthread_rwlock_t
structure to store state.
One of the members of the pthread_rwlock_t
structure is __cur_writer
, that will be important.
Test setup
Latest(ish) unit (1.28.0-30-ga0327445) compiled from source, latest APCu (5.1.22) compiled from source
Unit configuration
{
"listeners": {
"[::]:8080": {
"pass": "applications/php"
}
},
"applications": {
"php": {
"type": "php",
"root": "/var/tmp/unit-php",
"processes": 2
}
}
}
PHP test
<?php
$item = "Hello World";
apcu_store('foo', $item, 600);
var_dump(apcu_fetch('foo'));
?>
Then hit it with ApacheBench
$ ab -n 200 -c 10 http://localhost:8080/apcu.php
The problem
We get errors like
'apcu_fetch(): Failed to acquire read lock'
due to glibc (well, only been debugging this on Fedora) detecting a potential deadlock situation and failing the lock acquisition, via checks like
/* Make sure we are not holding the rwlock as a writer. This is a deadlock
situation we recognize and report. */
if (__glibc_unlikely (atomic_load_relaxed (&rwlock->__data.__cur_writer)
== THREAD_GETMEM (THREAD_SELF, tid)))
return EDEADLK;
Which is kind of weird. There is no threads involved and a process will acquire a read or write lock, do it's thing, then release it.
Testing the same exact thing under php-fpm showed no issues.
... some time later ...
Having a look again with gdb after we have acquired a lock (either one) showed the pthread_rwlock_t
structure as the following (this is for unit)
gdb) p *lock
$2 = {
__data = {
__readers = 3,
__writers = 0,
__wrphase_futex = 1,
__writers_futex = 1,
__pad3 = 0,
__pad4 = 0,
__cur_writer = 95104,
__shared = 1,
__rwelision = 0 '\000',
__pad1 = "\000\000\000\000\000\000",
__pad2 = 0,
__flags = 0
},
__size = "\003\000\000\000\000\000\000\000\001\000\000\000\001", '\000' <repeats 11 times>, "\200s\001\000\001", '\000' <repeats 26 times>,
__align = 3
}
(gdb)
At first glance it all looks sane.
However, remember that __cur_writer
member?, that's supposed to store the PID of the process which currently holds the write lock.
Lets look at what processes we have
andrew 95104 11765 0 18:56 pts/3 00:00:00 unit: main v1.29.0 [/opt/uni
andrew 95106 95104 0 18:56 pts/3 00:00:00 unit: controller
andrew 95107 95104 0 18:56 pts/3 00:00:00 unit: router
andrew 95108 95104 0 18:56 pts/3 00:00:00 unit: "php" prototype
andrew 95109 95108 0 18:56 pts/3 00:00:00 unit: "php" application
...
Here __cur_writer == pid of the main unit process. Hmm, that's not right! (It should be one of the php application processes).
Just for a sanity check, I looked at the same thing with php-fpm
(gdb) p *lock
$3 = {
__data = {
__readers = 3,
__writers = 0,
__wrphase_futex = 1,
__writers_futex = 1,
__pad3 = 0,
__pad4 = 0,
__cur_writer = 95200,
__shared = 1,
__rwelision = 0 '\000',
__pad1 = "\000\000\000\000\000\000",
__pad2 = 0,
__flags = 0
},
__size = "\003\000\000\000\000\000\000\000\001\000\000\000\001", '\000' <repeats 11 times>, "\340s\001\000\001", '\000' <repeats 26 times>,
__align = 3
}
root 95198 1 2 19:02 ? 00:00:00 php-fpm: master process (/et
apache 95200 95198 0 19:02 ? 00:00:00 php-fpm: pool www
...
Here __cur_writer == pid of the php-fpm worker process. Good, as expected.
I started suspecting it could have something to do with how we create the processes, using clone rather than fork(2).
Indeed, forcing unit to use fork(2) (which it will do if there is no clone available) made the problem go away. I also replicated this issue with a simple test program to verify.
Solution
So to recap, the problem is that when glibc checks the current thread id against what's stored in __cur_writer
, they both seem to be set to the PID of the main unit process and so glibc thinks the process is trying to acquire a lock (either one) when it already has the write lock, but it doesn't really.
At the moment I'm not entirely sure how to fix this.
Currently our processes are created something like
clone(child_stack=NULL, flags=SIGCHLD) = 5754
whereas if you use fork(2) which is implemented with clone, you get
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f317e668a10) = 5857
I believe the key bits here are the CLONE_CHILD_SETTID
flag and the setting of child_tidptr
which in glibc's case points into some core thread memory area which I assume updates THREAD_SELF->tid as used in the above glibc check.
More investigation required.
@tippexs @ac000 is anybody in your team who can push this forward?
@hbernaciak
It's a known issue with glibc (musl libc seems to be OK and perhaps glibc < 2.25) but the glibc developers don't seem to be in a hurry to fix it, in fact there seems to have been some disagreement that they even should.
At the moment it's looking like the way forward would be to drop our use of SYS_clone, we use it so that we can make use of namespaces(7), but we can achieve the same with a combination of fork(2)/unshare(2).
I will keep this issue updated along the way.
@razvanphp @arku31 @hbernaciak
Hi folks,
I have put up a branch that should resolve this issue (my simple test case now works 100%), you can grab it here.
If anyone is able to test this, that would be good (feel free to ping me here if you need help setting it up). Report success or failure. If it fails, please let me know your Linux (and this only effects Linux) distribution and version and libc version, whether that's glibc, musl libc or something else, if you know it, no problem if not.
Cheers, Andrew
Updated branch with improved version.