semian icon indicating copy to clipboard operation
semian copied to clipboard

Disappearing semaphore array

Open viraptor opened this issue 5 years ago • 2 comments

I've run into an issue with a semaphore array disappearing from the system while the app is running. I can't find any details in the logs about what would cause it, but it started happening pretty much as we moved from ubuntu 14.04 to 18.04. There are no other changes that I could see that would be related here.

The system is running with ruby 2.5.5. The exception we get is:

Semian::SyscallError: semop() failed, errno: 22 (Invalid argument)

File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 50 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 50 in acquire_bulkhead
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 24 in block in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 38 in block in acquire_circuit_breaker
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/circuit_breaker.rb line 141 in maybe_with_half_open_resource_timeout
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/circuit_breaker.rb line 30 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 37 in acquire_circuit_breaker
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/protected_resource.rb line 23 in acquire
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/adapter.rb line 34 in acquire_semian_resource
File .../vendor/bundle/ruby/2.5.0/gems/semian-0.8.8/lib/semian/net_http.rb line 83 in connect

This is with the latest released semian.

The issue starts occurring a number of hours after the deployment, without any obvious pattern of traffic.

I tracked the call down to:

10574.300 ( 0.015 ms): ruby/21041 semtimedop(semid: 131072, tsops: 0x7ffff92379c2, nsops: 1, timeout: 0x7ffff9237aa8) = -1 EINVAL Invalid argument

where the semid: 131072 doesn't exist on the system (normally we have 2 semaphore arrays, but this system had only 1). This was validated using ipcs -s.

Please let me know if there's any more debugging information I can provide.

viraptor avatar May 27 '19 08:05 viraptor

It turns out there was a bit of misunderstanding of what happened. Additional details:

The rails app which was affected was is normally configured for 6 tickets and runs with 9 workers. The healthy status looks like this:

Semaphore Array semid=196608
uid=1001	 gid=1001	 cuid=1001	 cgid=1001
mode=0660, access_perms=0660
nsems = 4
otime = Mon May 27 18:25:46 2019
ctime = Mon May 27 18:25:46 2019
semnum     value      ncount     zcount     pid
0          1          0          0          14442
1          6          0          0          14442
2          6          0          0          14442
3          1          0          0          14442

For the affected app, the semid was incorrect, but also the semaphore array present at the instance was:

Semaphore Array semid=196608
uid=1001     gid=1001     cuid=1001     cgid=1001
mode=0660, access_perms=0660
nsems = 4
otime = Mon May 27 12:05:28 2019
ctime = Mon May 27 12:00:03 2019
semnum     value      ncount     zcount     pid
0          1          0          0          1225
1          1          0          0          1225
2          1          0          0          1225
3          0          0          0          1225

Where pid 1225 did not exist on the system anymore.

viraptor avatar May 27 '19 08:05 viraptor

After a bit of investigation, it has turned out this was a side effect of swapping to systemd and the logind.conf having the RemoveIPC=yes as the default value. The user had a UID > 1000 and performed some operations by su to the required user before running. Upon logging out, it would wipe out the semaphores that semian was relying on and cause some unexpected behaviours.

There are probably some safe guards we can put in place to ensure that if the semaphores are pulled from under the operating processes that it handles it better however I'll open a PR if we can find anything worth while doing there.

jacobbednarz avatar Jun 04 '19 06:06 jacobbednarz