posix_ipc icon indicating copy to clipboard operation
posix_ipc copied to clipboard

Context manager cleanup

Open rogerdahl opened this issue 5 years ago • 19 comments

On Linux, this script leaves an orphaned semaphore showing with ls -l /dev/shm. If the semaphore was created by the context manager, it seems like it would be safe to unlink it when exiting the context manager scope. Is that something that can be added to the library?

For now, I'm working around it by wrapping everything in a finally clause that performs the cleanup.

#!/usr/bin/env python

import posix_ipc

with posix_ipc.Semaphore(name='myname', flags=posix_ipc.O_CREAT, initial_value=1):
    print(1)

rogerdahl avatar Jun 03 '19 19:06 rogerdahl

Hi, thanks for your feature suggestion! I can see how it would be helpful in your particular case.

However, I'm not to inclined to add this for a few reasons, more or less in this order --

  • Adding this behavior would make posix_ipc backwards-incompatible with previous versions
  • Your excellent suggestion to unlink the semaphore in the finally block addresses the need elegantly
  • It would be the only place in posix_ipc where something is automatically unlinked
  • Creating a new semaphore via a with statement feels like an edge case to me (but I admit that I don't have any way to test that one way or another)

Do those sound like reasonable objections to you?

osvenskan avatar Jun 08 '19 20:06 osvenskan

Thank you for considering the suggestion.

Creating a new semaphore in the context manager is a good way to create a critical section, so it might not be that much that of an edge case.

Breaking backwards compatibility is the most troubleing issue. I think the expectation of most Python programmers is that a context manager will fully clean up the associated resource. So I think leaving the semaphore around if it was created by the context manager breaks with the principle of least astonishment. I think most existing code that expects the semaphore to stay around will be using the context manager to acquire an existing semaphore, but I agree that there may well be software out there that depends on the current behavior.

If it was me, I think I'd still change the behavior, but maybe bump the major version. It took a while for me to track down the bug that this caused, but I have the fix in place, so if it stays the way it is, that works for me as well :)

rogerdahl avatar Jun 09 '19 06:06 rogerdahl

If I was designing this from the ground up I might convince myself that the unlinking behavior is the best idea, I'm not sure. But I don't want to break backwards compatibility for a edge case behavior that I feel iffy about.

I'd like to leave this ticket open as a reminder to think about documenting this behavior for the next release. Does that work for you?

osvenskan avatar Jun 11 '19 00:06 osvenskan

I'm good with that decision. Yes, feel free to leave the ticket open. A description of the issue together with some example code that unlinks in finally might be the best way to go.

I did run into an issue with my approach of unlink in finally. On rare occations, the unlink would raise a "does not exist" exception, and I had to wrap it in a try. It didn't fit with my understanding of how it should behave under Linux. I'll add the code here later for your reference.

I think this is all a bit hacky because it's a workaround for a missing feature in POSIX. System V apparently had a flag one could set at create time to request that the semaphore be automatically cleaned up when no longer acqured.

Reading a bit about this, I found that a common recommendation is to use lock files instead.

rogerdahl avatar Jun 11 '19 21:06 rogerdahl

Here's the code where I had to wrap the unlink().

I use this context manager to serialize access to a filesystem paths that may or may not exist (yet).

@contextlib.contextmanager
def path_lock(path):
    path = str(path)
    logger.debug("Waiting for lock on path: {}".format(path))
    sem = posix_ipc.Semaphore(
        name="/{}".format(hashlib.md5(path.encode("utf-8")).hexdigest()),
        flags=posix_ipc.O_CREAT,
        initial_value=1,
    )
    # posix_ipc.Semaphore() can be used as a context manager, and ``with`` uses implicit
    # ``try/finally`` blocks. However, as a context manager, posix_ipc.Semaphore() does
    # not ``close()`` and ``unlink()``.
    try:
        sem.acquire()
        logger.debug("Acquired lock on path: {}".format(path))
        yield
    finally:
        sem.release()
        logger.debug("Released lock on path: {}".format(path))
        try:
            sem.unlink()
        except posix_ipc.ExistentialError:
            pass

rogerdahl avatar Jun 11 '19 22:06 rogerdahl

That looks like a great solution.

I'm also puzzled why sem.unlink() might fail in this code.

For tidiness (and depending on how this context manager is invoked), you might want to add a call to sem.close() --

finally:
    sem.release()
    logger.debug("Released lock on path: {}".format(path))
    sem.unlink()
    sem.close()

Since your process isn't explicitly closing the semaphore, the kernel retains a reference to it even though it's been unlinked. This can lead to surprising behavior. Consider --

>>> import posix_ipc
>>> name = 'foo'
>>> sem1 = posix_ipc.Semaphore(name, posix_ipc.O_CREAT)
>>> sem1.release()
>>> sem1.value
1
>>> sem1.unlink()
>>> sem1.release()
>>> sem1.value
2
>>> sem2 = posix_ipc.Semaphore(name, posix_ipc.O_CREX)
>>> sem2.value
0
>>> sem1.value
2
>>> sem1.name == sem2.name
True
>>> sem1.value == sem2.value
False

Note that sem2 is created with O_EXCL bit set, and the create still succeeds even though sem1 is still hanging around.

With a call to close() added to the steps above, sem1 disappears entirely from the system --

>>> import posix_ipc
>>> name = 'foo'
>>> sem1 = posix_ipc.Semaphore(name, posix_ipc.O_CREAT)
>>> sem1.release()
>>> sem1.value
1
>>> sem1.unlink()
>>> sem1.close()
>>> sem1.release()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
posix_ipc.ExistentialError: The semaphore has been closed
>>> sem1.value
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
posix_ipc.ExistentialError: The semaphore has been closed
>>> sem2 = posix_ipc.Semaphore(name)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
posix_ipc.ExistentialError: No semaphore exists with the specified name
>>> 

osvenskan avatar Jun 13 '19 02:06 osvenskan

Thanks, that's indeed suprising. I've added the close(). Wonder if that will help with the "not found" exceptions.

rogerdahl avatar Jun 13 '19 02:06 rogerdahl

Just FYI, adding close() did not help with the posix_ipc.ExistentialError: No semaphore exists with the specified name error. It's triggered by the unlink() in this finally block:


    finally:
        sem.release()
        sem.unlink()
        sem.close()

Wrapping the unlink() in a try works well.

The error is fairly rare. I tried doing a stress test but was unable to trigger it:

    @pytest.mark.parametrize("unique_path_count", [1, 10, 100])
    def test_1040(self, unique_path_count):
        """path_lock context manager."""
        worker_count = 100
        loop_count = 10
        lock_duration_min_sec = 0.0
        lock_duration_max_sec = 0.1

        path_list = [
            d1_test.instance_generator.random_data.random_lower_ascii(10, 20)
            for _ in range(unique_path_count)
        ]

        pool = multiprocessing.Pool(processes=worker_count)

        for _ in range(loop_count):
            random.shuffle(path_list)
            for path in path_list:
                pool.apply_async(
                    lock_path, args=(path, lock_duration_min_sec, lock_duration_max_sec)
                )

        pool.close()
        pool.join()


def lock_path(path, lock_duration_min_sec, lock_duration_max_sec):
    with d1_test.sample.path_lock(path):
        time.sleep(random.uniform(lock_duration_min_sec, lock_duration_max_sec))

rogerdahl avatar Jun 19 '19 16:06 rogerdahl

@rogerdahl @osvenskan Above path_lock context manager implementation with unlink doesn't perform locking. I'm running Ubuntu 18.04 and if I use above with unlink - I get race condition, here is complete test suite demonstrating this behavior:

import contextlib
import hashlib
import logging
import multiprocessing
import random
import time
from multiprocessing.pool import ThreadPool

import posix_ipc
import pytest


logger = logging.getLogger("test_path_lock")
_lock_name = "/lock_path_testing"


@contextlib.contextmanager
def path_lock(path):
    path = str(path)
    logger.debug("Waiting for lock on path: {}".format(path))
    sem = posix_ipc.Semaphore(
        name="/{}".format(hashlib.md5(path.encode("utf-8")).hexdigest()),
        flags=posix_ipc.O_CREAT,
        initial_value=1,
    )
    # posix_ipc.Semaphore() can be used as a context manager, and ``with`` uses implicit
    # ``try/finally`` blocks. However, as a context manager, posix_ipc.Semaphore() does
    # not ``close()`` and ``unlink()``.
    try:
        sem.acquire()
        logger.debug("Acquired lock on path: {}".format(path))
        yield
    finally:
        sem.release()
        logger.debug("Released lock on path: {}".format(path))
        try:
            sem.unlink()
        except posix_ipc.ExistentialError:
            pass
        sem.close()


@pytest.mark.parametrize("pool_cls", [multiprocessing.Pool, ThreadPool])
def test_named_lock(pool_cls):
    manager = multiprocessing.Manager()
    data = manager.Value("i", 0)
    worker_count = 5
    increment_count = 50
    lock_duration_min_sec = 0.01
    lock_duration_max_sec = 0.05
    pool = pool_cls(processes=worker_count)

    for i in range(increment_count):
        pool.apply_async(
            do_concurrent_increment, args=(data, lock_duration_min_sec, lock_duration_max_sec)
        )

    pool.close()
    pool.join()
    assert data.value == increment_count


def do_concurrent_increment(data, lock_duration_min_sec, lock_duration_max_sec):
    with path_lock(_lock_name):
        val = data.value
        time.sleep(random.uniform(lock_duration_min_sec, lock_duration_max_sec))
        data.value = val + 1

You can run it with pytest <file name>.py, lib version: posix_ipc==1.0.5 It seems like you need to call unlink only when absolutely sure that nobody else has opened semaphore

oleksii-sl avatar May 20 '22 06:05 oleksii-sl

@s-Oleksii I tried your test on Mint 20.3, based on Ubuntu 20.04, and it fails there as well.

This probably explains problems I've had when trying to implement some custom locking schemes. I think POSIX named semaphores may be fundamentally broken in that there is no generic way to safely clean them up.

rogerdahl avatar May 20 '22 14:05 rogerdahl

@s-Oleksii thanks for the report. Your test also fails for me under Mac OS.

Do you think I've implemented close() and unlink() incorrectly? Do you think you would get the same result if you could (somehow) code the same example in C?

osvenskan avatar May 23 '22 02:05 osvenskan

@osvenskan Here is what I came up with coding in C, please examine as I'm not a C dev. Issue was reproduced

// C program to demonstrate race condition when calling sem_unlink
// gcc posix_sem_test.c -lpthread -lrt -o posix_sem_test.out
// ./posix_sem_test.out
#include <stdio.h>
#include <pthread.h>
#include <semaphore.h>
#include <unistd.h>
#include <stdlib.h>
#include <fcntl.h>

#define SEM_MUTEX_NAME "/sem-mutex"

//sem_t* mutex;
int counter = 0;

float float_rand( float min, float max )
{
    float scale = rand() / (float) RAND_MAX; /* [0, 1.0] */
    return min + scale * ( max - min );      /* [min, max] */
}

  
void* thread(void* arg)
{
    for (int i = 0; i < 10; i++) {
        sem_t* mutex = sem_open(SEM_MUTEX_NAME, O_CREAT, 0660, 1);
        if (mutex == SEM_FAILED) {
            perror ("sem_open"); exit (1);
        }
        //wait
        sem_wait(mutex);
        //critical section
        int local = counter;

        float wait = float_rand(.01, .08);
        printf("Wait %f\n", wait);
        sleep(wait);
        counter = local + 1;

        sem_post(mutex);
        sem_close(mutex);
        if (sem_unlink (SEM_MUTEX_NAME) == -1) {
            perror ("sem_unlink"); //exit (1);
        }
    }
}

int main()
{
    pthread_t t1,t2;
    pthread_create(&t1,NULL,thread,NULL);
    pthread_create(&t2,NULL,thread,NULL);
    pthread_join(t1,NULL);
    pthread_join(t2,NULL);

    printf("counter %d\n", counter);
    return 0;
}

oleksii-sl avatar May 23 '22 08:05 oleksii-sl

@oleksii-sl Thank you so much for coding up this example in pure C. Since posix_ipc is just a think wrapper around the operating system's API, I don't think there's anything I can do about this in my module except document it.

It's possible that this could be considered a bug in the Linux kernel if the standard can be interpreted to allow better behavior.

osvenskan avatar Jun 09 '22 00:06 osvenskan

As @oleksii-sl said,

It seems like you need to call unlink only when absolutely sure that nobody else has opened semaphore

Which basically means that some higher level information needs to be factored into the decision on when the semaphores can be cleaned up. It can't be done automatically by the library.

A few years ago, some really nice new IPC facilities were added to the Linux kernel. When I looked at it back then, I thought that it was probably just a question of time before someone used those to implement a new solid set of synchronization primitives. Wonder if that has happened...

rogerdahl avatar Jun 09 '22 02:06 rogerdahl

If the semaphore was created by the context manager, it seems like it would be safe to unlink it when exiting the context manager scope. Is that something that can be added to the library?

As far as I understand, unlink() is not safe. In my use case (on macOS Ventura), many short-living processes (executors) were opening a shared semaphore

try:
    sem = Semaphore("/my_sem", O_CREAT, initial_value=8)
    sem.acquire(timeout=0)
    … do something
    sem.release()
    sem.unlink()
    sem.close()
except:
    print("could not acquire the semaphore")

I was surprised to find over-subscription. I changed my code to never call unlink() (I reset /my_sem manually before starting the process that spawns the executors), and now my throttling is working correctly. I believe that there is some race condition (in the kernel) which results in that more than one instance of /my_sem is actually live at a time.

PS the order of unlink() and close() does not seem to matter.

alexcohn avatar Mar 26 '23 10:03 alexcohn

Thanks @alexcohn for the report. I'm glad you got your app working the way you wanted it to. What you did makes sense to me. I think if you don't call unlink() eventually, the semaphore won't get cleaned up, even when your application exits. But I'm not sure. :-) This may or may not be a problem depending on context.

I believe that there is some race condition (in the kernel) which results in that more than one instance of /my_sem is actually live at a time.

The idea that more than one instance of /my_sem can exist is documented in the standard. (Also in my posix_ipc doc for unlink(), but of course the standard documentation carries a lot more weight!) Maybe this is what you're observing. The standard says --

Calls to sem_open() to recreate or reconnect to the semaphore refer to a new semaphore after sem_unlink() is called.

https://pubs.opengroup.org/onlinepubs/9699919799/

The implication is that two semaphores can exist in the system with the same name but different values, which means that the name isn't a unique identifier. You can see a demonstration of that here: https://github.com/osvenskan/posix_ipc/issues/17#issuecomment-501520476

This seems like a significant weakness in the standard, and I think is the cause of much of the confusion for those involved in this conversation. I would love to go back in time to when this standard was being developed to understand why this choice was made. Maybe there was a good reason for it, maybe it was an oversight, maybe both (it seemed like a good idea at the time but the practical consequences weren't clearly foreseen).

osvenskan avatar Mar 26 '23 15:03 osvenskan

Right, the semaphore stays there after my program completes. But this is not a problem for me: I have code that resets it when I restart the program, so if it is not "cleaned up", its dirty state does not matter.

You speak about a problem with the standard. I understand this limitation, and I hoped that I could live with it, but this was not the case, unfortunately.

It's OK that a call to sem_open() after sem_unlink() will create a new copy of the semaphore. What is not OK, is that when two processes call sem_open(…, O_CREAT) with the same path, these may create two copies of the semaphore, i.e. the sem_open(…, O_CREAT) seems to be not atomic, at least on macOS. Only sem_open(…, O_EXCL) is officially declared to be atomic.

alexcohn avatar Mar 26 '23 19:03 alexcohn

when two processes call sem_open(…, O_CREAT) with the same path, these may create two copies of the semaphore, i.e. the sem_open(…, O_CREAT) seems to be not atomic

You're right about that, but it's not because sem_open() fails to be atomic. If two calls to sem_open('foo') return references to different semaphores, it's because that's what the standard calls for.

In the example in https://github.com/osvenskan/posix_ipc/issues/17#issuecomment-501520476 in the code block immediately after the phrase "This can lead to surprising behavior", sem2 is created with O_EXCL, and it still ends up with a reference to a different semaphore than sem1 even though both have the same name. There's no race condition involved. If you replay that example yourself, you can have lunch, make a cup of tea, and read the newspaper before creating sem2 and you'll get the same result.

Apple's implementation of POSIX IPC primitives could be better. They don't support message queues at all, nor do they support sem_getvalue() nor sem_timedwait(). Since their interest in POSIX IPC primitives seems like an attempt to meet some minimum requirement rather than provide full support, I wouldn't be surprised to find other shortcomings in Apple's implementation. In this case, though, I think sem_open() is behaving according to the standard.

osvenskan avatar Mar 26 '23 21:03 osvenskan

The difference in the scenario you show in #issuecomment-501520476 is that here sem_unlink() is not called. That's why it is a race: let the first sem_open() call settle, and the second will reference the same object. I am not saying that this implementation of sem_open() breaks the standard; I only wanted to expose a subtle detail of its behavior.

alexcohn avatar Mar 27 '23 06:03 alexcohn