distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Profiler triggers `BufferError: cannot close exported pointers exist`

Open uellue opened this issue 2 years ago • 4 comments

What happened:

Running the ncempy DM file reader with on_memory=True in distributed futures causes BufferError: cannot close exported pointers exist errors, while running it with other methods doesn't. Both a short wait between the actual workload and closing the reader object, and disabling the distributed profiler fix the issue.

A minimal reproducer that mimics the behavior is included below.

What you expected to happen:

If code runs natively, it will also run in dask.distributed futures.

Minimal Complete Verifiable Example:

import tempfile
import mmap

import dask
# Set the profiling interval so short that the bug is very likely to be triggered.
# It also happens with the default interval, just more rarely.
dask.config.set({"distributed.worker.profile.interval": "1 ms"})

# The bug doesn't happen when the profiler is disabled
# dask.config.set({"distributed.worker.profile.enabled": False})


import numpy as np
import distributed


def recurse(arr, index):
    '''
    Create many references into the array on the call stack.
    '''
    if index >= len(arr):
        return
    # This slice references memory from the memory map
    data = arr[index:index+1]
    return recurse(arr, index + 1)


def work_on_mmap(mm):
    '''
    Create a NumPy array backed by the memory map
    and do some work on it.
    '''
    aa = np.frombuffer(mm, dtype=np.uint8)
    recurse(aa, 0)


def do_map():
    '''
    Entry point, function to run on distributed cluster
    '''
    with tempfile.NamedTemporaryFile() as f:
        f.write(b"abc"*100)
        f.seek(0)
        mm = mmap.mmap(f.fileno(), 0)
        work_on_mmap(mm)
        mm.close()


if __name__ == '__main__':
    for i in range(1000):
        do_map()  # works

    with distributed.Client() as client:
        for i in range(1000):
            # breaks
            future = client.submit(do_map, priority=1, pure=False)
            future.result()

Anything else we need to know?:

The profiler seems to hold references to temporary objects within futures.

Debugged in collaboration with @sk1p

CC @ercius FYI

Environment:

  • Dask version: 2022.7.0
  • Python version: 3.9.12
  • Operating System: CentOS 7.9
  • Install method (conda, pip, source): pip

uellue avatar Jul 21 '22 17:07 uellue

Could you please clarifying what is meant by "profiler"? Not seeing how profiling comes into the example above (so guessing some context is just missing)

jakirkham avatar Jul 22 '22 21:07 jakirkham

With profiler I mean this one: https://distributed.dask.org/en/stable/diagnosing-performance.html#statistical-profiling

The issue only appears when it is enabled, and shortening the interval makes the issue more likely to appear. It looks like the profiler thread holds additional references to objects owned by the workload. As long as these references include objects that reference memory in the memory map, trying to close the memory map triggers the BufferError. That means a function within the worker can't reliably release resources. The Dask profiler or any other part of Dask shouldn't hold additional references to objects owned by the workload at any time.

The ncempy DM reader and the example above make the issue likely to appear by creating a complex temporary data structure that references parts of a memory map. In the DM reader that's parsing a tag tree in the file header. Normally this data structure is released before the memory map is closed and only copies of relevant parts are retained. With the Dask profiler thread holding additional references, however, this temporary data structure survives longer than it should, and closing the memory map triggers a BufferError.

uellue avatar Jul 23 '22 06:07 uellue

I dont think this will fix your problem but, there was a recent update to ncempy where a similar buffer error was seen only in Python 3.9. See these two issues and a PR: https://github.com/ercius/openNCEM/issues/39 https://github.com/ercius/openNCEM/issues/41 https://github.com/ercius/openNCEM/pull/40

I released ncempy version 1.10 (https://github.com/ercius/openNCEM/releases/tag/v1.10.0) to include this fix. Does this problem persist with the newest ncempy version?

ercius avatar Jul 23 '22 16:07 ercius

@ercius yes, this was tested using the version with the fix. Before the fix it would trigger a BufferError also outside of Dask, with the fix only when run as a Future. From what I can tell, the current behavior of the DM reader seems correct.

uellue avatar Jul 23 '22 16:07 uellue