distributed
distributed copied to clipboard
Profiler triggers `BufferError: cannot close exported pointers exist`
What happened:
Running the ncempy
DM file reader with on_memory=True
in distributed
futures causes BufferError: cannot close exported pointers exist
errors, while running it with other methods doesn't. Both a short wait between the actual workload and closing the reader object, and disabling the distributed
profiler fix the issue.
A minimal reproducer that mimics the behavior is included below.
What you expected to happen:
If code runs natively, it will also run in dask.distributed
futures.
Minimal Complete Verifiable Example:
import tempfile
import mmap
import dask
# Set the profiling interval so short that the bug is very likely to be triggered.
# It also happens with the default interval, just more rarely.
dask.config.set({"distributed.worker.profile.interval": "1 ms"})
# The bug doesn't happen when the profiler is disabled
# dask.config.set({"distributed.worker.profile.enabled": False})
import numpy as np
import distributed
def recurse(arr, index):
'''
Create many references into the array on the call stack.
'''
if index >= len(arr):
return
# This slice references memory from the memory map
data = arr[index:index+1]
return recurse(arr, index + 1)
def work_on_mmap(mm):
'''
Create a NumPy array backed by the memory map
and do some work on it.
'''
aa = np.frombuffer(mm, dtype=np.uint8)
recurse(aa, 0)
def do_map():
'''
Entry point, function to run on distributed cluster
'''
with tempfile.NamedTemporaryFile() as f:
f.write(b"abc"*100)
f.seek(0)
mm = mmap.mmap(f.fileno(), 0)
work_on_mmap(mm)
mm.close()
if __name__ == '__main__':
for i in range(1000):
do_map() # works
with distributed.Client() as client:
for i in range(1000):
# breaks
future = client.submit(do_map, priority=1, pure=False)
future.result()
Anything else we need to know?:
The profiler seems to hold references to temporary objects within futures.
Debugged in collaboration with @sk1p
CC @ercius FYI
Environment:
- Dask version: 2022.7.0
- Python version: 3.9.12
- Operating System: CentOS 7.9
- Install method (conda, pip, source): pip
Could you please clarifying what is meant by "profiler"? Not seeing how profiling comes into the example above (so guessing some context is just missing)
With profiler I mean this one: https://distributed.dask.org/en/stable/diagnosing-performance.html#statistical-profiling
The issue only appears when it is enabled, and shortening the interval makes the issue more likely to appear. It looks like the profiler thread holds additional references to objects owned by the workload. As long as these references include objects that reference memory in the memory map, trying to close the memory map triggers the BufferError
. That means a function within the worker can't reliably release resources. The Dask profiler or any other part of Dask shouldn't hold additional references to objects owned by the workload at any time.
The ncempy DM reader and the example above make the issue likely to appear by creating a complex temporary data structure that references parts of a memory map. In the DM reader that's parsing a tag tree in the file header. Normally this data structure is released before the memory map is closed and only copies of relevant parts are retained. With the Dask profiler thread holding additional references, however, this temporary data structure survives longer than it should, and closing the memory map triggers a BufferError
.
I dont think this will fix your problem but, there was a recent update to ncempy
where a similar buffer error was seen only in Python 3.9. See these two issues and a PR:
https://github.com/ercius/openNCEM/issues/39
https://github.com/ercius/openNCEM/issues/41
https://github.com/ercius/openNCEM/pull/40
I released ncempy version 1.10 (https://github.com/ercius/openNCEM/releases/tag/v1.10.0) to include this fix. Does this problem persist with the newest ncempy
version?
@ercius yes, this was tested using the version with the fix. Before the fix it would trigger a BufferError also outside of Dask, with the fix only when run as a Future. From what I can tell, the current behavior of the DM reader seems correct.