pathml
pathml copied to clipboard
SegmentMIF fails and causes program to hang
I think this is related to #211, but the error message is different on gpu. There are two types of error, both shown below:
1
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Worker process still alive after 3.999999237060547 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999996185302735 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.999999237060547 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 3.999999237060547 seconds, killing
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 520, in handle_comm
result = await result
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/scheduler.py", line 5832, in scatter
raise TimeoutError("No valid workers found")
asyncio.exceptions.TimeoutError: No valid workers found
Traceback (most recent call last):
File "mif-slidedataset-to-tiledataset-to-dataloader-test-via-dataset.py", line 60, in <module>
slide_dataset.run(pipeline = pipeline, client = client, write_dir = write_dir, distributed = True, tile_size = 512)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/pathml/core/slide_dataset.py", line 57, in run
slide.run(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/pathml/core/slide_data.py", line 320, in run
big_future = client.scatter(tile)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/client.py", line 2354, in scatter
return self.sync(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 309, in sync
return sync(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 363, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 348, in f
result[0] = yield future
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/client.py", line 2239, in _scatter
await self.scheduler.scatter(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 900, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 693, in send_recv
raise exc.with_traceback(tb)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 520, in handle_comm
result = await result
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/scheduler.py", line 5832, in scatter
raise TimeoutError("No valid workers found")
asyncio.exceptions.TimeoutError: No valid workers found
2
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 691.
31 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 696.
19 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 686.
67 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 695.
79 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 691.
31 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 696.
19 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
42 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 695.
79 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 691.
31 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 696.
19 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 695.
44 MiB -- Worker memory limit: 738.85 MiB
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1067, in connect
comm = await fut
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 3011, in gather_dep
response = await get_data_from_worker(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 4305, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 4282, in _get_data
comm = await rpc.connect(worker)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1078, in connect
raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fce64a99c70>>, <Task finished name='Task-14' coro=<Worker.gather_dep() done, de$
ined at /opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py:2955> exception=CommClosedError('Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:46876 remote=tcp://127.0.0.1:35715> already closed.')>)
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 3077, in gather_dep
self.batched_stream.send(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/batched.py", line 137, in send
raise CommClosedError(f"Comm {self.comm!r} already closed.")
distributed.comm.core.CommClosedError: Comm <TCP (closed) Worker->Scheduler local=tcp://127.0.0.1:46876 remote=tcp://127.0.0.1:35715> already closed.
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fce64a99c70>>, <Task finished name='Task-11' coro=<Worker.handle_scheduler() do$
e, defined at /opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py:1318> exception=CommClosedError('ConnectionPool not running. Status: Status.closed')>)
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1067, in connect
comm = await fut
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1331, in handle_scheduler
await self.close(report=False)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1561, in close
await r.close_gracefully()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 897, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1078, in connect
raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f3015274c70>>, <Task finished name='Task-17' coro=<Worker.close() done, defined
at /opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py:1537> exception=CommClosedError('ConnectionPool not running. Status: Status.closed')>)
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1067, in connect
comm = await fut
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1561, in close
await r.close_gracefully()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 897, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1078, in connect
raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 660.
48 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 661.
51 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 662.
43 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 664.
92 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 668.
09 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 672.
64 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 682.
02 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1273, in heartbeat
response = await retry_operation(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 897, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1054, in connect
raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 688.
68 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:38615 -> tcp://127.0.0.1:34199
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1752, in get_data
response = await comm.read(deserializers=serializers)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/comm/tcp.py", line 220, in read
convert_stream_closed_error(self, e)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:38615 remote=tcp://127.0.0.1:39154>: ConnectionResetError: [Errno 104] Connection reset by peer
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 692.
82 MiB -- Worker memory limit: 738.85 MiB
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/worker.py", line 1273, in heartbeat
response = await retry_operation(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 897, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 1054, in connect
raise CommClosedError(
distributed.comm.core.CommClosedError: ConnectionPool not running. Status: Status.closed
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 693.
57 MiB -- Worker memory limit: 738.85 MiB
distributed.nanny - WARNING - Worker process still alive after 3.999999237060547 seconds, killing
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Worker process still alive after 3.9999994277954105 seconds, killing
2022-02-10 20:04:05.835765: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 535.
11 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 542.
93 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 558.
33 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 571.
01 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 577.
54 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 589.
02 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 594.77 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 594.
77 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 596.
30 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 605.
31 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 622.
14 MiB -- Worker memory limit: 738.85 MiB
distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 676.
49 MiB -- Worker memory limit: 738.85 MiB
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
send_bytes(obj)
File "/opt/conda/envs/pathml/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/opt/conda/envs/pathml/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/opt/conda/envs/pathml/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
^[[A^[[A^[[A^[[A^[[A^[[A^[[Adistributed.core - ERROR - Exception while handling op scatter
Traceback (most recent call last):
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 520, in handle_comm
result = await result
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/scheduler.py", line 5832, in scatter
raise TimeoutError("No valid workers found")
asyncio.exceptions.TimeoutError: No valid workers found
Traceback (most recent call last):
File "mif-slidedataset-to-tiledataset-to-dataloader-test-via-dataset.py", line 60, in <module>
slide_dataset.run(pipeline = pipeline, client = client, write_dir = write_dir, distributed = True, tile_size = 512)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/pathml/core/slide_dataset.py", line 57, in run
slide.run(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/pathml/core/slide_data.py", line 320, in run
big_future = client.scatter(tile)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/client.py", line 2354, in scatter
return self.sync(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 309, in sync
return sync(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 363, in sync
raise exc.with_traceback(tb)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/utils.py", line 348, in f
result[0] = yield future
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/client.py", line 2239, in _scatter
await self.scheduler.scatter(
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 900, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 693, in send_recv
raise exc.with_traceback(tb)
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/core.py", line 520, in handle_comm
result = await result
File "/opt/conda/envs/pathml/lib/python3.8/site-packages/distributed/scheduler.py", line 5832, in scatter
raise TimeoutError("No valid workers found")
asyncio.exceptions.TimeoutError: No valid workers found
I think this is because we are using dask. I'm not sure how to proceed. I can try using pathos and seeing whether it works.
Thanks for posting this Surya, something is clearly going wrong with the Dask cluster. We'll need to dig into what is causing this - I'm guessing that part of it is caused on the pathml side by not making optimal use of dask, and part is probably caused by some details of how that specific dask cluster is configured
Are you able to proceed with distributed=False
?
not using gpu, since I need cudnn==8.1.x. I'm still sorting that out now. Do you have any knowledge of how to install cudnn 8.1.x? I have a machine with cuda 11.0 and mesmer needs cudnn 8.1.x, but I have 8.0.5. I find the cudnn installation instructions to be rather complicated. Do you have any knowledge of how to install cudnn?
Btw, with distributed=False
, I am able to get it work, but it might take over 4.5 hours per slide. Do you have any ideas on how to leverage the fact that the mesmer model can handle batches of size > 1?
Yeah that's a good point. Batching tiles would probably make it more efficient