distributed zarr write fail - OSError: too many open files; P2PConsistencyError: No active shuffle with
Describe the issue: I'm processing NetCDF and converting them to Zarr with xarray. For this, I'm using a coiled cluster, dask, xarray, s3fs...
As a normal user who just wants to process data, I'm ending up with random dask behaviour; sometimes, the processing works (rarely), but most of the time it fails with various race conditions. I change vm_types, nthreads, max_pool_connections... heaps of things which don't lead to any sort of success and the log errors I get are all but useful.
Minimal Complete Verifiable Example:
"coiled_cluster_options": {
"n_workers": [
20,
100
],
"scheduler_vm_types": "m7i-flex.large",
"worker_vm_types": "m7i-flex.xlarge",
"allow_ingress_from": "me",
"compute_purchase_option": "spot_with_fallback",
"worker_options": {
"nthreads": 2,
}
},
I'm either using p2p, or tasks, but end up with the same behaviour
dask.config.set(
{
"array.slicing.split_large_chunks": False,
"distributed.scheduler.worker-saturation": "inf",
"dataframe.shuffle.method": "p2p",
}
)
and my distributed file
scheduler:
work-stealing: False
allowed-failures: 1 # fail fast
worker:
memory:
spill: False
pause: False
terminate: False
Also tried to change the spill, pause, terminate to values such as .90 without any improvements.
The log outputs I get are not human readable,
\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support\x94\x8c\x1dunpickle_exception_with_attrs\x94\x93\x94(\x8c\x08builtins\x94\x8c\x0cRuntimeError\x94\x93\x94}\x94(\x8c\x08__dict__\x94}\x94\x8c\x04args\x94\x8c\xfaError during deserialization of the task graph. This frequently\noccurs if the Scheduler and Client have different environments.\nFor more information, see\nhttps://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments\n\x94\x85\x94uh\x00\x8c\x12unpickle_exception\x94\x93\x94(\x8c\x13botocore.exceptions\x94\x8c\x1b_exception_from_packed_args\x94\x93\x94h\x0e\x8c\x17EndpointConnectionError\x94\x93\x94N}\x94(\x8c\x0cendpoint_url\x94\x8c\x90https://imos-data.s3.ap-southeast-2.amazonaws.com/IMOS/SRS/SST/ghrsst/L3SM-1d/dn/2012/20120430092000-ABOM-L3S_GHRSST-SSTfnd-MultiSensor-1d_dn.nc\x94\x8c\x05error\x94h\x02(\x8c\x19aiohttp.client_exceptions\x94\x8c\x17ClientConnectorDNSError\x94\x93\x94}\x94(h\x07}\x94(\x8c\t_conn_key\x94\x8c\x15aiohttp.client_reqrep\x94\x8c\rConnectionKey\x94\x93\x94(\x8c)imos-data.s3.ap-southeast-2.amazonaws.com\x94M\xbb\x01\x88\x88NNNt\x94\x81\x94\x8c\t_os_error\x94h\r(h\x03\x8c\x07OSError\x94\x93\x94K\x18\x8c\x13Too many open files\x94\x86\x94Nh\x00\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\x08f_locals\x94}\x94\x8c\tf_globals\x94}\x94(\x8c\x08__name__\x94\x8c\x11aiohttp.connector\x94\x8c\x08__file__\x94\x8cA/opt/coiled/env/lib/python3.12/site-packages/aiohttp/connector.py\x94u\x8c\x06f_code\x94h*\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94h6\x8c\x07co_name\x94\x8c\x19_create_direct_connection\x94\x8c\x0bco_argcount\x94K\x00\x8c\x11co_kwonlyargcount\x94K\x00\x8c\x0bco_varnames\x94)\x8c\nco_nlocals\x94K\x00\x8c\x0cco_stacksize\x94K\x00\x8c\x08co_flags\x94K@\x8c\x0eco_firstlineno\x94K\x00ub\x8c\x08f_lineno\x94M\x02\x06ubM\xfc\x05h*\x8c\tTraceback\x94\x93\x94)\x81\x94}\x94(\x8c\x08tb_frame\x94h,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=\x8c\r_resolve_host\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM}\x04ub\x8c\ttb_lineno\x94M|\x04\x8c\x07tb_next\x94hH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=\x8c\x1b_resolve_host_with_throttle\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xab\x04ubhSM\x9b\x04hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x10aiohttp.resolver\x94h5\x8c@/opt/coiled/env/lib/python3.12/site-packages/aiohttp/resolver.py\x94uh7h9)\x81\x94}\x94(h<heh=\x8c\x07resolve\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK(ubhSK(hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x13asyncio.base_events\x94h5\x8c5/opt/coiled/env/lib/python3.12/asyncio/base_events.py\x94uh7h9)\x81\x94}\x94(h<hph=\x8c\x0bgetaddrinfo\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x89\x03ubhSM\x89\x03hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x19concurrent.futures.thread\x94h5\x8c;/opt/coiled/env/lib/python3.12/concurrent/futures/thread.py\x94uh7h9)\x81\x94}\x94(h<h{h=\x8c\x03run\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK?ubhSK;hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x06socket\x94h5\x8c(/opt/coiled/env/lib/python3.12/socket.py\x94uh7h9)\x81\x94}\x94(h<h\x86h=hsh?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xd2\x03ubhSM\xd2\x03ubububububub\x87\x94R\x94N\x89Nt\x94R\x94uh\th"h\x8c\x86\x94\x8c\x05errno\x94K\x18\x8c\x08strerror\x94h&uh\x8ch)h,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x17aiobotocore.httpsession\x94h5\x8cG/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/httpsession.py\x94uh7h9)\x81\x94}\x94(h<h\x95h=\x8c\x04send\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x16\x01ubK\xe0hH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x0eaiohttp.client\x94h5\x8c>/opt/coiled/env/lib/python3.12/site-packages/aiohttp/client.py\x94uh7h9)\x81\x94}\x94(h<h\xa0h=\x8c\x08_request\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xa6\x03ubhSM\x0b\x03hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\x9fh5h\xa0uh7h9)\x81\x94}\x94(h<h\xa0h=\x8c\x19_connect_and_send_request\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xe1\x02ubhSM\xde\x02hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=\x8c\x07connect\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x88\x02ubhSM\x82\x02hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=\x8c\x12_create_connection\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xb9\x04ubhSM\xb9\x04hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=h>h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x02\x06ubhSM\x02\x06ububububub\x87\x94R\x94h\x8c\x88N)t\x94R\x94h\x1bbu\x87\x94Nh)h,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x15distributed.scheduler\x94h5\x8cE/opt/coiled/env/lib/python3.12/site-packages/distributed/scheduler.py\x94uh7h9)\x81\x94}\x94(h<h\xd1h=\x8c\x0cupdate_graph\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFMt\x13ubM\n\x13hH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x1edistributed.protocol.serialize\x94h5\x8cN/opt/coiled/env/lib/python3.12/site-packages/distributed/protocol/serialize.py\x94uh7h9)\x81\x94}\x94(h<h\xdch=\x8c\x0bdeserialize\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xc4\x01ubhSM\xc4\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xdbh5h\xdcuh7h9)\x81\x94}\x94(h<h\xdch=\x8c\x0cpickle_loads\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKoubhSKohThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x1bdistributed.protocol.pickle\x94h5\x8cK/opt/coiled/env/lib/python3.12/site-packages/distributed/protocol/pickle.py\x94uh7h9)\x81\x94}\x94(h<h\xf0h=\x8c\x05loads\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKbubhSK]hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x1cxarray.backends.file_manager\x94h5\x8cL/opt/coiled/env/lib/python3.12/site-packages/xarray/backends/file_manager.py\x94uh7h9)\x81\x94}\x94(h<h\xfbh=\x8c\x0c__setstate__\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x17\x01ubhSM\x17\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xfah5h\xfbuh7h9)\x81\x94}\x94(h<h\xfbh=\x8c\x08__init__\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x94ubhSK\x94hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xfah5h\xfbuh7h9)\x81\x94}\x94(h<h\xfbh=\x8c\t_make_key\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xa7ubhSK\xa7hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xfah5h\xfbuh7h9)\x81\x94}\x94(h<h\xfbh=j\x07\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFMM\x01ubhSMM\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x0bfsspec.spec\x94h5\x8c;/opt/coiled/env/lib/python3.12/site-packages/fsspec/spec.py\x94uh7h9)\x81\x94}\x94(h<j \x01\x00\x00h=\x8c\x08__hash__\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x9f\x07ubhSM\x9f\x07hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x1f\x01\x00\x00h5j \x01\x00\x00uh7h9)\x81\x94}\x94(h<j \x01\x00\x00h=\x8c\x07details\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x85\x07ubhSM\x85\x07hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x0bfsspec.asyn\x94h5\x8c;/opt/coiled/env/lib/python3.12/site-packages/fsspec/asyn.py\x94uh7h9)\x81\x94}\x94(h<j4\x01\x00\x00h=\x8c\x07wrapper\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKvubhSKvhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j3\x01\x00\x00h5j4\x01\x00\x00uh7h9)\x81\x94}\x94(h<j4\x01\x00\x00h=\x8c\x04sync\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKgubhSKghThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j3\x01\x00\x00h5j4\x01\x00\x00uh7h9)\x81\x94}\x94(h<j4\x01\x00\x00h=\x8c\x07_runner\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK<ubhSK8hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\ts3fs.core\x94h5\x8c9/opt/coiled/env/lib/python3.12/site-packages/s3fs/core.py\x94uh7h9)\x81\x94}\x94(h<jQ\x01\x00\x00h=\x8c\x05_info\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xb9\x05ubhSM\xa5\x05hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3jP\x01\x00\x00h5jQ\x01\x00\x00uh7h9)\x81\x94}\x94(h<jQ\x01\x00\x00h=\x8c\x08_call_s3\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFMs\x01ubhSMs\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3jP\x01\x00\x00h5jQ\x01\x00\x00uh7h9)\x81\x94}\x94(h<jQ\x01\x00\x00h=\x8c\x0e_error_wrapper\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x92ubhSK\x92hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3jP\x01\x00\x00h5jQ\x01\x00\x00uh7h9)\x81\x94}\x94(h<jQ\x01\x00\x00h=jf\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x92ubhSKrhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x13aiobotocore.context\x94h5\x8cC/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/context.py\x94uh7h9)\x81\x94}\x94(h<jv\x01\x00\x00h=j7\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK$ubhSK$hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x12aiobotocore.client\x94h5\x8cB/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/client.py\x94uh7h9)\x81\x94}\x94(h<j\x80\x01\x00\x00h=\x8c\x0e_make_api_call\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x96\x01ubhSM\x96\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x7f\x01\x00\x00h5j\x80\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\x80\x01\x00\x00h=\x8c\r_make_request\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xb9\x01ubhSM\xb0\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x14aiobotocore.endpoint\x94h5\x8cD/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/endpoint.py\x94uh7h9)\x81\x94}\x94(h<j\x94\x01\x00\x00h=\x8c\r_send_request\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKxubhSKxhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x93\x01\x00\x00h5j\x94\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\x94\x01\x00\x00h=\x8c\x0c_needs_retry\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x18\x01ubhSM\x18\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x11aiobotocore.hooks\x94h5\x8cA/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/hooks.py\x94uh7h9)\x81\x94}\x94(h<j\xa8\x01\x00\x00h=\x8c\x05_emit\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKDubhSKDhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x14aiobotocore._helpers\x94h5\x8cD/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/_helpers.py\x94uh7h9)\x81\x94}\x94(h<j\xb3\x01\x00\x00h=\x8c\x11resolve_awaitable\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x06ubhSK\x06hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x18aiobotocore.retryhandler\x94h5\x8cH/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/retryhandler.py\x94uh7h9)\x81\x94}\x94(h<j\xbe\x01\x00\x00h=\x8c\x05_call\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKkubhSKkhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xb2\x01\x00\x00h5j\xb3\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xb3\x01\x00\x00h=j\xb6\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x06ubhSK\x06hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xbd\x01\x00\x00h5j\xbe\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xbe\x01\x00\x00h=j\xc1\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK~ubhSK~hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xbd\x01\x00\x00h5j\xbe\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xbe\x01\x00\x00h=\x8c\r_should_retry\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xa5ubhSK\xa5hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xb2\x01\x00\x00h5j\xb3\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xb3\x01\x00\x00h=j\xb6\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x06ubhSK\x06hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xbd\x01\x00\x00h5j\xbe\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xbe\x01\x00\x00h=j\xc1\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xaeubhSK\xaehThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x15botocore.retryhandler\x94h5\x8cE/opt/coiled/env/lib/python3.12/site-packages/botocore/retryhandler.py\x94uh7h9)\x81\x94}\x94(h<j\xf2\x01\x00\x00h=\x8c\x08__call__\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xf7ubhSK\xf7hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xf1\x01\x00\x00h5j\xf2\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xf2\x01\x00\x00h=\x8c\x17_check_caught_exception\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xa0\x01ubhSM\xa0\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x93\x01\x00\x00h5j\x94\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\x94\x01\x00\x00h=\x8c\x10_do_get_response\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xd0ubhSK\xc9hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x93\x01\x00\x00h5j\x94\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\x94\x01\x00\x00h=\x8c\x05_send\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM/\x01ubhSM/\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\x94h5h\x95uh7h9)\x81\x94}\x94(h<h\x95h=h\x98h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x16\x01ubhSM\x16\x01ubububububububububububububububububububububububububububububububububub\x87\x94R\x94h\xca\x89Nt\x94R\x94h)h,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xd0h5h\xd1uh7h9)\x81\x94}\x94(h<h\xd1h=h\xd4h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFMt\x13ubM\x13\x13N\x87\x94R\x94j\x1c\x02\x00\x00\x88N)t\x94R\x94.'. Traceback (most recent call last): File "/home/ubuntu/github_repo/aodn_cloud_optimised/aodn_cloud_optimised/lib/GenericZarrHandler.py", line 1007, in publish_cloud_optimised_fileset_batch self._write_ds(ds, idx) File "/home/ubuntu/github_repo/aodn_cloud_optimised/aodn_cloud_optimised/lib/GenericZarrHandler.py", line 1786, in _write_ds self._append_zarr_store(ds) File "/home/ubuntu/github_repo/aodn_cloud_optimised/aodn_cloud_optimised/lib/GenericZarrHandler.py", line 1840, in _append_zarr_store ds.to_zarr( File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/core/dataset.py", line 2292, in to_zarr return to_zarr( # type: ignore[call-overload,misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/backends/api.py", line 2246, in to_zarr writes = writer.sync( ^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/backends/common.py", line 357, in sync delayed_store = chunkmanager.store( ^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py", line 247, in store return store( ^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/dask/array/core.py", line 1221, in store dask.compute(arrays, **kwargs) File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/dask/base.py", line 681, in compute results = schedule(expr, keys, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/distributed/client.py", line 2416, in _gather raise exception.with_traceback(traceback) Exception: b'\x80\x05\x95\xed \x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support\x94\x8c\x1dunpickle_exception_with_attrs\x94\x93\x94(\x8c\x08builtins\x94\x8c\x0cRuntimeError\x94\x93\x94}\x94(\x8c\x08__dict__\x94}\x94\x8c\x04args\x94\x8c\xfaError during deserialization of the task graph. This frequently\noccurs if the Scheduler and Client have different environments.\nFor more information, see\nhttps://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments\n\x94\x85\x94uh\x00\x8c\x12unpickle_exception\x94\x93\x94(\x8c\x13botocore.exceptions\x94\x8c\x1b_exception_from_packed_args\x94\x93\x94h\x0e\x8c\x17EndpointConnectionError\x94\x93\x94N}\x94(\x8c\x0cendpoint_url\x94\x8c\x90https://imos-data.s3.ap-southeast-2.amazonaws.com/IMOS/SRS/SST/ghrsst/L3SM-1d/dn/2012/20120430092000-ABOM-L3S_GHRSST-SSTfnd-MultiSensor-1d_dn.nc\x94\x8c\x05error\x94h\x02(\x8c\x19aiohttp.client_exceptions\x94\x8c\x17ClientConnectorDNSError\x94\x93\x94}\x94(h\x07}\x94(\x8c\t_conn_key\x94\x8c\x15aiohttp.client_reqrep\x94\x8c\rConnectionKey\x94\x93\x94(\x8c)imos-data.s3.ap-southeast-2.amazonaws.com\x94M\xbb\x01\x88\x88NNNt\x94\x81\x94\x8c\t_os_error\x94h\r(h\x03\x8c\x07OSError\x94\x93\x94K\x18\x8c\x13Too many open files\x94\x86\x94Nh\x00\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\x08f_locals\x94}\x94\x8c\tf_globals\x94}\x94(\x8c\x08__name__\x94\x8c\x11aiohttp.connector\x94\x8c\x08__file__\x94\x8cA/opt/coiled/env/lib/python3.12/site-packages/aiohttp/connector.py\x94u\x8c\x06f_code\x94h*\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94h6\x8c\x07co_name\x94\x8c\x19_create_direct_connection\x94\x8c\x0bco_argcount\x94K\x00\x8c\x11co_kwonlyargcount\x94K\x00\x8c\x0bco_varnames\x94)\x8c\nco_nlocals\x94K\x00\x8c\x0cco_stacksize\x94K\x00\x8c\x08co_flags\x94K@\x8c\x0eco_firstlineno\x94K\x00ub\x8c\x08f_lineno\x94M\x02\x06ubM\xfc\x05h*\x8c\tTraceback\x94\x93\x94)\x81\x94}\x94(\x8c\x08tb_frame\x94h,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=\x8c\r_resolve_host\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM}\x04ub\x8c\ttb_lineno\x94M|\x04\x8c\x07tb_next\x94hH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=\x8c\x1b_resolve_host_with_throttle\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xab\x04ubhSM\x9b\x04hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x10aiohttp.resolver\x94h5\x8c@/opt/coiled/env/lib/python3.12/site-packages/aiohttp/resolver.py\x94uh7h9)\x81\x94}\x94(h<heh=\x8c\x07resolve\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK(ubhSK(hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x13asyncio.base_events\x94h5\x8c5/opt/coiled/env/lib/python3.12/asyncio/base_events.py\x94uh7h9)\x81\x94}\x94(h<hph=\x8c\x0bgetaddrinfo\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x89\x03ubhSM\x89\x03hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x19concurrent.futures.thread\x94h5\x8c;/opt/coiled/env/lib/python3.12/concurrent/futures/thread.py\x94uh7h9)\x81\x94}\x94(h<h{h=\x8c\x03run\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK?ubhSK;hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x06socket\x94h5\x8c(/opt/coiled/env/lib/python3.12/socket.py\x94uh7h9)\x81\x94}\x94(h<h\x86h=hsh?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xd2\x03ubhSM\xd2\x03ubububububub\x87\x94R\x94N\x89Nt\x94R\x94uh\th"h\x8c\x86\x94\x8c\x05errno\x94K\x18\x8c\x08strerror\x94h&uh\x8ch)h,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x17aiobotocore.httpsession\x94h5\x8cG/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/httpsession.py\x94uh7h9)\x81\x94}\x94(h<h\x95h=\x8c\x04send\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x16\x01ubK\xe0hH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x0eaiohttp.client\x94h5\x8c>/opt/coiled/env/lib/python3.12/site-packages/aiohttp/client.py\x94uh7h9)\x81\x94}\x94(h<h\xa0h=\x8c\x08_request\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xa6\x03ubhSM\x0b\x03hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\x9fh5h\xa0uh7h9)\x81\x94}\x94(h<h\xa0h=\x8c\x19_connect_and_send_request\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xe1\x02ubhSM\xde\x02hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=\x8c\x07connect\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x88\x02ubhSM\x82\x02hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=\x8c\x12_create_connection\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xb9\x04ubhSM\xb9\x04hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h4h5h6uh7h9)\x81\x94}\x94(h<h6h=h>h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x02\x06ubhSM\x02\x06ububububub\x87\x94R\x94h\x8c\x88N)t\x94R\x94h\x1bbu\x87\x94Nh)h,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x15distributed.scheduler\x94h5\x8cE/opt/coiled/env/lib/python3.12/site-packages/distributed/scheduler.py\x94uh7h9)\x81\x94}\x94(h<h\xd1h=\x8c\x0cupdate_graph\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFMt\x13ubM\n\x13hH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x1edistributed.protocol.serialize\x94h5\x8cN/opt/coiled/env/lib/python3.12/site-packages/distributed/protocol/serialize.py\x94uh7h9)\x81\x94}\x94(h<h\xdch=\x8c\x0bdeserialize\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xc4\x01ubhSM\xc4\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xdbh5h\xdcuh7h9)\x81\x94}\x94(h<h\xdch=\x8c\x0cpickle_loads\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKoubhSKohThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x1bdistributed.protocol.pickle\x94h5\x8cK/opt/coiled/env/lib/python3.12/site-packages/distributed/protocol/pickle.py\x94uh7h9)\x81\x94}\x94(h<h\xf0h=\x8c\x05loads\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKbubhSK]hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x1cxarray.backends.file
_manager\x94h5\x8cL/opt/coiled/env/lib/python3.12/site-packages/xarray/backends/file_manager.py\x94uh7h9)\x81\x94}\x94(h<h\xfbh=\x8c\x0c__setstate__\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x17\x01ubhSM\x17\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xfah5h\xfbuh7h9)\x81\x94}\x94(h<h\xfbh=\x8c\x08__init__\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x94ubhSK\x94hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xfah5h\xfbuh7h9)\x81\x94}\x94(h<h\xfbh
=\x8c\t_make_key\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xa7ubhSK\xa7hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xfah5h\xfbuh7h9)\x81\x94}\x94(h<h\xfbh=j\x07\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFMM\x01ubhSMM\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x0bfsspec.spec\x94h5\x8c;/opt/coiled/env/lib/python3.12/site-packages/fsspec/spec.py\x94uh7h9)\x81\x94}\x94(h<j \x01\x00\x00h=\x8c\x08__hash__\x94h?K\x00h@K\x00hA)hBK\x00hCK\x0
0hDK@hEK\x00ubhFM\x9f\x07ubhSM\x9f\x07hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x1f\x01\x00\x00h5j \x01\x00\x00uh7h9)\x81\x94}\x94(h<j \x01\x00\x00h=\x8c\x07details\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x85\x07ubhSM\x85\x07hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x0bfsspec.asyn\x94h5\x8c;/opt/coiled/env/lib/python3.12/site-packages/fsspec/asyn.py\x94uh7h9)\x81\x94}\x94(h<j4\x01\x00\x00h=\x8c\x07wrapper\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@
hEK\x00ubhFKvubhSKvhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j3\x01\x00\x00h5j4\x01\x00\x00uh7h9)\x81\x94}\x94(h<j4\x01\x00\x00h=\x8c\x04sync\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKgubhSKghThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j3\x01\x00\x00h5j4\x01\x00\x00uh7h9)\x81\x94}\x94(h<j4\x01\x00\x00h=\x8c\x07_runner\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK<ubhSK8hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\ts3fs.core\x94h5\x8c9/o
pt/coiled/env/lib/python3.12/site-packages/s3fs/core.py\x94uh7h9)\x81\x94}\x94(h<jQ\x01\x00\x00h=\x8c\x05_info\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xb9\x05ubhSM\xa5\x05hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3jP\x01\x00\x00h5jQ\x01\x00\x00uh7h9)\x81\x94}\x94(h<jQ\x01\x00\x00h=\x8c\x08_call_s3\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFMs\x01ubhSMs\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3jP\x01\x00\x00h5jQ\x01\x00\x00uh7h9)\x81\x94}\x94
(h<jQ\x01\x00\x00h=\x8c\x0e_error_wrapper\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x92ubhSK\x92hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3jP\x01\x00\x00h5jQ\x01\x00\x00uh7h9)\x81\x94}\x94(h<jQ\x01\x00\x00h=jf\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x92ubhSKrhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x13aiobotocore.context\x94h5\x8cC/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/context.py\x94uh7h9)\x81\x94}\x94(h<jv\x01
\x00\x00h=j7\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK$ubhSK$hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x12aiobotocore.client\x94h5\x8cB/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/client.py\x94uh7h9)\x81\x94}\x94(h<j\x80\x01\x00\x00h=\x8c\x0e_make_api_call\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x96\x01ubhSM\x96\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x7f\x01\x00\x00h5j\x80\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\x
80\x01\x00\x00h=\x8c\r_make_request\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xb9\x01ubhSM\xb0\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x14aiobotocore.endpoint\x94h5\x8cD/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/endpoint.py\x94uh7h9)\x81\x94}\x94(h<j\x94\x01\x00\x00h=\x8c\r_send_request\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKxubhSKxhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x93\x01\x00\x00h5j\x94\x01\x00\x00uh7h9)\x8
1\x94}\x94(h<j\x94\x01\x00\x00h=\x8c\x0c_needs_retry\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x18\x01ubhSM\x18\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x11aiobotocore.hooks\x94h5\x8cA/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/hooks.py\x94uh7h9)\x81\x94}\x94(h<j\xa8\x01\x00\x00h=\x8c\x05_emit\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKDubhSKDhThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x14aiobotocore._helpers\x94h5\x8cD/
opt/coiled/env/lib/python3.12/site-packages/aiobotocore/_helpers.py\x94uh7h9)\x81\x94}\x94(h<j\xb3\x01\x00\x00h=\x8c\x11resolve_awaitable\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x06ubhSK\x06hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x18aiobotocore.retryhandler\x94h5\x8cH/opt/coiled/env/lib/python3.12/site-packages/aiobotocore/retryhandler.py\x94uh7h9)\x81\x94}\x94(h<j\xbe\x01\x00\x00h=\x8c\x05_call\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFKkubhSKkhThH
)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xb2\x01\x00\x00h5j\xb3\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xb3\x01\x00\x00h=j\xb6\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x06ubhSK\x06hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xbd\x01\x00\x00h5j\xbe\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xbe\x01\x00\x00h=j\xc1\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK~ubhSK~hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xbd\x01\x00\x00h5j\xbe\x01
\x00\x00uh7h9)\x81\x94}\x94(h<j\xbe\x01\x00\x00h=\x8c\r_should_retry\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xa5ubhSK\xa5hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xb2\x01\x00\x00h5j\xb3\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xb3\x01\x00\x00h=j\xb6\x01\x00\x00h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\x06ubhSK\x06hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xbd\x01\x00\x00h5j\xbe\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xbe\x01\x00\x00h=j\xc1\x01\x00\x00h
?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xaeubhSK\xaehThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3\x8c\x15botocore.retryhandler\x94h5\x8cE/opt/coiled/env/lib/python3.12/site-packages/botocore/retryhandler.py\x94uh7h9)\x81\x94}\x94(h<j\xf2\x01\x00\x00h=\x8c\x08__call__\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xf7ubhSK\xf7hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\xf1\x01\x00\x00h5j\xf2\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\xf2\x01\x00\x00h=\x8c\x17_ch
eck_caught_exception\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\xa0\x01ubhSM\xa0\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x93\x01\x00\x00h5j\x94\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\x94\x01\x00\x00h=\x8c\x10_do_get_response\x94h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFK\xd0ubhSK\xc9hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3j\x93\x01\x00\x00h5j\x94\x01\x00\x00uh7h9)\x81\x94}\x94(h<j\x94\x01\x00\x00h=\x8c\x05_send\x94h?K\x00h@K\x00hA)hBK\x00hCK\x0
0hDK@hEK\x00ubhFM/\x01ubhSM/\x01hThH)\x81\x94}\x94(hKh,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\x94h5h\x95uh7h9)\x81\x94}\x94(h<h\x95h=h\x98h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFM\x16\x01ubhSM\x16\x01ubububububububububububububububububububububububububububububububububub\x87\x94R\x94h\xca\x89Nt\x94R\x94h)h,)\x81\x94}\x94(h/}\x94h1}\x94(h3h\xd0h5h\xd1uh7h9)\x81\x94}\x94(h<h\xd1h=h\xd4h?K\x00h@K\x00hA)hBK\x00hCK\x00hDK@hEK\x00ubhFMt\x13ubM\x13\x13N\x87\x94R\x94j\x1c\x02\x00\x00\x88N)t\x94R\
x94.'
another error
2025-11-27 06:18:18,481 - ERROR - GenericZarrHandler.py:1020 - publish_cloud_optimised_fileset_batch - 39b496bb-31ad-45e1-9ede-6fca2888f8c7: An unexpected error occurred during batch 2 processing: b'\x80\x05\x95\x92\x0b\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support\x94\x8c\x1dunpickle_exception_with_attrs\x94\x93\x94(\x8c\x1fdistributed.shuffle._exceptions\x94\x8c\x13P2PConsistencyError\x94\x93\x94}\x94(\x8c\x08__dict__\x94}\x94\x8c\x04args\x94\x8cBNo active shuffle with id=\'dbe46e5700b3cd9c0e51aa5b1ec8602d\' found\x94\x85\x94uh\x02(\x8c\x08builtins\x94\x8c\x08KeyError\x94\x93\x94}\x94(h\x07}\x94h\t\x8c dbe46e5700b3cd9c0e51aa5b1ec8602d\x94\x85\x94uNh\x00\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\x08f_locals\x94}\x94\x8c\tf_globals\x94}\x94(\x8c\x08__name__\x94\x8c%distributed.shuffle._scheduler_plugin\x94\x8c\x08__file__\x94\x8cU/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_scheduler_plugin.py\x94u\x8c\x06f_code\x94h\x15\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94h!\x8c\x07co_name\x94\x8c\x03get\x94\x8c\x0bco_argcount\x94K\x00\x8c\x11co_kwonlyargcount\x94K\x00\x8c\x0bco_varnames\x94)\x8c\nco_nlocals\x94K\x00\x8c\x0cco_stacksize\x94K\x00\x8c\x08co_flags\x94K@\x8c\x0eco_firstlineno\x94K\x00ub\x8c\x08f_lineno\x94K\xafubK\xafh\x15\x8c\tTraceback\x94\x93\x94)\x81\x94}\x94(\x8c\x08tb_frame\x94h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(h\x1eh\x1fh h!uh"h$)\x81\x94}\x94(h\'h!h(\x8c\x04_get\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xbeub\x8c\ttb_lineno\x94K\xbeub\x87\x94R\x94N\x89N)t\x94R\x94h\x10bh\x14h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(\x8c\x08__name__\x94\x8c\x12distributed.worker\x94\x8c\x08__file__\x94\x8cB/opt/coiled/env/lib/python3.12/site-packages/distributed/worker.py\x94uh"h$)\x81\x94}\x94(h\'hJh(\x8c\x10_run_task_simple\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\xb7\x0bubM\xaa\x0bh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c\x0fdask._task_spec\x94hI\x8c?/opt/coiled/env/lib/python3.12/site-packages/dask/_task_spec.py\x94uh"h$)\x81\x94}\x94(h\'hUh(\x8c\x08__call__\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\xf7\x02ubh>M\xf7\x02\x8c\x07tb_next\x94h3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c\x19distributed.shuffle._core\x94hI\x8cI/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_core.py\x94uh"h$)\x81\x94}\x94(h\'hah(\x8c\x0bp2p_barrier\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1MB\x02ubh>M>\x02hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c"distributed.shuffle._worker_plugin\x94hI\x8cR/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_worker_plugin.py\x94uh"h$)\x81\x94}\x94(h\'hlh(\x8c\x07barrier\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\x87\x01ubh>M\x87\x01hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c\x11distributed.utils\x94hI\x8cA/opt/coiled/env/lib/python3.12/site-packages/distributed/utils.py\x94uh"h$)\x81\x94}\x94(h\'hwh(\x8c\x04sync\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\xc4\x01ubh>M\xc4\x01hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhvhIhwuh"h$)\x81\x94}\x94(h\'hwh(\x8c\x01f\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\xae\x01ubh>M\xaa\x01hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c\x0btornado.gen\x94hI\x8c;/opt/coiled/env/lib/python3.12/site-packages/tornado/gen.py\x94uh"h$)\x81\x94}\x94(h\'h\x8bh(\x8c\x03run\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M6\x03ubh>M\x0f\x03hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x08_barrier\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1Mj\x01ubh>Mj\x01hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x0fget_most_recent\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xb1ubh>K\xb1hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x0fget_with_run_id\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1Kwubh>KwhYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x08_refresh\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xdeubh>K\xdehYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x06_fetch\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xc8ubh>K\xc8hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(\x8c\x08__name__\x94\x8c%distributed.shuffle._scheduler_plugin\x94\x8c\x08__file__\x94\x8cU/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_scheduler_plugin.py\x94uh"h$)\x81\x94}\x94(h\'h\xc5h(\x8c\x03get\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xb2ubh>K\xb2ubububububububububububub\x87\x94R\x94hB\x88N)t\x94R\x94h\x08b.'. Traceback (most recent call last): File "/home/ubuntu/github_repo/aodn_cloud_optimised/aodn_cloud_optimised/lib/GenericZarrHandler.py", line 1007, in publish_cloud_optimised_fileset_batch self._write_ds(ds, idx) File "/home/ubuntu/github_repo/aodn_cloud_optimised/aodn_cloud_optimised/lib/GenericZarrHandler.py", line 1786, in _write_ds self._append_zarr_store(ds) File "/home/ubuntu/github_repo/aodn_cloud_optimised/aodn_cloud_optimised/lib/GenericZarrHandler.py", line 1840, in _append_zarr_store ds.to_zarr( File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/core/dataset.py", line 2292, in to_zarr return to_zarr( # type: ignore[call-overload,misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/backends/api.py", line 2246, in to_zarr writes = writer.sync( ^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/backends/common.py", line 357, in sync delayed_store = chunkmanager.store( ^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/namedarray/daskmanager.py", line 247, in store return store( ^^^^^^ File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/dask/array/core.py", line 1221, in store dask.compute(arrays, **kwargs) File "/home/ubuntu/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/dask/base.py", line 681, in compute results = schedule(expr, keys, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_core.py", line 574, in p2p_barrier File "/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_worker_plugin.py", line 391, in barrier File "/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_worker_plugin.py", line 362, in _barrier File "/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_worker_plugin.py", line 177, in get_most_recent File "/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_worker_plugin.py", line 119, in get_with_run_id File "/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_worker_plugin.py", line 222, in _refresh File "/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_worker_plugin.py", line 200, in _fetch File "/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_scheduler_plugin.py", line 178, in get Exception: b'\x80\x05\x95\x92\x0b\x00\x00\x00\x00\x00\x00\x8c\x16tblib.pickling_support\x94\x8c\x1dunpickle_exception_with_attrs\x94\x93\x94(\x8c\x1fdistributed.shuffle._exceptions\x94\x8c\x13P2PConsistencyError\x94\x93\x94}\x94(\x8c\x08__dict__\x94}\x94\x8c\x04args\x94\x8cBNo active shuffle with id=\'dbe46e5700b3cd9c0e51aa5b1ec8602d\' found\x94\x85\x94uh\x02(\x8c\x08builtins\x94\x8c\x08KeyError\x94\x93\x94}\x94(h\x07}\x94h\t\x8c dbe46e5700b3cd9c0e51aa5b1ec8602d\x94\x85\x94uNh\x00\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\x08f_locals\x94}\x94\x8c\tf_globals\x94}\x94(\x8c\x08__name__\x94\x8c%distributed.shuffle._scheduler_plugin\x94\x8c\x08__file__\x94\x8cU/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_scheduler_plugin.py\x94u\x8c\x06f_code\x94h\x15\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94h!\x8c\x07co_name\x94\x8c\x03get\x94\x8c\x0bco_argcount\x94K\x00\x8c\x11co_kwonlyargcount\x94K\x00\x8c\x0bco_varnames\x94)\x8c\nco_nlocals\x94K\x00\x8c\x0cco_stacksize\x94K\x00\x8c\x08co_flags\x94K@\x8c\x0eco_firstlineno\x94K\x00ub\x8c\x08f_lineno\x94K\xafubK\xafh\x15\x8c\tTraceback\x94\x93\x94)\x81\x94}\x94(\x8c\x08tb_frame\x94h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(h\x1eh\x1fh h!uh"h$)\x81\x94}\x94(h\'h!h(\x8c\x04_get\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xbeub\x8c\ttb_lineno\x94K\xbeub\x87\x94R\x94N\x89N)t\x94R\x94h\x10bh\x14h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(\x8c\x08__name__\x94\x8c\x12distributed.worker\x94\x8c\x08__file__\x94\x8cB/opt/coiled/env/lib/python3.12/site-packages/distributed/worker.py\x94uh"h$)\x81\x94}\x94(h\'hJh(\x8c\x10_run_task_simple\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\xb7\x0bubM\xaa\x0bh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c\x0fdask._task_spec\x94hI\x8c?/opt/coiled/env/lib/python3.12/site-packages/dask/_task_spec.py\x94uh"h$)\x81\x9}\x94(h\'hUh(\x8c\x08__call__\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\xf7\x02ubh>M\xf7\x02\x8c\x07tb_next\x94h3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c\x19distributed.shuffle._core\x94hI\x8cI/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_core.py\x94uh"h$)\x81\x94}\x94(h\'hah(\x8c\x0bp2p_barrier\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1MB\x02ubh>M>\x02hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c"distributed.shuffle._worker_plugin\x94hI\x8cR/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_worker_plugin.py\x94uh"h$)\x81\x94}\x94(h\'hlh(\x8c\x07barrier\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\x87\x01ubh>M\x87\x01hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c\x11distributed.utils\x94hI\x8cA/opt/coiled/env/lib/python3.12/site-packages/distributed/utils.py\x94uh"h$)\x81\x94}\x94(h\'hwh(\x8c\x04sync\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\xc4\x01ubh>M\xc4\x01hYh
3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhvhIhwuh"h$)\x81\x94}\x94(h\'hwh(\x8c\x01f\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M\xae\x01ubh>M\xaa\x01hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hG\x8c\x
0btornado.gen\x94hI\x8c;/opt/coiled/env/lib/python3.12/site-packages/tornado/gen.py\x94uh"h$)\x81\x94}\x94(h\'h\x8bh(\x8c\x03run\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1M6\x03ubh>M\x0f\x03hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\
x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x08_barrier\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1Mj\x01ubh>Mj\x01hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x0fget_most_
recent\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xb1ubh>K\xb1hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x0fget_with_run_id\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1Kwubh
>KwhYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGhkhIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x08_refresh\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xdeubh>K\xdehYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(hGh
khIhluh"h$)\x81\x94}\x94(h\'hlh(\x8c\x06_fetch\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xc8ubh>K\xc8hYh3)\x81\x94}\x94(h6h\x17)\x81\x94}\x94(h\x1a}\x94h\x1c}\x94(\x8c\x08__name__\x94\x8c%distributed.shuffle._scheduler_plugin\x94\x8c\x0
8__file__\x94\x8cU/opt/coiled/env/lib/python3.12/site-packages/distributed/shuffle/_scheduler_plugin.py\x94uh"h$)\x81\x94}\x94(h\'h\xc5h(\x8c\x03get\x94h*K\x00h+K\x00h,)h-K\x00h.K\x00h/K@h0K\x00ubh1K\xb2ubh>K\xb2ubububububububububububub\x87\x94R
\x94hB\x88N)t\x94R\x94h\x08b.'
The only way I can digest this is to use an AI.
With coiled dashboard, I can see that most of the time, none of the workers/scheduler have any sort of error, good mem/cpu usage.
Anything else we need to know?:
Environment:
- Dask version: distributed 2025.10.0
- Python version:
- Operating System:
- Install method (conda, pip, source):
It looks like there are two problems here:
- You're getting an unpickling error on the scheduler
- The exception is being mangled
Usually unpickling errors happen when you have a different software environment on your client and scheduler/workers. Most likely a different Python version. Given that you're using Coiled I suggest you reach out to their support to help with this.
I'll leave this open though because the exception mangling isn't great. When you get your environment issues resolved could you comment back here to let us know what it was as that might give us a clue to what is happening.
@jacobtomlinson Thanks a lot for your help.
TLDR; env diff was the problem!
As suggested, I fixed my software env. First I did a poetry update on my package. But even after that, when creating my cluster i would get this message:
---------+--------+-----------+---------
| Package | Client | Scheduler | Workers |
+---------+--------+-----------+---------
| lz4 | 4.4.4 | 4.4.5 | 4.4.5 |
+---------+--------+-----------+---------+
Initially, I didn't really care too much about it, and TBH barely saw it. My code would start, output a lot of log on my terminal. (My script would run for 10 min, 30 min sometimes and then throw the logs as I mentioned above. I spent maybe a week on this, trying various dask config, from p2p/tasks to other obscure options.)
As it was only a minor version of a package (lz4) I didn't even know about, i would just let my code proceed. And I assumed my poetry update should have fixed any package version mismatch anyway.
But my code failed again miserably.
I then decided to update lz4 of my client to 4.4.5. My code has been running for the last 4-5 hours without a single issue...
Now two things I don't really get. My client is ...a client! meaning that I don't quite understand why this has such an impact on the running code, but ok, I get it, some data needs to be serialised back from the scheduler to the client. But my biggest problem with this, is if the environment similarity between client/sched/worker is so important, why only raising a somewhat quiet warning. IMO, this should raise a
raise RuntimeError("Package version mismatch detected: client and worker versions do not match")
Is there an option to trigger this?
having this forced would be a massive quality of life improvement.
Its a bit of a thorny problem that's been discussed in Dask for many years. Often things work fine with slight mismatches so we don't want to fail too aggressively, but some core things like python, dask, distributed and anything related to serialization/compression like lz4 can cause problems like the one you experienced. I did start some work on this in #5582 but it never got over the line. This was also discussed again in #7017.
@jacobtomlinson , thanks for the background. I had a look at https://github.com/dask/distributed/pull/5582
I imagine Dask developers rarely run into issues like this because they already know the quirks (such as the one discussed here) and understand why failures occur. For a user/consumer like me, using Dask largely as a black box, debugging is difficult enough that without deeper knowledge it’s very easy to head down the wrong rabbit holes.
I'm in favour of your PR, and maybe even a new dask config option in distributed.yaml to cancel the creation of the cluster if critical is hit.