sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Identify lack of scalability in gwas_linear_regression

Open eric-czech opened this issue 5 years ago • 60 comments

It appears that this function does not scale well when run on a cluster.

Notes from my most recent attempt:

  • The code I ran is here: https://github.com/related-sciences/ukb-gwas-pipeline-nealelab/blob/4f862e31b8093d25fdaa8da7f841b9be8583cda4/scripts/gwas.py#L268
  • This works on a dataset with fewer variants (chr XY, which has ~8k variants compared to ~141k for chr 21)
  • The operation utilizes only one or two workers in a cluster of 20 n1-highmem-8 instances:

CPU utilization across worker VMs Screen Shot 2020-11-17 at 1 10 08 PM

Status Page Screen Shot 2020-11-17 at 12 35 39 PM

  • Drilling in on one of the workers that is running all the tasks, I see that the only not obviously parallelizable task it seems to be running is "solve-triangular":

Screen Shot 2020-11-17 at 1 36 57 PM

Full Task List


The job ultimately failed with the error "ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs".

Several of the workers show log messages like this:

distributed.worker - INFO - Dependent not found: ('rechunk-merge-66cac011d34e1c66cde96678a9e011b5', 0, 21) 0 . Asking scheduler

Perhaps this is what happens when one node unexpectedly becomes unreachable? I'm not sure.

I will run this again on a smaller dataset that didn't fail to get a performance report and task graph screenshot (which doesn't work on this data because the UI won't render so many nodes).

eric-czech avatar Nov 17 '20 18:11 eric-czech

Notes from a more detailed performance report resulting from running this for a smaller dataset (that succeeds):

(renamed to .txt to avoid github attachment filter) gwas-height-performance-report.html.txt

This zoomed-out view of the task stream in the report doesn't strike me as very healthy:

Screen Shot 2020-11-17 at 2 07 57 PM

Task graph:

Screen Shot 2020-11-17 at 1 54 52 PM

@mrocklin (cc: @ravwojdyla) do you have any suggestions on how to identify why the work isn't being distributed well on larger datasets for this workflow?

eric-czech avatar Nov 17 '20 19:11 eric-czech

Thank you for producing the performance report. If you want to publish these in the future then you may also want to look into gist.github.com and https://raw.githack.com/ .

I've only looked very briefly at it, but the thing that stands out the most is the 760s transfer times leading up to sum tasks. At expected bandwidths, these would be 150GB payloads, which I'm assuming is higher than you're expecting. It's also odd in how synchronized these transfers are, they all end within a few seconds of each other.

mrocklin avatar Nov 17 '20 19:11 mrocklin

ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs

I'm curious, do worker logs report anything strange?

Also I'm curious, which version of distributed are you running? (adding this to the performance report here: https://github.com/dask/distributed/pull/4249)

cc'ing @quasiben , who cares a bit about this space and has a lot of experience tracking down similar performance problems.

mrocklin avatar Nov 17 '20 20:11 mrocklin

Also it looks like some of the workers are under memory pressure. I wonder if maybe they're struggling to accept a large-ish payload.

On Tue, Nov 17, 2020 at 11:13 AM Eric Czech [email protected] wrote:

Notes from a more detailed performance report resulting from running this for a smaller dataset (that succeeds):

(renamed to .txt to avoid github attachment filter) gwas-height-performance-report.html.txt https://github.com/pystatgen/sgkit/files/5555640/gwas-height-performance-report.html.txt

This zoomed-out view of the task stream in the report doesn't strike me as very healthy:

[image: Screen Shot 2020-11-17 at 2 07 57 PM] https://user-images.githubusercontent.com/6130352/99435626-62c72080-28de-11eb-9627-560ead8131e8.png

Task graph:

[image: Screen Shot 2020-11-17 at 1 54 52 PM] https://user-images.githubusercontent.com/6130352/99435949-d406d380-28de-11eb-8097-2d3ba8e9bdaf.png

@mrocklin https://github.com/mrocklin (cc: @ravwojdyla https://github.com/ravwojdyla) do you have any suggestions on how to identify why the work isn't being distributed well on larger datasets for this workflow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit/issues/390#issuecomment-729142413, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTEPOAG7V7AMMB4M3GTSQLDPHANCNFSM4TY6UJFA .

mrocklin avatar Nov 17 '20 20:11 mrocklin

When I looked I saw some fairly lengthy disk-read/writes which I assume to be dask spilling. Which might correspond to the workers under memory pressure

Screen Shot 2020-11-17 at 3 32 36 PM

quasiben avatar Nov 17 '20 20:11 quasiben

Two minutes is a long time for a disk read/write. My initial guess is that there is some other external factor causing contention. That's just a wild guess though.

Looking at the worker/thread ratio, I wonder if it would make more sense to have far more workers with fewer threads each. Perhaps try four threads per worker?

On Tue, Nov 17, 2020 at 12:39 PM Benjamin Zaitlen [email protected] wrote:

When I looked I saw some fairly lengthy disk-read/writes which I assume to be dask spilling. Which might correspond to the workers under memory pressure

[image: Screen Shot 2020-11-17 at 3 32 36 PM] https://user-images.githubusercontent.com/1403768/99444378-20a3dc00-28ea-11eb-8890-fe2cf70240fd.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit/issues/390#issuecomment-729193609, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTGNBX6YOZINCQP2AZDSQLNO3ANCNFSM4TY6UJFA .

mrocklin avatar Nov 17 '20 20:11 mrocklin

Thank you for producing the performance report. If you want to publish these in the future then you may also want to look into gist.github.com and https://raw.githack.com/ .

👍

I'm curious, do worker logs report anything strange?

I see messages like "Worker stream died during communication" so I'm sure a couple workers in the cluster had died. At the end I was down to 18 instead of 20. Full log from one worker is here jic.

When I looked I saw some fairly lengthy disk-read/writes which I assume to be dask spilling. Which might correspond to the workers under memory pressure

Thanks Ben, what do you make of the "transfer-sub" tasks (the long red bars)? Do you have any intuition for what's happening in those?

Memory pressure does seem to be part of the problem -- I rechunked my input to 1/16th of the original chunk size and the whole job has progressed further. Parallel utilization is still pretty disappointing across the whole cluster:

CPU utilization across workers after rechunking input to 1/16th of original Screen Shot 2020-11-17 at 3 42 33 PM

eric-czech avatar Nov 17 '20 20:11 eric-czech

Looking at the worker/thread ratio, I wonder if it would make more sense to have far more workers with fewer threads each. Perhaps try four threads per worker?

Alright, I can try that. Hey @quasiben, how do you set the number of workers per VM in Cloud Provider?

eric-czech avatar Nov 17 '20 21:11 eric-czech

Maybe you can choose smaller VMs?

On Tue, Nov 17, 2020, 1:05 PM Eric Czech [email protected] wrote:

Looking at the worker/thread ratio, I wonder if it would make more sense to have far more workers with fewer threads each. Perhaps try four threads per worker?

Alright, I can try that. Hey @quasiben https://github.com/quasiben, how do you set the number of workers per VM in Cloud Provider?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit/issues/390#issuecomment-729213000, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTDB4HWUPL3D2RSO7C3SQLQQJANCNFSM4TY6UJFA .

mrocklin avatar Nov 17 '20 21:11 mrocklin

Alright, I can try that. Hey @quasiben, how do you set the number of workers per VM in Cloud Provider?

That's a good question. I don't think this is supported but shouldn't be too hard. You can control the number of threads with worker_options={"nthreads": 2 }. As @mrocklin suggest, this often is tuned with smaller VMs. Can I ask you to file an issue on dask-cloudprovider to support multiple workers per VM ?

quasiben avatar Nov 17 '20 21:11 quasiben

Can I ask you to file an issue on dask-cloudprovider to support multiple workers per VM ?

You bet! https://github.com/dask/dask-cloudprovider/issues/173

eric-czech avatar Nov 17 '20 22:11 eric-czech

The larger dataset with smaller chunks did ultimately finish with no errors. Here are a couple readouts:

Screen Shot 2020-11-18 at 6 56 28 AM

Performance report (26M): https://drive.google.com/file/d/1feWLKNrjQkslKDIZ7T39fPCNQQDBrQFs/view?usp=sharing

It doesn't seem like any of network, disk, or cpu are even close to being saturated so I assume there is some room for improvement.

As a very rough estimate, this task takes about 5 hrs on a single 64 vCPU VM and 3 hours on a cluster of 20 8 vCPU VMs (160 vCPUs). That would imply ~2 hours with perfect scaling so this should be an approximate ceiling for improvements.

Log: dask_gwas_chr21_log.txt

I will try again with smaller VMs and see if there are any major differences.

eric-czech avatar Nov 18 '20 12:11 eric-czech

FYI @ravwojdyla and I have been talking a bit about some similar observations on a much simpler workflow in https://github.com/related-sciences/data-team/issues/38 (private). One conclusion there was that the individual objects on GCS are so small that API requests aren't efficient yet bigger chunk sizes would start to make the workloads fail. The GCS objects for a chunk are roughly 2MiB on disk but >100MB in memory. Another potential explanation for this behavior is that the GCS objects aren't being loaded asynchronously. We're still investigating both.

eric-czech avatar Nov 18 '20 12:11 eric-czech

I don't know if you're aware of the work done by @martindurant on nicer async IO for remote storage and zarr, but he might be good to talk to.

It looks like you're still primarily blocked by a few oddly long transfers. They're less prominent now, which is good, but probably worth investigating further.

mrocklin avatar Nov 18 '20 15:11 mrocklin

(sorry, wrong thread)

martindurant avatar Nov 18 '20 16:11 martindurant

@martindurant I suspect that this is not the issue for which you intended this comment :)

On Wed, Nov 18, 2020 at 8:14 AM Martin Durant [email protected] wrote:

I had tracebacks off for bisect, but now it luckily failed on the third run :)

(py37) /Users/mdurant/code/distributed ((dcb46d07...)|BISECTING)> pytest --count=20 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation -x ========================================================================================== test session starts =========================================================================================== platform darwin -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1 rootdir: /Users/mdurant/code/distributed, configfile: setup.cfg plugins: repeat-0.8.0, flaky-3.7.0, hypothesis-5.26.0, cov-2.10.1, profiling-1.7.0 collected 20 items

distributed/tests/test_failed_workers.py ..F

================================================================================================ FAILURES ================================================================================================ ______________________________________________________________________________ test_broken_worker_during_computation[3-20] _______________________________________________________________________________

def test_func():
    result = None
    workers = []
    with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:

        async def coro():
            with dask.config.set(config):
                s = False
                for i in range(5):
                    try:
                        s, ws = await start_cluster(
                            nthreads,
                            scheduler,
                            loop,
                            security=security,
                            Worker=Worker,
                            scheduler_kwargs=scheduler_kwargs,
                            worker_kwargs=worker_kwargs,
                        )
                    except Exception as e:
                        logger.error(
                            "Failed to start gen_cluster, retrying",
                            exc_info=True,
                        )
                        await asyncio.sleep(1)
                    else:
                        workers[:] = ws
                        args = [s] + workers
                        break
                if s is False:
                    raise Exception("Could not start cluster")
                if client:
                    c = await Client(
                        s.address,
                        loop=loop,
                        security=security,
                        asynchronous=True,
                        **client_kwargs,
                    )
                    args = [c] + args
                try:
                    future = func(*args)
                    if timeout:
                        future = asyncio.wait_for(future, timeout)
                    result = await future
                    if s.validate:
                        s.validate_state()
                finally:
                    if client and c.status not in ("closing", "closed"):
                        await c._close(fast=s.status == Status.closed)
                    await end_cluster(s, workers)
                    await asyncio.wait_for(cleanup_global_workers(), 1)

                try:
                    c = await default_client()
                except ValueError:
                    pass
                else:
                    await c._close(fast=True)

                def get_unclosed():
                    return [c for c in Comm._instances if not c.closed()] + [
                        c
                        for c in _global_clients.values()
                        if c.status != "closed"
                    ]

                try:
                    start = time()
                    while time() < start + 5:
                        gc.collect()
                        if not get_unclosed():
                            break
                        await asyncio.sleep(0.05)
                    else:
                        if allow_unclosed:
                            print(f"Unclosed Comms: {get_unclosed()}")
                        else:
                            raise RuntimeError("Unclosed Comms", get_unclosed())
                finally:
                    Comm._instances.clear()
                    _global_clients.clear()

                return result

        result = loop.run_sync(
          coro, timeout=timeout * 2 if timeout else timeout
        )

distributed/utils_test.py:954:


../../conda/envs/py37/lib/python3.7/site-packages/tornado/ioloop.py:532: in run_sync return future_cell[0].result() distributed/utils_test.py:912: in coro result = await future


fut = <Task cancelled coro=<test_broken_worker_during_computation() done, defined at /Users/mdurant/code/distributed/distributed/tests/test_failed_workers.py:315>>, timeout = 60

async def wait_for(fut, timeout, *, loop=None):
    """Wait for the single Future or coroutine to complete, with timeout.

    Coroutine will be wrapped in Task.

    Returns result of the Future or coroutine.  When a timeout occurs,
    it cancels the task and raises TimeoutError.  To avoid the task
    cancellation, wrap it in shield().

    If the wait is cancelled, the task is also cancelled.

    This function is a coroutine.
    """
    if loop is None:
        loop = events.get_event_loop()

    if timeout is None:
        return await fut

    if timeout <= 0:
        fut = ensure_future(fut, loop=loop)

        if fut.done():
            return fut.result()

        fut.cancel()
        raise futures.TimeoutError()

    waiter = loop.create_future()
    timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
    cb = functools.partial(_release_waiter, waiter)

    fut = ensure_future(fut, loop=loop)
    fut.add_done_callback(cb)

    try:
        # wait until the future completes or the timeout
        try:
            await waiter
        except futures.CancelledError:
            fut.remove_done_callback(cb)
            fut.cancel()
            raise

        if fut.done():
            return fut.result()
        else:
            fut.remove_done_callback(cb)
            # We must ensure that the task is not running
            # after wait_for() returns.
            # See https://bugs.python.org/issue32751
            await _cancel_and_wait(fut, loop=loop)
          raise futures.TimeoutError()

E concurrent.futures._base.TimeoutError

../../conda/envs/py37/lib/python3.7/asyncio/tasks.py:449: TimeoutError ------------------------------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------------------------------ distributed.scheduler - INFO - Clear task state distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:52676 distributed.scheduler - INFO - dashboard at: 127.0.0.1:8787 distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52677 distributed.worker - INFO - Listening to: tcp://127.0.0.1:52677 distributed.worker - INFO - dashboard at: 127.0.0.1:52678 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 17.18 GB distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/worker-_1zsf66b distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52679 distributed.worker - INFO - Listening to: tcp://127.0.0.1:52679 distributed.worker - INFO - dashboard at: 127.0.0.1:52680 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 2 distributed.worker - INFO - Memory: 17.18 GB distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/worker-cuz87qov distributed.worker - INFO - ------------------------------------------------- distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52677', name: 0, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52677 distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52679', name: 1, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52679 distributed.core - INFO - Starting established connection distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.core - INFO - Starting established connection distributed.scheduler - INFO - Receive client connection: Client-b80cc33a-29b8-11eb-9578-acde48001122 distributed.core - INFO - Starting established connection distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:52685' distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52686 distributed.worker - INFO - Listening to: tcp://127.0.0.1:52686 distributed.worker - INFO - dashboard at: 127.0.0.1:52687 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 2 distributed.worker - INFO - Memory: 2.86 GB distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/dask-worker-space/worker-322ii8r5 distributed.worker - INFO - ------------------------------------------------- distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52686', name: tcp://127.0.0.1:52686, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52686 distributed.core - INFO - Starting established connection distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.worker - INFO - Run out-of-band function '_exit' distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52686', name: tcp://127.0.0.1:52686, memory: 10, processing: 80> distributed.core - INFO - Removing comms to tcp://127.0.0.1:52686 distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:52684': in <closed TCP>: Stream is closed distributed.nanny - INFO - Worker process 71091 exited with status 1 distributed.nanny - WARNING - Restarting worker distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52714 distributed.worker - INFO - Listening to: tcp://127.0.0.1:52714 distributed.worker - INFO - dashboard at: 127.0.0.1:52716 distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 2 distributed.worker - INFO - Memory: 2.86 GB distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/dask-worker-space/worker-nlpop5co distributed.worker - INFO - ------------------------------------------------- distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52714', name: tcp://127.0.0.1:52714, memory: 0, processing: 0> distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52714 distributed.core - INFO - Starting established connection distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676 distributed.worker - INFO - ------------------------------------------------- distributed.core - INFO - Starting established connection distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:52686 Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b03d250>: ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/worker.py", line 1993, in gather_dep self.rpc, deps, worker, who=self.address File "/Users/mdurant/code/distributed/distributed/worker.py", line 3196, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 390, in retry_operation operation=operation, File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 370, in retry return await coro() File "/Users/mdurant/code/distributed/distributed/worker.py", line 3173, in _get_data comm = await rpc.connect(worker) File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect **self.connection_args, File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b03d250>: ConnectionRefusedError: [Errno 61] Connection refused distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:52686 Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b36e210>: ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/worker.py", line 1993, in gather_dep self.rpc, deps, worker, who=self.address File "/Users/mdurant/code/distributed/distributed/worker.py", line 3196, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 390, in retry_operation operation=operation, File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 370, in retry return await coro() File "/Users/mdurant/code/distributed/distributed/worker.py", line 3173, in _get_data comm = await rpc.connect(worker) File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect **self.connection_args, File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b36e210>: ConnectionRefusedError: [Errno 61] Connection refused distributed.core - ERROR - Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/mdurant/code/distributed/distributed/core.py", line 528, in handle_comm result = await result File "/Users/mdurant/code/distributed/distributed/scheduler.py", line 3040, in broadcast [send_message(address) for address in addresses if address is not None] File "/Users/mdurant/code/distributed/distributed/utils.py", line 230, in All result = await tasks.next() File "/Users/mdurant/code/distributed/distributed/scheduler.py", line 3031, in send_message comm = await self.rpc.connect(addr) File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect **self.connection_args, File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect _raise(error) File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused distributed.scheduler - INFO - Remove client Client-b80cc33a-29b8-11eb-9578-acde48001122 distributed.scheduler - INFO - Remove client Client-b80cc33a-29b8-11eb-9578-acde48001122 distributed.scheduler - INFO - Close client connection: Client-b80cc33a-29b8-11eb-9578-acde48001122 distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52677 distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52679 distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52677', name: 0, memory: 0, processing: 0> distributed.core - INFO - Removing comms to tcp://127.0.0.1:52677 distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52679', name: 1, memory: 0, processing: 0> distributed.core - INFO - Removing comms to tcp://127.0.0.1:52679 distributed.scheduler - INFO - Scheduler closing... distributed.scheduler - INFO - Scheduler closing all comms distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52714', name: tcp://127.0.0.1:52714, memory: 0, processing: 0> distributed.core - INFO - Removing comms to tcp://127.0.0.1:52714 distributed.scheduler - INFO - Lost all workers distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52714 distributed.nanny - INFO - Worker closed distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:52685'

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit/issues/390#issuecomment-729785585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTD2GWBR2GGYIOVSBRTSQPXF3ANCNFSM4TY6UJFA .

mrocklin avatar Nov 18 '20 16:11 mrocklin

Indeed - but I did want to comment here too.

Yes, gcsfs allows fetching of many objects from the store with a single call, and they will be processed concurrently, returning back the set of bytes objects. In the case that these are to be expanded in memory, you would of course have to deal with the blocks of bytes one at a time, to not exceed RAM. For zarr, this does mean transiently higher memory usage during decompression, depending on the number of storage chunks per dask partition - the best tradeoff would be workload-dependent.

martindurant avatar Nov 18 '20 16:11 martindurant

In trying this on a cluster of 40 4 vCPU machines (instead of 20 8 vCPUs), the workflow failed with some similar errors to my first attempt on this issue. It doesn't look like memory pressure was an issue this time but I'm not certain. I didn't see log messages about it anyhow, and this was with chunks 1/16th the size of the original, or ~6MB in memory. The client-side error I hit was again:

ValueError: Could not find dependent ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 130, 105). Check worker logs

Here are some worker logs:

Worker 1 distributed.worker - INFO - Start worker at: tcp://10.142.15.198:37407

distributed.worker - INFO - Listening to: tcp://10.142.15.198:37407

distributed.worker - INFO - dashboard at: 10.142.15.198:38769

distributed.worker - INFO - Waiting to connect to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Threads: 4

distributed.worker - INFO - Memory: 27.34 GB

distributed.worker - INFO - Local Directory: /dask-worker-space/dask-worker-space/worker-jst4aey2

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Registered to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - ERROR - failed during get data with tcp://10.142.15.198:37407 -> tcp://10.142.15.207:45671 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1282, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - Worker stream died during communication: tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1979, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3255, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3235, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 666, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 0, 13)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 0, 13) 0 . Asking scheduler

Worker 2 distributed.worker - INFO - Start worker at: tcp://10.142.15.201:43081

distributed.worker - INFO - Listening to: tcp://10.142.15.201:43081

distributed.worker - INFO - dashboard at: 10.142.15.201:43701

distributed.worker - INFO - Waiting to connect to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Threads: 4

distributed.worker - INFO - Memory: 27.34 GB

distributed.worker - INFO - Local Directory: /dask-worker-space/dask-worker-space/worker-9plktoms

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Registered to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - ERROR - failed during get data with tcp://10.142.15.201:43081 -> tcp://10.142.15.207:45671 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1282, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - Worker stream died during communication: tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1979, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3255, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3235, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 666, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 33, 40)

distributed.worker - ERROR - failed during get data with tcp://10.142.15.201:43081 -> tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 988, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1169, in write_to_fd return self.socket.send(data) # type: ignore ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1281, in get_data compressed = await comm.write(msg, serializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 256, in write convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 32, 40)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 34, 40)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 32, 40) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 33, 40) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 35, 40)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 34, 40) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 35, 40) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 137, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 136, 22)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 136, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 172, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 138, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 173, 22)

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 173, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 175, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 148, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 149, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 225, 32)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 174, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 225, 32) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 174, 22)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 150, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 244, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 151, 94)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 151, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 244, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 150, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - INFO - Stopping worker at tcp://10.142.15.201:43081

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 150, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

Logs for most of the workers looked like "Worker 1" above and I didn't notice anything particularly noteworthy in perusing a bunch of them. What is odd about this run is that I still had 40 nodes in the cluster at the end. It appears that one of them become temporarily unavailable or was otherwise unreachable long enough to crash the job. @mrocklin what should Dask do in a scenario where one worker is unreachable? Does it try to reschedule the work elsewhere or fail the whole job?

eric-czech avatar Nov 18 '20 17:11 eric-czech

Yes, gcsfs allows fetching of many objects from the store with a single call, and they will be processed concurrently, returning back the set of bytes objects

Thanks @martindurant. Does dask need to do anything in particular to use that (presumably what was in https://github.com/zarr-developers/zarr-python/issues/536)? @ravwojdyla mentioned that you need at least zarr 2.5.0, but we weren't sure if there was also more that needs to be done in dask or xarray to integrate it.

eric-czech avatar Nov 18 '20 17:11 eric-czech

Zarr 2.5 is enough - but you need more than one zarr block per dask task, else you see no benefit.

martindurant avatar Nov 18 '20 18:11 martindurant

I see, thanks @martindurant. Well I'm down to <10MB chunks being necessary to make this workflow run without OOM errors so this seems like an important point of contention we're likely to run into again. Even if I wrote the zarr chunks to be small enough such that multiple of them would fit in one dask chunk, I can't imagine that parallel reading of <100k chunks (~1M in memory) would provide much of a benefit.

I'll see if the workload will tolerate large but uneven chunks (i.e. tall-skinny, short-fat). There are a lot of multiplications in it and given https://github.com/pystatgen/sgkit/issues/375, it stands to reason that we should have to rethink chunking in every workflow as a function of the number of columns involved.

eric-czech avatar Nov 18 '20 18:11 eric-czech

Eric, if you're able to reproduce this same problem but with a bit less of the machinery here that would also make it easier for some of the other performance experts to try things out on their own and weigh in. I'm not sure if that's easy for you, but if you can recreate this with a random dataset and some tensordot calls that would be ideal.

On Wed, Nov 18, 2020 at 10:36 AM Eric Czech [email protected] wrote:

I see, thanks @martindurant https://github.com/martindurant. Well I'm down to <10MB chunks being necessary to make this workflow run without OOM errors so this seems like an important point of contention we're likely to run into again. Even if I wrote the zarr chunks to be small enough such that multiple of them would fit in one dask chunk, I can't imagine that parallel reading of <100k chunks (~1M in memory) would provide much of a benefit.

I'll see if the workload will tolerate large but uneven chunks (i.e. tall-skinny, short-fat). There are a lot of multiplications in it and given #375 https://github.com/pystatgen/sgkit/issues/375, it stands to reason that we should have to rethink chunking in every workflow as a function of the number of columns involved.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit/issues/390#issuecomment-729875174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTEVUBEEAAIBFFGTRY3SQQH2RANCNFSM4TY6UJFA .

mrocklin avatar Nov 18 '20 18:11 mrocklin

We could even turn it into a performance case study and try to get a few folks interested in the problem

On Wed, Nov 18, 2020 at 10:54 AM Matthew Rocklin [email protected] wrote:

Eric, if you're able to reproduce this same problem but with a bit less of the machinery here that would also make it easier for some of the other performance experts to try things out on their own and weigh in. I'm not sure if that's easy for you, but if you can recreate this with a random dataset and some tensordot calls that would be ideal.

On Wed, Nov 18, 2020 at 10:36 AM Eric Czech [email protected] wrote:

I see, thanks @martindurant https://github.com/martindurant. Well I'm down to <10MB chunks being necessary to make this workflow run without OOM errors so this seems like an important point of contention we're likely to run into again. Even if I wrote the zarr chunks to be small enough such that multiple of them would fit in one dask chunk, I can't imagine that parallel reading of <100k chunks (~1M in memory) would provide much of a benefit.

I'll see if the workload will tolerate large but uneven chunks (i.e. tall-skinny, short-fat). There are a lot of multiplications in it and given #375 https://github.com/pystatgen/sgkit/issues/375, it stands to reason that we should have to rethink chunking in every workflow as a function of the number of columns involved.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit/issues/390#issuecomment-729875174, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTEVUBEEAAIBFFGTRY3SQQH2RANCNFSM4TY6UJFA .

mrocklin avatar Nov 18 '20 18:11 mrocklin

Eric, if you're able to reproduce this same problem but with a bit less of the machinery here that would also make it easier for some of the other performance experts to try things out on their own and weigh in

Hey @mrocklin, here is a notebook that isolates the dask code being used here: https://gist.github.com/eric-czech/daae30d54a5c96fd09f13ffa58a3bafe.

I'm fairly certain the problem is https://stackoverflow.com/questions/64774771/does-blockwise-allow-iteration-over-out-of-core-arrays, or rather the lack of scalability of matrix multiplication in dask. I was able to get this workflow to run on the 40 node cluster by reducing the chunk size in the variants dimension to something far smaller, since variant_chunk_size x n_samples arrays are being loaded into memory by blockwise. At my original chunking (5216 variants, 5792 samples), these arrays should have been about 5216 variants * 365941 samples * 4 bytes = 7.5 GB which is bigger than the 6.5 GB of RAM available per vCPU on n1-highmem-* instances.

eric-czech avatar Nov 19 '20 15:11 eric-czech

Also, I don't think using a different worker/core ratio changed much. The job was a good bit slower on 40 nodes instead of 20 (3hr 20m vs 2h 50m) but that may be more attributable to the different chunking needed. Either way, a few GCP monitoring readouts looked like this for the cluster (about the same as before):

Screen Shot 2020-11-18 at 5 55 54 PM

eric-czech avatar Nov 19 '20 15:11 eric-czech

Eric, should I wait on publishing the problem as a dask issue until after you've had a chance to digest dask array reductions ? Or was the performance here a different issue?

On Thu, Nov 19, 2020 at 7:57 AM Eric Czech [email protected] wrote:

Also, I don't think using a different worker/core ratio changed much. The job was a good bit slower on 40 nodes instead of 20 (3hr 20m vs 2h 50m) but that may be more attributable to the different chunking needed. Either way, a few GCP monitoring readouts looked like this for the cluster (about the same as before):

[image: Screen Shot 2020-11-18 at 5 55 54 PM] https://user-images.githubusercontent.com/6130352/99690145-cf186000-2a55-11eb-911a-25fb00795194.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pystatgen/sgkit/issues/390#issuecomment-730469023, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTC76UYRYV3CEVTQOI3SQU55VANCNFSM4TY6UJFA .

mrocklin avatar Nov 19 '20 18:11 mrocklin

I think it's ok to publish it now. This workflow only uses lstsq and matrix multiplication, not any compiled functions or blockwise like the pairwise functions we were talking about in the context of array reductions on the call.

eric-czech avatar Nov 19 '20 18:11 eric-czech

Wow, rechunking to very short-fat chunks and re-running on a cluster of 40 n1-highmem-8 instances resulted in the whole workflow finishing in ~10 minutes (as opposed to 2 hours on a 20 node cluster) and I saw utilization like this:

Screen Shot 2020-11-19 at 3 25 00 PM

Performance report: https://drive.google.com/file/d/1vLZEwY0xea6Jc3VT_mS9HiZrv6NXeF2X/view?usp=sharing

This seems to summarize the differences fairly well:

Screen Shot 2020-11-19 at 4 43 13 PM

Based on only the performance reports though, I'm not sure how we could have known there was so much room for improvement. Nothing else jumps out to me as being predictive of that yet.

eric-czech avatar Nov 19 '20 21:11 eric-czech

As a negative control and because I was a little incredulous about this latest improvement, I reran this once again with small square chunks and saw memory usage swell followed by a severe slowdown in task processing rates. There really is something magical about the short-fat chunking. To summarize what I've tried (w.r.t chunking) and the results so far:

input chunks result
(5216, 5792) - original OOM
(1304, 1448) - "1/16th" as big works but takes hours
(5216, 724) - tall-skinny OOM
(652, 5792) - short-fat works, takes minutes

eric-czech avatar Nov 19 '20 22:11 eric-czech

I don't know the exact shape of your problem, but generally speaking the ideal chunk shape for tensordot style operations is non-intuitive. I think I linked to the automatic rechunk for tensordot issue at one point, which has a video of a lecture by Tammy Kolda which is pretty informative. For tensordot style operations you want to chunk so that you minimize communication during the contraction phase. Often this is counter to most folks' intuition built by thinking about matmul algorithms.

mrocklin avatar Nov 20 '20 00:11 mrocklin