stackstac icon indicating copy to clipboard operation
stackstac copied to clipboard

occasional lockups during dask reads

Open mukhery opened this issue 3 years ago • 5 comments

It seems that stackstac will occasionally hang indefinitely while doing a dataset read: image

call stack:

File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run()

File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/_concurrent_futures_thread.py", line 66, in run result = self.fn(*self.args, **self.kwargs)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3616, in apply_function result = function(*args, **kwargs)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/optimization.py", line 963, in __call__ return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 151, in get result = _execute_task(task, cache)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task return func(*(_execute_task(a, cache) for a in args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/to_dask.py", line 172, in fetch_raster_window data = reader.read(current_window)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 425, in read result = reader.read(

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 249, in read return self.dataset.read(1, window=window, **kwargs)

Is it possible to pass in a timeout parameter or something like that or would I be better off just cancelling the job entirely when something like this happens?

mukhery avatar Jun 22 '21 23:06 mukhery

Yeah, a timeout on the read would be reasonable. Then we could have that timeout trigger a retry via whatever logic we implement for #18.

@mukhery I'm curious if you have a reproducer for this, or have noticed cases/datasets/patterns that tend to cause it more often?

For now, you might try playing with setting GDAL_HTTP_MAX_RETRY and GDAL_HTTP_RETRY_DELAY via LayeredEnv. See https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access and https://trac.osgeo.org/gdal/wiki/ConfigOptions#GDAL_HTTP_TIMEOUT.

Maybe something like:

retry_env = stackstac.DEFAULT_GDAL_ENV.updated(dict(
    GDAL_HTTP_TIMEOUT=45,
    GDAL_HTTP_MAX_RETRY=5,
    GDAL_HTTP_RETRY_DELAY=0.5
))
stackstac.stack(..., gdal_env=retry_env)

gjoseph92 avatar Jun 25 '21 00:06 gjoseph92

I tried to come up with something to reproduce but haven't been able to. We've also been seeing several other network-related/comms issues, so it's possible that our specific workload and how we've implemented the processing is causing some of these issues. I ended up just adding timeouts to the task futures and then cancelling and/or restarting the cluster if needed to meet our current need. Feel free to close this issue if you'd like and I can reopen later if I'm able to reliably reproduce.

mukhery avatar Jun 29 '21 20:06 mukhery

I'll keep it open, since I think it's a reasonable thing to implement.

I ended up just adding timeouts to the task futures

Curious how you implemented this?

gjoseph92 avatar Jun 29 '21 21:06 gjoseph92

Sounds good, thanks!

I did something like this:

try:
    fut = cluster.client.compute(<task_involving_stackstac_data>)
    dask.distributed.wait(fut, timeout=600)
except dask.distributed.TimeoutError as curr_exception:
    error_text = f'{curr_exception}'[:100] #sometimes the error messages are crazy long
    print(f'task failed with exception: {error_text}')

mukhery avatar Jun 29 '21 22:06 mukhery

Nice! That makes sense.

gjoseph92 avatar Jul 01 '21 19:07 gjoseph92