stackstac
stackstac copied to clipboard
occasional lockups during dask reads
It seems that stackstac will occasionally hang indefinitely while doing a dataset read:
call stack:
File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner()
File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run()
File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 55, in _worker task.run()
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/_concurrent_futures_thread.py", line 66, in run result = self.fn(*self.args, **self.kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3616, in apply_function result = function(*args, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/optimization.py", line 963, in __call__ return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 151, in get result = _execute_task(task, cache)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task return func(*(_execute_task(a, cache) for a in args))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/to_dask.py", line 172, in fetch_raster_window data = reader.read(current_window)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 425, in read result = reader.read(
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 249, in read return self.dataset.read(1, window=window, **kwargs)
Is it possible to pass in a timeout parameter or something like that or would I be better off just cancelling the job entirely when something like this happens?
Yeah, a timeout on the read would be reasonable. Then we could have that timeout trigger a retry via whatever logic we implement for #18.
@mukhery I'm curious if you have a reproducer for this, or have noticed cases/datasets/patterns that tend to cause it more often?
For now, you might try playing with setting GDAL_HTTP_MAX_RETRY
and GDAL_HTTP_RETRY_DELAY
via LayeredEnv
. See https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access and https://trac.osgeo.org/gdal/wiki/ConfigOptions#GDAL_HTTP_TIMEOUT.
Maybe something like:
retry_env = stackstac.DEFAULT_GDAL_ENV.updated(dict(
GDAL_HTTP_TIMEOUT=45,
GDAL_HTTP_MAX_RETRY=5,
GDAL_HTTP_RETRY_DELAY=0.5
))
stackstac.stack(..., gdal_env=retry_env)
I tried to come up with something to reproduce but haven't been able to. We've also been seeing several other network-related/comms issues, so it's possible that our specific workload and how we've implemented the processing is causing some of these issues. I ended up just adding timeouts to the task futures and then cancelling and/or restarting the cluster if needed to meet our current need. Feel free to close this issue if you'd like and I can reopen later if I'm able to reliably reproduce.
I'll keep it open, since I think it's a reasonable thing to implement.
I ended up just adding timeouts to the task futures
Curious how you implemented this?
Sounds good, thanks!
I did something like this:
try:
fut = cluster.client.compute(<task_involving_stackstac_data>)
dask.distributed.wait(fut, timeout=600)
except dask.distributed.TimeoutError as curr_exception:
error_text = f'{curr_exception}'[:100] #sometimes the error messages are crazy long
print(f'task failed with exception: {error_text}')
Nice! That makes sense.