Many S3 clients created when using `RetryingFunctionExecutor`
I was successfully using the FunctionExecutor on AWS Lambda and S3, but seeing individual tasks fail at scale. So I tried switching my code to use the RetryingFunctionExecutor that @tomwhite added in #1291, to automatically retry each task (I set it to a maximum of 2 retry attempts).
However, as a result I now get this weird behaviour:
It's completed one batch of tasks (I'm running .map in a loop), but then as it tried to get the results it's somehow creating hundreds of new S3 clients. These just keep being added and added and it doesn't seem to make it to the next batch of tasks.
It didn't do this before I switched to the RetryingFunctionExecutor, so the problem must be with that (or possibly with how I'm calling that).
I've established that my code is creating one extra S3 client per task...
I feel like if I were able to follow the executor.get_result(futures) pattern then this would work... But RetryingFunctionExecutor doesn't have that method yet (see https://github.com/lithops-cloud/lithops/pull/1291#issue-2212994457). So I tried looking at the implementation of FunctionExecutor.get_result(), which seems to basically just call executor.wait(download_results=True). However, when I try explicitly calling executor.wait(download_results=True) I immediately get
OSError: [Errno 28] No space left on device
which makes no sense to me. My results are each only a few kB! I was previously quite happily downloading 700 of them at a time to my laptop! The docstring of download_results isn't super helpful on why this might be the case.
It seems the client connection gets created when the S3Backend class gets initiated, which is a specific instance of the StorageBackend. So I'm somehow triggering lots of StorageBackends to be created. But I can't follow the structure of the code well enough to see why that might be happening.
I'm not sure what's causing this, but in Cubed I've been using RetryingFunctionExecutor with ~1000 AWS lambda instances without any problems... I wonder if it's something to do with the way its being called? The relevant function is here: https://github.com/cubed-dev/cubed/blob/main/cubed/runtime/executors/lithops.py#L43-L164
Yeah I thought the same thing. I looked at your implementation and tried experimenting locally to copy what you did in Cubed @TomWhite but all your straggler logic made it harder for me to extract out the essence of what was needed...
EDIT: It might be interesting to delete that code from Cubed, simplifying it and see if the same problem emerges.
I wondered if it had something to do with using / not using the executor as a context manager... I would have naively expected that the S3 client would connect when the executor was instantiated, but it's not clear to me if that's true.
Here's what I think should be a reproducer:
Script
import lithops
from lithops.retries import RetryingFunctionExecutor
def hello(name: str) -> str:
return 'Hello {}!'.format(name)
def map_over_names(names: list[str], *, use_retries: bool) -> list[str]:
if use_retries:
# RetryingFunctionExecutor wraps an normal FunctionExecutor
lithops_client = lithops.FunctionExecutor()
with RetryingFunctionExecutor(lithops_client) as fexec:
# TODO retries should be exposed as a configuration arg in lithops, see https://github.com/lithops-cloud/lithops/issues/1412
futures = fexec.map(hello, names, retries=2)
finished_futures, pending = fexec.wait(futures)
results = [future.result() for future in finished_futures]
else:
fexec = lithops.FunctionExecutor()
futures = fexec.map(hello, names)
results = fexec.get_result(futures)
return results
if __name__ == '__main__':
names = ["Alice", "Bob", "Eve"]
USE_RETRIES = False
results = map_over_names(names=names, use_retries=USE_RETRIES)
print(results)
When I run this as a script locally, with no lithops config file (so all defaults), it runs locally using the lithops local executor. The output looks like this
Output
2025-06-27 13:27:12,543 [INFO] config.py:139 -- Lithops v3.6.1.dev0 - Python3.12
2025-06-27 13:27:12,544 [INFO] localhost.py:39 -- Localhost storage client created
2025-06-27 13:27:12,544 [INFO] localhost.py:78 -- Localhost compute v2 client created
2025-06-27 13:27:12,642 [INFO] invokers.py:119 -- ExecutorID 6c536d-0 | JobID M000 - Selected Runtime: python
2025-06-27 13:27:12,644 [INFO] invokers.py:186 -- ExecutorID 6c536d-0 | JobID M000 - Starting function invocation: hello() - Total: 3 activations
2025-06-27 13:27:12,647 [INFO] invokers.py:225 -- ExecutorID 6c536d-0 | JobID M000 - View execution logs at /private/var/folders/6x/yyxxlhtd3db164lh5zgfhgp80000gn/T/lithops-tom/logs/6c536d-0-M000.log
2025-06-27 13:27:12,648 [INFO] executors.py:494 -- ExecutorID 6c536d-0 - Getting results from 3 function activations
2025-06-27 13:27:12,648 [INFO] wait.py:101 -- ExecutorID 6c536d-0 - Waiting for 3 function activations to complete
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3
2025-06-27 13:27:12,995 [INFO] executors.py:618 -- ExecutorID 6c536d-0 - Cleaning temporary data
['Hello Alice!', 'Hello Bob!', 'Hello Eve!']
Notice that even thought we ran 3 tasks, it says
2025-06-27 13:15:55,668 [INFO] localhost.py:39 -- Localhost storage client created
2025-06-27 13:15:55,668 [INFO] localhost.py:78 -- Localhost compute v2 client created
only once, as expected.
Now I need to try this on S3 - if there is a real bug, I think we will see one S3 storage client created per-task (i.e. 3 of them).
(cc @chuckwondo)
Thanks for reporting this, I'll check it as well.
@TomNicholas, did you ever sort this out?
I did not, and still need to sort it out