lithops icon indicating copy to clipboard operation
lithops copied to clipboard

Many S3 clients created when using `RetryingFunctionExecutor`

Open TomNicholas opened this issue 8 months ago • 10 comments

I was successfully using the FunctionExecutor on AWS Lambda and S3, but seeing individual tasks fail at scale. So I tried switching my code to use the RetryingFunctionExecutor that @tomwhite added in #1291, to automatically retry each task (I set it to a maximum of 2 retry attempts).

However, as a result I now get this weird behaviour:

Image

It's completed one batch of tasks (I'm running .map in a loop), but then as it tried to get the results it's somehow creating hundreds of new S3 clients. These just keep being added and added and it doesn't seem to make it to the next batch of tasks.

It didn't do this before I switched to the RetryingFunctionExecutor, so the problem must be with that (or possibly with how I'm calling that).

TomNicholas avatar Apr 27 '25 01:04 TomNicholas

I've established that my code is creating one extra S3 client per task...

TomNicholas avatar Apr 27 '25 02:04 TomNicholas

I feel like if I were able to follow the executor.get_result(futures) pattern then this would work... But RetryingFunctionExecutor doesn't have that method yet (see https://github.com/lithops-cloud/lithops/pull/1291#issue-2212994457). So I tried looking at the implementation of FunctionExecutor.get_result(), which seems to basically just call executor.wait(download_results=True). However, when I try explicitly calling executor.wait(download_results=True) I immediately get

OSError: [Errno 28] No space left on device

which makes no sense to me. My results are each only a few kB! I was previously quite happily downloading 700 of them at a time to my laptop! The docstring of download_results isn't super helpful on why this might be the case.

TomNicholas avatar Apr 27 '25 02:04 TomNicholas

It seems the client connection gets created when the S3Backend class gets initiated, which is a specific instance of the StorageBackend. So I'm somehow triggering lots of StorageBackends to be created. But I can't follow the structure of the code well enough to see why that might be happening.

TomNicholas avatar Apr 27 '25 03:04 TomNicholas

I'm not sure what's causing this, but in Cubed I've been using RetryingFunctionExecutor with ~1000 AWS lambda instances without any problems... I wonder if it's something to do with the way its being called? The relevant function is here: https://github.com/cubed-dev/cubed/blob/main/cubed/runtime/executors/lithops.py#L43-L164

tomwhite avatar Apr 28 '25 16:04 tomwhite

Yeah I thought the same thing. I looked at your implementation and tried experimenting locally to copy what you did in Cubed @TomWhite but all your straggler logic made it harder for me to extract out the essence of what was needed...

EDIT: It might be interesting to delete that code from Cubed, simplifying it and see if the same problem emerges.

TomNicholas avatar Apr 28 '25 16:04 TomNicholas

I wondered if it had something to do with using / not using the executor as a context manager... I would have naively expected that the S3 client would connect when the executor was instantiated, but it's not clear to me if that's true.

TomNicholas avatar Apr 28 '25 16:04 TomNicholas

Here's what I think should be a reproducer:

Script
import lithops
from lithops.retries import RetryingFunctionExecutor


def hello(name: str) -> str:
    return 'Hello {}!'.format(name)


def map_over_names(names: list[str], *, use_retries: bool) -> list[str]:

    if use_retries:
        # RetryingFunctionExecutor wraps an normal FunctionExecutor
        lithops_client = lithops.FunctionExecutor()
        with RetryingFunctionExecutor(lithops_client) as fexec:
        
            # TODO retries should be exposed as a configuration arg in lithops, see https://github.com/lithops-cloud/lithops/issues/1412
            futures = fexec.map(hello, names, retries=2)
        
            finished_futures, pending = fexec.wait(futures)
            results = [future.result() for future in finished_futures]
    else:
        fexec = lithops.FunctionExecutor()
        
        futures = fexec.map(hello, names)
        results = fexec.get_result(futures)
    
    return results


if __name__ == '__main__':
    names = ["Alice", "Bob", "Eve"]
    USE_RETRIES = False

    results = map_over_names(names=names, use_retries=USE_RETRIES)
    print(results)

When I run this as a script locally, with no lithops config file (so all defaults), it runs locally using the lithops local executor. The output looks like this

Output
2025-06-27 13:27:12,543 [INFO] config.py:139 -- Lithops v3.6.1.dev0 - Python3.12
2025-06-27 13:27:12,544 [INFO] localhost.py:39 -- Localhost storage client created
2025-06-27 13:27:12,544 [INFO] localhost.py:78 -- Localhost compute v2 client created
2025-06-27 13:27:12,642 [INFO] invokers.py:119 -- ExecutorID 6c536d-0 | JobID M000 - Selected Runtime: python 
2025-06-27 13:27:12,644 [INFO] invokers.py:186 -- ExecutorID 6c536d-0 | JobID M000 - Starting function invocation: hello() - Total: 3 activations
2025-06-27 13:27:12,647 [INFO] invokers.py:225 -- ExecutorID 6c536d-0 | JobID M000 - View execution logs at /private/var/folders/6x/yyxxlhtd3db164lh5zgfhgp80000gn/T/lithops-tom/logs/6c536d-0-M000.log
2025-06-27 13:27:12,648 [INFO] executors.py:494 -- ExecutorID 6c536d-0 - Getting results from 3 function activations
2025-06-27 13:27:12,648 [INFO] wait.py:101 -- ExecutorID 6c536d-0 - Waiting for 3 function activations to complete

  100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3  

2025-06-27 13:27:12,995 [INFO] executors.py:618 -- ExecutorID 6c536d-0 - Cleaning temporary data
['Hello Alice!', 'Hello Bob!', 'Hello Eve!']

Notice that even thought we ran 3 tasks, it says

2025-06-27 13:15:55,668 [INFO] localhost.py:39 -- Localhost storage client created
2025-06-27 13:15:55,668 [INFO] localhost.py:78 -- Localhost compute v2 client created

only once, as expected.

Now I need to try this on S3 - if there is a real bug, I think we will see one S3 storage client created per-task (i.e. 3 of them).

(cc @chuckwondo)

TomNicholas avatar Jun 27 '25 17:06 TomNicholas

Thanks for reporting this, I'll check it as well.

JosepSampe avatar Jun 27 '25 19:06 JosepSampe

@TomNicholas, did you ever sort this out?

chuckwondo avatar Jul 30 '25 17:07 chuckwondo

I did not, and still need to sort it out

TomNicholas avatar Jul 30 '25 17:07 TomNicholas