crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

docs: clarify AWS Lambda storage

Open connorads opened this issue 1 year ago • 2 comments

There is ephemeral storage in /tmp https://docs.aws.amazon.com/lambda/latest/api/API_EphemeralStorage.html

Which could technically be used if desired CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage

connorads avatar May 20 '24 08:05 connorads

Which could technically be used if desired (CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage)

This is only true to an extent - the ephemeral storage can be shared between different Lambda invocations, provided they run in the same execution environment (i.e. if you call the Lambdas one after another, AWS will repurpose the running Lambda environment). This might cause some very hard-to-debug issues (stuck shared state from the previous runs) - even though Crawlee should always purge the previous state, you can never be too cautious with these things :) This is especially important if you want to run multiple crawler instances in one Lambda.

I agree w/ @B4nan that explaining all these whys and wherefores is rather counterproductive - I'd show the one and only way to do this rather than confusing the reader with (more or less) irrelevant details.

barjin avatar May 23 '24 09:05 barjin

Thanks for your feedback @B4nan and @barjin

Sounds like your saying we should use in-memory storage not because of the readonly Lambda filesystem but because it will cause the "statefulness" and potential hard to debug issues. I've tried to update it to express that https://github.com/apify/crawlee/pull/2477/commits/70a4fdd8978b15a9a26432fba4b5e50ea9143182.

If you still think its worse than before then feel free to edit it and/or close this pull request.

connorads avatar May 25 '24 13:05 connorads