docs: clarify AWS Lambda storage
There is ephemeral storage in /tmp
https://docs.aws.amazon.com/lambda/latest/api/API_EphemeralStorage.html
Which could technically be used if desired
CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage
Which could technically be used if desired (
CRAWLEE_STORAGE_DIR=/tmp/crawlee/storage)
This is only true to an extent - the ephemeral storage can be shared between different Lambda invocations, provided they run in the same execution environment (i.e. if you call the Lambdas one after another, AWS will repurpose the running Lambda environment). This might cause some very hard-to-debug issues (stuck shared state from the previous runs) - even though Crawlee should always purge the previous state, you can never be too cautious with these things :) This is especially important if you want to run multiple crawler instances in one Lambda.
I agree w/ @B4nan that explaining all these whys and wherefores is rather counterproductive - I'd show the one and only way to do this rather than confusing the reader with (more or less) irrelevant details.
Thanks for your feedback @B4nan and @barjin
Sounds like your saying we should use in-memory storage not because of the readonly Lambda filesystem but because it will cause the "statefulness" and potential hard to debug issues. I've tried to update it to express that https://github.com/apify/crawlee/pull/2477/commits/70a4fdd8978b15a9a26432fba4b5e50ea9143182.
If you still think its worse than before then feel free to edit it and/or close this pull request.