crawlee Optional batching and locking of KVS objects

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/core

Feature

What about introducing a simple Map of keys currently being queried and whenever another request to the KVS requesting the same key gets fired, we would not send it, but simply wait for the already running one to finish and return its value?

We could also use this for a sort of "locking". Whenever a write to KVS is happening, we could postpone all reads on the key until the write is done.

Motivation

When running under high concurrency, there's always a chance of race conditions between KVS reads and writes. This could eliminate the problem and make KVS easier to use and more predictable for newcomers.

Ideal solution or implementation, and any additional constraints

Alternative solutions or implementations

Technically, we could achieve both by caching of the objects locally, but I'm afraid that would explode on memory, so it's probably not the best idea.

Other context

No response

Nov 30 '23 09:11 mnmkng

Maybe one question before everything - did you encounter any issues/performance problems that motivated this (if so, please share :)) If not (but you feel like it could happen in an actual scenario), please share some example snippet 🙏🏽

Regarding the first paragraph - sounds like caching with extra steps. I'm not too sure how much we cache already, but there is something... maybe cc @vladfrangu ? I doubt that checking + waiting for currently running reads will be more efficient than a simple cache (regarding the fact that we'll have like <20 concurrent runs at most - and most likely not all of them will read from the same key). I'm not against running some experiments though, it might be interesting.

Whenever a write to KVS is happening, we could postpone all reads on the key until the write is done.

Sounds like transaction management with extra steps :) Here I'm interested in the motivation even more - did this ever cause any problems? The basic causality is afaik ensured, i.e. writes are immediately visible (await write(A = B); await read(A) == B). Anything else (e.g. preventing non-repeatable reads) would require some extended API for acquiring the locks and committing the changes and might cause even more user confusion.

Dec 14 '23 14:12 barjin

Regarding the first paragraph - sounds like caching with extra steps. I'm not too sure how much we cache already, but there is something... maybe cc @vladfrangu ? I doubt that checking + waiting for currently running reads will be more efficient than a simple cache (regarding the fact that we'll have like <20 concurrent runs at most - and most likely not all of them will read from the same key). I'm not against running some experiments though, it might be interesting

To me it sounds more like if two separate places request a key, we don't do two calls but one and return the same data to both requests. It sounds like it could be a nice optimization on the apify-client side tho (as for memory-storage, its all in memory and shouldn't cause issues)

We could also use this for a sort of "locking". Whenever a write to KVS is happening, we could postpone all reads on the key until the write is done.

Implementation wise it's probably simple, but I concur with Jindra about the _why_s. We could score some small wins with the first paragraph, especially for users on apify. For the latter it feels very.. niche but can see a use case for this too

Dec 14 '23 15:12 vladfrangu

Yeah, sorry, this is a typical example of "making a note", but without enough context. This thread prompted it, but I don't remember what exactly it was I was trying to solve 🤦‍♂️

I guess just to make sure that newbie users who use a single KVS item for both reads and writes in high concurrency scenarios get reasonably correct results. But yeah, caching could probably solve it as well. I don't even remember if it's there. It solves both the locking and protecting platform from unnecessary requests. I feel like it occurred to me back then, but I can't come up why I thought it wasn't enough. I guess I thought about multi-actor scenarios? 😅

Dec 15 '23 12:12 mnmkng