goon
goon copied to clipboard
Add delay in PutMulti to avoid datastore contention
With what kind of data have you been getting the contention errors? I think contention errors are exclusive to concurrent writes to the same entity group. Do you use PutMulti to write multiple entities to the same entity group?
I had over 4000 entities in the same entity group, but I didn't test what the maximum number was before getting the contention errors. Should I change the pull request to only add the delay for entities in the same entity group?
The delay has no positive effects for PutMulti usage where the entities aren't in the same entity group, and this is how it's designed to be used.
You can read more about entity group write contention and how to avoid it here & here.
I would like to point out one specific passage:
there's a limit on how quickly you can update one entity group. In general, this works out to somewhere between 1 and 5 updates per second; a good guideline is that you should consider rearchitecting if you expect an entity group to have to sustain more than one update per second
Trying to batch write 4000 entities that belong to the same entity group goes way beyond the "1 write per second" guideline.
I don't think this delay is acceptable for goon right now, as goon design is done with that same 1 wps guideline in mind. What's more, I highly recommend you redesign your data structures if you seriously encounter the need for 4000 batch writes to the same entity group. Occasional bursts of 10 wps could be fine, but in general everything should be at or below 1 wps.
If the data is small, perhaps you could merge it together into only a handful of entities? (Keeping in mind the 1MB entity size limit.) You should also consider if the strong consistency is something you truly need. For example chat messages by a user could be temping to to be put under a "user's chat messages" entity group, but a far superior design would be to just have a property on the chat message entity that contains the user id.
You are right that strong consistency probably isn't really needed in this case. But I think a single import of many items should still be supported by goon. In the docs it says:
The datastore will queue concurrent requests to wait their turn. Requests waiting in the queue past the timeout period will throw a concurrency exception.
So goon is putting to many items at once into the queue. Maybe there should be a PutMultiGroup or something similar to restrict the number of concurrent requests. Without fixing goon the only solution would be to do a PutMulti loop.
I agree with @xStrom that this doesn't belong in goon. This is an application-level problem outside of the goals and scope of goon.
But I think a single import of many items should still be supported by goon.
Yes. Have you gotten contention errors with GetMulti?
@mjibson I don't think it's an application problem. Of course it could be solved differently, but this is still a problem with PutMulti. Especially since you can get that many items from GetMulti, but writing them with PutMulti leads to an error. At least there should be some documentation of the constraints.
@xStrom No, I think this is only a write problem.
I would again like to point towards the 1 wps guideline. Batch writing 4000 entities into the same entity group errors heavily against this guideline.
Lets say for a moment that goon could efficiently detect entity groups in PutMulti and delay the writes. To conform to the datastore usage guideline of 1 wps, we would need to space out the 4000 writes over 4000 seconds. This goes beyond the 60 seconds that is available (request processing limit) if PutMulti gets called immediately at the start of the request.
There's just no way around this. So many writes into the same entity group in so little time is not behavior that is expected to work.
Frankly I'm skeptical that this 100ms delay between batches of 500 has any significant impact. Have you tested this in production? If this truly helps, then maybe there is some really efficient write combining going on by Google. If that's the case, then batch writes of 500 entities into the same entity group should never return a contention error. [1] Have you found this to be the case in production?
If batch writes of 500 entities to the same entity group doesn't throw errors, then I agree that the errors are a product of goon and we could look into what can be done to solve it. However if the contention errors are still there (and thus also when using the Google provided API directly) then it's a simple case of everything having to conform to the 1 wps guideline.
[1] Provided that there is no concurrent writing to the same entity group by some unrelated code.
If you are truly convinced this is a goon problem, this PR will need to have a convincing test that demonstrates how the problem is in goon. (I'm still not convinced.)
@xStrom I tested it in production and it works. As I understood the docs the 1 update per second is for a request to the datastore, so it can be a single entity or a batch with maximum 500 entities (the datastore limit). It works with 10 updates per second when I tested it, that's way I used 100ms. Maybe that only works for short time, so the default should be maybe more than that.
@mjibson I can only create a demo project since this happens on the production datastore, or is there a way to create tests for the production datastore?
I tested it again with 9212 entities.
Fails without delay:
Success with 1s delay:
It did not work with 500ms, so maybe the requests should just run consecutively in this case.
Interesting and indeed I think consecutive calls after waiting for a return on the previous batch seems like a better solution. It's more dynamic and adapts based on how long the datastore is taking to perform the writes. Could you test this in production to see if it works without errors?
Then there's the question of how to do this efficiently. For batch puts of 500 or less, basically no new code should run. For larger batches, I guess we could generate a string set of entity groups [1] for every sub-batch. Then the first sub-batch would instantly run, but every consecutive would check if any of that sub-batches entity groups are present in any previous sub-batch sets, and if so then wait for that sub-batch to finish before proceeding. This way there would be no extra waiting and the sub-batches would continue to be concurrent if the entity groups don't match.
[1] Namespace + root parent's kind + id should be the correct combination I think. May have to confirm this first.