goon icon indicating copy to clipboard operation
goon copied to clipboard

Add delay in PutMulti to avoid datastore contention

Open bashtian opened this issue 8 years ago • 11 comments

bashtian avatar Nov 15 '16 10:11 bashtian

With what kind of data have you been getting the contention errors? I think contention errors are exclusive to concurrent writes to the same entity group. Do you use PutMulti to write multiple entities to the same entity group?

xStrom avatar Nov 15 '16 15:11 xStrom

I had over 4000 entities in the same entity group, but I didn't test what the maximum number was before getting the contention errors. Should I change the pull request to only add the delay for entities in the same entity group?

bashtian avatar Nov 15 '16 16:11 bashtian

The delay has no positive effects for PutMulti usage where the entities aren't in the same entity group, and this is how it's designed to be used.

You can read more about entity group write contention and how to avoid it here & here.

I would like to point out one specific passage:

there's a limit on how quickly you can update one entity group. In general, this works out to somewhere between 1 and 5 updates per second; a good guideline is that you should consider rearchitecting if you expect an entity group to have to sustain more than one update per second

Trying to batch write 4000 entities that belong to the same entity group goes way beyond the "1 write per second" guideline.

I don't think this delay is acceptable for goon right now, as goon design is done with that same 1 wps guideline in mind. What's more, I highly recommend you redesign your data structures if you seriously encounter the need for 4000 batch writes to the same entity group. Occasional bursts of 10 wps could be fine, but in general everything should be at or below 1 wps.

If the data is small, perhaps you could merge it together into only a handful of entities? (Keeping in mind the 1MB entity size limit.) You should also consider if the strong consistency is something you truly need. For example chat messages by a user could be temping to to be put under a "user's chat messages" entity group, but a far superior design would be to just have a property on the chat message entity that contains the user id.

xStrom avatar Nov 15 '16 18:11 xStrom

You are right that strong consistency probably isn't really needed in this case. But I think a single import of many items should still be supported by goon. In the docs it says:

The datastore will queue concurrent requests to wait their turn. Requests waiting in the queue past the timeout period will throw a concurrency exception.

So goon is putting to many items at once into the queue. Maybe there should be a PutMultiGroup or something similar to restrict the number of concurrent requests. Without fixing goon the only solution would be to do a PutMulti loop.

bashtian avatar Nov 15 '16 20:11 bashtian

I agree with @xStrom that this doesn't belong in goon. This is an application-level problem outside of the goals and scope of goon.

maddyblue avatar Nov 15 '16 20:11 maddyblue

But I think a single import of many items should still be supported by goon.

Yes. Have you gotten contention errors with GetMulti?

xStrom avatar Nov 15 '16 21:11 xStrom

@mjibson I don't think it's an application problem. Of course it could be solved differently, but this is still a problem with PutMulti. Especially since you can get that many items from GetMulti, but writing them with PutMulti leads to an error. At least there should be some documentation of the constraints.

@xStrom No, I think this is only a write problem.

bashtian avatar Nov 16 '16 07:11 bashtian

I would again like to point towards the 1 wps guideline. Batch writing 4000 entities into the same entity group errors heavily against this guideline.

Lets say for a moment that goon could efficiently detect entity groups in PutMulti and delay the writes. To conform to the datastore usage guideline of 1 wps, we would need to space out the 4000 writes over 4000 seconds. This goes beyond the 60 seconds that is available (request processing limit) if PutMulti gets called immediately at the start of the request.

There's just no way around this. So many writes into the same entity group in so little time is not behavior that is expected to work.

Frankly I'm skeptical that this 100ms delay between batches of 500 has any significant impact. Have you tested this in production? If this truly helps, then maybe there is some really efficient write combining going on by Google. If that's the case, then batch writes of 500 entities into the same entity group should never return a contention error. [1] Have you found this to be the case in production?

If batch writes of 500 entities to the same entity group doesn't throw errors, then I agree that the errors are a product of goon and we could look into what can be done to solve it. However if the contention errors are still there (and thus also when using the Google provided API directly) then it's a simple case of everything having to conform to the 1 wps guideline.

[1] Provided that there is no concurrent writing to the same entity group by some unrelated code.

xStrom avatar Nov 16 '16 08:11 xStrom

If you are truly convinced this is a goon problem, this PR will need to have a convincing test that demonstrates how the problem is in goon. (I'm still not convinced.)

maddyblue avatar Nov 16 '16 09:11 maddyblue

@xStrom I tested it in production and it works. As I understood the docs the 1 update per second is for a request to the datastore, so it can be a single entity or a batch with maximum 500 entities (the datastore limit). It works with 10 updates per second when I tested it, that's way I used 100ms. Maybe that only works for short time, so the default should be maybe more than that.

@mjibson I can only create a demo project since this happens on the production datastore, or is there a way to create tests for the production datastore?

I tested it again with 9212 entities.

Fails without delay: without

Success with 1s delay:

1000ms

It did not work with 500ms, so maybe the requests should just run consecutively in this case.

bashtian avatar Nov 16 '16 11:11 bashtian

Interesting and indeed I think consecutive calls after waiting for a return on the previous batch seems like a better solution. It's more dynamic and adapts based on how long the datastore is taking to perform the writes. Could you test this in production to see if it works without errors?

Then there's the question of how to do this efficiently. For batch puts of 500 or less, basically no new code should run. For larger batches, I guess we could generate a string set of entity groups [1] for every sub-batch. Then the first sub-batch would instantly run, but every consecutive would check if any of that sub-batches entity groups are present in any previous sub-batch sets, and if so then wait for that sub-batch to finish before proceeding. This way there would be no extra waiting and the sub-batches would continue to be concurrent if the entity groups don't match.

[1] Namespace + root parent's kind + id should be the correct combination I think. May have to confirm this first.

xStrom avatar Nov 16 '16 12:11 xStrom