aerospike-client-go icon indicating copy to clipboard operation
aerospike-client-go copied to clipboard

Exit fast on conection pool full

Open gburanov opened this issue 5 years ago • 18 comments
trafficstars

gburanov avatar Nov 24 '19 21:11 gburanov

Thanks for the PR. What is the use case in which this is necessary?

khaf avatar Nov 24 '19 21:11 khaf

@khaf Thank you for replying fast. So our server need to reply to request in the fixed and small amount of time. Replying smth like "Cannot calculate" is possible thought.

So without this fix, when the connection pool is full, aerospike-client-go waits (Timeout) in the worst case and replies only then.

In our case (and I think in many other cases) it is much better to fail fast with error in this case.

gburanov avatar Nov 24 '19 22:11 gburanov

However I added flag to preserve old behavior WaitForConnectionsToBeRealeased when needed

gburanov avatar Nov 24 '19 22:11 gburanov

This is a really nice feature, there should be an option to fail fast if there are no connections available.

Any timeline when this feature will be merged?

saifat29 avatar Nov 25 '19 12:11 saifat29

The client has an internal retry mechanism, and it does not regard a lack of available pool connections as a failure, since another connection may become available a microsecond later.

I don't think I'll merge this PR, but we may change the client in a way that supports this use case by counting lack of connections in the pool as a failure. In that case a Policy.MaxRetries = 1 would be the same as a fail fast option.

Feedback is welcome.

khaf avatar Nov 25 '19 12:11 khaf

@khaf You are right, currently fail fast can be achieved by using Policy.MaxRetries = 0, zero because the first request doesn't counts as retry and I don't want to try again. Mind explaining why you chose Policy.MaxRetries = 1?

If I am not wrong than policy.TotalTimeout can also be used for the same.

saifat29 avatar Nov 25 '19 13:11 saifat29

@saifat29 You are correct, that was a slip. This does not work at the moment since the iterations begin at -1, and this logic.

I will need to change that a bit to allow for this mechanism.

khaf avatar Nov 25 '19 14:11 khaf

If you will change the code so that MaxRetries = 1 will work - this is great =) However what we have is

  • MaxRetries = 2 (to try second time)
  • Timeout = 5s (very big value. should trigger only when there is a network problem. Lowering it down does not really work for us because when connection is timing out - it is closed. And creating new connection take time and resources) So Timeout = 50ms will not work - because many connections will be closing/reopening

I guess if MaxRetries=1 is the easiest fix - lets do it. We can retry on the client side - not a big deal.

gburanov avatar Nov 25 '19 15:11 gburanov

@gburanov Are you aware of the two distinct Timeout settings in the policies? There is SocketTimeout and then the TotalTimeout which allows you to fine tune these. Have you tried them?

khaf avatar Nov 25 '19 15:11 khaf

yes, There is a socket timeout description https://github.com/aerospike/aerospike-client-go/blob/d33d4fcf291570041ca447b1e76c0f252fcaa90c/policy.go says If SocketTimeout is not zero and SocketTimeout is reached before an attempt completes, the Timeout above is checked. If Timeout is not exceeded, the transaction is retried . This is not really what we want I think...

So what we want is

  • Close aerospike connections rare - only when there is a networking issue => TotalTimeout should be big (5s)

  • However we should fail fast in the case I described (no connections)

  • we should exit in maximum xxx ms time (to cut off 1% slow loads/writes to aerospike). Now we do it by wrapping writes

doneChan := make(chan struct{})
	go func() {
		rec, err = c.client.Get(policy, key, binNames...)
		close(doneChan)
	}()
	select {
	case <-ctx.Done():
		c.userLoadAerospikeContextCancelled.Inc(1)
		return nil, ctx.Err()
	case <-doneChan:
	}
	return rec, err

If u have some other way of implementing it - It would be great to know =)

gburanov avatar Nov 25 '19 15:11 gburanov

I think the best method in my opinion is to set Policy.MaxRetries = 0 or whatever number of times the client is willing to wait if there are no connections in the pool.

This way client will set Policy.MaxRetries = 0 if it wants to return immediately. Or Policy.MaxRetries = n if the client is willing to try "n" number of times. I tried setting Policy.MaxRetries to zero but it doesn't works. Check this condition

@gburanov is correct about not using TotalTimeout as a fast exit mechanism. I tried it and noticed that TotalTimeout closes the connection which is not good as resources are getting wasted.

So to sum it up, if Policy.MaxRetries = 0 returns instantly with an appropriate error this issue will be solved.

saifat29 avatar Nov 25 '19 16:11 saifat29

So what was decided, how should we proceed?

saifat29 avatar Nov 26 '19 16:11 saifat29

We need to discuss this internally since it may affect all the clients. The next meeting is early next week. I'll get back to you as soon as we have a decision.

khaf avatar Nov 26 '19 16:11 khaf

@khaf how is it going? =)

gburanov avatar Jan 30 '20 08:01 gburanov

Sorry for the late reply @gburanov , we are still at it, and discussed it again in a meeting today.

I'll get back to you soon.

khaf avatar Feb 10 '20 19:02 khaf

@khaf any update on this?

dynamix avatar Apr 06 '21 17:04 dynamix

Yes, this is coming with the new v5 release towards the end of this week.

khaf avatar Apr 06 '21 18:04 khaf

Yes, this is coming with the new v5 release towards the end of this week.

@khaf is there a commit you can reference for that change? I could not find it on the v5 branch.

@gburanov is correct about not using TotalTimeout as a fast exit mechanism. I tried it and noticed that TotalTimeout closes the connection which is not good as resources are getting wasted.

@saifat29 I think I am being bitten by this problem for some time now. This is more or less what I observe happening when looking at a tcpdump traffic dump:

  1. client has a long-standing TCP connection open with server (let's say 10+ minutes) which has been used for many queries/commands
  2. most queries have a TotalTimeout of 1 second on the policy
  3. when such queries time out, the connections are closed (!)
  4. this aggravates the timeout problem because now new connections will have to be opened

NOTE: I am using the master branch

Shall I open a new bug about (4)? If it is confirmed that an expiring TotalTimeout leads to closed connections then I think we should fix it, because an user-set timeout does not imply that the underlying connection is bad. TCP has its own flags to detect that. I guess I could use @gburanov's approach until this is solved.

gmaz42 avatar Aug 25 '21 16:08 gmaz42

Done in Nov 2021 per this commit: https://github.com/aerospike/aerospike-client-go/commit/943c03c74b61380fe3e2d416c99881401b436fff

Unfortunately, I forgot to mention it here.

khaf avatar Apr 02 '24 13:04 khaf