chisel icon indicating copy to clipboard operation
chisel copied to clipboard

Fix client backoff

Open fin3ss3g0d opened this issue 1 year ago • 6 comments

The code here fails to set a minimum value for the backoff based on this project.

image

This means if the server is down, the backoff minimum for every chisel user is set to a 100ms value by default according to the screenshot above and will result in extremely short backoffs (default behavior is shown below).

image

Specifically, the sequence will always be 100ms, 200ms, 400ms, 800ms, and 1.6s (if the server is down). This technically gives the user about 2 seconds to get the server back up before the client exits entirely, and it is noisy over the wire to send this many packets this fast. I hope you can see how this can be problematic based on use case. The updated behavior with this commit can be seen below and gives users more time to get the server back up and running:

image

fin3ss3g0d avatar Oct 24 '24 16:10 fin3ss3g0d

That’s strange, chisel client should retry forever…

On Fri, 25 Oct 2024 at 3:55 AM Dylan Evans @.***> wrote:

The code here https://github.com/jpillora/chisel/blob/ab8f06a83048dca0c24dc0b06932dc98df54e8b1/client/client_connect.go#L22C1-L22C55 fails to set a minimum value for the backoff based on this https://github.com/jpillora/backoff project.

image.png (view on web) https://github.com/user-attachments/assets/36f8c2cf-3dbc-4b07-9126-ec61cfdb37f7

This means if the server is down, the backoff minimum for every chisel user is set to a 100ms value by default according to the screenshot above and will result in extremely short backoffs (default behavior is shown below).

image.png (view on web) https://github.com/user-attachments/assets/953b22cc-3f60-440a-b6a6-25c0eba6b60c

Specifically, the sequence will always be 100ms, 200ms, 400ms, 800ms, and 1.6s (if the server is down). This technically gives the user about 2 seconds to get the server back up before the client exits entirely, and it is noisy over the wire to send this many packets this fast. I hope you can see how this can be problematic based on use case. The updated behavior with this commit can be seen below and gives users more time to get the server back up and running:

image.png (view on web) https://github.com/user-attachments/assets/e555a341-765e-4078-b472-7c375db2c57f

You can view, comment on, or merge this pull request online at:

https://github.com/jpillora/chisel/pull/537 Commit Summary

File Changes

(3 files https://github.com/jpillora/chisel/pull/537/files)

Patch Links:

  • https://github.com/jpillora/chisel/pull/537.patch
  • https://github.com/jpillora/chisel/pull/537.diff

— Reply to this email directly, view it on GitHub https://github.com/jpillora/chisel/pull/537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2X46MWG3Y3YKF5JPCFC3Z5ERB7AVCNFSM6AAAAABQRQXWTKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYYTEMJRGY4DSMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jpillora avatar Oct 24 '24 20:10 jpillora

Respectfully, I never said retrying was the issue or that the client would stop retrying unintentionally... the title of the PR is Fix client backoff because of the way you have configured the backoff for the client and is about backoff time. There should be more time in between the first 5 retries and the user should be able to control the time with a new flag... The first 5 retries will all take place within ~2 seconds which doesn't make sense from my standpoint. You are not giving the user enough time in between each retry in order to get the server back up in case it goes down. With your current logic, the backoff will eventually get above 2 seconds if you go above 5 retries, but it would be better to increase the minimum value of the backoff and allow the user to control it. This way, not as many retries are required and we can still get the wait time desired.

You always set Min of the Backoff to 100ms here and don't give the user the option to set it, so the first 5 backoffs will always be 100ms, 200ms, 400ms, 800ms, and 1.6s if the server goes down. This also means that if the user sets --max-retry-count to a value between 0 - 5 and the server goes down, the client will retry for ~2 seconds before exiting and the user will lose their chisel connection due to the extremely short wait times. Original:

image

After fix (and --min-retry-interval set to 1m):

image

You can do your own testing to verify what I am reporting. Start a server, connect a client, kill the server with the client still connected and you'll see the first 5 backoffs will always be 100ms, 200ms, 400ms, 800ms, 1.6s, etc. with no ability to customize.

I would classify this as a bug due to the unintentional nature of this behavior. You don't document this anywhere and I'm pretty sure a lot of people are unaware unless they take a look at the code surrounding the backoff logic. If you don't plan on fixing, you should document it is a known issue so people at least have the heads up. The logic might work for a development branch or testing, but when it comes to production version of the code, the user should be able to customize this behavior relating to the backoff.

fin3ss3g0d avatar Oct 24 '24 21:10 fin3ss3g0d

If not exposed as a client flag, I believe enabling Jitter by default would be a good thing as well: b := &backoff.Backoff{Jitter: true, Max: c.config.MaxRetryInterval, Min: c.config.MinRetryInterval}

liojacqs avatar Nov 04 '24 14:11 liojacqs

If not exposed as a client flag, I believe enabling Jitter by default would be a good thing as well: b := &backoff.Backoff{Jitter: true, Max: c.config.MaxRetryInterval, Min: c.config.MinRetryInterval}

I'll add this to another commit @liojacqs soon 😄

fin3ss3g0d avatar Nov 04 '24 14:11 fin3ss3g0d

@liojacqs added the jitter option as a bool flag. @jpillora any updated thoughts so far on this PR?

fin3ss3g0d avatar Nov 07 '24 00:11 fin3ss3g0d

Just one thing on the first commit, for the default minimum retry inteval: image

This code makes it impossible to have <1s retry interval other than 100ms, and I would personnaly add more flexibility, like:

	if c.MinRetryInterval <= 0 {
		c.MinRetryInterval = 100 * time.Millisecond
	}

That way the default will remain 100ms, but we can also set something like 1ms/250ms/600ms....

liojacqs avatar Nov 08 '24 09:11 liojacqs