resource-agents icon indicating copy to clipboard operation
resource-agents copied to clipboard

Low: nfsserver: more appropriate default timeouts

Open davidvossel opened this issue 10 years ago • 11 comments

davidvossel avatar Apr 29 '15 15:04 davidvossel

It would be good to know the motivation behind extending timeouts.

dmuhamedagic avatar May 05 '15 17:05 dmuhamedagic

@dmuhamedagic In some environments that use clvmd + clustered volume groups for shared storage, we noticed that it was possible for nfs to take longer than the default 40 second timeout to start for the first time. This is a common deployment, so we'd like the default timeout to "just work" for most people.

davidvossel avatar May 06 '15 14:05 davidvossel

What about the stop and monitor timeouts?

It is not a big issue to increase the defaults, but we need to understand what's behind the change.

dmuhamedagic avatar May 06 '15 14:05 dmuhamedagic

Default timeouts should error on the side of being too conservative. Using a 20s timeout for any nfs action is too aggressive. monitor of nfs is low impact, so I used 30s (which i'd consider to be the lowest default timeout i'd feel comfortable imposing on people). Stop is a bit more involved, so I chose 60s.

My philosophy is that It is easier to tell a user "tighten up your timeout values if you want to achieve quicker failover". It is more difficult and time consuming to field support questions that consist of "why is my resource timing out" only to realize in their specific setup they need a more conservative timeout period. I've had to deal with this more than I'd like over the last few years so I'm beginning to think more conservative timeouts make better defaults.

davidvossel avatar May 06 '15 15:05 davidvossel

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too conservative.

That makes two of us. I'm all for setting longer timeouts.

Using a 20s timeout for any nfs action is too aggressive. monitor of nfs is low impact, so I used 30s (which i'd consider to be the lowest default timeout i'd feel comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing special happening in the operation which justifies extending that, then 20s should be the default. Can you please give the reason why the default needs to be extended here?

Stop is a bit more involved, so I chose 60s.

Again, this doesn't give us the reason.

My philosophy is that It is easier to tell a user "tighten up your timeout values if you want to achieve quicker failover".

Normally, tighter timeouts don't result in faster failover.

It is more difficult and time consuming to field support questions that consist of "why is my resource timing out" only to realize in their specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these defaults are actually not observed by pacemaker, but, optionally, by various UI.

I've had to deal with this more than I'd like over the last few years so I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can understand your sentiment here quite well. But we still had to put the line somewhere. If we need to cross that line, we should give some justification which pertains to the nature of the RA and the actual actions performed within that RA. Note also that the default timeout should be the minimum advisable timeout for that particular kind of resource (but never less than 20s).

dmuhamedagic avatar May 06 '15 16:05 dmuhamedagic

----- Original Message -----

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too conservative.

That makes two of us. I'm all for setting longer timeouts.

excellent, I still can't tell if you're arguing for or against this change though.

Using a 20s timeout for any nfs action is too aggressive. monitor of nfs is low impact, so I used 30s (which i'd consider to be the lowest default timeout i'd feel comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing special happening in the operation which justifies extending that, then 20s should be the default. Can you please give the reason why the default needs to be extended here?

I disagree with pacemaker's default timeout of 20s.

Stop is a bit more involved, so I chose 60s.

Again, this doesn't give us the reason.

because 60s is more conservative that 40s. this is philosophical, not technical.

My philosophy is that It is easier to tell a user "tighten up your timeout values if you want to achieve quicker failover".

Normally, tighter timeouts don't result in faster failover.

yes, normally interval results in faster failover, but that's not always the case.

It is more difficult and time consuming to field support questions that consist of "why is my resource timing out" only to realize in their specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these defaults are actually not observed by pacemaker, but, optionally, by various UI.

right.

I've had to deal with this more than I'd like over the last few years so I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can understand your sentiment here quite well. But we still had to put the line somewhere. If we need to cross that line, we should give some justification which pertains to the nature of the RA and the actual actions performed within that RA. Note also that the default timeout should be the minimum advisable timeout for that particular kind of resource (but never less than 20s).

I'm really not interested in defending anything other than the start timeout change. I was in the area and made the decision that I believed we should be advertising more conservative timeout periods in the metadata for other actions as well. honestly, if there's any push back here I don't care enough (or feel strongly enough) about the non start default timeout changes to discuss it further.


Reply to this email directly or view it on GitHub: https://github.com/ClusterLabs/resource-agents/pull/607#issuecomment-99534943

davidvossel avatar May 11 '15 16:05 davidvossel

On Mon, May 11, 2015 at 09:29:16AM -0700, David Vossel wrote:

----- Original Message -----

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too conservative.

That makes two of us. I'm all for setting longer timeouts.

excellent, I still can't tell if you're arguing for or against this change though.

Well, both. We just need to argument the change.

Using a 20s timeout for any nfs action is too aggressive. monitor of nfs is low impact, so I used 30s (which i'd consider to be the lowest default timeout i'd feel comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing special happening in the operation which justifies extending that, then 20s should be the default. Can you please give the reason why the default needs to be extended here?

I disagree with pacemaker's default timeout of 20s.

Most of the time it's fine. So, I guess that it is fine as a default too. But if you disagree, why not raise the issue on the ML?

Stop is a bit more involved, so I chose 60s.

Again, this doesn't give us the reason.

because 60s is more conservative that 40s. this is philosophical, not technical.

I'd say that it is really technical. It's about knowing the RA and what does it do and estimating how much time particular commands in the particular operation's path may take.

My philosophy is that It is easier to tell a user "tighten up your timeout values if you want to achieve quicker failover".

Users should never set timeouts lower than the RA defaults.

Normally, tighter timeouts don't result in faster failover.

yes, normally interval results in faster failover, but that's not always the case.

It is more difficult and time consuming to field support questions that consist of "why is my resource timing out" only to realize in their specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these defaults are actually not observed by pacemaker, but, optionally, by various UI.

right.

I've had to deal with this more than I'd like over the last few years so I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can understand your sentiment here quite well. But we still had to put the line somewhere. If we need to cross that line, we should give some justification which pertains to the nature of the RA and the actual actions performed within that RA. Note also that the default timeout should be the minimum advisable timeout for that particular kind of resource (but never less than 20s).

I'm really not interested in defending anything other than the start timeout change. I was in the area and made the decision that I believed we should be advertising more conservative timeout periods in the metadata for other actions as well. honestly, if there's any push back here I don't care enough (or feel strongly enough) about the non start default timeout changes to discuss it further.

The defaults should be conservative, but not more conservative than necessary. And to stress again:

The default timeout is the _minimum_ advisable timeout for
that particular kind of resource (but never less than 20s).

Further, once we increase defaults for the existing RA, the working configurations will suddenly produce warnings about insufficient operation timeouts. That wouldn't make a good impression.

dmuhamedagic avatar May 12 '15 06:05 dmuhamedagic

----- Original Message -----

On Mon, May 11, 2015 at 09:29:16AM -0700, David Vossel wrote:

----- Original Message -----

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too conservative.

That makes two of us. I'm all for setting longer timeouts.

excellent, I still can't tell if you're arguing for or against this change though.

Well, both. We just need to argument the change.

Using a 20s timeout for any nfs action is too aggressive. monitor of nfs is low impact, so I used 30s (which i'd consider to be the lowest default timeout i'd feel comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing special happening in the operation which justifies extending that, then 20s should be the default. Can you please give the reason why the default needs to be extended here?

I disagree with pacemaker's default timeout of 20s.

Most of the time it's fine. So, I guess that it is fine as a default too. But if you disagree, why not raise the issue on the ML?

Stop is a bit more involved, so I chose 60s.

Again, this doesn't give us the reason.

because 60s is more conservative that 40s. this is philosophical, not technical.

I'd say that it is really technical. It's about knowing the RA and what does it do and estimating how much time particular commands in the particular operation's path may take.

My philosophy is that It is easier to tell a user "tighten up your timeout values if you want to achieve quicker failover".

Users should never set timeouts lower than the RA defaults.

Normally, tighter timeouts don't result in faster failover.

yes, normally interval results in faster failover, but that's not always the case.

It is more difficult and time consuming to field support questions that consist of "why is my resource timing out" only to realize in their specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these defaults are actually not observed by pacemaker, but, optionally, by various UI.

right.

I've had to deal with this more than I'd like over the last few years so I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can understand your sentiment here quite well. But we still had to put the line somewhere. If we need to cross that line, we should give some justification which pertains to the nature of the RA and the actual actions performed within that RA. Note also that the default timeout should be the minimum advisable timeout for that particular kind of resource (but never less than 20s).

I'm really not interested in defending anything other than the start timeout change. I was in the area and made the decision that I believed we should be advertising more conservative timeout periods in the metadata for other actions as well. honestly, if there's any push back here I don't care enough (or feel strongly enough) about the non start default timeout changes to discuss it further.

The defaults should be conservative, but not more conservative than necessary. And to stress again:

The default timeout is the minimum advisable timeout for that particular kind of resource (but never less than 20s).

That's interesting. I didn't realize that's how we documented this in the metadata, and I'm not sure I agree it.

If we're advertising something as being an "okay" value to use why would we give the absolute minimum value? The minimum value represents the most aggressive timing we consider safe. In the case of nfsserver there are far too many variables involved for us to advertise a safe minimum value. What is safe for 90% of users might not be safe for another 10% of users. If we raise the minimum value just to account for the 10% use cases, then we're telling the other 90% of people that they should never go below our advertised minimum value even though in reality it would be safe.

The minimum value for some agent's actions vary so drastically between deployments it would be impractical for us to even attempt to recommend a minimum.

Take the galera or redis agents for example. A galera promotion involves a syncing a galera instance with another active galera instance in the cluster... How could I give a minimum value that makes any sense for that? The timing period depends on network speed, how large the database is, and potentially how loaded the donor galera instance is. The minimum value for a small database could actually be 20 seconds... but in practice we're seeing it can take nearly 300s in the real world. In this case, the minimum timeout of 20s would work for proably 1% of users, the 300s timeout would work for around 90% of users, and out of that 90% most of them could tighten up the timeout value by entire minutes.

For galera I advertised promote timeout as 300s because I just want people to be able to use these agents and for them to work.

Further, once we increase defaults for the existing RA, the working configurations will suddenly produce warnings about insufficient operation timeouts. That wouldn't make a good impression.


Reply to this email directly or view it on GitHub: https://github.com/ClusterLabs/resource-agents/pull/607#issuecomment-101157787

davidvossel avatar May 15 '15 16:05 davidvossel

Take the galera or redis agents for example. A galera promotion involves a syncing a galera instance with another active galera instance in the cluster... How could I give a minimum value that makes any sense for that?

On this note, I'd argue for requiring explicit configuration of all timeouts. As a compromise, the defaults should be pessimistic rather than optimistic. (To sum up, I agree with both of you).

krig avatar May 18 '15 19:05 krig

On Fri, May 15, 2015 at 09:45:49AM -0700, David Vossel wrote:

----- Original Message -----

On Mon, May 11, 2015 at 09:29:16AM -0700, David Vossel wrote:

----- Original Message -----

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too conservative.

That makes two of us. I'm all for setting longer timeouts.

excellent, I still can't tell if you're arguing for or against this change though.

Well, both. We just need to argument the change.

Using a 20s timeout for any nfs action is too aggressive. monitor of nfs is low impact, so I used 30s (which i'd consider to be the lowest default timeout i'd feel comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing special happening in the operation which justifies extending that, then 20s should be the default. Can you please give the reason why the default needs to be extended here?

I disagree with pacemaker's default timeout of 20s.

Most of the time it's fine. So, I guess that it is fine as a default too. But if you disagree, why not raise the issue on the ML?

Stop is a bit more involved, so I chose 60s.

Again, this doesn't give us the reason.

because 60s is more conservative that 40s. this is philosophical, not technical.

I'd say that it is really technical. It's about knowing the RA and what does it do and estimating how much time particular commands in the particular operation's path may take.

My philosophy is that It is easier to tell a user "tighten up your timeout values if you want to achieve quicker failover".

Users should never set timeouts lower than the RA defaults.

Normally, tighter timeouts don't result in faster failover.

yes, normally interval results in faster failover, but that's not always the case.

It is more difficult and time consuming to field support questions that consist of "why is my resource timing out" only to realize in their specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these defaults are actually not observed by pacemaker, but, optionally, by various UI.

right.

I've had to deal with this more than I'd like over the last few years so I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can understand your sentiment here quite well. But we still had to put the line somewhere. If we need to cross that line, we should give some justification which pertains to the nature of the RA and the actual actions performed within that RA. Note also that the default timeout should be the minimum advisable timeout for that particular kind of resource (but never less than 20s).

I'm really not interested in defending anything other than the start timeout change. I was in the area and made the decision that I believed we should be advertising more conservative timeout periods in the metadata for other actions as well. honestly, if there's any push back here I don't care enough (or feel strongly enough) about the non start default timeout changes to discuss it further.

The defaults should be conservative, but not more conservative than necessary. And to stress again:

The default timeout is the _minimum_ advisable timeout for
that particular kind of resource (but never less than 20s).

That's interesting. I didn't realize that's how we documented this in the metadata, and I'm not sure I agree it.

If we're advertising something as being an "okay" value to use why would we give the absolute minimum value?

To allow users to make better estimates for their installations.

The minimum value represents the most aggressive timing we consider safe.

Yes. For some "typical" setup. What is "typical" setup is up to the RA author to decide. After all, they should have the necessary expertize.

In the case of nfsserver there are far too many variables involved for us to advertise a safe minimum value. What is safe for 90% of users might not be safe for another 10% of users. If we raise the minimum value just to account for the 10% use cases, then we're telling the other 90% of people that they should never go below our advertised minimum value even though in reality it would be safe.

The minimum value for some agent's actions vary so drastically between deployments it would be impractical for us to even attempt to recommend a minimum.

Take the galera or redis agents for example. A galera promotion involves a syncing a galera instance with another active galera instance in the cluster... How could I give a minimum value that makes any sense for that? The timing period depends on network speed, how large the database is, and potentially how loaded the donor galera instance is. The minimum value for a small database could actually be 20 seconds... but in practice we're seeing it can take nearly 300s in the real world. In this case, the minimum timeout of 20s would work for proably 1% of users, the 300s timeout would work for around 90% of users, and out of that 90% most of them could tighten up the timeout value by entire minutes.

For galera I advertised promote timeout as 300s because I just want people to be able to use these agents and for them to work.

Yes, it is very difficult to make estimates for some agents.

dmuhamedagic avatar May 22 '15 15:05 dmuhamedagic

On Mon, May 18, 2015 at 12:26:47PM -0700, Kristoffer Grönlund wrote:

Take the galera or redis agents for example. A galera promotion involves a syncing a galera instance with another active galera instance in the cluster... How could I give a minimum value that makes any sense for that?

On this note, I'd argue for requiring explicit configuration of all timeouts.

I can see your point and that's certainly true for stuff such as databases or resources which depend on network. It is up to the user to have those timeouts set depending on their environment. Otherwise, setting timeouts for everything would probably make the configuration even more unreadable than it is.

Perhaps we need a special value for some defaults: SET_THIS_ONE_YOURSELF.

As a compromise, the defaults should be pessimistic rather than optimistic. (To sum up, I agree with both of you).

:)

dmuhamedagic avatar May 22 '15 15:05 dmuhamedagic