cloudstack icon indicating copy to clipboard operation
cloudstack copied to clipboard

Flexible timeout for commands MS sends to the agents

Open GutoVeronezi opened this issue 1 year ago • 22 comments

Description

Currently, almost every command MS sends to the agents uses the global setting wait as the timeout's value. That means, if you want to extend or shorten the execution of a command, you will affect every other command. Therefore, currently we do not have a way to customize the timeout for specific commands.

For some edge cases, we have "wait"/"timeout" hardcoded values or values that come from global settings to set the commands timeout, like for GetAutoScaleMetricsCommand and for MigrateCommand. Currently, we have 385 classes that extend com.cloud.agent.api.Command (I skipped classes that extend com.cloud.agent.api.Answer); therefore, externalizing global settings for each command to provide flexibility is not feasible; however, we can reach flexibility by other means.

Currently, if we do not set a wait to the command, AgentManagerImpl will fallback to the wait global setting:

https://github.com/apache/cloudstack/blob/af8a582055c4194d1e42c8c1abd96969692046ed/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java#L441-L443

I propose to create a table of command_timeout to store the classpath of the command and its wait/timeout. Then, instead of fall backing directly to the wait global setting, we check if there is an entry of the command in this table. If it exists, then we use the entry's value; otherwise, we fall back to the wait global setting (the current behavior). Thus, we can change the commands wait/timeout on demand and in a flexible way.

The first step of the proposal is to create the table and changing AgentManagerImpl to check the table. Some entries will already be created for hardcoded values. Further steps can be found in #8506.

Types of changes

  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] New feature (non-breaking change which adds functionality)
  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [X] Enhancement (improves an existing feature and functionality)
  • [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
  • [ ] build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • [ ] Major
  • [x] Minor

How Has This Been Tested?

TBD

GutoVeronezi avatar Feb 03 '24 00:02 GutoVeronezi

Codecov Report

Attention: Patch coverage is 37.50000% with 25 lines in your changes missing coverage. Please review.

Project coverage is 15.28%. Comparing base (19f9635) to head (6d44719). Report is 15 commits behind head on main.

Files Patch % Lines
...ck/framework/config/dao/CommandTimeoutDaoImpl.java 0.00% 13 Missing :warning:
...dstack/framework/config/impl/CommandTimeoutVO.java 0.00% 7 Missing :warning:
...java/com/cloud/agent/manager/AgentManagerImpl.java 78.94% 4 Missing :warning:
...rc/main/java/com/cloud/agent/manager/Commands.java 0.00% 1 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##               main    #8605     +/-   ##
===========================================
  Coverage     15.28%   15.28%             
- Complexity    11535    11538      +3     
===========================================
  Files          5424     5426      +2     
  Lines        474334   474361     +27     
  Branches      60970    58437   -2533     
===========================================
+ Hits          72498    72503      +5     
- Misses       393778   393804     +26     
+ Partials       8058     8054      -4     
Flag Coverage Δ
uitests 4.25% <ø> (ø)
unittests 16.02% <37.50%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 03 '24 00:02 codecov[bot]

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Feb 05 '24 08:02 github-actions[bot]

@blueorangutan package

GutoVeronezi avatar Feb 07 '24 17:02 GutoVeronezi

@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Feb 07 '24 17:02 blueorangutan

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Feb 19 '24 09:02 github-actions[bot]

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Feb 26 '24 14:02 github-actions[bot]

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Mar 08 '24 10:03 github-actions[bot]

Guys, as a cloud operator (user of ACS, operator) - I don't like the idea how we do this:

  • For a developer, it would be fine - you can look for commands, classes etc
  • For a cloud operator, using this in this way - will be a nightmare and is error prone

If I got it wrong, please correct me - my understanding is we have an "empty" table/configuration list to start from, and then we expect the cloud operator to add specific command names and timeouts to this table?

That being said, I'm struggling to think of a better way, except perhaps somehow having a prepopulated list of commands in that table and very detailed documentation - to ensure that this (absolutely needed feature...) can be used by a non-developer.

andrijapanicsb avatar Mar 26 '24 17:03 andrijapanicsb

Guys, as a cloud operator (user of ACS, operator) - I don't like the idea how we do this:

  • For a developer, it would be fine - you can look for commands, classes etc
  • For a cloud operator, using this in this way - will be a nightmare and is error prone

If I got it wrong, please correct me - my understanding is we have an "empty" table/configuration list to start from, and then we expect the cloud operator to add specific command names and timeouts to this table?

That being said, I'm struggling to think of a better way, except perhaps somehow having a prepopulated list of commands in that table and very detailed documentation - to ensure that this (absolutely needed feature...) can be used by a non-developer.

@andrijapanicsb, please check https://github.com/apache/cloudstack/issues/8506#issuecomment-1924648687.

GutoVeronezi avatar Mar 26 '24 19:03 GutoVeronezi

Hey @GutoVeronezi , I did read it - I see we can initially add some commands in phase1 later use APIs etc - this sounds overwhelmingly complex for a non-developer human - this is my concern - how manageable this will be? In theory, we need to document every command/class (what it is, what it does, etc) etc - which doesn't sound feasible - so that an operator can add it. And this - again - is developer-friendly, but not human/user/operator-friendly (my take on it, thus my commensts here)

Does it also cover (and again, I'm not a developer, so apologies if asking a silly one) specific timeouts for e.g. start a VM, configuring DHCP reservations inside a VR etc?

To give you a use-case: I'm an operator, and I want to be able to easily configure timeout for some basic stuff, e.g.:

  • deploy a VM
  • starting a VM
  • stopping a VM
  • attach/detach a volume
  • DHCP programming of a VR while the VM is starting
  • etc

I'm not sure if my "expectations" (described here ^^^) are a good fit for this specific PR, as I see this PR/issue is related to the agent-side command timeouts (e.g. there is no StartVMCmdByAdmin on the list of command/classes which you provided in the https://github.com/apache/cloudstack/issues/8506#issuecomment-1924648687 - which is a management server command)

andrijapanicsb avatar Mar 27 '24 16:03 andrijapanicsb

@andrijapanicsb, I made some changes in the PR's title and description to make the proposal clearer.

To put us on the same page, the proposal with this PR is to flexibilize the timeout of the commands MS sends to the agents. It was never aimed to provide API calls for operators/users.

Currently, almost every command MS sends to the agents uses the global setting wait as the timeout's value. That means, if you want to extend or shorten the execution of a command, you will affect every other command. Therefore, currently we do not have a way to customize the timeout for specific commands.

For some edge cases, there are specific global settings, like kvm.storage.online.migration.wait and kvm.storage.offline.migration.wait. Externalizing global settings for each command to provide flexibility is not interesting, as we would increase the number of global settings by almost 50%(we would also have the work to create and describe every setting, which you mentioned "doesn't sound feasible"). Furthermore, if we continue working with global settings, each command timeout externalization would require the process of (i) providing the use case; (ii) extending the platform; and (iii) waiting for a release. We know that process can take several months and, if you are an operator, you know that waiting months for a simple timeout configuration is not feasible.

The proposal is to create a mechanism to allow operators to define the timeout of the commands MS sends to the agents dynamically and in a flexible fashion. The first phase is focused on the definition of the concept; thus, it is expected to be more developer-friendly at first. However, we already have the next steps planned, as you can see in https://github.com/apache/cloudstack/issues/8506#issuecomment-1924648687, and those will make the feature user/operator-friendly (via API and GUI).

Indeed, the process of documenting all the commands will require some work, but we have contributors who are willing to do that (me and some others); thus, as it will not require efforts from the operators to document them, I do not see the problem.

It is also important to mention that it is a new feature and the current behavior is kept; thus, you do not need to use the feature if you do not want.

If you have another effortless and more straightforward proposal that addresses all of the use cases we are talking about, I would be glad to discuss it with you.

Regarding the timeout of the API calls the users do the MS, as they are a different context, we can create another issue to discuss it.

GutoVeronezi avatar Mar 27 '24 18:03 GutoVeronezi

@GutoVeronezi thx for the detailed explanation on this one, I guess we'll have to go with that - I'm now more clear on the idea, and, indeed, it doesn't align with what I was "expecting" to be able to do as on operator, that would have to be a separate effort/PR

andrijapanicsb avatar Mar 29 '24 11:03 andrijapanicsb

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar May 30 '24 14:05 github-actions[bot]

@GutoVeronezi do we need more work on this? (except for conflict resolution)

DaanHoogland avatar Jun 04 '24 08:06 DaanHoogland

@GutoVeronezi do we need more work on this? (except for conflict resolution)

I think we are good with the code; however, I want to try it a bit more and post the tests.

GutoVeronezi avatar Jun 04 '24 15:06 GutoVeronezi

@blueorangutan package

DaanHoogland avatar Jun 05 '24 07:06 DaanHoogland

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

blueorangutan avatar Jun 05 '24 07:06 blueorangutan

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9787

blueorangutan avatar Jun 05 '24 08:06 blueorangutan

@blueorangutan test

DaanHoogland avatar Jun 06 '24 06:06 DaanHoogland

@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan avatar Jun 06 '24 06:06 blueorangutan

[SF] Trillian test result (tid-10370) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 49632 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8605-t10370-kvm-centos7.zip Smoke tests completed. 132 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:

Test Result Time (s) Test File

blueorangutan avatar Jun 06 '24 21:06 blueorangutan

This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.

github-actions[bot] avatar Jun 10 '24 05:06 github-actions[bot]