cloudstack
cloudstack copied to clipboard
Flexible timeout for commands MS sends to the agents
Description
Currently, almost every command MS sends to the agents uses the global setting wait
as the timeout's value. That means, if you want to extend or shorten the execution of a command, you will affect every other command. Therefore, currently we do not have a way to customize the timeout for specific commands.
For some edge cases, we have "wait"/"timeout" hardcoded values or values that come from global settings to set the commands timeout, like for GetAutoScaleMetricsCommand
and for MigrateCommand
. Currently, we have 385 classes that extend com.cloud.agent.api.Command
(I skipped classes that extend com.cloud.agent.api.Answer
); therefore, externalizing global settings for each command to provide flexibility is not feasible; however, we can reach flexibility by other means.
Currently, if we do not set a wait
to the command, AgentManagerImpl
will fallback to the wait
global setting:
https://github.com/apache/cloudstack/blob/af8a582055c4194d1e42c8c1abd96969692046ed/engine/orchestration/src/main/java/com/cloud/agent/manager/AgentManagerImpl.java#L441-L443
I propose to create a table of command_timeout
to store the classpath of the command and its wait/timeout. Then, instead of fall backing directly to the wait
global setting, we check if there is an entry of the command in this table. If it exists, then we use the entry's value; otherwise, we fall back to the wait
global setting (the current behavior). Thus, we can change the commands wait/timeout on demand and in a flexible way.
The first step of the proposal is to create the table and changing AgentManagerImpl
to check the table. Some entries will already be created for hardcoded values. Further steps can be found in #8506.
Types of changes
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [X] Enhancement (improves an existing feature and functionality)
- [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
- [ ] build/CI
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
- [ ] Major
- [x] Minor
How Has This Been Tested?
TBD
Codecov Report
Attention: Patch coverage is 37.50000%
with 25 lines
in your changes missing coverage. Please review.
Project coverage is 15.28%. Comparing base (
19f9635
) to head (6d44719
). Report is 15 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #8605 +/- ##
===========================================
Coverage 15.28% 15.28%
- Complexity 11535 11538 +3
===========================================
Files 5424 5426 +2
Lines 474334 474361 +27
Branches 60970 58437 -2533
===========================================
+ Hits 72498 72503 +5
- Misses 393778 393804 +26
+ Partials 8058 8054 -4
Flag | Coverage Δ | |
---|---|---|
uitests | 4.25% <ø> (ø) |
|
unittests | 16.02% <37.50%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.
@blueorangutan package
@GutoVeronezi a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.
Guys, as a cloud operator (user of ACS, operator) - I don't like the idea how we do this:
- For a developer, it would be fine - you can look for commands, classes etc
- For a cloud operator, using this in this way - will be a nightmare and is error prone
If I got it wrong, please correct me - my understanding is we have an "empty" table/configuration list to start from, and then we expect the cloud operator to add specific command names and timeouts to this table?
That being said, I'm struggling to think of a better way, except perhaps somehow having a prepopulated list of commands in that table and very detailed documentation - to ensure that this (absolutely needed feature...) can be used by a non-developer.
Guys, as a cloud operator (user of ACS, operator) - I don't like the idea how we do this:
- For a developer, it would be fine - you can look for commands, classes etc
- For a cloud operator, using this in this way - will be a nightmare and is error prone
If I got it wrong, please correct me - my understanding is we have an "empty" table/configuration list to start from, and then we expect the cloud operator to add specific command names and timeouts to this table?
That being said, I'm struggling to think of a better way, except perhaps somehow having a prepopulated list of commands in that table and very detailed documentation - to ensure that this (absolutely needed feature...) can be used by a non-developer.
@andrijapanicsb, please check https://github.com/apache/cloudstack/issues/8506#issuecomment-1924648687.
Hey @GutoVeronezi , I did read it - I see we can initially add some commands in phase1 later use APIs etc - this sounds overwhelmingly complex for a non-developer human - this is my concern - how manageable this will be? In theory, we need to document every command/class (what it is, what it does, etc) etc - which doesn't sound feasible - so that an operator can add it. And this - again - is developer-friendly, but not human/user/operator-friendly (my take on it, thus my commensts here)
Does it also cover (and again, I'm not a developer, so apologies if asking a silly one) specific timeouts for e.g. start a VM, configuring DHCP reservations inside a VR etc?
To give you a use-case: I'm an operator, and I want to be able to easily configure timeout for some basic stuff, e.g.:
- deploy a VM
- starting a VM
- stopping a VM
- attach/detach a volume
- DHCP programming of a VR while the VM is starting
- etc
I'm not sure if my "expectations" (described here ^^^) are a good fit for this specific PR, as I see this PR/issue is related to the agent-side command timeouts (e.g. there is no StartVMCmdByAdmin on the list of command/classes which you provided in the https://github.com/apache/cloudstack/issues/8506#issuecomment-1924648687 - which is a management server command)
@andrijapanicsb, I made some changes in the PR's title and description to make the proposal clearer.
To put us on the same page, the proposal with this PR is to flexibilize the timeout of the commands MS sends to the agents. It was never aimed to provide API calls for operators/users.
Currently, almost every command MS sends to the agents uses the global setting wait
as the timeout's value. That means, if you want to extend or shorten the execution of a command, you will affect every other command. Therefore, currently we do not have a way to customize the timeout for specific commands.
For some edge cases, there are specific global settings, like kvm.storage.online.migration.wait
and kvm.storage.offline.migration.wait
. Externalizing global settings for each command to provide flexibility is not interesting, as we would increase the number of global settings by almost 50%(we would also have the work to create and describe every setting, which you mentioned "doesn't sound feasible"). Furthermore, if we continue working with global settings, each command timeout externalization would require the process of (i) providing the use case; (ii) extending the platform; and (iii) waiting for a release. We know that process can take several months and, if you are an operator, you know that waiting months for a simple timeout configuration is not feasible.
The proposal is to create a mechanism to allow operators to define the timeout of the commands MS sends to the agents dynamically and in a flexible fashion. The first phase is focused on the definition of the concept; thus, it is expected to be more developer-friendly at first. However, we already have the next steps planned, as you can see in https://github.com/apache/cloudstack/issues/8506#issuecomment-1924648687, and those will make the feature user/operator-friendly (via API and GUI).
Indeed, the process of documenting all the commands will require some work, but we have contributors who are willing to do that (me and some others); thus, as it will not require efforts from the operators to document them, I do not see the problem.
It is also important to mention that it is a new feature and the current behavior is kept; thus, you do not need to use the feature if you do not want.
If you have another effortless and more straightforward proposal that addresses all of the use cases we are talking about, I would be glad to discuss it with you.
Regarding the timeout of the API calls the users do the MS, as they are a different context, we can create another issue to discuss it.
@GutoVeronezi thx for the detailed explanation on this one, I guess we'll have to go with that - I'm now more clear on the idea, and, indeed, it doesn't align with what I was "expecting" to be able to do as on operator, that would have to be a separate effort/PR
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.
@GutoVeronezi do we need more work on this? (except for conflict resolution)
@GutoVeronezi do we need more work on this? (except for conflict resolution)
I think we are good with the code; however, I want to try it a bit more and post the tests.
@blueorangutan package
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9787
@blueorangutan test
@DaanHoogland a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests
[SF] Trillian test result (tid-10370) Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7 Total time taken: 49632 seconds Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8605-t10370-kvm-centos7.zip Smoke tests completed. 132 look OK, 0 have errors, 0 did not run Only failed and skipped tests results shown below:
Test | Result | Time (s) | Test File |
---|
This pull request has merge conflicts. Dear author, please fix the conflicts and sync your branch with the base branch.