xen-orchestra Load-Balancer: Add a "Balance VCPUs" option in performance plan

Context

xo-server 5.70.0 xo-web 5.74.0
Considering pool1 with: HOST1 ( 50% CPU ), HOST2 ( 1% cpu ) withs VM:
- VM1 on HOST1 (1c/2g ) , VM2 ( 4c / 8G ) on HOST2, and VM3. ( 4c / 8G) on HOST2

I configure a Load-Balancing plan in performance mode on the pool. No critical thresholds defined.

Expected behavior

I have 2 BIG VM on the SAME HOST.

One should be move to the other to have

HOST1: 25% CPU
HOST2: 25% CPU

Current behavior

No migration....

Nov 20 '20 15:11 henri9813

Found, I have to define a threeshold ....

Why this is required ?

It should be smart and automatically move someting when the cpu is too much used.

Nov 20 '20 15:11 henri9813

I had to define a very threeshold of 50% but i shouldn't have to do that, it should be live automatically calculated no ?

Nov 20 '20 15:11 henri9813

The objective is to live migration to have a perfect balance between host, not a load balancing only when the hyper-visor is ready to die with a high threeshold , i'm right ?

Nov 20 '20 15:11 henri9813

Because currently: i did

performance mode:
threesholds: CPU 50%

BUT IF

HOST1: 35%
HOST2: 0%

Nothing happen.

It's not normal

Third Example

Host1: 90%
Host2: 89%

What gonna happen ? always migrate in loop ?

Nov 20 '20 15:11 henri9813

Can you please give more detail on the ducmentation about this please ?

Best regards

Nov 20 '20 15:11 henri9813

Hi, @Wescoeur is assigned he'll take a look when he can.

Nov 20 '20 18:11 olivierlambert

Found, I have to define a threeshold .... Why this is required ? It should be smart and automatically move someting when the CPU is too much used.

What's the meaning of "too much"? Personally I don't have the answer: it's subjective. That is why we leave the choice to the user. By default, using the performance mode, the VMs are migrated when the CPU usage is 80% of higher.

I had to define a very threeshold of 50% but i shouldn't have to do that, it should be live automatically calculated no ?

Can you provide a way to compute automatically the threeshold? Or do you have an example?

The objective is to live migration to have a perfect balance between host, not a load balancing only when the hyper-visor is ready to die with a high threeshold , i'm right ?

There are two goals:

In performance mode, the CPU/RAM must be below a threshold to give the best overall performance.
In density mode, the objective is to use the least hosts as possible, and to concentrate your VMs. After that you can shutdown unused hosts.

There is no "perfect balance", why start a migration if CPU usage is below 50% in your case? Not all host resources are used yet. If the 50% limit is not reached, it's useless to migrate.

HOST1: 35% HOST2: 0% Nothing happen. It's not normal

As I said earlier it’s normal. But if the limit had been reached with a VM representing 40% of the CPU usage and another 20% (so a total CPU usage of 60%), then the VM with the lower CPU usage would have been migrated.

Host1: 90% Host2: 89% What gonna happen ? always migrate in loop ?

No. In performance mode the migration is never executed if you don't have a profit. For example trying to migrate a VM using 5% of the CPU in this case would create an imbalance, so the vm is therefore not moved. FYI, a VM can be moved to a host "B" from "A" if the CPU usage of "A" remains higher compared to "B" after the migration. This test is performed here before the migration: https://github.com/vatesfr/xen-orchestra/blob/6973b92c4acf771700f365c35aad5e5665745fef/packages/xo-server-load-balancer/src/performance-plan.js#L125-L127

An other idea in this plan is to use the CPU statistics of the last 30min with a ratio to avoid useless migrations if the CPU is used intensely for a short period of time, it's not useful to migrate: https://github.com/vatesfr/xen-orchestra/blob/6973b92c4acf771700f365c35aad5e5665745fef/packages/xo-server-load-balancer/src/plan.js#L157-L167

Official doc: https://xen-orchestra.com/docs/load_balancing.html

Nov 23 '20 10:11 Wescoeur

Hello,

Thank you very much for your explanation !

VMWare DRS doesn't ask for a "threeshold" we only choose a "middle point" between agressive and passage

And vm are automatically moved accros hosts to have such a "perfect" balance.

Sorry i don't have a way to calcul this ....

Indeed, they are no goal to migrate move if cpu is under 50% however, it's in the case the cpu grow up very very quickly,

The case:

VM1: 20 core, 10G ram VM2: 20 core, 10G ram

Currently, if my threeshold is 50% the vm 2 will be moved if the host cpu is above 50% but taking the case of the Black Friday or TV show, the CPU can grow at "very very fast speed", and then, a host can be under heavy load, and vm will have CPU steal .... whereas if the vm was "migrated before", this problem would never occur.

Was it clear for you ?

Sorry for my bad english :-)

Nov 23 '20 10:11 henri9813

In opposite, can you maybe purpose the reversed density method ? To use all hyper-visor if possible ?

This should answer to my problem maybe :-)

Nov 23 '20 11:11 henri9813

I’m not too familiar with VMWare DRS. :slightly_smiling_face: I suppose several algorithms are used when the "migration threshold" mode is updated and the thresholds are set internally (and/or computed using the CPU models + the number of hosts).

However, we can get closer to it on some points:

If you want an aggressive mode, you can try to use a lower threshold to force migration.*
We have a similar "AggressiveCPUActive" option (https://blogs.vmware.com/vsphere/2016/05/load-balancing-vsphere-clusters-with-drs.html) in our load balancer, but actually it's hardcoded... Like I said in my previous answer, we have this code: https://github.com/vatesfr/xen-orchestra/blob/6973b92c4acf771700f365c35aad5e5665745fef/packages/xo-server-load-balancer/src/plan.js#L157-L167

Currently, if my threeshold is 50% the vm 2 will be moved if the host cpu is above 50% but taking the case of the Black Friday or TV show, the CPU can grow at "very very fast speed", and then, a host can be under heavy load, and vm will have CPU steal .... whereas if the vm was "migrated before", this problem would never occur.

It could possibly be a solution to be able to configure the weight (currently 0.75 in the source code above) and the time interval (currently 30 minutes => MINUTES_OF_HISTORICAL_DATA). With a small interval and a big weight, a VM where a CPU is spiky can be easily migrated.

*I think the load balancer can be improved :wink:, but for the moment I don't see other solutions concerning your problem with the current state of the load balancer. You can try with a low threshold, if it's not sufficient for you, we can probably offer the possibility to modify the weight and time interval.

Nov 23 '20 13:11 Wescoeur

hello,

Okay thank you.

What do you think about my feature proposal ? The opposite of density mode ?

With this, i could imagine: Host1:

VM1
VM2
VM3
VM4 Host2:
NA

1AM --> 6AM : Reverse density mode
VM1 --> Host2
VM2 --> Host2
Every host as now the same number of VM.
Host1:
- VM3
- VM4
Host2:
- VM1
- VM2
Reverse density mode applied.

6AM --> 1AM: Performance Mode

What do you think about this ?

This could answer to my need.

Nov 23 '20 16:11 henri9813

@henri9813 The opposite of density mode can be a good idea but I think we can add an option directly in the performance mode: We can count the number of vCPUs used for each VM on each host and migrate to hosts with the fewest vCPU count when possible. (So the percentage usage is not used with this algorithm, using the vCPU count we are sure to balance correctly the VMs like you want.) Also we must always respect in parallel the thresholds using the percentage usage.

Is it clear for you? Do you you agree with this proposal?

Nov 30 '20 11:11 Wescoeur

Hello,

Hmmm, this is very clear, but what you described look like the reverse of the density mode,

Define this in performance can be disturbing, i think it should have a dedicated plan: "Distributed" ?

But your proposition is quite amazing ! :D

Best regards

Nov 30 '20 18:11 henri9813

To me the only difference with "perf" is that avoid to migrate VMs because of performance counters before VMs are loaded. It's spreading to get the best vCPU/CPU ratio on all hosts. It's not incompatible with it, more "complementary" (so you prepare to spread your VMs not on their current perfs but on their potential (vCPU number and host).

Also, this spread won't be enough alone, that's why it's only useful to spread VMs when they are under the "load balancing" limit. When there's load, the spread will be less relevant because perf counters are really what matters in the end.

It's like "prepositioning" if you prefer.

Nov 30 '20 18:11 olivierlambert

Hello,

I like the name "prepositionning", so, will it be a dedicated mode ? not juste a "performance" sub-mode ?

Best regards :-)

Nov 30 '20 23:11 henri9813

As I said, it doesn't make sense alone: just "prepositioning" the VM on theoretical counters (like vCPU number) doesn't make sense as soon you got real load. I mean, what if the VM with less vCPUs on a host is in fact doing all the work? (while others are idle).

So this feature only makes sense inside a mode based on counters. Otherwise, you'll have a placement that won't reflect the real requirements.

Dec 01 '20 11:12 olivierlambert

Hello,

After re-reading well i agree with you ( my english is limited and i had badly understood your response ).

Best regards :-)

Dec 01 '20 14:12 henri9813

No problem :)

Dec 01 '20 15:12 olivierlambert

@Wescoeur feel free to create an issue or rename this one, whatever you feel best.

Dec 01 '20 15:12 olivierlambert

Hi @henri9813

Please can you send me any tutorial how to configure the balancing in my HOSTS ?

I have two hosts under the same XO and I have a VM that use 90% of HOST1 resources, and I want to use the resources of HOST 2 ( CPU & RAM ) also for this VM. is that possible?

Mar 05 '21 09:03 abenzakour

No, it's not possible. A single VM can't use resources of 2 hosts.

Mar 05 '21 12:03 olivierlambert

Hello,

Do you have some news about planification ?

Best regards,

Dec 27 '21 14:12 henri9813

Pinging @Wescoeur about this

Dec 27 '21 15:12 olivierlambert

Hello @henri9813, we have many important tasks to do before that (XCP-ng maintenance, DRBD/Linstor driver improvements, ...). Because is not a complex problem, I think I could probably look at your problem in detail in a few weeks :wink: .

Don’t hesitate to ping me if I don’t give news!

Jan 03 '22 13:01 Wescoeur

@Wescoeur can brief you @b-Nollet so it's easier to get the context and how to move forward on this related milestone :)

Nov 10 '23 12:11 olivierlambert

Done in #7333

Apr 10 '24 14:04 b-Nollet

xen-orchestra xen-orchestra copied to clipboard

Load-Balancer: Add a "Balance VCPUs" option in performance plan

Context

Expected behavior

Current behavior

xen-orchestra
xen-orchestra copied to clipboard