ceph-balancer
ceph-balancer copied to clipboard
balancer generates pg-upmap that fails verify_upmap
On a fairly large cluster running Ceph 15.2.16 placementoptimizer.py balance
generates pg-upmap-items
that violate the crush failure domain constraint by trying to map to multiple OSD in an EC PG to the same rack. Any ideas on what I am doing wrong would be greatly appreciated.
[root@ceph-admin ~]# ./placementoptimizer.py show
poolid name type size min pg_num stored used avail shrdsize crush
10 fs.meta.mirror repl 3 2 1024 2.1G 2.1G 1.2T 2.1M 6:replicated_chassis_mddm default~mddm*1.000
11 fs.meta.user repl 3 2 1024 2.7G 2.7G 1.2T 2.7M 7:replicated_chassis_mddu default~mddu*1.000
12 fs.data.mirror.ssd.rep repl 3 2 8192 0.0B 0.0B 267.4T 0.0B 1:replicated_chassis_ssd default~ssd*1.000
13 fs.data.mirror.nvme.rep repl 3 2 2048 0.0B 0.0B 53.4T 0.0B 3:replicated_chassis_nvme default~nvme*1.000
14 fs.data.user.ssd.rep repl 3 2 8192 94.4T 94.4T 267.4T 11.8G 1:replicated_chassis_ssd default~ssd*1.000
15 fs.data.user.nvme.rep repl 3 2 2048 0.0B 0.0B 53.4T 0.0B 3:replicated_chassis_nvme default~nvme*1.000
16 fs.data.mirror.ssd.ec ec6+2 8 7 4096 356.8T 356.8T 601.6T 14.9G 2:fs.data.mirror.ssd.ec default~ssd*1.000
17 fs.data.mirror.hdd.ec ec6+2 8 7 4096 1.2P 1.2P 315.0T 53.1G 4:fs.data.mirror.hdd.ec default~hdd*1.000
18 fs.data.user.hdd.ec ec6+2 8 7 4096 3.5G 3.5G 315.0T 148.3K 5:fs.data.user.hdd.ec default~hdd*1.000
20 device_health_metrics repl 3 2 32 3.1G 3.1G 53.4T 99.2M 3:replicated_chassis_nvme default~nvme*1.000
21 fs.data.user.ssd.ec ec6+2 8 7 4096 82.5T 82.5T 601.6T 3.4G 0:fs.data.user.ssd.ec default~ssd*1.000
default~mddm 2.07G 2.07G 1.625%
default~mddu 2.73G 2.73G 1.708%
default~ssd 533.62T 533.62T 40.830%
default~nvme 3.10G 3.10G 0.157%
default~hdd 1.24P 1.24P 51.871%
sum 1.76P 1.76P
while 128 PG are active+remapped+backfilling
from the mgr balancer
I tried running the JJ balancer
to see if can do a better job since I currently have OSD ranging from 22% to 82% %USE. However, the pg-upmap
fail the crush failure domain constraint of trying to use two OSD in the same rack
,
[root@ceph1 ~]# ./placementoptimizer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps.2
[2022-06-25 15:28:31,297] gathering cluster state via ceph api...
[2022-06-25 15:29:10,227] running pg balancer
[2022-06-25 15:29:10,506] current OSD fill rate per crushclasses:
[2022-06-25 15:29:10,506] mddu: average=0.20%, median=0.20%, without_placement_constraints=1.67%
[2022-06-25 15:29:10,506] mddm: average=0.15%, median=0.15%, without_placement_constraints=1.58%
[2022-06-25 15:29:10,506] nvme: average=0.01%, median=0.00%, without_placement_constraints=0.15%
[2022-06-25 15:29:10,509] ssd: average=44.80%, median=44.48%, without_placement_constraints=40.82%
[2022-06-25 15:29:10,510] hdd: average=93.10%, median=58.60%, without_placement_constraints=51.84%
[2022-06-25 15:29:10,510] smr: average=0.00%, median=0.00%, without_placement_constraints=0.05%
[2022-06-25 15:29:10,517] cluster variance for crushclasses:
[2022-06-25 15:29:10,517] mddu: 0.000
[2022-06-25 15:29:10,517] mddm: 0.000
[2022-06-25 15:29:10,517] nvme: 0.000
[2022-06-25 15:29:10,517] ssd: 6.502
[2022-06-25 15:29:10,517] hdd: 3915.508
[2022-06-25 15:29:10,517] smr: 0.000
[2022-06-25 15:29:10,517] min osd.253 0.000%
[2022-06-25 15:29:10,517] max osd.2323 385.361%
[2022-06-25 15:29:10,517] osd.2353 has calculated usage >= 100%: 100.63859057273221%
[2022-06-25 15:29:10,517] osd.447 has calculated usage >= 100%: 100.95507949049384%
...
[2022-06-25 15:29:12,803] osd.1960 has calculated usage >= 100%: 288.98203225266775%
[2022-06-25 15:29:12,803] osd.2323 has calculated usage >= 100%: 289.00484092949876%
[2022-06-25 15:29:12,803] enough remaps found
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] generated 10 remaps.
[2022-06-25 15:29:12,803] total movement size: 532.0G.
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] old cluster variance per crushclass:
[2022-06-25 15:29:12,803] mddu: 0.000
[2022-06-25 15:29:12,803] mddm: 0.000
[2022-06-25 15:29:12,803] nvme: 0.000
[2022-06-25 15:29:12,803] ssd: 6.502
[2022-06-25 15:29:12,803] hdd: 3915.508
[2022-06-25 15:29:12,803] smr: 0.000
[2022-06-25 15:29:12,803] old min osd.253 0.000%
[2022-06-25 15:29:12,803] old max osd.2323 385.361%
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] new min osd.253 0.000%
[2022-06-25 15:29:12,803] new max osd.2323 289.005%
[2022-06-25 15:29:12,803] new cluster variance:
[2022-06-25 15:29:12,803] mddu: 0.000
[2022-06-25 15:29:12,804] mddm: 0.000
[2022-06-25 15:29:12,804] nvme: 0.000
[2022-06-25 15:29:12,804] ssd: 6.502
[2022-06-25 15:29:12,804] hdd: 3655.503
[2022-06-25 15:29:12,804] smr: 0.000
[2022-06-25 15:29:12,804] --------------------------------------------------------------------------------
ceph osd pg-upmap-items 17.966 2323 2128
ceph osd pg-upmap-items 17.ce8 1986 2128
ceph osd pg-upmap-items 17.be4 2323 2128
ceph osd pg-upmap-items 17.55e 1986 2128
ceph osd pg-upmap-items 17.afb 2397 2099
ceph osd pg-upmap-items 17.67a 1982 2128
ceph osd pg-upmap-items 17.8c4 2177 2099
ceph osd pg-upmap-items 17.1c3 2215 2127
ceph osd pg-upmap-items 17.450 2313 2128
ceph osd pg-upmap-items 17.b03 2397 2099
[root@ceph-admin ~]# ceph osd pg-upmap-items 17.b03 2397 2099
set 17.b03 pg_upmap_items mapping to [2397->2099]
[root@ceph1 ceph]# tail -f /var/log/ceph/ceph-mon.ceph1.log
...
2022-06-25T15:30:31.076-0700 7fe8a0f67700 -1 verify_upmap multiple osds 2099,2141 come from same failure domain -4487
2022-06-25T15:30:31.076-0700 7fe8a0f67700 0 check_pg_upmaps verify_upmap of pg 17.b03 returning -22
which is true since both osd.2099 and osd.2141 are in the same rack,
[root@ceph-admin ~]# ceph osd find 2099 | grep rack
"rack": "s14",
[root@ceph-admin ~]# ceph osd find 2141 | grep rack
"rack": "s14",
And for reference, here is the PG attempting to be modified,
[root@ceph-admin ~]# ceph pg dump | awk '$1 == "17.b03"'
dumped all
17.b03 82138 0 0 0 0 343693292761 0 0 2150 2150 active+clean 2022-06-22T21:31:45.461326-0700 706729'221745 706729:3645914 [2343,1947,2496,2397,2210,2141,465,1839] 2343 [2343,1947,2496,2397,2210,2141,465,1839] 2343 671601'208723 2022-06-22T21:31:45.461159-0700 671601'208723 2022-06-22T21:31:45.461159-0700 0
hi! it may well be that further bugs are hiding in my implementation :) can you please create a state dump and send it to me (some cloud storage tool, via mail, ...), then i can easily play around locally.
of course, first make sure you're on the latest git version, since I've fixed a bug generating invalid movements not long ago. i was quite sure it is now really fixed, oh well :)
Updating to the latest git version got rid of the invalid pg-upmap-items, however, it also now fails to generate any remaps. If you have time to look at this state file that would be great. I am looking for something better than manually adjusting OSD reweight values to handle a large imbalance in disk usage.
P.S. It took me quite a few times to get ./placementoptimizer.py -v gather /tmp/lolfile.xz
to run without throwing an exception for Cluster topology changed during information gathering
even though the cluster was fairly idle with just one client writing in new data and all PG either in: active+clean
,active+clean+scrubbing
, or active+clean+scurbbing+deep
according to "ceph -s". Is that expected?
[root@ceph-admin ~]# ./placementoptimizer.py -v balance --max-pg-moves 10 --max-full-move-attempts=100 | tee /tmp/balance-upmpas
[2022-06-26 15:40:52,117] gathering cluster state via ceph api...
[2022-06-26 15:41:26,934] running pg balancer
[2022-06-26 15:41:27,242] current OSD fill rate per crushclasses:
[2022-06-26 15:41:27,243] mddu: average=0.20%, median=0.21%, without_placement_constraints=1.11%
[2022-06-26 15:41:27,243] mddm: average=0.16%, median=0.16%, without_placement_constraints=1.05%
[2022-06-26 15:41:27,243] nvme: average=0.01%, median=0.01%, without_placement_constraints=0.10%
[2022-06-26 15:41:27,245] ssd: average=44.89%, median=44.53%, without_placement_constraints=40.86%
[2022-06-26 15:41:27,246] hdd: average=96.88%, median=60.80%, without_placement_constraints=53.38%
[2022-06-26 15:41:27,247] smr: average=0.00%, median=0.00%, without_placement_constraints=0.03%
[2022-06-26 15:41:27,254] cluster variance for crushclasses:
[2022-06-26 15:41:27,254] mddu: 0.000
[2022-06-26 15:41:27,254] mddm: 0.000
[2022-06-26 15:41:27,254] nvme: 0.000
[2022-06-26 15:41:27,254] ssd: 7.100
[2022-06-26 15:41:27,254] hdd: 4273.052
[2022-06-26 15:41:27,254] smr: 0.000
[2022-06-26 15:41:27,254] min osd.253 0.000%
[2022-06-26 15:41:27,254] max osd.2323 399.894%
[2022-06-26 15:41:27,255] osd.445 has calculated usage >= 100%: 100.69996474075616%
[2022-06-26 15:41:27,255] osd.347 has calculated usage >= 100%: 102.01944874032769%
[2022-06-26 15:41:27,255] osd.355 has calculated usage >= 100%: 103.58332359721759%
...
[2022-06-26 15:41:27,261] osd.2397 has calculated usage >= 100%: 337.4984239706796%
[2022-06-26 15:41:27,261] osd.1986 has calculated usage >= 100%: 399.57821433986345%
[2022-06-26 15:41:27,261] osd.2323 has calculated usage >= 100%: 399.894190800478%
[2022-06-26 15:41:27,361] in descending full-order, couldn't empty osd.2327, so we're done. if you want to try more often, set --max-full-move-attempts=$nr, this may unlock more balancing possibilities.
[2022-06-26 15:41:27,361] osd.445 has calculated usage >= 100%: 100.69996474075616%
[2022-06-26 15:41:27,361] osd.347 has calculated usage >= 100%: 102.01944874032769%
[2022-06-26 15:41:27,361] osd.355 has calculated usage >= 100%: 103.58332359721759%
...
[2022-06-26 15:41:27,367] osd.2397 has calculated usage >= 100%: 337.4984239706796%
[2022-06-26 15:41:27,367] osd.1986 has calculated usage >= 100%: 399.57821433986345%
[2022-06-26 15:41:27,367] osd.2323 has calculated usage >= 100%: 399.894190800478%
[2022-06-26 15:41:27,367] --------------------------------------------------------------------------------
[2022-06-26 15:41:27,367] generated 0 remaps.
[2022-06-26 15:41:27,367] total movement size: 0.0B.
[2022-06-26 15:41:27,367] --------------------------------------------------------------------------------
[2022-06-26 15:41:27,367] old cluster variance per crushclass:
[2022-06-26 15:41:27,367] mddu: 0.000
[2022-06-26 15:41:27,367] mddm: 0.000
[2022-06-26 15:41:27,367] nvme: 0.000
[2022-06-26 15:41:27,367] ssd: 7.100
[2022-06-26 15:41:27,367] hdd: 4273.052
[2022-06-26 15:41:27,367] smr: 0.000
[2022-06-26 15:41:27,367] old min osd.253 0.000%
[2022-06-26 15:41:27,368] old max osd.2323 399.894%
[2022-06-26 15:41:27,368] --------------------------------------------------------------------------------
[2022-06-26 15:41:27,368] new min osd.253 0.000%
[2022-06-26 15:41:27,368] new max osd.2323 399.894%
[2022-06-26 15:41:27,368] new cluster variance:
[2022-06-26 15:41:27,368] mddu: 0.000
[2022-06-26 15:41:27,368] mddm: 0.000
[2022-06-26 15:41:27,368] nvme: 0.000
[2022-06-26 15:41:27,368] ssd: 7.100
[2022-06-26 15:41:27,368] hdd: 4273.052
[2022-06-26 15:41:27,368] smr: 0.000
[2022-06-26 15:41:27,368] --------------------------------------------------------------------------------
For the Cluster topology changed
, see #13.
Since it apparently can't empty the fullest device, did you try setting --max-full-move-attempts=10
or more?
I haven't had time looking at your dump yet, but that time will surely come :)
I've had a look now: It seems that the constraints distributing accross racks prevents the balancer finding a suitable solution:
[2022-11-15 00:55:42,912] trying to empty osd.2323 (400.110562 %)
[2022-11-15 00:55:42,912] TRY-0 moving pg 17.966 (1/13) with 55.1G from osd.2323
[2022-11-15 00:55:42,913] OK => taking pg 17.966 from source osd.2323 since it has too many of pool=17 (8 > 1.1150)
[2022-11-15 00:55:42,913] prepare crush check for pg 17.966 currently up=[2279, 2323, 345, 2453, 493, 329, 470, 2471]
[2022-11-15 00:55:42,913] rule:
{'name': 'fs.data.mirror.hdd.ec',
'steps': [{'num': 5, 'op': 'set_chooseleaf_tries'},
{'num': 100, 'op': 'set_choose_tries'},
{'item': -4565, 'item_name': 'default~hdd', 'op': 'take'},
{'num': 0, 'op': 'chooseleaf_indep', 'type': 'rack'},
{'op': 'emit'}]}
[2022-11-15 00:55:42,914] allowed reuses per rule step, starting at root: [8, 1, 1]
[2022-11-15 00:55:42,914] trace for 2279: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4232, 'type_name': 'rack'}, {'id': -12067, 'type_name': 'host'}, {'id': 2279, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] trace for 2323: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4367, 'type_name': 'rack'}, {'id': -12228, 'type_name': 'host'}, {'id': 2323, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] trace for 345: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4253, 'type_name': 'rack'}, {'id': -4238, 'type_name': 'host'}, {'id': 345, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] trace for 2453: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4487, 'type_name': 'rack'}, {'id': -12676, 'type_name': 'host'}, {'id': 2453, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] trace for 493: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4535, 'type_name': 'rack'}, {'id': -4490, 'type_name': 'host'}, {'id': 493, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] trace for 329: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4211, 'type_name': 'rack'}, {'id': -4208, 'type_name': 'host'}, {'id': 329, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] trace for 470: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4463, 'type_name': 'rack'}, {'id': -4448, 'type_name': 'host'}, {'id': 470, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] trace for 2471: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4556, 'type_name': 'rack'}, {'id': -12739, 'type_name': 'host'}, {'id': 2471, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] processing crush step {'op': 'take', 'item': -4565, 'item_name': 'default~hdd'} with tree_depth=0, rule_depth=0, item_uses=defaultdict(<class 'dict'>, {})
[2022-11-15 00:55:42,914] processing crush step {'op': 'chooseleaf_indep', 'num': 0, 'type': 'rack'} with tree_depth=0, rule_depth=1, item_uses=defaultdict(<class 'dict'>, {})
[2022-11-15 00:55:42,914] processing crush step {'op': 'emit'} with tree_depth=3, rule_depth=2, item_uses=defaultdict(<class 'dict'>, {1: {-4565: 0, -4232: 1, -4367: 1, -4253: 1, -4487: 1, -4535: 1, -4211: 1, -4463: 1, -4556: 1}, 0: {-4565: 8}})
[2022-11-15 00:55:42,914] after crush check preparations: rule_tree_depth={0: 0, 1: 3, 2: 5} item_uses=defaultdict(<class 'dict'>, {1: {-4565: 0, -4232: 1, -4367: 1, -4253: 1, -4487: 1, -4535: 1, -4211: 1, -4463: 1, -4556: 1}, 0: {-4565: 8}, 2: {2279: 1, 2323: 1, 345: 1, 2453: 1, 493: 1, 329: 1, 470: 1, 2471: 1}})
[2022-11-15 00:55:42,914] crush check preparation success!
[2022-11-15 00:55:42,918] TRY-1 move 17.966 osd.2323 => osd.2128 (1/704)
[2022-11-15 00:55:42,919] OK => osd.2128 has too few of pool=17 (45 < 113.1569)
[2022-11-15 00:55:42,919] trace for 2128: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4487, 'type_name': 'rack'}, {'id': -11528, 'type_name': 'host'}, {'id': 2128, 'type_name': 'osd'}]
[2022-11-15 00:55:42,919] item reuse check ok: osd.2323@[0]=-4565 -> osd.2128@[0]=-4565 x uses=8 <= max_allowed=8
[2022-11-15 00:55:42,919] item reuse check fail: osd.2323@[3]=-4367 -> osd.2128@[3]=-4487 x uses=2 > max_allowed=1
[2022-11-15 00:55:42,919] BAD move 17.966 osd.2323 => osd.2128
This goes on for all other PGs on that OSD - all destination attempts fail because some shard of that PG's rack is already in use by the current PGs. I haven't added this in the code yet, but maybe you know: How many racks are theoretically suitable for placement of pool 17?
If you're interested in further analysis, maybe please send a recent state dump so I can have a look again :)