ceph-balancer balancer generates pg-upmap that fails verify

On a fairly large cluster running Ceph 15.2.16 placementoptimizer.py balance generates pg-upmap-items that violate the crush failure domain constraint by trying to map to multiple OSD in an EC PG to the same rack. Any ideas on what I am doing wrong would be greatly appreciated.

[root@ceph-admin ~]# ./placementoptimizer.py show

poolid name                    type     size min pg_num  stored    used   avail shrdsize crush
    10 fs.meta.mirror          repl        3   2   1024    2.1G    2.1G    1.2T     2.1M 6:replicated_chassis_mddm default~mddm*1.000
    11 fs.meta.user            repl        3   2   1024    2.7G    2.7G    1.2T     2.7M 7:replicated_chassis_mddu default~mddu*1.000
    12 fs.data.mirror.ssd.rep  repl        3   2   8192    0.0B    0.0B  267.4T     0.0B 1:replicated_chassis_ssd default~ssd*1.000
    13 fs.data.mirror.nvme.rep repl        3   2   2048    0.0B    0.0B   53.4T     0.0B 3:replicated_chassis_nvme default~nvme*1.000
    14 fs.data.user.ssd.rep    repl        3   2   8192   94.4T   94.4T  267.4T    11.8G 1:replicated_chassis_ssd default~ssd*1.000
    15 fs.data.user.nvme.rep   repl        3   2   2048    0.0B    0.0B   53.4T     0.0B 3:replicated_chassis_nvme default~nvme*1.000
    16 fs.data.mirror.ssd.ec   ec6+2       8   7   4096  356.8T  356.8T  601.6T    14.9G 2:fs.data.mirror.ssd.ec default~ssd*1.000
    17 fs.data.mirror.hdd.ec   ec6+2       8   7   4096    1.2P    1.2P  315.0T    53.1G 4:fs.data.mirror.hdd.ec default~hdd*1.000
    18 fs.data.user.hdd.ec     ec6+2       8   7   4096    3.5G    3.5G  315.0T   148.3K 5:fs.data.user.hdd.ec default~hdd*1.000
    20 device_health_metrics   repl        3   2     32    3.1G    3.1G   53.4T    99.2M 3:replicated_chassis_nvme default~nvme*1.000
    21 fs.data.user.ssd.ec     ec6+2       8   7   4096   82.5T   82.5T  601.6T     3.4G 0:fs.data.user.ssd.ec default~ssd*1.000
default~mddm                                              2.07G   2.07G    1.625%
default~mddu                                              2.73G   2.73G    1.708%
default~ssd                                             533.62T 533.62T   40.830%
default~nvme                                              3.10G   3.10G    0.157%
default~hdd                                               1.24P   1.24P   51.871%
sum                                                       1.76P   1.76P

while 128 PG are active+remapped+backfilling from the mgr balancer I tried running the JJ balancer to see if can do a better job since I currently have OSD ranging from 22% to 82% %USE. However, the pg-upmap fail the crush failure domain constraint of trying to use two OSD in the same rack,

[root@ceph1 ~]# ./placementoptimizer.py -v balance --max-pg-moves 10 | tee /tmp/balance-upmaps.2
[2022-06-25 15:28:31,297] gathering cluster state via ceph api...
[2022-06-25 15:29:10,227] running pg balancer
[2022-06-25 15:29:10,506] current OSD fill rate per crushclasses:
[2022-06-25 15:29:10,506]   mddu: average=0.20%, median=0.20%, without_placement_constraints=1.67%
[2022-06-25 15:29:10,506]   mddm: average=0.15%, median=0.15%, without_placement_constraints=1.58%
[2022-06-25 15:29:10,506]   nvme: average=0.01%, median=0.00%, without_placement_constraints=0.15%
[2022-06-25 15:29:10,509]   ssd: average=44.80%, median=44.48%, without_placement_constraints=40.82%
[2022-06-25 15:29:10,510]   hdd: average=93.10%, median=58.60%, without_placement_constraints=51.84%
[2022-06-25 15:29:10,510]   smr: average=0.00%, median=0.00%, without_placement_constraints=0.05%
[2022-06-25 15:29:10,517] cluster variance for crushclasses:
[2022-06-25 15:29:10,517]   mddu: 0.000
[2022-06-25 15:29:10,517]   mddm: 0.000
[2022-06-25 15:29:10,517]   nvme: 0.000
[2022-06-25 15:29:10,517]   ssd: 6.502
[2022-06-25 15:29:10,517]   hdd: 3915.508
[2022-06-25 15:29:10,517]   smr: 0.000
[2022-06-25 15:29:10,517] min osd.253 0.000%
[2022-06-25 15:29:10,517] max osd.2323 385.361%
[2022-06-25 15:29:10,517] osd.2353 has calculated usage >= 100%: 100.63859057273221%
[2022-06-25 15:29:10,517] osd.447 has calculated usage >= 100%: 100.95507949049384%
...
[2022-06-25 15:29:12,803] osd.1960 has calculated usage >= 100%: 288.98203225266775%
[2022-06-25 15:29:12,803] osd.2323 has calculated usage >= 100%: 289.00484092949876%
[2022-06-25 15:29:12,803] enough remaps found
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] generated 10 remaps.
[2022-06-25 15:29:12,803] total movement size: 532.0G.
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] old cluster variance per crushclass:
[2022-06-25 15:29:12,803]   mddu: 0.000
[2022-06-25 15:29:12,803]   mddm: 0.000
[2022-06-25 15:29:12,803]   nvme: 0.000
[2022-06-25 15:29:12,803]   ssd: 6.502
[2022-06-25 15:29:12,803]   hdd: 3915.508
[2022-06-25 15:29:12,803]   smr: 0.000
[2022-06-25 15:29:12,803] old min osd.253 0.000%
[2022-06-25 15:29:12,803] old max osd.2323 385.361%
[2022-06-25 15:29:12,803] --------------------------------------------------------------------------------
[2022-06-25 15:29:12,803] new min osd.253 0.000%
[2022-06-25 15:29:12,803] new max osd.2323 289.005%
[2022-06-25 15:29:12,803] new cluster variance:
[2022-06-25 15:29:12,803]   mddu: 0.000
[2022-06-25 15:29:12,804]   mddm: 0.000
[2022-06-25 15:29:12,804]   nvme: 0.000
[2022-06-25 15:29:12,804]   ssd: 6.502
[2022-06-25 15:29:12,804]   hdd: 3655.503
[2022-06-25 15:29:12,804]   smr: 0.000
[2022-06-25 15:29:12,804] --------------------------------------------------------------------------------
ceph osd pg-upmap-items 17.966 2323 2128
ceph osd pg-upmap-items 17.ce8 1986 2128
ceph osd pg-upmap-items 17.be4 2323 2128
ceph osd pg-upmap-items 17.55e 1986 2128
ceph osd pg-upmap-items 17.afb 2397 2099
ceph osd pg-upmap-items 17.67a 1982 2128
ceph osd pg-upmap-items 17.8c4 2177 2099
ceph osd pg-upmap-items 17.1c3 2215 2127
ceph osd pg-upmap-items 17.450 2313 2128
ceph osd pg-upmap-items 17.b03 2397 2099

[root@ceph-admin ~]# ceph osd pg-upmap-items 17.b03 2397 2099
set 17.b03 pg_upmap_items mapping to [2397->2099]

[root@ceph1 ceph]# tail -f /var/log/ceph/ceph-mon.ceph1.log
...
2022-06-25T15:30:31.076-0700 7fe8a0f67700 -1 verify_upmap multiple osds 2099,2141 come from same failure domain -4487
2022-06-25T15:30:31.076-0700 7fe8a0f67700  0 check_pg_upmaps verify_upmap of pg 17.b03 returning -22

which is true since both osd.2099 and osd.2141 are in the same rack,

[root@ceph-admin ~]# ceph osd find 2099 | grep rack
        "rack": "s14",
[root@ceph-admin ~]# ceph osd find 2141 | grep rack
        "rack": "s14",

And for reference, here is the PG attempting to be modified,

[root@ceph-admin ~]# ceph pg dump | awk '$1 == "17.b03"'
dumped all
17.b03     82138                   0         0          0        0  343693292761            0           0   2150      2150                 active+clean  2022-06-22T21:31:45.461326-0700  706729'221745  706729:3645914   [2343,1947,2496,2397,2210,2141,465,1839]        2343   [2343,1947,2496,2397,2210,2141,465,1839]            2343  671601'208723  2022-06-22T21:31:45.461159-0700    671601'208723  2022-06-22T21:31:45.461159-0700              0

Jun 25 '22 22:06 stuartthebruce

hi! it may well be that further bugs are hiding in my implementation :) can you please create a state dump and send it to me (some cloud storage tool, via mail, ...), then i can easily play around locally.

of course, first make sure you're on the latest git version, since I've fixed a bug generating invalid movements not long ago. i was quite sure it is now really fixed, oh well :)

Jun 26 '22 14:06 TheJJ

Updating to the latest git version got rid of the invalid pg-upmap-items, however, it also now fails to generate any remaps. If you have time to look at this state file that would be great. I am looking for something better than manually adjusting OSD reweight values to handle a large imbalance in disk usage.

P.S. It took me quite a few times to get ./placementoptimizer.py -v gather /tmp/lolfile.xz to run without throwing an exception for Cluster topology changed during information gathering even though the cluster was fairly idle with just one client writing in new data and all PG either in: active+clean ,active+clean+scrubbing, or active+clean+scurbbing+deep according to "ceph -s". Is that expected?

[root@ceph-admin ~]# ./placementoptimizer.py -v balance --max-pg-moves 10 --max-full-move-attempts=100 | tee /tmp/balance-upmpas
[2022-06-26 15:40:52,117] gathering cluster state via ceph api...
[2022-06-26 15:41:26,934] running pg balancer
[2022-06-26 15:41:27,242] current OSD fill rate per crushclasses:
[2022-06-26 15:41:27,243]   mddu: average=0.20%, median=0.21%, without_placement_constraints=1.11%
[2022-06-26 15:41:27,243]   mddm: average=0.16%, median=0.16%, without_placement_constraints=1.05%
[2022-06-26 15:41:27,243]   nvme: average=0.01%, median=0.01%, without_placement_constraints=0.10%
[2022-06-26 15:41:27,245]   ssd: average=44.89%, median=44.53%, without_placement_constraints=40.86%
[2022-06-26 15:41:27,246]   hdd: average=96.88%, median=60.80%, without_placement_constraints=53.38%
[2022-06-26 15:41:27,247]   smr: average=0.00%, median=0.00%, without_placement_constraints=0.03%
[2022-06-26 15:41:27,254] cluster variance for crushclasses:
[2022-06-26 15:41:27,254]   mddu: 0.000
[2022-06-26 15:41:27,254]   mddm: 0.000
[2022-06-26 15:41:27,254]   nvme: 0.000
[2022-06-26 15:41:27,254]   ssd: 7.100
[2022-06-26 15:41:27,254]   hdd: 4273.052
[2022-06-26 15:41:27,254]   smr: 0.000
[2022-06-26 15:41:27,254] min osd.253 0.000%
[2022-06-26 15:41:27,254] max osd.2323 399.894%
[2022-06-26 15:41:27,255] osd.445 has calculated usage >= 100%: 100.69996474075616%
[2022-06-26 15:41:27,255] osd.347 has calculated usage >= 100%: 102.01944874032769%
[2022-06-26 15:41:27,255] osd.355 has calculated usage >= 100%: 103.58332359721759%
...
[2022-06-26 15:41:27,261] osd.2397 has calculated usage >= 100%: 337.4984239706796%
[2022-06-26 15:41:27,261] osd.1986 has calculated usage >= 100%: 399.57821433986345%
[2022-06-26 15:41:27,261] osd.2323 has calculated usage >= 100%: 399.894190800478%
[2022-06-26 15:41:27,361] in descending full-order, couldn't empty osd.2327, so we're done. if you want to try more often, set --max-full-move-attempts=$nr, this may unlock more balancing possibilities.
[2022-06-26 15:41:27,361] osd.445 has calculated usage >= 100%: 100.69996474075616%
[2022-06-26 15:41:27,361] osd.347 has calculated usage >= 100%: 102.01944874032769%
[2022-06-26 15:41:27,361] osd.355 has calculated usage >= 100%: 103.58332359721759%
...
[2022-06-26 15:41:27,367] osd.2397 has calculated usage >= 100%: 337.4984239706796%
[2022-06-26 15:41:27,367] osd.1986 has calculated usage >= 100%: 399.57821433986345%
[2022-06-26 15:41:27,367] osd.2323 has calculated usage >= 100%: 399.894190800478%
[2022-06-26 15:41:27,367] --------------------------------------------------------------------------------
[2022-06-26 15:41:27,367] generated 0 remaps.
[2022-06-26 15:41:27,367] total movement size: 0.0B.
[2022-06-26 15:41:27,367] --------------------------------------------------------------------------------
[2022-06-26 15:41:27,367] old cluster variance per crushclass:
[2022-06-26 15:41:27,367]   mddu: 0.000
[2022-06-26 15:41:27,367]   mddm: 0.000
[2022-06-26 15:41:27,367]   nvme: 0.000
[2022-06-26 15:41:27,367]   ssd: 7.100
[2022-06-26 15:41:27,367]   hdd: 4273.052
[2022-06-26 15:41:27,367]   smr: 0.000
[2022-06-26 15:41:27,367] old min osd.253 0.000%
[2022-06-26 15:41:27,368] old max osd.2323 399.894%
[2022-06-26 15:41:27,368] --------------------------------------------------------------------------------
[2022-06-26 15:41:27,368] new min osd.253 0.000%
[2022-06-26 15:41:27,368] new max osd.2323 399.894%
[2022-06-26 15:41:27,368] new cluster variance:
[2022-06-26 15:41:27,368]   mddu: 0.000
[2022-06-26 15:41:27,368]   mddm: 0.000
[2022-06-26 15:41:27,368]   nvme: 0.000
[2022-06-26 15:41:27,368]   ssd: 7.100
[2022-06-26 15:41:27,368]   hdd: 4273.052
[2022-06-26 15:41:27,368]   smr: 0.000
[2022-06-26 15:41:27,368] --------------------------------------------------------------------------------

Jun 26 '22 23:06 stuartthebruce

For the Cluster topology changed, see #13. Since it apparently can't empty the fullest device, did you try setting --max-full-move-attempts=10 or more? I haven't had time looking at your dump yet, but that time will surely come :)

Jul 09 '22 22:07 TheJJ

I've had a look now: It seems that the constraints distributing accross racks prevents the balancer finding a suitable solution:

[2022-11-15 00:55:42,912] trying to empty osd.2323 (400.110562 %)
[2022-11-15 00:55:42,912] TRY-0 moving pg 17.966 (1/13) with 55.1G from osd.2323
[2022-11-15 00:55:42,913]   OK => taking pg 17.966 from source osd.2323 since it has too many of pool=17 (8 > 1.1150)
[2022-11-15 00:55:42,913] prepare crush check for pg 17.966 currently up=[2279, 2323, 345, 2453, 493, 329, 470, 2471]
[2022-11-15 00:55:42,913] rule:
{'name': 'fs.data.mirror.hdd.ec',
 'steps': [{'num': 5, 'op': 'set_chooseleaf_tries'},
           {'num': 100, 'op': 'set_choose_tries'},
           {'item': -4565, 'item_name': 'default~hdd', 'op': 'take'},
           {'num': 0, 'op': 'chooseleaf_indep', 'type': 'rack'},
           {'op': 'emit'}]}
[2022-11-15 00:55:42,914] allowed reuses per rule step, starting at root: [8, 1, 1]
[2022-11-15 00:55:42,914]    trace for 2279: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4232, 'type_name': 'rack'}, {'id': -12067, 'type_name': 'host'}, {'id': 2279, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914]    trace for 2323: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4367, 'type_name': 'rack'}, {'id': -12228, 'type_name': 'host'}, {'id': 2323, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914]    trace for  345: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4253, 'type_name': 'rack'}, {'id': -4238, 'type_name': 'host'}, {'id': 345, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914]    trace for 2453: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4487, 'type_name': 'rack'}, {'id': -12676, 'type_name': 'host'}, {'id': 2453, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914]    trace for  493: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4535, 'type_name': 'rack'}, {'id': -4490, 'type_name': 'host'}, {'id': 493, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914]    trace for  329: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4211, 'type_name': 'rack'}, {'id': -4208, 'type_name': 'host'}, {'id': 329, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914]    trace for  470: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4463, 'type_name': 'rack'}, {'id': -4448, 'type_name': 'host'}, {'id': 470, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914]    trace for 2471: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4556, 'type_name': 'rack'}, {'id': -12739, 'type_name': 'host'}, {'id': 2471, 'type_name': 'osd'}]
[2022-11-15 00:55:42,914] processing crush step {'op': 'take', 'item': -4565, 'item_name': 'default~hdd'} with tree_depth=0, rule_depth=0, item_uses=defaultdict(<class 'dict'>, {})
[2022-11-15 00:55:42,914] processing crush step {'op': 'chooseleaf_indep', 'num': 0, 'type': 'rack'} with tree_depth=0, rule_depth=1, item_uses=defaultdict(<class 'dict'>, {})
[2022-11-15 00:55:42,914] processing crush step {'op': 'emit'} with tree_depth=3, rule_depth=2, item_uses=defaultdict(<class 'dict'>, {1: {-4565: 0, -4232: 1, -4367: 1, -4253: 1, -4487: 1, -4535: 1, -4211: 1, -4463: 1, -4556: 1}, 0: {-4565: 8}})
[2022-11-15 00:55:42,914] after crush check preparations: rule_tree_depth={0: 0, 1: 3, 2: 5} item_uses=defaultdict(<class 'dict'>, {1: {-4565: 0, -4232: 1, -4367: 1, -4253: 1, -4487: 1, -4535: 1, -4211: 1, -4463: 1, -4556: 1}, 0: {-4565: 8}, 2: {2279: 1, 2323: 1, 345: 1, 2453: 1, 493: 1, 329: 1, 470: 1, 2471: 1}})
[2022-11-15 00:55:42,914] crush check preparation success!
[2022-11-15 00:55:42,918] TRY-1 move 17.966 osd.2323 => osd.2128 (1/704)
[2022-11-15 00:55:42,919]  OK => osd.2128 has too few of pool=17 (45 < 113.1569)
[2022-11-15 00:55:42,919]    trace for 2128: [{'id': -4565, 'type_name': 'root'}, {'id': -4562, 'type_name': 'datacenter'}, {'id': -4559, 'type_name': 'pod'}, {'id': -4487, 'type_name': 'rack'}, {'id': -11528, 'type_name': 'host'}, {'id': 2128, 'type_name': 'osd'}]
[2022-11-15 00:55:42,919]    item reuse check ok:   osd.2323@[0]=-4565 -> osd.2128@[0]=-4565 x uses=8 <= max_allowed=8
[2022-11-15 00:55:42,919]    item reuse check fail: osd.2323@[3]=-4367 -> osd.2128@[3]=-4487 x uses=2 > max_allowed=1
[2022-11-15 00:55:42,919]  BAD move 17.966 osd.2323 => osd.2128

This goes on for all other PGs on that OSD - all destination attempts fail because some shard of that PG's rack is already in use by the current PGs. I haven't added this in the code yet, but maybe you know: How many racks are theoretically suitable for placement of pool 17?

Nov 15 '22 00:11 TheJJ

If you're interested in further analysis, maybe please send a recent state dump so I can have a look again :)

May 08 '23 20:05 TheJJ

ceph-balancer
ceph-balancer copied to clipboard

balancer generates pg-upmap that fails verify_upmap

ceph-balancer ceph-balancer copied to clipboard

balancer generates pg-upmap that fails verify_upmap

ceph-balancer
ceph-balancer copied to clipboard