talos icon indicating copy to clipboard operation
talos copied to clipboard

ceph-csi very slow on vm

Open plano-fwinkler opened this issue 1 year ago • 16 comments

proxmox with ceph and talos as vm with ceph csi is much slower than openebs-hostpath, are there any modules missing for the kernel?

Environment

  • Talos version: 1.8.2
  • Kubernetes version: 1.31.2
  • Platform: proxmox with ceph storage

plano-fwinkler avatar Nov 19 '24 14:11 plano-fwinkler

The issue you posted doesn't have any relevant details, including the performance numbers, the way you set up things, etc.

Ceph is a complicated subject, and setting it up properly is not trivial.

smira avatar Nov 19 '24 14:11 smira

We have a Proxmox Cluster with 5 Nodes and a Ceph Cluster on the Proxmox. The Ceph Cluster has a 100GB nic.

if i testing with kubestr fio:

with a local path Storageclass ` ./kubestr fio -s openebs-hostpath PVC created kubestr-fio-pvc-qqb7w Pod created kubestr-fio-pod-4z7zc Running FIO test (default-fio) on StorageClass (openebs-hostpath) with a PVC of Size (100Gi) Elapsed time- 28.089900025s FIO test results:

FIO version - fio-3.36 Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops blocksize=4K filesize=2G iodepth=64 rw=randread read: IOPS=49767.750000 BW(KiB/s)=199087 iops: min=41961 max=61272 avg=49501.585938 bw(KiB/s): min=167847 max=245088 avg=198006.484375

JobName: write_iops blocksize=4K filesize=2G iodepth=64 rw=randwrite write: IOPS=21245.320312 BW(KiB/s)=84993 iops: min=9028 max=39728 avg=35385.707031 bw(KiB/s): min=36112 max=158912 avg=141543.125000

JobName: read_bw blocksize=128K filesize=2G iodepth=64 rw=randread read: IOPS=36891.605469 BW(KiB/s)=4722663 iops: min=31849 max=45298 avg=36709.964844 bw(KiB/s): min=4076761 max=5798144 avg=4698881.500000

JobName: write_bw blocksize=128k filesize=2G iodepth=64 rw=randwrite write: IOPS=33320.179688 BW(KiB/s)=4265520 iops: min=17652 max=40996 avg=33119.656250 bw(KiB/s): min=2259456 max=5247488 avg=4239321.500000

Disk stats (read/write): sda: ios=1454972/1046364 merge=0/22 ticks=1907168/1466570 in_queue=3393654, util=29.229431%

  • OK `

and with the ceph block Storageclass: rbd.csi.ceph.com

` ./kubestr fio -s ceph-block PVC created kubestr-fio-pvc-n7m9z Pod created kubestr-fio-pod-4jnqw Running FIO test (default-fio) on StorageClass (ceph-block) with a PVC of Size (100Gi) Elapsed time- 27.566283667s FIO test results:

FIO version - fio-3.36 Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops blocksize=4K filesize=2G iodepth=64 rw=randread read: IOPS=242.109741 BW(KiB/s)=983 iops: min=98 max=496 avg=257.322571 bw(KiB/s): min=392 max=1987 avg=1030.129028

JobName: write_iops blocksize=4K filesize=2G iodepth=64 rw=randwrite write: IOPS=224.676819 BW(KiB/s)=914 iops: min=2 max=768 avg=264.464294 bw(KiB/s): min=8 max=3072 avg=1058.357178

JobName: read_bw blocksize=128K filesize=2G iodepth=64 rw=randread read: IOPS=213.964386 BW(KiB/s)=27884 iops: min=90 max=462 avg=223.967743 bw(KiB/s): min=11520 max=59254 avg=28694.708984

JobName: write_bw blocksize=128k filesize=2G iodepth=64 rw=randwrite write: IOPS=219.214661 BW(KiB/s)=28548 iops: min=4 max=704 avg=258.035706 bw(KiB/s): min=512 max=90112 avg=33048.785156

Disk stats (read/write): rbd2: ios=8696/8655 merge=0/267 ticks=2245425/1975831 in_queue=4221257, util=99.504547%

  • OK

`

The talos machine has two nic's. One only to communicating with the ceph monitor's.

It's Working, but i think to slow.

plano-fwinkler avatar Nov 19 '24 14:11 plano-fwinkler

Then you need to dig further to understand why - what is the bottleneck, certainly Ceph block storage should be slower as it goes via the network, does replication, etc.

You can watch resource utilization to understand what is the bottleneck.

We are not aware of anything missing from the Talos side, and we do use Ceph a lot ourselves with Talos.

smira avatar Nov 19 '24 16:11 smira

Ok, first part we updated from 1.7.5 to 1.8.3:

talos 1.8.3:

./kubestr fio -s ceph-block
PVC created kubestr-fio-pvc-rr88n
Pod created kubestr-fio-pod-tfhwn
Running FIO test (default-fio) on StorageClass (ceph-block) with a PVC of Size (100Gi)
Elapsed time- 28.61114439s
FIO test results:
  
FIO version - fio-3.36
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
  blocksize=4K filesize=2G iodepth=64 rw=randread
read:
  IOPS=225.275375 BW(KiB/s)=916
  iops: min=58 max=547 avg=245.451614
  bw(KiB/s): min=232 max=2188 avg=982.258057

JobName: write_iops
  blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=208.858887 BW(KiB/s)=850
  iops: min=118 max=480 avg=251.928574
  bw(KiB/s): min=472 max=1923 avg=1008.285706

JobName: read_bw
  blocksize=128K filesize=2G iodepth=64 rw=randread
read:
  IOPS=171.147690 BW(KiB/s)=22384
  iops: min=32 max=382 avg=186.451614
  bw(KiB/s): min=4096 max=48896 avg=23881.837891

JobName: write_bw
  blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=210.829285 BW(KiB/s)=27469
  iops: min=18 max=486 avg=251.142853
  bw(KiB/s): min=2304 max=62208 avg=32166.677734

Disk stats (read/write):
  rbd7: ios=7798/8137 merge=0/266 ticks=2268458/2110792 in_queue=4379250, util=99.517471%
  -  OK
  

talos 1.7.5

 ./kubestr fio -s ceph-block
PVC created kubestr-fio-pvc-gz78h
Pod created kubestr-fio-pod-w6q9h
Running FIO test (default-fio) on StorageClass (ceph-block) with a PVC of Size (100Gi)
Elapsed time- 25.926723803s
FIO test results:
  
FIO version - fio-3.36
Global options - ioengine=libaio verify=0 direct=1 gtod_reduce=1

JobName: read_iops
  blocksize=4K filesize=2G iodepth=64 rw=randread
read:
  IOPS=3099.707031 BW(KiB/s)=12415
  iops: min=2904 max=3330 avg=3104.266602
  bw(KiB/s): min=11616 max=13322 avg=12417.200195

JobName: write_iops
  blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=1818.115234 BW(KiB/s)=7289
  iops: min=1597 max=1963 avg=1821.033325
  bw(KiB/s): min=6388 max=7855 avg=7284.466797

JobName: read_bw
  blocksize=128K filesize=2G iodepth=64 rw=randread
read:
  IOPS=3061.892822 BW(KiB/s)=392458
  iops: min=2860 max=3300 avg=3065.199951
  bw(KiB/s): min=366080 max=422400 avg=392351.312500

JobName: write_bw
  blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=1826.963989 BW(KiB/s)=234388
  iops: min=1712 max=1960 avg=1829.699951
  bw(KiB/s): min=219136 max=250880 avg=234209.000000

Disk stats (read/write):
  rbd3: ios=104828/62036 merge=0/701 ticks=2173229/1309682 in_queue=3482912, util=99.467682%
  -  OK

But I think that's still too slow.

But I don't know where to look.

f-wi-plano avatar Dec 02 '24 12:12 f-wi-plano

I tested with a Fedora VM in which I deployed a Ceph single-node cluster using standard cephadm tooling. The VM as well as its storage disk (both qcow2) reside in tmpfs for stability of results.

In Talos I installed ceph-csi with the attached csi-configs.zip alongside https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-provisioner-rbac.yaml and https://raw.githubusercontent.com/ceph/ceph-csi/master/deploy/rbd/kubernetes/csi-nodeplugin-rbac.yaml

I tested the bandwidth between nodes to be approximately 40 Gbit/s for both Talos 1.7.7 and 1.8.3, as well as 1.7.5 I also tested. The VMs are routed through host (I modified iptables in a way to allow my system to route packets between talosctl cluster nodes and libvirt virtual machine containing Ceph).

I was unable to determine any actual performance difference based on Talos version, here are the test results Google Spreadsheet

So please tell me what's different in your setup and might have affected the behavior. Also please check the network bandwidth between Ceph nodes and Talos nodes for both versions

dsseng avatar Dec 13 '24 09:12 dsseng

Hello,

Thanks for the answer.

I have two questions:

  1. what did you do “I modified iptables in a way to allow my system to route packets between talosctl cluster nodes and libvirt virtual machine containing Ceph”

  2. what is the best way to measure the network bandwidth on talos? (network bandwidth between Ceph nodes and Talos nodes for both versions)

I still have to export the configuration for ceph-csi, we are finishing the Helm Chart.

f-wi-plano avatar Jan 06 '25 12:01 f-wi-plano

  1. I meant the config of the host system when debugging. Just made sure nodes could reach Ceph VM running in a different virtual network. That couldn't be your issue since you have Ceph accessible, but with degraded performance.
  2. I used iperf3 in Kubernetes debug pods (+ iperf3 server running on the Ceph node) and did some measurements in both normal and reverse mode. Other solutions are possible as well

dsseng avatar Jan 06 '25 15:01 dsseng

iperf -c 192.168.1.1 C-m -P 1000 from proxmox host to proxmox host [SUM] 0.0000-10.0517 sec 116 GBytes 99.2 Gbits/sec

fromkubernetes(talos) zu proxmox : [SUM] 0.00-11.37 sec 30.5 GBytes 23.1 Gbits/sec

ceph-csi-config

f-wi-plano avatar Jan 08 '25 07:01 f-wi-plano

Is there any difference between good/bad Talos versions (as in Ceph performance)?

dsseng avatar Jan 08 '25 08:01 dsseng

That's a wild guess but your performance gap could suggest you got a change of ceph-csi "mounter" between both Talos versions. In my experience only KRBD (= rbd) provides decent performance for now. rbd-nbd hasn't catched up yet.

Please try this:

  • in your Helm release, set storageClass.mounter: "rbd" explicitely
  • check that the StorageClass has parameters.mounter: "rbd", and not trace of "ndb"
  • benchmark with new PersistentVolumes because the mounter parameter is set at volume creation and is immutable

npdgm avatar Mar 03 '25 09:03 npdgm

Hi,

I think I have found the reason for the difference in performance. We have two network types. One for ceph public network with 100gb and one for “Kubernetes Public” network, a bond with 30gb. Cilium is used as network provider. According to my test it is routed via the wrong network card.

I don't have a solution for the problem yet.

plano-fwinkler avatar Mar 03 '25 09:03 plano-fwinkler

Talos sometimes has issues reconciliating routes because it gets stuck trying to merge configuration changes that cannot coexist in the route table. It can happen if you changed things past the initial configuration in an inconvenient order.

You can have a look at RouteSpec objects in their default namespace (network) , and the configuration namespace (network-config). If you see duplicated routes, then only one is active on Linux at a time and it may change across reboots and be inconsistant between nodes of the cluster too.

This is an example where it went bad when changing the default route from 172.16.100.1 to 172.16.100.230 via MachineConfig. There should be only one default route, not two.

❯ talosctl1.9.3 --talosconfig=./clusterconfig/talosconfig --nodes=172.16.100.31 get routespec
NODE            NAMESPACE   TYPE        ID                           VERSION
172.16.100.31   network     RouteSpec   inet4/172.16.100.1//1024     2
172.16.100.31   network     RouteSpec   inet4/172.16.100.230//1024   2

❯ talosctl1.9.3 --talosconfig=./clusterconfig/talosconfig --nodes=172.16.100.31 get -n network-config routespec
NODE            NAMESPACE   TYPE        ID                           VERSION
172.16.100.31   network     RouteSpec   inet4/172.16.100.1//1024     2
172.16.100.31   network     RouteSpec   inet4/172.16.100.230//1024   2

And the conflict is shown every minute is machined logs:

❯ talosctl1.9.3 --talosconfig=./clusterconfig/talosconfig --nodes=172.16.100.31 logs machined | grep network.RouteSpecController

172.16.100.31: 2025/03/03 10:38:43.077447 [talos] controller failed {"component": "controller-runtime", "controller": "network.RouteSpecController", "error": "1 error occurred:\n\t* error adding route: netlink receive: file exists, message {Family:2 DstLength:0 SrcLength:0 Tos:0 Table:0 Protocol:4 Scope:0 Type:1 Flags:0 Attributes:{Dst:<nil> Src:<nil> Gateway:172.16.100.1 OutIface:8 Priority:1024 Table:254 Mark:0 Pref:<nil> Expires:<nil> Metrics:<nil> Multipath:[]}}\n\n"}
172.16.100.31: 2025/03/03 10:39:45.499612 [talos] controller failed {"component": "controller-runtime", "controller": "network.RouteSpecController", "error": "1 error occurred:\n\t* error adding route: netlink receive: file exists, message {Family:2 DstLength:0 SrcLength:0 Tos:0 Table:0 Protocol:4 Scope:0 Type:1 Flags:0 Attributes:{Dst:<nil> Src:<nil> Gateway:172.16.100.1 OutIface:8 Priority:1024 Table:254 Mark:0 Pref:<nil> Expires:<nil> Metrics:<nil> Multipath:[]}}\n\n"}
172.16.100.31: 2025/03/03 10:40:24.601504 [talos] controller failed {"component": "controller-runtime", "controller": "network.RouteSpecController", "error": "1 error occurred:\n\t* error adding route: netlink receive: file exists, message {Family:2 DstLength:0 SrcLength:0 Tos:0 Table:0 Protocol:4 Scope:0 Type:1 Flags:0 Attributes:{Dst:<nil> Src:<nil> Gateway:172.16.100.1 OutIface:8 Priority:1024 Table:254 Mark:0 Pref:<nil> Expires:<nil> Metrics:<nil> Multipath:[]}}\n\n"}

Maybe that can help you find the root cause of routing issues, because it seems unlikely Cilium would affect that traffic (krbd is in kernel space and main netns).

npdgm avatar Mar 03 '25 11:03 npdgm

Hey all Im also running into this exact issue, my setup for help:

6 Metal Hosts running Proxmox 3 Metal Hosts with 4 VM Hosts Each Running Talos v1.10.2 (2 NIC's @ 1G) 3 Metal hosts with 6,6, and 12 osd's on them (2 NIC's @ 10G)

Ceph is setup via proxmox and imported to the k8s cluster using rook's external cluster option.

The issue:

Reading (even running head on them) large files IE a large video file from cephfs mounts takes a long time:

real 1m47.566s
user 0m0.000s
sys  0m44.729s

Interesting Facts:

  • Subsequent reads are fast
  • Reads on ceph mounts on metal hosts are fast (tested all 6)
real    0m0.539s
user    0m0.000s
sys     0m0.056s
  • Reads on an Ubuntu VM on the same host is fast (used same vm bridge network interface as talos vms)
real    0m0.262s
user    0m0.002s
sys     0m0.018s
  • Reads on manual mounts (ie. mount -t ceph ...) within a pod in talos is slow
  • This seems to have happened overnight between 12:00 AM Friday - 4:00 AM Friday

I've tried this both with the ceph module added in the machine config as well as without (aka default talos setup) both cause issues.

Happy to provide anything else to help!

FreekingDean avatar May 25 '25 11:05 FreekingDean

@FreekingDean

Do you have spinning disks or ssds (if ssd do they have plp). Also are the servers within a region and what is the ms latency between them?

Still what I found is that Ceph is best enjoyed via 10G also in the Talos VMs with the least ms latency possible.

Also where I am not really sure if necessary but what gives: Try a trace within a talos mashine to see if there are any unnecessary hops.

Munsio avatar May 25 '25 22:05 Munsio

TY It is very odd. Ceph is 100% more enjoyed on 10G and I do have that upgrade coming. These are spinning disks, but I have had this same exact setup for ~2 years only with fedora coreos & k3s without any issue. So id be surprised if the disks or network is the cause of the issue. As well as I don't have this issue under an ubuntu VM using the same setup. ~The only difference with the ubuntu VM is its not running 6.12 so im updating it to see if I have the same issue.~ Just updated the kernel and the ubuntu VM is fine in 6.12

FreekingDean avatar May 26 '25 01:05 FreekingDean

The trace was a great callout! I am seeing a hop to my router first that nothing else is doing!

FreekingDean avatar May 26 '25 11:05 FreekingDean

Okay, I worked through this, setup the proper routes, and Im still running into this. This seems to be exclusive to talos.

FreekingDean avatar Jun 25 '25 17:06 FreekingDean

probably https://github.com/siderolabs/talos/issues/11129 ?

steverfrancis avatar Jun 25 '25 17:06 steverfrancis

Wow! Yes! Thank you <3 We can likely close this older ticket too. (im testing this right now)

This was 100% it, we can close this imho.

FreekingDean avatar Jun 25 '25 17:06 FreekingDean

Should be fixed in v1.10.4 and later.

smira avatar Jun 30 '25 16:06 smira

After upgrading to 1.10.4 we still get this error message. Do we have to change anything else?

Image

plano-fwinkler avatar Jul 01 '25 13:07 plano-fwinkler

@plano-fwinkler please open a separate issue, and don't comment on unrelated closed issues.

smira avatar Jul 01 '25 14:07 smira