linstor-server
linstor-server copied to clipboard
Slow server responses causing PV attachment timeouts
We're using piraeusdatastore to provide linstor PVs to a k8s cluster which we're currently building.
We're testing various scenarios with 2 worker nodes and a tiebreaker where we want automatic fail-over in case of a single node failure.
We initially noticed that upon node failure where the PVs were attached to pods, the PV attachments on the non-failing node (upon fail-over) were timing out and ultimately never completed (the pods were stuck in pending state):
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 23m (x2 over 30m) kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[kube-api-access-flfr4 data]: timed out waiting for the condition
Warning FailedAttachVolume 108s (x37 over 63m) attachdetach-controller AttachVolume.Attach failed for volume "pvc-ccc2551c-f7d6-44ae-a148-7b567c5a5801" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Warning FailedMount 42s (x26 over 61m) kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data kube-api-access-flfr4]: timed out waiting for the condition
We submitted an issue over at the piraeusdatastore repo here https://github.com/piraeusdatastore/linstor-csi/issues/172.
We then did some pruning on the system deleting old images and stale containers. We're running ZFS on root, so each container and image is using a zfs dataset.
After the pruning the PVs would attach on the non-failing node after 15 mins.
When the PVs were not attaching we had around 1300 datasets per node when doing zfs list | wc -l
, after the pruning this got down to 679. However in production we expect this number to be higher than 1300.
With 679 zfs datasets zfs list
takes about 1.6-1.8 secs. Conversely if we do zfs list -r
on the pool assigned to linstor it only takes 14ms. Would it be much faster to list all of the pools attached to linstor individually?
You can see the exact requests which are timing out on the linstor-server in this reply on the issue https://github.com/piraeusdatastore/linstor-csi/issues/172#issuecomment-1233157045
Listing resources is also slow (unsure if this is normal):
time linstor r l -a
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-3eb665ec-27ec-4875-8df9-106751f1a832 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7008 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:39:18 ┊
┊ pvc-3eb665ec-27ec-4875-8df9-106751f1a832 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7008 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:39:41 ┊
┊ pvc-3eb665ec-27ec-4875-8df9-106751f1a832 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7008 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:40:27 ┊
┊ pvc-3ec4cd7c-a369-4fe8-b0c3-68528ad1ef37 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7007 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:39:18 ┊
┊ pvc-3ec4cd7c-a369-4fe8-b0c3-68528ad1ef37 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7007 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:39:39 ┊
┊ pvc-3ec4cd7c-a369-4fe8-b0c3-68528ad1ef37 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7007 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:39:56 ┊
┊ pvc-95f607ac-89fb-43d3-81a1-2ad89bdfa438 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7004 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:38:34 ┊
┊ pvc-95f607ac-89fb-43d3-81a1-2ad89bdfa438 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7004 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:38:53 ┊
┊ pvc-95f607ac-89fb-43d3-81a1-2ad89bdfa438 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7004 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:39:30 ┊
┊ pvc-8932ceaf-ebc6-4c7f-855f-b3c8da2a9564 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7009 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:39:19 ┊
┊ pvc-8932ceaf-ebc6-4c7f-855f-b3c8da2a9564 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7009 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:39:44 ┊
┊ pvc-8932ceaf-ebc6-4c7f-855f-b3c8da2a9564 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7009 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:40:27 ┊
┊ pvc-28069ce0-b3a0-4f46-94db-ae5f2eda5d4d ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7006 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:38:34 ┊
┊ pvc-28069ce0-b3a0-4f46-94db-ae5f2eda5d4d ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7006 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:38:59 ┊
┊ pvc-28069ce0-b3a0-4f46-94db-ae5f2eda5d4d ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7006 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:39:30 ┊
┊ pvc-d2b15e29-2594-40b3-90ae-9a14c0fda0bc ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7003 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:38:34 ┊
┊ pvc-d2b15e29-2594-40b3-90ae-9a14c0fda0bc ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7003 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:38:50 ┊
┊ pvc-d2b15e29-2594-40b3-90ae-9a14c0fda0bc ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7003 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:39:08 ┊
┊ pvc-fbce07ea-aa30-4093-acba-82661d040903 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7010 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:39:19 ┊
┊ pvc-fbce07ea-aa30-4093-acba-82661d040903 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7010 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:39:46 ┊
┊ pvc-fbce07ea-aa30-4093-acba-82661d040903 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7010 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:40:27 ┊
┊ pvc-fc12a115-ace3-49d2-8fb1-a3cb8b1a3b38 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7005 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:38:34 ┊
┊ pvc-fc12a115-ace3-49d2-8fb1-a3cb8b1a3b38 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7005 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:38:56 ┊
┊ pvc-fc12a115-ace3-49d2-8fb1-a3cb8b1a3b38 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7005 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:39:30 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7000 ┊ ┊ ┊ Unknown ┊ 2022-08-31 15:37:33 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7000 ┊ Unused ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ TieBreaker ┊ 2022-08-31 15:37:50 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7000 ┊ InUse ┊ Connecting(dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com) ┊ UpToDate ┊ 2022-08-31 15:38:00 ┊
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
________________________________________________________
Executed in 3.88 secs fish external
usr time 270.16 millis 563.00 micros 269.60 millis
sys time 77.70 millis 342.00 micros 77.36 millis
Here's some benchmarks with 679 zfs datasets
ZFS version:
zfs-2.1.2-1ubuntu3
zfs-kmod-2.1.4-0ubuntu0.1
https://github.com/LINBIT/linstor-server/blob/a758bf07796c374fd2004465b0d8690209b74356/satellite/src/main/java/com/linbit/linstor/layer/storage/zfs/utils/ZfsCommands.java#L25-L30
Executed in 3.42 secs fish external
usr time 0.15 secs 499.00 micros 0.15 secs
sys time 3.27 secs 390.00 micros 3.27 secs
https://github.com/LINBIT/linstor-server/blob/a758bf07796c374fd2004465b0d8690209b74356/satellite/src/main/java/com/linbit/linstor/layer/storage/zfs/utils/ZfsCommands.java#L306-L311
Executed in 1.59 secs fish external
usr time 0.08 secs 472.00 micros 0.08 secs
sys time 1.51 secs 369.00 micros 1.51 secs
The rest of the commands in ZfsCommands.java
complete in the order of ms.
I could try passing through the vol and snapshot data here:
https://github.com/LINBIT/linstor-server/blob/a758bf07796c374fd2004465b0d8690209b74356/satellite/src/main/java/com/linbit/linstor/layer/storage/zfs/ZfsProvider.java#L167
Then construct the fullQualifiedId
and loop through the command on those specific datasets.
I will need to figure out how to build the linstor-server and dockerize it (using https://github.com/piraeusdatastore/piraeus/tree/master/dockerfiles/piraeus-server) into the piraeus-server image (I can host it on my own repo for testing).
I am not sure if I understood you correctly, but if you mean that Linstor might optimize some zfs list
commands I could not reproduce the issue here. Might be because of different zfs versions:
root@bravo:~# zfs --version
zfs-0.8.3-1ubuntu12.14
zfs-kmod-0.8.3-1ubuntu12.14
root@bravo:~# time zfs list -H -p -o name,used,refer,volsize,available,type -t volume,snapshot,filesystem | wc -l
701
real 0m0.144s
user 0m0.041s
sys 0m0.103s
root@bravo:~# time zfs list -r | wc -l
702
real 0m0.098s
user 0m0.021s
sys 0m0.076s
root@bravo:~# uname -a
Linux bravo 5.4.0-122-generic #138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@bravo:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.4 LTS"
root@bravo:~#
@ghernadi do you have any load?
The slowness of zfs list
is quite well documented going all the way back to v6.
As stated here:
https://github.com/openzfs/zfs/issues/5558#issuecomment-270286066
It may be better to move away from the ZFS CLI entirely as it doesn't look like there's a fix for this, if a user had many datasets in their Linstor pool they will come across this issue eventually.
In the meantime I will still work to see if listing the datasets individually will improve performance (it definitely will in our case).
Here's a list of issues about the slowness of zfs list
https://github.com/openzfs/zfs/discussions/8898 https://github.com/openzfs/zfs/issues/11491 https://github.com/openzfs/zfs/issues/5558 https://github.com/openzfs/zfs/issues/8587 https://github.com/openzfs/zfs/issues/5218 https://github.com/openzfs/zfs/issues/2131 https://github.com/openzfs/zfs/issues/722 https://github.com/openzfs/zfs/issues/450
do you have any load?
No, and from the links you shared I do believe you that zfs list
(still) has performance issues, but I am not sure how we can improve that. Linstor uses zfs list
not only to see what volumes exist and what sizes they are (which could be achieved by looking for /dev/...
entries for example), but also how much free space the storage pools have.
Feel free to make suggestions on how to improve Linstor or make it less dependent on zfs list
, I can certainly help implementing it, but I don't know what to implement right now in this regard.
@ghernadi Can you take a look at https://github.com/Rid/linstor-server/commit/427185c94a02dfb731e3bb53121d26e910ba03a3 it's just a quick idea which could speed up systems where there are many datasets outside of linstor (such as our case).
I'd like to be able to build deb packages for testing however I can't find the tooling for doing the build, I can see the debian folder but no DEBIAN folder.
Can you let me know how to build the .debs?
I ended up putting the built source into a docker container, however running satellites is generating errors and not connecting:
ERROR REPORT 6318F776-47208-000055
============================================================
Application: LINBIT? LINSTOR
Module: Satellite
Version: 0.1
Build ID: 427185c94a02dfb731e3bb53121d26e910ba03a3
Build time: 2022-09-07T16:15:05+00:00
Error time: 2022-09-07 19:59:21
Node: dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com
Peer: 23.106.61.232:54208
============================================================
Reported error:
===============
Category: Error
Class name: ImplementationError
Class canonical name: com.linbit.ImplementationError
Generated at: Method 'run', Source file 'TcpConnectorService.java', Line #734
Error message: Unhandled IllegalStateException
Call backtrace:
Method Native Class:Line number
run N com.linbit.linstor.netcom.TcpConnectorService:734
run N java.lang.Thread:829
Caused by:
==========
Category: RuntimeException
Class name: IllegalStateException
Class canonical name: java.lang.IllegalStateException
Generated at: Method 'doHandshake', Source file 'SslTcpConnectorHandshaker.java', Line #103
Error message: com.linbit.linstor.netcom.ssl.SslTcpConnectorService indicates requiring a handshake, but the sun.security.ssl.SSLEngineImpl instance is not in handshake mode
Call backtrace:
Method Native Class:Line number
doHandshake N com.linbit.linstor.netcom.ssl.SslTcpConnectorHandshaker:103
read N com.linbit.linstor.netcom.ssl.SslTcpConnectorPeer:162
run N com.linbit.linstor.netcom.TcpConnectorService:543
run N java.lang.Thread:829
END OF ERROR REPORT.
Do the servers & satellites all need to be the same version? At the moment I only have the satellites running this code.
The previous error was resolved by updating the server containers to the same version.
However the zfs list
fails as it appears that zpool
& identifier
are not set in vlmDataListRef
& snapVlmsRef
initially.
I'll take another look at it and see if I can work around.
Ok I've tested the following commit https://github.com/Rid/linstor-server/commit/458216ae0fe2168a77ac49f6d5afc14a3433b703
Before the changes:
Debug
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-2bb12d1e-4029-42a3-b2c1-3ffa7c4e567c ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7004 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:06:41 ┊
┊ pvc-2bb12d1e-4029-42a3-b2c1-3ffa7c4e567c ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7004 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 13:06:50 ┊
┊ pvc-2bb12d1e-4029-42a3-b2c1-3ffa7c4e567c ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7004 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:07:04 ┊
┊ pvc-2144d02d-acaf-41cd-ac62-96066fe5abef ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7008 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:08:19 ┊
┊ pvc-2144d02d-acaf-41cd-ac62-96066fe5abef ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7008 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 18:08:59 ┊
┊ pvc-2144d02d-acaf-41cd-ac62-96066fe5abef ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7008 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:09:59 ┊
┊ pvc-6448d0a4-d0b0-4a87-b073-f204e3128bda ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7009 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:08:19 ┊
┊ pvc-6448d0a4-d0b0-4a87-b073-f204e3128bda ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7009 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 18:08:38 ┊
┊ pvc-6448d0a4-d0b0-4a87-b073-f204e3128bda ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7009 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:09:18 ┊
┊ pvc-b8df4a81-6260-4b81-9e1c-5f29e28e243e ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7005 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:06:41 ┊
┊ pvc-b8df4a81-6260-4b81-9e1c-5f29e28e243e ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7005 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 13:06:53 ┊
┊ pvc-b8df4a81-6260-4b81-9e1c-5f29e28e243e ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7005 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:07:16 ┊
┊ pvc-bc09e2d0-5c21-410e-8346-d597586060c8 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7003 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:06:24 ┊
┊ pvc-bc09e2d0-5c21-410e-8346-d597586060c8 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7003 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 13:06:29 ┊
┊ pvc-bc09e2d0-5c21-410e-8346-d597586060c8 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7003 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:06:44 ┊
┊ pvc-ca568d40-223b-4aa9-a933-1d577169079b ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7007 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:08:19 ┊
┊ pvc-ca568d40-223b-4aa9-a933-1d577169079b ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7007 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 18:08:55 ┊
┊ pvc-ca568d40-223b-4aa9-a933-1d577169079b ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7007 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:09:39 ┊
┊ pvc-ed95bc34-dbf1-4504-ae30-a4c5f32952e0 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7006 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:07:32 ┊
┊ pvc-ed95bc34-dbf1-4504-ae30-a4c5f32952e0 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7006 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 18:07:54 ┊
┊ pvc-ed95bc34-dbf1-4504-ae30-a4c5f32952e0 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7006 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:08:06 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-08-31 15:37:33 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7000 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-08-31 15:37:50 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7000 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-08-31 15:38:00 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
________________________________________________________
Executed in 4.23 secs fish external
usr time 259.05 millis 489.00 micros 258.56 millis
sys time 101.30 millis 260.00 micros 101.03 millis
After the changes:
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName ┊ Node ┊ Port ┊ Usage ┊ Conns ┊ State ┊ CreatedOn ┊
╞═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
┊ pvc-2bb12d1e-4029-42a3-b2c1-3ffa7c4e567c ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7004 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:06:41 ┊
┊ pvc-2bb12d1e-4029-42a3-b2c1-3ffa7c4e567c ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7004 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 13:06:50 ┊
┊ pvc-2bb12d1e-4029-42a3-b2c1-3ffa7c4e567c ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7004 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:07:04 ┊
┊ pvc-2144d02d-acaf-41cd-ac62-96066fe5abef ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7008 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:08:19 ┊
┊ pvc-2144d02d-acaf-41cd-ac62-96066fe5abef ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7008 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 18:08:59 ┊
┊ pvc-2144d02d-acaf-41cd-ac62-96066fe5abef ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7008 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:09:59 ┊
┊ pvc-6448d0a4-d0b0-4a87-b073-f204e3128bda ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7009 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:08:19 ┊
┊ pvc-6448d0a4-d0b0-4a87-b073-f204e3128bda ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7009 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 18:08:38 ┊
┊ pvc-6448d0a4-d0b0-4a87-b073-f204e3128bda ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7009 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:09:18 ┊
┊ pvc-b8df4a81-6260-4b81-9e1c-5f29e28e243e ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7005 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:06:41 ┊
┊ pvc-b8df4a81-6260-4b81-9e1c-5f29e28e243e ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7005 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 13:06:53 ┊
┊ pvc-b8df4a81-6260-4b81-9e1c-5f29e28e243e ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7005 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:07:16 ┊
┊ pvc-bc09e2d0-5c21-410e-8346-d597586060c8 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7003 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:06:24 ┊
┊ pvc-bc09e2d0-5c21-410e-8346-d597586060c8 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7003 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 13:06:29 ┊
┊ pvc-bc09e2d0-5c21-410e-8346-d597586060c8 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7003 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 13:06:44 ┊
┊ pvc-ca568d40-223b-4aa9-a933-1d577169079b ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7007 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:08:19 ┊
┊ pvc-ca568d40-223b-4aa9-a933-1d577169079b ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7007 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 18:08:55 ┊
┊ pvc-ca568d40-223b-4aa9-a933-1d577169079b ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7007 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:09:39 ┊
┊ pvc-ed95bc34-dbf1-4504-ae30-a4c5f32952e0 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7006 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:07:32 ┊
┊ pvc-ed95bc34-dbf1-4504-ae30-a4c5f32952e0 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7006 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-09-07 18:07:54 ┊
┊ pvc-ed95bc34-dbf1-4504-ae30-a4c5f32952e0 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7006 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-09-07 18:08:06 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ dedi1-node1.23-106-60-155.lon-01.uk.appsolo.com ┊ 7000 ┊ Unused ┊ Ok ┊ UpToDate ┊ 2022-08-31 15:37:33 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ vm6-cplane1.23-106-61-231.lon-01.uk.appsolo.com ┊ 7000 ┊ Unused ┊ Ok ┊ TieBreaker ┊ 2022-08-31 15:37:50 ┊
┊ pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0 ┊ vm9-node2.23-106-61-193.lon-01.uk.appsolo.com ┊ 7000 ┊ InUse ┊ Ok ┊ UpToDate ┊ 2022-08-31 15:38:00 ┊
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
________________________________________________________
Executed in 679.12 millis fish external
usr time 241.85 millis 708.00 micros 241.15 millis
sys time 78.43 millis 362.00 micros 78.06 millis
Would you like me to submit a PR?
Actually I'm thinking we could make this a lot more efficient if instead of looping through every volume, we just group them by storage pool and then list the storage pools.
I'll adjust the commit.
Here is the commit for only listing unique storage pools https://github.com/Rid/linstor-server/commit/7ccceac946985e3e9f948bcfce3a8891017b94bf
It doesn't actually make any noticeable different from the previous commit, but it should do if you have pools with more datasets.
I'm not sure what the bottleneck is in this case which is pushing up the response to 700ms in the linstor command.
time kubectl exec deployment/piraeus-op-cs-controller -- curl --cacert /etc/linstor/client/ca.crt --key /etc/linstor/client/tls.key --cert /etc/linstor/client/tls.crt 'https://piraeus-op-cs.default.svc:3371/v1/view/resources?limit=0&offset=0' | jq '.[].layer_object.drbd.drbd_resource_definition.secret = ""'
________________________________________________________
Executed in 381.95 millis fish external
usr time 295.58 millis 0.00 millis 295.58 millis
sys time 93.22 millis 1.45 millis 91.77 millis
Looks like some of it is creating the json object. I think 300ms should eliminate the timeout problem.
@ghernadi Can you take a look at Rid@427185c it's just a quick idea which could speed up systems where there are many datasets outside of linstor (such as our case).
Here is the commit for only listing unique storage pools https://github.com/Rid/linstor-server/commit/7ccceac946985e3e9f948bcfce3a8891017b94bf
Good idea, but instead of looping over all storage pools and calling individual zfs list ...
commands, one could also group all storage pools into one zfs list -r ... sp1 sp2 sp3...
command. I started this branch earlier this day and just pushed it here, would appreciate if you could test it: https://github.com/LINBIT/linstor-server/tree/gh/zfs-performance
I'd like to be able to build deb packages for testing however I can't find the tooling for doing the build, I can see the debian folder but no DEBIAN folder.
Can you let me know how to build the .debs?
Something like VERSION=1.19.1 make debrelease
should do the trick. Make sure your git is clean, i.e. everything commited, nothing in staging.
But I assume you already figured that out :)
I ended up putting the built source into a docker container, however running satellites is generating errors and not connecting:
Yes, but that does not seem to be the problem of the ErrorReport. After the controller established the connection to the satellite, the satellite reports its own Linstor version along with other things. If the controller has a different version than the satellite, the controller closes the connection again and marks the satellite as Offline (Version mismatch)
or something like that.
However, the ErrorReport you are showing looks like an SSL error, where the connection cannot be established (i.e. before the verification of the version).
That makes sense, I will test it now and report back.
@ghernadi tests were successful, performance is fine, I think it can be merged.
time kubectl exec deployment/piraeus-op-cs-controller -- curl --cacert /etc/linstor/client/ca.crt --key /etc/linstor/client/tls.key --cert /etc/linstor/client/tls.crt 'https://piraeus-op-cs.default.svc:3371/v1/view/resources?limit=0&offset=0' | jq '.[].layer_object.drbd.drbd_resource_definition.secret = ""'
________________________________________________________
Executed in 409.24 millis fish external
usr time 328.96 millis 0.00 millis 328.96 millis
sys time 85.19 millis 1.23 millis 83.95 millis
time kubectl exec deployment/piraeus-op-cs-controller -- curl --cacert /etc/linstor/client/ca.crt --key /etc/linstor/client/tls.key --cert /etc/linstor/client/tls.crt 'https://piraeus-op-cs.default.svc:3371/v1/view/resources?limit=0&offset=0' | jq '.[].layer_object.drbd.drbd_resource_definition.secret = ""'
________________________________________________________
Executed in 394.14 millis fish external
usr time 348.51 millis 694.00 micros 347.82 millis
sys time 83.14 millis 364.00 micros 82.78 millis
time kubectl exec deployment/piraeus-op-cs-controller -- curl --cacert /etc/linstor/client/ca.crt --key /etc/linstor/client/tls.key --cert /etc/linstor/client/tls.crt 'https://piraeus-op-cs.default.svc:3371/v1/view/resources?limit=0&offset=0' | jq '.[].layer_object.drbd.drbd_resource_definition.secret = ""'
________________________________________________________
Executed in 397.08 millis fish external
usr time 306.30 millis 731.00 micros 305.57 millis
sys time 104.89 millis 384.00 micros 104.50 millis
time kubectl exec deployment/piraeus-op-cs-controller -- curl --cacert /etc/linstor/client/ca.crt --key /etc/linstor/client/tls.key --cert /etc/linstor/client/tls.crt 'https://piraeus-op-cs.default.svc:3371/v1/view/resources?limit=0&offset=0' | jq '.[].layer_object.drbd.drbd_resource_definition.secret = ""'
________________________________________________________
Executed in 382.34 millis fish external
usr time 305.62 millis 781.00 micros 304.84 millis
sys time 103.11 millis 410.00 micros 102.70 millis
time kubectl exec deployment/piraeus-op-cs-controller -- curl --cacert /etc/linstor/client/ca.crt --key /etc/linstor/client/tls.key --cert /etc/linstor/client/tls.crt 'https://piraeus-op-cs.default.svc:3371/v1/view/resources?limit=0&offset=0' | jq '.[].layer_object.drbd.drbd_resource_definition.secret = ""'
________________________________________________________
Executed in 360.52 millis fish external
usr time 287.06 millis 756.00 micros 286.30 millis
sys time 100.86 millis 398.00 micros 100.47 millis
The actual zfs list
now takes around 18ms
rpool/team-100/pvc-2144d02d-acaf-41cd-ac62-96066fe5abef_00000 16248960 16248960 209797120 volume
rpool/team-100/pvc-2bb12d1e-4029-42a3-b2c1-3ffa7c4e567c_00000 833990976 833990976 2147983360 volume
rpool/team-100/pvc-4dda6ad4-5e31-4ac2-bba7-8ed5046cf311_00000 2690688 2690688 8591810560 volume
rpool/team-100/pvc-6448d0a4-d0b0-4a87-b073-f204e3128bda_00000 2856672 2856672 2147983360 volume
rpool/team-100/pvc-b8df4a81-6260-4b81-9e1c-5f29e28e243e_00000 2935296 2935296 2147983360 volume
rpool/team-100/pvc-bc09e2d0-5c21-410e-8346-d597586060c8_00000 2856672 2856672 11813724160 volume
rpool/team-100/pvc-ca568d40-223b-4aa9-a933-1d577169079b_00000 2865408 2865408 209797120 volume
rpool/team-100/pvc-ed95bc34-dbf1-4504-ae30-a4c5f32952e0_00000 2882880 2882880 22553436160 volume
rpool/team-100/pvc-ff369c1b-3abd-4e90-9481-7aeb376626e0_00000 2996448 2996448 11813724160 volume
________________________________________________________
Executed in 18.76 millis fish external
usr time 7.62 millis 273.00 micros 7.35 millis
sys time 11.20 millis 180.00 micros 11.02 millis
Do you have any idea where the extra 350ms could be coming from on view/resources
?
The linstor r l -a
command takes around 600ms.
I might be wrong here, but there might be quite a delay in opening a socket to the controller or something like that. I'd have to investigate to refresh my memories, but I believe about 200-300ms are lost between client <-> controller.
Maybe @rp- knows more in this regard
view/resources
fetches current allocated space from thin volumes, so it has to query all satellites, which takes time.
view/resources
fetches current allocated space from thin volumes, so it has to query all satellites, which takes time.
There's only 2 satellites, which takes around 20ms on each to do a zfs list, but shouldn't these requests be asynchronous? In which case we'd expect all values to be returned by 20ms. Worst case and it's not async would be 40ms.
The servers are both in the same rack, so RTT should be close to 0-5ms.