swarmkit icon indicating copy to clipboard operation
swarmkit copied to clipboard

manager/csi/manager.go: getPlugin: Issue with handling plugin names related to image tags

Open ppignet opened this issue 3 months ago • 2 comments

I was working on debugging compatibility with the CephCSI driver and stumbled upon an issue in how plugin names and aliases are handled.

Installed the driver using this cmd:

docker plugin rm rbd.csi.ceph.com --force && \
docker plugin install <privaterepo>/ceph-csi-swarm/cephcsi-aio:canary \
--grant-all-permissions \
NODE_ID=test-docker-1 \
CEPHCSI_VERBOSITY=10 \
DEBUG_ENTRYPOINT=true \
--alias rbd.csi.ceph.com

Created CSI volume and a service using it fine. Basically, everything worked up until volume publishing, where I was met with an error message from Docker stating: CSI node ID not found for given Swarm node ID.

Looking at the code:

That log only triggers if this lookup fails:

csiNodeID := p.swarmToCSI[nodeID]
if csiNodeID == "" { ... }

There's two maps in the plugin object swarmToCSI and csiToSwarm. These are correctly populated upon daemon startup, which I checked using debug statements:

Image

At the point in time where these maps are accessed, they seem to be emtpy:

Image

I further tested this by logging the ptr of the underlying plugin object which confirmed there were two "virtual" instances of the same plugin at runtime: At start:

Image

At use:

Image

Looking at where plugin object instances are created, manager/csi/manager.go, and the usages thereof, it seems like handleNode gets the plugin's name from info.PluginName, (https://github.com/moby/swarmkit/blob/3a23580599a45034467695b859a0e794e62bddfa/manager/csi/manager.go#L406) whereas all other codepaths get it from v.Spec.Driver.Name.

This effectively causes two driver objects to be created from one installed driver, one with the tag in the alias and one without, where one instance is correctly populated with values at startup, but the other unpopulated instance is what is actually used later in the code, causing the CSI driver to fail at NodePublishVolume.


To fix this I modified the getPlugin method to normalize/canonicalize plugin names correctly. https://github.com/ppignet/swarmkit/tree/csi/manager_normalize_plugin_name

I confirmed that this fixes the issue by veryfing the ptr is now identical:

upon startup: Image

upon usage:

Image Image

..., and now being able to spin up many volumes and services without errors anymore! 😺

Lmk if this is a non-issue and just me doing something wrong. But I think this addition would make handling more fail-proof in any case.

ppignet avatar Nov 05 '25 16:11 ppignet

cc @dperny PTAL

thaJeztah avatar Nov 05 '25 16:11 thaJeztah

@thaJeztah @dperny

Would someone with deeper knowledge of this area be able to take a look at the issue and the associated PR? Thanks in advance!

gschwienbacher avatar Dec 03 '25 07:12 gschwienbacher