Ceph-CSI connecting to an existing cephfs Fails in Nomad
When using this plugin with nomad to use an existing Subvolume in the default ceph namespace /volumes/_nogroup/nomad register volume.hcl
Error: GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = missing required field monitors
As the CSI Plugin is working for rbd as well as cephfs volumes created via nomad on the same ceph cluster this might be a bug.
CSI-Controller configuration:
job "ceph-csi-cephfs-plugin-controller" {
namespace = "system-infrastructure"
datacenters = ["dc1"]
priority = 100
update {
max_parallel = 1
min_healthy_time = "10s"
healthy_deadline = "3m"
auto_revert = true
auto_promote = false
canary = 1
stagger = "30s"
}
group "controller" {
network {
port "metrics" {}
}
task "ceph-cephfs-controller" {
template {
data = <<EOF
[{
"clusterID": "<ClusterID>",
"monitors": [
"<MonitorIP1>",
"<MonitorIP2>",
"<MonitorIP3>",
"<MonitorIP4>",
"<MonitorIP5>"
]
}]
EOF
destination = "local/config.json"
change_mode = "restart"
}
driver = "docker"
config {
image = "quay.io/cephcsi/cephcsi:v3.13.0"
volumes = [
"./local/config.json:/etc/ceph-csi-config/config.json"
]
mounts = [
{
type = "tmpfs"
target = "/tmp/csi/keys"
readonly = false
tmpfs_options = {
size = 1000000 # size in bytes
}
}
]
args = [
"--type=cephfs",
"--controllerserver=true",
"--drivername=cephfs.csi.ceph.com",
"--endpoint=unix://csi/csi.sock",
"--nodeid=${node.unique.name}",
"--instanceid=${node.unique.name}-controller",
"--pidlimit=-1",
"--logtostderr=true",
"--v=5",
"--metricsport=$${NOMAD_PORT_metrics}"
]
}
resources {
cpu = 50
memory = 64
memory_max = 256
}
service {
name = "ceph-csi-cephfs-controller"
port = "metrics"
tags = [ "prometheus" ]
}
csi_plugin {
id = "ceph-csi-cephfs"
type = "controller"
mount_dir = "/csi"
}
}
}
}
CSI-Node configuration:
job "ceph-csi-cephfs-plugin-nodes" {
namespace = "system-infrastructure"
datacenters = ["dc1"]
priority = 100
type = "system"
update {
max_parallel = 1
min_healthy_time = "10s"
healthy_deadline = "3m"
auto_revert = true
auto_promote = false
canary = 1
stagger = "30s"
}
group "nodes" {
network {
port "metrics" {}
}
task "ceph-node" {
driver = "docker"
template {
data = <<EOF
[{
"clusterID": "<ClusterID>",
"monitors": [
"<MonitorIP1>",
"<MonitorIP2>",
"<MonitorIP3>",
"<MonitorIP4>",
"<MonitorIP5>"
]
}]
EOF
destination = "local/config.json"
change_mode = "restart"
}
config {
image = "quay.io/cephcsi/cephcsi:v3.13.0"
volumes = [
"./local/config.json:/etc/ceph-csi-config/config.json"
]
mounts = [
{
type = "tmpfs"
target = "/tmp/csi/keys"
readonly = false
tmpfs_options = {
size = 1000000 # size in bytes
}
}
]
args = [
"--type=cephfs",
"--drivername=cephfs.csi.ceph.com",
"--nodeserver=true",
"--endpoint=unix://csi/csi.sock",
"--nodeid=${node.unique.name}",
"--instanceid=${node.unique.name}-nodes",
"--pidlimit=-1",
"--logtostderr=true",
"--v=5",
"--metricsport=$${NOMAD_PORT_metrics}"
]
privileged = true
}
resources {
cpu = 50
memory = 64
memory_max = 256
}
service {
name = "ceph-csi-cephfs-nodes"
port = "metrics"
tags = [ "prometheus" ]
}
csi_plugin {
id = "ceph-csi-cephfs"
type = "node"
mount_dir = "/csi"
}
}
}
}
Nomad-Volume configuration:
id = "<random Volume ID/Name>"
name = "<random Volume Name>"
namespace = "<namespace>"
type = "csi"
plugin_id = "ceph-csi-cephfs"
external_id = "<Volume ID in ceph>"
capability {
access_mode = "multi-node-multi-writer"
attachment_mode = "file-system"
}
mount_options {
fs_type = "ceph"
mount_flags = ["noatime"]
}
secrets {
userID = "<ceph User>"
userKey = "<Secret>"
}
parameters {
clusterID = "<ceph-Cluster ID>"
staticVolume = "true"
fsName = "<cephFS Name>"
rootPath = "/volumes/_nogroup/<SubVolume>"
}
Nomad-Job configuration used for testing:
job "csi-volume-test" {
datacenters = ["dc1"]
namespace = "<namespace>"
type = "batch"
group "test" {
task "write-read-volume" {
driver = "docker"
config {
image = "alpine"
command = "sh"
args = ["-c", "echo 'Hello from CSI volume!' > /mnt/testvol/hello.txt && cat /mnt/testvol/hello.txt"]
}
volume_mount {
volume = "testvol"
destination = "/mnt/testvol"
read_only = false
}
resources {
cpu = 25
memory = 25
}
}
volume "testvol" {
type = "csi"
read_only = false
source = "<Volume ID>"
attachment_mode = "file-system" # oder "block" je nach Plugin
access_mode = "multi-node-multi-writer"
}
}
}
Ceph user rights:
[client.<username>]
key = <Secret>
caps mds = "allow rw fsname=<fsName>"
caps mon = "allow r fsname=<fsName>"
caps osd = "allow rw tag cephfs data=<fsName>"
Ceph Volume Info:
{
"mon_addrs": [
"<MonitorIP1>:6789",
"<MonitorIP2>:6789",
"<MonitorIP3>:6789",
"<MonitorIP4>:6789",
"<MonitorIP5>:6789"
],
"pending_subvolume_deletions": 0,
"pools": {
"data": [
{
"avail": 424216067309568,
"name": "cephfs.<Volume Name>.data",
"used": 12288
}
],
"metadata": [
{
"avail": 424216067309568,
"name": "cephfs.<Volume Name>.meta",
"used": 2698643
}
]
},
"used_size": 127
}
Ceph Subvolume info:
{
"atime": "2025-05-26 06:39:20",
"bytes_pcent": "0.00",
"bytes_quota": 1099511627776,
"bytes_used": 0,
"created_at": "2025-05-26 06:39:20",
"ctime": "2025-06-03 13:36:05",
"data_pool": "cephfs.<data pool name>",
"features": [
"snapshot-clone",
"snapshot-autoprotect",
"snapshot-retention"
],
"flavor": 2,
"gid": 0,
"mode": 16895,
"mon_addrs": [
"<MonitorIP1>",
"<MonitorIP2>",
"<MonitorIP3>",
"<MonitorIP4>",
"<MonitorIP5>"
],
"mtime": "2025-06-03 13:36:05",
"path": "/volumes/_nogroup/<SubVolumeName>/<ID>",
"pool_namespace": "",
"state": "complete",
"type": "subvolume",
"uid": 0
}
clusterID = "<ceph-Cluster ID>"
[{ "clusterID": "<ClusterID>", "monitors": [ "<MonitorIP1>", "<MonitorIP2>", "<MonitorIP3>", "<MonitorIP4>", "<MonitorIP5>" ] }]
The clusterID specified when creating the volume should match the clusterID and monitor mapping created above.
Yes the clusterID is the same one for all the configurations. They also all have the same Monitors. As stated the plugin is working for another cephfs pool on the same ceph-cluster where nomad is creating the subvolumes in a specified subvolume -group or using another pool for rbd storage. Just using(Nomad job configuration) an already existing subvolume in Nomad seems to be causing problems.
GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = missing required field monitors
@uvensys-kirchen okay i that case of not aware of it from CSI logs it looks like a configuration issue.
Can you share some of the logs where the error pops up? The Ceph-CSI containers contain quite a bit of logging, and that can help to point to the area where the monitors are not, or incorrectly configured (or possibly something else).
At a glance, I do not see anything missing, at least compared to the static CephFS volume docs.
603 13:40:22.284107 1 utils.go:266] ID: 6782 Req-ID: <nomad-ceph-volume-ID> GRPC call: /csi.v1.Node/NodeStageVolume
I0603 13:40:22.284218 1 utils.go:267] ID: 6782 Req-ID: <nomad-ceph-volume-ID> GRPC request: {"secrets":"***stripped***","staging_target_path":"/local/csi/staging/<Nomad-Namespace>/<nomad-ceph-volume-ID>/rw-file-system-multi-node-multi-writer","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":5}},"volume_id":"<nomad-ceph-volume-ID>"}
E0603 13:40:22.284256 1 utils.go:271] ID: 6782 Req-ID: <nomad-ceph-volume-ID> GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = missing required field monitors
I0603 13:40:22.308266 1 utils.go:266] ID: 6783 Req-ID: <nomad-ceph-volume-ID> GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0603 13:40:22.308329 1 utils.go:267] ID: 6783 Req-ID: <nomad-ceph-volume-ID> GRPC request: {"target_path":"/local/csi/per-alloc/20fced5b-42f1-969f-8933-34d72b5ac69f/<nomad-ceph-volume-ID>/rw-file-system-multi-node-multi-writer","volume_id":"<nomad-ceph-volume-ID>"}
E0603 13:40:22.308379 1 nodeserver.go:620] ID: 6783 Req-ID: <nomad-ceph-volume-ID> stat failed: stat /local/csi/per-alloc/20fced5b-42f1-969f-8933-34d72b5ac69f/<nomad-ceph-volume-ID>/rw-file-system-multi-node-multi-writer: no such file or directory
I0603 13:40:22.308387 1 nodeserver.go:624] ID: 6783 Req-ID: <nomad-ceph-volume-ID> targetPath: /local/csi/per-alloc/20fced5b-42f1-969f-8933-34d72b5ac69f/<nomad-ceph-volume-ID>/rw-file-system-multi-node-multi-writer has already been deleted
I0603 13:40:22.308395 1 utils.go:273] ID: 6783 Req-ID: <nomad-ceph-volume-ID> GRPC response: {}
I0603 13:40:22.308683 1 utils.go:266] ID: 6784 Req-ID: <nomad-ceph-volume-ID> GRPC call: /csi.v1.Node/NodeUnstageVolume
I0603 13:40:22.308708 1 utils.go:267] ID: 6784 Req-ID: <nomad-ceph-volume-ID> GRPC request: {"staging_target_path":"/local/csi/staging/<Nomad-Namespace>/<nomad-ceph-volume-ID>/rw-file-system-multi-node-multi-writer","volume_id":"<nomad-ceph-volume-ID>"}
I0603 13:40:22.308752 1 utils.go:273] ID: 6784 Req-ID: <nomad-ceph-volume-ID> GRPC response: {}
I0603 13:40:22.284218 1 utils.go:267] ID: 6782 Req-ID: <nomad-ceph-volume-ID> GRPC request: {"secrets":"***stripped***","staging_target_path":"/local/csi/staging/<Nomad-Namespace>/<nomad-ceph-volume-ID>/rw-file-system-multi-node-multi-writer","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":5}},"volume_id":"<nomad-ceph-volume-ID>"}
The request does not seem to contain the volume_context that contains the cluster-id and other required parameters (see volumeAttributes in this example). You may need to add that to your volume "testvol" section in the Nomad Job.
Please doublecheck with the parameters block defined in Nomad-Volume configuration I posted The same file with different parameters seems to be able to create a a new subvolume with a dedicated subvolumegroup. The only major change there is the rootPath . If there are further attributes that need to be specified here, I am not aware of what these are.
PS.: I also tested just specifying the monitors of the Ceph used in the parameters area of the Nomad-Volume Configuration but got the same problematic results.
A NodeStageVolume procedure does not use parameters. Instead, it should get a volume_context, which is missing.
I do not know how to place that in a Nomad job, but the documentation suggests that there is a context as part of the parameters.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
A
NodeStageVolumeprocedure does not use parameters. Instead, it should get a volume_context, which is missing.I do not know how to place that in a Nomad job, but the documentation suggests that there is a
contextas part of theparameters.
This might explain my Problems. I have yet to test it. I'll inform you if that is working for me.
I have lost many hours trying to debug this myself, the documentation is not very good at all, and nomad being not as popular as kubernetes definitely does not help.
Here was my issue, I had the same controller/node job as your self, and a very similar volume trying to mount an existing cephfs volume (key on existing). First of all, unlike the RBD side of Ceph, the volume parameters need to be provided via the context instead of using parameters as @nixpanic was so kind to point out. Frustratingly, ceph has documentation for deploying RBD in nomad, but nothing like it for CephFS. Nomad has official examples as well, but nothing about CephFS.
After trying the context suggestion, I got this new error E0722 21:02:50.191707 1 utils.go:270] ID: 506 Req-ID: test-cephfs GRPC error: rpc error: code = Internal desc = rpc error: code = Internal desc = missing required field provisionVolume which was beyond confusing, as no where in the internet can I get a good match on this. If you look deep enough however, you can find the static-pvc docs mentions staticVolume, and looking at previous PRs you can see that provisionVolume was actually removed in favor for staticVolume as part of https://github.com/ceph/ceph-csi/pull/390/files#diff-ac32bb87c315551d410bb3d2be14eefb4f84953c90f4a92e70ccf4657ae9d7c3R298-R301, it would be nice if this was documented better somewhere, or perhaps the error message updated.
Anyways... if you have an already created volume, you need to add the the staticVolume = true and a rootPath = "/your/path along with the monitors as part of your context. Here is an example of a working volume definition, hope it helps someone out there
id = "test-cephfs"
name = "test-cephfs"
type = "csi"
plugin_id = "ceph-csi-cephfs"
capacity_max = "10G"
capacity_min = "1G"
capability {
access_mode = "multi-node-multi-writer"
attachment_mode = "file-system"
}
context {
clusterID = "c3ae25e7-45c6-4acb-8d45-06c71bcb5c9f"
fsName = "cephfs_test_volume"
monitors = "ip_addr:6789,ip_addr:6789,ip_addr:6789"
staticVolume = true
rootPath = "/test"
}
secrets {
userID = "user"
userKey = "user_key
}
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
Closing the issue as the cause was misconfiguration.
@iPraveenParihar the issue still exists as the system should not panic with no error on such scenarios. Would a new issue help?