ceph-nvmeof
ceph-nvmeof copied to clipboard
Host-specific discovery service responses
This depends on the basic NVMe-oF discovery feature (#63). This feature is part of the requirement in https://github.com/ceph/ceph-nvmeof/issues/115
We'd eventually like the DS to allow hosts to authenticate to/with it (#67), which enables the DS to know the identity of each connected host.
Even with an unauthenticated host ID (so, possibly before #67) it would be useful if the DS could advertise only the ports each host needs for the namespaces it will access. While a discovery service doesn't advertise namespaces, it can limit the number of NVM subsystems a host must contact to find the namespaces accessible to it.
So if a Ceph cluster has two gateway groups (one group per ODS pool, one gateway group per group of hosts, etc.) and a host is configured to use a namespace in one of those, it's not helpful for that host to connect to the other gateway group only to find there aren't any namespaces (for that host) there.
This feature enables the DS to determine what namespaces a host is associated with, and what NVMe subsystems those namespaces are in. The discovery log page returned that host should contain only the ports of the subsystems that contain the namespaces associated with that specific host.
The assumption is that the DS can see the gateway config in OMAP, and that has to specify all the namespaces the gateway should have. Since it's a goal that the gateway
"Even with an unauthenticated host ID (so, possibly before https://github.com/ceph/ceph-nvmeof/issues/67) it would be useful if the DS could advertise only the ports each host needs for the namespaces it will access. While a discovery service doesn't advertise namespaces, it can limit the number of NVM subsystems a host must contact to find the namespaces accessible to it?"
I don't quite understand, can't the current spdk meet this requirement? Can you show one example?
So if a Ceph cluster has two gateway groups (one group per ODS pool, one gateway group per group of hosts, etc.) and a host is configured to use a namespace in one of those, it's not helpful for that host to connect to the other gateway group only to find there aren't any namespaces (for that host) there.
Can you explain more? How could the gateway group be involved?
The assumption is that the DS can see the gateway config in OMAP
This is a question about the DS data source, which is the key to the design of DS, and we should determine whether it is true or not. If not then where does the data source for DS come from
"Even with an unauthenticated host ID (so, possibly before #67) it would be useful if the DS could advertise only the ports each host needs for the namespaces it will access. While a discovery service doesn't advertise namespaces, it can limit the number of NVM subsystems a host must contact to find the namespaces accessible to it?"
I don't quite understand, can't the current spdk meet this requirement? Can you show one example?
We'd like to enable the cluster admin to use as few as one NVMe subsystem and one gateway node (or HA pair of gateways in a gateway group) for all the RBD images in a Ceph cluster accessible via NVMe-oF (5k-10k namespaces is the goal). In this case, all hosts will connect to the ports of the same subsystem, and the cluster DS will return the same set of ports to all hosts.
For several reasons it might make sense for that admin to split a large number of namespaces across several subsystems and several gateway nodes. You might do this if a single gateway node (or pair) couldn't handle all the traffic for all the namespaces. You might also do this to limit the number of hosts impacted by the failure of a gateway node.
Now the namespaces for some hosts will be in subsystem A, some will be in subsystem B, etc.
In this case the DS could just return all the ports of all the subsystems to all the hosts that inquire. If subsystems A and B each have two ports, the discovery log page would have four entries. If the DS does this, each host will have to (attempt to) connect to each of those subsystems to find the one that (will allow it to connect and) has its namespace.
It would be better if the DS only returned to hosts the subsystem ports of the subsystem with the namespace the host can attach to. Hosts using namespaces in subsystem A should only get the ports for subsystem A, and not waste time trying to connect to subsystem B.
When namespace access control (#70) is accomplished by constructing a subsystem per host, there may be thousands of subsystems. If the DS returns all of the several thousand subsystem ports to each host that inquires, every host may all have to make thousands of attempts to connect to subsystem ports before they find one that enables it to attach to a namespace. This will greatly delay the process of booting.
@sdpeters Your reason make sense. Other question is: we choose namespace masking. So why we still need "Host-specific discovery service responses". I mean with https://github.com/PepperJo/spdk/commit/094a9d70f5a8f32313d06b62c0d0f9d3e6ba74ca, we can add all namespace into one subsystem for discovery service. I mean with this patch and using libspdk_nvmf lib. Are there other reasons why this feature is still needed?
@sdpeters Your reason make sense. Other question is: we choose namespace masking. So why we still need "Host-specific discovery service responses". I mean with PepperJo/spdk@094a9d7, we can add all namespace into one subsystem for discovery service. I mean with this patch and using libspdk_nvmf lib. Are there other reasons why this feature is still needed?
We want to enable the cluster admin to use just one subsystem, but we can't require them to do that. There are several reasons why a cluster admin might want or need to divide the NVMe-oF accessible namespaces into multiple subsystems. When they do, it makes sense to hide from each host the subsystem ports that will be useless to it.
When ADNN is used in clusters with several OSD pools, there may be some subsystem ports in OSD nodes that don't contain any of the objects in a host's namespace (because the OSD node has no OSDs in the pool containing the hosts image). Those ports are useless to that host, and it would be better if the DS didn't list them in its response to that host. The DS has all the information it needs to determine this.
So even with namespace masking host-specific discovery responses are helpful. If namespace masking can't be upstreamed, and we can only accomplish namespace access control by creating one subsystem per host, host-specific discovery responses will be essential.
@sdpeters, I agree on the reason of this feature.
About how to implement host-specific discovery response, in your email, you said "register a user-defined nvmf_generate_discovery_log function".
Whether can do more filter flag in spdk_nvmf_tgt_discovery_filter( https://github.com/spdk/spdk/blob/master/include/spdk/nvmf.h#L45)?
I just wanted to use libnvmf as much as possible instead of developing in-house nvmf
Whether can do more filter flag in spdk_nvmf_tgt_discovery_filter( https://github.com/spdk/spdk/blob/master/include/spdk/nvmf.h#L45)?
That feature only allows the existing nvmf_generate_discovery_log() to exclude some of the listeners in the local SPDK app from the discovery log page. We need the discovery log page in the Ceph gateway DS to include ports on other gateway nodes in the cluster (not just those in the same app as the DS).
Even if we used @PepperJo's discovery-only patch to construct "discovery only" SPDK listeners (listeners with non-local IP addresses, etc. in an SPDK app that will be listed in the discovery log page but won't actually listen for connections), the SPDK discovery filter won't allow us to return different sets of ports to different hosts based on what subsystem their namespace is in, or what OSD pool that RBD image is stored in.
All those things are possible if we can customize existing nvmf_generate_discovery_log().