daos DAOS-16473 Support QLC NVMe SSD in control plane and data plane

This are the changes for feature of using QLC NVMe SSD to store bulk data. It mainly includes 3 points: 1. support configuring QLC NVMe tier with newly added bulk_data role for both PMem and MD-on-SSD mode, which extend the max tier num to 2 or 4 for each mode respectively; 2. support creating pool with qlc-size parameter to set the size of QLC NVMe SSD in manual mode, and has no impact to the original usage(will set qlc-size to 0 automatically); 3. support display the QLC NVMe SSD and non-QLC NVMe SSD separately for the display of usage and pool related information. Currently, the QLC NVMe tier is identified through the role configured in yml config file. With commits in this PR, you can set a QLC NVMe tier with bulk_data role in yml config file and then start the daos_server to deploy a DAOS storage system with QLC NVMe SSD used to store bulk data. It works well without QLC NVMe tier set(with no bulk_data role set in yml config file), and also support to create pool without using QLC NVMe media when QLC NVMe tier is configured.

Required-githooks: true

Before requesting gatekeeper:

[ ] Two review approvals and any prior change requests have been resolved.
[ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
[ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
[ ] Commit messages follows the guidelines outlined here.
[ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

[ ] You are the appropriate gatekeeper to be landing the patch.
[ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
[ ] Githooks were used. If not, request that user install them and check copyright dates.
[ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
[ ] All builds have passed. Check non-required builds for any new compiler warnings.
[ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
[ ] If applicable, the PR has addressed any potential version compatibility issues.
[ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
[ ] Extra checks if forced landing is requested
- [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
- [ ] No new NLT or valgrind warnings. Check the classic view.
- [ ] Quick-build or Quick-functional is not used.
[ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Aug 30 '24 05:08 junzeng1

Also, do we really want to add QLC support details for PMem mode? If the additions are only going to be relevant to MD-on-SSD then the changes should be specifically switched based on mode and this PR should be targeted at the feature/vos_on_blob_p2 branch. @gnailzenh @NiuYawei thoughts?

Sep 03 '24 21:09 tanabarr

I think before we continue reviewing the control-plane side @NiuYawei does this make sense regarding adding an explicit extra tier to the list of "SCM,NVME" in pool query/create schema? Also does the additional role "bulk_data" make sense as applied in this PR? MD-on-SSD branch will have conflicts with this change including updates I am working on for phase-2. TIA

We plan to make QLC as a capacity NVMe tier for bulk data only, it should support both pmem and md-on-ssd mode. See the design at https://daosio.atlassian.net/wiki/spaces/DC/pages/11768692750/QLC+SSD+Support

Sep 04 '24 02:09 NiuYawei

Also, do we really want to add QLC support details for PMem mode? If the additions are only going to be relevant to MD-on-SSD then the changes should be specifically switched based on mode and this PR should be targeted at the feature/vos_on_blob_p2 branch. @gnailzenh @NiuYawei thoughts?

Yes, I think both pmem or md-on-ssd mode should be able to support QLC capacity tier, but I agree with you that it might be more convenient to do the QLC support development on a dedicated feature branch. @gnailzenh , what do you think?

Sep 04 '24 02:09 NiuYawei

One thing I find myself wondering generally is whether we should present the QLC tier as "QLC" explicitly to the user, as shown here, or whether we should think about a different way to present it. SCM/NVMe tiers are based on which bus the storage is sitting on, with SCM sitting on the memory bus, and NVMe sitting on the PCIe bus. The devices in question may be using any media. Maybe it should simply be considered a second NVMe tier.

@tanabarr has done a lot of work in the MD-on-SSD implementation for the control plane, so I have added him explicitly for review. @mjmac may have some suggestions about how to handle how we present this to users as well.

I agree that we should consider it as a second NVMe tier rather than explicitly "QLC". In PMem mode display as "SCM/NVMe 1/NVMe 2" and in md-on-ssd maybe "Meta/Data/Bulk Data"?

Sep 04 '24 10:09 tanabarr

Also, do we really want to add QLC support details for PMem mode? If the additions are only going to be relevant to MD-on-SSD then the changes should be specifically switched based on mode and this PR should be targeted at the feature/vos_on_blob_p2 branch.

@tanabarr I agree with this. Unless we have a customer asking for support for this new feature in pmem mode, I think this should be targeted to MD-on-SSD only. Pmem is essentially a legacy feature at this point. New customers aren't likely to be using it, and IMO we should avoid changes to it unless a customer is requesting it.

I agree that we should consider it as a second NVMe tier rather than explicitly "QLC". In PMem mode display as "SCM/NVMe 1/NVMe 2" and in md-on-ssd maybe "Meta/Data/Bulk Data"?

If we must allow it in pmem mode, maybe "SCM/NVMe/NVMe Bulk"? The names for MD on SSD sound good.

@NiuYawei @gnailzenh In future it would be good to get Control Plane team members a chance to review design docs for a change like this.

Sep 04 '24 16:09 kjacque

This are the changes for feature of using QLC NVMe SSD to store bulk data. It mainly includes 3 points: 1. support configuring QLC NVMe tier with newly added bulk_data role for both PMem and MD-on-SSD mode, which extend the max tier num to 2 or 4 for each mode respectively; 2. support creating pool with qlc-size parameter to set the size of QLC NVMe SSD in manual mode, and has no impact to the original usage; 3. support disaply the QLC NVMe SSD and non-QLC NVMe SSD separately for the display of usage and pool related information. Currently, the QLC NVMe tier is identified through the role configured in yml config file. With commits in this PR, you can set a QLC NVMe tier with bulk_data role in yml config file and then start the daos_server to deploy a DAOS storage system with QLC NVMe SSD used to store bulk data. Currently, there are few restrictions for this feature: 1. the meta blob header only consider new deployment temporally, not support the exist pool which created with old version now; 2. the rpc interface is not compatible with old client version now, for the DAOS_MEDIA_MAX changed with QLC NVMe added as an new type of media. And for these two restrictions there will be other commit to deal with them, to be compatible with old versions.

Required-githooks: true

Hello everyone, the data plane implementation has been added to this PR, please help review and give suggestions, thanks.

Sep 19 '24 06:09 junzeng1

Please don't rebase and force-push once your work has begun to be reviewed, or if you really need to (such as for changing base branch), please don't fixup or squash commits. It makes it very hard to tell the changes between versions, especially if old commits are modified.

Instead of rebasing, you can merge the base branch into your PR branch to pick up changes and resolve any conflicts. Commits are cleaned up by the gatekeepers when it lands.

ok, will do as you mentioned here in the future, thanks for the suggestion. Yesterday, I did rebase the dev branch based on the the master, I thought it may helpful for the reviewer at first. May be I got a wrong feeling, sorry for the inconvenience caused by it. But as you see there are only two commits for this PR, and most of the changes inside it are independent from commits inherited from master branch during the rebasing, so it may still ok to figure out the changes. Thanks.

Sep 24 '24 00:09 junzeng1

ok, will do as you mentioned here in the future, thanks for the suggestion. Yesterday, I did rebase the dev branch based on the the master, I thought it may helpful for the reviewer at first. May be I got a wrong feeling, sorry for the inconvenience caused by it. But as you see there are only two commits for this PR, and most of the changes inside it are independent from commits inherited from master branch during the rebasing, so it may still ok to figure out the changes. Thanks.

To be clear, the main concern is that if you make small changes to address comments, and squash/fixup to make them part of your original commits, then we can't easily see the updates. It's unclear to me if that happened here. The Github interface marked some previously-reviewed files as modified, but it's hard to tell if they were meaningful changes, or just a result of rebasing. It's really a problem with how Github displays results from rebasing/force-pushing, but alas, we are all stuck working around it. Thanks.

Sep 26 '24 00:09 kjacque

I think both pmem or md-on-ssd mode should be able to support QLC capacity tier.

@NiuYawei please reconsider whether it is a valuable investment right now. Having it implemented into the PMem mode requires all: coming up with a way to fit it together (will take time for many participants of this discussion), development, writing tests, review, expanding DAOS test infrastructure (already insufficient to cover our needs, @brianjmurrell, @JohnMalmberg), and growing DOAS test matrix (already too big to return a result in a reasonable time; @ryon-jensen).

I hope we have real customer needs to back up this investment. We are already stretched pretty thin.

@NiuYawei @gnailzenh In future it would be good to get Control Plane team members a chance to review design docs for a change like this.

@kjacque Amen to this.

If the additions are only going to be relevant to MD-on-SSD then the changes should be specifically switched based on mode and this PR should be targeted at the feature/vos_on_blob_p2 branch.

MD-on-SSD branch will have conflicts with this change including updates I am working on for phase-2.

@tanabarr From what I know, the development in the MD-on-SSD mode is far more rapid than what is going on in the PMem mode. So, I would argue that this change ought to target the feature/vos_on_blob_p2 branch regardless. Otherwise, merging the feature/vos_on_blob_p2 branch to the master will be quite painful, I would expect.

I find myself wondering generally is whether we should present the QLC tier as "QLC" explicitly to the user

I agree that we should consider it as a second NVMe tier rather than explicitly "QLC". In PMem mode display as "SCM/NVMe 1/NVMe 2" and in md-on-ssd maybe "Meta/Data/Bulk Data"?

If we must allow it in pmem mode, maybe "SCM/NVMe/NVMe Bulk"? The names for MD on SSD sound good.

@junzeng1 IMHO the "QLC" word should not appear in the code base under no circumstance. Otherwise, we tie the DAOS naming convention to a device characteristic and a technology which may change over time. Issues of the QLC technology may be solved in the future, so we may want to use QLC as a different tier as well, or the Octa-level cell will be invented and we would like to use it in the same place as QLC. Either way, it will be awkward. Please don't do it.

My vote is on: "SCM/NVMe/NVMe Bulk" (if we end up adding it to the PMem mode) and "Meta/Data/Bulk Data".

This review is massive and since it is already split into two commits can we review it in two steps as well? Meaning two separate pull requests:

Control plane.
Data plane.

It will be easier both for @junzeng1 since they won't need to apply renaming and clean-ups to both commits at the same time, and it may go back and forth as we reasonably can expect, and for the gatekeepers, since they won't need to pick by hand which fix applies to which of the two commits before landing the whole change.

Does it make sense?

Oct 11 '24 10:10 janekmi

Have add commits to support exist pool created from old software version which not support the QLC NVMe, and to be compatible with client rpc interface compiled from old software version which not support the QLC NVMe.

Oct 12 '24 05:10 junzeng1

I think both pmem or md-on-ssd mode should be able to support QLC capacity tier.

@NiuYawei please reconsider whether it is a valuable investment right now. Having it implemented into the PMem mode requires all: coming up with a way to fit it together (will take time for many participants of this discussion), development, writing tests, review, expanding DAOS test infrastructure (already insufficient to cover our needs, @brianjmurrell, @JohnMalmberg), and growing DOAS test matrix (already too big to return a result in a reasonable time; @ryon-jensen).

I hope we have real customer needs to back up this investment. We are already stretched pretty thin.

Yes, I agree that's an issue. @gnailzenh any thoughts?

@NiuYawei @gnailzenh In future it would be good to get Control Plane team members a chance to review design docs for a change like this.

@kjacque Amen to this.

Sure.

Oct 21 '24 13:10 NiuYawei

I think both pmem or md-on-ssd mode should be able to support QLC capacity tier.

@NiuYawei please reconsider whether it is a valuable investment right now. Having it implemented into the PMem mode requires all: coming up with a way to fit it together (will take time for many participants of this discussion), development, writing tests, review, expanding DAOS test infrastructure (already insufficient to cover our needs, @brianjmurrell, @JohnMalmberg), and growing DOAS test matrix (already too big to return a result in a reasonable time; @ryon-jensen).

I hope we have real customer needs to back up this investment. We are already stretched pretty thin.

@NiuYawei @gnailzenh In future it would be good to get Control Plane team members a chance to review design docs for a change like this.

@kjacque Amen to this.

If the additions are only going to be relevant to MD-on-SSD then the changes should be specifically switched based on mode and this PR should be targeted at the feature/vos_on_blob_p2 branch.

MD-on-SSD branch will have conflicts with this change including updates I am working on for phase-2.

@tanabarr From what I know, the development in the MD-on-SSD mode is far more rapid than what is going on in the PMem mode. So, I would argue that this change ought to target the feature/vos_on_blob_p2 branch regardless. Otherwise, merging the feature/vos_on_blob_p2 branch to the master will be quite painful, I would expect.

I find myself wondering generally is whether we should present the QLC tier as "QLC" explicitly to the user

I agree that we should consider it as a second NVMe tier rather than explicitly "QLC". In PMem mode display as "SCM/NVMe 1/NVMe 2" and in md-on-ssd maybe "Meta/Data/Bulk Data"?

If we must allow it in pmem mode, maybe "SCM/NVMe/NVMe Bulk"? The names for MD on SSD sound good.

@junzeng1 IMHO the "QLC" word should not appear in the code base under no circumstance. Otherwise, we tie the DAOS naming convention to a device characteristic and a technology which may change over time. Issues of the QLC technology may be solved in the future, so we may want to use QLC as a different tier as well, or the Octa-level cell will be invented and we would like to use it in the same place as QLC. Either way, it will be awkward. Please don't do it.

My vote is on: "SCM/NVMe/NVMe Bulk" (if we end up adding it to the PMem mode) and "Meta/Data/Bulk Data".

This review is massive and since it is already split into two commits can we review it in two steps as well? Meaning two separate pull requests:

Control plane.

Data plane.

It will be easier both for @junzeng1 since they won't need to apply renaming and clean-ups to both commits at the same time, and it may go back and forth as we reasonably can expect, and for the gatekeepers, since they won't need to pick by hand which fix applies to which of the two commits before landing the whole change.

Does it make sense?

Yes, will try to split this PR to serval small PRs and try to get them reviewed and merged gradually later. For the tier name, will change it as all the discussion here.

Oct 22 '24 00:10 junzeng1