zfs
zfs copied to clipboard
User Interface Design for Object Storage Pools
This issue has been opened to provide a venue for discussion and feedback on the proposed CLI design for interacting with and creating pools using the upcoming Object Storage feature.
Introduction and Context
For those who have not yet been introduced to the concept, here is some background on the project.
Modern cloud services provide access to Object Storage APIs. These APIs provide low costs of storage and high throughput compared to more traditional device-based storage options in the cloud, with the drawback that they tend to have higher latency and have per-request costs. Users who wish to store extremely large amounts of data with low-frequency access patterns may wish to use object storage instead of traditional device-based storage. However, ZFS provides no capacity to interface with these objstores, as its assumed a traditional disk will back its storage. Naively using a single object to provide a "disk" for use with ZFS would result in extremely poor performance and high costs.
The proposed project would instead add support for a new type of vdev. This object storage vdev uses a special storage format, reducing costs and increasing performance. A new caching infrasture is also being developed, which would allow the bulk of reads to not need to access the objstore, reducing costs and increasing performance. This caching infrastructure can be viewed as an alternative to the L2ARC with improved scalability.
New Parameters
In order to interface with the cloud-provided object storage, a few pieces of information are required. First, ZFS needs to know what endpoints to use to reach the objstore. Second, most objstores have a concept of "buckets", which are collections of objects that are managed and access controlled together. ZFS needs to know which bucket to store its data in. Third, credentials are required to access the objstore. Some objstores also have a concept of "regions", which represent different geographical locales one might store data in. These regions are often also reflected in the endpoints, but not always.
User Interface Design
The only part of the existing UI that needs to change to support the feature is the zpool command. A new comand is also added to launch the ZFS Object Agent, which is the component responsible for communicating with the objstore on the kernel's behalf.
zpool create
When creating a new object storage pool, the objstore-related parameters need to be specified. In addition, the pool layout must be specified in such a way that ZFS knows which vdevs are objstore vdevs and which are not. A sample command looks something like the following:
zpool create -o object-credentials-location="file:///etc/zpool_credentials" -o object-endpoint="https://s3-us-west-2.amazonaws.com" -o object-region="us-west-2" poolname s3 bucketname log sdd
This would create a new pool named poolname
, using AWS S3 as the objstore. It would retrieve the credentials from the file at /etc/zpool_credentials
(other options besides a file://
URI include env
and prompt
for retrieving credentials from environment variables or prompting the user for them). The region would be us-west-2
and the endpoint would be https://s3-us-west-2.amazonaws.com
. It would also add a log device using a locally attached disk, sdd
.
For AWS, the region could be derived from the endpoint; however, this would require cloud-specific logic beyond their APIs, and may not be reliable.
The endpoint, region, and credential location (but not credential value, which must be retrieved at each import) will be stored in the pool config. The bucket name is a vdev property.
zpool import
Cachefile import
zpool import -c cachefile poolname
will import a pool that is located in the cachefile, using the pool properties in the cache file.
zpool import -d bucketname -o object-credentials-location="file:///etc/zpool_credentials" -o object-endpoint="https://s3-us-west-2.amazonaws.com" -o object-region="us-west-2"
scans for pools in the bucketname
bucket. Multiple bucket names can be specified with multiple -d
flags.
zpool import -o object-credentials-location="file:///etc/zpool_credentials" -o object-endpoint="https://s3-us-west-2.amazonaws.com" -o object-region="us-west-2"
scans for pools in all buckets accessible by the given credentials.
If a pool name or guid is provided to either of the previous commands, the pool with the given name will be imported.
zpool *
All other pool operations require no additional parameters or changes to their usage. One note is that in the current design, if an objstore vdev is in use, no other top-level vdevs are allowed. Additionally, objstore vdevs cannot (currently) be mirrored, though that might be possible to extend in the future.
zfs_object_agent
Currently, the userland agent that communicates to the cloud provider on the kernel's behalf requires few parameters; the -v
flag can be provided repeatedly for increased logging verbosity, and it takes an optional argument that specifies the directory it should use to store its sockets. It also accepts the AWS_PREFIX
environment variable, which is intended for use in testing and causes all objects to be written or read from a path with that prefix prepended.
Conclusion
This is the current proposed UI design for object storage pools in ZFS. If people have questions or comments about the proposed UI design, feel free to bring them up here. If you have questions about the project more generally, please reach out to me via email ([email protected]), on OpenZFS Slack (@pcd
), or on the openzfs IRC channel (Currently #openzfs on Freenode, in the process of moving to #openzfs on Libera, I am pcd
in both).
For slightly more background info, I talked about this project briefly on the April video call.
And again on the June video call
creating pools using the upcoming Object Storage feature.
Where can I find more information on this "Object Storage feature"?
I'm interested in knowing more about how this would be implemented. There's a few technical challenges that I'm curious about:
-
Most S3-like object pools do not provide any concept of a transaction. How do you make it such that a set of mutations to multiple objects can be atomically committed or not? In the case of a block device, it seems (I'm still learning about the underlyings of ZFS) that ZFS writes to unused blocks. If the uberblock has not yet been updated, then the intermediate block updates are functionally invisible in the event of a sudden crash. However, for an object storage, the intermediate objects are already stored into the pool. If ZFS crashes, then those intermediate objects stay allocated, but the system (upon booting back up), may not know about their existence.
-
Also, what type of consistency would be expected for the S3-like storage? It wasn't until relatively recently that S3 provided strong read-after-write consistency. Many other S3-like storage options may not necessarily provide the same consistency guarantees.
-
The article from Amazon seems to describe RAW consistency for the same object key. Presumably, creating a zpool on top of S3 would require some form of RAW consistency guarantee across disjoint object keys.
-
Whatever consistency guarantees are expected, it might be worth exploring having
zpool create
do a small unit test where it does a sanity check to ensure that the S3-like store actually provides the consistency guarantees that the rest of the implementation depends on.
-
zpool *
... All other pool operations require no additional parameters or changes to their usage.
I assume zpool scrub
will fully scrub the actual data of an S3-like bucket. Assuming we can trust that the S3-like service will provide the guarantee of reliable data storage, I suspect there's a use-case for scrubbing just the metadata to make sure it is consistent.
Overall the proposed user interface sounds great. It doesn't seem to overly assume AWS S3 as the backing store, and I would recommend trying to make the implementation agnostic towards any specific S3-like implementation and avoid using any specialized AWS S3 features.
@dsnet
Where can I find more information on this "Object Storage feature"?
All the available information is in the issue, and in the development branch.
How do you make it such that a set of mutations to multiple objects can be atomically committed or not?
Roughly similar to block-based ZFS, there's an object for each (last few) TXG's which points to other objects which contain more metadata, which point to other objects that contain metadata... which forms a tree.
for an object storage, the intermediate objects are already stored into the pool. If ZFS crashes, then those intermediate objects stay allocated, but the system (upon booting back up), may not know about their existence
Correct. Upon starting back up, we'll find these objects and remove them. They have keys that are lexicographically after the last valid objects, so we can use the ListObjects API to find them quickly.
what type of consistency would be expected for the S3-like storage
The design doesn't rely on read-after-write consistency; eventual consistency is fine. This is straightforward because we only overwrite objects with contents that are logically identical (and self-describing). E.g. overwriting data objects to remove freed blocks.
it might be worth exploring having zpool create do a small unit test where it does a sanity check to ensure that the S3-like store actually provides the consistency guarantees that the rest of the implementation depends on
That's a great idea. It would also be useful to be able to test object storage that doesn't have the strong consistency of S3. Do you know of any S3-protocol object storage services (or better, software we can run in-house for testing) that implements weaker consistency?
I assume zpool scrub will fully scrub the actual data of an S3-like bucket.
Correct.
I suspect there's a use-case for scrubbing just the metadata to make sure it is consistent
I think that would be just as useful as it is with block-based pools. We aren't planning to add it as part of this work, but if you feel inspired, feel free to work on it. Note that you can approximate this behavior with the zfs_no_scrub_io
module parameter (all metadata will be read, although not scrubbed meaning that we only read one copy of the metadata, not all copies/parity).
I would recommend trying to make the implementation agnostic towards any specific S3-like implementation and avoid using any specialized AWS S3 features
The current implementation should work with any S3-protocol (although we haven't tested it yet). The eventual goal of the project is for this to work with Azure blob storage as well. (That will probably come as a later PR.)
That's a great idea. It would also be useful to be able to test object storage that doesn't have the strong consistency of S3. Do you know of any S3-protocol object storage services (or better, software we can run in-house for testing) that implements weaker consistency?
S3 does varying levels of consistency; their claims for strong read-after-write consistency only apply to a single caller, and there's software called s3backer
that demonstrates how this consistency doesn't really work in context of ZFS.
If you're updating an object, there's eventual consistency.
If you're creating a new object so long as you don't call HEAD before you PUT it, it's replicated immediately. if you call HEAD, then it's all eventual consistency.
it should also be noted that AWS's cost model for S3 makes operations like zpool scrub prohibitively expensive. how does this proposal aim to reduce the burden of that?
The current implementation should work with any S3-protocol (although we haven't tested it yet). The eventual goal of the project is for this to work with Azure blob storage as well. (That will probably come as a later PR.)
they have very different requirements, please do the work before the S3 implementation is included to ensure that BLOB and Backblaze B2 will work with this as well.
Just piping up to mention Ceph object storage as another S3 compatible object store:
https://ceph.io/en/discover/technology/#object
We've been considering putting ZFS on a Ceph block store to complete our storage consolidation - having ZFS on Ceph object store would be excellent to remove a level of indirection.
I was wondering what the current state on this feature is? The development branch mentioned above seems to have been removed(?)
I was wondering what the current state on this feature is? The development branch mentioned above seems to have been removed(?)
yep, it's concerning that there were no responses to the issue of eventual consistency or cost to scrub the datasets. as someone who worked on this for a half decade, it's probable that they decided it wasn't worth the upfront dev investment or maintenance burden.
The design doesn't rely on read-after-write consistency; eventual consistency is fine. This is straightforward because we only overwrite objects with contents that are logically identical (and self-describing). E.g. overwriting data objects to remove freed blocks.
i don't see how overwriting an object removes the freed block. you must send a DELETE command to S3.
Unless plans have radically changed in the past month, this feature is still very much under active development. There were two great presentations about it at the OpenZFS Developer Summit on November 8th:
ZFS on Object Storage ZettaCache: fast access to slow storage
This sounds terrific. I'm watching the dev summit talks right now.
We have a lot of backup zpools, where ingress performance is important, but egress is infrequent, some long-term log stores where we stash things that are unlikely to be retrieved, and also cluster nodes in distributed databases, where we want at least one replica with stronger stability requirements than a typical ephemeral server.
I'm not sure how comfortable I'd be using these across the internet, how would zfs handle occasional latency spikes? A mode like failmode=continue
or even failmode=wait_patiently_it_will_eventually_come_back_i_promise
might be needed.
mirrors
While vdev mirror is explicitly excluded in this initial release, it would be really useful.
I would love to have a vdev in a mirror on our key systems, that is detached/offline most of the time, and brought online periodically to resilver, that is cloud backed.
Currently we zfs send
the entire zpool through restic, which chunks, deduplicates, and syncs the data to cloud storage. This would remove a step, and most importantly, potentially make getting started from complete server loss significantly simpler:
- boot new replacement system
- import from cloud vdev
- add local vdevs back into mirror
- wait patiently for sufficient silvering to reboot
In my case, these zpools are all around 100-200GiB sized, not huge.
migration
is it planned in future to be able to migrate a pool between cloud vendors? Or would this
need to be done via a full zfs send | recv
instead?
I would love to have a vdev in a mirror on our key systems, that is detached/offline most of the time, and brought online periodically to resilver, that is cloud backed.
because it stores data differently, I don't think it can ever be mirrored. I could be wrong.
@dch
We have a lot of backup zpools
Object-based pools could be a good fit for this workload.
I'm not sure how comfortable I'd be using these across the internet, how would zfs handle occasional latency spikes? A mode like failmode=continue or even failmode=wait_patiently_it_will_eventually_come_back_i_promise might be needed.
Occasional latency spikes should be handled fine, up to the point that txg's get backed up (they can buffer up to 4GB of dirty data by default). Outages to the object store (or if the agent process dies) are always handled patently (i.e. the kernel just waits forever for it to come back). It might make sense to add a mode that fails reads after some (long) timeout.
I would love to have a vdev in a mirror on our key systems, that is detached/offline most of the time, and brought online periodically to resilver, that is cloud backed.
The design doesn't currently handle that. We'll have to think more about how possible that would be.
is it planned in future to be able to migrate a pool between cloud vendors? Or would this need to be done via a full zfs send | recv instead?
If the cloud vendors both use the same object store protocol (e.g. S3 protocol, which AFAIK is used by almost everyone except Azure), then you should be able to use external tools to copy all the objects to the other cloud and then import the new (copied) pool. You could also use this method to copy/replicate a pool to a different region (datacenter) within the same cloud vendor.
The "development branch" that @ahrens linked above is gone, as is the whole repository in fact. "Object storage" doesn't seem to have already been merged. So to ask again:
Where can I find more information on this "Object Storage feature"?
@infogulch We aren't quite ready to share the code yet. For now, the best place to learn more would be from the talks that we gave at the OpenZFS Developer Summit: Object Store slides and video Zettacache slides and video
@ahrens ok thanks for the links, I found this thread after watching those and came here digging for more. Godspeed!
I too am very interested in testing this feature out.
A use-case I have;
I have a NAS at home, for which i have 2 6TB drives (mirrored) to back up some local important data, ZFS has been instrumental with keeping down anxiety about my data not silently corrupting, which I value highly.
However, I would of course not have good data protection policies without backing up off-site, so I backup using restic to a remote S3 bucket, over a relatively slow (50 down, 10 mbitps up) connection, which is then done with snapshots, overnight and over the week(s).
If it is possible for the S3 upload to be "async" (where it is stored locally redundantly and quickly, but over a long period of time, syncs to S3), then I would absolutely love this feature, as it means i can upload all data i'd possibly need, have it be served from zettacache if its recently relevant, but be able to be "synced" to S3 if i recently uploaded it to my NAS.
TL;DR: Have S3 upload be able to not be strictly synchronous, but async over a very long range of time (persistent across crashes/reboots).
Quite excited for this feature!
One thing I think might be worth considering is support for some more traditional file sharing protocols. In my particular use case, we have a storage box that is accessible via FTP(S), SFTP/SCP, Samba and WebDAV that I would love to use with this feature.
As is, I could fuse mount the box using one of the supported protocols, and then use something like Minio to slap a S3-like API on top. That would probably work well enough, as performance isn't particularly important here, but it sounds kind of convoluted.
Bespoke support for some of those protocols would be nice (I think WebDAV might be quite close to the S3 API), but one way to get a bunch of them at once might be to support a local directory of files, which would work with anything you can mount.
One more thought: What would happen when importing the pool on multiple systems at once? I'd assume multiple writers wouldn't be supported, but could single-writer multiple-reader work by importing the pool read-only on all but one nodes?
@ShadowJonathan We aren't planning to support that use case (although you could have separate pools locally and on S3, and use send/recv to copy the data to the cloud). But it might be possible to design this as an extension on what we're doing.
@ahti Storing an "object store" pool on other protocols is a neat idea. We're using Minio for testing, but adding a backend of one file per object as a replacement for the S3 protocol would fit into the design well.
What would happen when importing the pool on multiple systems at once? I'd assume multiple writers wouldn't be supported, but could single-writer multiple-reader work by importing the pool read-only on all but one nodes?
You can't import a pool writeable on multiple systems (and we have an always-on MMP-like mechanism to prevent this). You could import a pool readonly on multiple systems, and like with block-based pools, use zpool checkpoint
to allow modifying one writeable import while the readonly imports (of the checkpoint) are guaranteed to continue working. I don't think that zpool checkpoint
works with object store pools currently, but I think it would be a relatively straightforward extension.
Potentially relevant is recently announced Neon.tech which maps the object store api to an arbitrary number of block devices by splitting reads and writes into pageserver
and walserver
services, see Architecture decisions in Neon. Neon is currently targeting postgresql specifically, but the developers have indicated that it could be used for arbitrary block devices.
This could be useful directly as a backend for ZFS (like this proposal), or as a model for designing such a backend. In any case it may be interesting to compare. What makes me interested in a system that is implemented at the FS layer such as ZFS is that it would support many applications right away (tuning notwithstanding), including databases other than postgresql as well as raw file datasets.
Hello, I want to know what the current status of this feature is. At the '21 summit, it was planned to be version 3.0 in' 22, but I didn't hear anything about this feature at this year's developer Summit.
Delphix's plans for open-sourcing this have been put on hold. I'll close this issue for now.
N.b. even if you had a copy of the source before they closed it, because there's no explicit license headers on any of it, you probably shouldn't have ever looked at it, so don't think that's a basis for starting an open source version.
Surprise.
(e: to be clear, this wasn't intended to do anything other than to communicate something I found very surprising after learning it.)
Thank you @ahrens and @delphix for even considering open-sourcing this.
As someone who's involved in many open source projects, I know how much work it is to maintain these projects. That's also not even considering the financial and competitive loss that would occur for open-sourcing a technology that your (relatively small) company has spent significant headcount to develop. As such, I respect the decision to not open source (though I continue to hope it becomes open source eventually in the future).
It's easy to be disappointed that this specific feature will not be open sourced, but there's already so much that's been open-sourced and I'm grateful to be able to benefit from this technology that I don't pay for. Thank you for all the work you have already spent on maintaining OpenZFS.