zfs User Interface Design for Object Storage Pools

This issue has been opened to provide a venue for discussion and feedback on the proposed CLI design for interacting with and creating pools using the upcoming Object Storage feature.

Introduction and Context

For those who have not yet been introduced to the concept, here is some background on the project.

Modern cloud services provide access to Object Storage APIs. These APIs provide low costs of storage and high throughput compared to more traditional device-based storage options in the cloud, with the drawback that they tend to have higher latency and have per-request costs. Users who wish to store extremely large amounts of data with low-frequency access patterns may wish to use object storage instead of traditional device-based storage. However, ZFS provides no capacity to interface with these objstores, as its assumed a traditional disk will back its storage. Naively using a single object to provide a "disk" for use with ZFS would result in extremely poor performance and high costs.

The proposed project would instead add support for a new type of vdev. This object storage vdev uses a special storage format, reducing costs and increasing performance. A new caching infrasture is also being developed, which would allow the bulk of reads to not need to access the objstore, reducing costs and increasing performance. This caching infrastructure can be viewed as an alternative to the L2ARC with improved scalability.

New Parameters

In order to interface with the cloud-provided object storage, a few pieces of information are required. First, ZFS needs to know what endpoints to use to reach the objstore. Second, most objstores have a concept of "buckets", which are collections of objects that are managed and access controlled together. ZFS needs to know which bucket to store its data in. Third, credentials are required to access the objstore. Some objstores also have a concept of "regions", which represent different geographical locales one might store data in. These regions are often also reflected in the endpoints, but not always.

User Interface Design

The only part of the existing UI that needs to change to support the feature is the zpool command. A new comand is also added to launch the ZFS Object Agent, which is the component responsible for communicating with the objstore on the kernel's behalf.

`zpool create`

When creating a new object storage pool, the objstore-related parameters need to be specified. In addition, the pool layout must be specified in such a way that ZFS knows which vdevs are objstore vdevs and which are not. A sample command looks something like the following:

zpool create -o object-credentials-location="file:///etc/zpool_credentials" -o object-endpoint="https://s3-us-west-2.amazonaws.com" -o object-region="us-west-2" poolname s3 bucketname log sdd

This would create a new pool named poolname, using AWS S3 as the objstore. It would retrieve the credentials from the file at /etc/zpool_credentials (other options besides a file:// URI include env and prompt for retrieving credentials from environment variables or prompting the user for them). The region would be us-west-2 and the endpoint would be https://s3-us-west-2.amazonaws.com. It would also add a log device using a locally attached disk, sdd.

For AWS, the region could be derived from the endpoint; however, this would require cloud-specific logic beyond their APIs, and may not be reliable.

The endpoint, region, and credential location (but not credential value, which must be retrieved at each import) will be stored in the pool config. The bucket name is a vdev property.

`zpool import`

Cachefile import

zpool import -c cachefile poolname will import a pool that is located in the cachefile, using the pool properties in the cache file.

zpool import -d bucketname -o object-credentials-location="file:///etc/zpool_credentials" -o object-endpoint="https://s3-us-west-2.amazonaws.com" -o object-region="us-west-2" scans for pools in the bucketname bucket. Multiple bucket names can be specified with multiple -d flags.

zpool import -o object-credentials-location="file:///etc/zpool_credentials" -o object-endpoint="https://s3-us-west-2.amazonaws.com" -o object-region="us-west-2" scans for pools in all buckets accessible by the given credentials.

If a pool name or guid is provided to either of the previous commands, the pool with the given name will be imported.

`zpool *`

All other pool operations require no additional parameters or changes to their usage. One note is that in the current design, if an objstore vdev is in use, no other top-level vdevs are allowed. Additionally, objstore vdevs cannot (currently) be mirrored, though that might be possible to extend in the future.

`zfs_object_agent`

Currently, the userland agent that communicates to the cloud provider on the kernel's behalf requires few parameters; the -v flag can be provided repeatedly for increased logging verbosity, and it takes an optional argument that specifies the directory it should use to store its sockets. It also accepts the AWS_PREFIX environment variable, which is intended for use in testing and causes all objects to be written or read from a path with that prefix prepended.

Conclusion

This is the current proposed UI design for object storage pools in ZFS. If people have questions or comments about the proposed UI design, feel free to bring them up here. If you have questions about the project more generally, please reach out to me via email ([email protected]), on OpenZFS Slack (@pcd), or on the openzfs IRC channel (Currently #openzfs on Freenode, in the process of moving to #openzfs on Libera, I am pcd in both).

May 25 '21 00:05 pcd1193182

For slightly more background info, I talked about this project briefly on the April video call.

May 25 '21 05:05 ahrens

And again on the June video call

Jun 25 '21 18:06 ahrens

creating pools using the upcoming Object Storage feature.

Where can I find more information on this "Object Storage feature"?

I'm interested in knowing more about how this would be implemented. There's a few technical challenges that I'm curious about:

Most S3-like object pools do not provide any concept of a transaction. How do you make it such that a set of mutations to multiple objects can be atomically committed or not? In the case of a block device, it seems (I'm still learning about the underlyings of ZFS) that ZFS writes to unused blocks. If the uberblock has not yet been updated, then the intermediate block updates are functionally invisible in the event of a sudden crash. However, for an object storage, the intermediate objects are already stored into the pool. If ZFS crashes, then those intermediate objects stay allocated, but the system (upon booting back up), may not know about their existence.
Also, what type of consistency would be expected for the S3-like storage? It wasn't until relatively recently that S3 provided strong read-after-write consistency. Many other S3-like storage options may not necessarily provide the same consistency guarantees.
- The article from Amazon seems to describe RAW consistency for the same object key. Presumably, creating a zpool on top of S3 would require some form of RAW consistency guarantee across disjoint object keys.
- Whatever consistency guarantees are expected, it might be worth exploring having zpool create do a small unit test where it does a sanity check to ensure that the S3-like store actually provides the consistency guarantees that the rest of the implementation depends on.

zpool * ... All other pool operations require no additional parameters or changes to their usage.

I assume zpool scrub will fully scrub the actual data of an S3-like bucket. Assuming we can trust that the S3-like service will provide the guarantee of reliable data storage, I suspect there's a use-case for scrubbing just the metadata to make sure it is consistent.

Overall the proposed user interface sounds great. It doesn't seem to overly assume AWS S3 as the backing store, and I would recommend trying to make the implementation agnostic towards any specific S3-like implementation and avoid using any specialized AWS S3 features.

Jun 25 '21 21:06 dsnet

@dsnet

Where can I find more information on this "Object Storage feature"?

All the available information is in the issue, and in the development branch.

How do you make it such that a set of mutations to multiple objects can be atomically committed or not?

Roughly similar to block-based ZFS, there's an object for each (last few) TXG's which points to other objects which contain more metadata, which point to other objects that contain metadata... which forms a tree.

for an object storage, the intermediate objects are already stored into the pool. If ZFS crashes, then those intermediate objects stay allocated, but the system (upon booting back up), may not know about their existence

Correct. Upon starting back up, we'll find these objects and remove them. They have keys that are lexicographically after the last valid objects, so we can use the ListObjects API to find them quickly.

what type of consistency would be expected for the S3-like storage

The design doesn't rely on read-after-write consistency; eventual consistency is fine. This is straightforward because we only overwrite objects with contents that are logically identical (and self-describing). E.g. overwriting data objects to remove freed blocks.

it might be worth exploring having zpool create do a small unit test where it does a sanity check to ensure that the S3-like store actually provides the consistency guarantees that the rest of the implementation depends on

That's a great idea. It would also be useful to be able to test object storage that doesn't have the strong consistency of S3. Do you know of any S3-protocol object storage services (or better, software we can run in-house for testing) that implements weaker consistency?

I assume zpool scrub will fully scrub the actual data of an S3-like bucket.

Correct.

I suspect there's a use-case for scrubbing just the metadata to make sure it is consistent

I think that would be just as useful as it is with block-based pools. We aren't planning to add it as part of this work, but if you feel inspired, feel free to work on it. Note that you can approximate this behavior with the zfs_no_scrub_io module parameter (all metadata will be read, although not scrubbed meaning that we only read one copy of the metadata, not all copies/parity).

I would recommend trying to make the implementation agnostic towards any specific S3-like implementation and avoid using any specialized AWS S3 features

The current implementation should work with any S3-protocol (although we haven't tested it yet). The eventual goal of the project is for this to work with Azure blob storage as well. (That will probably come as a later PR.)

Jun 27 '21 03:06 ahrens

That's a great idea. It would also be useful to be able to test object storage that doesn't have the strong consistency of S3. Do you know of any S3-protocol object storage services (or better, software we can run in-house for testing) that implements weaker consistency?

S3 does varying levels of consistency; their claims for strong read-after-write consistency only apply to a single caller, and there's software called s3backer that demonstrates how this consistency doesn't really work in context of ZFS.

If you're updating an object, there's eventual consistency.

If you're creating a new object so long as you don't call HEAD before you PUT it, it's replicated immediately. if you call HEAD, then it's all eventual consistency.

Jun 27 '21 13:06 bghira

it should also be noted that AWS's cost model for S3 makes operations like zpool scrub prohibitively expensive. how does this proposal aim to reduce the burden of that?

Jun 27 '21 13:06 bghira

The current implementation should work with any S3-protocol (although we haven't tested it yet). The eventual goal of the project is for this to work with Azure blob storage as well. (That will probably come as a later PR.)

they have very different requirements, please do the work before the S3 implementation is included to ensure that BLOB and Backblaze B2 will work with this as well.

Jun 27 '21 13:06 bghira

Just piping up to mention Ceph object storage as another S3 compatible object store:

https://ceph.io/en/discover/technology/#object

We've been considering putting ZFS on a Ceph block store to complete our storage consolidation - having ZFS on Ceph object store would be excellent to remove a level of indirection.

Jun 29 '21 05:06 chrisrd

I was wondering what the current state on this feature is? The development branch mentioned above seems to have been removed(?)

Dec 05 '21 15:12 jdeluyck

I was wondering what the current state on this feature is? The development branch mentioned above seems to have been removed(?)

yep, it's concerning that there were no responses to the issue of eventual consistency or cost to scrub the datasets. as someone who worked on this for a half decade, it's probable that they decided it wasn't worth the upfront dev investment or maintenance burden.

Dec 05 '21 19:12 bghira

The design doesn't rely on read-after-write consistency; eventual consistency is fine. This is straightforward because we only overwrite objects with contents that are logically identical (and self-describing). E.g. overwriting data objects to remove freed blocks.

i don't see how overwriting an object removes the freed block. you must send a DELETE command to S3.

Dec 05 '21 19:12 bghira

Unless plans have radically changed in the past month, this feature is still very much under active development. There were two great presentations about it at the OpenZFS Developer Summit on November 8th:

ZFS on Object Storage ZettaCache: fast access to slow storage

Dec 05 '21 20:12 Yneeb

This sounds terrific. I'm watching the dev summit talks right now.

We have a lot of backup zpools, where ingress performance is important, but egress is infrequent, some long-term log stores where we stash things that are unlikely to be retrieved, and also cluster nodes in distributed databases, where we want at least one replica with stronger stability requirements than a typical ephemeral server.

I'm not sure how comfortable I'd be using these across the internet, how would zfs handle occasional latency spikes? A mode like failmode=continue or even failmode=wait_patiently_it_will_eventually_come_back_i_promise might be needed.

mirrors

While vdev mirror is explicitly excluded in this initial release, it would be really useful.

I would love to have a vdev in a mirror on our key systems, that is detached/offline most of the time, and brought online periodically to resilver, that is cloud backed.

Currently we zfs send the entire zpool through restic, which chunks, deduplicates, and syncs the data to cloud storage. This would remove a step, and most importantly, potentially make getting started from complete server loss significantly simpler:

boot new replacement system
import from cloud vdev
add local vdevs back into mirror
wait patiently for sufficient silvering to reboot

In my case, these zpools are all around 100-200GiB sized, not huge.

migration

is it planned in future to be able to migrate a pool between cloud vendors? Or would this need to be done via a full zfs send | recv instead?

Jan 21 '22 08:01 dch

I would love to have a vdev in a mirror on our key systems, that is detached/offline most of the time, and brought online periodically to resilver, that is cloud backed.

because it stores data differently, I don't think it can ever be mirrored. I could be wrong.

Jan 21 '22 16:01 bghira

@dch

We have a lot of backup zpools

Object-based pools could be a good fit for this workload.

I'm not sure how comfortable I'd be using these across the internet, how would zfs handle occasional latency spikes? A mode like failmode=continue or even failmode=wait_patiently_it_will_eventually_come_back_i_promise might be needed.

Occasional latency spikes should be handled fine, up to the point that txg's get backed up (they can buffer up to 4GB of dirty data by default). Outages to the object store (or if the agent process dies) are always handled patently (i.e. the kernel just waits forever for it to come back). It might make sense to add a mode that fails reads after some (long) timeout.

I would love to have a vdev in a mirror on our key systems, that is detached/offline most of the time, and brought online periodically to resilver, that is cloud backed.

The design doesn't currently handle that. We'll have to think more about how possible that would be.

is it planned in future to be able to migrate a pool between cloud vendors? Or would this need to be done via a full zfs send | recv instead?

If the cloud vendors both use the same object store protocol (e.g. S3 protocol, which AFAIK is used by almost everyone except Azure), then you should be able to use external tools to copy all the objects to the other cloud and then import the new (copied) pool. You could also use this method to copy/replicate a pool to a different region (datacenter) within the same cloud vendor.

Jan 21 '22 18:01 ahrens

The "development branch" that @ahrens linked above is gone, as is the whole repository in fact. "Object storage" doesn't seem to have already been merged. So to ask again:

Where can I find more information on this "Object Storage feature"?

Jan 31 '22 19:01 infogulch

@infogulch We aren't quite ready to share the code yet. For now, the best place to learn more would be from the talks that we gave at the OpenZFS Developer Summit: Object Store slides and video Zettacache slides and video

Jan 31 '22 19:01 ahrens

@ahrens ok thanks for the links, I found this thread after watching those and came here digging for more. Godspeed!

Jan 31 '22 21:01 infogulch

I too am very interested in testing this feature out.

Mar 11 '22 02:03 timolow

A use-case I have;

I have a NAS at home, for which i have 2 6TB drives (mirrored) to back up some local important data, ZFS has been instrumental with keeping down anxiety about my data not silently corrupting, which I value highly.

However, I would of course not have good data protection policies without backing up off-site, so I backup using restic to a remote S3 bucket, over a relatively slow (50 down, 10 mbitps up) connection, which is then done with snapshots, overnight and over the week(s).

If it is possible for the S3 upload to be "async" (where it is stored locally redundantly and quickly, but over a long period of time, syncs to S3), then I would absolutely love this feature, as it means i can upload all data i'd possibly need, have it be served from zettacache if its recently relevant, but be able to be "synced" to S3 if i recently uploaded it to my NAS.

TL;DR: Have S3 upload be able to not be strictly synchronous, but async over a very long range of time (persistent across crashes/reboots).

Mar 11 '22 15:03 ShadowJonathan

Quite excited for this feature!

One thing I think might be worth considering is support for some more traditional file sharing protocols. In my particular use case, we have a storage box that is accessible via FTP(S), SFTP/SCP, Samba and WebDAV that I would love to use with this feature.

As is, I could fuse mount the box using one of the supported protocols, and then use something like Minio to slap a S3-like API on top. That would probably work well enough, as performance isn't particularly important here, but it sounds kind of convoluted.

Bespoke support for some of those protocols would be nice (I think WebDAV might be quite close to the S3 API), but one way to get a bunch of them at once might be to support a local directory of files, which would work with anything you can mount.

One more thought: What would happen when importing the pool on multiple systems at once? I'd assume multiple writers wouldn't be supported, but could single-writer multiple-reader work by importing the pool read-only on all but one nodes?

May 02 '22 02:05 ahti

@ShadowJonathan We aren't planning to support that use case (although you could have separate pools locally and on S3, and use send/recv to copy the data to the cloud). But it might be possible to design this as an extension on what we're doing.

May 02 '22 06:05 ahrens

@ahti Storing an "object store" pool on other protocols is a neat idea. We're using Minio for testing, but adding a backend of one file per object as a replacement for the S3 protocol would fit into the design well.

What would happen when importing the pool on multiple systems at once? I'd assume multiple writers wouldn't be supported, but could single-writer multiple-reader work by importing the pool read-only on all but one nodes?

You can't import a pool writeable on multiple systems (and we have an always-on MMP-like mechanism to prevent this). You could import a pool readonly on multiple systems, and like with block-based pools, use zpool checkpoint to allow modifying one writeable import while the readonly imports (of the checkpoint) are guaranteed to continue working. I don't think that zpool checkpoint works with object store pools currently, but I think it would be a relatively straightforward extension.

May 02 '22 06:05 ahrens

Potentially relevant is recently announced Neon.tech which maps the object store api to an arbitrary number of block devices by splitting reads and writes into pageserver and walserver services, see Architecture decisions in Neon. Neon is currently targeting postgresql specifically, but the developers have indicated that it could be used for arbitrary block devices.

This could be useful directly as a backend for ZFS (like this proposal), or as a model for designing such a backend. In any case it may be interesting to compare. What makes me interested in a system that is implemented at the FS layer such as ZFS is that it would support many applications right away (tuning notwithstanding), including databases other than postgresql as well as raw file datasets.

Jul 28 '22 22:07 infogulch

Hello, I want to know what the current status of this feature is. At the '21 summit, it was planned to be version 3.0 in' 22, but I didn't hear anything about this feature at this year's developer Summit.

Oct 25 '22 07:10 xuyufenghz

Delphix's plans for open-sourcing this have been put on hold. I'll close this issue for now.

Oct 25 '22 20:10 ahrens

N.b. even if you had a copy of the source before they closed it, because there's no explicit license headers on any of it, you probably shouldn't have ever looked at it, so don't think that's a basis for starting an open source version.

Surprise.

(e: to be clear, this wasn't intended to do anything other than to communicate something I found very surprising after learning it.)

Oct 28 '22 18:10 rincebrain

Thank you @ahrens and @delphix for even considering open-sourcing this.

As someone who's involved in many open source projects, I know how much work it is to maintain these projects. That's also not even considering the financial and competitive loss that would occur for open-sourcing a technology that your (relatively small) company has spent significant headcount to develop. As such, I respect the decision to not open source (though I continue to hope it becomes open source eventually in the future).

It's easy to be disappointed that this specific feature will not be open sourced, but there's already so much that's been open-sourced and I'm grateful to be able to benefit from this technology that I don't pay for. Thank you for all the work you have already spent on maintaining OpenZFS.

Oct 28 '22 18:10 dsnet

zfs zfs copied to clipboard

User Interface Design for Object Storage Pools

Introduction and Context

New Parameters

User Interface Design

zpool create

zpool import

Cachefile import

zpool *

zfs_object_agent

Conclusion

mirrors

migration

zfs
zfs copied to clipboard

`zpool create`

`zpool import`

`zpool *`

`zfs_object_agent`