ozone HDDS-11237. Implement BucketCache with expiry for FileSystem client rename and delete

What changes were proposed in this pull request?

The patch implements a client-side LRU size-bound cache for managing bucket metadata, designed to handle up to 100 buckets by default.
The cache is particularly useful for operations such as fs.rename() and fs.delete(), where it significantly reduces the number of getBucket() RPC calls required for fetching simple information such as the volume/bucket names, bucket layout information and bucket link information.

The patch leverages CacheBuilder from Guava to implement an LRU cache. CacheBuilder enables us to define a maximum cache size and set an expiration time (defaulting to 10 minutes) for the cache entries (both maxCacheSize and expirationDuration are configurable Ozone properties).

How does BucketCache help?

For instance, a simple test such as this one: AbstractRootedOzoneFileSystemTest.testRenameFile() originally required 6-7 round trips for getBucket() calls alone (i.e., InfoBucketRequest calls) between the client and OM (for a single bucket entry), even after the client-side optimization: HDDS-11238.

With the new BucketCache implementation in place, this is reduced to a single round trip for the getBucket() call. The subsequent requests are served from the cache, significantly lowering the number of round trips (and hence the round trip latency) between the client and OM.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-11237

How was this patch tested?

Existing unit and integration tests should cover the changes, green CI/CD: https://github.com/tanvipenumudy/ozone/actions/runs/10142602029

Jul 26 '24 08:07 tanvipenumudy

Hi @tanvipenumudy thank you for your work on this, I have some general doubt, let me share it.

Most of our tests that were failing earlier when I moved off from getBucket call HDDS-11235 and HDDS-11232, I ran into test failures as semantics have changed. Here I don't see any test failures, however the semantics is changing, and this makes me nervous about the cache itself...

Let's see a few cases what can happen:

in case the client initializes a FileSystem object, does a call and closes the object, then there is no change, other than we have one element in a cache that is used only once, and the cache is removed. (This is completely fine.)
in case the client initializes a FileSystem object, and issues subsequent calls that should succeed, and the calls does not contain any manipulation of the bucket(s) used after they are created and cached, then things are fine.
in case the client initializes a FileSystem object, and issues subsequent calls that should succeed, but it modifies a bucket (e.g.: changing a property) then it goes undetected from the client's point of view unless the change causes a subsequent call to fail. (I am not sure if there is such a scenario at the moment though).
in case the client initializes a FileSystem object, and issues subsequent calls that should fail in a particular way after a call removes the bucket, it might fail in a very different way, that might affects retries as well, as in this case it will be the server side that fails, and the failure handling in getBucket won't be effective as we get the bucket from the cache, and there we can not get the error.

This last case as far as I can tell is not handled in the current code, and it can lead to unexpected behaviour, or behaviour we do not have tests for.

There is an other thing we need to put more thought into when it comes to caching buckets. We need to consider what happens to the client, if the bucket was recreated with different properties, or removed after the client cached it, as the client uses stale information and I don't think the behaviour in these cases are fully defined yet fully correct... In this case I do not see any handling that would clear the cached bucket info, so if my assumptions are correct, the client may behave incorrectly until the cache evicts the bucket info after the 10 minute cache lifetime.

Let me add one more generic thought... I believe our first goal should be to change our code to use at most one VolumeInfo, at most one BucketInfo, and then at most the required RPC calls to handle the actual operation, for example 3 in createkey (openkey and commitkey, and renameKey (from x.COPYING to x), or 1 for RenameKey, or 1 for DeleteKey and so on. Once that is done, we can introduce the cache, and cache VolumeInfo and BucketInfo objects, in order to reduce the overall volume of RPC calls that are caused by long running clients. Why? Currently the code calls getBucket here and there, by introducing caching, we either do error handling everywhere where we use the bucket object, and at the moment it is a bit hectic and error prone. If all our code that touches RPC is simplified, and it is clear how and when we call an RPC, that helps greatly to properly handle errors. I know this is arguable, and I also am sure that this is just a single point of view based on assumptions on the rest of the code I have not examined much, but this is how I think about this problem, please argue with this if I am wrong, as I would be happy to introduce the cache first and get rid of unnecessary InfoBucket calls first, but at the moment I don't think it is this simple even for just the delete/rename flow...

Jul 29 '24 15:07 fapifta

This change looks good but we do need to validate if there are any side effects of caching these. In the longer term I do believe we should change the client to OM interaction itself.

Jul 29 '24 20:07 kerneltime

Thank you @fapifta for taking the time to review the patch and sharing your thoughts, I do agree with most of the points shared.

Regarding points 2. & 3.,

in case the client initializes a FileSystem object, and issues subsequent calls that should succeed, but it modifies a bucket (e.g.: changing a property) then it goes undetected from the client's point of view unless the change causes a subsequent call to fail. (I am not sure if there is such a scenario at the moment though).

Currently, the properties being cached are the bucket layout semantics and the bucket link information (for the FS client rename and delete flow). Both these properties cannot be changed through Ozone shell on an existing bucket unless the bucket has been deleted and recreated using the same name. In such a scenario, the client would be using the stale information that has been cached until the cache has been invalidated after the configured expiration time.

In such a case, we might also have to bring in the objectID that is unique (or the creation/modification time) as part of the cache key and figure out a way to retrieve cache entries on the client-side based on this new key information.

Regarding point 4.,

in case the client initializes a FileSystem object, and issues subsequent calls that should fail in a particular way after a call removes the bucket, it might fail in a very different way, that might affects retries as well, as in this case it will be the server side that fails, and the failure handling in getBucket won't be effective as we get the bucket from the cache, and there we can not get the error.

Right, this is a valid point! The current code does not handle the retry behaviour, which we might need to analyze and compare with the current behaviour if we were to move ahead with this approach.

I believe our first goal should be to change our code to use at most one VolumeInfo, at most one BucketInfo, and then at most the required RPC calls to handle the actual operation, for example 3 in createkey (openkey and commitkey, and renameKey (from x.COPYING to x), or 1 for RenameKey, or 1 for DeleteKey and so on. Once that is done, we can introduce the cache, and cache VolumeInfo and BucketInfo objects, in order to reduce the overall volume of RPC calls that are caused by long running clients.

This makes sense. We may revisit this as needed once we converge all RPC calls into at most one for every operation.

As @kerneltime suggested, in the long run, we may have to allow OM to resolve the getBucketInfo() and getVolumeInfo() calls internally and not have the client resolve this information - however, this might involve some major API redesigning (introducing newer OM APIs and modifying a lot of the existing code) and with that compatibility handling.

Jul 30 '24 06:07 tanvipenumudy

in case the client initializes a FileSystem object, and issues subsequent calls that should fail in a particular way after a call removes the bucket, it might fail in a very different way, that might affects retries as well, as in this case it will be the server side that fails, and the failure handling in getBucket won't be effective as we get the bucket from the cache, and there we can not get the error

@fapifta IMO, we should use the cache only for existing sanity checks and bucket type checks inside write-calls like rename/delete etc. For read-calls like getFileStatus/getBucketInfo, the cache should not be used as the results can be stale and not guarantee consistency. For the case you described, if the cache gives out a bucket that has been deleted, here are the steps we should follow:

Say the operation is a rename b/w file vol/buck/dir/a and vol/buck/dir/b

Step1 : Rename API on the client wants to know the type of bucket on which the rename op is being run. It invokes getBucket(buck) Step2: Say some one deletes buck at this point . So OM no longer maintains buck on the server. Step3 : The cache happened to have the bucket buck details cached and the rename logic moves forward after getting bucket type of 'buck' . Step4: Rename request is sent to the server Step5: This is the crucial step . Server also checks getBucketInfo (buck) while handling the request and checks if the response is non-null before actually processing the rename. The call response will be null as the bucket is deleted and the OM should send a "BUCKET_NOT_FOUND" response to the rename.

Also this case is also applicable currently without the cache too as there is a possibility that the bucket is deleted before the actual rename request is reached on the server as we don't hold a lock b/w these operations. We already have that handling today so we should be good. I'd say we should add some tests to cover this scenario

Jul 30 '24 06:07 sadanand48

To handle cases of recreation of bucket with the same name, something like this could be done.

Bucket cache should store bucket Object ID along with other required metadata such as bucket/volume name.
When a request is sent to the server say a rename request , on the OM there needs to be a check b/w bucketObjID in the request and bucketObjID of the stored BucketInfo object on the DB. If they mismatch, send an error response asking the client to invalidate its cache.

OmBucketInfo rocksdbBucketInfo = bucketTable.get(request.getRenameRequest().getBucketName);
if (request.getRenameRequest().getBucketObjectID!=rocksdbBucketInfo.getBucketObjectID){
     // return error response with 'INVALIDATE_BUCKET_CACHE";
}

Jul 30 '24 06:07 sadanand48

/pending different approach suggested by Pifta

Apr 03 '25 09:04 adoroszlai

Thank you very much for the patch. I am closing this PR temporarily as there was no activity recently and it is waiting for response from its author.

It doesn't mean that this PR is not important or ignored: feel free to reopen the PR at any time.

It only means that attention of committers is not required. We prefer to keep the review queue clean. This ensures PRs in need of review are more visible, which results in faster feedback for all PRs.

If you need ANY help to finish this PR, please contact the community on the mailing list or the slack channel."

Apr 25 '25 00:04 github-actions[bot]