ozone HDDS-10527. KeyOverwrite with optimistic locking

What changes were proposed in this pull request?

This change introduces the ability to re-create / overwrite a key in Ozone using an optimistic locking technique.

Say there is a desire to replace a key with some additional data added somewhere in the key, or perhaps change its replication type from Ratis to EC. To do this, you can read the current key data, write a new key with the same name, and then on commitKey, the new key version will be visible.

However, there is a possibility that some other client deletes the original key, or re-writes it at the same time, resulting in potential lost updates.

To replace a key in this way, the proposal is to use the existing objectID and updateID on the key to ensure the key has not changed since it was read. The flow would be:

Get the keyInfo for the current key.
Call the new bucket.overWriteKey() method, passing the details of the existing key
This call will adjust the keyArgs to pass two new fields - overwriteObjectID and updateObjectID which are taken from the objectID and updateID of the existing key.
When OM receives the open key request, it checks that an existing key is present having the passed keyname, objectID and updateID. If not, an error is returned. Otherwise the key is added to the openKeyTable, storing the overwrite IDs.
The data is written to the key as usual.
On key commit, the values stored in the openKey table for the overwrite IDs are checked against the current key. If the current key is absent, or its IDs have changed, the commit will fail and an error is returned. Otherwise the key is committed as usual.

This technique is similar to optimistic locking used in relational databases, to avoid holding a lock on an object for a long period of time.

Notably there are no additional locks needed on OM and no additional calls or rocksDB reads required to implement this - passing and storing the IDs in the openKey table is all that is required. The overwriteIDs don't need to be stored in the keyTable.

This change only added the feature for Object Store buckets for now.

Additionally, there is a question over what to do about meta-data and ACLs. Should they be copied from the existing key, or passed from the client.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10527

How was this patch tested?

New integration and unit tests added.

Mar 15 '24 11:03 sodonnel

Why not just introduce an object generation field and expose this API (in the future for applications itself). Reference: https://cloud.google.com/storage/docs/json_api/v1/objects/update

Parameter Name	Value	Description
Optional query parameters
generation	long	If present, selects a specific revision of this object (as opposed to the latest version, the default).
ifGenerationMatch	long	Makes the operation conditional on whether the object's current generation matches the given value. Setting to 0 makes the operation succeed only if there are no live versions of the object.
ifGenerationNotMatch	long	Makes the operation conditional on whether the object's current generation does not match the given value. If no live object exists, the precondition fails. Setting to 0 makes the operation succeed only if there is a live version of the object.
ifMetagenerationMatch	long	Makes the operation conditional on whether the object's current metageneration matches the given value.
ifMetagenerationNotMatch	long	Makes the operation conditional on whether the object's current metageneration does not match the given value.
overrideUnlockedRetention	boolean	Applicable for object's that have an unlocked retention configuration: Required to be set to true if the operation includes a retention property that changes the mode to Locked, reduces the retainUntilTime, or removes the retention configuration from the object.

Mar 20 '24 18:03 kerneltime

Why not just introduce an object generation field and expose this API (in the future for applications itself). Reference: https://cloud.google.com/storage/docs/json_api/v1/objects/update Parameter Name Value Description Optional query parameters generation long If present, selects a specific revision of this object (as opposed to the latest version, the default). ifGenerationMatch long Makes the operation conditional on whether the object's current generation matches the given value. Setting to 0 makes the operation succeed only if there are no live versions of the object. ifGenerationNotMatch long Makes the operation conditional on whether the object's current generation does not match the given value. If no live object exists, the precondition fails. Setting to 0 makes the operation succeed only if there is a live version of the object. ifMetagenerationMatch long Makes the operation conditional on whether the object's current metageneration matches the given value. ifMetagenerationNotMatch long Makes the operation conditional on whether the object's current metageneration does not match the given value. overrideUnlockedRetention boolean Applicable for object's that have an unlocked retention configuration: Required to be set to true if the operation includes a retention property that changes the mode to Locked, reduces the retainUntilTime, or removes the retention configuration from the object.

I think it would be possible to extend to this in the future, but it may need more thought about what all operations it impacts. Implementing this sort of generic approach probably needs to cover key delete and create (overwrite) at least. Perhaps move / rename too. We would need to think through what it would mean for FSO buckets (perhaps nothing different, but I have not given it deep thought).

The workings of the proposed solution here is hidden behind the new overwriteKey API on the bucket object, and that is the only external API I have changed. If we were able to get the generation concept added, the overwrite key could be changed to use it and further enhance this sort of feature.

Mar 20 '24 18:03 sodonnel

Create https://issues.apache.org/jira/browse/HDDS-10558 for the idea suggested by @kerneltime as it is something that is worth exploring, but probably needs a bit more design work than just this PR.

Mar 20 '24 18:03 sodonnel

Create https://issues.apache.org/jira/browse/HDDS-10558 for the idea suggested by @kerneltime as it is something that is worth exploring, but probably needs a bit more design work than just this PR.

It would not make sense to have generation and also use objectID. It is essentially the same capability, one using just the objectID and another having a more developer friendly generation id. I would recommend to introduce generation ID here instead of using objectID. It would not make sense to expose object ID to developers.

Mar 20 '24 20:03 kerneltime

ObjectID is already there, as is updateID. They have not been introduced here, and they are already persisted and managed in OM. This PR only publicly exposes the new method on the bucket, with no intention of exposing it further.

If we are going to expose GenerationID through the APIs in the way the google cloud docs indicate, then we need to decide if that is something we want to start with. Eg Google cloud supports it, but AWS does not. Its more test surface, and more code to write and expose via all the interfaces and most importantly to be supported going forward. We would also have to worry about forward / backward compatibility if we are adding another ID to be persisted in OM. Old keys will not have a generationID, but perhaps it can be derived from the object / update ID. How does it tie in with the object Version? If it is a new ID to be stored, it will add a small storage / memory overhead on OM. I don't believe we should just implement something like that without giving it due consideration.

What is there in this PR, could easily be changed to use a generationID if it was introduce later and this feature sees some adoption as the use of object / updateID is contained and hidden from users of the API, which is why I would suggest exploring GenerationID in the other Jira I raised. I am trying to make the smallest useful change possible to allow for atomic key overwrite, without closing any paths for future improvements. Therefore I would like to move any design around GenerationID into the other Jira and move ahead with this one in its current direction.

Mar 20 '24 21:03 sodonnel

The generationID in google cloud appears to be like the version in AWS and in Ozone to some extent (I believe versioning is not fully implemented in Ozone).

From the docs:

There is no guarantee that generation numbers increase for successive versions, only that each new version has a unique generation number. There is no relationship between the generation numbers of unrelated objects, even if the objects are in the same bucket.

From that, it is not clear if each generationID is unique across the bucket, or if two different keys can have the the same generationID.

In Ozone, if you create a key, with objectID=5, and then create a new key of the same name, the objectID, I believe, remains the same, but the update_id is incremented.

If you deleted the key (without versioning enabled) and recreated it, the object_ID will change.

UpdateID comes from the Ratis Transaction ID (if Ratis is enabled), so it is probably unique on its own without the objectID, but I am not sure if that can be trusted without Ratis enabled. Also based on comments in WithObjectID.java.

Therefore I think that to guarantee an object has not changed, we need both object and update ID.

For now, I don't think we should expose any of this via the S3 or Hadoop compatible filesystems. We could also tag the new bucket.overwriteKey() as experimental for now, giving us scope to remove it or change it later if we decide it is not useful. However I still think it could be adapted to a new approach easily behind the public interface. Basically, I think the whole Ozone version story and version ID along with object and update ID needs to be fully worked out before we expose anything more widely that is done in this PR.

Mar 20 '24 22:03 sodonnel

The generationID in google cloud appears to be like the version in AWS and in Ozone to some extent (I believe versioning is not fully implemented in Ozone).

Versioning != generation ID. Generation is just a monotonically increasing number without preserving the older generations.

From the docs:

There is no guarantee that generation numbers increase for successive versions, only that each new version has a unique generation number. There is no relationship between the generation numbers of unrelated objects, even if the objects are in the same bucket.

From that, it is not clear if each generationID is unique across the bucket, or if two different keys can have the the same generationID.

Each object has its own generation field, it is not based on the bucket.

In Ozone, if you create a key, with objectID=5, and then create a new key of the same name, the objectID, I believe, remains the same, but the update_id is incremented.

If you deleted the key (without versioning enabled) and recreated it, the object_ID will change.

UpdateID comes from the Ratis Transaction ID (if Ratis is enabled), so it is probably unique on its own without the objectID, but I am not sure if that can be trusted without Ratis enabled. Also based on comments in WithObjectID.java.

Therefore I think that to guarantee an object has not changed, we need both object and update ID.

For now, I don't think we should expose any of this via the S3 or Hadoop compatible filesystems. We could also tag the new bucket.overwriteKey() as experimental for now, giving us scope to remove it or change it later if we decide it is not useful. However I still think it could be adapted to a new approach easily behind the public interface. Basically, I think the whole Ozone version story and version ID along with object and update ID needs to be fully worked out before we expose anything more widely that is done in this PR.

Mar 21 '24 00:03 kerneltime

@devmadhuu Thanks for the review. I added one more commit that adds details to the audits.

@kerneltime In google cloud, they use the generationID to access versions of an object like:

gs://bucket/object#generation_ID

So there is surely some overlap with the Ozone version and how AWS does things. Objects that are not versioned also have a generationID, so its not tied to versioning, but it is used in versioning.

Mar 21 '24 12:03 sodonnel

In Ozone, if you create a key, with objectID=5, and then create a new key of the same name, the objectID, I believe, remains the same, but the update_id is incremented.

Object ID is derived from the Ratis transaction index of the create request. See the method doing the calculation that is called from all create requests. Note the extra space created around each object ID to account for directories that might need to be created with a file create request. Creating a new key with the same name will have a different object ID.

Update ID is also set on key create but additionally changed when metadata like Native ACLs are changed. This is just the Ratis transaction index with no additional modifications. See this method and everywhere it is used for what operations modify the update ID. I'm not sure what you mean "update ID is incremented". The update ID of the new key will be arbitrarily larger than the old key since it is a Ratis transaction index, but its value is not related to the update ID of the previous version of the key.

Mar 21 '24 20:03 errose28

@errose28 I think that for a HSync (HBase related changes), when a key is appended or synced, the update ID should be increased while the objectID stays the same. As I first read your comment, I thought maybe I can cut this down to just checking the objectID, but I think the append / hsync thing changes the picture slightly, needing the update ID to be checked too if we want to ensure there are no changes.

By "update ID is incremented" I just meant it was changed. Not that it is increased by 1. I was aware that both IDs are derived from the Ratis Transaction ID.

Mar 21 '24 21:03 sodonnel

As I first read your comment, I thought maybe I can cut this down to just checking the objectID

Actually I'm thinking we can just check the update ID to determine if the key changed. The case where object ID alone would be useful is if only a key's native ACLs were updated and we wanted to disregard this change. However, I think we want this API to preserve native ACLs in the original key. This makes it consistent with Ranger ACLs which will remain the same as long as the key is re-rewritten to the same location.

but I think the append / hsync thing changes the picture slightly, needing the update ID to be checked too if we want to ensure there are no changes.

I don't think a key with an open lease being hsync'ed should be eligible for overwrite. I think this overwrite case is already blocked by existing hsync changes on the feature branch but we should double check to make sure we don't have issues later. I believe append will require an OM side update every time it is called, so yes that should increment the update ID.

Mar 21 '24 22:03 errose28

I don't think a key with an open lease being hsync'ed should be eligible for overwrite

A key being hsync'ed is "open but visible" in Ozone and hence should have a lease which blocks other writers.

I think the "hsync / hbase" work also allows for a key to be appended - ie the key is closed and committed. Then a writer reopens it and appends some new data and commits / closes it again.

While the first scenario should be blocked, I need to ensure we do the right thing if an append happens to a closed key that is currently being overwrriten, and whether that is even possible! I will ask around about that.

I think you could be right that update_id is all we need as it will change on key append, hsync and object delete and recreate. It would be much nicer to only need to use 1 field.

Mar 21 '24 22:03 sodonnel

Can this be done in a feature branch instead of the master? Right now there is no use of this change and it does not work for FSO and needs more testing.

Mar 22 '24 01:03 kerneltime

Can this be done in a feature branch instead of the master? Right now there is no use of this change and it does not work for FSO and needs more testing.

I am pretty strongly against feature branches, because:

Nobody reviews changes landing in them except the few people working on them.
They are painful to merge to the main branch
They are painful to keep up to date with frequent merge conflicts.
Merging to the main branch causes a lot of change dumped in all at once.
Merging to main requires a vote for a changeset that frankly nobody ever reviews.
We have no release in sight for Ozone, so there is plenty of time to add any additional PRs and tests.

These changes will do no harm if nobody uses them. We also plan to add FSO support afterward, but this PR is large enough already.

Mar 22 '24 09:03 sodonnel

@errose28 I have addressed all the comments except extending the tests to ensure the acls are copied over. I will try to do that later today.

Mar 22 '24 11:03 sodonnel

@sodonnel If there is snapshot, then having overwrite will actually increase size, as there will be 2 copy - 1 - snapshot and 2 - active db with this solution.

Apr 01 '24 16:04 sumitagrawl

@sodonnel If there is snapshot, then having overwrite will actually increase size, as there will be 2 copy - 1 - snapshot and 2 - active db with this solution.

Correct. This feature, at least initially will only change the latest version of a key. If the key is under a snapshot it will cause the data to be duplicated.

Apr 02 '24 13:04 sodonnel

@sodonnel If there is snapshot, then having overwrite will actually increase size, as there will be 2 copy - 1 - snapshot and 2 - active db with this solution.

Correct. This feature, at least initially will only change the latest version of a key. If the key is under a snapshot it will cause the data to be duplicated. @sodonnel IMO, another impacts,

When snapshot is deleted, that key blocks also needs to be deleted. Since no delete event, this will cause orphan blocks (where activeDB key is already overwritten). Need verify if implicitly handled or not (like another client does key overwrite with new set of blocks)
quota size calculation -- Now key type gets changed from Ratis to EC or vice versa, so that impact for used size to be handled while calculating old and new to update difference.
parallel operation -- before overwrite in commit, key gets deleted, so releasing blocks created for overwrite failure -- Do key is present in open key table for this scenario? so that on failure, and cleanup of open key, blocks are released?

Any reason of keeping same updateID ? as already we have key overwrite, same logic can follow from server side?

Apr 02 '24 16:04 sumitagrawl

@sumitagrawl This change is the same as any other key overwrite, only it will no go through (ie commit) unless the update IDs match - that is the only change. All the delete / quota handling is unchanged. The original key gets deleted if the bucket is not versioned in the exact same way as if someone writes the same key again.

In the case of snapshot, the original key is deleted, but the snapshot causes the blocks to be retained. When the snapshot is deleted, the original blocks get deleted too. Same behavior as now. This PR doesn't change that at all.

Apr 02 '24 16:04 sodonnel

@kerneltime After a discussion we had, you suggested the OzoneBucket.overwriteKey(...) API does not seem correct. I agree that the Java doc and the method name are not good and should be changed. However the way the API works is in the same pattern as the existing similar methods.

For example createKey looks like:

  public OzoneOutputStream createKey(String key, long size,
      ReplicationConfig replicationConfig,
      Map<String, String> keyMetadata)
      throws IOException {
    return proxy
        .createKey(volumeName, name, key, size, replicationConfig, keyMetadata);
  }

Note it takes the keyName and repConfig and returns an output stream.

The way the new API is intended to be used, is to replace an existing key with a new replication type. To do that, the user must first get the OzoneKeyDetails of the existing key, which will container the updateID along with the keyName etc. Therefore the new API allows those key details to be passed in directly.

OzoneKeyDetails exisitingKey = bucket.getKey(keyName);
try (OutputStream os = bucket.overwriteKey(existingKey, newRepConfig) {
  os.write(bucket.readKey(keyName))
}

To me this makes perfect sense as a usable API without any extra manipulation of the key read. It is pretty much in the same pattern as the the other existing APIs, which was my intention.

It would be possible to change the parameter list to something like:

bucket.atomicallyOverwriteKey(String keyName, long existingKeyUpdateID, ReplicationConfig newRepConfig)

But I actually think that is worse to use, as you then have to extract pieces from the keyInfo you loaded. And it directly exposes the updateID, which at the moment is hidden. Perhaps we want to expose it, but perhaps not at this stage. I may also be possible to overload the API to have both options.

Apr 03 '24 21:04 sodonnel

@sodonnel I think it would help if you were to write up a markdown document summarizing the new API and contribute that as its own PR that people could read and comment as an easier medium for discussion. A lot of the discussion happening on this PR is not related to the code, but to higher level design decisions that were made like ACL handling, tracking ID throughout the process, and the API name and spec itself. There may be good reason for doing it this way, but for myself and others in the community both now and in the future, it would be best to write out the options that were considered and why a given approach was chosen, rather than submitting code that has already made a lot of these decisions without showing the process to get there.

Some things that might be helpful to write out:

What does the new API look like from the client side
How will clients use the new API
How does the API compare to what other storage systems offer? Are there any ideas there we can re-use?
General correctness guarantees in terms of metadata and data
What key metadata is preserved or updated at the time of create and commit

More generally I hope this extra step can become a normal part of any change having moderate impact, like a new client API we must support indefinitely.

Apr 04 '24 02:04 errose28

it, key gets deleted, so releasing blocks created for overwrite failure -- Do key is present in open key table for this scenario? so

@sodonnel Another query, If overwrite happens, and old blocks are removed as cleanup. But client already in progress of reading blocks from datanode for big file. In this case, Client may get No_block exception from DN. Are we planning to handle client for this change to refresh metadata on this failure?

Apr 04 '24 13:04 sumitagrawl

Added a design doc via PR #6482.

Apr 05 '24 17:04 sodonnel

@sumitagrawl The problem you describe is already present for overwriting a key, or indeed if the balancer moves a container, or if the OM cache is holding stale container locations. There is already a retry facility in the client which attempts to obtain new block locations in the event that the locations is has obtained are invalid.

Apr 05 '24 17:04 sodonnel

Converted to draft until #6482 is outstanding.

Apr 22 '24 11:04 adoroszlai

@errose28 @kerneltime I have updated this PR to match what we discussed in the design doc PR.

The only things in the design doc that are not included here are:

Using the version framework in the client to detect an old server. As this is going to go on a branch, we will do that in a followup PR.
Have not considered failing an atomic write at a block boundary. That will be investigate later, and is somewhat nice to have at this stage.

I also plan to send this PR to a branch - it will not be committed to master. I need to create the branch and then see if we can re-point the PR at the branch.

Apr 30 '24 19:04 sodonnel

@errose28 @kerneltime, @adoroszlai has created the branch HDDS-10656-atomic-key-overwrite and pointed this PR at that branch.

Please let us know if you are happy with the changes here, or if you have further comments so we can progress toward committing this onto the branch.

May 03 '24 08:05 sodonnel

@errose28 @kerneltime @adoroszlai have you got any further comments on this PR?

May 08 '24 16:05 sodonnel

@errose28 I believe I have addressed you further comments. Please have a look and let me know what you think.

May 09 '24 20:05 sodonnel

Looking good overall, I think we just need to determine when to call it rewriteGeneration and when to just use generation from this thread. I still think we need one method used by all tests that defines metadata expectations before and after an overwrite as described in this comment but if you want to do a follow up PR for that I'm ok.

May 09 '24 20:05 errose28