lakeFS icon indicating copy to clipboard operation
lakeFS copied to clipboard

Improvement: Remove Token Validation from Get&LinkPhysicalAddress

Open itaiad200 opened this issue 1 year ago • 0 comments
trafficstars

Based on this design, we introduced a token for GetPhysicalAddress & LinkPhysicalAddress. The concerns were around data integrity with/without garbage collection:

  1. When GC runs, it may pickup objects from the storage namespace that are not pointed from a commit or an uncommitted entry. The mitigation is to keep a list of unused physical address (i.e. each one is the token) and list them during GC. This could be mitigated today by setting the minimum age for objects retention (Default is 6 hours)
  2. Another concern was the ability to relink the same object to multiple entries. The token is used for integrity and it can only be used once. Linking the same token again will fail. The reason behind this is that linking objects during gc could lead to an integrity issue:
    1. GC marks a committed object for cleanup
    2. LinkPhysicalAddress is called to link that object to another branch and succeeds
    3. The object is cleaned by the GC. I suggest we remove this limitation and shift the responsibility to the user side. If a token is misused, it's unsafe when GC runs. In real world usage, this is mostly handled by lakeFS published clients, like lakectl or lakeFS HadoopFS.

The implication of managing the tokens in the backend can lead to bad performance. All tokens in a repo are managed in the repo partition. Currently a single write will lead to 3 operations on the same partition: You SetIf the token on GetPhysicalAddress, Get it and then Delete it on LinkPhysicalAddress. Having a partition that is being written to twice during a single write reduces the capacity of writes by half.

itaiad200 avatar May 02 '24 17:05 itaiad200