iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Support client-side purge in REST catalog

Open flyrain opened this issue 1 year ago • 4 comments

Proposed Change

The current Rest clients relies on the rest server to delete table files while dropping a table with purging. There are two concerns about this approach:

  1. The rest server isn't necessarily able to access users' storage. It's impossible to delete table files if the server doesn't have the permission.
  2. The rest server may take a performance hit in case of purging table with a large amount of files.

I propose to support the client-side purging, while still allowing server side deletion to be compatible with the current behavior.

Option 1, to put the purge state in a delete table response.

DeleteTableResponse:
  type: object
  properties:
    purged:
      type: boolean

The clients can decide to delete files or not according to the response. If files are deleted in the server side, do nothing; otherwise, delete them in the client side.

Option 2, checking the existence of table files in the client side

The client can check if files exist, then decide to delete them or not. This doesn't need spec changes. Clients will rely on a convention instead of spec, which is a bit ambiguous.

WDYT? Please share your feedback.

cc @RussellSpitzer @aokolnychyi @rdblue @danielcweeks @Fokko

Proposal document

No response

Specifications

  • [ ] Table
  • [ ] View
  • [X] REST
  • [ ] Puffin
  • [ ] Encryption
  • [ ] Other

flyrain avatar Apr 05 '24 19:04 flyrain

@flyrain I'm a little confused, how can the REST Server not have access to the files? Currently the server needs access to at least the metadata files. Are you considering a situations where data files and metadata files are protected separately?

The way we've been thinking about REST puts the responsibility of the delete on the server (the client shouldn't be responsible for how or when the delete happens).

danielcweeks avatar Apr 05 '24 19:04 danielcweeks

That's right. In our case, the rest server cannot access every table file due to following reasons:

  1. The rest catalog or any other catalog isn't allowed to access users' data due to the security policy, metadata access is fine.
  2. Some Iceberg tables are in HDFS with kerberos, which makes them pretty hard to access from a centralized server.

We still write metadata.json files, but they are located in a server-side storage instead of users' table storage. I understand this use case is a bit different from where the REST catalog was introduced, but I believe it is a valid use case, and we can extend the scope of rest catalog a bit more to support it. cc @RussellSpitzer

flyrain avatar Apr 05 '24 20:04 flyrain

@flyrain, I think your use case makes sense and that we should support some version of client-side purge. That said, I don't think that either option proposed here is the right solution. The problem with both is that this assumes that the purge needs to happen immediately, which isn't necessarily the case.

There's a lot of confusion about purging because in Hive there was no background process to clean up tables and file ownership wasn't clear. As a result, purge has conflicting meanings. It could be either that the table data is sensitive and needs to be deleted immediately, or it could be used to indicate that the data is owned by the table and should be cleaned up rather than left sitting in storage indefinitely. To make this worse, defaults are based on the second and more common interpretation: Iceberg's dropTable(Identifier) calls dropTable(identifier, true /* purge */) in the default implementation.

I want to avoid a case where we have purge-by-default trigger client behavior to actually delete files because catalogs can have much better handling now. For instance, our catalog will keep tables around for a few days that can be restored in case of accidental deletes. In that case, purge uses the first definition and if a client deleted all of the files immediately it would be a problem. We also have to ignore the client-side purge flag because we don't know whether it was defaulted or not.

To solve this, what about adding a config default property that can be sent back by the service? Then all you'd need to do is send a config to the client to tell it to purge tables itself because the service can't. Would that work for your case?

rdblue avatar Jun 19 '24 17:06 rdblue

Having a config to describe the server's capability sounds like a good idea. Although, I think this use case could be resolved in a different way.

our catalog will keep tables around for a few days that can be restored in case of accidental deletes.

Can we distinguish the behaviors of immediate deletion and soft deletion(Putting a table in a Trash Can) more explicitly? Users might have to be aware of that. The current solution seems a bit ambiguous in which users don't know if server actually does immediate deletion or not(it completely depends on the impl.). This is not OK when users have to delete a table immediately together with the data for compliance. I understand the default dropTable(Identifier) purges. Does it make sense to introduce a new method for soft deletion, so that users can invoke it explicitly?

flyrain avatar Jul 01 '24 20:07 flyrain

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Dec 29 '24 00:12 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Jan 12 '25 00:01 github-actions[bot]