io icon indicating copy to clipboard operation
io copied to clipboard

Support for VersionId in s3 file system

Open sebastianschramm opened this issue 4 years ago • 9 comments

  • TensorFlow version 2.4.1

Support VersionId argument in s3_file_system operations like reading (currently this is not supported by tensorflow even though it is supported by the underlying aws sdk).

I want to be able to load a model with a specific VersionId from a versioned AWS S3-bucket in tf-serving.

(I posted this issue originally here https://github.com/tensorflow/tensorflow/issues/48331#issue-851456710)

sebastianschramm avatar Apr 07 '21 19:04 sebastianschramm

cc @vnvo2409

yongtang avatar May 01 '21 22:05 yongtang

I am still thinking about it along with #1342. Since environment variable is the only option and reading them is quite expensive, we have to be careful not to affect our performance. I will try to publish a design doc for some improvements in reading / managing the envs for the filesystem soon.

vnghia avatar May 01 '21 23:05 vnghia

@yongtang @sebastianschramm Please take a look at this design doc ! Thank you !

vnghia avatar May 04 '21 01:05 vnghia

@yongtang Gentle ping

vnghia avatar May 18 '21 21:05 vnghia

@vnvo2409 Specifc to versionId', I am wondering if suffix of ?versionId=L4kqtJlcpXroDTDmpUMLUo` is OK? This will be in consistent with S3's Rest API (https://docs.aws.amazon.com/AmazonS3/latest/userguide/RetrievingObjectVersions.html)

yongtang avatar May 20 '21 16:05 yongtang

@yongtang I think it's OK. In addition, the aws-sdk-cpp has already supported it so we only need to call SetVersionId instead of adding the suffix ourselves.

vnghia avatar May 20 '21 16:05 vnghia

@vnvo2409 In case of VersionId a suffix in URL like ?versionId=L4kqtJlcpXroDTDmpUMLUo is enough as the VersionId is for object-level.

For file system level configurations (auth token in GCS (this is implemented in gcs-config). ), maybe we can just re-use already expose API like: https://github.com/tensorflow/tensorflow/commit/618730825cc1e2f57c62f71f09b5c63167227d05

I think this can solve the run-time configuration change issue?

yongtang avatar May 25 '21 01:05 yongtang

@yongtang

In case of VersionId a suffix in URL like ?versionId=L4kqtJlcpXroDTDmpUMLUo is enough as the VersionId is for object-level.

Sorry I misunderstood your previous comment. If you mean that we add the suffix to the url which is then passed into tensorflow like s3://bucket/object?versionId=id, I find this idea is really simple, more convenient and more extendable to other options than the envs. However, according to AWS Doc, object?versionId=id is a valid object name. Therefore, we won't be able to distinguish between the query params and the name. I am thinking of 2 solutions:

  • Have a separator between the name and the query params. Maybe something like s3://bucket/object\0?versionId=id. \0 is oddly enough that I think no one will use it for the object name.
  • Have an env to control where we parse the URL as an encoded URL or not. Maybe if it is on, Something like s3://bucket/object%2Fsub%2Fdir?versionId=id will become s3://bucket/object/sub/dir/ and query params. Otherwise, something like s3://bucket/object/sub/dir/?versionId=id will become s3://bucket/object/sub/dir/ and query params.

For file system level configurations (auth token in GCS (this is implemented in gcs-config). ), maybe we can just re-use already expose API like: tensorflow/tensorflow@6187308

I think this is also a better alternative for env. However, I couldn't see a way to call them from python. Am I missing something ?

vnghia avatar May 27 '21 01:05 vnghia

@vnvo2409 Ah I think the API set_filesystem_configuration has been exposed, though it is not wired up with FileSystem C++ API yet. It will need to be wired with FileSystem C++ API like other ops (e.g, create_dir, delete_file, etc.)

yongtang avatar May 28 '21 04:05 yongtang