io
io copied to clipboard
Support for VersionId in s3 file system
- TensorFlow version 2.4.1
Support VersionId argument in s3_file_system operations like reading (currently this is not supported by tensorflow even though it is supported by the underlying aws sdk).
I want to be able to load a model with a specific VersionId from a versioned AWS S3-bucket in tf-serving.
(I posted this issue originally here https://github.com/tensorflow/tensorflow/issues/48331#issue-851456710)
cc @vnvo2409
I am still thinking about it along with #1342. Since environment variable is the only option and reading them is quite expensive, we have to be careful not to affect our performance. I will try to publish a design doc for some improvements in reading / managing the envs for the filesystem soon.
@yongtang @sebastianschramm Please take a look at this design doc ! Thank you !
@yongtang Gentle ping
@vnvo2409 Specifc to versionId', I am wondering if suffix of ?versionId=L4kqtJlcpXroDTDmpUMLUo` is OK? This will be in consistent with S3's Rest API (https://docs.aws.amazon.com/AmazonS3/latest/userguide/RetrievingObjectVersions.html)
@yongtang I think it's OK. In addition, the aws-sdk-cpp has already supported it so we only need to call SetVersionId instead of adding the suffix ourselves.
@vnvo2409 In case of VersionId a suffix in URL like ?versionId=L4kqtJlcpXroDTDmpUMLUo is enough as the VersionId is for object-level.
For file system level configurations (auth token in GCS (this is implemented in gcs-config). ), maybe we can just re-use already expose API like: https://github.com/tensorflow/tensorflow/commit/618730825cc1e2f57c62f71f09b5c63167227d05
I think this can solve the run-time configuration change issue?
@yongtang
In case of VersionId a suffix in URL like
?versionId=L4kqtJlcpXroDTDmpUMLUois enough as the VersionId is for object-level.
Sorry I misunderstood your previous comment. If you mean that we add the suffix to the url which is then passed into tensorflow like s3://bucket/object?versionId=id, I find this idea is really simple, more convenient and more extendable to other options than the envs. However, according to AWS Doc, object?versionId=id is a valid object name. Therefore, we won't be able to distinguish between the query params and the name. I am thinking of 2 solutions:
- Have a separator between the name and the query params. Maybe something like
s3://bucket/object\0?versionId=id.\0is oddly enough that I think no one will use it for the object name. - Have an env to control where we parse the URL as an encoded URL or not. Maybe if it is on, Something like
s3://bucket/object%2Fsub%2Fdir?versionId=idwill becomes3://bucket/object/sub/dir/and query params. Otherwise, something likes3://bucket/object/sub/dir/?versionId=idwill becomes3://bucket/object/sub/dir/and query params.
For file system level configurations (auth token in GCS (this is implemented in gcs-config). ), maybe we can just re-use already expose API like: tensorflow/tensorflow@6187308
I think this is also a better alternative for env. However, I couldn't see a way to call them from python. Am I missing something ?
@vnvo2409 Ah I think the API set_filesystem_configuration has been exposed, though it is not wired up with FileSystem C++ API yet. It will need to be wired with FileSystem C++ API like other ops (e.g, create_dir, delete_file, etc.)