gravitino [#3379] feat(catalog-hadoop): Add S3 support for Fileset Hadoop catalog

What changes were proposed in this pull request?

Add S3 support for Fileset Hadoop catalog. We only add hadoop-aws dependency actually, most of the work is conducting tests.

Why are the changes needed?

Fix: #3379

Does this PR introduce any user-facing change?

No.

How was this patch tested?

IT.

Jul 22 '24 02:07 xiaozcy

@xiaozcy Please resolve the conflicts if you are free.

Jul 22 '24 07:07 yuqi1129

@FANNG1 can you please also take a look at this?

Besides, I think it would be better that s3 related configurations to be catalog/schema/fileset properties, the reason is that such properties are important to make fileset on s3 work, we'd better not hiding them into the hadoop site xml.

Jul 24 '24 03:07 jerryshao

@FANNG1 can you please also take a look at this?

Besides, I think it would be better that s3 related configurations to be catalog/schema/fileset properties, the reason is that such properties are important to make fileset on s3 work, we'd better not hiding them into the hadoop site xml.

@xiaozcy could you provide a document about how to make S3 works for Fileset hadoop catalog.

Jul 24 '24 13:07 FANNG1

@xiaozcy can you please address the comments here?

Aug 01 '24 12:08 jerryshao

@jerryshao, sorry for the late reply, I have already upgraded the version of hadoop and done some abstraction of the IT, and I'm still working on managing some S3-related configurations in Gravitino.

Aug 01 '24 13:08 xiaozcy

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

Aug 05 '24 06:08 xiaozcy

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

@yuqi1129 what do you think?

Aug 05 '24 11:08 jerryshao

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

@yuqi1129 what do you think?

I think it would be best to add a flag to clearly indicate the type of authentication we will be using. The Gravitino fileset currently supports simple and Kerberos authentication. Once we set the type, we can verify that the required properties have been provided before initializing the correspond SDK client.

https://github.com/apache/gravitino/blob/2d57f6308bbdf6f5f3a0fa7a4d751e08d04d293a/docs/hadoop-catalog.md?plain=1#L26-L48

Aug 05 '24 12:08 yuqi1129

@yuqi1129, could you help review this again?

Aug 07 '24 02:08 xiaozcy

@yuqi1129, could you help review this again?

Sure.

Aug 07 '24 02:08 yuqi1129

https://github.com/apache/gravitino/blob/13f3b3f3118e2a7f8b63aa324e11c688159f4633/catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/SecureHadoopCatalogOperations.java#L92-L94

I wonder why the passed value to hadoop conf is null in places like this. In this way, we can not pass some configurations in schema/fileset level. Should we use hadoopCatalogOperations.getHadoopConf() instead?

Aug 07 '24 03:08 xiaozcy

https://github.com/apache/gravitino/blob/13f3b3f3118e2a7f8b63aa324e11c688159f4633/catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/SecureHadoopCatalogOperations.java#L92-L94

I wonder why the passed value to hadoop conf is null in places like this. In this way, we can not pass some configurations in schema/fileset level. Should we use hadoopCatalogOperations.getHadoopConf() instead?

We can't pass the Hadoop Conf from the method signature, so I use null instead. If you need the value, please go ahead and make the necessary changes.

Aug 07 '24 09:08 yuqi1129

https://github.com/apache/gravitino/blob/13f3b3f3118e2a7f8b63aa324e11c688159f4633/catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/SecureHadoopCatalogOperations.java#L92-L94

I wonder why the passed value to hadoop conf is null in places like this. In this way, we can not pass some configurations in schema/fileset level. Should we use hadoopCatalogOperations.getHadoopConf() instead?

We can't pass the Hadoop Conf from the method signature, so I use null instead. If you need the value, please go ahead and make the necessary changes.

After thinking carefully, I'm not sure whether it is appropriate to support multi-level authentication for S3. Because we have to pass some configurations to hadoop conf when doing some operations on schema/fileset, but the hadoopConf in HadoopCatalogOperations is defined at catalog level, if we want to modify it, we have to lock hadoopConf to avoid conflicts, which means we cannot do some operations at same time(like creating two different schemas at one catalog). Do you have any suggestions? @yuqi1129

Aug 08 '24 07:08 xiaozcy

https://github.com/apache/gravitino/blob/13f3b3f3118e2a7f8b63aa324e11c688159f4633/catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/SecureHadoopCatalogOperations.java#L92-L94

I wonder why the passed value to hadoop conf is null in places like this. In this way, we can not pass some configurations in schema/fileset level. Should we use hadoopCatalogOperations.getHadoopConf() instead?

We can't pass the Hadoop Conf from the method signature, so I use null instead. If you need the value, please go ahead and make the necessary changes.

After thinking carefully, I'm not sure whether it is appropriate to support multi-level authentication for S3. Because we have to pass some configurations to hadoop conf when doing some operations on schema/fileset, but the hadoopConf in HadoopCatalogOperations is defined at catalog level, if we want to modify it, we have to lock hadoopConf to avoid conflicts, which means we cannot do some operations at same time(like creating two different schemas at one catalog). Do you have any suggestions? @yuqi1129

We can start with catalog-level authentication first, when there is a clear need to support multiple-level authentication for S3, we can do it again.

Aug 08 '24 07:08 yuqi1129

Hi @xiaozcy thanks a lot for your work. I think most of us are debating on how to make a meaningful property. Well, this indeed is a problem that should be carefully thought of, I was thinking of a more fundamental problem: how do we support different storage?

As we may continue to add more storage support, like adls, gcs, oss, adding all of them as dependencies for Hadoop catalog will make this catalog a dependency hell. In the meantime, for most of the companies, one or two storages should be enough.

So I was thinking of introducing a pluggable framework to support different cloud storages, users can configure to make the storage they want work, no need to package all of them together.

Can you guys please investigate a bit about other projects like Iceberg, etc, to see how they handle this problem and make an elegant framework? Basically, what I want is a pluggable framework for different storages, users can configure to make it work, besides, they can be in the separate packages, unless you make it work manually, the storages cannot be worked automatically.

Meanwhile, I think to support S3, we should also make it work in both Java and Python GVFS implementations, otherwise it cannot be used end to end.

So @xiaozcy @yuqi1129 @FANNG1 can you guys please think about my questions and propose a better solution?

Aug 13 '24 15:08 jerryshao

Hi @xiaozcy thanks a lot for your work. I think most of us are debating on how to make a meaningful property. Well, this indeed is a problem that should be carefully thought of, I was thinking of a more fundamental problem: how do we support different storage?

As we may continue to add more storage support, like adls, gcs, oss, adding all of them as dependencies for Hadoop catalog will make this catalog a dependency hell. In the meantime, for most of the companies, one or two storages should be enough.

So I was thinking of introducing a pluggable framework to support different cloud storages, users can configure to make the storage they want work, no need to package all of them together.

Can you guys please investigate a bit about other projects like Iceberg, etc, to see how they handle this problem and make an elegant framework? Basically, what I want is a pluggable framework for different storages, users can configure to make it work, besides, they can be in the separate packages, unless you make it work manually, the storages cannot be worked automatically.

Meanwhile, I think to support S3, we should also make it work in both Java and Python GVFS implementations, otherwise it cannot be used end to end.

So @xiaozcy @yuqi1129 @FANNG1 can you guys please think about my questions and propose a better solution?

I agree that it's proper to provide a plugin method, but this seems not urgent and a blocking matter. Iceberg is a library which should not introduce too many dependencies to users, which Gravitino is a server which hidden the dependency issues to user and Polaris and UnityCatalog doesn't handle this too, IMO it's more importance to provide a unified storage configuration for all catalogs than providing plugin storage.

Aug 14 '24 02:08 FANNG1

Hi @xiaozcy thanks a lot for your work. I think most of us are debating on how to make a meaningful property. Well, this indeed is a problem that should be carefully thought of, I was thinking of a more fundamental problem: how do we support different storage? As we may continue to add more storage support, like adls, gcs, oss, adding all of them as dependencies for Hadoop catalog will make this catalog a dependency hell. In the meantime, for most of the companies, one or two storages should be enough. So I was thinking of introducing a pluggable framework to support different cloud storages, users can configure to make the storage they want work, no need to package all of them together. Can you guys please investigate a bit about other projects like Iceberg, etc, to see how they handle this problem and make an elegant framework? Basically, what I want is a pluggable framework for different storages, users can configure to make it work, besides, they can be in the separate packages, unless you make it work manually, the storages cannot be worked automatically. Meanwhile, I think to support S3, we should also make it work in both Java and Python GVFS implementations, otherwise it cannot be used end to end. So @xiaozcy @yuqi1129 @FANNG1 can you guys please think about my questions and propose a better solution?

I agree that it's proper to provide a plugin method, but this seems not urgent and a blocking matter. Iceberg is a library which should not introduce too many dependencies to users, which Gravitino is a server which hidden the dependency issues to user and Polaris and UnityCatalog doesn't handle this too, IMO it's more importance to provide a unified storage configuration for all catalogs than providing plugin storage.

I agree with @FANNG1, adding all of the dependencies seems not to be a huge problem in server side, and letting users configure to enable it may make it more complicated to use. I think this might be something to consider when dealing with GVFS, which I would think about later. For this PR, I think we should focus on the how to manage these configurations first.

Aug 14 '24 02:08 xiaozcy

No, I would rather have a thorough solution beforehand. Supporting one cloud storage is easy, but when we add more, maintaining them will have a heavy burden, this is the burden the community should take care, not the user. The solution should not only focus on the server side Fileset catalog support, the client side GVFS should also be considered in this solution, the unified configuration thing should also be included in this solution.

So I would suggest we have a complete design doc about how to support multiple cloud storages that includes both the server and client side solutions. We can discuss based on the design doc. The design doc will make us more clear and thorough about how to well support different storages, currently, all our discussion is around S3, but what if the current solution cannot fit ADSL or GCS, how do you handle this? Besides, whether the pluggable framework should be introduced or not should be decided based on thorough investigation and discussion, not just based on whether it is urgent or blocking or not.

Aug 14 '24 03:08 jerryshao

Hi all, let me share some actual production cases: Since I'm mainly responsible for the integration between Gravitino Fileset and internal engines and platforms, I did encounter some actual dependency conflicts problems during the launch process.

On the server side: Our current supports both HDFS/MiFS (an internal storage that supports various object storages) in a Fileset Catalog. When introducing MiFS, the biggest problem encountered by the server during integration testing is that the MiFS dependency conflicts with some Grav dependencies. At this time, we have to take two approaches: a. Contact MiFS R&D to shade these dependencies (on the public cloud storage, I think this is not possible to do).
b. During integration testing, exclude MiFS dependencies and cannot do MiFS integration testing.
In GVFS client: Since it needs to be used in Spark, we have tried to add MiFS dependencies directly to GVFS, but this may still conflict with some dependencies of Spark. We have to shade these dependencies in GVFS, and some dependencies whatever cannot be shaded. Therefore, the solution we finally adopted is to make MiFS dependencies independent of GVFS client, and ask MiFS developers to shade these dependencies before introducing them separately in Spark.

What I want to say is that with the increase of supported storages, dependency conflicts are inevitable, so on-demand loading may be a more reasonable approach. But I still hope that one Catalog can support multiple storage types, but the supported storage types can be determined by the maintainer.

Aug 15 '24 07:08 xloya

we could reuse the s3 properties defined in https://github.com/apache/gravitino/pull/4897/files#diff-7434c367b3597195902b8b064a5efe5810cb0cf5e7f55228044bfcc4cd9b2abd

Sep 13 '24 07:09 FANNG1