Support Ceph Storage in delta by adding a dependency
Hi, we have implemented the file system with ceph rgw for delta Lake through openstack swift API. The scheme of the path is ceph, The full path is ceph://<your cephrgw container>/<path to delta Table> . There is no problem using s3singledriverlogstore in logstore. We talked about this before #877, and now we have sent this function to Maven centra repository. Here are the links to SBT and Maven.
SBT:
libraryDependencies += "io.github.nanhu-lab" % "hadoop-cephrgw" % "1.0.1"
Maven:
<dependency>
<groupId>io.github.nanhu-lab</groupId>
<artifactId>hadoop-cephrgw</artifactId>
<version>1.0.1</version>
</dependency>
In addition,can this be added to your official online documents as a data storage ?
Hi @zhangt-nhlab - thanks very much for your contributions. We're in the middle of restructuring the delta.io website including the documentation. One of the things we were about to start was to include a "community packages" that would include your and other storage systems. If this makes sense, could you create an issue on github.com/delta-io/website (or I could and link back to this issue) so we can take care of this? Thanks!
ok,we will create an issue on github.com/delta-io/website. @dennyglee
Hello @zhangt-nhlab, thanks for this great feature! Very useful for the sites that want to integrate delta with Ceph Storage Gateway in their pipelines. I was testing this implementation in our site but had some problems.
Following the instructions from here I run spark interactively but it was impossible to make it work:
$ export _JAVA_OPTIONS='-Djdk.tls.maxCertificackages'; spark-shell --packages io.delta:delta-core_2.12:1.1.0,io.github.nanhu-lab:hadoop-cephrgw:1.0.1,org.apache.spark:spark-hadoop-cloud_2.12:3.2.1 --conf spark.hadoop.fs.ceph.username=tester --conf spark.hadoop.fs.ceph.password=testXXXXX --conf spark.hadoop.fs.ceph.uri=https://site:8080/swift/v1/ --conf spark.hadoop.fs.s3a.connection.ssl.enabled=false --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore --conf spark.hadoop.fs.ceph.impl=org.apache.hadoop.fs.ceph.rgw.CephStoreSystem
scala > spark.range(5).write.format("delta").save("ceph://SparkTest/delta-table")
2022-04-01 09:44:28,565 ERROR core.AbstractCommand: JOSS / HTTP GET call https://site:8080/swift/v1/, HTTP status 404, Error UNKNOWN 2022-04-01 09:44:28,565 ERROR core.AbstractCommand: * X-Auth-User=tester 2022-04-01 09:44:28,565 ERROR core.AbstractCommand: * X-Auth-Key=testXXXX 2022-04-01 09:44:28,567 WARN fs.FileSystem: Failed to initialize fileystem ceph://SparkTest/delta-table/_delta_log: Command exception, HTTP Status code: 404 => UNKNOWN org.javaswift.joss.exception.CommandException at org.javaswift.joss.command.impl.core.httpstatus.HttpStatusChecker.verifyCode(HttpStatusChecker.java:45)
Same issue if other format is used instead of Delta. What would be missing? Maybe is something missing in our configuration?
PS: Correct me if this is not the proper page to post this issue ;)
All the best, A.
Hello @aidaph, thank you for your reply. Is the client Ceph RGW? Because I found that your "spark.hadoop.fs.ceph.uri" is different. if so, you need to register a swift user after setting up Ceph RGW. After registration, username and password will be generated.
The following is an example.
URI :"http://localhost:7480/auth/1.0"
swift user
{ "user_id": "testuser", "display_name": "First User", "email": "", "suspended": 0, "max_buckets": 1000, "auid": 0, "subusers": [ { "id": "testuser:swift", "permissions": "full-control" } ], "keys": [ { "user": "testuser", "access_key": "XUTBVKD9R6ELLF8FZR7R", "secret_key": "20uZMBiVHoS1Y9REr5slQHEQo1HTGHVnPfGDuziE" } ], "swift_keys": [ { "user": "testuser:swift", "secret_key": "kijAijVYMJ7vxZxGDBWchQoRc4x3W077ZBt1gjWE" // Use this user information } ], ...}
If there is no help to solve the problem, please reply again, thank you.
Thanks a lot for your hint @zhangt-nhlab! I think that we've found what is my problem. In our current system, the swift implementation is delegated to OpenStack Identity service. Then, the authentication of cephrgw is provided by the OpenStack Identity kesytone rather than using the local auth from ceph-rgw that you suggested at here.
I guess that this keystone auth part is missing in your hadoop-cephrgw package. Can I take a look at your code to try to do some contributions (if I know how to do, of course ;) ) ? Thanks again!!
hi @aidaph Thanks a lot for your attention to this issue! Here is our GitHub website: https://github.com/nanhu-lab/hadoop-ceph
Please feel free to contact us if you have any question, thanks!
Hello @zhangt-nhlab , thanks a lot for the answer. I've submitted a PR to your repo that includes a short implementation to get authenticated with the Keystone credentials.
Hello @aidaph , thanks a lot for your PR. We have received your PR, and published a new version of hadoop-cephrgw to maven. What should we do next to let more people can use our hadoop-cephrgw, if they want to store the delta table on Ceph? Is it possible to add a content about Ceph into the integration on the Delta Lake Website and link to our Github address?We are looking forward to your reply, thank you!
Maven URL : https://mvnrepository.com/artifact/io.github.nanhu-lab/hadoop-cephrgw/1.0.2
Github URL: https://github.com/nanhu-lab/hadoop-ceph
Hi @zhangt-nhlab - yes, we will be updating https://github.com/delta-io/website/issues/11 shortly to include Ceph. Could you slack me via the Delta Users Slack at https://go.delta.io/slack (my Slack is dennyglee) as I may be able to get something in a little earlier while we wait for the documentation but wanted to check with you first. Thanks!