io icon indicating copy to clipboard operation
io copied to clipboard

Tensorflow-IO DAOS Plugin

Open omar-hmarzouk opened this issue 3 years ago • 17 comments

Integrating Intel's new DAOS file-system into tensorflow io's supported file-systems. Documentation for building instructions and an introduction to the DAOS environment can be found here and an example for the intergrated plugin in action can be tested in a DAOS-aware environment here

omar-hmarzouk avatar Jan 02 '22 16:01 omar-hmarzouk

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@omar-hmarzouk Thanks for the PR! Can you try rebase against the master branch and see if this can resolve the CI build issues?

yongtang avatar Jan 04 '22 04:01 yongtang

@omar-hmarzouk Looks like some of the build failed. Can you take a look and see if the failure can be resolved?

For lint, you can run bazel run //tools/lint:lint which will resolve most of the lint failures.

yongtang avatar Jan 05 '22 05:01 yongtang

@yongtang The build is mostly failing because it expects a running DAOS environment to build. You will find that the failures are due to missing DAOS dependencies. As for the lint failures, bazel run //tools/lint:lint passes locally and adds no changes to the files that fail here on the workflow?

Please advise on any further steps to take. You can also find docs for the added plugin and the DAOS environment here

omar-hmarzouk avatar Jan 10 '22 09:01 omar-hmarzouk

@omar-hmarzouk Sorry I have been occupied for the past several days and was not able to get to this. Let me take a look and see how to fix it.

yongtang avatar Jan 12 '22 18:01 yongtang

@yongtang Any updates? Let me know if I can help/support in any way

omaraziz255 avatar Jan 24 '22 14:01 omaraziz255

@omaraziz255 There are quite a few places that has to be changed. First of all, the daos header files has to be included in Bazel BUILD files:

diff --git a/third_party/daos.BUILD b/third_party/daos.BUILD
index 6baeaba9..dca2a5ab 100644
--- a/third_party/daos.BUILD
+++ b/third_party/daos.BUILD
@@ -2,6 +2,31 @@ package(default_visibility = ["//visibility:public"])
 
 cc_library(
     name = "daos",
+    hdrs = glob(
+        [
+            "src/include/**/*.h",
+        ],
+    ) + [
+        "src/include/daos_version.h",
+    ],
     copts = [],
     includes = ["src/include"],
+    deps = [
+        "@util_linux//:uuid",
+    ],
+)
+
+genrule(
+    name = "daos_version_h",
+    srcs = [
+        "src/include/daos_version.h.in",
+    ],
+    outs = [
+        "src/include/daos_version.h",
+    ],
+    cmd = ("sed " +
+           "-e 's/@TMPL_MAJOR@/2/g' " +
+           "-e 's/@TMPL_MINOR@/0/g' " +
+           "-e 's/@TMPL_FIX@/0/g' " +
+           "$< >$@"),
 )

With the above patch, the build can move a little longer.

yongtang avatar Jan 24 '22 22:01 yongtang

I am also looking into some other places and see how to fix those.

yongtang avatar Jan 24 '22 22:01 yongtang

One other thing I noticed is that the plugin use C++ API to implement the file system. Is there any chance to convert this to C API like HDFS? Using C++ API means it will be very much restrictive in versions of TF it can support (most of the time it can only support exactly one version of TF).

yongtang avatar Jan 24 '22 22:01 yongtang

Hello @yongtang. I have integrated the build patch you proposed and am currently looking into replacing any calls to the tensorflow C++ API. Feel free to let me know if there are any more suggested changes in the meantime.

omaraziz255 avatar Jan 30 '22 09:01 omaraziz255

@yongtang Any status updates? Let me know what are the next steps when you can

omaraziz255 avatar Feb 15 '22 13:02 omaraziz255

@omar-hmarzouk I have been working on trying to get the build works. One issue is that directly depending on -ldaos /etc will not work as this means users without daos in their system will not be able to run tensorflow-io (we don't want to force user to install daos while they may not use this feature).

There are two solutions:

  1. Build daos's client library into part of the package so that there will be no external dependencies.
  2. Split out daos file system as a separate .so file, and dynamically load the .so. In that case, if load of .so fails, then user at least can use other features.

I am trying to get 1) work https://github.com/yongtang/io/commit/28c62f224aeea736115197c173bb21d87a772d70

Think I am close though it may still find out all dependencies of daos itself (e.g., daos depends on libyaml, etc).

If 1) turns out to be too much of an issue, 2) is still possible.

yongtang avatar Feb 25 '22 15:02 yongtang

Hello @yongtang . Any updates on your work with the building process. Please let me know what's the status as of now.

omaraziz255 avatar Mar 23 '22 10:03 omaraziz255

Also, would Bazel's build environment support checking if daos libraries are available and only build them if they are? Maybe this conditional approach would be simpler?

omaraziz255 avatar Mar 31 '22 16:03 omaraziz255

@yongtang I have written a small stub cc_library called dfs_dummy which implements an empty function for each daos function used by the dfs plugin. The dfs cc_library has dfs_dummy listed in its deps field. Both are compiled with "linkstatic = False" so they both produce .so libraries. The dfs plugin now compiles cleanly without requiring daos libraries to be installed on the build node. I believe this is the #2 solution you previously mentioned above. My remaining problem is that bazel adds dfs_dummy.so into libtensorflow_io_plugins.so, so at runtime the daos libraries on an execution node do not get used. How do I use the dfs_dummy cc_library to satisfy build dependencies for the dfs cc_library, yet keep dfs_dummy.so from being included in libtensorflow_io_plugins.so?

Hmm, I guess this won't work. Instead of creating stub routines, I need to load the daos libraries at run time with dlopen(), and then instead of using a stub function, use dlsym() to look up the real daos function, and call that instead .

krehm avatar Jun 15 '22 00:06 krehm

@krehm @omar-hmarzouk Sorry for the long wait! It looks like there are some CI build failures. Can you apply the https://github.com/tensorflow/io/pull/1696/commits/96ad01d26db5c41b6356b911f72aef8f55c05ccf and https://github.com/tensorflow/io/pull/1696/commits/df361e51c33219a98e61e7f7b24fc1cf23ee205e to your PR? That will skip macOS and windows .

There are still some other failures but I will try to see if they can be fixed in the next several days.

yongtang avatar Jul 26 '22 01:07 yongtang

@yongtang Thanks for the update! I just applied the patches. Please let us know of any status updates

omar-hmarzouk avatar Jul 26 '22 14:07 omar-hmarzouk