io
io copied to clipboard
Tensorflow-IO DAOS Plugin
Integrating Intel's new DAOS file-system into tensorflow io's supported file-systems. Documentation for building instructions and an introduction to the DAOS environment can be found here and an example for the intergrated plugin in action can be tested in a DAOS-aware environment here
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@omar-hmarzouk Thanks for the PR! Can you try rebase against the master branch and see if this can resolve the CI build issues?
@omar-hmarzouk Looks like some of the build failed. Can you take a look and see if the failure can be resolved?
For lint, you can run bazel run //tools/lint:lint
which will resolve most of the lint failures.
@yongtang The build is mostly failing because it expects a running DAOS environment to build. You will find that the failures are due to missing DAOS dependencies. As for the lint failures, bazel run //tools/lint:lint
passes locally and adds no changes to the files that fail here on the workflow?
Please advise on any further steps to take. You can also find docs for the added plugin and the DAOS environment here
@omar-hmarzouk Sorry I have been occupied for the past several days and was not able to get to this. Let me take a look and see how to fix it.
@yongtang Any updates? Let me know if I can help/support in any way
@omaraziz255 There are quite a few places that has to be changed. First of all, the daos header files has to be included in Bazel BUILD files:
diff --git a/third_party/daos.BUILD b/third_party/daos.BUILD
index 6baeaba9..dca2a5ab 100644
--- a/third_party/daos.BUILD
+++ b/third_party/daos.BUILD
@@ -2,6 +2,31 @@ package(default_visibility = ["//visibility:public"])
cc_library(
name = "daos",
+ hdrs = glob(
+ [
+ "src/include/**/*.h",
+ ],
+ ) + [
+ "src/include/daos_version.h",
+ ],
copts = [],
includes = ["src/include"],
+ deps = [
+ "@util_linux//:uuid",
+ ],
+)
+
+genrule(
+ name = "daos_version_h",
+ srcs = [
+ "src/include/daos_version.h.in",
+ ],
+ outs = [
+ "src/include/daos_version.h",
+ ],
+ cmd = ("sed " +
+ "-e 's/@TMPL_MAJOR@/2/g' " +
+ "-e 's/@TMPL_MINOR@/0/g' " +
+ "-e 's/@TMPL_FIX@/0/g' " +
+ "$< >$@"),
)
With the above patch, the build can move a little longer.
I am also looking into some other places and see how to fix those.
One other thing I noticed is that the plugin use C++ API to implement the file system. Is there any chance to convert this to C API like HDFS? Using C++ API means it will be very much restrictive in versions of TF it can support (most of the time it can only support exactly one version of TF).
Hello @yongtang. I have integrated the build patch you proposed and am currently looking into replacing any calls to the tensorflow C++ API. Feel free to let me know if there are any more suggested changes in the meantime.
@yongtang Any status updates? Let me know what are the next steps when you can
@omar-hmarzouk I have been working on trying to get the build works. One issue is that directly depending on -ldaos
/etc will not work as this means users without daos in their system will not be able to run tensorflow-io (we don't want to force user to install daos while they may not use this feature).
There are two solutions:
- Build daos's client library into part of the package so that there will be no external dependencies.
- Split out daos file system as a separate .so file, and dynamically load the .so. In that case, if load of .so fails, then user at least can use other features.
I am trying to get 1) work https://github.com/yongtang/io/commit/28c62f224aeea736115197c173bb21d87a772d70
Think I am close though it may still find out all dependencies of daos itself (e.g., daos depends on libyaml, etc).
If 1) turns out to be too much of an issue, 2) is still possible.
Hello @yongtang . Any updates on your work with the building process. Please let me know what's the status as of now.
Also, would Bazel's build environment support checking if daos libraries are available and only build them if they are? Maybe this conditional approach would be simpler?
@yongtang I have written a small stub cc_library called dfs_dummy which implements an empty function for each daos function used by the dfs plugin. The dfs cc_library has dfs_dummy listed in its deps field. Both are compiled with "linkstatic = False" so they both produce .so libraries. The dfs plugin now compiles cleanly without requiring daos libraries to be installed on the build node. I believe this is the #2 solution you previously mentioned above. My remaining problem is that bazel adds dfs_dummy.so into libtensorflow_io_plugins.so, so at runtime the daos libraries on an execution node do not get used. How do I use the dfs_dummy cc_library to satisfy build dependencies for the dfs cc_library, yet keep dfs_dummy.so from being included in libtensorflow_io_plugins.so?
Hmm, I guess this won't work. Instead of creating stub routines, I need to load the daos libraries at run time with dlopen(), and then instead of using a stub function, use dlsym() to look up the real daos function, and call that instead .
@krehm @omar-hmarzouk Sorry for the long wait! It looks like there are some CI build failures. Can you apply the https://github.com/tensorflow/io/pull/1696/commits/96ad01d26db5c41b6356b911f72aef8f55c05ccf and https://github.com/tensorflow/io/pull/1696/commits/df361e51c33219a98e61e7f7b24fc1cf23ee205e to your PR? That will skip macOS and windows .
There are still some other failures but I will try to see if they can be fixed in the next several days.
@yongtang Thanks for the update! I just applied the patches. Please let us know of any status updates