io icon indicating copy to clipboard operation
io copied to clipboard

Docker Container for Support between tf-io and tf-serving.

Open Ouwen opened this issue 6 years ago • 11 comments

Can there be an official docker build or documentation to integrate tensorflow-io with tensorflow-serving? Using the tf.io ops in the tf.serving ecosystem would be a large development convenience and likely decrease inference latency.

Ouwen avatar Aug 08 '19 19:08 Ouwen

@ewilderj

Ouwen avatar Aug 08 '19 19:08 Ouwen

cc @tensorflow/serving-members @rcrowe-google

ewilderj avatar Aug 08 '19 19:08 ewilderj

Hi all, just a note that I'm following up on this in the sister issue - from tf serving perspective, we are extremely conscious about op robustness and supportability. What are tf.io's policies regarding backward/forward compatibility, unit testing, support SLOs, ownership, etc.?

peddybeats avatar Aug 09 '19 22:08 peddybeats

Thanks @unclepeddy @Ouwen @ewilderj. TFIO consists of C++ code. We have been very careful about binary compatibility with TensorFlow (core)'s released binaries. I can briefly introduce about the building/testing, and misc on tensorflow-io.

  1. Building of tensorflow-io:

In order to make sure any system that could run tensorflow, can equally run tensorflow-io, we have been building tensorflow-io on the same system as tensorflow's main repo. This is quite challenging as tensorflow was build on old systems to maximize supported platforms.

Here is the platform we use to build tensorflow-io on Linux (match with TensorFlow Core repo)

Python TF 1.x compatible tensorflow-io TF 2.0 compatible tensorflow-io
2.7 Ubuntu 14.04 + GCC 4.8.2 Ubuntu 16.04 + Dev Toolset 7 (GCC 7.3)
3.4 Ubuntu 14.04 + GCC 4.8.2 N/A
3.5 Ubuntu 14.04 + GCC 4.8.2 Ubuntu 16.04 + Dev Toolset 7 (GCC 7.3)
3.6 Ubuntu 14.04 + GCC 4.8.2 Ubuntu 16.04 + Dev Toolset 7 (GCC 7.3)
3.7 Ubuntu 16.04 + GCC 5.3.1 Ubuntu 16.04 + Dev Toolset 7 (GCC 7.3)

Note: Ubuntu 14.04 + GCC 4.8.2 and Ubuntu 16.04 + GCC 5.3.1 are system default. Note: Ubuntu 16.04 + Dev Toolset 7 (GCC 7.3) are linked against old GLIBC library to match tensorflow's setting.

  1. Testing platform on tensorflow-io:

While tensorflow-io packages are built on old systems which in theory runs on new systems, we also want to make sure newer system are properly covered. For that reason the testing was done with Ubuntu 16.04 and 18.04, two of the mostly widely used Linux distributions.

All of our tests run multiple times with different systems and python versions:

Test Ubuntu 16.04 Ubuntu 18.04
2.7 :heavy_check_mark: :heavy_check_mark:
3.5 :heavy_check_mark: N/A
3.6 N/A :heavy_check_mark:
3.7 N/A :heavy_check_mark:
  1. Testing contents on tensorflow-io:

The tensorflow-io package mostly consists of custom kernel ops and tf.data.Dataset implementations. Our tests make sure:

  • custom kernel ops (e.g., decode_webp) works in eager and non-eager mode with standard Tensor
  • extensions of our tf.data.Dataset implementation works in eager and non-eager mode with direct tf.keras integration.
  1. Integration testing with external systems.

The tensorflow-io package consists of many ops that are cloud-vendor specific or integrates with other systems. We have testing done against either live systems or emulators during Google's Kokoro CI run.

Note:

  • Live system means we setup a real server inside the virtual machine (e.g. Kafka/Ignite).
  • Emulator means we setup emulator
  • Offline means test will not run in Google Kokoro CI.
Dataset Live System Emulator Google Kokoro CI Offline
Apache Kafka :heavy_check_mark: :heavy_check_mark:
Apache Ignite :heavy_check_mark: :heavy_check_mark:
Prometheus :heavy_check_mark: :heavy_check_mark:
Google PubSub :heavy_check_mark: :heavy_check_mark:
Azure File System :heavy_check_mark: :heavy_check_mark:
AWS Kinesis :heavy_check_mark: :heavy_check_mark:
AlibabaCloud OSS :heavy_check_mark:
Google BigTable/BigQuery To be added
  1. API compatibility:

This is a little challenging at the moment, as TF 1.x and 2.0 are fundamentally different even in upstream core tensorflow repo. For example, in tf.data.Dataset, there is a big API different between TF 1.x and 2.0:

dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])

# In 1.x:
iterator = dataset.make_one_shot_iterator()
next_ele = iterator.get_next()
with tf.Session() as sess:
  try:
    while True:
      val = sess.run(next_ele)
      print(val)
  except tf.errors.OutOfRangeError:
    pass

# in 2.0:
for val in dataset:
  print(val)

There are also some subtle differences in exposed python methods. For example, in 1.x construction of Dataset is done by passing output_types and output_dtypes. In 2.0, construction of Dataset is done by passing element_structure (soon to be changed to element_spec.

We are still trying to balance the compatibility. @terrytangyuan @BryanCutler @dmitrievanthony may offer additional insights. Though below are the one I am thinking:

  1. All primitive/custom kernel ops (in C++) will be supported in eager and non-eager mode.
  2. All Dataset implementation will support tf.keras integration in both eager and non-eager mode.
  3. Top level API will always be backward-compatible once 2.0 is released. (in eager mode only).

Let me know if there are any questions, and /cc @tensorflow/sig-io-maintainers

yongtang avatar Aug 10 '19 01:08 yongtang

Please pardon the cross-post, but, as indicated in https://github.com/tensorflow/serving/issues/1411 we are keenly interested in integrating tf-io (specifically the image operators for webp support) into our build of TFS. I don't have a lot of experience working with bazel, but I think the main difficulty is getting two dockerized builds to integrate with each other. The closest I've gotten is publishing a .a file and attempting to link that into TFS, but we still get the error about a missing image operator.

If there is someone with tf-io build expertise who can spare some time solving this puzzle and share their solution it would be much appreciated.

Cheers.

tinder-michaelallman avatar Aug 23 '19 19:08 tinder-michaelallman

@yongtang i love that great build and test explanation above, would it make sense to excerpt that as a standalone doc to put in the repo somewhere?

ewilderj avatar Aug 23 '19 19:08 ewilderj

@tinder-michaelallman @ewilderj Thanks for the work! I update the README.md in PR #447 to includes additional information about building, testing, and CI integrations information.

@tinder-michaelallman I don't know enough about how tf-serving build, though I think have enough experience with tensorflow's core repo bazel build (as well as tensorflow-io's repo build), I could help with building issues in tf-serving if you can point me to the build script.

On tensorflow-io side, because of the recent changes in tensorflow core repo (due to manylinux2010 requirement), it is not exactly straightforward now. But if you have docker installed, then the following command:

bash -x -e .travis/python.release.sh

will generate four whl packages in wheelhouse directory with python 2.7, 3.5, 3.6, 3.7 support (and manylinux2010 compatible).

If you just want to build everything inside bazel, then depending on if you need manylinux2010 or not the build command could be different. But I might be able to help as well.

One note about missing operator: it might be that the .so file is not copied into the python path. In tensorflow-io we use bazel for C++ build but we decided to just package python with setup.py (as bazel is not very intuitive for python build support).

(I understand starlark in bazel is a dialect of python but that is a separate discussion.)

I might be able to help if you could describe the exact steps that reproduce the issue.

yongtang avatar Aug 24 '19 19:08 yongtang

@tinder-michaelallman I don't know enough about how tf-serving build, though I think have enough experience with tensorflow's core repo bazel build (as well as tensorflow-io's repo build), I could help with building issues in tf-serving if you can point me to the build script.

Hi @yongtang. Thanks so much for your offer to help. I had to set aside my efforts, and I can't remember now exactly where I got stuck. But I feel confident we can make progress together.

Let me get back to you later in the week once I've had the opportunity to work on this integration again.

Cheers.

tinder-michaelallman avatar Aug 26 '19 20:08 tinder-michaelallman

Hi @yongtang. I want to follow up. We have switched to using the Python PIL library for image processing. We took a performance hit moving to PIL, but it's not so substantial. Integrating webp support into TFS is not a priority for us right now, and I haven't had time to go back to attempting the integration. I'll let you know if we decide to have another go at it.

Thank you.

tinder-michaelallman avatar Sep 11 '19 22:09 tinder-michaelallman

Any updates on this?

lminer avatar Jan 15 '21 19:01 lminer

Hi, I want to share my case here too.

I tried to build the TensorFlow serving with the S3 module in TF IO.

https://github.com/tensorflow/serving/issues/1963#issuecomment-1385792395

jeongukjae avatar Jan 17 '23 17:01 jeongukjae