tfx icon indicating copy to clipboard operation
tfx copied to clipboard

How to pass custom component to DataflowRunner?

Open sadeel opened this issue 4 years ago • 5 comments

I am using TFX on Kubeflow. I have written a custom component that does some work. I want that library (defining the custom_component) to be available on the Dataflow runner. Is there an example of how to do it?

Right now, it is complaining that my "custom_component" does not exist.

sadeel avatar Mar 09 '20 19:03 sadeel

@sadeel can you use the TFX CLI [1] to package your custom component? See for example the part about building the image with skaffold mentioned in [2]. See also the custom component docs in [3]

[1] https://github.com/tensorflow/tfx/blob/master/docs/guide/cli.md [2] https://github.com/tensorflow/tfx/blob/master/docs/tutorials/tfx/template_beam.ipynb [3] https://github.com/tensorflow/tfx/blob/master/tfx/examples/custom_components/slack/README.md#compile-the-pipeline-gcp

neuromage avatar Mar 09 '20 20:03 neuromage

I've done that and that works well when running just in KFP. However, inside KFP, I need to start a Dataflow job, and the Dataflow job needs to be aware of my custom component - I haven't figured a good way to do that.

sadeel avatar Mar 09 '20 20:03 sadeel

Got it, this is a missing feature right now. We'll take a look at fixing this.

neuromage avatar Mar 09 '20 21:03 neuromage

Not sure if this issue is still relevant, but the TFX docs contain updated informaltion on how to provide multi-dependencies to Dataflow. Two options:

  • Package your code via a tar ball or setup.py (it needs to contain the TFX code too, see note below)
  • Build a custom image to be used by Dataflow's workers

If you provide your own packages, it will overwrite TFX's package for Dataflow. (see https://github.com/tensorflow/tfx/blob/master/tfx/utils/dependency_utils.py#L63)

I couldn't get the multi-dependencies to work with a tar ball or setup.py (we have internal dependencies which aren't publicly available via PyPI), but the docker image worked perfectly.

Further references:

  • https://www.tensorflow.org/tfx/guide/beam
  • https://cloud.google.com/dataflow/docs/guides/using-custom-containers#docker
  • https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies
  • https://github.com/tensorflow/tfx/issues/3994#issuecomment-873554029

Big thanks to @wizjo for deep diving into this issue!

hanneshapke avatar Oct 28 '21 15:10 hanneshapke

@sadeel As mentioned above, the above steps work perfectly. Just make sure you are using the docker image. Please go ahead and close the issue as it has been resolved. Thanks!

gowthamkpr avatar Aug 12 '22 16:08 gowthamkpr

Closing this issue as it has been stale for 2 weeks. Please update response, and we will reopen it again. Thanks!!

gowthamkpr avatar Aug 26 '22 17:08 gowthamkpr