flyte
flyte copied to clipboard
[Plugin] TypeTransformer for TensorFlow tf.data.Dataset
Motivation: Why do you think this is important?
The tf.data.Dataset
object encapsulates data as well as a preprocessing pipeline. It can be used in model fit
, predict
, and evaluate
methods. It is widely used in Tensorflow tutorials and documentation and is considered a best practice when creating pipelines that saturate GPU resources.
Goal: What should the final outcome look like, ideally?
Flyte tasks should be able to pass tf.data.Dataset
objects as parameters and accept them as return types.
Describe alternatives you've considered
There are caveats to passing tf.data.Dataset
objects between tasks. Since a tf.data.Dataset
object can have steps in the pipelines that use local Python functions (e.g., a map
or filter
step), there doesn't seem to be a way to serialize the object without effectively "computing" the graph pipeline. There are times this could be beneficial (doing an expensive preprocessing pipeline once can free up the CPU during training) but this could also be confusing to the Flyte end user.
So while adding a type transformer for tf.data.Dataset
is certainly possible, it's still a good question if Flyte should actually support it given all the caveats. The alternative to consider here is to not support tf.data.Dataset
. This seems like a question for the core Flyte team.
Propose: Link/Inline OR Additional context
There are at least three main ways to serialize/deserialize tf.data.Dataset
objects.
These are probably in order of least complex to most complex. But determining the method of serialization/deserialization is an open question.
Some additional links:
-
This tensorflow github issue is asking about ways to serialize/deserialize a
tf.data.Dataset
as a deep copy without having the side-effect of "computing" the pipeline. - I asked a similar question on the Tensorflow Forum.
Are you sure this issue hasn't been raised already?
- [X] Yes
Have you read the Code of Conduct?
- [X] Yes