Add a helper model to automatically apply preprocessing
This is a cool idea very similar to Huggingface pipelines. Is there any efficiency loss in taking this "lazy" approach to tokenization? Are all the optimizations described in the tfds guide still possible (doc)?
HF docs warn users to expect a performance hit for batched data when there is high variability in sequence lengths (doc).
Looking at tf.data's summary of performance optimizations, I would say largely yes (to preserving optimzations), with a few caveats:
- We are not loading the data, so sharding and interleaving data on the loading side is still up to the user.
- We are not caching the data in memory, as that is not good general purpose approach (it might not fit in memory, there might be random augmentations). We could consider an argument, though I would be inclined to not add it.
- Some people may want to preprocess in a separate job entirely, which will often be the fastest, at an increased usability burden and an inability to do things like generate random masks. This is clearly out of scope for what we can do in a helper like this, though still fully supported with
include_preprocessing=False
I would actually expect this preprocessing to do better than the hf approach (unless they are now leveraging tf.data under the hood?), but we would need to validate this.
The question of whether preprocessing might be better applied batched or unbatched is interesting. But I don't think we need to be everything for everyone with this, as you can toggle it on and off. If it is always better to preprocess unbatched vs batched, this design could be worrisome. But I don't believe that is the case.
tl;dr for most users, this is a speedup, where we correctly apply data preprocessing tricks they aren't aware of. For power users where we aren't doing things exactly as they want, just pass include_preprocessing=False.
Thought experiment: What if our users need to overwrite train_step, fit, etc for their own use case? Do they have to copy our boilerplate first? Could we provide a train_step_override method to be called by our own override? Or is this class not for advanced use cases?
Thought experiment: What if our users need to overwrite train_step, fit, etc for their own use case? Do they have to copy our boilerplate first? Could we provide a train_step_override method to be called by our own override? Or is this class not for advanced use cases?
With things as currently written, they could write any train_step/test_step/predict_step as normal, but would need to remember to pass include_preprocessing=False when calling the model. This boilerplate is line for line the same as as upstream Keras modulo that change.
We could also consider never preprocessing in model.call. This might feel more inconsistent to a user, but would cut down a lot of code here for sure. And maybe the fact the preprocessing is only applied in fit/predict/evaluate would actually be a simpler mental model? Worth discussing!
We could also consider never preprocessing in model.call. This might feel more inconsistent to a user, but would cut down a lot of code here for sure. And maybe the fact the preprocessing is only applied in fit/predict/evaluate would actually be a simpler mental model? Worth discussing!
After discussion, we decided to go with the more minimal version that does not override call.