Supporting Distributed adapting in KPL
Hi TensorFlow team,
Our team are trying to explore the distributed adapting to pre-process the data (Text Vectorization, and Normalization), because we do see some bottlenecks when trying to do large-scale preprocessing.
We noticed that the Horizontal scalability is not implemented (the multi-worker strategy doesn't work when we call strategy.run(adapt, data). For example, in Normalization, it simply assign the merged mean and variance per replica to the new value, without any synchronization. So we are currently doing experiments to add horizontal scalability into the preprocessing layer. Before we are doing that we would like to
- make sure there is no duplicated effort: do you have planning to add Horizontal scalability of KPL in the later version, or are you working on it?
- in the future if we have some updates we can share the pull request to make sure our design is meeting the open source requirements.
Hi! Thanks for filing, sorry for the delay here.
Do you have an idea of how much horizontal scaling you would need? What's your use case/data size?
We talked about this extensively on the TF team, and the conclusions we came to...
- For true multi-worker scaling of dataset analysis,
tf.distributeis not a great fit. When you are distributing a training job you usually want a lot of very expensive accelerators, but when distributing analysis before training you often just want a lot of CPU compute. We didn't want to make a square peg fit a round hole, and decided that using something like tensorflow transform, which is backed by apache beam, is the best solution when your dataset analysis needs are truly massive. - For a lot of real world dataset analysis, though, we suspect that people don't really need a whole fleet of machines. A fairly large dataset can fit on a single machine. In these cases, we did want to make adapt multi-processed. So if you are developing on a machine with a ton of CPU cores sitting around, you could very quickly take a pass over your data to adapt. We think this would scale to a lot of fairly large use cases quite effectively, but haven't had the bandwidth to prioritize this work.
Let me know if that makes sense!
tl;dr If you adapt() needs are really big (e.g. you dataset won't even fit on a single machine), take a look at tensorflow transform.
Thanks @mattdangerw for the detailed explanation and the discussion. TensorFlow Transform is also one of the solution we are currently exploring to deal with large scale in-model preprocessing solution. We will look into it and see if it fits our need. Thanks!
Hello, Thank you for reporting an issue.
We're currently in the process of migrating the new Keras 3 code base from keras-team/keras-core to keras-team/keras. Consequently, This issue may not be relevant to the Keras 3 code base . After the migration is successfully completed, feel free to reopen this Issue at keras-team/keras if you believe it remains relevant to the Keras 3 code base. If instead this Issue is a bug or security issue in legacy tf.keras, you can instead report a new issue at keras-team/tf-keras, which hosts the TensorFlow-only, legacy version of Keras.
To know more about Keras 3, please read https://keras.io/keras_core/announcement/