ftlib
ftlib copied to clipboard
Support TensorFlow (2.0)
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
Status:
So far FTLib does not support TensorFlow. When adopted in ElasticDL, we take a NumPy NDArray and wrapped it into a Tensor data structure defined in PyTorch. Such approach not only suffers from overhead, but also is not elegant. It will be much better if FTLib support TensorFlow natively.
Potential Approach(es):
Distributed Strategy is introduced with TF 2.0. The implementation of CollectiveAllReduceStrategy hints we can customize a new strategy with a fault-tolerant/elastic ops defined in FTLib.
Regarding the enhanced ops,
- the logic FTLib uses to enhance collective ops can be assembled in a new, customized (by FTLib) cross_device_ops library
- the logic FTLib uses to reconfigure the member list can be customized into the new distributed strategy in FTLib
Steps:
- Prepare new collective ops with elastic enhancement
- Create customized distributed strategy
Potential Issues:
- While this proposal mainly shall work for TF 2.0, it cannot be applied to earlier version.
- While it may look transparent to the TF 2.0 users, this design is remotely close to what FTLib does with PyTorch and NumPy.
/cc @gaocegege @QiJune @skydoorkai