ftlib icon indicating copy to clipboard operation
ftlib copied to clipboard

Support TensorFlow (2.0)

Open zw0610 opened this issue 4 years ago • 0 comments

Is this a BUG REPORT or FEATURE REQUEST?:

/kind feature

Status:

So far FTLib does not support TensorFlow. When adopted in ElasticDL, we take a NumPy NDArray and wrapped it into a Tensor data structure defined in PyTorch. Such approach not only suffers from overhead, but also is not elegant. It will be much better if FTLib support TensorFlow natively.

Potential Approach(es):

Distributed Strategy is introduced with TF 2.0. The implementation of CollectiveAllReduceStrategy hints we can customize a new strategy with a fault-tolerant/elastic ops defined in FTLib.

Regarding the enhanced ops,

  1. the logic FTLib uses to enhance collective ops can be assembled in a new, customized (by FTLib) cross_device_ops library
  2. the logic FTLib uses to reconfigure the member list can be customized into the new distributed strategy in FTLib

Steps:

  1. Prepare new collective ops with elastic enhancement
  2. Create customized distributed strategy

Potential Issues:

  1. While this proposal mainly shall work for TF 2.0, it cannot be applied to earlier version.
  2. While it may look transparent to the TF 2.0 users, this design is remotely close to what FTLib does with PyTorch and NumPy.

/cc @gaocegege @QiJune @skydoorkai

zw0610 avatar Jul 05 '20 12:07 zw0610