decision-forests icon indicating copy to clipboard operation
decision-forests copied to clipboard

Support tf.distribute strategies in TF-DF

Open sibyjackgrove opened this issue 3 years ago • 2 comments

Which tf.distribute strategy would be most suitable to use with tfdf if we were to use it with multiple nodes of an HPC.

sibyjackgrove avatar Aug 03 '21 17:08 sibyjackgrove

Hi sibyjackgrove.

TF-DF does not yet support tf.distribute strategies. This is because the currently implemented decision forest algorithms are not distributed algorithms and require the entire dataset in memory on a single machine. If you use a multiworker setup, the current algorithms will likely either crash or use only one of the machines -- this is undefined behavior.

However, a distributed gradient boosted tree algorithm is in the works and will hopefully be available later this year. I have relabeled this issue, and we will update it when the appropriate release is pushed out.

Thanks! Arvind

arvnds avatar Aug 09 '21 16:08 arvnds

Distributed training was published in the TF-DF 0.2.0 release. See the distributed training documentation for more details.

Note that the code is still experimental (documentation is still in writing), and that tf.distribute.experimental.ParameterServerStrategy is only compatible with a TF+TF-DF monolithic build. In other words, ParameterServerStrategy is not yet compatible with the PyPi TF-DF package. In the mean time, TF-DF distributed training is possible with the Yggdrasil Decision Forests GRPC Distribution Strategy.

This bug is left open and will be closed when ParameterServerStrategy is fulled supported.

Mathieu

achoum avatar Nov 01 '21 17:11 achoum

Since this bug was opened, distributed training has been rewritten and is now stable. See https://github.com/tensorflow/decision-forests/blob/main/examples/distributed_training.py for an example. If there are still feature requests or issues with TF-DF distributed training, please open a new issue.

rstz avatar Sep 11 '23 16:09 rstz