DeePray icon indicating copy to clipboard operation
DeePray copied to clipboard

[Feat] Add CheckpointManager support with TFRA Dynamic Embedding Horovod Training.

Open MoFHeka opened this issue 1 year ago • 0 comments

Description

Now CheckpointManager is available in Deepray when training with TFRA Dynamic Embedding.

[fix] In deepray/core/base_trainer.py, gpu_affinity didn't take effect when NVML Shared Library Not Found. [fix] In deepray/core/base_trainer.py line 784, self.loss_container.metrics may empty when 'FLAGS.stop_steps = 0' in tools/testing/horovod_sync_train_test.py.

Also the adding script also support test TF Embedding when use Horovod training.

Type of change

Checklist:

  • [x] I've properly formatted my code according to the guidelines
    • [ ] By running find ./ -name '*.py' -exec yapf --style=./.yapf -ir {} ;
    • [ ] By running pre-commit hooks
  • [ ] This PR addresses an already submitted issue for Deepray
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • [ ] This PR contains modifications to C++ custom-ops

How Has This Been Tested?

mpirun -np 2 -H localhost:2 --allow-run-as-root pytest -v tools/testing/horovod_sync_train_test.py

or

horovodrun -np 2  pytest -v tools/testing/horovod_sync_train_test.py

MoFHeka avatar Dec 25 '23 07:12 MoFHeka