DeePray
DeePray copied to clipboard
[Feat] Add CheckpointManager support with TFRA Dynamic Embedding Horovod Training.
Description
Now CheckpointManager is available in Deepray when training with TFRA Dynamic Embedding.
[fix] In deepray/core/base_trainer.py, gpu_affinity didn't take effect when NVML Shared Library Not Found. [fix] In deepray/core/base_trainer.py line 784, self.loss_container.metrics may empty when 'FLAGS.stop_steps = 0' in tools/testing/horovod_sync_train_test.py.
Also the adding script also support test TF Embedding when use Horovod training.
Type of change
- [x] Bug fix
- [ ] New Tutorial
- [ ] Updated or additional documentation
- [x] Additional Testing
- [ ] New Activation and the changes conform to the activation contribution guidelines
- [ ] New Callback and the changes conform to the callback contribution guidelines
- [ ] New Image addition and the changes conform to the image op contribution guidelines
- [ ] New Layer and the changes conform to the layer contribution guidelines
- [ ] New Loss and the changes conform to the loss contribution guidelines
- [ ] New Metric and the changes conform to the metric contribution guidelines
- [ ] New Optimizer and the changes conform to the optimizer contribution guidelines
- [ ] New RNN Cell and the changes conform to the rnn contribution guidelines
- [ ] New Seq2seq addition and the changes conform to the seq2seq contribution guidelines
- [ ] New Text addition and the changes conform to the text op contribution guidelines
Checklist:
- [x] I've properly formatted my code according to the guidelines
- [ ] By running find ./ -name '*.py' -exec yapf --style=./.yapf -ir {} ;
- [ ] By running pre-commit hooks
- [ ] This PR addresses an already submitted issue for Deepray
- [ ] I have made corresponding changes to the documentation
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] This PR contains modifications to C++ custom-ops
How Has This Been Tested?
mpirun -np 2 -H localhost:2 --allow-run-as-root pytest -v tools/testing/horovod_sync_train_test.py
or
horovodrun -np 2 pytest -v tools/testing/horovod_sync_train_test.py