Jiajin Yu comments

Results 7 comments of


                                            Jiajin Yu

Horovod converges slow for resnet

@alsrgv , thanks for the prompt reply For NCCL, we use ``` python /tensorflow_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \ --data_format=NCHW --batch_size=64 --model=resnet50_v2 --optimizer=momentum \ --variable_update=replicated --nodistortions --allow_growth=True --all_reduce_spec=nccl\ --print_training_accuracy=True --num_epochs=1 --weight_decay=1e-4 \ --num_gpus=4 \...

Horovod converges slow for resnet

the commit is `d7b68b146c82ee9b936bd196c9f1ed6d54f4a1c7` (fixed) for v2, etc. this is not intentional, we test both and neither of them converges as the same. I just pasted two versions. that was...

Horovod converges slow for resnet

Sure. Let me rerun with the additional flag.

Horovod converges slow for resnet

I am checking the code. I think you using some different data_dir? The code is like this ``` # Infere dataset name from data_dir if data_name is not provided. if...

Horovod converges slow for resnet

@alsrgv , thanks a lot for working on this so quickly. Looking forward to your solution.

Horovod converges slow for resnet

@alsrgv , thanks a lot for the fix.@lcytzk and I tested in our case and both Resnet and VGG, etc work fine.

The EETQ quantization model cannot be performed locally

> It is possible to save a model by TGI and reuse it. @dtlzhuangz, may I ask how to do that?