[ecapa-tdnn] [Ascend] The code of distributed script need to modify
If this is your first time, please read our contributor guidelines: https://github.com/mindspore-lab/mindcv/blob/main/CONTRIBUTING.md
Describe the bug/ 问题描述 (Mandatory / 必填) 当前run_distribute_train_ascend.sh代码中卡0日志无法保存
- Hardware Environment(
Ascend/GPU/CPU) / 硬件环境:
Please delete the backend not involved / 请删除不涉及的后端: /device ascend
-
Software Environment / 软件环境 (Mandatory / 必填): -- MindSpore version (e.g., 1.7.0.Bxxx) :commit_id = '[sha1]:8a30fd67,[branch]:(HEAD,origin/master,origin/HEAD,master)' -- Python version (e.g., Python 3.7.5) :3.7.5 -- OS platform and distribution (e.g., Linux Ubuntu 16.04):Ubuntu -- GCC/Compiler version (if compiled from source):7.3.0
-
Excute Mode / 执行模式 (Mandatory / 必填)(
PyNative/Graph):
Please delete the mode not involved / 请删除不涉及的模式: /mode pynative /mode graph
To Reproduce / 重现步骤 (Mandatory / 必填) Steps to reproduce the behavior:
- bash run_distribute_train_ascend.sh /data3/zl/Mindlab_data/dataset/hccl_8p.json
Expected behavior / 预期结果 (Mandatory / 必填) 分布式训练卡0日志可以保存
Screenshots/ 日志 / 截图 (Mandatory / 必填) if [ $# != 1 ] then echo "Usage: bash run_distribute_train.sh [RANK_TABLE_FILE]" exit 1 fi
export RANK_TABLE_FILE=$1 export DEVICE_NUM=8 export RANK_SIZE=8
if [ ! -f $1 ] then echo "RANK_TABLE_FILE Does Not Exist!" exit 1 fi
for((i=1; i<${DEVICE_NUM}; i++)) do export DEVICE_ID=$i export RANK_ID=$i rm -rf ./train_parallel$i mkdir ./train_parallel$i cp ./.py ./train_parallel$i cp ./.yaml ./train_parallel$i cd ./train_parallel$i || exit echo "start training for rank $RANK_ID, device $DEVICE_ID" env > env.log python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 > train.log 2>&1 & cd .. done export DEVICE_ID=0 export RANK_ID=0 rm -rf ./train_parallel0 mkdir ./train_parallel0 cp ./.py ./train_parallel0 cp ./.yaml ./train_parallel0 cd ./train_parallel0 || exit echo "start training for rank $RANK_ID, device $DEVICE_ID" env > env.log python train_speaker_embeddings.py --need_generate_data=False --run_distribute=1 2>&1 cd ..
Additional context / 备注 (Optional / 选填) Add any other context about the problem here.
please check @LiTingyu1997