FunASR
FunASR copied to clipboard
parformer大规模数据集预训练cpu内存持续增加的问题
Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)
❓ Questions and Help
在训练paraformer 使用train.py进行大规模预训练时候,cpu内存会不断增加,最后达到100%,进而数据加载报错,应该如何解决
Before asking:
- search the issues.
- search the docs.
What is your question?
在训练paraformer 使用train.py进行大规模预训练时候,cpu内存会不断增加,最后达到100%,进而数据加载报错,应该如何解决
Code
[2024-08-14 21:57:58,231][root][INFO] - train, rank: 1, epoch: 0/500, data_slice: 0/5, step_in_slice: 44000/156088, step_in_epoch: 80000, total step: 80000, (loss_avg_rank: 0.679), (loss_avg_slice: 0.623), (ppl_avg_slice: 1.865e+00), (acc_avg_slice: 0.694), (lr: 6.124e-04), [('loss_att', 0.453), ('acc', 0.758), ('loss_pre', 0.048), ('loss', 0.501), ('batch_size', 172)], {'data_load': '0.001', 'forward_time': '0.324', 'backward_and_AllReaduce_time': '0.407', 'optim_time': '0.075', 'total_time': '0.808'}, GPU, memory: usage: 4.356 GB, peak: 39.970 GB, cache: 40.660 GB, cache_peak: 40.660 GB [2024-08-14 21:57:58,231][root][INFO] - train, rank: 7, epoch: 0/500, data_slice: 0/5, step_in_slice: 44000/156088, step_in_epoch: 80000, total step: 80000, (loss_avg_rank: 0.338), (loss_avg_slice: 0.623), (ppl_avg_slice: 1.865e+00), (acc_avg_slice: 0.694), (lr: 6.124e-04), [('loss_att', 0.845), ('acc', 0.616), ('loss_pre', 0.148), ('loss', 0.993), ('batch_size', 200)], {'data_load': '0.000', 'forward_time': '0.339', 'backward_and_AllReaduce_time': '0.235', 'optim_time': '0.242', 'total_time': '0.816'}, GPU, memory: usage: 3.791 GB, peak: 38.241 GB, cache: 38.902 GB, cache_peak: 38.902 GB [2024-08-14 21:57:58,232][root][INFO] - train, rank: 0, epoch: 0/500, data_slice: 0/5, step_in_slice: 44000/156088, step_in_epoch: 80000, total step: 80000, (loss_avg_rank: 0.639), (loss_avg_slice: 0.623), (ppl_avg_slice: 1.865e+00), (acc_avg_slice: 0.694), (lr: 6.124e-04), [('loss_att', 0.393), ('acc', 0.801), ('loss_pre', 0.047), ('loss', 0.441), ('batch_size', 185)], {'data_load': '0.000', 'forward_time': '0.331', 'backward_and_AllReaduce_time': '0.397', 'optim_time': '0.084', 'total_time': '0.813'}, GPU, memory: usage: 4.306 GB, peak: 41.477 GB, cache: 42.143 GB, cache_peak: 42.143 GB [2024-08-14 21:57:58,235][root][INFO] - train, rank: 2, epoch: 0/500, data_slice: 0/5, step_in_slice: 44000/156088, step_in_epoch: 80000, total step: 80000, (loss_avg_rank: 0.544), (loss_avg_slice: 0.623), (ppl_avg_slice: 1.865e+00), (acc_avg_slice: 0.694), (lr: 6.124e-04), [('loss_att', 0.358), ('acc', 0.787), ('loss_pre', 0.044), ('loss', 0.402), ('batch_size', 155)], {'data_load': '0.000', 'forward_time': '0.338', 'backward_and_AllReaduce_time': '0.386', 'optim_time': '0.093', 'total_time': '0.819'}, GPU, memory: usage: 4.394 GB, peak: 55.395 GB, cache: 55.916 GB, cache_peak: 55.916 GB [2024-08-14 21:57:58,235][root][INFO] - train, rank: 4, epoch: 0/500, data_slice: 0/5, step_in_slice: 44000/156088, step_in_epoch: 80000, total step: 80000, (loss_avg_rank: 0.550), (loss_avg_slice: 0.623), (ppl_avg_slice: 1.865e+00), (acc_avg_slice: 0.694), (lr: 6.124e-04), [('loss_att', 0.424), ('acc', 0.697), ('loss_pre', 0.062), ('loss', 0.486), ('batch_size', 103)], {'data_load': '0.000', 'forward_time': '0.334', 'backward_and_AllReaduce_time': '0.405', 'optim_time': '0.073', 'total_time': '0.813'}, GPU, memory: usage: 4.703 GB, peak: 67.206 GB, cache: 67.770 GB, cache_peak: 67.770 GB [2024-08-14 21:57:58,237][root][INFO] - train, rank: 6, epoch: 0/500, data_slice: 0/5, step_in_slice: 44000/156088, step_in_epoch: 80000, total step: 80000, (loss_avg_rank: 0.294), (loss_avg_slice: 0.623), (ppl_avg_slice: 1.865e+00), (acc_avg_slice: 0.694), (lr: 6.124e-04), [('loss_att', 0.788), ('acc', 0.614), ('loss_pre', 0.162), ('loss', 0.95), ('batch_size', 199)], {'data_load': '0.000', 'forward_time': '0.329', 'backward_and_AllReaduce_time': '0.239', 'optim_time': '0.246', 'total_time': '0.814'}, GPU, memory: usage: 3.786 GB, peak: 60.709 GB, cache: 61.479 GB, cache_peak: 61.479 GB [2024-08-14 21:57:58,237][root][INFO] - train, rank: 3, epoch: 0/500, data_slice: 0/5, step_in_slice: 44000/156088, step_in_epoch: 80000, total step: 80000, (loss_avg_rank: 0.547), (loss_avg_slice: 0.623), (ppl_avg_slice: 1.865e+00), (acc_avg_slice: 0.694), (lr: 6.124e-04), [('loss_att', 0.376), ('acc', 0.759), ('loss_pre', 0.041), ('loss', 0.417), ('batch_size', 138)], {'data_load': '0.000', 'forward_time': '0.333', 'backward_and_AllReaduce_time': '0.398', 'optim_time': '0.083', 'total_time': '0.815'}, GPU, memory: usage: 4.479 GB, peak: 47.688 GB, cache: 48.240 GB, cache_peak: 48.240 GB [2024-08-14 21:57:58,246][root][INFO] - Validate epoch: 0, rank: 3
[2024-08-14 21:57:58,248][root][INFO] - Validate epoch: 0, rank: 5
[2024-08-14 21:57:58,250][root][INFO] - Validate epoch: 0, rank: 1
[2024-08-14 21:57:58,250][root][INFO] - Validate epoch: 0, rank: 7
[2024-08-14 21:57:58,251][root][INFO] - Validate epoch: 0, rank: 0
[2024-08-14 21:57:58,253][root][INFO] - Validate epoch: 0, rank: 4
[2024-08-14 21:57:58,254][root][INFO] - Validate epoch: 0, rank: 2
[2024-08-14 21:57:58,255][root][INFO] - Validate epoch: 0, rank: 6
[2024-08-14 21:57:58,409][root][INFO] - rank: 3, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:57:58,412][root][INFO] - rank: 5, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:57:58,413][root][INFO] - rank: 1, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:57:58,414][root][INFO] - rank: 4, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:57:58,414][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:57:58,417][root][INFO] - rank: 7, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:57:58,418][root][INFO] - rank: 2, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:57:58,419][root][INFO] - rank: 6, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:58:59,814][root][INFO] - rank: 5, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:59:02,055][root][INFO] - rank: 3, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:59:02,374][root][INFO] - rank: 6, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:59:03,529][root][INFO] - rank: 4, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:59:03,584][root][INFO] - rank: 1, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:59:05,045][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 65, after: 65
[2024-08-14 21:59:05,611][root][INFO] - rank: 2, dataloader start from step: 0, batch_num: 65, after: 65
[rank7]: File "/lxc-data/FunASR/examples/industrial_data_pretraining/paraformer/../../../funasr/bin/train.py", line 270, in
[rank1]: The above exception was the direct cause of the following exception:
[rank1]: Traceback (most recent call last):
[rank1]: File "/lxc-data/FunASR/examples/industrial_data_pretraining/paraformer/../../../funasr/bin/train.py", line 270, in
What have you tried?
What's your environment?
cuda12.2 linux 4* A100-8
- OS (e.g., Linux):
- FunASR Version (e.g., 1.0.0):
- ModelScope Version (e.g., 1.11.0):
- PyTorch Version (e.g., 2.0.0):
- How you installed funasr (
pip, source): - Python version:
- GPU (e.g., V100M32)
- CUDA/cuDNN version (e.g., cuda11.7):
- Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
- Any other relevant information: