snowfall
snowfall copied to clipboard
prepare.py slow speed
I have been running some experiments to debug why prepare.py is so slow. I'll post the results here in a bit, but let me just give you a preview of why I believe it's so slow. By default, torch seems to use a lot of threads, but it does so in a very inefficient way which actually just wastes CPU. We attempt to get around this with:
torch.set_num_threads(1)
torch.set_num_interop_threads(1)
but now that we are using "spawn" as the parallelization method and the main() is guarded by:
if __name__ == '__main__':
main()
that code is never reached.
Here are my notes from investigating the time to prepare dev-clean, including writing llc files. The script prepare.py was modified to skip the prepare_librispeech set and just load the relevant manifest from disk so the time is just the time taken in cut_set.compute_and_store_features().
Below I show various setups. TL:DR: the following statements:
torch.set_num_threads(1)
torch.set_num_interop_threads(1)
need to be moved to OUTSIDE OF MAIN() for a 100-fold speed imrovement. This should be done regardless of the num-jobs or execution setup.
with executor = None; torch.set_num_threads(1); torch.set_num_interop_threads(1):
real 1m17, user: 0m51, sys: 0m22
with executor = None; and no torch.setXXX, which means it uses about 25 CPUs:
real 1m48, user: 45m45, sys: 1m6
with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
[in main.py: torch.set_num_threads(1); torch.set_num_interop_threads(1)]; num_jobs = 1:
Uses about 25 CPUs within 1 thread:
real 1m35, user 39m30, sys 1m16.
with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
[in main.py: torch.set_num_threads(10); torch.set_num_interop_threads(1)]; num_jobs = 10:
Each thread uses about 500% CPU.
real 10m16.768s, user 299m8.986s, sys 2m26.761s
with executor = ProcessPoolExecutor(num_jobs, mp_context=multiprocessing.get_context("spawn")),
[outside main.py: torch.set_num_threads(10); torch.set_num_interop_threads(1)]; num_jobs = 10:
Each thread uses about 500% CPU.
real real 0m13.860s, user 1m8.146s, sys 0m44.639s
Also notice there is about a factor of 300 in speed between the last 2 runs, if you look at the user
time taken (i.e. total across CPUs).
Let me try it under your suggestion and make a PR to fix it.
Yes, it is much faster! 👍
Use 48 cores on ('dev-clean', 'test-clean')
:
real 0m47.141s
user 3m11.562s
sys 1m39.085s
Use 72 cores on ('dev-clean', 'test-clean', 'train-clean-100')
:
real 6m20.349s
user 32m31.692s
sys 7m57.317s
It is slower than Dan's result, I think it should be caused by the NFS delay of my home directory.
🎉
I have disabled the OpenMP support for libsox
on torchaudio side. https://github.com/pytorch/audio/issues/1026 It will be available through nightly binary distribution starting tomorrow. (Please check out pytorch.org for the installation of nightly build)
This should remove the suspected cause of crashing (GNU OpenMP and Intel OpenMP co-exist and interfere), and will allow you to go back to the fork
method without being worried about OMP_NUM_THREADS
configuration (or torch.set_num_threads
).
We have minor release scheduled in the early December, so we appreciate if you can test it and confirm it resolves the issue (or not). https://github.com/pytorch/audio/issues/1021
@jimbozhang would you mind testing this? (I think what is required is testing that the speed is ok after you revert the "spawn"->"fork" and remove the stuff about setting torch num_threads).
@jimbozhang would you mind testing this? (I think what is required is testing that the speed is ok after you revert the "spawn"->"fork" and remove the stuff about setting torch num_threads).
OK, I'll work on this (and k2
NVTX stuffs) next Monday. @danpovey
With the latest version's k2+lhotse+snowfall
, under the path snowfall/egs/librispeech/asr/simple_v1
:
$ time python prepare.py
dev-clean
Extracting and storing features in parallel: 100%|_________| 2703/2703 [00:07<00:00, 364.28it/s]
test-clean
Extracting and storing features in parallel: 100%|_________| 2620/2620 [00:07<00:00, 342.19it/s]
train-clean-100
Extracting and storing features in parallel: 100%|_________| 28539/28539 [01:26<00:00, 331.35it/s]
real 3m41.975s
user 33m55.149s
sys 5m25.221s
Didn't modify the job number setting, so the job number num_jobs = min(15, os.cpu_count())
should be 15.
I think to test what he's asking about, you need to install the latest version of TorchAudio (available through the nightly builds)... at least I think that is what he was saying. You might need to install PyTorch 1.7.0 in order to do that. Perhaps you should talk to Haowen about this to-morrow. On our test machine we have cuda 10.1 and you are probably using PyTorch 1.6.0, so you'll have to see whether there
Also you'd need to change the snowfall prepare.py code to use "fork" instead of "spawn", and probably remove the 2 statements setting torch's num-threads to 0, to verify that it still works the same after those changes.
On Tue, Nov 24, 2020 at 3:09 PM Junbo Zhang [email protected] wrote:
With the latest version's k2+lhotse+snowfall, under the path snowfall/egs/librispeech/asr/simple_v1:
$ time python prepare.py
dev-clean Extracting and storing features in parallel: 100%|_________| 2703/2703 [00:07<00:00, 364.28it/s]
test-clean Extracting and storing features in parallel: 100%|_________| 2620/2620 [00:07<00:00, 342.19it/s]
train-clean-100 Extracting and storing features in parallel: 100%|_________| 28539/28539 [01:26<00:00, 331.35it/s]
real 3m41.975s user 33m55.149s sys 5m25.221s
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/18#issuecomment-732693212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO625JHOMXUVMTN4UGDSRNKS3ANCNFSM4TWZ42KA .
Ah, apologies. I misunderstood it. I didn't notice mthrok's comment at all :sweat: and just simply tested the latest snowfall. Sorry about that ...
I'll try torchaudio
with fork
method tomorrow.
I tested it. The snowfall/egs/librispeech/asr/simple_v1/prepare.py
worked fine when I just replaced "spawn" with "fork", with torchaudio (0.8.0.dev20201125) + torch (1.8.0.dev20201126+cu101)
.
But after I removed the torch.set_num_threads
settings, it became extremely slow, same as before. @mthrok @danpovey