cryodrgn
cryodrgn copied to clipboard
GPU parallelization
Hi,
I just wanted to check if you have made a progress on this? I saw some commits that you add DataParallel lines.
Thanks for developing such nice tool for cryo-EM field.
best, heejong
Hi Heejong,
Thanks for asking! The current top of tree has GPU parallelization (commit 3ba2439db6fef20922dd3c60c2a7ab1508475d76) and mixed precision training. Feel free to give it a shot -- I've been meaning to reorganize the documentation before an official release.
A few mini docs:
- In this version,
cryodrgn train_vae
will use all available GPUs by default. Use the CUDA_VISIBLE_DEVICES environmental variable to restrict running on specific GPUs - Depending on image size / model size, you may wish to increase the default batch size (e.g. -b 16) to take advantage of parallelization. The current default batch size is optimized for model updates per second on a single GPU instead of GPU utilization.
- Mixed precision training is now available on Nvidia tensor core architectures with the
--amp
flag. You'll need to install apex (https://github.com/NVIDIA/apex#quick-start). In my experience, this leads to ~3x faster training for larger model architectures (e.g. 1024x3).
Thanks, Ellen
Yes. I briefly tested with multiple GPUs and it improved the performance substantially, which helped me quickly try out and settle down at least minimal working parameters for my cases.
One thing that I couldn't make it work is --amp flag.
With that flag I'm getting the following error msg:
Traceback (most recent call last):
File "/home/XXXX/miniconda3/envs/cryodrgn/bin/cryodrgn", line 11, in
If you have any ideas what might have caused this issue, it would be tremendously helpful. Once I get it working, I will get back to you with the direct comparison in terms of speed.
Thanks.
Great to hear! I added some assertion messages for the assert that you ran into (commit f1de270a565592adc88602dfee313ed861afebb5).
It's checking that your image size is a multiple of 8. Mixed precision training leads to dramatic speed ups only if your tensor dimensions are even multiples of 8, so I added a few asserts to ensure this.
Fantastic! I just ran over 1 million particles with D128 with --amp -- lazy --zdim 10 -n 50 --qdim 1024 --qlayers 3 --pdim 1024 --players 3 it took only a little bit over one day.
Also, thanks for adding the argument for specifying the K. It's really helpful.
It's checking that your image size is a multiple of 8. Mixed precision training leads to dramatic speed ups only if your tensor dimensions are even multiples of 8, so I added a few asserts to ensure this.
When I compared amp vs no amp with a common command of cryodrgn train_vae cryosparc_P4_J251_009_particles_cs_abs_w_mrcs_star_06_25.256.mrcs --poses pose_256.pkl --ctf ctf.pkl --zdim 8 -n 100 -o vae256_z8_e100 --lazy --batch-size 64 --beta 4
adding amp runs ~17 times faster.
Although I didn't run formal benchmark (run multiple times to minimize fluke data/condition), I set every other settings same (same # gpu, partition/hardware, command). Therefore, I plan to add amp from now on.
Wow 17x! Great! I haven't noticed any accuracy degradation when using mixed precision training (admittedly with limited benchmarking), so I usually leave it on by default as well. For smaller architectures, sometimes the overhead makes using amp slightly slower than full precision training so keep that in mind too.
Just as a quick note, I would caution against increasing the batch size too much since it may negatively affect the training dynamics. We're definitely not GPU-memory limited with the default batch size (-b 8
) so increasing the batch size can lead to dramatic speed ups in terms of time per epoch... except that it will result in fewer model updates per epoch, so you can actually end up training slower in terms of wall clock time. I've noticed this in some initial tests, but something else to explore/benchmark before officially releasing the GPU parallelization version.
Wow 17x! Great! I haven't noticed any accuracy degradation when using mixed precision training (admittedly with limited benchmarking), so I usually leave it on by default as well. For smaller architectures, sometimes the overhead makes using amp slightly slower than full precision training so keep that in mind too.
Just as a quick note, I would caution against increasing the batch size too much since it may negatively affect the training dynamics. We're definitely not GPU-memory limited with the default batch size (
-b 8
) so increasing the batch size can lead to dramatic speed ups in terms of time per epoch... except that it will result in fewer model updates per epoch, so you can actually end up training slower in terms of wall clock time. I've noticed this in some initial tests, but something else to explore/benchmark before officially releasing the GPU parallelization version.
Thanks for your comment.
With default architecture (e.g. --enc-layers QLAYERS Number of hidden layers (default: 3) --enc-dim QDIM Number of nodes in hidden layers (default: 256) --dec-layers PLAYERS Number of hidden layers (default: 3) --dec-dim PDIM Number of nodes in hidden layers (default: 256)) and nvidia a100 chip, I see 2.5 x wall clock time speed up with amp (installed by python).
@kimdn How did you install apex? I did install it in a separate folder apex which is parallel with cryodrgn folder, but --amp option could not work with the following error: NameError: name 'amp' is not defined
Any suggestion would be greatly appreciated.
@donghuachensu
I ran pip install -v --disable-pip-version-check --no-cache-dir ./ according to https://github.com/NVIDIA/apex#quick-start
c++ installation (e.g. pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./) never worked for my system (always resulted in some error during installation for many months).