openfold
openfold copied to clipboard
training speed is about 2x slower than JAX trainable version (Uni-Fold)
device: 1 A100 with 40GB memory
cuda: 11.3
Compared with https://github.com/dptech-corp/Uni-Fold, using model_2 setting, and the same data (only use one sample, and use DummyDataLoader in openfold).
And I follow this issue, https://github.com/aqlaboratory/openfold/issues/19, disabled clear_cache_between_blocks and deepspeed for cpu offload.
The commit I used is https://github.com/aqlaboratory/openfold/commit/c4d9f57f9005f3e9e0325eff97b8232e328b4813
speed per example:
| FP32 | FP16 | |
|---|---|---|
| openfold | 24.5 s | 17 s |
| Uni-Fold | 13.25 s | 8.9 s |
Is that expected? any tricks that I can get further speed-up?
No, this is not expected. In our previous A100 experiments we observed single-example times of 6.5-7s for the 256 crop. I'll get back to you once this has been verified on a recent build.
Are you using DeepSpeed at all? What ZeRO stage are you using if so?
Also, how long are you running each model before you recorded times?
Have you made any other changes to the model config besides disabling the cache clearing option?
here is the code for my test: https://github.com/guolinke/openfold/tree/guoke/test the changeset is https://github.com/guolinke/openfold/commit/d36876319d745de2c6e921eb835fb750335778e3 and https://github.com/guolinke/openfold/commit/6fb440c1d6485bcd3b57837593f5a0600dda882d For deepspeed, I still use it, by changing it to stage 0 and cpu_offload=false.
To run the code, you should first gunzip test_data.pickle.gz
then run the training command, python train_openfold.py . . . . 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1
For the time recording, I wait for several iterations until it is stable.

A few comments/questions:
- I don't believe that the sample batch used in the unit tests is very well-suited for tests like this. I believe it uses a much smaller crop size, and it may be out of date in certain respects. If possible, I'd generate a real batch using the actual dataloader, optionally pickling that for subsequent tests.
- Removing the copy.deepcopy() from the DummyDataLoader will likely have unintended consequences, since grad is enabled for several of the tensors inserted therein. I'd put it back and try the same test again.
- Since this test is being run on just one GPU, I'd try just getting rid of DeepSpeed altogether.
- Is Uni-Fold definitely doing all of the same stuff as OpenFold (which hews pretty close to the original supplement/source code)? Does it maintain, for example, an EMA of the model parameters? Is it doing the same number of recycling iterations? Is it checkpointing in all of the same places? And so on.
- Try disabling the line in the training script that TorchScript's components of the model (the call to _script_preset seems to degrade performance on occasion).
- I'm pretty sure this has no effect on performance, but there's no need for the
--replace-sampler-ddpflag if you're just using 1 GPU. - Maybe try running training with the
--benchmarkflag enabled. - 13 iterations might be too few for the runtime to stabilize. It usually takes a lot longer for me, but, granted, most of my testing is done on 2080 Ti's.
In any case, it's also possible that OpenFold's performance might have been affected by a recent change. Again, I'll be repeating our runtime tests on A100s ASAP (that might take a few days, though).
Thank you .
- I want to get rid of affect of data loader, so creating a dummy data for the benchmark. BTW, this data isn't from tests/test_data, I create it by myself. Its crop size is 256.
- just tried, and the same speed.
- removing deepspeed will be slower (22s for fp16). I think deepspeed's fused_adam may cause this difference.
- I am not sure about that. I think the recycling number may cause this, I will make it the same and fixed. for other factors, do you think they will cause the large speed difference? updated: just check the code of https://github.com/dptech-corp/Uni-Fold/blob/8cc7bcedc23efd53a2f0b11de6657c61ac9204f5/unifold/model/modules.py#L481 (search hk.remat in this file), it seems the places of activation checkpointing are the same .
- removing it indeed is faster, about 1s faster for fp16.
- just tried, and almost the same speed.
- Try disabling contiguous_gradients in DeepSpeed.
Continuing the discussion on previous points:
- Since the crop size of the sample batch isn't right, I still think it's important to use a fresh one. If you want to discount the runtime of the dataloader, you can pickle it and then rerun the tests using the original DummyDataloader. 3 (need to put something here so GitHub doesn't change the number). I'm surprised to see such a big DeepSpeed performance difference---I've never seen a difference of more than a second or two. Quite odd. 4 (four). Certainly, the placement and number of activation checkpoints, the number of recycling iterations, and so on can affect the training iteration runtime.
(N.B. - I added 7. and 8. straight to my previous reply, possibly after you already responded. Sorry about that!)
Here's a datapoint in the meantime. Using the right-out-of-the-box setting from the same commit (c4d9f57), with the real dataloader, the slow cache clearing, DeepSpeed stage 2, CPU offloading, and the slow TorchScripting (so basically the worst-case scenario), I ran
python3 train_openfold.py data/ alignments/ /data/ga122/alphafold/pdb_mmcif/mmcif_files/ train_op_16 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --gpus 1 --replace_sampler_ddp=True --seed 44 --default_root_dir train_op_16 --deepspeed_config deepspeed_config.json --precision 16
on 1 consumer-grade 2080 Ti. After 13 iterations, I got:

It makes me think that something might be wrong with your torch/CUDA installation or something. I'm not sure.
I use the docker mmdog/pytorch:pytorch1.10.0-cuda11.3 to run.
I think I found the problem, with my created dummy data, openfold will fix the recycling number to 4 (3 no_grad + 1 grad), while uni-fold random samples from [0, 3] + 1. So I run uni-fold with fixed 3+1 recyling number again.
The update result:
| FP32 | FP16 | |
|---|---|---|
| openfold | 22.5 s | 16 s |
| Uni-Fold | 18.44 s | 12 s |
The result is much closer now.
BTW, I also update the comment (https://github.com/aqlaboratory/openfold/issues/34#issuecomment-997321701) above. I will try to disable ema, and other suggestion latter.
- disable ema is slightly faster, about 0.1s
--benchmarkis almost the same speed- dilable
contiguous_gradientsis almost the same speed
Hm. I'll try to think of more discrepancies. I think there still have to be more; even if the 6.5-7s A100 time doesn't pan out, we shouldn't be getting essentially the same times on the A100 and 2080 Ti, especially considering the optimizations you've made.
with uniform random recycling [1, 4], the speed of fp16 is about 11.9s for openfold. I am trying to create the real data for testing, but the download the preprocess speed is very slow. It will be great if you can share with me a toy small data for the test, like you used in above screen snapshot.
Yeah no problem. How best can I get it to you?
thank you, in the way you are convenient, like google drive or Dropbox. my email is [email protected]
gently ping @gahdritz for the data sharing.
Sent.
Our A100 results were obtained using the following:
CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd
and with cache clearing disabled (but using the real dataloader).
Thank you very much! I receive the data.
it seems the data don't include template part (template_mmcif_dir and mmcif_cache.json), are they not needed?
The mmcif cache isn't required, but the template mmCIFs are. I'll send those over now.
Sent.
Our A100 results were obtained using the following:
CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd
and with cache clearing disabled (but using the real dataloader).
Have you tried running it with bfloat16? Doesn't seem to be working.
File "openfold/openfold/utils/loss.py", line 46, in sigmoid_cross_entropy log_p = torch.nn.functional.logsigmoid(logits) RuntimeError: "log_sigmoid_forward_cuda" not implemented for 'BFloat16'
I'm also a bit surprised that the model params size is not adjusted. It should be half the size, same as with fp16, right?
Yes, we have tested bfloat16, and it's a lot better than fp16, but you'll need PyTorch 10 for that. The test I referenced previously used fp16.
Strange, I'm already on torch 1.10.1+cu113. Better in terms of what?
You won't NaN anymore.
Have you updated your DeepSpeed config for bf16 training?
I'm not using DeepSpeed in this experiment, just switched on precision="bf16" in PyTorch-lightning.
Hm. Could you test it with DeepSpeed one time? That's what our test used. I'd repeat the test without DeepSpeed myself, but the A100's we've been using are borrowed and not currently accessible.
It works, but OOM's which I believe it doesn't on FP16. Re-running the latter now. That's why I was wondering about the parameter size.
That's kind of weird. How much memory do you have on your A100s?
40GB. Single batch. I cap now validation targets at 700AA which did the trick.
Just 700? That's very odd. Is grad being enabled for validation runs or something?
Didn't really test anything, probably can be a bit larger (tested it on a v100s with 32GB). There were some 1k+ AA targets in the set beforehand.
Not sure about grad being enabled. Was wondering the same, but manually switching to no_grad didn't do anything and it's much faster compared to training on crops.
Actually on second thought it's not very weird that really long validation proteins should fail---chunking isn't enabled by default during validation, so you'll get much worse memory performance than during inference.
No OOM with FP16...
Did you actually mean v100s, or was that a typo? v100s don't have bfloat16 support.