openfold training speed is about 2x slower than JAX trainable version (Uni-Fold)

trafficstars

device: 1 A100 with 40GB memory cuda: 11.3 Compared with https://github.com/dptech-corp/Uni-Fold, using model_2 setting, and the same data (only use one sample, and use DummyDataLoader in openfold).

And I follow this issue, https://github.com/aqlaboratory/openfold/issues/19, disabled clear_cache_between_blocks and deepspeed for cpu offload. The commit I used is https://github.com/aqlaboratory/openfold/commit/c4d9f57f9005f3e9e0325eff97b8232e328b4813

speed per example:

	FP32	FP16
openfold	24.5 s	17 s
Uni-Fold	13.25 s	8.9 s

Is that expected? any tricks that I can get further speed-up?

Dec 18 '21 10:12 guolinke

No, this is not expected. In our previous A100 experiments we observed single-example times of 6.5-7s for the 256 crop. I'll get back to you once this has been verified on a recent build.

Are you using DeepSpeed at all? What ZeRO stage are you using if so?

Also, how long are you running each model before you recorded times?

Have you made any other changes to the model config besides disabling the cache clearing option?

Dec 18 '21 21:12 gahdritz

here is the code for my test: https://github.com/guolinke/openfold/tree/guoke/test the changeset is https://github.com/guolinke/openfold/commit/d36876319d745de2c6e921eb835fb750335778e3 and https://github.com/guolinke/openfold/commit/6fb440c1d6485bcd3b57837593f5a0600dda882d For deepspeed, I still use it, by changing it to stage 0 and cpu_offload=false.

To run the code, you should first gunzip test_data.pickle.gz then run the training command, python train_openfold.py . . . . 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1

For the time recording, I wait for several iterations until it is stable.

Dec 19 '21 01:12 guolinke

A few comments/questions:

I don't believe that the sample batch used in the unit tests is very well-suited for tests like this. I believe it uses a much smaller crop size, and it may be out of date in certain respects. If possible, I'd generate a real batch using the actual dataloader, optionally pickling that for subsequent tests.
Removing the copy.deepcopy() from the DummyDataLoader will likely have unintended consequences, since grad is enabled for several of the tensors inserted therein. I'd put it back and try the same test again.
Since this test is being run on just one GPU, I'd try just getting rid of DeepSpeed altogether.
Is Uni-Fold definitely doing all of the same stuff as OpenFold (which hews pretty close to the original supplement/source code)? Does it maintain, for example, an EMA of the model parameters? Is it doing the same number of recycling iterations? Is it checkpointing in all of the same places? And so on.
Try disabling the line in the training script that TorchScript's components of the model (the call to _script_preset seems to degrade performance on occasion).
I'm pretty sure this has no effect on performance, but there's no need for the --replace-sampler-ddp flag if you're just using 1 GPU.
Maybe try running training with the --benchmark flag enabled.
13 iterations might be too few for the runtime to stabilize. It usually takes a lot longer for me, but, granted, most of my testing is done on 2080 Ti's.

In any case, it's also possible that OpenFold's performance might have been affected by a recent change. Again, I'll be repeating our runtime tests on A100s ASAP (that might take a few days, though).

Dec 19 '21 02:12 gahdritz

Thank you .

I want to get rid of affect of data loader, so creating a dummy data for the benchmark. BTW, this data isn't from tests/test_data, I create it by myself. Its crop size is 256.
just tried, and the same speed.
removing deepspeed will be slower (22s for fp16). I think deepspeed's fused_adam may cause this difference.
I am not sure about that. I think the recycling number may cause this, I will make it the same and fixed. for other factors, do you think they will cause the large speed difference? updated: just check the code of https://github.com/dptech-corp/Uni-Fold/blob/8cc7bcedc23efd53a2f0b11de6657c61ac9204f5/unifold/model/modules.py#L481 (search hk.remat in this file), it seems the places of activation checkpointing are the same .
removing it indeed is faster, about 1s faster for fp16.
just tried, and almost the same speed.

Dec 19 '21 03:12 guolinke

Try disabling contiguous_gradients in DeepSpeed.

Continuing the discussion on previous points:

Since the crop size of the sample batch isn't right, I still think it's important to use a fresh one. If you want to discount the runtime of the dataloader, you can pickle it and then rerun the tests using the original DummyDataloader. 3 (need to put something here so GitHub doesn't change the number). I'm surprised to see such a big DeepSpeed performance difference---I've never seen a difference of more than a second or two. Quite odd. 4 (four). Certainly, the placement and number of activation checkpoints, the number of recycling iterations, and so on can affect the training iteration runtime.

(N.B. - I added 7. and 8. straight to my previous reply, possibly after you already responded. Sorry about that!)

Dec 19 '21 03:12 gahdritz

Here's a datapoint in the meantime. Using the right-out-of-the-box setting from the same commit (c4d9f57), with the real dataloader, the slow cache clearing, DeepSpeed stage 2, CPU offloading, and the slow TorchScripting (so basically the worst-case scenario), I ran

python3 train_openfold.py data/ alignments/ /data/ga122/alphafold/pdb_mmcif/mmcif_files/ train_op_16 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --gpus 1 --replace_sampler_ddp=True --seed 44 --default_root_dir train_op_16 --deepspeed_config deepspeed_config.json --precision 16

on 1 consumer-grade 2080 Ti. After 13 iterations, I got:

It makes me think that something might be wrong with your torch/CUDA installation or something. I'm not sure.

Dec 19 '21 03:12 gahdritz

I use the docker mmdog/pytorch:pytorch1.10.0-cuda11.3 to run. I think I found the problem, with my created dummy data, openfold will fix the recycling number to 4 (3 no_grad + 1 grad), while uni-fold random samples from [0, 3] + 1. So I run uni-fold with fixed 3+1 recyling number again.

The update result:

	FP32	FP16
openfold	22.5 s	16 s
Uni-Fold	18.44 s	12 s

The result is much closer now.

BTW, I also update the comment (https://github.com/aqlaboratory/openfold/issues/34#issuecomment-997321701) above. I will try to disable ema, and other suggestion latter.

Dec 19 '21 04:12 guolinke

disable ema is slightly faster, about 0.1s
--benchmark is almost the same speed
dilable contiguous_gradients is almost the same speed

Dec 19 '21 04:12 guolinke

Hm. I'll try to think of more discrepancies. I think there still have to be more; even if the 6.5-7s A100 time doesn't pan out, we shouldn't be getting essentially the same times on the A100 and 2080 Ti, especially considering the optimizations you've made.

Dec 19 '21 04:12 gahdritz

with uniform random recycling [1, 4], the speed of fp16 is about 11.9s for openfold. I am trying to create the real data for testing, but the download the preprocess speed is very slow. It will be great if you can share with me a toy small data for the test, like you used in above screen snapshot.

Dec 19 '21 04:12 guolinke

Yeah no problem. How best can I get it to you?

Dec 19 '21 04:12 gahdritz

thank you, in the way you are convenient, like google drive or Dropbox. my email is [email protected]

Dec 19 '21 04:12 guolinke

gently ping @gahdritz for the data sharing.

Dec 21 '21 09:12 guolinke

Sent.

Our A100 results were obtained using the following:

CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd

and with cache clearing disabled (but using the real dataloader).

Dec 21 '21 19:12 gahdritz

Thank you very much! I receive the data. it seems the data don't include template part (template_mmcif_dir and mmcif_cache.json), are they not needed?

Dec 22 '21 03:12 guolinke

The mmcif cache isn't required, but the template mmCIFs are. I'll send those over now.

Dec 22 '21 17:12 gahdritz

Sent.

Our A100 results were obtained using the following:

CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd

and with cache clearing disabled (but using the real dataloader).

Have you tried running it with bfloat16? Doesn't seem to be working.

File "openfold/openfold/utils/loss.py", line 46, in sigmoid_cross_entropy log_p = torch.nn.functional.logsigmoid(logits) RuntimeError: "log_sigmoid_forward_cuda" not implemented for 'BFloat16'

I'm also a bit surprised that the model params size is not adjusted. It should be half the size, same as with fp16, right?

Dec 23 '21 18:12 lhatsk

Yes, we have tested bfloat16, and it's a lot better than fp16, but you'll need PyTorch 10 for that. The test I referenced previously used fp16.

Dec 23 '21 19:12 gahdritz

Strange, I'm already on torch 1.10.1+cu113. Better in terms of what?

Dec 23 '21 19:12 lhatsk

You won't NaN anymore.

Have you updated your DeepSpeed config for bf16 training?

Dec 23 '21 19:12 gahdritz

I'm not using DeepSpeed in this experiment, just switched on precision="bf16" in PyTorch-lightning.

Dec 23 '21 19:12 lhatsk

Hm. Could you test it with DeepSpeed one time? That's what our test used. I'd repeat the test without DeepSpeed myself, but the A100's we've been using are borrowed and not currently accessible.

Dec 23 '21 19:12 gahdritz

It works, but OOM's which I believe it doesn't on FP16. Re-running the latter now. That's why I was wondering about the parameter size.

Dec 23 '21 19:12 lhatsk

That's kind of weird. How much memory do you have on your A100s?

Dec 23 '21 19:12 gahdritz

40GB. Single batch. I cap now validation targets at 700AA which did the trick.

Dec 23 '21 19:12 lhatsk

Just 700? That's very odd. Is grad being enabled for validation runs or something?

Dec 23 '21 19:12 gahdritz

Didn't really test anything, probably can be a bit larger (tested it on a v100s with 32GB). There were some 1k+ AA targets in the set beforehand.

Not sure about grad being enabled. Was wondering the same, but manually switching to no_grad didn't do anything and it's much faster compared to training on crops.

Dec 23 '21 20:12 lhatsk

Actually on second thought it's not very weird that really long validation proteins should fail---chunking isn't enabled by default during validation, so you'll get much worse memory performance than during inference.

Dec 23 '21 20:12 gahdritz

No OOM with FP16...

Dec 23 '21 20:12 lhatsk

Did you actually mean v100s, or was that a typo? v100s don't have bfloat16 support.

Dec 23 '21 20:12 gahdritz

openfold openfold copied to clipboard

training speed is about 2x slower than JAX trainable version (Uni-Fold)

openfold
openfold copied to clipboard