Gustaf Ahdritz comments

Results 170 comments of


Gustaf Ahdritz

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Here's a datapoint in the meantime. Using the right-out-of-the-box setting from the same commit (c4d9f57), with the real dataloader, the slow cache clearing, DeepSpeed stage 2, CPU offloading, and the...

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Hm. I'll try to think of more discrepancies. I think there still have to be more; even if the 6.5-7s A100 time doesn't pan out, we shouldn't be getting essentially...

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Yeah no problem. How best can I get it to you?

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Sent. Our A100 results were obtained using the following: CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd...

training speed is about 2x slower than JAX trainable version (Uni-Fold)

The mmcif cache isn't required, but the template mmCIFs are. I'll send those over now.

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Yes, we have tested bfloat16, and it's a lot better than fp16, but you'll need PyTorch 10 for that. The test I referenced previously used fp16.

training speed is about 2x slower than JAX trainable version (Uni-Fold)

You won't NaN anymore. Have you updated your DeepSpeed config for bf16 training?

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Hm. Could you test it with DeepSpeed one time? That's what our test used. I'd repeat the test without DeepSpeed myself, but the A100's we've been using are borrowed and...

training speed is about 2x slower than JAX trainable version (Uni-Fold)

That's kind of weird. How much memory do you have on your A100s?

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Just 700? That's very odd. Is grad being enabled for validation runs or something?