Gustaf Ahdritz comments

Results 170 comments of


Gustaf Ahdritz

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Actually on second thought it's not very weird that really long validation proteins should fail---chunking isn't enabled by default during validation, so you'll get much worse memory performance than during...

training speed is about 2x slower than JAX trainable version (Uni-Fold)

Did you actually mean v100s, or was that a typo? v100s don't have bfloat16 support.

CUDA Error in openmm

Hm I haven't seen this one before. [This thread](https://github.com/LuxCoreRender/LuxCore/issues/490) indicates that it might have something to do with your CUDA version?

Unusual predicted structures from pretrained OpenFold on Pascal GPU

Could you verify the git commit hash of your installation + that the git diff shows no modifications to the model code?

Unusual predicted structures from pretrained OpenFold on Pascal GPU

Could you try disabling the custom CUDA kernels on the bad system? Specifically, make sure that all occurrences of `use_memory_efficient_kernel` and `use_flash` are set to `False` in the config.

Unusual predicted structures from pretrained OpenFold on Pascal GPU

Ah my bad I never added it to the config. You'll have to disable `use_memory_efficient_kernel` manually in `openfold/model/evoformer.py`. There should only be one occurrence of it there; change the setting...

Unusual predicted structures from pretrained OpenFold on Pascal GPU

Weird. I'm not really sure what to make of this. Do you think it's worth trying to install OF from scratch on the 1080 system as a sanity check?

Unusual predicted structures from pretrained OpenFold on Pascal GPU

Thanks for the diagnostics. Weird that it's crashing on a stock PyTorch matmul and not the custom kernel immediately thereafter... Equally weird is that this is happening halfway through the...

About the memory consumption

This all happens automatically. The inference-time "chunking" you described is controlled by the "chunk_size" parameter of the config. Activation checkpointing is controlled by "blocks_per_ckpt".

Does Training support Chunking?

Chunking is disabled during training because it doesn't save any memory when grad is enabled---activations for each chunk are still stored for the backwards pass. Furthermore, activation checkpoints can't be...