Parameterize truncation_seq_length?

Open wongkarenhy-hex opened this issue 2 years ago • 1 comments

Thanks for providing such a great tool for a community! I pulled the latest repo and noticed that docking would fail on larger structures. I then traced it to this one line of code. Would it be possible to parameterize truncation_seq_length? The error message was also a bit confusing: LM embeddings for complex AF-P08518-F1 did not have the right length for the protein. Skipping AF-P08518-F1. Thanks!

May 04 '23 00:05 wongkarenhy-hex

Seconding this! The specified truncation_seq_length in the README for the PDBBind dataset resulted only ~1% of uncorrupted data points.

Screenshot from 2023-05-07 14-51-15

Removing the truncation_seq_length limit entirely resulted in the cache_torsion for training being able to successfully process about ~50% of the data points. (Had to rerun the embedding process, which took a surprisingly long time with no truncation. It generated a 42GB esm2_3billion_embeddings.pt instead of the original file with the README's truncation_seq_length 10GB esm2_3billion_embeddings.pt). no_trunc_seq_length

May 07 '23 18:05 JuLieAlgebra