ProSST Structural tokenizer (`PdbQuantizer`) is too slow at processing long proteins

Hi Teams, Thanks for the great work. Wondering how long it takes to process a protein of length 400 for your pre-trained PdbQuantizer? On my machine, somehow it's super slow. Just trying to figure out the reason.

Thanks for help!

Aug 15 '24 18:08 KatarinaYuan

Hi, thanks for concerning our work. We will release an accelerated PdbQuantizer with multi-thread parallel processing next month.

Aug 16 '24 06:08 mingchen-li

Accelerated version speed:

Protein Name (Uniprot_ID)	Length (Local structures)	Splitting to local structure	Encoding
CCDB_ECOLI_Adkar_2012	101	0.29s	4.43s
ESTA_BACSU_Nutschel_2020	212	0.67s	4.27s
PTEN_HUMAN_Matreyek_2021	403	1.06s	4.45s
ENV_HV1B9_DuenasDecamp_2016	853	3.24s	5.63s

Aug 16 '24 06:08 mingchen-li

Thanks! Looking forward to the release!

Aug 30 '24 17:08 KatarinaYuan

Hello, when will it be released?

Oct 17 '24 08:10 SIITW

Hi, please check the new quantizer.py. (Make sure that you have installed pathos in your python env.[pip install pathos])

Oct 17 '24 11:10 mingchen-li

Still too slow on my side. Could you tell us how to use the new quantizer.py. I just use it as suggested:

from prosst.structure.quantizer import PdbQuantizer
processor = PdbQuantizer(structure_vocab_size=2048) # can be 20, 128, 512, 1024, 2048, 4096
result = processor("example_data/p1.pdb", return_residue_seq=False)

Dec 20 '24 16:12 GGchen1997

Still too slow on my side. Could you tell us how to use the new quantizer.py. I just use it as suggested:
from prosst.structure.quantizer import PdbQuantizer
processor = PdbQuantizer(structure_vocab_size=2048) # can be 20, 128, 512, 1024, 2048, 4096
result = processor("example_data/p1.pdb", return_residue_seq=False)

I want to know,too! It's super slow. :(

Dec 23 '24 02:12 zz-lovely

it take me about 2 hour to process 30 proteins complex, which 400AA, how can i accelate this progress?

Jan 15 '25 11:01 Linzy19

We're excited to share that we've just merged a significant optimization contributed by mdanzi. Which took a batch of 100 proteins from running in about 7 hours to running in about 80 seconds. thanks again to mdanzi for the excellent contribution!

Feb 26 '25 12:02 Tpan1039-ui