ProSST Discrepancy in structure quantizer outputs and request for original PDB files used in the proteingym

Thank you for your open-source work!

Regarding the structure quantizer, I tried the example provided in the README but obtained different outputs. Below is my code and the generated output:

from prosst.structure.quantizer import PdbQuantizer
processor = PdbQuantizer(structure_vocab_size=2048)
structure_sequences = processor("example_data/p1.pdb", return_residue_seq=False)
structure_sequence_offset = [[i + 3 for i in struc_seq] for struc_seq in structure_sequences][0]

My Output:

[1689,774,774,1869,774,1880,1893,1978,1526,1526,1471,799,1526,789,58,664,1471,1471,664,664,935,1471,1715,935,799,799,1978,26,1528,45,1893,1893,26,216,45,1077,1077,45,1471,1655,1674,1893,1893,1528,1929,26,1988,1037,26,1978,799,935,58,1037,463,664,1471,1669,975,935,1526,1526,799,26,1674,1471,1526,1674,1526,26,1674,1988,1528,341,1279]

Expected Output from README:

[407, 998, 1841, 1421, 653, 450, 117, 822, ...]

Additionally, could you provide the original PDB files used in the proteingym_benchmark.zip for validation? The structure sequences I generated using PDB files downloaded directly from Proteingym also differ from the expected results.

Thank you for your help!

Mar 17 '25 09:03 yangguang8112

Thank you for your open-source work!

Regarding the structure quantizer, I tried the example provided in the README but obtained different outputs. Below is my code and the generated output:

from prosst.structure.quantizer import PdbQuantizer processor = PdbQuantizer(structure_vocab_size=2048) structure_sequences = processor("example_data/p1.pdb", return_residue_seq=False) structure_sequence_offset = [[i + 3 for i in struc_seq] for struc_seq in structure_sequences][0] My Output:
[1689,774,774,1869,774,1880,1893,1978,1526,1526,1471,799,1526,789,58,664,1471,1471,664,664,935,1471,1715,935,799,799,1978,26,1528,45,1893,1893,26,216,45,1077,1077,45,1471,1655,1674,1893,1893,1528,1929,26,1988,1037,26,1978,799,935,58,1037,463,664,1471,1669,975,935,1526,1526,799,26,1674,1471,1526,1674,1526,26,1674,1988,1528,341,1279]
Expected Output from README:
[407, 998, 1841, 1421, 653, 450, 117, 822, ...]
Additionally, could you provide the original PDB files used in the proteingym_benchmark.zip for validation? The structure sequences I generated using PDB files downloaded directly from Proteingym also differ from the expected results.

Thank you for your help!

Thank you for your detailed feedback! After verification, the discrepancy occurs because the example outputs were generated using an early pre-release checkpoint. Our officially released checkpoints have been optimized and adjusted, which explains the difference in expected outputs.

May 29 '25 07:05 Tpan1039-ui

We have updated the structure token generator, plz have a try! The original PDB dataset is the same as ProtSSN, which can be downloaded from Huggingface.

May 29 '25 13:05 tyang816

Discrepancy in structure quantizer outputs and request for original PDB files used in the proteingym_benchmark.zip