ProSST icon indicating copy to clipboard operation
ProSST copied to clipboard

Discrepancy in structure quantizer outputs and request for original PDB files used in the proteingym_benchmark.zip

Open yangguang8112 opened this issue 9 months ago • 2 comments

Thank you for your open-source work!

Regarding the structure quantizer, I tried the example provided in the README but obtained different outputs. Below is my code and the generated output:

from prosst.structure.quantizer import PdbQuantizer
processor = PdbQuantizer(structure_vocab_size=2048)
structure_sequences = processor("example_data/p1.pdb", return_residue_seq=False)
structure_sequence_offset = [[i + 3 for i in struc_seq] for struc_seq in structure_sequences][0]

My Output:

[1689,774,774,1869,774,1880,1893,1978,1526,1526,1471,799,1526,789,58,664,1471,1471,664,664,935,1471,1715,935,799,799,1978,26,1528,45,1893,1893,26,216,45,1077,1077,45,1471,1655,1674,1893,1893,1528,1929,26,1988,1037,26,1978,799,935,58,1037,463,664,1471,1669,975,935,1526,1526,799,26,1674,1471,1526,1674,1526,26,1674,1988,1528,341,1279]

Expected Output from README:

[407, 998, 1841, 1421, 653, 450, 117, 822, ...]

Additionally, could you provide the original PDB files used in the proteingym_benchmark.zip for validation? The structure sequences I generated using PDB files downloaded directly from Proteingym also differ from the expected results.

Thank you for your help!

yangguang8112 avatar Mar 17 '25 09:03 yangguang8112

Thank you for your open-source work!

Regarding the structure quantizer, I tried the example provided in the README but obtained different outputs. Below is my code and the generated output:

from prosst.structure.quantizer import PdbQuantizer processor = PdbQuantizer(structure_vocab_size=2048) structure_sequences = processor("example_data/p1.pdb", return_residue_seq=False) structure_sequence_offset = [[i + 3 for i in struc_seq] for struc_seq in structure_sequences][0] My Output:

[1689,774,774,1869,774,1880,1893,1978,1526,1526,1471,799,1526,789,58,664,1471,1471,664,664,935,1471,1715,935,799,799,1978,26,1528,45,1893,1893,26,216,45,1077,1077,45,1471,1655,1674,1893,1893,1528,1929,26,1988,1037,26,1978,799,935,58,1037,463,664,1471,1669,975,935,1526,1526,799,26,1674,1471,1526,1674,1526,26,1674,1988,1528,341,1279]

Expected Output from README:

[407, 998, 1841, 1421, 653, 450, 117, 822, ...]

Additionally, could you provide the original PDB files used in the proteingym_benchmark.zip for validation? The structure sequences I generated using PDB files downloaded directly from Proteingym also differ from the expected results.

Thank you for your help!

Thank you for your detailed feedback! After verification, the discrepancy occurs because the example outputs were generated using an early pre-release checkpoint. Our officially released checkpoints have been optimized and adjusted, which explains the difference in expected outputs.

Tpan1039-ui avatar May 29 '25 07:05 Tpan1039-ui

We have updated the structure token generator, plz have a try! The original PDB dataset is the same as ProtSSN, which can be downloaded from Huggingface.

tyang816 avatar May 29 '25 13:05 tyang816