Residue annotation always returns zero tensor

Open dandingsky opened this issue 9 months ago • 0 comments

Hi, I'm trying to run through the residue annotation pipeline. I noticed that in encode_decode.py we have:

ra_tokens = residue_annotations_tokenizer.tokenize(
    {
        "interpro_site_descriptions": descriptions,
        "interpro_site_starts": starts,
        "interpro_site_ends": ends,
    },
    sequence=sequence,
    fail_on_mismatch=True,
)

, but when I go into the residue_annotations_tokenizer.tokenize() function, I found that it always returns full pad tokens if the input is missing the field interpro_site_residues:

if any(
    sample.get(field) is None
    for field in [
        "interpro_site_descriptions",
        "interpro_site_starts",
        "interpro_site_ends",
        "interpro_site_residues",
    ]
):
    return ["<pad>"] * seqlen

, which is exactly the case from encode_decode.py. This causes all residue annotations to be zeros. May I ask is this on purpose or am I missing something?

Apr 10 '25 11:04 dandingsky