OntoProtein Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files

Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files

Open QuadratJunges opened this issue 1 year ago • 3 comments

trafficstars

While trying to tokenize sequences from a multi-fasta-list, the generated input using the prot-BERT tokenizer always generates the same numpy arrays, eventhough the initial sequences differ significantly. Anyone else ever faced such problems? Looking forward for some input.

With the submitted sequence list looking like this "['IISACLAGEKCRYTGDGFDYPALRKLVEEGKAIPVCPEVLGGLSVPRDPNEIIGGNGFDVLDGKAKVLTNRGVDTTAAFVKGAAEVLAIAQKKGARVAVLKERSPSCGSTMIYDGTFSGRRIPGCGCTAALLVKEGIRVFSEEN', 'RLLLIDGNSIAFRSFFALQNSLSRFTNADGLHTNAIYGFNKMLDIILDNVNPTDALVAFDAGKTTFRTKMYTNYKGGRAKTPSELTEQMPYLRDLLTGYGIKSYEL...]"

the output arrays look like this [array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149, -0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149, -0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149, -0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149, -0.04542159, 0.07880748]], dtype=float32), array([[-0.09921601, 0.05850809, -0.0922595 , ..., -0.00792149, -0.04542159, 0.07880748]], dtype=float32)]

    while the def im currently using looks like the following

def EMBED_SEQUENCE(QUERY_SEQUENCES, TOKENIZER, MODEL): EMBEDDINGS = [] MODEL.eval() for SEQ in QUERY_SEQUENCES: INPUTS = TOKENIZER(SEQ, return_tensors="pt", padding=True, truncation=True, max_length=1024) with torch.no_grad(): OUTPUTS = MODEL(**INPUTS) EMBEDDING = OUTPUTS.last_hidden_state.mean(dim=1).cpu().numpy() EMBEDDINGS.append(EMBEDDING) return EMBEDDINGS

QUERIES = EMBED_SEQUENCE(QUERY_SEQUENCES, TOKENIZER, MODEL)

Sep 20 '24 19:09 QuadratJunges

OntoProtein OntoProtein copied to clipboard

Issues regarding the usage of the ProtBERT tokenizer on multi-fasta-files

OntoProtein
OntoProtein copied to clipboard