pyhmmer
pyhmmer copied to clipboard
cannot reproduce example for ```pyhmmer.hmmer.hmmsearch```
Hello and hope all is well. I'm running into issues reproducing this code (https://pyhmmer.readthedocs.io/en/stable/examples/performance_tips.html#Search-and-scan):
import time
import pyhmmer
t1 = time.time()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
print(f"- hmmsearch without prefetching took {time.time() - t1:.3} seconds")
# pre-fetching targets - fast, but needs the whole target database in memory
t1 = time.time()
with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
seqs = seq_file.read_block()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
print(f"- hmmsearch with prefetching took {time.time() - t1:.3} seconds")
My code is:
import os
import sys
import psutil
import argparse
import pyhmmer
from pyhmmer.easel import SequenceFile
from pyhmmer.plan7 import HMMFile
from pyhmmer.hmmer import hmmsearch
from pyhmmer.hmmer import hmmscan
def run_pyhmmer(proteins, hmmdb, tblout, domtblout, bitcutoff, cores_n):
available_memory = psutil.virtual_memory().available
proteins_size = os.stat(proteins).st_size
database_size = os.stat(proteins).st_size
input_size = proteins_size + database_size
with HMMFile(hmmdb) as hmm_file:
hmms = list(hmm_file)
Z_val=len(hmms)
with SequenceFile(proteins, digital=True) as seq_file:
if input_size < available_memory * 0.1:
print("Enough available memory!")
print("Pre-fetching targets into memory...")
seqs = seq_file.read_block()
else:
seqs = seq_file
print("Performing pyhmmer hmmsearch...")
all_hits = hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff)
print(type(all_hits)) # This output prints <generator object run_pyhmmer.<locals>.<genexpr> at 0x7f18801a87b0>
The documentation states that hmmsearch yields a TopHits
object, but I seem to be getting <generator object run_pyhmmer.<locals>.<genexpr> at 0x7f18801a87b0>
.
I also wanted to ask if I can run hmmsearch(hmms, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff)
instead of hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff)
if my system has enough memory to store the db.
My hmm database is a large database of 31,150 HMM profiles (~3 GB), and my protein is a FAA output from prodigal with 208,993 proteins (76 MB) .
I've noticed that putting the function into a list returns a list of TopHits objects:
tophits_list = list(hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff))
How would I then be able to write all the top hits into a single "domains" file? I've done this but seems like it's not the most efficient way:
with open(domtblout, "ab") as f:=
all_hits_list[0].write(f, format="domains", header=True)
for hits in all_hits_list[1:]:
hits.write(f, format="domains", header=False)
I've read all over the documentation and can't seem to see any example of many HMMs run against manny proteins through pyhmmer.hmmer.hmmsearch().
.
Would be nice to do a walk-through of what I'm trying to achieve for anyone planning to do the same!
Once I figure it out happy to contribute to the walkthrough.
Best,
Erfan