pyhmmer icon indicating copy to clipboard operation
pyhmmer copied to clipboard

cannot reproduce example for ```pyhmmer.hmmer.hmmsearch```

Open erfanshekarriz opened this issue 1 year ago • 1 comments

Hello and hope all is well. I'm running into issues reproducing this code (https://pyhmmer.readthedocs.io/en/stable/examples/performance_tips.html#Search-and-scan):

import time
import pyhmmer

t1 = time.time()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seqs:
        total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
        print(f"- hmmsearch without prefetching took {time.time() - t1:.3} seconds")

# pre-fetching targets - fast, but needs the whole target database in memory
t1 = time.time()
with pyhmmer.easel.SequenceFile("data/seqs/938293.PRJEB85.HG003687.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()
with pyhmmer.plan7.HMMFile("data/hmms/bin/t2pks.h3m") as hmms:
    total = sum(len(hits) for hits in pyhmmer.hmmer.hmmsearch(hmms, seqs))
    print(f"- hmmsearch with prefetching took {time.time() - t1:.3} seconds")

My code is:

import os
import sys
import psutil
import argparse

import pyhmmer
from pyhmmer.easel import SequenceFile
from pyhmmer.plan7 import HMMFile
from pyhmmer.hmmer import hmmsearch
from pyhmmer.hmmer import hmmscan


def run_pyhmmer(proteins, hmmdb, tblout, domtblout, bitcutoff, cores_n):

        available_memory = psutil.virtual_memory().available
        proteins_size = os.stat(proteins).st_size
        database_size = os.stat(proteins).st_size
        input_size = proteins_size + database_size

        with HMMFile(hmmdb) as hmm_file:
                hmms = list(hmm_file)
                Z_val=len(hmms)

                with SequenceFile(proteins, digital=True) as seq_file:
                        if input_size < available_memory * 0.1:
                                print("Enough available memory!")
                                print("Pre-fetching targets into memory...")
                                seqs = seq_file.read_block()

                        else:
                                seqs = seq_file

                        print("Performing pyhmmer hmmsearch...")
                        all_hits = hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff)
                        print(type(all_hits)) # This output prints <generator object run_pyhmmer.<locals>.<genexpr> at 0x7f18801a87b0>

The documentation states that hmmsearch yields a TopHits object, but I seem to be getting <generator object run_pyhmmer.<locals>.<genexpr> at 0x7f18801a87b0>.

I also wanted to ask if I can run hmmsearch(hmms, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff) instead of hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff) if my system has enough memory to store the db.

My hmm database is a large database of 31,150 HMM profiles (~3 GB), and my protein is a FAA output from prodigal with 208,993 proteins (76 MB) .

erfanshekarriz avatar Dec 09 '23 03:12 erfanshekarriz

I've noticed that putting the function into a list returns a list of TopHits objects:

tophits_list = list(hmmsearch(hmm_file, seqs, cpus=cores_n, Z=Z_val, bit_cutoffs=bitcutoff))

How would I then be able to write all the top hits into a single "domains" file? I've done this but seems like it's not the most efficient way:

with open(domtblout, "ab") as f:=
  all_hits_list[0].write(f, format="domains", header=True)
  for hits in all_hits_list[1:]:
      hits.write(f, format="domains", header=False)

I've read all over the documentation and can't seem to see any example of many HMMs run against manny proteins through pyhmmer.hmmer.hmmsearch()..

Would be nice to do a walk-through of what I'm trying to achieve for anyone planning to do the same!

Once I figure it out happy to contribute to the walkthrough.

Best,

Erfan

erfanshekarriz avatar Dec 09 '23 05:12 erfanshekarriz