hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

Extract full sequence in database of HHblits hits

Open realtman opened this issue 3 years ago • 2 comments

Is there a recommended method for extracting the full sequences (i.e. not just the aligned part) of hits from an hhr file output by HHblits, or even from individual sequence ids?

In particular I would like to do this for the BFD database, for which the ffdata file is too large to parse, and the ffindex file does not seem to have sequence ids listed along with the offsets.

Thanks!

realtman avatar Aug 11 '21 04:08 realtman

I am having the same problem, I would like to extract full sequences from the BFD database and cannot find a way to do. Does anyone have a solution for this? Many thanks in advance!

eric-jm-lang avatar Oct 29 '21 14:10 eric-jm-lang

My workflow is to create a separate text file connecting the a3m headers with the a3m file indexes in ffindex:

ffdata_file  = "bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata",
ffindex_file = "bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex"
output_file  = "bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_names.txt"

with open(index_file) as index, open(data_file) as data:
    with open(output_file, 'w') as out:
        for index_line in index:
            file, start, length = index_line.split()
            data.seek(int(start) + 1)
            data_line = next(data)
            header = data_line.split(maxsplit = 1)[0]
            out.write(f'{file}\t{header}\n')

The a3m alignments can then be extracting by first looking up the file indexes in *names.txt and next fetching them with ffindex_get.

alephreish avatar May 08 '23 12:05 alephreish