hh-suite icon indicating copy to clipboard operation
hh-suite copied to clipboard

scripts/pdbfilter.py does not filter out the same protein for some cases.

Open huhlim opened this issue 2 years ago • 0 comments

:exclamation: Make to check out our User Guide.

Expected Behavior

When a PDB has multiple chains of the same protein, I expect the script to leave only one of the chains. For example with the following input files,

  • cluster.tsv
4APC_A	4APC_A
4APC_A	4APC_B
4APC_A	4B9D_A
4APC_A	4B9D_B
  • pdb_filter.dat
#pdb_chain	resolution	r_free	completeness	method
4APC_A	2.1	0.248	0.837	X-RAY DIFFRACTION
4APC_B	2.1	0.248	0.846	X-RAY DIFFRACTION
4B9D_A	1.9	0.222	0.829	X-RAY DIFFRACTION
4B9D_B	1.9	0.222	0.843	X-RAY DIFFRACTION

I expected it resulted in one representative output, but it resulted in two sequences, 4B9D_B and 4APC_B.

Current Behavior

It occasionally resulted in multiple chains.

Steps to Reproduce (for bugs)

pdbfilter_debug.zip

unzip pdbfilter_debug.zip
cd pdbfilter_debug
pdbfilter.py input.fas cluster.tsv pdb_filter.dat output.fas

Suggested debugging

As they are in the same cluster, I think they should result in one representative sequence. I think it can be debugged by modifying the file like this:

        if best_entry_res is not None:
            selected_sequences.add(best_entry_res)
            
            if DEBUG:
                print (' - Selected {n} (best resolution = {r}).'.format(
                    n = best_entry_res,
                    r = best_res))

        elif best_entry_rfr is not None:
            selected_sequences.add(best_entry_rfr)
            
            if DEBUG:
                print (' - Selected {n} (best R-free = {r}).'.format(
                    n = best_entry_rfr,
                    r = best_rfr))
        
        elif best_entry_comp is not None:
            selected_sequences.add(best_entry_comp)
            
            if DEBUG:
                print (' - Selected {n} (best completness = {r}).'.format(
                    n = best_entry_comp,
                    r = best_comp))    

        else best_entry_res == None and best_entry_rfr == None and best_entry_comp == None:
            print ('! Warning: Did not find any representative entry for cluster {c}.'.format(
                c = cluster))

huhlim avatar Aug 31 '21 00:08 huhlim