gapseq icon indicating copy to clipboard operation
gapseq copied to clipboard

Subunit detection is returning wrong numbers

Open Porthmeus opened this issue 1 year ago • 0 comments

I just spotted this while rewriting the code for diamond. If the subunit nomenclature of the enzyme is denoted by letters, subunits I, V and X will be get wrong numbers in complex_detection.R, because they will be treated as roman numbers.

So I would expect A -> 1 B -> 2 C -> 3 ... I -> 9

but instead it does I -> 1

You can test this for example with either the single uniprot entrance: A0A7V5FFT7 Or you can just use seq/Bacteria/unrev/1.6.5.3.fasta from the repository.

Furthermore in the very same test case the extraction of the subunits can fail if one of the keywords for detecting subunits is preceded by a single capital letter. For example if the header of the faster looks like this: "UniRef50_U3TYP0 NADH dehydrogenase I chain F n=1 Tax=Plautia stali symbiont TaxID=891974 RepID=U3TYP0_9ENTR" the script will extract: "I chain" as the subunit, instead of the expected "chain F"

My current plan is to implement diamond only for the -p all option, where I will try to correct these errors, however, I thought I report it here, as I am not sure when and if I find the time to finish it.

Porthmeus avatar Nov 07 '24 16:11 Porthmeus