abricate
abricate copied to clipboard
`--mincov` filter applied after `culling_limit`
When looking for aminoglycoside resistance genes, Norelle and I noticed that using the ncbi
database, we weren't finding the genes we expected for some of our genomes.
For example, for this genome, we expected to find the gene aac(6')-Ib_1
, which is in both the resfinder
and ncbi
databases:
$ abricate ~/PacBio/final/fna/2014-19072.fna --db resfinder --mincov 90 | grep aac
Using nucl database resfinder: 2434 sequences - 2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 11 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_2 89936 90541 aac(6')-Ib_1 1-606/606 =============== 0/0 100.00 100.00 resfinder M21682 aac(6')-Ib_1
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_3 36519 37124 aac(6')-Ib_1 1-606/606 =============== 0/0 100.00 100.00 resfinder M21682 aac(6')-Ib_1
Instead, using the ncbi
database, we get this:
$ abricate ~/PacBio/final/fna/2014-19072.fna --db ncbi | grep aminoglycoside
Using nucl database ncbi: 4324 sequences - 2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 17 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_2 76085 76900 aph(3')-Ia 1-816/816 =============== 0/0 100.00 100.00 ncbi A7J11_00274 aminoglycoside O-phosphotransferase APH(3')-Ia
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_2 89955 90541 A7J11_02581 806-1392/1392 ........======= 0/0 42.17 99.66 ncbi A7J11_02581 bifunctional aminoglycoside nucleotidyltransferase ANT(3'')-Ih/aminoglycoside N-acetyltransferase AAC(6')-Iid
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_2 90611 90829 aadA1 1-219/792 =====.......... 0/0 27.65 100.00 ncbi A7J11_04543 ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_3 27659 27769 aadA1 667-777/777 ............=== 0/0 14.29 99.10 ncbi A7J11_04568 ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_3 36538 37124 A7J11_02581 806-1392/1392 ........======= 0/0 42.17 99.66 ncbi A7J11_02581 bifunctional aminoglycoside nucleotidyltransferase ANT(3'')-Ih/aminoglycoside N-acetyltransferase AAC(6')-Iid
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_3 37194 37412 aadA1 1-219/792 =====.......... 0/0 27.65 100.00 ncbi A7J11_04543 ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1
This identifies a partial hit to a bifunctional gene ANT(3'')-Ih/AAC(6')-Iid
instead. Applying the --mincov
filter in abricate also doesn't work:
$ abricate ~/PacBio/final/fna/2014-19072.fna --db ncbi --mincov 90 | grep aminoglycoside
Using nucl database ncbi: 4324 sequences - 2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 9 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna 2014-19072_2 76085 76900 aph(3')-Ia 1-816/816 =============== 0/0 100.00 100.00 ncbi A7J11_00274 aminoglycoside O-phosphotransferase APH(3')-Ia
This just seems to apply this coverage filter after the blastn -culling_limit 1
filter has been applied.
When I make a blast database with the aac(6')
genes, the bifunctional gene is ranked higher due to a higher bitscore (longer alignment length).
I think it has something to do with where you apply the --mincov
filter in abricate
:
my $pccov = 100 * ($hit{length}-$hit{gaps}) / $hit{slen};
next unless $pccov >= $mincov;
You obviously still want to just report the top hit, but could you perhaps set -culling_limit 100, and then filter manually?
I am trying to get calls for the differen plasmidMLST (pmlst) genes. I have made a database for the schemes in Warwick. When I run abricate I get hits that I believe are not optimal. A longer hit with lower identity is preferred over a shorter hit with 100/100 cov/id. I checked this for the hit that abricate gives as FIA_17, but that I would expect to be FIA_2. After this I made a database with just the FIA_2 sequence and the 100/100 cov/id hit is called. Is there a way to prefer the 100/100 hits over the longer hits with less identity?
The results from the example: when using a database with only IncF FIA_2:
AWGS160030_S13_spades_jan19.fa NODE_27_length_40814_cov_43.958291 27638 28021 + FIA_2 1-384/384 =============== 0/0 100.00 100.00 pMLST GENE_ACC_unkown pMLST~~~FIA_2~~~GENE_ACC_unkown~~~incf incf
When using the full pmlst database:
AWGS160030_S13_spades_jan19.fa NODE_27_length_40814_cov_43.958291 27637 28045 + FIA_17 1-409/409 =============== 0/0 100.00 98.53 pMLST GENE_ACC_unkown pMLST~~~FIA_17~~~GENE_ACC_unkown~~~incf incf
AWGS160030_S13_spades_jan19.fa NODE_63_length_6039_cov_89.406800 4535 4691 - FII_1 1-157/157 =============== 0/0 100.00 100.00 pMLST GENE_ACC_unkown pMLST~~~FII_1~~~GENE_ACC_unkown~~~incf incf
AWGS160030_S13_spades_jan19.fa NODE_63_length_6039_cov_89.406800 4649 4799 - FIC_3 51-200/200 ...=====/====== 1/1 75.00 97.35 pMLST GENE_ACC_unkown pMLST~~~FIC_3~~~GENE_ACC_unkown~~~incf incf
AWGS160030_S13_spades_jan19.fa NODE_65_length_5341_cov_56.738205 2736 3108 - FIB_20 1-373/373 =============== 0/0 100.00 100.00 pMLST GENE_ACC_unkown pMLST~~~FIB_20~~~GENE_ACC_unkown~~~incf incf
and applying the minid filter removes the hits, although the 100/100 gene is present:
AWGS160030_S13_spades_jan19.fa NODE_63_length_6039_cov_89.406800 4535 4691 - FII_1 1-157/157 =============== 0/0 100.00 100.00 pMLST GENE_ACC_unkown pMLST~~~FII_1~~~GENE_ACC_unkown~~~incf incf
AWGS160030_S13_spades_jan19.fa NODE_65_length_5341_cov_56.738205 2736 3108 - FIB_20 1-373/373 =============== 0/0 100.00 100.00 pMLST GENE_ACC_unkown pMLST~~~FIB_20~~~GENE_ACC_unkown~~~incf incf
abricate version 0.9.8
Any suggestions on how to tackle this problem would be appreciated! (and I have posted in this issues because I think it may be the same problem)
@pepijnhuizinga Would mlst
be a better tool?
Can you provide a link to the Warwick plasmidMLST download page?
I think the problem is that I use -culling_limit
from the BLAST package to choose the "best hit" in a region, to avoid getting 100s of hits to the same locus.
I chose abricate above mlst because my goal was to get all the present alleles and with plasmid mlst its quite common for one isolate to have multiple variants of the same gene. For example the following would not be uncommon:
NODE_49_length_11049_cov_122.750778 3852 4008 + FII_31
NODE_50_length_10220_cov_127.648271 1419 1575 + FII_36
I was not sure that I could overcome this problem in mlst
. As it was less important for me to get an actual ST compared to presence/absence of the alleles I went for abricate. If mlst
can overcome this problem that would of course be great.
And I must apologize I wrote Warwick as source of the database, which was incorrect. It is the pubmlst website. The download page I used was the following: https://pubmlst.org/bigsdb?db=pubmlst_plasmid_seqdef&page=downloadAlleles. There is an IncA/C cgMLST and a pmlst scheme, for all the other plasmids there is one scheme. If it is in anyway useful to supply the downloaded databases/reworked for abricate I would be happy to do so of course.