abricate icon indicating copy to clipboard operation
abricate copied to clipboard

`--mincov` filter applied after `culling_limit`

Open kwongj opened this issue 6 years ago • 3 comments

When looking for aminoglycoside resistance genes, Norelle and I noticed that using the ncbi database, we weren't finding the genes we expected for some of our genomes.

For example, for this genome, we expected to find the gene aac(6')-Ib_1, which is in both the resfinder and ncbi databases:

$ abricate ~/PacBio/final/fna/2014-19072.fna --db resfinder --mincov 90 | grep aac
Using nucl database resfinder:  2434 sequences -  2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 11 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    89936   90541   aac(6')-Ib_1    1-606/606       =============== 0/0     100.00  100.00  resfinder       M21682  aac(6')-Ib_1
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_3    36519   37124   aac(6')-Ib_1    1-606/606       =============== 0/0     100.00  100.00  resfinder       M21682  aac(6')-Ib_1

Instead, using the ncbi database, we get this:

$ abricate ~/PacBio/final/fna/2014-19072.fna --db ncbi | grep aminoglycoside
Using nucl database ncbi:  4324 sequences -  2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 17 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    76085   76900   aph(3')-Ia      1-816/816       =============== 0/0     100.00  100.00  ncbi    A7J11_00274     aminoglycoside O-phosphotransferase APH(3')-Ia
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    89955   90541   A7J11_02581     806-1392/1392   ........======= 0/0     42.17   99.66   ncbi    A7J11_02581     bifunctional aminoglycoside nucleotidyltransferase ANT(3'')-Ih/aminoglycoside N-acetyltransferase AAC(6')-Iid
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    90611   90829   aadA1   1-219/792       =====.......... 0/0     27.65   100.00  ncbi    A7J11_04543     ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_3    27659   27769   aadA1   667-777/777     ............=== 0/0     14.29   99.10   ncbi    A7J11_04568     ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_3    36538   37124   A7J11_02581     806-1392/1392   ........======= 0/0     42.17   99.66   ncbi    A7J11_02581     bifunctional aminoglycoside nucleotidyltransferase ANT(3'')-Ih/aminoglycoside N-acetyltransferase AAC(6')-Iid
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_3    37194   37412   aadA1   1-219/792       =====.......... 0/0     27.65   100.00  ncbi    A7J11_04543     ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA1

This identifies a partial hit to a bifunctional gene ANT(3'')-Ih/AAC(6')-Iid instead. Applying the --mincov filter in abricate also doesn't work:

$ abricate ~/PacBio/final/fna/2014-19072.fna --db ncbi --mincov 90 | grep aminoglycoside
Using nucl database ncbi:  4324 sequences -  2018-Apr-20
Processing: /home/jasonk1/PacBio/final/fna/2014-19072.fna
Found 9 genes in /home/jasonk1/PacBio/final/fna/2014-19072.fna
/home/jasonk1/PacBio/final/fna/2014-19072.fna   2014-19072_2    76085   76900   aph(3')-Ia      1-816/816       =============== 0/0     100.00  100.00  ncbi    A7J11_00274     aminoglycoside O-phosphotransferase APH(3')-Ia

This just seems to apply this coverage filter after the blastn -culling_limit 1 filter has been applied. When I make a blast database with the aac(6') genes, the bifunctional gene is ranked higher due to a higher bitscore (longer alignment length).

I think it has something to do with where you apply the --mincov filter in abricate:

my $pccov = 100 * ($hit{length}-$hit{gaps}) / $hit{slen};
next unless $pccov >= $mincov;

You obviously still want to just report the top hit, but could you perhaps set -culling_limit 100, and then filter manually?

kwongj avatar Jun 25 '18 06:06 kwongj

I am trying to get calls for the differen plasmidMLST (pmlst) genes. I have made a database for the schemes in Warwick. When I run abricate I get hits that I believe are not optimal. A longer hit with lower identity is preferred over a shorter hit with 100/100 cov/id. I checked this for the hit that abricate gives as FIA_17, but that I would expect to be FIA_2. After this I made a database with just the FIA_2 sequence and the 100/100 cov/id hit is called. Is there a way to prefer the 100/100 hits over the longer hits with less identity?

The results from the example: when using a database with only IncF FIA_2:

AWGS160030_S13_spades_jan19.fa	NODE_27_length_40814_cov_43.958291	27638	28021	+	FIA_2	1-384/384	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FIA_2~~~GENE_ACC_unkown~~~incf	incf

When using the full pmlst database:

AWGS160030_S13_spades_jan19.fa	NODE_27_length_40814_cov_43.958291	27637	28045	+	FIA_17	1-409/409	===============	0/0	100.00	98.53	pMLST	GENE_ACC_unkown	pMLST~~~FIA_17~~~GENE_ACC_unkown~~~incf	incf
AWGS160030_S13_spades_jan19.fa	NODE_63_length_6039_cov_89.406800	4535	4691	-	FII_1	1-157/157	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FII_1~~~GENE_ACC_unkown~~~incf	incf
AWGS160030_S13_spades_jan19.fa	NODE_63_length_6039_cov_89.406800	4649	4799	-	FIC_3	51-200/200	...=====/======	1/1	75.00	97.35	pMLST	GENE_ACC_unkown	pMLST~~~FIC_3~~~GENE_ACC_unkown~~~incf	incf
AWGS160030_S13_spades_jan19.fa	NODE_65_length_5341_cov_56.738205	2736	3108	-	FIB_20	1-373/373	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FIB_20~~~GENE_ACC_unkown~~~incf	incf

and applying the minid filter removes the hits, although the 100/100 gene is present:

AWGS160030_S13_spades_jan19.fa	NODE_63_length_6039_cov_89.406800	4535	4691	-	FII_1	1-157/157	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FII_1~~~GENE_ACC_unkown~~~incf	incf 
AWGS160030_S13_spades_jan19.fa	NODE_65_length_5341_cov_56.738205	2736	3108	-	FIB_20	1-373/373	===============	0/0	100.00	100.00	pMLST	GENE_ACC_unkown	pMLST~~~FIB_20~~~GENE_ACC_unkown~~~incf	incf

abricate version 0.9.8

Any suggestions on how to tackle this problem would be appreciated! (and I have posted in this issues because I think it may be the same problem)

pepijnhuizinga avatar Nov 20 '19 20:11 pepijnhuizinga

@pepijnhuizinga Would mlst be a better tool?

Can you provide a link to the Warwick plasmidMLST download page?

I think the problem is that I use -culling_limit from the BLAST package to choose the "best hit" in a region, to avoid getting 100s of hits to the same locus.

tseemann avatar Nov 23 '19 22:11 tseemann

I chose abricate above mlst because my goal was to get all the present alleles and with plasmid mlst its quite common for one isolate to have multiple variants of the same gene. For example the following would not be uncommon:

NODE_49_length_11049_cov_122.750778	3852	4008	+	FII_31
NODE_50_length_10220_cov_127.648271	1419	1575	+	FII_36

I was not sure that I could overcome this problem in mlst . As it was less important for me to get an actual ST compared to presence/absence of the alleles I went for abricate. If mlst can overcome this problem that would of course be great.

And I must apologize I wrote Warwick as source of the database, which was incorrect. It is the pubmlst website. The download page I used was the following: https://pubmlst.org/bigsdb?db=pubmlst_plasmid_seqdef&page=downloadAlleles. There is an IncA/C cgMLST and a pmlst scheme, for all the other plasmids there is one scheme. If it is in anyway useful to supply the downloaded databases/reworked for abricate I would be happy to do so of course.

pepijnhuizinga avatar Nov 24 '19 09:11 pepijnhuizinga