amr Presence/absence matrix suggestion

Presence/absence matrix suggestion

Open ireneortega opened this issue 4 years ago • 6 comments

Once the output is created, it would be nice if a matrix 1/0 representing presence/absence for each gene is generated. Users could select minimum coverage and identity treshold for considering a gene as present. In that way, results would be summarized and easy to read. You can include as many columns as you want from the original output as extra information in the matrix. What do you think?

May 15 '20 10:05 ireneortega

The desired matrix can be produced using the normal AMRFinder output as input. Therefore, it can be done by a separate script. Moreover, this work can be done for a set of genomes. Do you want us to write this script?

May 15 '20 13:05 vbrover

Hi Irene,

There are a few points you bring up. I'll discuss one-by-one:

it would be nice if a matrix 1/0 representing presence/absence for each gene is generated.

We have over 4,600 gene symbols that could be printed, do you want a matrix with 4,000 0's in columns or 4,000 zeros in rows? I'm not sure I understand the matrix format you're describing.

I can create a line of shell script to print out a uniq'd list of the gene symbols that are returned from AMRFinderPlus. Note that genes with "PARTIAL" should probably be excluded since they are less likely to be functional.

Users could select minimum coverage and identity treshold for considering a gene as present.

We do provide options to select minimum blast thresholds, but those options are primarily for testing and development, and they are not how we recommend running AMRFinderPlus. The decision tree AMRFinderPlus uses to select a gene symbol and name is a bit more complex and based on a tree of gene nomenclature including curated HMM searches, blastx, blastp, and blastn searches with curated blast cutoffs for some genes where the defaults are inappropriate.

You can include as many columns as you want from the original output as extra information in the matrix.

So this is making me think that you mean a gene symbol per column maybe? Not all gene symbols are defined by sequences, some higher level nodes would be defined by HMMs.

Based on how I often use AMRFinderPlus I'm thinking the following might be helpful to you and produce output a bit closer to what you're looking for.

Here's a line of shell to get a list of gene symbols and names for an assembly on the command-line. Here I'm using the example sequence test_dna.fa.

$ amrfinder -n test_dna.fa -O Escherichia --threads 10 | awk 'NR>1{print}' | cut -f 6,7 | sort -u
23S_A2058T	Escherichia azithromycin/erythromycin/telithromycin resistant 23S
ampC_T-14TGT	Escherichia cephalosporin resistant ampC
aph(3'')-Ib	aminoglycoside O-phosphotransferase APH(3'')-Ib
blaEC	BlaEC family class C beta-lactamase
blaOXA	OXA-48 family class D beta-lactamase
blaOXA	class D beta-lactamase
blaPDC	PDC family class C beta-lactamase
blaTEM	TEM family class A beta-lactamase
blaTEM-156	class A beta-lactamase TEM-156
nfsA_K141STOP	Escherichia nitrofurantoin resistant NfsA
nfsA_R15C	Escherichia nitrofurantoin resistant NfsA
pmrB_C84R	Escherichia colistin resistant PmrB
sul2	sulfonamide-resistant dihydropteroate synthase Sul2
vanG	D-alanine--D-serine ligase VanG

To just get a list of gene symbols skipping partials internal to contigs:

$ amrfinder -n test_dna.fa -O Escherichia --threads 10 | grep -Pv '\tPARTIAL' | awk 'NR>1{print}' | cut -f 6 | sort -u
23S_A2058T
ampC_T-14TGT
blaEC
blaOXA
blaPDC
blaTEM
blaTEM-156
nfsA_K141STOP
nfsA_R15C
pmrB_C84R
vanG

Obviously much more sophisticated manipulations are possible with short shell scripts, but I'm not sure exactly what you're looking for.

Thanks for your suggestions, Arjun

May 15 '20 14:05 evolarjun

Sorry, I didn't explain well.

As the database is huge, 1/0 matrix with all gene symbol will not be very intuitive due to high number of rows and columns, as you said. But I suggest you another option to visualize results of different samples at the same time.

For example, a data frame with gene symbols for each different Element type (columns) and genome (rows). That information could be showed in different ways. For that, something should be added in the otuput for the tool to know what sequences correspond to each genome. Perhaps, input file names could be used for that. I can give you an example of what I mean AMR_matrix.txt, using these results AMR_output.txt (A, B and C are genome names). I joined several outputs in one file, after I added a column in each single file with the file name, that corresponds to the genome id. This is highly interesting in your output if users want to join outputs in one file to know what results correspond to each genome. In the example I just considered genes. Maybe this is more complex than I think as I only use AMRFinder to look for antibiotic resistance genes and, as you said, not all gene symbols are defined by sequences.

Considering -i and -c options can be used as minimum identity and coverage, respectively, these parameteres are thresholds for gene presence/absence.

I am just thinking a way of visualizing results of a set of genomes and compare them quickly at the same time. Your tool would be excellent if you think in data visualization, too.

Thanks for your considerations.

May 17 '20 10:05 ireneortega

Hi Irene,

Thanks for posting the examples. Now I understand what you're talking about. We do ourselves transform the AMRFinderPlus output to a format similar to what you are describing for the Pathogen Detection Isolates Browser, so I see the utility.

I think this is probably better handled by post-processing the AMRFinderPlus output, with the possible exception of adding a 'sample name' option. I do a fair amount of AMRFinderPlus analysis myself, and certainly have my own ways of transforming the output to be useful to me. There have also been other requests for a friendlier output format (E.g., https://github.com/ncbi/amr/issues/25). We should create a page with scripts or command-lines to modify the output in different ways.

For example here's a little awk script to get gene lists from a file like your AMR_output.txt:

awk 'BEGIN{SUBSEP=OFS=FS="\t"} NR>1 {g[$1] = $7 " " g[$1]} END {for (s in g) print s"\t"g[s]}' AMR_output.txt

Arjun

May 18 '20 17:05 evolarjun

Hi Arjun,

That helped me, thanks. I had a look to https://github.com/ncbi/amr/issues/25 and that could be an enhancement, too. Otherwise, I agree with you that some examples of command-lines to modify the output will help users with no programming experience, I'm sure.

Irene

May 20 '20 07:05 ireneortega

amr amr copied to clipboard

Presence/absence matrix suggestion

amr
amr copied to clipboard