Mash icon indicating copy to clipboard operation
Mash copied to clipboard

Add query comment to output of mash dist

Open brymerr921 opened this issue 7 years ago • 8 comments

I'm using Mash 2.0 and the file refseq.genomes.k21.s1000.msh. When I run "mash dist" on an individual FASTA file and query it against refseq.genomes.k21.s1000.msh, I noticed that the "query-comment" does not appear in the output, and I can't figure out how to get it to show up. It's less helpful for me to see the filename of the genome that my genome is most similar to, and I'd really love to be able to see the name of the genome most closely related to my query.

Example output: S24_CONCOCTC16_MAG_00011-contigs.fa GCF_000171215.1_ASM17121v1_genomic.fna.gz 0.0185579 0 512/1000 S24_CONCOCTC16_MAG_00011-contigs.fa GCF_001052445.1_ASM105244v1_genomic.fna.gz 0.0210181 0 474/1000 S24_CONCOCTC16_MAG_00011-contigs.fa GCF_000013285.1_ASM1328v1_genomic.fna.gz 0.0212923 0 470/1000

I can look up the accession of the top hit at NCBI (ASM17121v1) to see that it's Clostridium perfringens, but it'd be great to see that right in the output of mash dist.

brymerr921 avatar Oct 18 '17 23:10 brymerr921

This would be a great issue to address, and we would welcome input on how best to do it. On the one hand, we try to keep mash dist as generic as possible for varying use cases, but on the other, that can make it clunky for very common ones like this. For example, you would need "reference-comment" instead of "query-comment" if your inputs had been reversed, which would be perfectly valid. Adding comment fields for both reference and query I think would be too large and unreadable, but we also like to keep the option list clean so it is somewhat intuitive to use. One possibility is to detect a one-to-many comparison and show both the name and comment, but for the "many" file only. The issue could also be addressed at the sketch input level, as it was when we had to custom-combine and label RefSeq assemblies, by adding more info to the name itself. I'm curious if anyone has more thoughts on this!

ondovb avatar Oct 23 '17 22:10 ondovb

Hi, I'm coming back to this sometime later and I realized I wasn't very clear. Apologies!

When I run this command using Mash v2.0: mash dist -h it says the output columns are: The output fields are

[identity, shared-hashes, median-multiplicity, p-value, query-ID]

When I run this command: mash dist refseq.genomes.k21s1000.msh genome1.fa I get five output columns:

GCF_000171215.1_ASM17121v1_genomic.fna.gz       genome1.fa     0.0185579       0       512/1000

Running mash screen -h also says that there is one additional output column, as follows: [identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment] Running this command:

mash screen refseq.genomes.k21s1000.msh genome1.fa

Gives me these six columns:

0.973992        575/1000        1       0       GCF_000171215.1_ASM17121v1_genomic.fna.gz       [55 seqs] NZ_ABDY01000055.1 Clostridium perfringens NCTC 8239 gcontig_1106202603752, whole genome shotgun sequence [...]

Is there a way I get the sixth column of the mash screen output to appear in the output for mash dist?

brymerr921 avatar Feb 14 '18 23:02 brymerr921

Bumping this for the same request as last posted by @brymerr921

dadahan avatar Mar 29 '19 23:03 dadahan

I found that if you change line 270 of src/mash/CommandDistance.cpp from

cout << output->sketchRef.getReference(j).name << '\t' << output->sketchQuery.getReference(i).name << '\t' << pair->distance << '\t' << pair->pValue << '\t' << pair->numer << '/' << pair->denom << endl;

to

cout << output->sketchRef.getReference(j).name << '\t' << output->sketchQuery.getReference(i).name << '\t' << pair->distance << '\t' << pair->pValue << '\t' << pair->numer << '/' << pair->denom << '\t' << output->sketchRef.getReference(j).comment << endl;

and then recompile, the comments are included, which will show the name of your hit and not just the accession number.

brymerr921 avatar Mar 29 '19 23:03 brymerr921

This is addressed in latest source with -C. It appends the comment to the ID, separated by :. This avoids changing the field order. Let me know if this works; will be in the next release.

ondovb avatar May 01 '19 23:05 ondovb

Thank you!

dadahan avatar May 01 '19 23:05 dadahan

This is addressed in latest source with -C. It appends the comment to the ID, separated by :. This avoids changing the field order. Let me know if this works; will be in the next release.

Hi, I have tried using the -C on mash version 2.1.1 its not appending the column. Capitalizing the -C returned the error: Unrecognized option, and the small caps returned the error: "Argument to -c must be a number". I ran the analysis as follows: mash dist -c .....

Batsi-2015 avatar Nov 02 '20 18:11 Batsi-2015

This is addressed in latest source with -C. It appends the comment to the ID, separated by :. This avoids changing the field order. Let me know if this works; will be in the next release.

Hi, I have tried using the -C on mash version 2.1.1 its not appending the column. Capitalizing the -C returned the error: Unrecognized option, and the small caps returned the error: "Argument to -c must be a number". I ran the analysis as follows: mash dist -c .....

I managed to figure this out it was the database that needed an update.

mabvakureb avatar Nov 09 '20 15:11 mabvakureb