Mash
Mash copied to clipboard
Add query comment to output of mash dist
I'm using Mash 2.0 and the file refseq.genomes.k21.s1000.msh. When I run "mash dist" on an individual FASTA file and query it against refseq.genomes.k21.s1000.msh, I noticed that the "query-comment" does not appear in the output, and I can't figure out how to get it to show up. It's less helpful for me to see the filename of the genome that my genome is most similar to, and I'd really love to be able to see the name of the genome most closely related to my query.
Example output: S24_CONCOCTC16_MAG_00011-contigs.fa GCF_000171215.1_ASM17121v1_genomic.fna.gz 0.0185579 0 512/1000 S24_CONCOCTC16_MAG_00011-contigs.fa GCF_001052445.1_ASM105244v1_genomic.fna.gz 0.0210181 0 474/1000 S24_CONCOCTC16_MAG_00011-contigs.fa GCF_000013285.1_ASM1328v1_genomic.fna.gz 0.0212923 0 470/1000
I can look up the accession of the top hit at NCBI (ASM17121v1) to see that it's Clostridium perfringens, but it'd be great to see that right in the output of mash dist.
This would be a great issue to address, and we would welcome input on how best to do it. On the one hand, we try to keep mash dist
as generic as possible for varying use cases, but on the other, that can make it clunky for very common ones like this. For example, you would need "reference-comment" instead of "query-comment" if your inputs had been reversed, which would be perfectly valid. Adding comment fields for both reference and query I think would be too large and unreadable, but we also like to keep the option list clean so it is somewhat intuitive to use. One possibility is to detect a one-to-many comparison and show both the name and comment, but for the "many" file only. The issue could also be addressed at the sketch input level, as it was when we had to custom-combine and label RefSeq assemblies, by adding more info to the name itself. I'm curious if anyone has more thoughts on this!
Hi, I'm coming back to this sometime later and I realized I wasn't very clear. Apologies!
When I run this command using Mash v2.0:
mash dist -h
it says the output columns are:
The output fields are
[identity, shared-hashes, median-multiplicity, p-value, query-ID]
When I run this command:
mash dist refseq.genomes.k21s1000.msh genome1.fa
I get five output columns:
GCF_000171215.1_ASM17121v1_genomic.fna.gz genome1.fa 0.0185579 0 512/1000
Running mash screen -h
also says that there is one additional output column, as follows:
[identity, shared-hashes, median-multiplicity, p-value, query-ID, query-comment]
Running this command:
mash screen refseq.genomes.k21s1000.msh genome1.fa
Gives me these six columns:
0.973992 575/1000 1 0 GCF_000171215.1_ASM17121v1_genomic.fna.gz [55 seqs] NZ_ABDY01000055.1 Clostridium perfringens NCTC 8239 gcontig_1106202603752, whole genome shotgun sequence [...]
Is there a way I get the sixth column of the mash screen
output to appear in the output for mash dist
?
Bumping this for the same request as last posted by @brymerr921
I found that if you change line 270 of src/mash/CommandDistance.cpp from
cout << output->sketchRef.getReference(j).name << '\t' << output->sketchQuery.getReference(i).name << '\t' << pair->distance << '\t' << pair->pValue << '\t' << pair->numer << '/' << pair->denom << endl;
to
cout << output->sketchRef.getReference(j).name << '\t' << output->sketchQuery.getReference(i).name << '\t' << pair->distance << '\t' << pair->pValue << '\t' << pair->numer << '/' << pair->denom << '\t' << output->sketchRef.getReference(j).comment << endl;
and then recompile, the comments are included, which will show the name of your hit and not just the accession number.
This is addressed in latest source with -C
. It appends the comment to the ID, separated by :
. This avoids changing the field order. Let me know if this works; will be in the next release.
Thank you!
This is addressed in latest source with
-C
. It appends the comment to the ID, separated by:
. This avoids changing the field order. Let me know if this works; will be in the next release.
Hi, I have tried using the -C on mash version 2.1.1 its not appending the column. Capitalizing the -C returned the error: Unrecognized option, and the small caps returned the error: "Argument to -c must be a number". I ran the analysis as follows: mash dist -c .....
This is addressed in latest source with
-C
. It appends the comment to the ID, separated by:
. This avoids changing the field order. Let me know if this works; will be in the next release.Hi, I have tried using the -C on mash version 2.1.1 its not appending the column. Capitalizing the -C returned the error: Unrecognized option, and the small caps returned the error: "Argument to -c must be a number". I ran the analysis as follows:
mash dist -c .....
I managed to figure this out it was the database that needed an update.