Mash
Mash copied to clipboard
Guidelines for making your own database
Just wondering if you have any tips for making your own database. Specifically can you use a multifasta file with different species and strains to make the database or are you better off using individual fasta files? Thanks
This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with -i
will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome.
As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBI genomes/
directory, use find
to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with -l
.
Hello, everyone~
when I do:
mash info RefSeq88n.msh | head -n 50
I found(part):
Header:
Hash function (seed): MurmurHash3_x64_128 (0)
K-mer size: 21 (64-bit hashes)
Alphabet: ACGT (canonical)
Target min-hashes per sketch: 1000
Sketches: 127219
Sketches:
[Hashes] [Length] [ID] [Comment]
1000 143726002 GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz [1870 seqs] NC_004354.4 Drosophila melanogaster chromosome X [...]
1000 3241953429 GCF_000001405.36_GRCh38.p10_genomic.fna.gz [557 seqs] NC_000001.11 Homo sapiens chromosome 1, GRCh38.p7 Primary Assembly [...]
1000 3257319537 GCF_000001405.38_GRCh38.p12_genomic.fna.gz [594 seqs] NC_000001.11 Homo sapiens chromosome 1, GRCh38.p12 Primary Assembly [...]
1000 3231170666 GCF_000001515.7_Pan_tro_3.0_genomic.fna.gz [44449 seqs] NC_006468.4 Pan troglodytes isolate Yerkes chimp pedigree #C0471 (Clint) chromosome 1, Pan_tro 3.0, whole
genome shotgun sequence [...]
So, I want to build a latest refseq bacterial mash database from bleow ftp site:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/*.genomics.fna.gz
But I can't get genomics file like GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz. Do I need to resplit the refseq release bacteria genomics file to each genomics file ?
Thanks~
I think I should use ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/
This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with
-i
will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome.As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBI
genomes/
directory, usefind
to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with-l
.
Hi, Does this mean that the pre-sketched RefSeq we can download from the tutorial page are updated regularly? Could you specify to which release of the ncbi database it corresponds?
This depends on the input data - if your genomes are single contigs (like polished bacterial or viral), then a multifasta with
-i
will work. However, if they are assemblies or have multiple chromosomes, there is no way for Mash to distinguish the genomes once they are concatenated, so they really need to be kept in separate files if you want a sketch for each genome. As far as general guidelines, this is definitely something we should add to the tutorials but haven't gotten to yet. To summarize our RefSeq sketching strategy, we mirror the NCBIgenomes/
directory, usefind
to crawl it and create a flat directory of symlinks, and pass these to mash as a FOFN with-l
.Hi, Does this mean that the pre-sketched RefSeq we can download from the tutorial page are updated regularly? Could you specify to which release of the ncbi database it corresponds?
Hi, I would also like to know if the pre-sketched RefSeq is updated. And if not, as beginner in the field of bioinformatics, what action would you propose for me to do, so that i could update it myself? Since i used mash in my project it seems like more recent entries are not included in the pre-sketched RefSeq. Thank you for the help in advance!