ampliseq icon indicating copy to clipboard operation
ampliseq copied to clipboard

12S taxonomic classification databases

Open emmastrand opened this issue 1 year ago • 3 comments

Description of feature

Hi there - I'm trying to use ampliseq for 12S amplicon data and running into issues adding our own custom database b/c of incompatible formatting. It would be great for ampliseq to have this amplicon option along with CO1, 16S, 18S, etc. This is an example of one database that we would use. https://mitofish.aori.u-tokyo.ac.jp/. Thanks!

emmastrand avatar Feb 28 '24 18:02 emmastrand

It's relatively easy to add a database, so maybe you could contribute this yourself? You need to provide one or two urls for download and a formatting script that outputs files suitable for DADA2's assignTaxonomy and addSpecies functions. The urls, together with some information, go into conf/ref_databases.config and the formatting scripts reside in bin. Here's the documentation for contributing to nf-core pipelines: https://nf-co.re/docs/contributing/contributing_to_pipelines. Eternal glory as a contributor to Ampliseq awaits you! :-)

erikrikarddaniel avatar Feb 28 '24 18:02 erikrikarddaniel

Thanks for sharing this! Do other contributors have advice/tips/scripts for formatting a script that outputs files suitable for DADA2? This is mostly where I'm stuck.

emmastrand avatar Feb 28 '24 18:02 emmastrand

You can view all formatting scripts in the bin directory of the pipeline. The files look like the below.

assignTaxonomy.fna:

>Bacteria;Proteobacteria;Alphaproteobacteria;Rickettsiales;Rickettsiaceae;Rickettsia;Rickettsia felis
TGAGAGTTTGATCCTGGCTCAGAACGAACGCTATCGGTATGCTTAACACATGCAAGTCGGACGGACTAATTGGGGCTTGCTCCAATTAGTTAGTGGCAGACGGGTGAGTAACACGTGGGAATCTGCCCATCAGTACGGAATAACTTTTAGAAATAAAAGCTAATACCGTATATTCTCTACAGAGGAAAGATTTATCGCTGATGGATGAGCCCGCGTCAGATTAGGTAGTTGGTGAGGTAACGGCTCACCAAGCCGACGATCTGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATACCGAGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTAGCAAGGAAGATAATGACGTTACTTGCAGAAAAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTGCGTAGGCGGTTTAGTAAGTTGGAAGTGAAAGCCCGGGGCTTAACCTCGGAATTGCTTTCAAAACTACTAATCTAGAGTGTAGTAGGGGATGATGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGTCATCTGGGCTACAACTGACGCTGATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGATATCGGAAGATTCTCTTTCGGTTTCGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCTCGCACAAGCGGTGGAGCATGCGGTTTAATTCGATGTTACGCGAAAAACCTTACCAACCCTTGACATGGTGGTCGCGGATCGCAGAGATGCTTTCCTTCAGCTCGGCTGGACCACACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATTCTTATTTGCCAGCGGGTAATGCCGGGAACTATAAGAAAACTGCCGGTGATAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTTGGGCTACACGCGTGCTACAATGGTGTTTACAGAGGGAAGCAAGACGGCGACGTGGAGCAAATCCCTAAAAGACATCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCTCGGGCCTTGTACACACTGCCCGTCACGCCATGGGAGTTGGTTTTACCTGAAGGTGGTGAGCTAACGCAAGAGGCAGCCAACCACGGTAAAATTAGCGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATTACCTCCTTA

I.e. each sequence's name is just the full taxonomy string.

addSpecies.fna:

>GB_GCA_000012145.1 Rickettsia felis
TGAGAGTTTGATCCTGGCTCAGAACGAACGCTATCGGTATGCTTAACACATGCAAGTCGGACGGACTAATTGGGGCTTGCTCCAATTAGTTAGTGGCAGACGGGTGAGTAACACGTGGGAATCTGCCCATCAGTACGGAATAACTTTTAGAAATAAAAGCTAATACCGTATATTCTCTACAGAGGAAAGATTTATCGCTGATGGATGAGCCCGCGTCAGATTAGGTAGTTGGTGAGGTAACGGCTCACCAAGCCGACGATCTGTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATACCGAGTGAGTGATGAAGGCCCTAGGGTTGTAAAGCTCTTTTAGCAAGGAAGATAATGACGTTACTTGCAGAAAAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGAGTGCGTAGGCGGTTTAGTAAGTTGGAAGTGAAAGCCCGGGGCTTAACCTCGGAATTGCTTTCAAAACTACTAATCTAGAGTGTAGTAGGGGATGATGGAATTCCTAGTGTAGAGGTGAAATTCTTAGATATTAGGAGGAACACCGGTGGCGAAGGCGGTCATCTGGGCTACAACTGACGCTGATGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGATATCGGAAGATTCTCTTTCGGTTTCGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCTCGCACAAGCGGTGGAGCATGCGGTTTAATTCGATGTTACGCGAAAAACCTTACCAACCCTTGACATGGTGGTCGCGGATCGCAGAGATGCTTTCCTTCAGCTCGGCTGGACCACACACAGGTGTTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCATTCTTATTTGCCAGCGGGTAATGCCGGGAACTATAAGAAAACTGCCGGTGATAAGCCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGGGTTGGGCTACACGCGTGCTACAATGGTGTTTACAGAGGGAAGCAAGACGGCGACGTGGAGCAAATCCCTAAAAGACATCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGTTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCTCGGGCCTTGTACACACTGCCCGTCACGCCATGGGAGTTGGTTTTACCTGAAGGTGGTGAGCTAACGCAAGAGGCAGCCAACCACGGTAAAATTAGCGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAACCTGCGGCTGGATTACCTCCTTA

Here, each species has an accession followed by the species name. AFAIK, the accession is not used for anything, but I guess it has to be unique.

Your script just needs to output these two files with the above names starting from whatever you can download.

You can also use nf-core's Slack (#ampliseq channel) to discuss.

erikrikarddaniel avatar Feb 28 '24 21:02 erikrikarddaniel

It seems to me the problem was resolved. Please re-open in case some sort of action is needed.

d4straub avatar Mar 26 '25 07:03 d4straub