sourmash
sourmash copied to clipboard
classify plant reads in metagenomes
Dear Sourmash team,
Hope you are all good. I have a project where I have some shotgun metagenomics data of wild rodents. I want to see if I can classify reads to plant genomes, to have an idea of their diet.
Is it possible to do it with sourmash? I suppose I would have to make my own database as I do not see any databases containing plants already available.
Thanks in advance.
Gabriele
Hi Gabri, sorry for ignoring your issue for so long 😭
Short version - we don't have anything formal for plants, BUT if you can find a listing of all the things you want - maybe an assembly_summary file? - we can put together a recipe for sketching it quickly. Sound good?
Hi Titus, sounds great. Here attached the list of all plant genomes marked as reference in the genbank, with their accession numbers and taxID. Is that fine or you need any other info?
Thanks a lot again! Gabri
Hi Titus,
Any chances of further progresses on this? I would love to help if needed?
Thanks! Gabri
Hi Gabri,
I went ahead and made this database for you. You can download the files here:
| K-mer size | Zipfile collection |
|---|---|
| k21 | download (7G) |
| k31 | download (8.8G) |
| k51 | download (11G) |
Lineage spreadsheet for sourmash tax commands: download
Two genbank accessions did not have genome FASTAs associated with them. They are:
| accession | name | ftp_path |
|---|---|---|
| GCA_033675215.1 | GCA_033675215.1 ASM3367521v1 Rhodiola kirilowii | https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/033/675/215/GCA_033675215.1_ASM3367521v1 |
| GCA_033675395.1 | GCA_033675395.1 ASM3367539v1 Rhodiola chrysanthemifolia | https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/033/675/395/GCA_033675395.1_ASM3367539v1 |
The commands and files I used are here: https://github.com/bluegenes/2024-ds-plant. In short, I used our directsketch plugin. I first created a compatible input file from your csv, which looked like this:
accession,name,ftp_path
GCF_000001735.4,GCF_000001735.4 TAIR10.1 Arabidopsis thaliana,
GCF_030704535.1,GCF_030704535.1 ASM3070453v1 Vitis vinifera,
GCF_022201045.2,GCF_022201045.2 DVS_A1.0 Citrus sinensis,
I used your first 3 columns for the "name".
Then, I ran gbsketch:
sourmash scripts gbsketch Reference_genomes_plants_genbank.directsketch.csv -c 1 -r 1 --genomes-only -o Reference_genomes_plants_genbank.directsketch.zip --failed Reference_genomes_plants_genbank.directsketch.failed.csv --checksum-fail Reference_genomes_plants_genbank.directsketch.checksum-fail.csv --batch-size 200
to help with restart since there are large genomes, I use a batching process which actually yields a number of output zips. For the final db, I ran
sourmash sig catto unite the batches. I also used a parameter-n 2to download fewer genomes simultaneously since the are so large (helped with memory usage -- my first run-through got killed at 100G!). Unfortunately, that is currently only available in a development branch for now, but will be released soon.
There are a lot of very large genomes in here! It took a few days to run 😅 . If you have an updated version with any genomes that have been added since July, I'm happy to sketch those and update this db for you.
Please let me know how it goes - I haven't actually tested these with anything yet!
Tessa
@bluegenes this is sooo cool!!! thanks so much!!! @ctb and thanks Titus too you rock!!!!! I will let you know how it goes!
Hello @bluegenes. One last question. I successfully run gather between my metagenomes and the databases. Now I want to run taxonomy. But I get an error that the csv file is not formatted correctly. Nevertheless, i have troubles finding the correct way to format it from the instruction website.
This is the command I run and the error I get:
sourmash tax annotate -g ${infile} --taxonomy Rerence_genomes_plants_genbank.csv
error:
== This is sourmash version 4.6.1. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
[KERROR: cannot read taxonomy assignments from '/xdisk/laubitz/schiro/sourmash_plant/Rerence_genomes_plants_genbank.csv': No taxonomic identifiers found; headers are '\ufeffAssembly Accession','Assembly Name','Organism Name','Organism Taxonomic ID','Organism Infraspecific Names Breed','Organism Infraspecific Names Strain','Organism Infraspecific Names Cultivar','Organism Infraspecific Names Ecotype','Organism Infraspecific Names Isolate','Organism Infraspecific Names Sex','Annotation Name','Assembly Stats Total Sequence Length','Assembly Stats Total Number of Chromosomes','Assembly Level','Assembly Release Date','WGS project accession','Assembly Stats Number of Scaffolds'
hi @gabridinosauro sorry, we need to format the taxdb for you/make the lineages CSV - there are some instructions in the plant repo mentioned above, but I'm not sure if they'll work. It's on our list!
@gabridinosauro I made you a tax file that should work, though I didn't test it. Try downloading here: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/genbank-plant-2024-07/genbank-plants-2024-07.lineages.csv.gz
let me know if it works!
it worked thanks!!! I downloaded the file from the github repo tho, not from that link because I did not have permissions from that link. Overall, the method seem to work although I seem to get some false positives. I get hits of tropical plants when my wild rodents are living in Canada. I think it might be because there is bacterial contamination in plant genomes? I am trying now to use kmer 51 instead of 31. Another idea would be to use chloroplast genomes instead, I can try building my own database.
hi @gabridinosauro we have some new code that makes it easy to sketch all genomes under any given NCBI taxonomic rank; it's not particularly usable by others yet, but I wanted to link it for you, since you were part of the inspiration. see: https://github.com/sourmash-bio/sourmash/issues/3487
basically, if you can locate one or more NCBI taxonomic IDs under which you want all the things sketched, we can produce said database :)
@gabridinosauro updated reference databases described here: https://github.com/sourmash-bio/sourmash/issues/3504