sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

classify plant reads in metagenomes

Open gabridinosauro opened this issue 1 year ago • 2 comments
trafficstars

Dear Sourmash team,

Hope you are all good. I have a project where I have some shotgun metagenomics data of wild rodents. I want to see if I can classify reads to plant genomes, to have an idea of their diet.

Is it possible to do it with sourmash? I suppose I would have to make my own database as I do not see any databases containing plants already available.

Thanks in advance.

Gabriele

gabridinosauro avatar May 24 '24 00:05 gabridinosauro

Hi Gabri, sorry for ignoring your issue for so long 😭

Short version - we don't have anything formal for plants, BUT if you can find a listing of all the things you want - maybe an assembly_summary file? - we can put together a recipe for sketching it quickly. Sound good?

ctb avatar Jul 09 '24 20:07 ctb

Hi Titus, sounds great. Here attached the list of all plant genomes marked as reference in the genbank, with their accession numbers and taxID. Is that fine or you need any other info?

Thanks a lot again! Gabri

Rerence_genomes_plants_genbank.csv

gabridinosauro avatar Jul 09 '24 22:07 gabridinosauro

Hi Titus,

Any chances of further progresses on this? I would love to help if needed?

Thanks! Gabri

gabridinosauro avatar Dec 03 '24 23:12 gabridinosauro

Hi Gabri,

I went ahead and made this database for you. You can download the files here:

K-mer size Zipfile collection
k21 download (7G)
k31 download (8.8G)
k51 download (11G)

Lineage spreadsheet for sourmash tax commands: download

Two genbank accessions did not have genome FASTAs associated with them. They are:

accession name ftp_path
GCA_033675215.1 GCA_033675215.1 ASM3367521v1 Rhodiola kirilowii https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/033/675/215/GCA_033675215.1_ASM3367521v1
GCA_033675395.1 GCA_033675395.1 ASM3367539v1 Rhodiola chrysanthemifolia https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/033/675/395/GCA_033675395.1_ASM3367539v1

The commands and files I used are here: https://github.com/bluegenes/2024-ds-plant. In short, I used our directsketch plugin. I first created a compatible input file from your csv, which looked like this:

accession,name,ftp_path
GCF_000001735.4,GCF_000001735.4 TAIR10.1 Arabidopsis thaliana,
GCF_030704535.1,GCF_030704535.1 ASM3070453v1 Vitis vinifera,
GCF_022201045.2,GCF_022201045.2 DVS_A1.0 Citrus sinensis,

I used your first 3 columns for the "name".

Then, I ran gbsketch:

sourmash scripts gbsketch Reference_genomes_plants_genbank.directsketch.csv -c 1 -r 1 --genomes-only -o Reference_genomes_plants_genbank.directsketch.zip --failed Reference_genomes_plants_genbank.directsketch.failed.csv --checksum-fail Reference_genomes_plants_genbank.directsketch.checksum-fail.csv --batch-size 200

to help with restart since there are large genomes, I use a batching process which actually yields a number of output zips. For the final db, I ran sourmash sig cat to unite the batches. I also used a parameter -n 2 to download fewer genomes simultaneously since the are so large (helped with memory usage -- my first run-through got killed at 100G!). Unfortunately, that is currently only available in a development branch for now, but will be released soon.

There are a lot of very large genomes in here! It took a few days to run 😅 . If you have an updated version with any genomes that have been added since July, I'm happy to sketch those and update this db for you.

Please let me know how it goes - I haven't actually tested these with anything yet!

Tessa

bluegenes avatar Dec 10 '24 21:12 bluegenes

@bluegenes this is sooo cool!!! thanks so much!!! @ctb and thanks Titus too you rock!!!!! I will let you know how it goes!

gabridinosauro avatar Dec 11 '24 01:12 gabridinosauro

Hello @bluegenes. One last question. I successfully run gather between my metagenomes and the databases. Now I want to run taxonomy. But I get an error that the csv file is not formatted correctly. Nevertheless, i have troubles finding the correct way to format it from the instruction website.

This is the command I run and the error I get:

sourmash tax annotate -g ${infile} --taxonomy Rerence_genomes_plants_genbank.csv

error:

== This is sourmash version 4.6.1. ==

[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==


[KERROR: cannot read taxonomy assignments from '/xdisk/laubitz/schiro/sourmash_plant/Rerence_genomes_plants_genbank.csv': No taxonomic identifiers found; headers are '\ufeffAssembly Accession','Assembly Name','Organism Name','Organism Taxonomic ID','Organism Infraspecific Names Breed','Organism Infraspecific Names Strain','Organism Infraspecific Names Cultivar','Organism Infraspecific Names Ecotype','Organism Infraspecific Names Isolate','Organism Infraspecific Names Sex','Annotation Name','Assembly Stats Total Sequence Length','Assembly Stats Total Number of Chromosomes','Assembly Level','Assembly Release Date','WGS project accession','Assembly Stats Number of Scaffolds'

gabridinosauro avatar Dec 16 '24 02:12 gabridinosauro

hi @gabridinosauro sorry, we need to format the taxdb for you/make the lineages CSV - there are some instructions in the plant repo mentioned above, but I'm not sure if they'll work. It's on our list!

ctb avatar Dec 26 '24 18:12 ctb

@gabridinosauro I made you a tax file that should work, though I didn't test it. Try downloading here: https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/genbank-plant-2024-07/genbank-plants-2024-07.lineages.csv.gz

let me know if it works!

bluegenes avatar Dec 26 '24 18:12 bluegenes

it worked thanks!!! I downloaded the file from the github repo tho, not from that link because I did not have permissions from that link. Overall, the method seem to work although I seem to get some false positives. I get hits of tropical plants when my wild rodents are living in Canada. I think it might be because there is bacterial contamination in plant genomes? I am trying now to use kmer 51 instead of 31. Another idea would be to use chloroplast genomes instead, I can try building my own database.

gabridinosauro avatar Dec 28 '24 04:12 gabridinosauro

hi @gabridinosauro we have some new code that makes it easy to sketch all genomes under any given NCBI taxonomic rank; it's not particularly usable by others yet, but I wanted to link it for you, since you were part of the inspiration. see: https://github.com/sourmash-bio/sourmash/issues/3487

basically, if you can locate one or more NCBI taxonomic IDs under which you want all the things sketched, we can produce said database :)

ctb avatar Jan 13 '25 15:01 ctb

@gabridinosauro updated reference databases described here: https://github.com/sourmash-bio/sourmash/issues/3504

ctb avatar Jan 22 '25 12:01 ctb