Inquiry on result interpretation
Hi
I am attempting to dereplicate the reconstructed MAGs at species level using following commands
In="CheckM2 output" In2="Folder possessing MAGs"/* ; Out="Output folder" ; mkdir -p $Out ; dRep dereplicate -p 80 -comp 50 -con 10 -sa 0.965 --S_algorithm gANI -nc 0.6 --genomeInfo $In $Out -g $In2 ;
The thresholds (0.965 & 0.6) are suggested from the previous study Reference: Varghese, Neha J., et al. "Microbial species delineation using whole genome sequences." Nucleic acids research 43.14 (2015): 6761-6771.
The attached files are the dereplicated MAG and their taxonomy profile using gtdb-tk
If you see the results, there are taxonomic affiliation of the identical species
For example, both bin29 and bin51 are Nitrospira_sp009594995 having the closest relative with GCA_009594995.1
Does it mean that dereplication does not work well with this parameter ? How can I understand this result and de-replicate the MAGs at species level if this is not proper method ?
Thanks !
Hi @LanSabb - GTDB uses a very complex method to do their dereplicaiton that includes things like preserving the names of historic species on a case-by-case basis and adjusting thresholds between ~97-94.5% depending on the species in question. Because of this, it's not possible to exactly recapitulate GTDB species with dRep, but the thresholds your using will get you very close. If you'd like to do species dereplicaiton exactly like GTDB, you could always just manually pick one member of each GTDB species with the highest score
Best, Matt