Enable binning of eukaryotic genomes
Since we are not eukaryotic experts perhaps @hyphaltip would have some useful suggestions on how to implement this, but here are my thoughts:
- We would need to use a eukaryotic gene finder like Augustus. However, it probably wouldn't be incredibly accurate without RNA data (although I guess it could be an option to include that), and I don't know if anyone has ever tried it on metagenomes.
- There is a standard gene set that eukaryotic genomics efforts use to determine how complete their genome is. Sorry I don't have a citation to hand. However, while this would give us estimated completeness, I'm not sure whether it would give us purity because I'm not sure if they are all single copy.
- As outlined in the NSF proposal we are planning on checking the taxonomic congruence of single-copy markers we find in bacteria/archaea. So a similar method could be used to estimate purity of eukaryotic bins, perhaps?
- It might be a good idea to include contigs unclassified on the kingdom level in the analysis. I have long suspected that a lot of the eukaryotic portion ends up there because there are relatively fewer eukaryotic genomes in the NCBI database.
One advantage to doing this is that I think it would broaden the appeal of Autometa, it would be an interesting project for a student or outside contributor, and I have been asked about this several times at meetings.
If an outside contributor is interested in this - please let us know because it might be better to work together, and also base PRs off dev rather than main because it is pretty different right now (it is Python3 and most of the code has been refactored).
There is a standard gene set that eukaryotic genomics efforts use to determine how complete their genome is. Sorry I don't have a citation to hand. However, while this would give us estimated completeness, I'm not sure whether it would give us purity because I'm not sure if they are all single copy.
Perhaps busco is a tool for this?
Yes BUSCO sets would make sense. These are HMMs with an expected length and a bitscore cutoff that was calibrated to avoid overcalling paralogs.
We can try a couple of euk predictors. One thing we do in funannotate is train gene predictors with BUSCO gene sets. I'd be willing to try a couple of scenarios. We have a low complexity (only 1-2 eukaryote) lichen datasets that might be a good test set to try this on.
Sorry to be slow on this. I didn't see the mention and the summer has been crazy. But I'd love to work on this some with you.
Hi @hyphaltip, thanks for your willingness to help out on this. I've put together a couple links regarding the comments above. Are there any other euk predictors you would suggest? If so, would you mind listing them? Thanks!
Resources
This would be super cool. I'd use this feature if it's implemented.
Hi @hyphaltip, could you provide a link to these test datasets? I think we have a few members in the lab that would be interested in trying to tackle this as a little side project.