CheckM2 icon indicating copy to clipboard operation
CheckM2 copied to clipboard

Database update?

Open cmkobel opened this issue 11 months ago • 2 comments

How old is the uniref database that CheckM2 is currently using? I see a reference to 3rd june 2018 in the main publication but am not sure if it has been updated since?

Am I correct in assuming that you downloaded uniref100 and the idmappings (https://www.uniprot.org/help/downloads), and then kept only the proteins that have a kegg orthology mapping?

Cheers.

cmkobel avatar Mar 12 '24 11:03 cmkobel

Hi,

Yes, that's correct - we used a 2018 database with KEGG-uniref idmappings during CheckM2 development, but UNIREF has since decided not to include KEGG id mapping in its future updates, meaning that currently CheckM2 is using the last available database from 2018. Given the reliance of CheckM2 on fast diamond-based protein annotation, we haven't switched to KEGG hmm-searches. We are currently exploring using an alternative annotation system using DRAM-based (or other annotation tools, e.g. String/EggNog) annotation of the full GTDB protein database, but that is still at the benchmarking stage for now.

Nevertheless, though the annotation database is a bit old, we'll be using newly added publicly available genomes to update CheckM2 (newest CheckM2 update incorporating GTDB R214 should hopefully be out by the end of the month).

chklovski avatar Mar 12 '24 11:03 chklovski

Thanks for the quick answer! Okay that explains why I have such a hard time finding a mapping between Uniref and Kegg orthology. Looking forward for testing the new protein setup. :)

Btw, I know you've been looking into using Kegg pathways as part of the completeness scoring (correct me if I am wrong). Do you think that Gene Ontology (GO) might be a better fit for pathway lookups, generally speaking?

cmkobel avatar Mar 12 '24 12:03 cmkobel