How to create a small test DB ?
Is is possible to create a small test DB of the latest release? E.g. only using 5 genomes? The requirement would be that GTDB-tk can work with this DB, even tough that most genomes would obviously be not assigned. We would need this for testing purposes for a wrapper of GTDB-tk we added to Galaxy (https://github.com/galaxyproject/tools-iuc/tree/main/tools/gtdbtk) as well as metagenomic / MAGs workflows (https://github.com/galaxyproject/iwc/pull/769).
If you could guide us on creating such a Mock-DB that would be great.
Hi,
Could you let me know what you'd like to test in GTDB-Tk? Are you aiming for a full ANI comparison, or are you focusing on either classify_wf or de_novo_wf? How small would you like the test to be? Is storage a limiting factor, or is memory the primary concern?
Hi @pchaumeil, I'm not the OP but myself and other collaborators working on the nf-core/mag pipeline have a similar question.
We have asked about this previously (see this post on the GTDB forum by @jfy133: https://forum.gtdb.ecogenomic.org/t/existance-of-or-how-to-create-a-tiny-version-of-gtdb-for-use-for-gtdb-tk-testing/665) - essentially, we are using the classify_wf command as a module in our pipeline, but currently are unable to quickly test the pipeline on Github CI runners as the GTDB database is too large (the runners have 14Gb of disk space), and the process itself requires too much memory. This means to test pipeline runs we have to run local tests, which can be costly and slow even with our small data sets.
We would like to know if there exists, or it would be possible to create, a ‘tiny’/dummy version of the GTDB release tarballs (e.g. maybe containing just two or three genomes), that we can use to properly ‘simulate’ running GTDB in our tests. From our perspective, the main output files that we are interested are the bac120.summary.tsv and ar53.summary.tsv describing the classification of MAGs - so building a small DB that can run the pipeline and output a summary file that is the same format as a full run, even if most of the input test genomes are described as "Unclassified bacterium", would be what we are after, in a way that is both memory- and space- efficient.
Alternatively, a description of the minimal required files/structure (and how to generate) our own mini replica of the database would be also helpful - either by building it up from a minimal set of genomes, or alternatively stripping out most of the data and from the actual release.
We have also noted that you have test data for unit testing - maybe it is possible to modify this to produce a compliant database?
Thank you for the quick reply @pchaumeil. The test is meant as a functional test, i.e. does the tool run and produce output? The output does not even have to be biologically reasonable. Currently, we are using the classify_wf workflow. If you could provide e.g. a small genome (a fraction is also fine) as input and a minimal DB, that contains this genome or a small set including this genome, that would be great. For our tool tests ideally < 1MB, but if for function a larger DB is required that's also fine, generally as small as possible :) Thanks a lot for the help, we will make sure to credit your work and support !
Thanks for the feedback!
I think it should be possible to put together a mockup. The idea would be to pick a small clade from the archaeal tree and reduce all phyla not represented in that clade to a single genome each, bringing the total to around 25 genomes. To run the test, you will need to use the --skip_ani_screen flag so that GTDB-Tk skips ANI comparisons against the full database and pushes your genome through the full pipeline ( identify,align,classify). The genome you are testing should also be part of the selected clade so that pplacer can assign it to a non edited branch.
With that setup, I expect the GTDB-Tk database will be around 100 to 200 Mb in size and require only a few hundred Mb of RAM. For now, this is probably the lightest version I can come up with (requirements < 1 Mb seem hard to reach 🙂).
We are releasing the new GTDB and GTDB-Tk package next week, so I will see if I can put together a small mockup database around then or shortly after.
Hi @pchaumeil
Conceptually that sounds great! As close to 100MB as possible would be perfect. For the nf-core context, our preferred limit is 100MB as this is the maximum filesize that GitHub allows for upload (most of our test-data is stored on there), so that works for us.
I have two extra questions mostly out of curiosity:
- Is there a particular reason why you think archaea would be better - is it because genomes and clades are smaller (just wondering given most of our existing test data at the pipeline level around mag building is typically bacterial) - but we can work around this.
- Do you have an automated workflow for the database building? One other alternative that we considered previously and would work for our GTDB Tk classify_wf 'module' test, was to actually build a database on the fly every time we wanted to test the classify_wf module as that would allow us more wiggle room for database size (but this unfortunately doesn't work for our pipeline level tests, so still your offer to build the mockup database would still be extremely appreciated 🙏 )
-
I usually default to the archaeal tree when pruning because it has a small number of phyla, which makes it easier to maintain the full topology using fewer genomes (typically less than 20). Another reason is that GTDB-Tk v2 doesn’t use the divide-and-conquer approach for Archaea, so it always runs with the full tree. But, for testing purposes, I think I create a mockup using the bacterial tree instead. I just pick a small clade (maybe 4/5 genomes) plus one genome as an outgroup. If we choose the bacterial mockup, we'll need to run our tests using the
--full_treeflag, since doing the divide-and-conquer setup for a mockup would require more effort. -
Currently,I don’t have an automated workflow for building the Tk database. I use a set of scripts for specific steps, but they’re quite dependent on our current environment. So, building the database still involves a fair amount of manual work.
OK cool, thanks for the clarifications 🙏
I look forward to the new release and the mini-test mockup :)
The first draft of the mock GTDB-Tk database is ready. You can find it here
It contains a minimal setup to run gtdbtk classify_wf using the provided test genome.
A HOWTO.txt file is included with instructions.
Please test it and let me know if everything works as expected.
Hi @pchaumeil, many thanks for this! I have given it a test-run - it seems to run nicely both with the provided test genome, as well as some of the test MAGs generated using my own test dataset. Total memory usage for 8 genomes was ~216 Mb!
Thank you @pchaumeil ! I will also see how it works with the nf-core module and pipeline and report back how it works :D (But that @prototaxites says it is already working well is a very good sign!)
Nice, this is very useful. I'll add it our pipelines for testing :).
Thanks a lot ! We will also use it for Galaxy tests !
@pchaumeil I can also confirm it works for me too :) - thank you very much!
The only issue is the download can be slow... would you be OK with us making a copy of the mockup and hosting it in a closer location (without warranty on your end of course, i.e. so it's up to us to update our copy the mockup per release for our tests, and if yes @paulzierep you're welcome to use the too if necessary)
Yes, I would also appreciate a fast and stable location. how about zenodo ? Thanks !
If Zenodo then I think best for @pchaumeil to upload that, but otherwise a quick/dirty solution would be GitHub (we have a dedicated repo for nf-core: https://github.com/nf-core/test-datasets) or we also have some space on AWS, but 27mb isn't so bad.
@pchaumeil Wee follow-up question: if one were to build a mash DB for the handful of genomes in the test data, would GTDB-Tk be able to run with the addition of ANI screening?
For the time being we've uploaded the mock db to github here: https://github.com/nf-core/test-datasets/tree/mag/databases/gtdbtk as our tests were really slow (download taking >1h in some cases) , but we are happy to remove if if you prefer @pchaumeil
Thanks everyone for the feedback!
@prototaxites : Regarding the Mash database, you should be able to build it using the small set of genomes provided in the mockup and run GTDB-Tk with the ANI pre-screening step. I don’t see why that would be a problem.
@jfy133 : Not a problem at all 😊 I'm glad the test set is small enough to upload to GitHub. I'm currently exploring options to have GTDB/GTDB-Tk data mirrored at various locations worldwide (e.g., across universities and research centers). It might take a few months to set up, but hopefully this will help resolve the slow download speeds.
Oh that's very exciting, Thanks @pchaumeil !
Quick update,
I’m still working with UQ to improve the download speed on our Australian mirror. In the meantime, a new mirror has been deployed in Aalborg (https://data.gtdb.aau.ecogenomic.org/ ), which should offer better performance. I’ll update the conda recipe to point to the latest.
Cheers, Pierre