Collection of tools and databases required for GTN that cannot be synchronized easily between the main servers
-
humann nucleotide and protein database: https://github.com/galaxyproject/usegalaxy-tools/issues/855 (missing on AU/ORG)
-
kraken2 - 5 toturials, multiple DBs - should all be updated to 2024 versions on the servers and the tutorials
-
blastn: https://training.galaxyproject.org/topics/assembly/tutorials/assembly-decontamination/tutorial.html
-
DBs for https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html
-
phyloseq IT (only in EU)
Will try to add to this list step by step. Still need to check versions of the DBs on the servers if they exist.
Another issue are tools that are differently configured in TPV. E.g. quast on org cannot access the internet and therefore fails in this tutorial: https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/metagenomics-assembly/tutorial.html
Thanks for filing this @paulzierep ! Really appreciate it. cc @bgruening @natefoo @cat-bro since some databases/etc may be needed on each.
quast on org cannot access the internet
i'm shocked that it needs internet access. That shouldn't be necessary :/ cc @jennaj
I'm looking through a couple of these to see if we can analyse this problem statically but I fear we can't. E.g. the blastn link, the tutorial mentions a database, the workflow does not! instead it uses a connected input parameter that's empty. Same for Kraken in https://training.galaxyproject.org/training-material/topics/microbiome/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html, they're empty input parameters. The test case for taxonomy_profiling_and_visualisation_with_krona-test.yml also doesn't mention the database, but maybe I've missed it? So without parsing the english language text, there's no way to figure that out for that specific case.
nanopore_preprocessing.ga, for the same tutorial, in theory we could. But it would involve
- for every workflow
- for every tool, recursively through subworkflows
- for every parameter
- for every supported server (~20-30)
- check against /api/tools/build/{tool_id}?io_details=True to see if that value is available.
- for every supported server (~20-30)
- for every parameter
- for every tool, recursively through subworkflows
In a subset of cases it is technically possible, but I am very afraid of false positives/negatives there, which means maybe we would have to restrict it to tools we know use databases, but i'm still not confident I'll say, since it requires such deep parsing of Galaxy's datastructures and API responses.
Especially since there's no flag or signal (as far as I can tell) in the API responses that a specific parameter is a "database select" parameter that might vary between servers. If that was exposed, if we had a convenient way to know which parameters are "database selects", this problem would look a lot more tractable (albeit still with the cases of "workflow doesn't match tutorial and doesn't pre-select a DB")
phyloseq IT (only in EU)
this should already be tested. we test for tools used in the tutorial / workflow. If you notice any bugs here please let me know! :) See the tools key in https://training.galaxyproject.org/training-material/api/topics/microbiome/tutorials/dada-16S//tutorial.json where phyloseq is mentioned. That's the data that goes into compatibility checking.
I installed the phyloseq IT on .org but don't have data to test it, can someone do that please?
humann databases are updated as per the linked issue.
kraken2, blastn, and pathogen detection DBs, any specific details about what is needed there?