tools-iuc
tools-iuc copied to clipboard
MITOS: try to speed up tool form
The options filter seems to be evaluated on tool load https://github.com/galaxyproject/tools-iuc/blob/b991ebbd9f59ffff1b3eae09f65b59b02ad7cf8e/tools/mitos/mitos.xml#L47 which slows down tool loading for large histories / collections
Maybe replace by validator
I'd maybe also check the performance on the Galaxy side, I don't think there's an inherent reason why this has to be slow.
Jep, seems a good opportunity. Any suggestions which part(s) of the code could be the culprit?
I'd start by profiling https://github.com/mvdbeek/galaxy/blob/c0fc0a853edb102a685f1343efc04dace4b044aa/lib/galaxy/webapps/galaxy/api/tools.py#L203
Added some debug statements to my best guesses and loaded the tool with an active history containing
- 1 fasta including ~8000 sequences
- 1 collection with ~8000 fasta datasets each containing 1 sequence
On loading I see:
galaxy.tools.parameters.basic ERROR 2022-06-24 12:00:08,723 [pN:main.1,p:416895,tN:WSGI_2] DataToolParameter.from_json START
galaxy.tools.parameters.basic ERROR 2022-06-24 12:00:08,725 [pN:main.1,p:416895,tN:WSGI_2] BaseDataToolParameter.get_initial_value START
galaxy.tools.parameters.basic ERROR 2022-06-24 12:00:08,746 [pN:main.1,p:416895,tN:WSGI_2] DataToolParameter.to_dict START
galaxy.tools.parameters.basic ERROR 2022-06-24 12:00:08,754 [pN:main.1,p:416895,tN:WSGI_2] BaseDataToolParameter.get_initial_value START
...
galaxy.tools.parameters.basic ERROR 2022-06-24 12:06:42,827 [pN:main.1,p:416895,tN:WSGI_2] DataToolParameter.to_dict END
- so the tool needs 6 min to load :(
- some
ENDdebug statements are missing (but maybe I forgot some places to add debug statements) - seems odd that the functions are called twice, or?
Also strange that the process starts again after the tool form loaded (and the above messages are seen in the log).
This was tested on https://github.com/galaxyproject/galaxy/commit/83d110bef72608ae7030dabbd27781761edd4d01
Never did the profiling. Will see if I can work along the docs
So ... DataToolParameter.to_dict function uses a DatasetMatcherFactory which always uses DatasetCollectionMatcher
https://github.com/galaxyproject/galaxy/blob/60d851445c3f3792543cce96204fc12eea86befd/lib/galaxy/tools/parameters/basic.py#L2226
https://github.com/galaxyproject/galaxy/blob/60d851445c3f3792543cce96204fc12eea86befd/lib/galaxy/tools/parameters/dataset_matcher.py#L83
The runtime of to_dict seems to go to close to zero if SummaryDatasetCollectionMatcher would be used (I hardcoded it in a single experiment), but for this we would need to initialize the matcher factory using the tool parameter.
No idea how to get this into the the call of the to_dict function .. trans seems to contain no reference to the tool, or?
Parts of my previous message were wrong:
We would get SummaryDatasetCollectionMatcher, but since the parameter has dynamic options we get the slow DatasetCollectionMatcher here
https://github.com/galaxyproject/galaxy/blob/60d851445c3f3792543cce96204fc12eea86befd/lib/galaxy/tools/parameters/dataset_matcher.py#L84
because of
https://github.com/galaxyproject/galaxy/blob/60d851445c3f3792543cce96204fc12eea86befd/lib/galaxy/tools/parameters/dataset_matcher.py#L43
Maybe that's why options_filter_attribute is marked as deprecated: https://docs.galaxyproject.org/en/master/dev/schema.html#other-ways-to-dynamically-generate-options
Fixed in https://github.com/galaxyproject/tools-iuc/pull/5152