galaxy
galaxy copied to clipboard
Enable all-vs-all collection analysis patterns.
Galaxy matches corresponding datasets when multiple collections are used to map over a tool - this is a variation of a dot product pattern when mapping over collections. This can be easily be adapted to perform all-vs-all mapping if we first produce two new input collections containing the Cartesian product (in math terms) or Cross Join (in SQL terms) of the inputs where every combination is lined up in some corresponding element between the first and second list.
These are available in a list(n) x list (m) -> list(nxm) version (a Cartesian product that produces two flat lists) and a list(n) x list(m) -> list(n):list(m) version (a cross product that produces two nested lists).
There are some cool pictures put together by Seven Bridges that demonstrate the CWL variants of these concepts. The middle part semantics are different but the inputs and resulting structures are the same:
Nested:
Flat:
After two lists have been run through one of these two tools - the result is two new lists that can be passed into another tool to perform all-against-all operations using Galaxy's normal collection mapping semantics. There is no extra work or thought needed here - it really is as simple as running the respective collections through one of these tools and then passing the output corresponding to the input to the next tool and the result will be an all-vs-all operation.
The choice of which tool to use will depend on how you want to continue to process the all-against-all results after the next step in an analysis. My sense is the flat version is "easier" to think about and pick through manually and the nested version preserves more structure if additional collection operation tools will be used to filter or aggregate the results.
Some considerations:
Naming?
I have been calling them cross products because that is what CWL calls them - but they are Cartesian products not cross products. I guess SQL uses the terminology "Cross Join" which makes sense. I think part of the confusion is the related terminology and that mathematically they both are often represented with a big "X" symbol - but mathematically this operation is definitely not a cross product 😢.
I've called the tools cross products in this version cut at the PR but I think we should abandon the CWL naming and come up with more exact terminology.
Apply Rules?
I do not believe the Apply Rules tool semanatics would allow these operations but certainly the Apply Rules tool could be used to convert the result of the flat version to the nested version or vice versa - so no metadata is really lost per se between the two versions. I think it is still worth including both versions though - they both have utility (both for instance are baked into CWL's workflow semantics - https://docs.sevenbridges.com/docs/about-parallelizing-tool-executions#nested-cross-product) and avoiding requiring complex Apply Rules programs for simple workflows is probably ideal.
One Tool vs Two?
Marius and I agree that few simpler tools for these kinds of operations are better. The tool help can be more focused and avoiding the conditional and conditional outputs make the static analysis done for instance by the workflow editor simpler.
Editor/Tool Options vs Collection Operation Options
TODO: we've not generally gone in this direction and probably should steer clear.
How to test the changes?
(Select all options that apply)
- [x] I've included appropriate automated tests.
License
- [x] I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.
I gave this a try and it looks nice, I'd be in favor of merging this. https://github.com/galaxyproject/galaxy/compare/dev...mvdbeek:galaxy:cross_product?expand=1 contains a test and fixes for running database operation tools within conditional steps, can you pull this in if it looks good to you @jmchilton ?
@jmchilton could you rebase and take it out of draft if there's nothing more to add ?
The tools have no help - but if you're willing to merge without that I'm willing to pull it out of draft 😅.
Ah, a minimal help text would probably be a good idea 😆, while we figure out something graphical for the output section ?