wdl
wdl copied to clipboard
Proposal: `chunk` function, to create subsets of an array
Array[Array[X]] split(Array[X], Int)
Create subsets of an array of a specified maximum size n
. All sub-arrays will have n
elements, except possibly the last sub-array which will have at most n
elements but may have fewer if the array size is not a multiple of n
.
Example:
Array[Int] a = [1, 2, 3, 4, 5]
Array[Array[Int]] = split(a, 2) # => [[1, 2], [3, 4], [5]]
There have been identified at least two use cases where this would be useful:
- Downsampling an array: if I want only the first
n
elements of an array:subset(a, n)[0]
- Breaking up a very large array for nested scatters:
Array a = [...] Array chunks = split(a, 100) scatter (chunk in chunks) { scatter (i in chunk) { call foo { input: i = i } } } Array[Int] results = flatten(foo.result)
Draft implementation: https://github.com/openwdl/wdl/tree/398-split-function
Workflow validation on a portion of the array prior to running the full workflow would also be valuable, though is a testing-based context. I also wonder about additional use cases if the subsetting process functioned in multiple ways: in order with the leftover at the end as the default, but possibly other ways such as random sampling of Int
items in the array.
~~Could we call it something like slice
? Better aligns with other programming terminology and avoids confusing the function with anything related to sets.~~
Edit: I realised I was actually mistaken what the PR was about. It's about batching a list into a number of sublists - i was a bit excited. Still think that subset
isn't a great term, but maybe something something like chunk
?
I agree actually. I had to try to not write slice. I was originally (in Slack) thinking slice
or select_n
and it would be given an Array[Int]
but just return a shorter Array[Int]
because I really wanted a more equivalent of select_first
, but for arbitrary n (and select_first
would be the same as select_n
where n=1). But @jdidion 's suggestion to make it slightly more broadly relevant by making it into an Array[Array[Int]]
means then when I want to select the first N values, I can take the first entry in the Array of Array's OR iterate over all of the entries, seems good too. Might work in more use cases too.
split is a good option. In Scala, this function is called "grouped".
@vortexing I think you make a great point! the select_n
is actually a special case of @jdidion's suggestion where the user woudl do something like subset(...,3)[0]
, making this a very flexible option indeed. Not that I have much love of scala, but I kind of like the sound of something like grouped
or group
. It defines the operation pretty exactly
So:
subset
and slice
and select
all imply that what is returned is only a portion of the original, IMHO, which if we are returning Array[Array[Int]
would be suboptimal. Also slice
in python is what my original thought was and won't return all the entries.
group
would work wrt returning the entire array, but it implies
chunk
seems nice, or split
(split
in Linux is basically the same as what is happening here... what is what in Scala doesn't impact me much b/c.... no Scala knowledge... ), divide
seems too literal/math, numpy has array_split
which is real specific but accurate.
Ok, let's go with split
. I've updated the title and proposal.
What happens to the remaining elements of the array if the length of the array is not a multiple of n?
- Is the MAX size of each split the value of n, and thus the last split will contain n or fewer elements?
Or - will the number of elements in each split be n except where the number of elements in the array is not a multiple of n, in which case the last split will contain: n+ remaining elements?
"All sub-arrays will have n elements, except possibly the last sub-array if the array size is not a multiple of n."
I will update this to make it clear that the last array will have at most n
elements. The example also illustrates this.
Good to be explicit. That's what I ASSumed you were aiming for re: example, but good to spell it out in words too.
To add my voice, I too would be interested in an array split/chunk feature.
I was looking for this for writing a hierarchical merge and was surprised that it didn't exist already.
It is possible to achieve what @jdidion is asking using the development version of WDL:
version development
workflow main {
Array[Int] a = [1, 2, 3, 4, 5]
Int n = 2
scatter (i in range(length(a))) { Int b = i / n }
Map[Int, Array[Int]] c = collect_by_key(zip(b, a))
scatter (i in range((length(a) + n - 1) / n)) { Array[Int] d = c[i] }
output { Array[Array[Int]] e = d }
}
This will indeed generate output:
{
"main.e": [[1, 2], [3, 4], [5]]
}
If the function unzip()
was supported by Cromwell, it would be even more simple:
version development
workflow main {
Array[Int] a = [1, 2, 3, 4, 5]
Int n = 2
scatter (i in range(length(a))) { Int b = i / n }
output { Array[Array[Int]] c = unzip(as_pairs(collect_by_key(zip(b, a)))).right }
}
I would actually rather reserve split()
for a function that is sort of the reverse for function sep(String, Array[String])
so to have a functionality similar to what python has with the String split()
method
Also, if you want to keep the size of each slice as similarly sized as possible, the following might be an even better solution:
version development
workflow main {
input {
Array[Int] a = range(22)
Int n = 7
}
Int m = (length(a) + n - 1) / n
scatter (i in range(length(a))) { Int b = i * m / length(a) }
Map[Int, Array[Int]] c = collect_by_key(zip(b, a))
scatter (i in range(m)) { Array[Int] d = c[i] }
output { Array[Array[Int]] e = d }
}
This will generate the output:
{
"main.e": [[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15, 16], [17, 18, 19, 20, 21]]
}
Although it might be possible to do this (albeit in the development version) it certainly isn't as obvious what you are doing as having some kind of group
, split
, window
type function.
The above example of how to do this right now is a very good example of why this should be a baked in function.