wdl icon indicating copy to clipboard operation
wdl copied to clipboard

Proposal: `chunk` function, to create subsets of an array

Open jdidion opened this issue 3 years ago • 16 comments

Array[Array[X]] split(Array[X], Int)

Create subsets of an array of a specified maximum size n. All sub-arrays will have n elements, except possibly the last sub-array which will have at most n elements but may have fewer if the array size is not a multiple of n.

Example:

Array[Int] a = [1, 2, 3, 4, 5]
Array[Array[Int]] = split(a, 2)  # => [[1, 2], [3, 4], [5]]

There have been identified at least two use cases where this would be useful:

  • Downsampling an array: if I want only the first n elements of an array: subset(a, n)[0]
  • Breaking up a very large array for nested scatters:
    Array a = [...]
    Array chunks = split(a, 100)
    scatter (chunk in chunks) {
        scatter (i in chunk) {
            call foo { input: i = i }
        }
    }
    Array[Int] results = flatten(foo.result)
    

Draft implementation: https://github.com/openwdl/wdl/tree/398-split-function

jdidion avatar Aug 27 '20 18:08 jdidion

Workflow validation on a portion of the array prior to running the full workflow would also be valuable, though is a testing-based context. I also wonder about additional use cases if the subsetting process functioned in multiple ways: in order with the leftover at the end as the default, but possibly other ways such as random sampling of Int items in the array.

vortexing avatar Aug 27 '20 18:08 vortexing

~~Could we call it something like slice? Better aligns with other programming terminology and avoids confusing the function with anything related to sets.~~

Edit: I realised I was actually mistaken what the PR was about. It's about batching a list into a number of sublists - i was a bit excited. Still think that subset isn't a great term, but maybe something something like chunk?

illusional avatar Aug 27 '20 23:08 illusional

I agree actually. I had to try to not write slice. I was originally (in Slack) thinking slice or select_n and it would be given an Array[Int] but just return a shorter Array[Int] because I really wanted a more equivalent of select_first, but for arbitrary n (and select_first would be the same as select_n where n=1). But @jdidion 's suggestion to make it slightly more broadly relevant by making it into an Array[Array[Int]] means then when I want to select the first N values, I can take the first entry in the Array of Array's OR iterate over all of the entries, seems good too. Might work in more use cases too.

vortexing avatar Aug 28 '20 01:08 vortexing

split is a good option. In Scala, this function is called "grouped".

jdidion avatar Aug 28 '20 01:08 jdidion

@vortexing I think you make a great point! the select_n is actually a special case of @jdidion's suggestion where the user woudl do something like subset(...,3)[0], making this a very flexible option indeed. Not that I have much love of scala, but I kind of like the sound of something like grouped or group. It defines the operation pretty exactly

patmagee avatar Aug 28 '20 14:08 patmagee

So: subset and slice and select all imply that what is returned is only a portion of the original, IMHO, which if we are returning Array[Array[Int] would be suboptimal. Also slice in python is what my original thought was and won't return all the entries.

group would work wrt returning the entire array, but it implies that there is some feature we're grouping by, but in this case we're just grouping by the order of appearance.

chunk seems nice, or split (split in Linux is basically the same as what is happening here... what is what in Scala doesn't impact me much b/c.... no Scala knowledge... ), divide seems too literal/math, numpy has array_split which is real specific but accurate.

vortexing avatar Aug 28 '20 16:08 vortexing

Ok, let's go with split. I've updated the title and proposal.

jdidion avatar Aug 28 '20 16:08 jdidion

What happens to the remaining elements of the array if the length of the array is not a multiple of n?

  1. Is the MAX size of each split the value of n, and thus the last split will contain n or fewer elements?
    Or
  2. will the number of elements in each split be n except where the number of elements in the array is not a multiple of n, in which case the last split will contain: n+ remaining elements?

vortexing avatar Aug 28 '20 16:08 vortexing

"All sub-arrays will have n elements, except possibly the last sub-array if the array size is not a multiple of n."

I will update this to make it clear that the last array will have at most n elements. The example also illustrates this.

jdidion avatar Aug 28 '20 17:08 jdidion

Good to be explicit. That's what I ASSumed you were aiming for re: example, but good to spell it out in words too.

vortexing avatar Aug 28 '20 17:08 vortexing

To add my voice, I too would be interested in an array split/chunk feature.

microbioticajon avatar Dec 07 '21 16:12 microbioticajon

I was looking for this for writing a hierarchical merge and was surprised that it didn't exist already.

lbergelson avatar May 11 '22 20:05 lbergelson

It is possible to achieve what @jdidion is asking using the development version of WDL:

version development

workflow main {
  Array[Int] a = [1, 2, 3, 4, 5]
  Int n = 2
  scatter (i in range(length(a))) { Int b = i / n }
  Map[Int, Array[Int]] c = collect_by_key(zip(b, a))
  scatter (i in range((length(a) + n - 1) / n)) { Array[Int] d = c[i] }
  output { Array[Array[Int]] e = d }
}

This will indeed generate output:

{
  "main.e": [[1, 2], [3, 4], [5]]
}

If the function unzip() was supported by Cromwell, it would be even more simple:

version development

workflow main {
  Array[Int] a = [1, 2, 3, 4, 5]
  Int n = 2
  scatter (i in range(length(a))) { Int b = i / n }
  output { Array[Array[Int]] c = unzip(as_pairs(collect_by_key(zip(b, a)))).right }
}

I would actually rather reserve split() for a function that is sort of the reverse for function sep(String, Array[String]) so to have a functionality similar to what python has with the String split() method

freeseek avatar Sep 06 '22 01:09 freeseek

Also, if you want to keep the size of each slice as similarly sized as possible, the following might be an even better solution:

version development

workflow main {
  input {
    Array[Int] a = range(22)
    Int n = 7
  }
  Int m = (length(a) + n - 1) / n
  scatter (i in range(length(a))) { Int b = i * m / length(a) }
  Map[Int, Array[Int]] c = collect_by_key(zip(b, a))
  scatter (i in range(m)) { Array[Int] d = c[i] }
  output { Array[Array[Int]] e = d }
}

This will generate the output:

{
  "main.e": [[0, 1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15, 16], [17, 18, 19, 20, 21]]
}

freeseek avatar Sep 06 '22 14:09 freeseek

Although it might be possible to do this (albeit in the development version) it certainly isn't as obvious what you are doing as having some kind of group, split, window type function.

markjschreiber avatar Dec 08 '22 16:12 markjschreiber

The above example of how to do this right now is a very good example of why this should be a baked in function.

lbergelson avatar Dec 16 '22 19:12 lbergelson