stdlib
stdlib copied to clipboard
Combinatorial procedures
Motivation
Functions or subroutines that take an array (or a string) and return an array (of higher rank, or of strings) with combinatorial compositions of the input values. Common facilities (e.g. Python's itertools) with example results given here in a non-Fortran nested format:
-
products([1, 2, 3], r=2)
returns[[1, 1], [2, 1], [3, 1], [1, 2], [2, 2], [3, 2], [1, 3], [2, 3], [3, 3]]
-
permutations([1, 2, 3], r=2)
returns[[2, 1], [3, 1], [1, 2], [3, 2], [1, 3], [2, 3]]
-
combinations([1, 2, 3], r=2)
returns[[2, 1], [3, 1], [3, 2]]
-
combinations_with_replacement([1, 2, 3], r=2)
returns[[1, 1], [2, 1], [3, 1], [2, 2], [3, 2], [3, 3]]
These examples are in column-major order, which makes sense for indices but less so for other elements. If this behavior is consistent, for strings the result would not be lexically ordered:
-
products('ABC', r=2)
returns['AA', 'BA', 'CA', 'AB', 'BB', 'CB', 'AC', 'BC', 'CC']
-
permutations('ABC', r=2)
returns['BA', 'CA', 'AB', 'CB', 'AC', 'BC']
-
combinations('ABC', r=2)
returns['BA', 'CA', 'CB']
-
combinations_with_replacement('ABC', r=2)
returns['AA', 'BA', 'CA', 'BB', 'CB', 'CC']
The procedures can be extended to lists.
Questions
- What names/features should be used (e.g.
combinations_with_replacement
is quite long)? - Should the results be in column-major order?
- Should some be merged into a single procedure?
- Should be there alternative generator-like routines to get the next iteration from a given or saved state?
Prior Art
- Rosetta Code: Printing permutations
- https://link.springer.com/content/pdf/10.3758%2FBF03201295.pdf
Additional Information
- https://github.com/fortran-lang/stdlib/issues/1 mentions permutations on the side
- https://github.com/fortran-lang/stdlib/issues/404 mentions a single permutation step as a special case of a
sample
procedure
What names/features should be used (e.g. combinations_with_replacement is quite long)?
I like the names. For combinations_with_replacement
, what do you think about using an optional argument to combinations
to create this special case:
combinations(data, r, with_replacement=.true.)
or similar. By default, with_replacement
would be .false.
. This form is longer than the original name combinations_with_replacement
, but I think it makes sense to merge very similar functionality into one function.
Is with_replacement
a common word for this? Something like with_duplicates
makes a little more sense to me, but I don't know if it's better.
Should the results be in column-major order?
Yes, so the result of products([1, 2, 3], r=2)
would have the shape [2, 9]
. Likewise for other functions.
Should some be merged into a single procedure?
Yes, see above what I thought about combinations
. Others are okay as they are.
Should be there alternative generator-like routines to get the next iteration from a given or saved state?
I think it could be useful, let's get some feedback on it. If yes, I think it should be a separate PR so that we keep PRs as small as possible.
Does this belong in stdlib_math
or stdlib_stats
?
Thanks for your input!
I have thought about a with_replacement
optional argument. And if that can be merged, maybe all of them should be. The naming is not great to begin with and not consistent as far as I know. (By the way, usually it's either "replacement" or "repetition".)
Apparently "row-major order" is a confusing term. In all cases the number of items should be the first dimension and the number of compositions the second. The question is about the order of those compositions. What comes after "aa", is it "ba" or "ab"? Is it [1,2] or [2,1] after [1,1]? If the first entry increases, it mimics the array-logic of Fortran, but last-entry increases would lead to lexically ordered strings (and/or arrays). Should it be different for strings than for other items?
Feedback sounds great.
Another idea: How about functions that calculate just the size/shape of the results. Those are usually simple expressions like factorial or powers but would be quite convenient to first check if the result is small enough.
I am not sure where this belongs.