Random API
Discussions about the Random API proposal will be held here.
Pull request: https://github.com/Kotlin/KEEP/pull/132
Recently I came across a twitter thread on randomness by @colmmacc and wanted to share it here.
A short summary:
-
Implementations like
list[rand() % list.size]provide biased distributions and should be treated as bugs.- Example: say
rand()returns an int uniformly from 0 toMAX_INT, andlist.sizeis approximately2/3 * MAX_INT. Then the code above will return elements from the first half of the list twice as often as from the other half.
- Example: say
-
Similar solutions with floating point numbers
list[ (randDouble() * list.size).toInt() ]are also flawed because double numbers from0.0to1.0are not uniform. -
The correct solution for getting a random element of a list is:
fun random(size: Int): Int { val limit = RAND_MAX - RAND_MAX % size while(1) { val r = rand(): if (r < limit) { return r % size; } } } val e = list[random(list.size)] -
some more tweets on shuffling arrays...
I hope this will help us design Kotlin random APIs in a way that prevents such common misuses.
The original thread: https://twitter.com/colmmacc/status/1012719876706840578
On the kotlin forum there was just a post about adding shuffle for arrays as well. There is also an issue about this. https://discuss.kotlinlang.org/t/adding-shuffle-to-array-t-and-bytearray-intarray-et-al/8687 https://youtrack.jetbrains.com/issue/KT-25651
I think it might be worth to consider to add those functions to this proposal.
@voddan Thanks for the link.
We have taken care of the implementation and API, so that our users could avoid these and some other pitfalls of random number generation. For example, there are overloads of nextInt that generate an unbiased number in a given range, so one doesn't have to worry about rand() % n in their code.
If you like, you can review the implementation here.
I would like to pitch for some random sampling methods such as
fun <E> List<E>.sampleRandom(random: Random = DefaultRandom): E
fun <E> List<E>.sampleRandomList(size: Int, random: Random = DefaultRandom): List<E>
fun <E> List<E>.sampleRandomSet(size: Int, random: Random = DefaultRandom): Set<E>
fun <E> List<E>.sampleRandomSet(size: Int, weights: List<Int> = LazyList(this.size) {i -> 1}): Set<E>
The benefit of having such functions in the stdlib is preventing developers from implementing naive-but-incorrect implementations themselves. For example, for sampleRandomList it could be tempting to do something like the following, which is incorrect because the result has a different distribution than initial Random.nextInt()
fun <E> List<E>.sampleRandomList(sampleSize: Int): List<E> {
val selectedIndices = mutableSetOf<Int>()
repeat(sampleSize) {
do {
val i: Int = Random.nextInt(indices)
val wasNew = selectedIndices.add(i)
} while(!wasNew)
}
return this.slice(selectedIndices)
}
Another inefficient approach could be the implementation below, which is very easy to implement with the current proposal, and thus even more probable to be a source of problems.
fun <E> List<E>.sampleRandomList(sampleSize: Int): List<E> = shuffled().take(sampleSize)
As far as I understand, one of the correct and efficient algorithms for sampling is "Vose's Alias" algorithm, of which I am frankly totally ignorant, but you can read about it here http://www.keithschwarz.com/darts-dice-coins/
@voddan I think it's better to start another KEEP for this family of functions.
@ilya-g IMHO sampling and shuffling functions should go in the same KEEP. Partly because how similar they are, partly because how easy it is to (inefficiently) implement one through the other. That said, maybe it is better to move shuffling out of the scope of this proposal, and combine it with the sampling APIs.
@voddan shuffling is already supported in the standard library, just an overload that takes a particular RNG is missing in common. On the other hand, sampling is a completely new API with its own design questions.
There's an interesting article about the performance and correctness of selecting random numbers from a range here: http://www.pcg-random.org/posts/bounded-rands.html See the conclusion for what is probably the best overall algorithm (in C, might be different for the JVM).
@chrismiller Thanks for the article, it is very enlightening.
Currently we use the "Debiased Mod (x1)" algorithm to select numbers from a range. "Debiased Int Multiplication" looks enticing, however we don't have the efficient 32-bit int multiplication that returns the upper part of the product in Kotlin/JS, not speaking of 64-bit ints for which there is no such multiplication even in Kotlin/JVM.
We need to decide which one to use before releasing repeatable RNG, because different algorithms can produce different number sequences from the same source of random bits.
I don't know if this can be used here, but it seems like there is a better algorithm than the currently used xorwow (which is part of the Xorshift family): Xoroshiro128+ and related
xoroshiro128+ (named after its operations: XOR, rotate, shift, rotate) is a pseudorandom number generator intended as a successor to xorshift+. Instead of perpetuating Marsaglia's tradition of xorshift as a basic operation, xoroshiro128+ uses a shift/rotate-based linear transformation designed by Sebastiano Vigna in collaboration with David Blackman. The result is a significant improvement in speed (well below a nanosecond per integer) and a significant improvement in statistical quality.
Apparently xorwow also fails a few tests in the BigCrush test suite.
This is the authors site where they recommend xoshiro256** (or xoroshiro128** if only half the space needed) and present test results and (public domain) implementations.
Given everyone who participated in this discussion, I'm guessing that you'd all be interested in this proposal as well.
Add a CSPRNG (eg. SecureRandom) to the Kotlin StdLib #184
I'd love your feedback.
Might I suggest including a sampleWithReplacement in the sampling API
I would like to pitch for some random sampling methods such as
fun <E> List<E>.sampleRandom(random: Random = DefaultRandom): E fun <E> List<E>.sampleRandomList(size: Int, random: Random = DefaultRandom): List<E> fun <E> List<E>.sampleRandomSet(size: Int, random: Random = DefaultRandom): Set<E> fun <E> List<E>.sampleRandomSet(size: Int, weights: List<Int> = LazyList(this.size) {i -> 1}): Set<E>The benefit of having such functions in the stdlib is preventing developers from implementing naive-but-incorrect implementations themselves. For example, for
sampleRandomListit could be tempting to do something like the following, which is incorrect because the result has a different distribution than initialRandom.nextInt()fun <E> List<E>.sampleRandomList(sampleSize: Int): List<E> { val selectedIndices = mutableSetOf<Int>() repeat(sampleSize) { do { val i: Int = Random.nextInt(indices) val wasNew = selectedIndices.add(i) } while(!wasNew) } return this.slice(selectedIndices) }Another inefficient approach could be the implementation below, which is very easy to implement with the current proposal, and thus even more probable to be a source of problems.
fun <E> List<E>.sampleRandomList(sampleSize: Int): List<E> = shuffled().take(sampleSize)As far as I understand, one of the correct and efficient algorithms for sampling is "Vose's Alias" algorithm, of which I am frankly totally ignorant, but you can read about it here http://www.keithschwarz.com/darts-dice-coins/
Can you please explain, why the two methods you present for taking random samples are "naive-but-incorrect"? "Vose's Alias" which you cited is an algorithm for getting random numbers with a biased (non-uniform) distribution. I cannot see how this fits into your example.
The first one is incorrect because its slow and possibly non-terminating.
When sample size is close to list length (n) I believe the program has an expected(in the probability sense) time complexity of O(n log(n))
The second one is incorrect as it is very slow when list length is large and sample size is small.
The first one is incorrect because its slow and possibly non-terminating.
When sample size is close to list length
(n)I believe the program has an expected(in the probability sense) time complexity ofO(n log(n))The second one is incorrect as it is very slow when list length is large and sample size is small.
Of course, considering performance, the first is only recommendable for small subsets and the second one for large subsets. But they are both statistically sound. You indicated that they have a "different distribution than Random.nextInt" and I just wanted to point out that their results is just as good (or bad) as the built in RNG/nextInt. Oh, and complexity near ´n´ is actually ´O(n²)´.
@quickstep24, @voddan was the one to suggest that it's a different distribution. I have no idea on that matter.
Just looking for an update to see if this "bug" is gonna be fixed anytime soon. I am resorting to using java.util.Random or java.security.SecureRandom
just wondering.
stavares Just looking for an update to see if this "bug" is gonna be fixed anytime soon.
Can you clarify what exactly you are interested in?
Check out kotlin.random.Random. It's been there since Kotlin 1.3. There is no built-in support for secure random, though, but on JVM you can always do SecureRandom().asKotlinRandom() and use rich API of Kotlin's Random class in your code.
stavares Just looking for an update to see if this "bug" is gonna be fixed anytime soon.
Can you clarify what exactly you are interested in?
Check out
kotlin.random.Random. It's been there since Kotlin 1.3. There is no built-in support for secure random, though, but on JVM you can always doSecureRandom().asKotlinRandom()and use rich API of Kotlin'sRandomclass in your code.
I just wanted to know if it was ever going to be crypto secure but you answered my question. Thx.