KEEP icon indicating copy to clipboard operation
KEEP copied to clipboard

Random API

Open ilya-g opened this issue 7 years ago • 19 comments

Discussions about the Random API proposal will be held here.

Pull request: https://github.com/Kotlin/KEEP/pull/132

ilya-g avatar Jul 09 '18 13:07 ilya-g

Recently I came across a twitter thread on randomness by @colmmacc and wanted to share it here.

A short summary:

  • Implementations like list[rand() % list.size] provide biased distributions and should be treated as bugs.

    • Example: say rand() returns an int uniformly from 0 to MAX_INT, and list.size is approximately 2/3 * MAX_INT. Then the code above will return elements from the first half of the list twice as often as from the other half.
  • Similar solutions with floating point numbers list[ (randDouble() * list.size).toInt() ] are also flawed because double numbers from 0.0 to 1.0 are not uniform.

  • The correct solution for getting a random element of a list is:

      fun random(size: Int): Int {
          val limit = RAND_MAX - RAND_MAX % size
          while(1) {
               val r = rand():
    
               if (r < limit) {
                    return r % size;
               }
         }
      }
    
      val e = list[random(list.size)]
    
  • some more tweets on shuffling arrays...

I hope this will help us design Kotlin random APIs in a way that prevents such common misuses.

The original thread: https://twitter.com/colmmacc/status/1012719876706840578

voddan avatar Jul 22 '18 10:07 voddan

On the kotlin forum there was just a post about adding shuffle for arrays as well. There is also an issue about this. https://discuss.kotlinlang.org/t/adding-shuffle-to-array-t-and-bytearray-intarray-et-al/8687 https://youtrack.jetbrains.com/issue/KT-25651

I think it might be worth to consider to add those functions to this proposal.

Wasabi375 avatar Jul 22 '18 13:07 Wasabi375

@voddan Thanks for the link.

We have taken care of the implementation and API, so that our users could avoid these and some other pitfalls of random number generation. For example, there are overloads of nextInt that generate an unbiased number in a given range, so one doesn't have to worry about rand() % n in their code.

If you like, you can review the implementation here.

ilya-g avatar Jul 23 '18 00:07 ilya-g

I would like to pitch for some random sampling methods such as

fun <E> List<E>.sampleRandom(random: Random = DefaultRandom): E
fun <E> List<E>.sampleRandomList(size: Int, random: Random = DefaultRandom): List<E>
fun <E> List<E>.sampleRandomSet(size: Int, random: Random = DefaultRandom): Set<E>
fun <E> List<E>.sampleRandomSet(size: Int, weights: List<Int> = LazyList(this.size) {i -> 1}): Set<E>

The benefit of having such functions in the stdlib is preventing developers from implementing naive-but-incorrect implementations themselves. For example, for sampleRandomList it could be tempting to do something like the following, which is incorrect because the result has a different distribution than initial Random.nextInt()

fun <E> List<E>.sampleRandomList(sampleSize: Int): List<E> {
    val selectedIndices = mutableSetOf<Int>()

    repeat(sampleSize) {
        do {
            val i: Int = Random.nextInt(indices)
            val wasNew = selectedIndices.add(i)
        } while(!wasNew)
    }

    return this.slice(selectedIndices)
}

Another inefficient approach could be the implementation below, which is very easy to implement with the current proposal, and thus even more probable to be a source of problems.

fun <E> List<E>.sampleRandomList(sampleSize: Int): List<E> = shuffled().take(sampleSize)

As far as I understand, one of the correct and efficient algorithms for sampling is "Vose's Alias" algorithm, of which I am frankly totally ignorant, but you can read about it here http://www.keithschwarz.com/darts-dice-coins/

voddan avatar Jul 23 '18 11:07 voddan

@voddan I think it's better to start another KEEP for this family of functions.

ilya-g avatar Jul 23 '18 12:07 ilya-g

@ilya-g IMHO sampling and shuffling functions should go in the same KEEP. Partly because how similar they are, partly because how easy it is to (inefficiently) implement one through the other. That said, maybe it is better to move shuffling out of the scope of this proposal, and combine it with the sampling APIs.

voddan avatar Jul 23 '18 13:07 voddan

@voddan shuffling is already supported in the standard library, just an overload that takes a particular RNG is missing in common. On the other hand, sampling is a completely new API with its own design questions.

ilya-g avatar Jul 23 '18 14:07 ilya-g

There's an interesting article about the performance and correctness of selecting random numbers from a range here: http://www.pcg-random.org/posts/bounded-rands.html See the conclusion for what is probably the best overall algorithm (in C, might be different for the JVM).

chrismiller avatar Jul 30 '18 08:07 chrismiller

@chrismiller Thanks for the article, it is very enlightening.

Currently we use the "Debiased Mod (x1)" algorithm to select numbers from a range. "Debiased Int Multiplication" looks enticing, however we don't have the efficient 32-bit int multiplication that returns the upper part of the product in Kotlin/JS, not speaking of 64-bit ints for which there is no such multiplication even in Kotlin/JVM.

We need to decide which one to use before releasing repeatable RNG, because different algorithms can produce different number sequences from the same source of random bits.

ilya-g avatar Jul 31 '18 18:07 ilya-g

I don't know if this can be used here, but it seems like there is a better algorithm than the currently used xorwow (which is part of the Xorshift family): Xoroshiro128+ and related

xoroshiro128+ (named after its operations: XOR, rotate, shift, rotate) is a pseudorandom number generator intended as a successor to xorshift+. Instead of perpetuating Marsaglia's tradition of xorshift as a basic operation, xoroshiro128+ uses a shift/rotate-based linear transformation designed by Sebastiano Vigna in collaboration with David Blackman. The result is a significant improvement in speed (well below a nanosecond per integer) and a significant improvement in statistical quality.

Apparently xorwow also fails a few tests in the BigCrush test suite.

This is the authors site where they recommend xoshiro256** (or xoroshiro128** if only half the space needed) and present test results and (public domain) implementations.

jgandert avatar Nov 02 '18 12:11 jgandert

Given everyone who participated in this discussion, I'm guessing that you'd all be interested in this proposal as well.

Add a CSPRNG (eg. SecureRandom) to the Kotlin StdLib #184

I'd love your feedback.

JLLeitschuh avatar Apr 04 '19 19:04 JLLeitschuh

Might I suggest including a sampleWithReplacement in the sampling API

hXtreme avatar Mar 01 '20 06:03 hXtreme

I would like to pitch for some random sampling methods such as

fun <E> List<E>.sampleRandom(random: Random = DefaultRandom): E
fun <E> List<E>.sampleRandomList(size: Int, random: Random = DefaultRandom): List<E>
fun <E> List<E>.sampleRandomSet(size: Int, random: Random = DefaultRandom): Set<E>
fun <E> List<E>.sampleRandomSet(size: Int, weights: List<Int> = LazyList(this.size) {i -> 1}): Set<E>

The benefit of having such functions in the stdlib is preventing developers from implementing naive-but-incorrect implementations themselves. For example, for sampleRandomList it could be tempting to do something like the following, which is incorrect because the result has a different distribution than initial Random.nextInt()

fun <E> List<E>.sampleRandomList(sampleSize: Int): List<E> {
    val selectedIndices = mutableSetOf<Int>()

    repeat(sampleSize) {
        do {
            val i: Int = Random.nextInt(indices)
            val wasNew = selectedIndices.add(i)
        } while(!wasNew)
    }

    return this.slice(selectedIndices)
}

Another inefficient approach could be the implementation below, which is very easy to implement with the current proposal, and thus even more probable to be a source of problems.

fun <E> List<E>.sampleRandomList(sampleSize: Int): List<E> = shuffled().take(sampleSize)

As far as I understand, one of the correct and efficient algorithms for sampling is "Vose's Alias" algorithm, of which I am frankly totally ignorant, but you can read about it here http://www.keithschwarz.com/darts-dice-coins/

Can you please explain, why the two methods you present for taking random samples are "naive-but-incorrect"? "Vose's Alias" which you cited is an algorithm for getting random numbers with a biased (non-uniform) distribution. I cannot see how this fits into your example.

quickstep24 avatar Jan 20 '21 08:01 quickstep24

The first one is incorrect because its slow and possibly non-terminating.

When sample size is close to list length (n) I believe the program has an expected(in the probability sense) time complexity of O(n log(n))

The second one is incorrect as it is very slow when list length is large and sample size is small.

hXtreme avatar Jan 23 '21 16:01 hXtreme

The first one is incorrect because its slow and possibly non-terminating.

When sample size is close to list length (n) I believe the program has an expected(in the probability sense) time complexity of O(n log(n))

The second one is incorrect as it is very slow when list length is large and sample size is small.

Of course, considering performance, the first is only recommendable for small subsets and the second one for large subsets. But they are both statistically sound. You indicated that they have a "different distribution than Random.nextInt" and I just wanted to point out that their results is just as good (or bad) as the built in RNG/nextInt. Oh, and complexity near ´n´ is actually ´O(n²)´.

quickstep24 avatar Jan 24 '21 17:01 quickstep24

@quickstep24, @voddan was the one to suggest that it's a different distribution. I have no idea on that matter.

hXtreme avatar Jan 24 '21 22:01 hXtreme

Just looking for an update to see if this "bug" is gonna be fixed anytime soon. I am resorting to using java.util.Random or java.security.SecureRandom

just wondering.

stavares avatar Feb 25 '23 23:02 stavares

stavares Just looking for an update to see if this "bug" is gonna be fixed anytime soon.

Can you clarify what exactly you are interested in?

Check out kotlin.random.Random. It's been there since Kotlin 1.3. There is no built-in support for secure random, though, but on JVM you can always do SecureRandom().asKotlinRandom() and use rich API of Kotlin's Random class in your code.

elizarov avatar Feb 26 '23 20:02 elizarov

stavares Just looking for an update to see if this "bug" is gonna be fixed anytime soon.

Can you clarify what exactly you are interested in?

Check out kotlin.random.Random. It's been there since Kotlin 1.3. There is no built-in support for secure random, though, but on JVM you can always do SecureRandom().asKotlinRandom() and use rich API of Kotlin's Random class in your code.

I just wanted to know if it was ever going to be crypto secure but you answered my question. Thx.

stavares avatar Mar 05 '23 22:03 stavares