faker icon indicating copy to clipboard operation
faker copied to clipboard

faker.datatype.number often gives duplicates

Open frankandrobot opened this issue 10 months ago • 19 comments

Pre-Checks

Describe the bug

  • In unit tests running on our CI environment (and occasionally on our macs), the following code has been observed to throw a "duplicate error"---where it cannot find a unique integer
  • fakeId is called 50--100 times in total on the offending test runs.
  • the rate of failure on CI is around 1 in 1000 runs
  • haven't done the math but it is still very surprising because we are generating up the "max safe integer"... which is equal to 2^53 – 1... a very, very large number, so I don't expect 100 generations to render duplicates
export const fakeId = () =>
  faker.unique(() => faker.datatype.number({ max: Number.MAX_SAFE_INTEGER }), undefined, { maxTime: 100 }).toString();

Analysis

  • The mersenne implementation is very sus---why use this at all? For the most part, fakerjs seems to be a test helper, so IMO, this is over engineering. We can likely just use Math.random under the hood
  • Additionally, the implementation of unique is also sus on CI environments. Namely, this block: https://github.com/faker-js/faker/blob/next/src/modules/helpers/unique.ts#L114-L122
  • Most likely what's happening is the following---in our CI environments, we have a known issue with slowness. The mersenne implementation will unnecessarily return duplicates. And when both happen, now - startTime will be greater than maxTime, resulting in the thrown error

Suggestion:

  • A workaround we can do on our end is pass a very large maxTime.
  • but unique should rely on maxRetries only
  • faker should have an option to opt-out of using mersenne

Minimal reproduction code

  • see above

Additional Context

No response

Environment Info

OS: linux/macosx
Node: 16
faker: 7.3.0

Which module system do you use?

  • [ ] CJS
  • [ ] ESM

Used Package Manager

npm

frankandrobot avatar Aug 28 '23 15:08 frankandrobot

I used the following code:

import { faker } from "@faker-js/faker";

const set = new Set();
const limit = 10_000_000;

for (let i = 0; i < limit; i++) {
  set.add(faker.number.int());
  //if (i % 100_000 === 0) {
  //  console.log(i);
  //}
}


console.log("=====================================");
console.log("Runs:  ", limit);
console.log("Values:", set.size);
console.log("Ratio: ", set.size / limit);

Which returned the following values:

Runs: 10000000 Values: 9988326 Ratio: 0.9988326

Which is 12 duplicates per 10k generated values. This should not significantly slow down the generation process (unless you happen to run into a GC pause).


I used your code above to count the number of errors:

console.warn = () => {};

import { faker } from "@faker-js/faker";

const limit = 5_000_000;
let counter = 0;

for (let i = 0; i < limit; i++) {
  try {
    faker.helpers.unique(
      () => faker.number.int({ max: Number.MAX_SAFE_INTEGER }),
      undefined,
      { maxTime: 100 }
    );
  } catch (e) {
    counter++;
  }
  //if (i % 100_000 === 0) {
  //  console.log(i);
  //}
}

console.log("=====================================");
console.log("Runs:  ", limit);
console.log("Errors:", counter);
console.log("Ratio: ", counter / limit);

And I didn't get any in 5M executions, I didn't run it on a slim CI machine though.

So I assume it is caused by a slow CI pipeline host/GC pause.


  • A workaround we can do on our end is pass a very large maxTime.

This is the only workaround that I can currently think of except from implementing unique yourself. I'm not sure what you consider large but I assume it is dependent on the number of parallel processes/tests running on the pipeline host and it's computing power. We also have tests for the unique method implementation and while we do see some rare failures, these specifically seem to occur on the GitHub MacOs runners. So these are either especially slow or have a NodeJs bug/issue that wastes time. Please make sure you don't reset your seed to a static value causing the same values to be generated over and over again.

Have you tried using Faker v8?

  • but unique should rely on maxRetries only

We aren't happy with the current implementation either. We plan to remove/refactor the method (probably in v9).

The mersenne implementation is very sus---why use this at all? For the most part, fakerjs seems to be a test helper, so IMO, this is over engineering. We can likely just use Math.random under the hood

  • faker should have an option to opt-out of using mersenne

Faker uses mersenne to generate reproducible values, reproducible values are important for certain test cases and test debugging. We have an issue + PR to add support for custom generators:

  • #2195
  • #2284

ST-DDT avatar Aug 28 '23 21:08 ST-DDT

Please also read https://github.com/faker-js/faker/issues/1785 for more details and potentially you can already switch now to one of the new unique implementations out there, or even just copy-paste my suggestion and alter it to your own needs. The "good" thing is that unique is and never was coupled to Faker itself, because it does not depend on the mersenne/seed.

  • https://github.com/faker-js/faker/issues/1785

Shinigami92 avatar Aug 29 '23 06:08 Shinigami92

It does seem like faker.number.int returns more duplicates than expected for large max. If every int was equally likely to be selected, you would expect fewer duplicates as the max increased. However in fact for max above 10^10 there doesn't seem to be an increase in the runs needed before the first repeated value, it occurs after approx 80,000 values on average. Note than Number.MAX_SAFE_INTEGER is of the magnitude 10^15.

const {
  faker
} = require("@faker-js/faker");
const attempts = 100
for (let pow = 2; pow < 15; pow++) {
  const max = 10 ** pow;
  let total = 0
  for (let attempt = 0; attempt < attempts; attempt++) {
    total += timeToFirstDuplicate(max)
  }
  let average = Math.round(total / attempts)
  console.log("First duplicate for faker.number.int({max:10^" + pow + "}) occurs after avg " + average)
}

function timeToFirstDuplicate(max) {
  const set = new Set();
  let duplicates = false
  let count = 0

  do {
    const j = faker.number.int({
      max
    });
    if (set.has(j)) {
      duplicates = true
    } else {
      set.add(j)
      count++
    }
  } while (!duplicates)
  return count
}
First duplicate for faker.number.int({max:10^2}) occurs after avg 12
First duplicate for faker.number.int({max:10^3}) occurs after avg 38
First duplicate for faker.number.int({max:10^4}) occurs after avg 114
First duplicate for faker.number.int({max:10^5}) occurs after avg 394
First duplicate for faker.number.int({max:10^6}) occurs after avg 1250
First duplicate for faker.number.int({max:10^7}) occurs after avg 4205
First duplicate for faker.number.int({max:10^8}) occurs after avg 12820
First duplicate for faker.number.int({max:10^9}) occurs after avg 37629
First duplicate for faker.number.int({max:10^10}) occurs after avg 81947
First duplicate for faker.number.int({max:10^11}) occurs after avg 78749
First duplicate for faker.number.int({max:10^12}) occurs after avg 85969
First duplicate for faker.number.int({max:10^13}) occurs after avg 83313
First duplicate for faker.number.int({max:10^14}) occurs after avg 79924

Perhaps the way we have mersenne set up means the number of possible outputs is less than expected?

matthewmayer avatar Aug 29 '23 08:08 matthewmayer

This makes sense, as mersenne twister implementation returns around 2^32 seperate values which is approx 10^10, several orders of magnitude less than Number.MAX_SAFE_INTEGER

matthewmayer avatar Aug 29 '23 08:08 matthewmayer

If you uniformly draw a random number uniformly from [1…N] with replacement the expected number of draws before you get a repeat is approx sqrt(pi/2*N) so for a mersenne twister with 2^32 values we should expect a repeat approximately every sqrt(pi/2*2^32) = 82,137 trials.

Apologies for the mathematical diversion :D

matthewmayer avatar Aug 29 '23 08:08 matthewmayer

Send from my phone: what about multiplying faker.number.int() with faker.number.int(min: 1, max: 25000) 👀 This would increase the range but still is deterministic

// (2**32/2)/83000≈25000

We could move this multiplication also into our source to adjust it internally 🤔

But first let's hear the ideas from the others

Shinigami92 avatar Aug 29 '23 10:08 Shinigami92

A little loss in precision is to be expected in algorithmic randomness. IMO we should check whether there are certain values that are causing the duplicates or whether the duplicates are randomly distributed.

Also we could check if we could use a different access method from our twister. AFAICT there is one with higher precision or at least is says so.

ST-DDT avatar Aug 29 '23 10:08 ST-DDT

I tried switching from genrandReal2 to genrandRes53.

And the number of duplicates went from ~ $1 \over 10000$ to 0 (at 8kk runs).

Test-Code
import MersenneTwister19937 from "./twister";

const twister = new MersenneTwister19937();
twister.initGenrand(Math.random() * Number.MAX_SAFE_INTEGER);

const limit = 8_300_000; // JS object key limit!?
const data: Record<number, number> = {};

for (let i = 0; i < limit; i++) {
  const random = twister.genrandReal2();
  // const random = twister.genrandRes53();
  data[random] = (data[random] || 0) + 1;

  if (i % 100_000 === 0) {
    console.log("Runs:  ", i);
  }
}
console.log("Runs:  ", limit);
console.log("=====================================");
console.log("Calculating...");

const duplicateCount = Object.values(data).reduce((a, b) => a + b - 1, 0);
const duplicateMax = Object.values(data).reduce((a, b) => Math.max(a, b), 0) - 1;

console.log("=====================================");
console.log("Runs:      ", limit);
console.log("Duplicates:", duplicateCount);
console.log("Ratio:     ", duplicateCount / limit);
console.log("Max:       ", duplicateMax);
console.log("=====================================");

ST-DDT avatar Aug 29 '23 11:08 ST-DDT

I'll have a look at the bit magic in the implementation, when i have more time.

ST-DDT avatar Aug 29 '23 13:08 ST-DDT

I ran 500kkk invocations for faker.number.float() and it returns [0,1) wheres all the other methods return [0,max]. I also checked MersenneTwister19937.genrandInt32() and it returns [0,2^32) instead of [0,2^32] as described (checked with 225kkk invocations).

It looks like https://github.com/faker-js/faker/pull/1675 might have contributed to this.

  • https://github.com/faker-js/faker/pull/1675

ST-DDT avatar Aug 29 '23 16:08 ST-DDT

Is [0,1) or [0,1] supposed to be the correct behavior for number.float()?

IMO [0,1] is the correct version.

ST-DDT avatar Aug 29 '23 16:08 ST-DDT

faker.int.number() only returns even values when using very large max values. This is caused by the float() being truncated after 17 places + loosing a few due to math. Min ~ 0b00000000000000000000000000001000000000000000000000000 Max ~ 0b11111111111111111111111111111111000000000000000000000

ST-DDT avatar Aug 29 '23 20:08 ST-DDT

Is [0,1) or [0,1] supposed to be the correct behavior for number.float()?

IMO [0,1] is the correct version.

I think [0, 1) is the correct one. Math.random() also returns values in that range.

xDivisionByZerox avatar Aug 29 '23 20:08 xDivisionByZerox

I created https://github.com/faker-js/faker/pull/2357 to start working on a fix.

  • https://github.com/faker-js/faker/pull/2357

ST-DDT avatar Aug 29 '23 21:08 ST-DDT

Is [0,1) or [0,1] supposed to be the correct behavior for number.float()? IMO [0,1] is the correct version.

I think [0, 1) is the correct one. Math.random() also returns values in that range.

All our other methods return [min,max] though and AFAICT faker.number.float({precision}) does that as well (because it uses int() internally).

ST-DDT avatar Aug 29 '23 21:08 ST-DDT

With the new PR we could introduce a new option maxExclusive: boolean = false or exclusiveMax though.

ST-DDT avatar Aug 29 '23 21:08 ST-DDT

With the new PR we could introduce a new option maxExclusive: boolean = false or exclusiveMax though.

I don't like this. Just define what the function does and be done. String.prototype.substring doesn't have a option includeLastCharacter either.

xDivisionByZerox avatar Aug 29 '23 21:08 xDivisionByZerox

If someone needs that they should open a new feature request then.

ST-DDT avatar Aug 29 '23 21:08 ST-DDT

Team Decision

We will change mersenne to 53 bit precision to reduce the duplicates and create a separate PR for the float issues.

ST-DDT avatar Aug 31 '23 16:08 ST-DDT