faker
faker copied to clipboard
faker.datatype.number often gives duplicates
Pre-Checks
- [X] Follow our Code of Conduct.
- [X] Read the Contributing Guidelines.
- [X] Read the docs.
- [X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
- [X] Make sure this is a Faker issue and not related to a combination with another package.
- [X] Check that this is a concrete bug. For Q&A open a GitHub Discussion or join our Discord Chat Server.
- [X] The provided reproduction is a minimal reproducible example of the bug.
- [X] I am willing to provide a PR.
Describe the bug
- In unit tests running on our CI environment (and occasionally on our macs), the following code has been observed to throw a "duplicate error"---where it cannot find a unique integer
-
fakeId
is called 50--100 times in total on the offending test runs. - the rate of failure on CI is around 1 in 1000 runs
- haven't done the math but it is still very surprising because we are generating up the "max safe integer"... which is equal to 2^53 â 1... a very, very large number, so I don't expect 100 generations to render duplicates
export const fakeId = () =>
faker.unique(() => faker.datatype.number({ max: Number.MAX_SAFE_INTEGER }), undefined, { maxTime: 100 }).toString();
Analysis
- The mersenne implementation is very sus---why use this at all? For the most part, fakerjs seems to be a test helper, so IMO, this is over engineering. We can likely just use Math.random under the hood
- Additionally, the implementation of
unique
is also sus on CI environments. Namely, this block: https://github.com/faker-js/faker/blob/next/src/modules/helpers/unique.ts#L114-L122 - Most likely what's happening is the following---in our CI environments, we have a known issue with slowness. The mersenne implementation will unnecessarily return duplicates. And when both happen,
now - startTime
will be greater thanmaxTime
, resulting in the thrown error
Suggestion:
- A workaround we can do on our end is pass a very large
maxTime
. - but
unique
should rely onmaxRetries
only - faker should have an option to opt-out of using mersenne
Minimal reproduction code
- see above
Additional Context
No response
Environment Info
OS: linux/macosx
Node: 16
faker: 7.3.0
Which module system do you use?
- [ ] CJS
- [ ] ESM
Used Package Manager
npm
I used the following code:
import { faker } from "@faker-js/faker";
const set = new Set();
const limit = 10_000_000;
for (let i = 0; i < limit; i++) {
set.add(faker.number.int());
//if (i % 100_000 === 0) {
// console.log(i);
//}
}
console.log("=====================================");
console.log("Runs: ", limit);
console.log("Values:", set.size);
console.log("Ratio: ", set.size / limit);
Which returned the following values:
Runs: 10000000 Values: 9988326 Ratio: 0.9988326
Which is 12 duplicates per 10k generated values. This should not significantly slow down the generation process (unless you happen to run into a GC pause).
I used your code above to count the number of errors:
console.warn = () => {};
import { faker } from "@faker-js/faker";
const limit = 5_000_000;
let counter = 0;
for (let i = 0; i < limit; i++) {
try {
faker.helpers.unique(
() => faker.number.int({ max: Number.MAX_SAFE_INTEGER }),
undefined,
{ maxTime: 100 }
);
} catch (e) {
counter++;
}
//if (i % 100_000 === 0) {
// console.log(i);
//}
}
console.log("=====================================");
console.log("Runs: ", limit);
console.log("Errors:", counter);
console.log("Ratio: ", counter / limit);
And I didn't get any in 5M executions, I didn't run it on a slim CI machine though.
So I assume it is caused by a slow CI pipeline host/GC pause.
- A workaround we can do on our end is pass a very large maxTime.
This is the only workaround that I can currently think of except from implementing unique yourself. I'm not sure what you consider large but I assume it is dependent on the number of parallel processes/tests running on the pipeline host and it's computing power. We also have tests for the unique method implementation and while we do see some rare failures, these specifically seem to occur on the GitHub MacOs runners. So these are either especially slow or have a NodeJs bug/issue that wastes time. Please make sure you don't reset your seed to a static value causing the same values to be generated over and over again.
Have you tried using Faker v8?
- but unique should rely on maxRetries only
We aren't happy with the current implementation either. We plan to remove/refactor the method (probably in v9).
The mersenne implementation is very sus---why use this at all? For the most part, fakerjs seems to be a test helper, so IMO, this is over engineering. We can likely just use Math.random under the hood
- faker should have an option to opt-out of using mersenne
Faker uses mersenne to generate reproducible values, reproducible values are important for certain test cases and test debugging. We have an issue + PR to add support for custom generators:
- #2195
- #2284
Please also read https://github.com/faker-js/faker/issues/1785 for more details and potentially you can already switch now to one of the new unique implementations out there, or even just copy-paste my suggestion and alter it to your own needs. The "good" thing is that unique is and never was coupled to Faker itself, because it does not depend on the mersenne/seed.
- https://github.com/faker-js/faker/issues/1785
It does seem like faker.number.int returns more duplicates than expected for large max. If every int was equally likely to be selected, you would expect fewer duplicates as the max increased. However in fact for max above 10^10 there doesn't seem to be an increase in the runs needed before the first repeated value, it occurs after approx 80,000 values on average. Note than Number.MAX_SAFE_INTEGER is of the magnitude 10^15.
const {
faker
} = require("@faker-js/faker");
const attempts = 100
for (let pow = 2; pow < 15; pow++) {
const max = 10 ** pow;
let total = 0
for (let attempt = 0; attempt < attempts; attempt++) {
total += timeToFirstDuplicate(max)
}
let average = Math.round(total / attempts)
console.log("First duplicate for faker.number.int({max:10^" + pow + "}) occurs after avg " + average)
}
function timeToFirstDuplicate(max) {
const set = new Set();
let duplicates = false
let count = 0
do {
const j = faker.number.int({
max
});
if (set.has(j)) {
duplicates = true
} else {
set.add(j)
count++
}
} while (!duplicates)
return count
}
First duplicate for faker.number.int({max:10^2}) occurs after avg 12
First duplicate for faker.number.int({max:10^3}) occurs after avg 38
First duplicate for faker.number.int({max:10^4}) occurs after avg 114
First duplicate for faker.number.int({max:10^5}) occurs after avg 394
First duplicate for faker.number.int({max:10^6}) occurs after avg 1250
First duplicate for faker.number.int({max:10^7}) occurs after avg 4205
First duplicate for faker.number.int({max:10^8}) occurs after avg 12820
First duplicate for faker.number.int({max:10^9}) occurs after avg 37629
First duplicate for faker.number.int({max:10^10}) occurs after avg 81947
First duplicate for faker.number.int({max:10^11}) occurs after avg 78749
First duplicate for faker.number.int({max:10^12}) occurs after avg 85969
First duplicate for faker.number.int({max:10^13}) occurs after avg 83313
First duplicate for faker.number.int({max:10^14}) occurs after avg 79924
Perhaps the way we have mersenne set up means the number of possible outputs is less than expected?
This makes sense, as mersenne twister implementation returns around 2^32 seperate values which is approx 10^10, several orders of magnitude less than Number.MAX_SAFE_INTEGER
If you uniformly draw a random number uniformly from [1âŚN] with replacement the expected number of draws before you get a repeat is approx sqrt(pi/2*N)
so for a mersenne twister with 2^32 values we should expect a repeat approximately every sqrt(pi/2*2^32)
= 82,137 trials.
Apologies for the mathematical diversion :D
Send from my phone: what about multiplying faker.number.int() with faker.number.int(min: 1, max: 25000) đ This would increase the range but still is deterministic
// (2**32/2)/83000â25000
We could move this multiplication also into our source to adjust it internally đ¤
But first let's hear the ideas from the others
A little loss in precision is to be expected in algorithmic randomness. IMO we should check whether there are certain values that are causing the duplicates or whether the duplicates are randomly distributed.
Also we could check if we could use a different access method from our twister. AFAICT there is one with higher precision or at least is says so.
I tried switching from genrandReal2
to genrandRes53
.
And the number of duplicates went from ~ $1 \over 10000$ to 0 (at 8kk runs).
Test-Code
import MersenneTwister19937 from "./twister";
const twister = new MersenneTwister19937();
twister.initGenrand(Math.random() * Number.MAX_SAFE_INTEGER);
const limit = 8_300_000; // JS object key limit!?
const data: Record<number, number> = {};
for (let i = 0; i < limit; i++) {
const random = twister.genrandReal2();
// const random = twister.genrandRes53();
data[random] = (data[random] || 0) + 1;
if (i % 100_000 === 0) {
console.log("Runs: ", i);
}
}
console.log("Runs: ", limit);
console.log("=====================================");
console.log("Calculating...");
const duplicateCount = Object.values(data).reduce((a, b) => a + b - 1, 0);
const duplicateMax = Object.values(data).reduce((a, b) => Math.max(a, b), 0) - 1;
console.log("=====================================");
console.log("Runs: ", limit);
console.log("Duplicates:", duplicateCount);
console.log("Ratio: ", duplicateCount / limit);
console.log("Max: ", duplicateMax);
console.log("=====================================");
I'll have a look at the bit magic in the implementation, when i have more time.
I ran 500kkk invocations for faker.number.float() and it returns [0,1) wheres all the other methods return [0,max]. I also checked MersenneTwister19937.genrandInt32() and it returns [0,2^32) instead of [0,2^32] as described (checked with 225kkk invocations).
It looks like https://github.com/faker-js/faker/pull/1675 might have contributed to this.
- https://github.com/faker-js/faker/pull/1675
Is [0,1) or [0,1] supposed to be the correct behavior for number.float()?
IMO [0,1] is the correct version.
faker.int.number() only returns even values when using very large max values.
This is caused by the float() being truncated after 17 places + loosing a few due to math.
Min ~ 0b00000000000000000000000000001000000000000000000000000
Max ~ 0b11111111111111111111111111111111000000000000000000000
Is [0,1) or [0,1] supposed to be the correct behavior for number.float()?
IMO [0,1] is the correct version.
I think [0, 1) is the correct one. Math.random()
also returns values in that range.
I created https://github.com/faker-js/faker/pull/2357 to start working on a fix.
- https://github.com/faker-js/faker/pull/2357
Is [0,1) or [0,1] supposed to be the correct behavior for number.float()? IMO [0,1] is the correct version.
I think [0, 1) is the correct one.
Math.random()
also returns values in that range.
All our other methods return [min,max]
though and AFAICT faker.number.float({precision})
does that as well (because it uses int()
internally).
With the new PR we could introduce a new option maxExclusive: boolean = false
or exclusiveMax
though.
With the new PR we could introduce a new option maxExclusive: boolean = false or exclusiveMax though.
I don't like this. Just define what the function does and be done. String.prototype.substring
doesn't have a option includeLastCharacter
either.
If someone needs that they should open a new feature request then.
Team Decision
We will change mersenne to 53 bit precision to reduce the duplicates and create a separate PR for the float issues.