faker
faker copied to clipboard
feat!: high precision random number generator
Fixes #2355
- #2355
To Fix
- [x] fix/reduce duplicates returned from
faker.number.int()
andfaker.number.float()
- [x] fix faker.number.int() only returning even numbers (31bits out of 53)
- [x] fix faker.number.int(max) is incapable of returning max value for large maxes
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.57%. Comparing base (
0d4cba6
) to head (a4b4203
).
:exclamation: Current head a4b4203 differs from pull request most recent head 998611b. Consider uploading reports for the commit 998611b to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## next #2357 +/- ##
==========================================
- Coverage 99.57% 99.57% -0.01%
==========================================
Files 2846 2815 -31
Lines 249892 249870 -22
Branches 1055 1066 +11
==========================================
- Hits 248830 248804 -26
- Misses 1062 1066 +4
Files | Coverage Δ | |
---|---|---|
src/faker.ts | 100.00% <100.00%> (ø) |
|
src/index.ts | 100.00% <100.00%> (ø) |
|
src/internal/mersenne.ts | 97.58% <100.00%> (+0.14%) |
:arrow_up: |
src/simple-faker.ts | 100.00% <100.00%> (ø) |
Probably also worth checking if genrandRes53CC is significantly slower, or if there are any other downsides to using it.
It's twice as slow.
I think this should be probably considered a breaking change given it causes nearly every snapshot to change. That would also allow for beta testing of the change before 9.0 in case the slower algorithm causes performance problems.
The breaking of the snapshots is caused by next()
now effectively taking 2 values from the seed.
We could reduce this if we dynamically request either 32/53 bits of precision depending on the values in use.
e.g. an array with length 15 doesn't need 53 bits of precision for the index.
I'm not that deep into math, but for some reason using two values from the seed increases the variance of the values. Which increase likelihood of this test to fail: https://github.com/faker-js/faker/actions/runs/6025687970/job/16346941314?pr=2357
I'm not that deep into math, but for some reason using two values from the seed increases the variance of the values. Which increase likelihood of this test to fail: faker-js/faker/actions/runs/6025687970/job/16346941314?pr=2357
Nevermind, I found the error, currently the first and the last element aren't evenly distributed for number.int()
Team Decision
We will change mersenne to 53 bit precision to reduce the duplicates and create a separate PR for the float issues.
Reduced to only cover the most minimal required changes.
Could we still allow access to the original 32-bit Mersenne generator via the new #2195 feature? Then if someone prefers the original implementation (e.g. if speed is more importrant than occasional duplicates) they still have that option.
Could we still allow access to the original 32-bit Mersenne generator via the new #2195 feature? Then if someone prefers the original implementation (e.g. if speed is more importrant than occasional duplicates) they still have that option.
Could you please describe the usecase for/idea behind this proposal? We already have a guess why you suggest that, but we would like to know whether we might have forgotten something.
If we go that route, then this would be blocked by #2284 which is blocked by #2361.
Speed. You said this implementation is twice as slow? So if I'm generating millions of fake names I don't want it to be slower than before.
twice as slow
Why not backing the library by an official prng generator tailored for performance like: pure-rand? It would probably allow the faker team to rather focus on generators themselves while keeping the optimization of random number generators into other libraries (this one is used by Jest). I'm the author of this library
You can also choose to copy the implem from the lib I just shared but it sounds less sustainable. The implem is much faster than the one in faker (from my benchs)
Cc @ST-DDT
@dubzzz A disclaimer that you are a/the maintainer of that library would have been nice.
You can also choose to copy the implem from the lib I just shared but it sounds less sustainable.
Copying a well maintained lib doesn't sound like a good idea to me either.
The implem is much faster than the one in faker (from my benchs)
Does your implementation provide the full 53 bits of precision possible with JS?
A disclaimer that you are a/the maintainer of that library would have been nice.
Sorry, I forgot to underline it. There are many others if you want.
Does your implementation provide the full 53 bits of precision possible with JS?
Not really. The generator itself generates 32bits values in an uniform way. But I do have helpers to generate values in an uniform way on larger ranges (see unsafeUniformIntDistribution
for 53bits and the ones with ArrayInt in the name for above).
A disclaimer that you are a/the maintainer of that library would have been nice.
Sorry, I forgot to underline it. There are many others if you want.
No problem. We will have to discuss the choice of library internally anyway (if we go that route).
While it looks like a very nice library it's also maintained by one person. Given the history of the project Faker deliberately has many maintainers (and currently no dependencies).
@dubzzz Do you have a test checking that the default implementation can return 0x00000000
and 0xffffffff
respectively?
e.g. https://github.com/faker-js/faker/blob/a73653436dfbaf4a005dfc6913ecf0668b134c1d/test/internal/mersenne/twister.spec.ts#L19-L23
EDIT: I'm currently searching for a seed that results in 0 for our implementation.
Min: 3 - Seed: 994350336 Max: * - Seed: 726175548
I don't think so. But I have test checking that it outputs the exact same set of values as the implementation of Mersenne Twister of the C++ standard and other tests. I can check the min/max.
Speaking from my absolutely personal perspective: I've looked through your profile and I like what I see :slightly_smiling_face: pure-rand also do not have sub-dependencies (which is a good thing :+1:) It is purely naively written in TypeScript :sparkles: So the only "bad" thing is that you are a one-man army (https://github.com/dubzzz/pure-rand/graphs/contributors) and Faker and its history ... anyways ... :sweat_smile:
But I have one question: @dubzzz why did you have such a spike suddenly in your downloads?
why did you have such a spike suddenly in your downloads?
@Shinigami92 The spike is due to Jest, they started to use pure-rand internally. At some point in time they have been looking for a library to generate random numbers. So I suggested to rely on pure-rand, as I contributed several times to the project and as they have been using my other library fast-check for quite some time to test some of their internals they went for pure-rand. Actually when I suggested the library I also checked the alternatives in both maintenance level and download rates and the one I found were a bit below.
While it looks like a very nice library it's also maintained by one person. Given the history of the project Faker deliberately has many maintainers (and currently no dependencies).
Speaking from my absolutely personal perspective: I've looked through your profile and I like what I see 🙂
@matthewmayer @Shinigami92 Honestly, it was mostly an idea: choice is definitely fully yours.
For the record, my only query was to have a way to connect an external pnrg into faker as described into https://github.com/faker-js/faker/issues/2195. The initial query is still opened whatever your choice as it may unlock some usages I can have of faker: I want to provide a plugin fast-check (my other lib) to faker, which will be able to use faker, but to do so I'll have to be able to customize the instance of prng being used by faker.
That's said, let me know if I can help.
A disclaimer that you are a/the maintainer of that library would have been nice.
And once again, sorry @ST-DDT. By the way I updated my original message to make it clearer.
#2195 still seems like the best way to handle this. We could have a page in the docs explaining how to plug in pure-rand (or other popular libraries) to make it very easy to use without adding a dependency.
What is left to do here? AFAICT nothing. Is this going to be v8 or v9?
Changing the default random generator should definitely be v9. Allowing opt in to a 53 bit mersenne generator using #2195 could be in v8.
Changing the default random generator should definitely be v9. Allowing opt in to a 53 bit mersenne generator using #2195 could be in v8.
I slightly agree, and I would also prio it like that.
In last team meet we decided that the generator.next()
function should fulfill Math.random()
's [0.0,1.0)
, and IMO this should be documented and added first before moving on and fixing this bug, due to I assume we then would need additional tests here in this PR.
In last team meet we decided that the
generator.next()
function should fulfillMath.random()
's[0.0,1.0)
IIRC we didn't. At least the meeting notes dont contain a decision for that.
⚠️ something horrible came to my mind tonight ⚠️
I would like to kindly ask if we first could prioritize the feature to provide a custom seeder
/ generator
, because Faker is not only used for tests, but also generating e.g. seeding data to a database instance with e.g. faker.string.uuid()
while relying on a constant defined seed value!
So if we would just merge this PR and do not provide the functionality to switch manually back to v8.0 behavior, it would not be possible anymore at all to recreate that seeded data.
So either this PR lands in v9 as a communicated breaking feature or we first merge / prioritize the customizable seeder
/ generator
Sure, lets talk about the specifics in the next team meeting.
Please keep in mind that we currently do not provide a guarantee to generate the same values on updates.
Sure, lets talk about the specifics in the next team meeting.
Please keep in mind that we currently do not provide a guarantee to generate the same values on updates.
While this is true, it mostly just affected localized stuff and not fundamentally changed the ground.
I for example do not rely on generating localized values in production, but use faker.string.uuid
or faker.helpers.arrayElements
with own defined data for populate production database.
The more I think about it, the more I think that my approach of doing so is insecure and I should try to do backup and handmade migrations instead of relying on a seed value 🤔
Agreed. Changing the mersenne implementation causes every faker method to return different values for a fixed seed so it should definitely be considered a breaking change.