faker icon indicating copy to clipboard operation
faker copied to clipboard

feat!: high precision random number generator

Open ST-DDT opened this issue 1 year ago • 32 comments

Fixes #2355

  • #2355

To Fix

  • [x] fix/reduce duplicates returned from faker.number.int() and faker.number.float()
  • [x] fix faker.number.int() only returning even numbers (31bits out of 53)
  • [x] fix faker.number.int(max) is incapable of returning max value for large maxes

ST-DDT avatar Aug 29 '23 21:08 ST-DDT

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 99.57%. Comparing base (0d4cba6) to head (a4b4203).

:exclamation: Current head a4b4203 differs from pull request most recent head 998611b. Consider uploading reports for the commit 998611b to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             next    #2357      +/-   ##
==========================================
- Coverage   99.57%   99.57%   -0.01%     
==========================================
  Files        2846     2815      -31     
  Lines      249892   249870      -22     
  Branches     1055     1066      +11     
==========================================
- Hits       248830   248804      -26     
- Misses       1062     1066       +4     
Files Coverage Δ
src/faker.ts 100.00% <100.00%> (ø)
src/index.ts 100.00% <100.00%> (ø)
src/internal/mersenne.ts 97.58% <100.00%> (+0.14%) :arrow_up:
src/simple-faker.ts 100.00% <100.00%> (ø)

... and 83 files with indirect coverage changes

codecov[bot] avatar Aug 29 '23 21:08 codecov[bot]

Probably also worth checking if genrandRes53CC is significantly slower, or if there are any other downsides to using it.

matthewmayer avatar Aug 30 '23 05:08 matthewmayer

It's twice as slow.

ST-DDT avatar Aug 30 '23 06:08 ST-DDT

I think this should be probably considered a breaking change given it causes nearly every snapshot to change. That would also allow for beta testing of the change before 9.0 in case the slower algorithm causes performance problems.

matthewmayer avatar Aug 30 '23 13:08 matthewmayer

The breaking of the snapshots is caused by next() now effectively taking 2 values from the seed. We could reduce this if we dynamically request either 32/53 bits of precision depending on the values in use. e.g. an array with length 15 doesn't need 53 bits of precision for the index.

ST-DDT avatar Aug 30 '23 13:08 ST-DDT

I'm not that deep into math, but for some reason using two values from the seed increases the variance of the values. Which increase likelihood of this test to fail: https://github.com/faker-js/faker/actions/runs/6025687970/job/16346941314?pr=2357

ST-DDT avatar Aug 30 '23 14:08 ST-DDT

I'm not that deep into math, but for some reason using two values from the seed increases the variance of the values. Which increase likelihood of this test to fail: faker-js/faker/actions/runs/6025687970/job/16346941314?pr=2357

Nevermind, I found the error, currently the first and the last element aren't evenly distributed for number.int()

ST-DDT avatar Aug 30 '23 14:08 ST-DDT

Team Decision

We will change mersenne to 53 bit precision to reduce the duplicates and create a separate PR for the float issues.

ST-DDT avatar Aug 31 '23 16:08 ST-DDT

Reduced to only cover the most minimal required changes.

ST-DDT avatar Sep 01 '23 15:09 ST-DDT

Could we still allow access to the original 32-bit Mersenne generator via the new #2195 feature? Then if someone prefers the original implementation (e.g. if speed is more importrant than occasional duplicates) they still have that option.

matthewmayer avatar Sep 07 '23 01:09 matthewmayer

Could we still allow access to the original 32-bit Mersenne generator via the new #2195 feature? Then if someone prefers the original implementation (e.g. if speed is more importrant than occasional duplicates) they still have that option.

Could you please describe the usecase for/idea behind this proposal? We already have a guess why you suggest that, but we would like to know whether we might have forgotten something.

If we go that route, then this would be blocked by #2284 which is blocked by #2361.

ST-DDT avatar Sep 07 '23 16:09 ST-DDT

Speed. You said this implementation is twice as slow? So if I'm generating millions of fake names I don't want it to be slower than before.

matthewmayer avatar Sep 07 '23 17:09 matthewmayer

twice as slow

Why not backing the library by an official prng generator tailored for performance like: pure-rand? It would probably allow the faker team to rather focus on generators themselves while keeping the optimization of random number generators into other libraries (this one is used by Jest). I'm the author of this library

You can also choose to copy the implem from the lib I just shared but it sounds less sustainable. The implem is much faster than the one in faker (from my benchs)

Cc @ST-DDT

dubzzz avatar Sep 07 '23 17:09 dubzzz

@dubzzz A disclaimer that you are a/the maintainer of that library would have been nice.

You can also choose to copy the implem from the lib I just shared but it sounds less sustainable.

Copying a well maintained lib doesn't sound like a good idea to me either.

The implem is much faster than the one in faker (from my benchs)

Does your implementation provide the full 53 bits of precision possible with JS?

ST-DDT avatar Sep 07 '23 18:09 ST-DDT

A disclaimer that you are a/the maintainer of that library would have been nice.

Sorry, I forgot to underline it. There are many others if you want.

Does your implementation provide the full 53 bits of precision possible with JS?

Not really. The generator itself generates 32bits values in an uniform way. But I do have helpers to generate values in an uniform way on larger ranges (see unsafeUniformIntDistribution for 53bits and the ones with ArrayInt in the name for above).

dubzzz avatar Sep 07 '23 18:09 dubzzz

A disclaimer that you are a/the maintainer of that library would have been nice.

Sorry, I forgot to underline it. There are many others if you want.

No problem. We will have to discuss the choice of library internally anyway (if we go that route).

ST-DDT avatar Sep 07 '23 18:09 ST-DDT

While it looks like a very nice library it's also maintained by one person. Given the history of the project Faker deliberately has many maintainers (and currently no dependencies).

matthewmayer avatar Sep 07 '23 18:09 matthewmayer

@dubzzz Do you have a test checking that the default implementation can return 0x00000000 and 0xffffffffrespectively?

e.g. https://github.com/faker-js/faker/blob/a73653436dfbaf4a005dfc6913ecf0668b134c1d/test/internal/mersenne/twister.spec.ts#L19-L23

EDIT: I'm currently searching for a seed that results in 0 for our implementation.

Min: 3 - Seed: 994350336 Max: * - Seed: 726175548

ST-DDT avatar Sep 07 '23 18:09 ST-DDT

I don't think so. But I have test checking that it outputs the exact same set of values as the implementation of Mersenne Twister of the C++ standard and other tests. I can check the min/max.

dubzzz avatar Sep 07 '23 18:09 dubzzz

Speaking from my absolutely personal perspective: I've looked through your profile and I like what I see :slightly_smiling_face: pure-rand also do not have sub-dependencies (which is a good thing :+1:) It is purely naively written in TypeScript :sparkles: So the only "bad" thing is that you are a one-man army (https://github.com/dubzzz/pure-rand/graphs/contributors) and Faker and its history ... anyways ... :sweat_smile:

But I have one question: @dubzzz why did you have such a spike suddenly in your downloads?

Shinigami92 avatar Sep 07 '23 18:09 Shinigami92

why did you have such a spike suddenly in your downloads?

@Shinigami92 The spike is due to Jest, they started to use pure-rand internally. At some point in time they have been looking for a library to generate random numbers. So I suggested to rely on pure-rand, as I contributed several times to the project and as they have been using my other library fast-check for quite some time to test some of their internals they went for pure-rand. Actually when I suggested the library I also checked the alternatives in both maintenance level and download rates and the one I found were a bit below.

While it looks like a very nice library it's also maintained by one person. Given the history of the project Faker deliberately has many maintainers (and currently no dependencies).

Speaking from my absolutely personal perspective: I've looked through your profile and I like what I see 🙂

@matthewmayer @Shinigami92 Honestly, it was mostly an idea: choice is definitely fully yours.

For the record, my only query was to have a way to connect an external pnrg into faker as described into https://github.com/faker-js/faker/issues/2195. The initial query is still opened whatever your choice as it may unlock some usages I can have of faker: I want to provide a plugin fast-check (my other lib) to faker, which will be able to use faker, but to do so I'll have to be able to customize the instance of prng being used by faker.

That's said, let me know if I can help.

A disclaimer that you are a/the maintainer of that library would have been nice.

And once again, sorry @ST-DDT. By the way I updated my original message to make it clearer.

dubzzz avatar Sep 07 '23 21:09 dubzzz

#2195 still seems like the best way to handle this. We could have a page in the docs explaining how to plug in pure-rand (or other popular libraries) to make it very easy to use without adding a dependency.

matthewmayer avatar Sep 08 '23 10:09 matthewmayer

What is left to do here? AFAICT nothing. Is this going to be v8 or v9?

ST-DDT avatar Sep 13 '23 23:09 ST-DDT

Changing the default random generator should definitely be v9. Allowing opt in to a 53 bit mersenne generator using #2195 could be in v8.

matthewmayer avatar Sep 14 '23 02:09 matthewmayer

Changing the default random generator should definitely be v9. Allowing opt in to a 53 bit mersenne generator using #2195 could be in v8.

I slightly agree, and I would also prio it like that.

In last team meet we decided that the generator.next() function should fulfill Math.random()'s [0.0,1.0), and IMO this should be documented and added first before moving on and fixing this bug, due to I assume we then would need additional tests here in this PR.

Shinigami92 avatar Sep 14 '23 03:09 Shinigami92

In last team meet we decided that the generator.next() function should fulfill Math.random()'s [0.0,1.0)

IIRC we didn't. At least the meeting notes dont contain a decision for that.

ST-DDT avatar Sep 14 '23 06:09 ST-DDT

⚠️ something horrible came to my mind tonight ⚠️

I would like to kindly ask if we first could prioritize the feature to provide a custom seeder / generator, because Faker is not only used for tests, but also generating e.g. seeding data to a database instance with e.g. faker.string.uuid() while relying on a constant defined seed value! So if we would just merge this PR and do not provide the functionality to switch manually back to v8.0 behavior, it would not be possible anymore at all to recreate that seeded data.

So either this PR lands in v9 as a communicated breaking feature or we first merge / prioritize the customizable seeder / generator

Shinigami92 avatar Sep 15 '23 05:09 Shinigami92

Sure, lets talk about the specifics in the next team meeting.

Please keep in mind that we currently do not provide a guarantee to generate the same values on updates.

ST-DDT avatar Sep 15 '23 07:09 ST-DDT

Sure, lets talk about the specifics in the next team meeting.

Please keep in mind that we currently do not provide a guarantee to generate the same values on updates.

While this is true, it mostly just affected localized stuff and not fundamentally changed the ground. I for example do not rely on generating localized values in production, but use faker.string.uuid or faker.helpers.arrayElements with own defined data for populate production database. The more I think about it, the more I think that my approach of doing so is insecure and I should try to do backup and handmade migrations instead of relying on a seed value 🤔

Shinigami92 avatar Sep 15 '23 07:09 Shinigami92

Agreed. Changing the mersenne implementation causes every faker method to return different values for a fixed seed so it should definitely be considered a breaking change.

matthewmayer avatar Sep 15 '23 07:09 matthewmayer