faker icon indicating copy to clipboard operation
faker copied to clipboard

Follow Benford's law for finance amounts.

Open RelatedTitle opened this issue 3 years ago • 16 comments

Clear and concise description of the problem

Currently, the amount's first digits are uniformly distributed (each digit having ~11.11% probability). In the real world, however, this is rarely the case, especially for financial data. This is what Benford's law demonstrates. The actual distribution is as follows.

Digit Probability
1 30.1%
2 17.6%
3 12.5%
4 9.7%
5 7.9%
6 6.7%
7 5.8%
8 5.1%
9 4.6%

The current implementation, although more random, is less realistic.

I am planning to submit a PR for this issue if accepted.

Suggested solution

Implementing an RNG function that follows Benford's law would provide more realistic data.

Alternative

No response

Additional context

No response

RelatedTitle avatar Jun 20 '22 07:06 RelatedTitle

I somehow like this idea 🤔 it would also improve finance.amount, as now it's somewhat just a wrapper around datatype.number + a currency symbol. But this would also need good documentation.

Another team member should also accept this proposal and then you can start 🙂

Shinigami92 avatar Jun 20 '22 07:06 Shinigami92

I was wondering if this were to be accepted, how should something like this be approached. Should it be a "datatype" like datatype.number() or a helper function, or something else?

RelatedTitle avatar Jun 20 '22 07:06 RelatedTitle

I also like the idea, but currently I'm not sure where to put it. On one hand I would like to have all numbers behave like this, on the other hand some methods just use the number method for index lookups, so any bias there would be detrimental.

I think datatype.benfordNumber or something similar might be useful. (We might later shift it - along with other methods - to a new and separate number module though). Lets wait a day or two for additional input, so we don't have to refactor the method/code later.

ST-DDT avatar Jun 20 '22 07:06 ST-DDT

I was also thinking, should it have min/max values? There could be an issue, for example, using 2000 as minimum and 5000 as maximum, because it couldn't possibly follow Benford's law that way. Should the params be limited to the number of digits to be generated?

RelatedTitle avatar Jun 20 '22 07:06 RelatedTitle

I have to think about it some more. min, max, precision vs integerDigits, decimalDigits 🤔

I'm also torn between using an options object or not. Especially for the later variant.

ST-DDT avatar Jun 20 '22 08:06 ST-DDT

Oh I thought we just replace the implementation of finance.amount and change its behaviour 🤔

Shinigami92 avatar Jun 20 '22 11:06 Shinigami92

I think datatype.benfordNumber or something similar might be useful. (We might later shift it - along with other methods - to a new and separate number module though). Lets wait a day or two for additional input, so we don't have to refactor the method/code later.

We somewhere decided/considered that the datatype module should be used for, well, javascript datatypes. That being number, string, symbol, bigint, object. So I benfordNumber() would not put it in there.

In the long run I think a number module is inevitable, but for now, I think it would be the best fitting in the helper module since it is exactly that for the finance module.

Edit: I will update this comment when I find the discussion/issue/pr stat proves the decision in block 1

xDivisionByZerox avatar Jun 20 '22 12:06 xDivisionByZerox

Oh I thought we just replace the implementation of finance.amount and change its behaviour 🤔

If you read the wiki article you will notice that it applies to all real numbers not just finance data. (It's statistics related not money related).

I'm fine with either location. The benford numbers are real numbers and the method would return a valid js number.

ST-DDT avatar Jun 20 '22 15:06 ST-DDT

In that case, this would replace the current meresenee twister RNG? It's just a different approach to a problem we already solved. That would be even more prove, to put it in the helpers module or even it own

xDivisionByZerox avatar Jun 20 '22 15:06 xDivisionByZerox

Just to make clear. I'm not discouraging or blocking this idea. I quite like it TBH. I just want to make sure that the project architecture is respected.

xDivisionByZerox avatar Jun 20 '22 15:06 xDivisionByZerox

No this has nothing to do with our random data generator/twister. It is only a way to generate numbers. The benford numbers arent suiteable to be used for anything else than 'as is'.

ST-DDT avatar Jun 20 '22 15:06 ST-DDT

I created a helper function in my fork (https://github.com/RelatedTitle/faker/commit/8a3392aaa71590838aa2415e6f9342cfbc0a67cf) called benfordNumber() that generates random integers with a given number of digits that follow Benford's law.

I think it's best to have it generate integers with a given # of digits and then have the calling function divide it accordingly instead of having integerDigits and decimalDigits. Especially if this is going to be reused in other places.

Here's a comparison between datatype.number() and helpers.benfordNumber() on the distribution of first digits.

10M Iterations

datatype.number():

Digit Probability Count
1 11.1% 1110411
2 11.09% 1108841
3 11.12% 1111828
4 11.11% 1110808
5 11.11% 1111486
6 11.11% 1111429
7 11.11% 1110752
8 11.11% 1111076
9 11.13% 1113260

helpers.benfordNumber():

Digit Probability Count
1 30.99% 3098875
2 17% 1700135
3 13.01% 1300570
4 9.01% 900650
5 8% 799577
6 7% 699698
7 6% 599856
8 5% 499896
9 4.01% 400743

As you can see, datatype.number() is uniformly distributed, while helpers.benfordNumber() follows Benford's law with less than 1.5% deviation.

I'm not sure if this is the best implementation/most efficient, so any feedback is appreciated. I'm also new to this so I'm not sure if this is the way it's supposed to be done.

RelatedTitle avatar Jun 21 '22 03:06 RelatedTitle

I also like the idea, but currently I'm not sure where to put it. On one hand I would like to have all numbers behave like this, on the other hand some methods just use the number method for index lookups, so any bias there would be detrimental.

I think datatype.benfordNumber or something similar might be useful. (We might later shift it - along with other methods - to a new and separate number module though). Lets wait a day or two for additional input, so we don't have to refactor the method/code later.

Yeah, I feel the same way. I really like it, just don't know where it would go.

ejcheng avatar Jun 21 '22 03:06 ejcheng

The current implementation looks good to me. Lets go for helpers.benfordNumber().

ST-DDT avatar Jun 21 '22 08:06 ST-DDT

Should I create a PR for it?

RelatedTitle avatar Jun 21 '22 08:06 RelatedTitle

Should I create a PR for it?

Sure, if you're up for it. I'll assign you to this issue if you want to create a PR.

Edit: Nvm, just saw the PR. Anddd you're already assigned facepalm

ejcheng avatar Jun 22 '22 04:06 ejcheng