pganonymize icon indicating copy to clipboard operation
pganonymize copied to clipboard

Using 'faker.unique.xx' as provider doesn't ensure uniqueness of the values...

Open azazel75 opened this issue 3 years ago • 3 comments

... due to the use of raw parallelization.

To prove it, run the following:

import parmap
from pganonymizer.providers import FakeProvider


provider = FakeProvider(name='fake.unique.user_name')


def gen_values(qty=100):
    base_values = [f'v{n}' for n in range(qty)]
    parallel_values = parmap.map(provider.alter_value, base_values)
    serial_values = parmap.map(provider.alter_value, base_values, pm_parallel=False)
    return parallel_values, serial_values


pvals, svals = gen_values()

# verify uniqueness
print(len(set(pvals)), len(set(svals)))

on my machine, when run it I get the following values:

$ python test.py
16 100

If it worked, it should have been 100 100

azazel75 avatar Sep 30 '21 10:09 azazel75

Hi, thanks for reporting this issue. I haven't had the time yet to take a look into it.

Disabling the parallel evaluation by setting pm_parallel to False, like you did in your fork, would also disable the whole parallelization during the importing, if I understand the parmap documentation correct?

hkage avatar Oct 14 '21 13:10 hkage

It seems to me that parmap is used to parallelize just the the application of the anonymizations to each row by splitting the work on more threads/processes and nothing else, but maybe I'm missing something?

azazel75 avatar Oct 14 '21 13:10 azazel75

Same issue, unique is unusable with the parallelization and by reading the code, I come to same conclusions as @azazel75 about parmap usage

How to move forward on this issue?

  • add an option to disable parallel ( ex --no-parallel, --no-optimization, etc..)

  • remove parmap usage

    • do we need benchmark to know impacts?
  • Remove or limit support of unique (ex: warning in the documentation/code)

fblackburn1 avatar May 01 '24 15:05 fblackburn1

Same issue, unique is unusable with the parallelization and by reading the code, I come to same conclusions as @azazel75 about parmap usage

How to move forward on this issue?

* add an option to disable parallel ( ex `--no-parallel`, `--no-optimization`, etc..)

* remove parmap usage
  
  * do we need benchmark to know impacts?

* Remove or limit support of `unique` (ex: warning in the documentation/code)

Sorry for the late response. I didn't have the time to check whether this is an issue of parmap itself or in combination with faker. The parallelization was added as a contribution and a feature request for the anonymizer, so I guess removing it would not be an option for users who have to anonymize massive data.

If it turns out, that uniqueness is not given with parmap I would tend to disable it with an option flag or remove the support for unique.

But I am open for help (maybe there is a fix to have uniqueness with parmap?) and/or other suggestions.

hkage avatar May 31 '24 14:05 hkage

Sorry for the late response. I didn't have the time to check whether this is an issue of parmap itself or in combination with faker. The parallelization was added as a contribution and a feature request for the anonymizer, so I guess removing it would not be an option for users who have to anonymize massive data.

It's not an issue of parmap, but a combination with faker To keep uniqueness, Faker need to track which value has been used and the same Faker instance need to be used. But, when doing multi processes with parmap, the Faker instance is recreated in each process Even if we pass an instance of Faker in parmap argument, the unique list won't be updated To make unique working with multi processes, we need to share state between processes which is not trivial (and maybe impact performance)

If it turns out, that uniqueness is not given with parmap I would tend to disable it with an option flag or remove the support for unique.

+1 to add an option to enable/disable multi processing I know is a breaking change, but what do you think to disable multi-processes by default and add an option to enable it (with a warning about unique)? IMHO the multi-processing is a more specific use case than using unique feature

However, tell me your preference, I'll implement the flag

fblackburn1 avatar Jun 04 '24 19:06 fblackburn1