wp-migrate-db-anonymization icon indicating copy to clipboard operation
wp-migrate-db-anonymization copied to clipboard

Generated fake data not unique for larger databases

Open tyrann0us opened this issue 5 years ago • 1 comments

We anonymize MailPoet subscribers like this:

// Callback for `wpmdb_anonymization_config` filter.
$config['mailpoet_subscribers'] = [
  'first_name' => [
    'fake_data_type' => 'firstName',
  ],
  'last_name' => [
    'fake_data_type' => 'lastName',
  ],
  'email' => [
    'fake_data_type' => 'email',
  ],
  'subscribed_ip' => [
    'fake_data_type' => 'ipv4',
  ],
  'confirmed_ip' => [
    'fake_data_type' => 'ipv4',
  ],
  'unconfirmed_data' => [
    'fake_data_type' => 'randomLetter', // Random data type, could be anything.
    'post_process_function' => '__return_null',
  ],
];
return $config;

(Posted full snippet, but only email is relevant here.)

It works for small subscriber lists. For ~6,500 subscribers and more, however, it seems that Faker no longer generates unique email addresses. Since MailPoet requires the email column to be UNIQUE, importing an anonymized database will fail with error Duplicate entry '[email protected]' for key 'email'. In fact, for the ~6,500 entities, Faker seems generate ~10 email addresses twice according to the error messages.

So I checked Faker's Modifiers (https://github.com/fzaninotto/Faker/#modifiers) and changed for testing https://github.com/deliciousbrains/wp-migrate-db-anonymization/blob/cca4ad85ca9d73b5bb2ae76347a0b6d3f3c5913d/includes/Config/Rule.php#L136 to

if ($this->fake_data_type === 'email') {
  $data = $faker->unique()->{$this->fake_data_type}($args);
} else {
  $data = $faker->{$this->fake_data_type}($args);
}

(it could be simplified by always using unique() but I'm not sure if this might have unwanted side effects):

$data = $faker->unique()->{$this->fake_data_type}($args);

That reduced the number of duplicates, but there were still some. Even updating Faker to v1.8 (that introduced more German email providers, https://github.com/fzaninotto/Faker/pull/1320, see #25) did not solved it (and is no solution for other languages). And even if an export would run without creating duplicates, we can't say for sure that it will work for larger records.

I'm not entirely sure why Faker still creates duplicates despite the unique() call, but I think it might be related to the fact that the plugin is bootstrapped every time admin-ajax.php is called, which also reinitializes Faker every time. If it's true, I have no idea how to deal with this. @polevaultweb, do you?

Thanks!

tyrann0us avatar May 24 '19 10:05 tyrann0us

The way this problem - in general, not just for emails - is handled by Anonimatron is to maintain a list of synonyms in a separate file. The synonyms file consists of a mapping from input production data to anonymized output data. This synonyms file should be treated as sensitive production data.

The big advantage of the synonyms file is that it allows consistency across tables, and it allow one to maintain anonymized test names across multiple anonymizations, which can be very helpful for the non production QA team.

It'd be a simple check to see if the generated email is in the synonyms file.... so a non elegant fix would be, if the generated email is in the synonyms file, rerun the faker until it comes up with a unique email that is not in the synonyms file.

pajtai avatar Apr 13 '20 13:04 pajtai