differential-privacy-library icon indicating copy to clipboard operation
differential-privacy-library copied to clipboard

Seedable models

Open wouterz opened this issue 4 years ago • 3 comments

Currently, self._rng = secrets.SystemRandom() is used to set the randomness for DPMechanisms.

https://github.com/IBM/differential-privacy-library/blob/a889ba0f8d19c77e2b0369451ebc392969fac685/diffprivlib/mechanisms/base.py#L77

I understand that for real models you don't want to set the seed, but is adding a parameter for reproduction purposes an option?

wouterz avatar Dec 03 '20 12:12 wouterz

This is something we've been looking to implement, but there are a number of knock-on effects that we need to be wary of. Implementing a seeding on mechanisms would be relatively straight-forward, but special attention is required when looking to seed models or tools, particularly those that initiate more than one mechanism.

Using the same seed to instantiate multiple mechanisms may result in correlated noise (i.e., for a sum() over a particular axis), which will break differential privacy. Enabling reproducibility at the cost of differential privacy is undesirable, especially given the potential for unwitting misuse.

At present there is no easy way to communicate between different mechanisms initialised in a given function, so it may require a globally-accessible seed generator. If you have any ideas on this, your thoughts would be greatly appreciated.

Stackexchange has an interesting discussion on this

naoise-h avatar Dec 03 '20 14:12 naoise-h

I'm not sure in python/numpy internals, but I'd think without explicitly setting a seed, a random number is already used as seed. So these effects, such as correlated noise, would also be an issue when not setting a seed, except it not being repeatable.

Finding a way to truly independently seed each DP-mechanism sounds like an improvement over the current seeding mechanism, but might not be comparable to the current implementation?

wouterz avatar Dec 04 '20 09:12 wouterz

Yes, it's also my understanding that RNGs are seeded at random when a seed isn't provided (such as, with the timestamp in milliseconds). In this case we may get correlation by chance. But, seeding many different RNGs with the same seed would result in correlation.

Additionally, it seems that the RNG given by the secrets module of Python does not support seeding (despite it being a parameter). It looks like it will be necessary to revert to a different RNG (numpy perhaps) in the instance of seeding, which presents another difficulty of a seeded model not being directly representative of a non-seeded model. Sadly, Numpy doesn't provide any cryptographically-secure RNGs (as desired for DP).

naoise-h avatar Dec 04 '20 16:12 naoise-h