differential-privacy-library
differential-privacy-library copied to clipboard
Handling of NaNs
Currently, the presence of NaN
s in a dataset produces a distinguishing event, as for most functions (other than nanmean
, nanvar
, nanstd
) the output will always be NaN
. Is the best solution for all functions to just ignore NaN
s (like nanmean
, etc)?
For single-dimensional problems, there should be no issue, as removing NaNs is a simple deterministic pre-processing step. For multi-dimensional problems, removing data rows may prove problematic for utility. Is there justification here for doing something fancier, like mapping to a value within the range?
inf
may also need special consideration, although these can usually be overcome when clipping the data, as inf
will clip to the upper bound, and -inf
to the lower bound. NaN
has no obvious value to map to. When the algorithm requires the norm of a row to be clipped (like LogisticRegression
), mapping from inf
to a value is no longer trivial. Do we map inf
to a value that ensures the row's norm matches the clip, or do we also scale the rest of the row?