differential-privacy-library icon indicating copy to clipboard operation
differential-privacy-library copied to clipboard

Handling of NaNs

Open naoise-h opened this issue 1 year ago • 0 comments

Currently, the presence of NaNs in a dataset produces a distinguishing event, as for most functions (other than nanmean, nanvar, nanstd) the output will always be NaN. Is the best solution for all functions to just ignore NaNs (like nanmean, etc)?

For single-dimensional problems, there should be no issue, as removing NaNs is a simple deterministic pre-processing step. For multi-dimensional problems, removing data rows may prove problematic for utility. Is there justification here for doing something fancier, like mapping to a value within the range?

inf may also need special consideration, although these can usually be overcome when clipping the data, as inf will clip to the upper bound, and -inf to the lower bound. NaN has no obvious value to map to. When the algorithm requires the norm of a row to be clipped (like LogisticRegression), mapping from inf to a value is no longer trivial. Do we map inf to a value that ensures the row's norm matches the clip, or do we also scale the rest of the row?

naoise-h avatar Mar 20 '23 12:03 naoise-h