smartnoise-core icon indicating copy to clipboard operation
smartnoise-core copied to clipboard

Assumptions about what information is already public

Open Shoeboxam opened this issue 4 years ago • 1 comments

There are some remaining assumptions we make of the data--

  • ~~the number of columns is known~~
  • ~~the number of columns is nonzero~~
  • the number of rows is nonzero
  • ~~column names~~
  • the csv is well-formed
  • the path to the csv is public
  • the existence of a file at the path is public
  • ~~the dimensionality (number of axes) of the original dataset is known~~
  • the user properly labels private data as private
  • the user properly indicates if the first row of the dataset is a header
  • for the rust runtime specifically: the data fits in memory (std lib panics upon failed alloc)

In-progress:

  • computation in the analysis won't overflow/underflow
  • elapsed computation time is not public

While none of these are explicitly published, they can be acquired via side channels

~The number of columns assumption can potentially be alleviated by generating empty string columns. When an empty string column is cast to numerics, the entire column will be imputed.~ done

I'm not sure how to handle the non-zero number of rows assumption. I have adjusted the resize component to only accept N > 0. But not all statistics need to know N. Perhaps components that don't need to know N could take a default parameter that is always output with some probability? This can appear in interesting ways -- computing the median of data filtered by a very specific mask that may or may not match one individual. EDIT: there is now protection against this with is_not_empty property, but we still make the assumption that this property is true in materialize.

Shoeboxam avatar Mar 29 '20 22:03 Shoeboxam

We can have assumptions. We just need to make them clear. The zero row could be an edge case we want to think about more.

tercer avatar Mar 30 '20 05:03 tercer