pg_diffix icon indicating copy to clipboard operation
pg_diffix copied to clipboard

Come up with a rename to clarify `aid instance` vs `aid`

Open pdobacz opened this issue 3 years ago • 6 comments

Related to discussions in the PR here and in Slack.

In the code, it is often not clear if aid means aid instance (so a column or expression) or a value ("instance" :wink:) of aid instance, being something derived from an AID value. (But it's not really the AID value anymore, because it may have been hashed!)

Here is one example of the former: https://github.com/diffix/pg_diffix/blob/master/src/aggregation/count.c#L59 Here is another: https://github.com/diffix/pg_diffix/blob/master/pg_diffix/query/oid_cache.h#L14

Here is an example of the latter: https://github.com/diffix/pg_diffix/blob/master/pg_diffix/aggregation/contribution_tracker.h#L40

pdobacz avatar Jan 05 '22 11:01 pdobacz

I think the term instance hinders more than it helps. We have AID columns and AID values. We should differentiate between them if the defining scope allows for both to exist at the same time. Otherwise, using aid generically and taking context into account is mostly fine IMO (we can't put all the relevant information into variable names anyway). As Edon said, the hash part is not relevant after initialization anymore, so we can leave it out of the name.

cristianberneanu avatar Jan 05 '22 13:01 cristianberneanu

As Edon said, the hash part is not relevant after initialization anymore, so we can leave it out of the name.

I have been thinking about this: If you're open to reconsider aid_hash, it would be a option which could potentially help solve this. Arguments:

  1. We do use it in reference, e.g.: (type AidCountState = { AidContributions: Dictionary<AidHash, float>; mutable UnaccountedFor: int64 })
  2. It is relevant insofar as we know "is safe to use in subsequent steps of anonymization, like seeding", as opposed to unhashed AID value.

pdobacz avatar Jan 06 '22 10:01 pdobacz

There is no way to get an unsafe AID value because it's literally impossible without the type adapter. A postgres Datum is an opaque value unless we have type info, which is handled by the AidSpec.

edongashi avatar Jan 06 '22 11:01 edongashi

There is no way to get an unsafe AID value

wait, so what is the hashing for? isn't it there to make the seed pseudo-random? we're turning of hashing only in DEBUG, so I assumed that it is there for some sort of safety

pdobacz avatar Jan 06 '22 11:01 pdobacz

ah, ok I see what you mean now. nvm

pdobacz avatar Jan 06 '22 11:01 pdobacz

wait, so what is the hashing for? isn't it there to make the seed pseudo-random? we're turning of hashing only in DEBUG, so I assumed that it is there for some sort of safety

We need to squash various types into an aid_t. For text we hash to get a number, for numbers we hash to mix distribution (IDs tend to be sequential, using only the lower end of bits). For debug we leave it off in case we want to investigate what entity is contributing to what.

edongashi avatar Jan 06 '22 11:01 edongashi