pg_diffix
pg_diffix copied to clipboard
Come up with a rename to clarify `aid instance` vs `aid`
Related to discussions in the PR here and in Slack.
In the code, it is often not clear if aid means aid instance (so a column or expression) or a value ("instance" :wink:) of aid instance, being something derived from an AID value. (But it's not really the AID value anymore, because it may have been hashed!)
Here is one example of the former: https://github.com/diffix/pg_diffix/blob/master/src/aggregation/count.c#L59 Here is another: https://github.com/diffix/pg_diffix/blob/master/pg_diffix/query/oid_cache.h#L14
Here is an example of the latter: https://github.com/diffix/pg_diffix/blob/master/pg_diffix/aggregation/contribution_tracker.h#L40
I think the term instance hinders more than it helps. We have AID columns and AID values.
We should differentiate between them if the defining scope allows for both to exist at the same time.
Otherwise, using aid generically and taking context into account is mostly fine IMO (we can't put all the relevant information into variable names anyway).
As Edon said, the hash part is not relevant after initialization anymore, so we can leave it out of the name.
As Edon said, the hash part is not relevant after initialization anymore, so we can leave it out of the name.
I have been thinking about this: If you're open to reconsider aid_hash, it would be a option which could potentially help solve this. Arguments:
- We do use it in
reference, e.g.: (type AidCountState = { AidContributions: Dictionary<AidHash, float>; mutable UnaccountedFor: int64 }) - It is relevant insofar as we know "is safe to use in subsequent steps of anonymization, like seeding", as opposed to unhashed AID value.
There is no way to get an unsafe AID value because it's literally impossible without the type adapter. A postgres Datum is an opaque value unless we have type info, which is handled by the AidSpec.
There is no way to get an unsafe AID value
wait, so what is the hashing for? isn't it there to make the seed pseudo-random? we're turning of hashing only in DEBUG, so I assumed that it is there for some sort of safety
ah, ok I see what you mean now. nvm
wait, so what is the hashing for? isn't it there to make the seed pseudo-random? we're turning of hashing only in DEBUG, so I assumed that it is there for some sort of safety
We need to squash various types into an aid_t. For text we hash to get a number, for numbers we hash to mix distribution (IDs tend to be sequential, using only the lower end of bits). For debug we leave it off in case we want to investigate what entity is contributing to what.