telemetry Handle category-filtered fields

This was originally mentioned in #59 and briefly discussed during jupyter-server meeting https://github.com/jupyter-server/team-compass/issues/4#issuecomment-830405683.

In summary, there are 2 decisions to make:

How exactly do we "filter" the fields? 2 options being discussed are

Set the property to null, equivalent to removing the data and keep the key
Delete the property, i.e. removing both the key and the data

Do we inform users of the fields being filtered?

How to filter properties

Currently fields that are filtered out due to not having all of their categories allowed are set to null instead of being deleted. The rationale was to preserve the "shape" of the data.

During the meeting, it was proposed by some to just delete the fields. The "shape" of the data is not totally preserved in the case of nested fields where the parent properties are filtered. For example

user:
  email: [email protected]
  name: someone

if the parent property user is to be filtered, the whole property would become

user: null

where the user.email and user.name keys are not preserved.

Listing filtered fields

It was suggested in #59 that we have another metadata field in the event to list out all the properties being filtered, e.g.

"__masked__": [
  ["user", "email"],
  ["user", "id"]
]

This would be useful if we decide to set the filtered properties to null. This helps differentiate whether a property is null because its value is actually null or because it is hidden due to categories.

This would alse be useful if we go with the second approach of deleting the field altogether too. Since there are json schemas that allow additional properties. This would inform whether the missing properties are not there due to being filtered by categories or not included in the original data.

May 03 '21 15:05 kiendang

How exactly do we "filter" the fields? 2 options being discussed are

Set the property to null, equivalent to removing the data and keep the key

Delete the property, i.e. removing both the key and the data

Do we inform users of the fields being filtered?

I don’t think it matters either way, because as you mentioned, preserving the data shape isn’t really a valid reason to keep the fields around. My preference, then, would be to drop the fields and add them to a masked field.
Adding a masked field seems like the right way to go here, no matter what we decide in (1). We should be explicit about what fields were masked, even in the case where we drop fields. This makes it easier for consumers of the emitted data to know exactly what happened to the data (and build logic around that).

This brings up another question. What about non-required properties? Do we need a missing field in the event capsule in the spirit of being explicit? Right now, it’s implied that missing, optional properties were not supplied by the event source, rather than being dropped by the filter. This can be inferred+verified by the checking the masked field, of course, so missing would be redundant but explicit.

Jun 01 '21 20:06 Zsailer

I don’t think it matters either way, because as you mentioned, preserving the data shape isn’t really a valid reason to keep the fields around. My preference, then, would be to drop the fields and add them to a masked field.

Adding a masked field seems like the right way to go here, no matter what we decide in (1). We should be explicit about what fields were masked, even in the case where we drop fields. This makes it easier for consumers of the emitted data to know exactly what happened to the data (and build logic around that).

This is what I think too.

This brings up another question. What about non-required properties? Do we need a missing field in the event capsule in the spirit of being explicit? Right now, it’s implied that missing, optional properties were not supplied by the event source, rather than being dropped by the filter. This can be inferred+verified by the checking the masked field, of course, so missing would be redundant but explicit.

I think it's better to do this post-hoc so we don't add more computation when recording events. For now we can provide a code example in the doc on how to extract missing, not filtered fields from the emitted events.

I would prefer the recording event logic to remain lean and only include things we have to do and can only do during event recording. To me this falls under operations/processing you can do on the emitted events after receiving them. There might be dedicated tooling for this in the future, by us and third parties, that arises from usage of telemetry.

Jun 02 '21 02:06 kiendang

Sounds good. We can always offer tooling (i.e. some helpful functions) inside jupyter/telemetry (in a later PR) to help process emitted events.

Jun 03 '21 18:06 Zsailer

telemetry telemetry copied to clipboard

Handle category-filtered fields

How to filter properties

Listing filtered fields

telemetry
telemetry copied to clipboard