enhancement-proposals Jupyter Telemetry Enhancement Proposal

Jupyter Telemetry Enhancement Proposal

Open jaipreet-s opened this issue 5 years ago • 17 comments

Contains two accompanying files

Press Release
Technical proposal

cc @yuvipanda @Zsailer

Jul 05 '19 23:07 jaipreet-s

A couple thoughts:

Pluggable persistence would likely eventually be an objective
Should folks use this event bus / messaging system for non-Jupyter application message persistence? Or "this is for logging structured metrics for Jupyter and extensions only"?

@choldgraf (@mybinder) and I were just talking about how to profile BinderHub container launches https://twitter.com/westurner/status/1142175356880900102 :

https://binderhub.readthedocs.io/en/latest/overview.html#a-diagram-of-the-binderhub-architecture

But there's nothing that can easily profile all of the layers of the distributed stack for a given container launch request (when the image is already cached)? Maybe @sysdig? https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage-monitoring/#sysdig

Sysdig pulls together data from system calls, Kubernetes events, Prometheus metrics, statsD, JMX, and more into a single pane that gives you a comprehensive picture of your environment.

JSON with a JSON Schema should be easy enough to integrate with a tool like sysdig, for example.

Presumably there'd be sinks for the supported persistence backends. Would there be a standard interface for reviewing telemetry events and quantitative metrics from within Notebook or JupyterLab; or would users be expected to also configure Grafana / ELK / Loki / Splunk / Sentry?

https://prometheus.io/docs/introduction/faq/#how-to-feed-logs-into-prometheus
https://grafana.com/loki#faq

I'm not at all familiar with with Wikimedia or Mozilla telemetry systems; so, this is a JSON message store with input validation?

Jul 06 '19 00:07 westurner

One thing that wasn't clear to me at the start of reading the JEP and was even less clear at the end: why have a router that is part of Jupyter instead of having the event sources talk directly to the event sinks. From the later parts of the proposal this is proposed for frontend extensions. Server extensions could obviously also send stuff directly to the event sinks.

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

Jul 06 '19 07:07 betatim

I wrote up https://github.com/jupyterlab/jupyterlab-telemetry/blob/master/design.md earlier which has informed a lot of choices in this, and has a ton of background material as well. Would recommend reading :)

Jul 06 '19 07:07 yuvipanda

I've read it previously and now but I don't think it answers my questions.

Jul 06 '19 08:07 betatim

Does GDPR apply to anonymous unique IDs? Is hash(IP, datetime,) considered to be personally identifiable information? How could I look that up given your username? I shouldn't assume that there's only one user behind an IP (and so I shouldn't disclose everything for a given IP to whoever claims that's theirs). With a one-way hash of (IP, datetime, [entropy]) it's difficult to impossible to look up that information given just someone anyone's IP address.

In order to profile BinderHub launches from initial request through to instance launch, do I need to include a username if I have a per-launch-request unique identifier?

AFAIU, log retention for lawful purposes supersedes; at least in the United States.

On Saturday, July 6, 2019, Tim Head [email protected] wrote:

I've read it previously and now but I don't think it answers my questions.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jupyter/enhancement-proposals/pull/41?email_source=notifications&email_token=AAAMNS3MW3UIUG72E2FXBJLP6BGUJA5CNFSM4H6QPCS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZKUW6A#issuecomment-508906360, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMNS6LQYL5J2QBGOEQPELP6BGUJANCNFSM4H6QPCSQ .

Jul 06 '19 10:07 westurner

@betatim:

I've read it previously and now but I don't think it answers my questions.

Apologies, that wasn't directed at you - just a general comment to those who might not have seen it yet.

Jul 07 '19 05:07 yuvipanda

One more thing I forgot to write down last time: I think adding a field to the messages that lets someone looking at the logs later tell if this message was sent from a trusted or untrusted component would be super useful. This field would have to be added by a trusted component (the "router" or some other server side component) to avoid clients faking it. The use case would be that only "trusted" messages can be part of any audit trail. Or maybe we can deal with this via having a "source" attribute that is added by a trusted component. I think for audit purposes anything that frontend sends is "useless" because that could have been tampered with by the user (I think).

Jul 07 '19 06:07 betatim

I think adding a field to the messages that lets someone looking at the logs later tell if this message was sent from a trusted or untrusted component would be super useful. This field would have to be added by a trusted component (the "router" or some other server side component) to avoid clients faking it. The use case would be that only "trusted" messages can be part of any audit trail. Or maybe we can deal with this via having a "source" attribute that is added by a trusted component. I think for audit purposes anything that frontend sends is "useless" because that could have been tampered with by the user (I think).

Re: components self-identifying as "trusted"

Private key integrity may be the most challenging part of this. A JS app running in a browser (with the obfuscated or unobfuscated source available) does not have a secure enclave within which to store a cryptographic key to be used for signing messages. A JS or Python component would need to generate message signing keys which are then somehow approved as trusted.

CSRF mitigations like per-request token generation may negatively affect performance because there's a shortage of random. https://github.com/OWASP/CheatSheetSeries/blob/master/cheatsheets/Cross-Site_Request_Forgery_Prevention_Cheat_Sheet.md#csrf-defense-recommendations-summary

There's already the Jupyter auth token; though that's not per-component and AFAIU is not designed to be used as a message signing key.

Jul 09 '19 05:07 westurner

HMAC ("hash-based message authentication code") tokens are one way to mitigate the risk of CSRF (a different thing submitting a message as a trusted thing) https://en.wikipedia.org/wiki/HMAC

Because JSON message key orderings are not necessarily stable (the key order may be different if an attribute is deleted and then inserted again later, for example), the cryptographic hash or signature varies unless the message is canonicalized first. json.dumps(sort_keys=True) is basically a message canonicalization algorithm.

Linked Data Signatures have (URIs for) signature suites, message canonicalization algorithms, and message digest algorithms. This makes things future proof in that instead of saying this is jupyter_telemetry_message_format v2, you specify the proof type (which defines a canonicalizationAlgorithm, digestAlgorithm, and proofAlgorithm) https://w3c-dvcg.github.io/ld-signatures/#terminology

{
  "@context": "https://w3id.org/identity/v1",
  "title": "Hello World!",
  "proof": {
    "type": "RsaSignature2018",
    "creator": "https://example.com/i/pat/keys/5",
    "created": "2017-09-23T20:21:34Z",
    "domain": "example.org",
    "nonce": "2bbgh3dgjg2302d-d2b3gi423d42",
    "proofValue": "eyJ0eXAiOiJK...gFWFOEjXk"
  }
}

https://w3c-dvcg.github.io/ld-signatures/#signature-suites :

{
  "id": "https://w3id.org/security#RsaSignature2018",
  "type": "SignatureSuite",
  "canonicalizationAlgorithm": "https://w3id.org/security#GCA2015",
  "digestAlgorithm": "https://www.ietf.org/assignments/jwa-parameters#SHA256",
  "proofAlgorithm": "https://www.ietf.org/assignments/jws-parameters#RSASSA-PSS"
}

https://web-payments.org/vocabs/security#LinkedDataSignature2015 :

{
  "@context": ["https://w3id.org/security/v1", "http://json-ld.org/contexts/person.jsonld"],
  "@type": "Person",
  "name": "Manu Sporny",
  "homepage": "http://manu.sporny.org/",
  "signature": {
    "@type": "LinkedDataSignature2015",
    "creator": "http://manu.sporny.org/keys/5",
    "created": "2015-09-23T20:21:34Z",
    "signatureValue": "OGQzNGVkMzVmMmQ3ODIyOWM32MzQzNmExMgoYzI4ZDY3NjI4NTIyZTk="
  }
}

"JSON-LD Signatures with JSON Web Signatures" https://github.com/WebOfTrustInfo/ld-signatures-python/blob/master/jld_signatures.py
"An implementation of the Linked Data Signatures specification for JSON-LD. Works in the browser and node.js." https://github.com/digitalbazaar/jsonld-signatures/#examples https://github.com/WebOfTrustInfo/ld-signatures-js
- Which signature suite is recommended changes over time and will change in the future. In order to future-proof, ld-signatures has URIs for standard signature suites: https://github.com/digitalbazaar/jsonld-signatures/tree/master/lib/suites
  - EcdsaKoblitzSignature2016.js
  - Ed25519Signature2018.js
  - GraphSignature2012.js
  - JwsLinkedDataSignature.js (JSON Web Signatures (JWS))
  - LinkedDataProof.js
  - LinkedDataSignature.js
  - LinkedDataSignature2015.js
  - RsaSignature2018

HMACs use symmetric keys (pre-shared key), cryptographic signatures use asymmetric keys (public and private keys). In either case, if a key is kept in code and/or RAM, it's really not that secret. https://gist.github.com/westurner/4345987bb29fca700f52163c339a270f#gistcomment-2822602

... What's a good way for a component to indicate that it's trusted?

Jul 09 '19 07:07 westurner

One thing that wasn't clear to me at the start of reading the JEP and was even less clear at the end: why have a router that is part of Jupyter instead of having the event sources talk directly to the event sinks. From the later parts of the proposal this is proposed for frontend extensions. Server extensions could obviously also send stuff directly to the event sinks.

Hi @betatim , Thanks for the feedback!

The router fundamentally decouples event publishers from event consumers. For example, without the router, if an event sink interface is updated or a new event sink is replaced, each event publisher will need to be updated to use the new interface. With it, this is not an issue since publishers still talk to the router and new event sinks can be added/dropped via the telemetry_event_sinks configuration.

In addition, the router abstracts common functionality that would otherwise have to be implemented by each event sink, such as those listed in the Core Event Router section

Schema validation
Adds a mechanism for adding metadata fields
Dropping events that are not whitelisted in a given deployment

Jul 10 '19 20:07 jaipreet-s

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

In terms of user privacy and transparency, this proposal is limited to making it clear to users what events are being collected, as well as having some kind of Opt-In in the JupyterLab UI. I'd be fine with having more nuanced proposals around audit trails and privacy preserving opt-in as a separate proposal. @Zsailer WDYT?

Jul 10 '19 20:07 jaipreet-s

Hi @betatim and @westurner - sorry for being late to get back re: components self-identifying as "trusted"

These are all good points. The current implementation for the event publisher interface makes it possible for publishers to do this themselves m and also for consumers to validate the trust/integrity at that end.

That said, we should consider offering ways to make this easier to do for publishers. https://github.com/jupyter/telemetry/issues/21 has a few ideas on how to provide this functionality

Aug 01 '19 23:08 jaipreet-s

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

In terms of user privacy and transparency, this proposal is limited to making it clear to users what events are being collected, as well as having some kind of Opt-In in the JupyterLab UI. I'd be fine with having more nuanced proposals around audit trails and privacy preserving opt-in as a separate proposal. @Zsailer WDYT?

@betatim and @jaipreet-s

Yes, this proposal is trying to communicate that we're injecting telemetry across various "layers" of the Jupyter stack (i.e. Kernel, Server, Lab, Hub, etc.). We want everyone to be aware of these changes without fear that "Jupyter is secretly collecting data about users". We'll provide tools for admins to inform users that data is being collected. And, like @jaipreet-s said, we'll likely provide UI in JupyterLab that allows users to have some control over event collection.

We could remove the technical design plans for "consent" from this proposal and make that a separate discussion if necessary, but I don't think we should remove the language that we care about user privacy and awareness.

Aug 07 '19 20:08 Zsailer

I think my main point was that I'd avoid talking about user choice and audit trails inn the same part of the document because they have such different requirements. They can't be reconciled, but that is fine as they are two very different things :)

Aug 08 '19 05:08 betatim

I'd avoid talking about user choice and audit trails inn the same part of the document

That makes sense—these are really two different experiences/environments. Maybe we should split that bit into two different paragraphs (assuming that you're talking about the press-release document right now).

One paragraph about environments where user is offering consent for admin/extension developer to collect data.
Another paragraph talking about strictly controlled environments where auditing is required. In this case, Jupyter provides tools that make it easy for environment admin to inform users that auditing is happening.

In both cases, we're communicating that Jupyter's stance is that administrators should be transparent with users.

Aug 08 '19 17:08 Zsailer

This is an example of a potential use case:

Our telemetry project, ETC JupyterLab Telemetry Extension, captures user interactions and logs these messages to a specified handler. The ETC JupyterLab Telemetry Example repo gives an example of the service provided by the extension being consumed and the events being logged to console.log.

Presently, we are capturing several user interactions with the Notebook:

Active Cell Changed
Cell Added
Cell Executed
Cell Removed
Notebook Opened
Notebook Saved
Notebook Scrolled

For each event, a list of cells relevant to the event are captured as well. This is described here. The messages include a list of relevant cells and the present state of the Notebook. Cell contents that have been seen before get replaced with a cell 'ID' in order to save storage space, which allows for the state of the Notebook to be reconstructed at a later time. The reason I point that out is that there might be use cases where multiple schemas could be registered for a single event.

This JSON schema matches the event messages:

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "event_name": {
      "type": "string"
    },
    "cells": {
      "type": "array",
      "items": [
        {
          "type": "object",
          "properties": {
            "id": {
              "type": "string"
            },
            "index": {
              "type": "integer"
            }
          },
          "required": [
            "id",
            "index"
          ]
        }
      ]
    },
    "notebook": {
      "type": "object",
      "properties": {
        "metadata": {
          "type": "object",
          "properties": {
            "kernelspec": {
              "type": "object",
              "properties": {
                "display_name": {
                  "type": "string"
                },
                "language": {
                  "type": "string"
                },
                "name": {
                  "type": "string"
                }
              },
              "required": [
                "display_name",
                "language",
                "name"
              ]
            },
            "language_info": {
              "type": "object",
              "properties": {
                "codemirror_mode": {
                  "type": "object",
                  "properties": {
                    "name": {
                      "type": "string"
                    },
                    "version": {
                      "type": "integer"
                    }
                  },
                  "required": [
                    "name",
                    "version"
                  ]
                },
                "file_extension": {
                  "type": "string"
                },
                "mimetype": {
                  "type": "string"
                },
                "name": {
                  "type": "string"
                },
                "nbconvert_exporter": {
                  "type": "string"
                },
                "pygments_lexer": {
                  "type": "string"
                },
                "version": {
                  "type": "string"
                }
              },
              "required": [
                "codemirror_mode",
                "file_extension",
                "mimetype",
                "name",
                "nbconvert_exporter",
                "pygments_lexer",
                "version"
              ]
            }
          },
          "required": [
            "kernelspec",
            "language_info"
          ]
        },
        "nbformat_minor": {
          "type": "integer"
        },
        "nbformat": {
          "type": "integer"
        },
        "cells": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "cell_type": {
                "type": "string"
              },
              "source": {
                "type": "string"
              },
              "metadata": {
                "type": "object",
                "properties": {
                  "trusted": {
                    "type": "boolean"
                  }
                },
                "required": [
                  "trusted"
                ]
              },
              "execution_count": {
                "type": "null"
              },
              "outputs": {
                "type": "array",
                "items": {}
              },
              "id": {
                "type": "string"
              }
            },
            "required": [

              "id"
            ]
          }
        }
      },
      "required": [
        "metadata",
        "nbformat_minor",
        "nbformat",
        "cells"
      ]
    },
    "seq": {
      "type": "integer"
    },
    "notebook_path": {
      "type": "string"
    },
    "user_id": {
      "type": "string"
    }
  },
  "required": [
    "event_name",
    "cells",
    "notebook",
    "seq",
    "notebook_path",
    "user_id"
  ]
}

Please let me know if anyone has any questions regarding our use case.

Jul 22 '21 12:07 adpatter

Hi @Zsailer - Do you think we can close this PR now? It hasn't had active discussion for a while now :) Thanks!

Dec 02 '21 18:12 jaipreet-s

enhancement-proposals enhancement-proposals copied to clipboard

Jupyter Telemetry Enhancement Proposal

enhancement-proposals
enhancement-proposals copied to clipboard