CDK CLI Telemetry
Description
The CDK CLI team currently has no visibility into how the CLI is performing for users. In order to detect accidental regressions more quickly, and make data-driven decisions on what to work on next, the CDK CLI will begin to collect anonymous telemetry on user commands.
Roles
| Role | User |
|---|---|
| Proposed by | @kaizencc |
| Author(s) | @kaizencc |
| API Bar Raiser | @iliapolo |
| Stakeholders | @rix0rrr @mrgrain |
See RFC Process for details
Workflow
- [x] Tracking issue created (label:
status/proposed) - [x] API bar raiser assigned (ping us at #aws-cdk-rfcs if needed)
- [x] Kick off meeting
- [x] RFC pull request submitted (label:
status/review) - [x] Community reach out (via Slack and/or Twitter)
- [x] API signed-off (label
status/api-approvedapplied to pull request) - [x] Final comments period (label:
status/final-comments-period) - [x] Approved and merged (label:
status/approved) - [ ] Execution plan submitted (label:
status/planning) - [x] Plan approved and merged (label:
status/implementing) - [ ] Implementation complete (label:
status/done)
Author is responsible to progress the RFC according to this checklist, and apply the relevant labels to this issue so that the RFC table in README gets updated.
Generally in favor of this approach to data-driven improvement, but I have concerns about the redaction mechanisms of tracebacks and other error-type messages. While they are optional, it seems that there are too many edge cases to properly redact all customer information, and I hope that these optional sections would default to off.
Without investigating in further detail I don't have specifics about which edge cases to examine in more detail, but mention it as the section I find potentially objectionable.
While I understand 'more data' is desirable for AWS, I'm opposed to the current proposal for these reasons:
Making telemetry gathering opt-out rather than opt-in. Both because an opt-out is only useful is you're aware of the option, which users may (probably will?) not be. This because of the risks of imperfect redaction, the possibility of deanonymization (which a cursory read through the gathered data is a real risk), and the potential privacy risks of sending metadata (such as connected IP's). This would be a security and privacy risk in our organization.
Besides that, I see two stated goals, but limited rationale for how these goals would be reached by this change, and if these goals can't be met better via other means.
For the first: I'm not sure if AWS can detect this in other ways: Failing deployments, increase in opened issues (any serious regression seems to lead to a quickly opened - and resolved - issue on GH) for instance. I can't judge the viability of these options, but before the CLI starts phoning home by default, I'd like to see a serious analysis of this.
As for the second: You can't measure what isn't possible, so 'what to work on next' will always be skewed towards (partially) supported features by these measurements.
Thanks for the feedback thus far! A couple of points to address the feedback here -- everyone is also very welcome to comment on the RFC pull request itself.
We believe that client side telemetry is the right way to achieve our goals of making data-driven product decisions and resolving errors faster. We are trying to raise the bar for how we respond to customer pain points -- that means intentionally driving a solution where we would be able to proactively solve issues rather than rely on reactionary efforts based on [manual] user reports.
Sounds like there are concerns around the redaction mechanism and that they have the potential to leak customer data. We don't want that, and if it is a risky proposition we will hold off on it initially (or make it opt-in / specifically asked for). I will update the RFC with a concrete proposal of what the redaction would look like, and we'll go from there.
Opt-out vs. opt-in. We will be providing a CLI Notice and a 30-day buffer period for customers to proactively opt-out if needed. The CLI Notice is the main mechanism for very clear communication. We typical use that communication mechanism for security vulnerabilities, behavior regressions, etc. It shows up on every CDK CLI command and will only disappear when explicitly acknowledged. Because we're using this channel, we expect that customers will be aware of this change well before it goes into effect.
I have added a far more detailed proposal on what and how we are sanitizing error messages, traces, and logs!