aws-cdk-rfcs icon indicating copy to clipboard operation
aws-cdk-rfcs copied to clipboard

CDK CLI Telemetry

Open kaizencc opened this issue 6 months ago • 4 comments

Description

The CDK CLI team currently has no visibility into how the CLI is performing for users. In order to detect accidental regressions more quickly, and make data-driven decisions on what to work on next, the CDK CLI will begin to collect anonymous telemetry on user commands.

Roles

Role User
Proposed by @kaizencc
Author(s) @kaizencc
API Bar Raiser @iliapolo
Stakeholders @rix0rrr @mrgrain

See RFC Process for details

Workflow

  • [x] Tracking issue created (label: status/proposed)
  • [x] API bar raiser assigned (ping us at #aws-cdk-rfcs if needed)
  • [x] Kick off meeting
  • [x] RFC pull request submitted (label: status/review)
  • [x] Community reach out (via Slack and/or Twitter)
  • [x] API signed-off (label status/api-approved applied to pull request)
  • [x] Final comments period (label: status/final-comments-period)
  • [x] Approved and merged (label: status/approved)
  • [ ] Execution plan submitted (label: status/planning)
  • [x] Plan approved and merged (label: status/implementing)
  • [ ] Implementation complete (label: status/done)

Author is responsible to progress the RFC according to this checklist, and apply the relevant labels to this issue so that the RFC table in README gets updated.

kaizencc avatar Jun 06 '25 16:06 kaizencc

Generally in favor of this approach to data-driven improvement, but I have concerns about the redaction mechanisms of tracebacks and other error-type messages. While they are optional, it seems that there are too many edge cases to properly redact all customer information, and I hope that these optional sections would default to off.

Without investigating in further detail I don't have specifics about which edge cases to examine in more detail, but mention it as the section I find potentially objectionable.

mikejgray avatar Jun 06 '25 18:06 mikejgray

While I understand 'more data' is desirable for AWS, I'm opposed to the current proposal for these reasons:

Making telemetry gathering opt-out rather than opt-in. Both because an opt-out is only useful is you're aware of the option, which users may (probably will?) not be. This because of the risks of imperfect redaction, the possibility of deanonymization (which a cursory read through the gathered data is a real risk), and the potential privacy risks of sending metadata (such as connected IP's). This would be a security and privacy risk in our organization.

Besides that, I see two stated goals, but limited rationale for how these goals would be reached by this change, and if these goals can't be met better via other means.

For the first: I'm not sure if AWS can detect this in other ways: Failing deployments, increase in opened issues (any serious regression seems to lead to a quickly opened - and resolved - issue on GH) for instance. I can't judge the viability of these options, but before the CLI starts phoning home by default, I'd like to see a serious analysis of this.

As for the second: You can't measure what isn't possible, so 'what to work on next' will always be skewed towards (partially) supported features by these measurements.

joostvanderborg avatar Jun 10 '25 08:06 joostvanderborg

Thanks for the feedback thus far! A couple of points to address the feedback here -- everyone is also very welcome to comment on the RFC pull request itself.

We believe that client side telemetry is the right way to achieve our goals of making data-driven product decisions and resolving errors faster. We are trying to raise the bar for how we respond to customer pain points -- that means intentionally driving a solution where we would be able to proactively solve issues rather than rely on reactionary efforts based on [manual] user reports.

Sounds like there are concerns around the redaction mechanism and that they have the potential to leak customer data. We don't want that, and if it is a risky proposition we will hold off on it initially (or make it opt-in / specifically asked for). I will update the RFC with a concrete proposal of what the redaction would look like, and we'll go from there.

Opt-out vs. opt-in. We will be providing a CLI Notice and a 30-day buffer period for customers to proactively opt-out if needed. The CLI Notice is the main mechanism for very clear communication. We typical use that communication mechanism for security vulnerabilities, behavior regressions, etc. It shows up on every CDK CLI command and will only disappear when explicitly acknowledged. Because we're using this channel, we expect that customers will be aware of this change well before it goes into effect.

kaizencc avatar Jun 10 '25 19:06 kaizencc

I have added a far more detailed proposal on what and how we are sanitizing error messages, traces, and logs!

kaizencc avatar Jun 12 '25 16:06 kaizencc