hyperdrive-daemon icon indicating copy to clipboard operation
hyperdrive-daemon copied to clipboard

Conversation: Which stats should we be logging?

Open andrewosh opened this issue 5 years ago • 4 comments

@zootella mentioned in the last Dat meeting that we should have a conversation about:

  1. What kind of telemetry info should we collect?
  2. How to do we make sure that that telemetry info is informative while not leaking any sensitive info?
  3. Should (and if so how) should we make the collected data available?

We currently have very rudimentary telemetry in the daemon, but we haven't yet had a conversation about exactly what things would both be appropriate to collect and useful for future optimization.

Currently we're reporting:

  1. The hashed daemon token (to track anonymized identity across restarts).
  2. The total number of hypercores that the corestore has in memory.
  3. The total number of peers that the swarm networker is connected to.

Ideally, we want to collect things like latency numbers as well, and perhaps other network-related stats.

What do y'all think? @mafintosh @pfrazee

andrewosh avatar Apr 01 '20 18:04 andrewosh

Lots to dig into here, the only observation I have right now is anomaly stats might be useful, things like “an operation took long under X condition.”

pfrazee avatar Apr 01 '20 18:04 pfrazee

Seeing the histogram of how long operations take would be really interesting @pfrazee

Telemetry can be useful at every level of the stack, I think, including all the way on the top: At the product level where the user clicks to complete a task and then waits and gets (or doesn't get) the desired result. How long did they wait for the result to start? to complete? What percent of the time does the user cancel or exit before a success or failure?

We're not at all interested in the behavior of our users or details about their data (Unlike all the big centralized platforms). To collect stats with minimal privacy impact we could have nodes report:

  • hashes of IPs and other unique or potentially identifying information instead of the actual identifiers
  • large buckets of values (<100ms, <1s, <10s, <1m, <10m, longer) rather than exact values
  • aggregate counts of things rather than details of individual things

A goal at this super high level would be to measure how well the technology works deployed, as real people and apps use it, with their real data and imperfect consumer hardware and internet connections. It would be great to be able to track user success version to version (Upgrade now: we're measuring dat: links load twice as reliably as before!) Changes beyond our control on the public internet will affect these metrics also, but that's not a bad thing: If some sudden or gradual external change (with a large ISP, with a Windows Update), makes things start working twice or half as well, that's something we should know about.

zootella avatar Apr 01 '20 20:04 zootella

  • Is the client reachable (to establish if hole-punching is working for everyone) – if not then find its public IP but include the IP’s ASN in the report not the IP itself. I’m thinking this will identify potential CG-NAT issues.
  • Does the client have a global-scoped IPv6 address (to establish if the future is now)
  • Uptime.

da2x avatar Jun 07 '20 19:06 da2x

Thanks all -- at the end of the day, we decided to drop the telemetry a few weeks back when the daemon moved out of the "beta" stage. Ultimately we'd prefer it to be opt-in, but in that case collecting large-scale stats (like IPv6 addresses @da2x) wouldn't make much sense, as we wouldn't get a large-scale picture.

Instead we've opted to ensure we don't log any keys in ~/.hyperdrive/log.json, so that we can collect those as necessary to fix issues. We're also periodically logging perf-related stats now, which will be useful for debugging. We're hoping certain aggregate stats can be pulled directly from the DHT (like an approximate DHT size).

@zootella Strongly agree with your last point, that we do need a way to measure if sudden/gradual changes have negative effects. Think for now we'll just rely on users opening issues, while making sure that the log files give us the info we need. Given that, I'll update the issue title to be "Which stats should we be logging?" as the suggestions here are equally relevant.

andrewosh avatar Jun 08 '20 09:06 andrewosh