posthog icon indicating copy to clipboard operation
posthog copied to clipboard

Batch exports UX improvements, e.g. error notifications and UI rewamp

Open tiina303 opened this issue 1 year ago • 0 comments

why: helps reduce confusion around how batch exports work reducing support, also cost savings

Plan:

  • [ ] Pipeline 3000 UI release
  • [ ] https://github.com/PostHog/posthog/issues/20367
  • [x] Error notifications - Auto-pause depends on us having sorted out emails: https://posthog.slack.com/archives/C0374DA782U/p1707999504020739
  • [x] Auto-pause on error thresholds
    • Pausing on repeated failures, with email notification. Slack: https://posthog.slack.com/archives/C0374DA782U/p1708006444370889?thread_ts=1708005246.718459&cid=C0374DA782U
  • [ ] UI quick wins
    • https://github.com/PostHog/posthog/issues/20450
    • Fill in and display the records_completed column in the UI
    • Fill in previously available values when editing an existing export.
    • Stretch: Deal with non-existent batch exports: Batch exports can be deleted if they ended and are past their retention period (7 days). The UI should display this by graying out the export, prompting the user to recreate it (potentially with the same configuration).
  • [ ] Redshift import via INSERT is way too slow, we need to document the S3 + RedShift flow -> https://github.com/PostHog/posthog.com/issues/8043

Q2 planning doc: https://docs.google.com/document/d/14vgMXToisseDFqAL-OaMXN1RfWSvh7w43Q8x6oheTaw/edit

Additional info / stretch goals:

  • Reliability and operations:
    • All backends need to heartbeat, so they can resume long running tasks on fail/deploy
    • Need alerting on processing lag if workers are broken
    • Our logs are too raw, we should catch common User errors and log them with something more useful (i.e. Snowflake user had incorrect table name, but Snowflake error was completely useless). And we should catch internal errors and log them as "Internal errors" without a ton of detail (for example we currently log pyarrow or Clickhouse errors in full detail, often including hostname and too much confusing information for a user).
  • It's possible for people to ingest events that are so large they can't do exports (they hit memory limits). We can manually adjust CLICKHOUSE_MAX_BLOCK_SIZE_OVERRIDES for now, but that's a lot of support + handholding. It should "Just Work" for any events you can ingest.

tiina303 avatar Mar 25 '24 19:03 tiina303