posthog
posthog copied to clipboard
Replay "yes, and" Pipeline
How Pipeline team can help Replay team:
- [ ] Continued blobby more resilient work if any:
- consumer lifecycle and offset management?
- Background on Blobby: https://docs.google.com/document/d/1Pj_Lpi3nGcFOARXQ_07yZPO4qFGvtj_qt7elxr4hQws/edit
- [ ] Help get error tracking of the ground
- Great opportunity for us to clarify how new event based products should be span up, see https://docs.google.com/document/d/17aOrHFk1iOSJHKWgk9RCu_Zqy5i2ZHhjV7XNAdOZO64/edit
- [ ] Replay without product analytics
- Specific problem is product analytics getting quota limited if the user isn't paying for it, but persons should likely still be created
- [ ] replay 🤝 machine learning
- we want to generate embeddings for recordings
- right now we're generating over a subset of recordings for our team using celery
- celery is pretty terrible at "please run this task over and over at a defined rate and keep your task queue full"
- since it prefers "you gave me 10,000 copies of a task I will try and run them all super fast and kill your dependencies"
- this is a very shared problem since it's a replay product need but affects and is affected by ingestion
- (ideally I'd have an RFC right now but we're still testing)
- one obvious thing for us to evaluate is if we should use temporal rather than celery
- imagine the algorithm is roughly
- for all opted in teams
- for all recordings over some minimum duration that have not yet had embeddings generated (or have ingested significantly more data since last processing)
- apply some filters to avoid processing every byte of every recording
- run some ML that generates embeddings
- store those embeddings
- on another timer
- run clustering to generate magic playlists if the embeddings for the team have changed
(added some info on the embeddings - it might not totally make sense 🙈)
cc @daibhin
things have moved on past this i think