posthog
posthog copied to clipboard
Idemptency when sending events (avoid duplicate events)
Is your feature request related to a problem?
Duplicate events possible when sending to posthog. My exact usecase is - i'm listening to appstore subscription purchase/renew events on my server and I want send those events to posthog to keep track of sales and MRR. However it is possible to end up with sending the same event to posthog twice:
- My server recieves notification about subscription renewal
- My server send corresponding event to posthog
- My server crashes
- Appstore will attempt to send the same notification with the same "NotificationUUID" again to my server as server didn't respond with 200 when processing this request
- My server will end up sending the same event to posthog for the second time, which is incorrect and messes up the subscriptions and sales data
Describe the solution you'd like
Some way to ensure idempotency when sending events to posthog so on the posthog side there's no duplicates, for example:
- Providing idempotency key
posthog.capture({
idempotencyKey: 'MY_IDEMPOTENCY_KEY
distinctId: 'distinct_id_of_the_user',
event: 'user paid for subscription',
})
- Or generating event id on a sender side
posthog.capture({
eventId: 'MY_GENERATED_EVENT_ID'
distinctId: 'distinct_id_of_the_user',
event: 'user paid for subscription"',
})
Describe alternatives you've considered
You can persist those unique IDs in the db before sending to posthog, but then there's a danger of never sending an event to Posthog if the server crashes after you persist to db but before you send event
Additional context
I'm sure this property will be useful not only in server-server events but also client-server.
Thank you for your feature request – we love each and every one!
I am chiming in here for another perhaps more common use case. When a user refreshes the page (either manually or by closing and opening their browser at a later time) we don't want to record multiple 'purchase' events in our case.
We deduplicate based on the sort key [timestamp, distinct_id, event, uuid] <- if these fields are the same, then there will be eventual deduplication, so the solution here is to make sure you send the events with the same uuid.
very usefull feature, can be usefull when importing old data and we have already live data reporting on to avoid duplicated events. maybe a "distinctEventId" key?
@tiina303 could you please provide more details or some docs perhaps? what is the schema you mentioned in your reply, is this how an event that is supposed to sent to posthog? maybe you have some exampled at hand
We deduplicate based on the sort key [timestamp, distinct_id, event, uuid] <- if these fields are the same, then there will be eventual deduplication, so the solution here is to make sure you send the events with the same uuid.
@tiina303 just uuid or all the fields?
all of the fields need to match (event is the event name, e.g. pageview). distinct_id and event are required, so you're sending them always already. Timestamp we send by default too & I believe js library sends event uuids now too, make sure you're on the latest version. If you send events from your backend library make sure to set uuid and timestamp always.
The code is https://github.com/PostHog/posthog/blob/61cf6d8bd127be95d9d8a557c7ad049b743405e0/posthog/models/event/sql.py#L92 <- sort key is by what deduplication happens in ClickHouse
@tiina303 that means just explicitely setting uuid=some_id_here and timestamp=123123 when sending an event?
yes, and make sure the event and distinct_id aren't changing (which would be odd).
need to have a option to force dedupe, for example distinct_id.
@tiina303 Can you explain what "eventual deduplication" means? What happens if I generate an insight before that deduplication happens?
This is something we get from ClickHouse and there are no concrete guarantees here afaik. In practice we've seen this work well especially for events sent near the same time. If deduplication hasn't happened yet you'd see the count in insights include the dupe.
I see, my events are coming 12 hours apart. For context, I'm polling payments made every 12 hours, but in case there are ever scheduling hiccups, I'm getting them for the past 24 hours. Especially since the job has some jitter on it.
I know almost nothing about how Posthog works under the hood, so no idea what the feasibility of this is, but if we could get real idempotency, i.e. events are never duplicated or we at least know how quickly they are deduplicated that would be helpful.
As it stands, I'll need to store all this info in our own database and deduplicate it myself.
I did notice this item under 2023 Q2 goals, any idea that status of that? It sounds like exactly what I need: Query time deduplication of events to cover until ClickHouse deduplication is complete
@tiina303 you mention sending a UUID, but there is no concept of this in the Ruby SDK. Would message_id
be what you're referring to?
I've tried passing message_id
and as well as uuid
in to the capture
method with no success.
After adding support to the Ruby library above, I was able to pass UUID as long as it's formated as a proper UUID and it eventually deduplicated. I was wondering if you fire the same event 5 times with different properties, what properties would be on the deduplicated event?
@jclusso this deduplication mechanism is designed to mitigate cases where the client submits the same event several times, either:
- due to unreliable connections (it is transparently implemented in our web and mobile SDKs)
- due to an event source that does not guarantee idempotency (webhook, kafka), and only if a valid UUID is available: not all 128-bit words are valid UUIDs.
To reply to your question about submitting the same UUID with different properties: we will not guarantee or document the behavior of our systems if that happens, as this use case is outside of our design parameters. Assuming that the timestamp, event name and UUID match, a single event will be kept after deduplication, but we won't guarantee which one will be chosen.
Thanks for that @xvello. I think it would be ideal if you guys improve your public documentation on this rather than people having to dig into various issues to find out how deduplication works. There are many times outside of unreliable connections that can cause a system to trigger a duplicate event and honestly it makes sense to always include a UUID from the client side if you at all care about idempotency.
Agreed @jclusso, we are currently discussing documentation changes to clarify the supported use cases, aligned with my latest comment. We also plan to provide automatic UUID generation on all SDKs that transparently queue and retry events, but cannot commit on a timeframe for this.
@xvello How would one specify the uuid using posthog-js (web)? And having a field event_id or similar, that's used to automatically deduplicate events would be incredibly useful in either case. Otherwise we would have to create uuids solely so we can use those to send with each event. Something like Meta's conversion API would be ideal and simple to implement.
Edit: Looks like it's not possible? In which case you can't really use it for any events where duplicates matter.
https://github.com/PostHog/posthog-js/blob/d3b17a2f436ecf49fd5addd3e4d6a3a6f54093c9/src/posthog-core.ts#L914
After testing, specifying timestamp can solve the problem.
posthog.capture('xxx', params, { timestamp })
I've provided the same uuid for duplicate events, but the events are not deduplicated and still appear in all analytics. I see in my data management tab that the uuid of the two events are the exact same. Is there anything else I need to do to enable deduplication?
Hey @hel-lo7, it might take a week or two for the de-duplication to kick in. We only de-duplicate events during non-peak hours such as weekends.