posthog icon indicating copy to clipboard operation
posthog copied to clipboard

Ingest events backwards

Open mariusandra opened this issue 3 years ago • 13 comments

Whenever we ingest events, we end up with entries in 3 tables: events, persons and person_distinct_ids.

Excluding plugins for now, if we ingest the same events again in the same order, we will (most likely) end up with the same entries in those tables.

However, I think we should also get exactly the same entries in those tables if we ingest events... backwards. If we do, it would make ingestion, retries and exports a lot simpler. For example, we could export multiple months in parallel from one posthog instance to another and know that all persons were transferred as they are.

This task touches two things:

  • Store person property updated_at timestamps... and then compare them before making any changes (in denormalized properties?). We need this anyway for $set_once issues, but this could also be added to $set (if a property has been set by an event in the future, don't reset it to something if this event is in the past).
  • Store some metadata on when users were merged and use that to prevent past changes from overriding future changes (e.g. somehow prevent renaming the new user to the old user)

The details and feasability of the above points need to be specced out.

mariusandra avatar May 10 '21 12:05 mariusandra

@tiina303

yakkomajuri avatar Sep 29 '21 16:09 yakkomajuri

A case for importance (being able to replay events later is already called out and we saw a need for that this week), but additionally network is unpredictable - sometimes a packet sent later will arrive earlier, sometimes an earlier packet fails and will be retried later.

For set we can just look at timestamps, however for set_once we can't as it shouldn't overwrite anything that was earlier, so imagine:

Timestamp | function | value
1         | set      |  1
2         | set_once |  2

Regardless of the order of arrival we want the value to be 1 in the end. If 2 is sent first, we need to know it was a set_once call that set it to know that we should overwrite it with the earlier value.

Furthermore this needs to also end up with val = 1 even if packet 1 is sent later.

Timestamp | function | value
1         | set_once |  1
2         | set_once |  2

I propose we keep an additional column set_once:bool that tells us if it was set by set_once, if it was and we see an earlier set or an earlier set_once, we need to overwrite it and update the timestamp to earlier.

Second tricky function is alias, currently we say we resolve conflicts by overwriting with the first distinct ID values if there's a conflict, however here if we get out of order events it gets tricky to resolve them, instead we could just rely on timestamps (latest one is used, well the same set/set_once applies as above). Then we can just look a the current values with timestamps & later coming values and we can just see this as one person and set/set_once already work in any order (I assume if we get a later set/set_once for someone who was aliased we can follow the redirect to the new person).

tiina303 avatar Sep 29 '21 16:09 tiina303

Hmm... how come we need to complicate this?

We give each property a timestamp. Then if $set comes in, we override if its timestamp is later than the property's timestamp. If so, we override. If $set_once comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.

Or what am I missing?

mariusandra avatar Sep 30 '21 15:09 mariusandra

If $set_once comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.

Imagine someone uses set_once and set both for the same key, we expect a value written by a later timestamped set not to get overwritten by an earlier timestamped set_once. A real world example of this could be we have some places, where we have weak evidence for user, where we want to only write those, if there wasn't a value before. For example imagine an App, where a user can set location in their profile but the app also uses the location based on where the user is visiting the site as the weak evidence. So Alice visited a site from London (which sends a set_once) and updated her location to be Dublin (which sends a set request), due to network problems the set_once('Alice', {'location':'London'}) request arrives later, but has an earlier timestamp (compared to the stored timestamp that the set request had). We don't want that to override the value, i.e. we want in the end to have 'location':'Dublin' regardless of which order the set & set_once requests arrive.

tiina303 avatar Sep 30 '21 17:09 tiina303

I'm still not following. If requests arrive later because of network issues, they should still contain the original timestamp, no? That's why we have all these now, sent_at and timestamp fields in the system.

mariusandra avatar Sep 30 '21 18:09 mariusandra

That's why we have all these now, sent_at and timestamp fields in the system.

I'm not actually sure how & where these three are set, I was making the assumption that we only have timestamp (which is based on what we want to either overwrite the value or not).

I'm still not following.

Sorry this can be confusing & maybe I'm misunderstanding something. You can check the table below, maybe that's more helpful.

To elaborate on my earlier example ... let me provide the exact calls I was thinking & what we'd want to happen. Also adding numbering to my statements, so it's easier to call out where it doesn't make sense anymore.

set_once('Alice',  {'location':'Dublin'}, timestamp="16:00")
set('Alice', {'location':'London'}, timestamp="16:01")

(1) For this in the end we want the outcome to be Alice.location == London (2) if the calls arrive in this order, it's simple we first set the location to be Dublin and later see the set call, which overrides it to London. (3) The above works based on your proposal

if $set comes in, we override if its timestamp is later than the property's timestamp.

(4) Now imagine network was bad and the requests arrive in the reverse order

set('Alice', {'location':'London'}, timestamp="16:01")
set_once('Alice',  {'location':'Dublin'}, timestamp="16:00")

(5) After having processed the first set call, we have Alice.location == London with the timestamp "16:01". (6) Now processing the set_once call. If we apply what you proposed ,

If $set_once comes in, we check if it has an earlier timestamp than the stored timestamp. If so, we override.

(7) then we see that set_once came in (8) we see that it does have an earlier timestamp compared to the stored one (9) so you say we should override (10) we override and end up with Alice.location == Dublin (11) but that's not what we wanted, we should get Alice.location = London despite that the requests arrived in this order.


Here's a table to explain when the override should happen. The example I provided above corresponds to row number 7. Based on rows 3&4 (also 7&8) differing depending on what the previous call that wrote the value was I can't think of a way to override correctly without that information.

  | call     | value exists  | call TS is ___ existing TS | previous fn | write/override 
 1| set      | no            | N/A                        | N/A         | yes
 2| set_once | no            | N/A                        | N/A         | yes
 3| set      | yes           | before                     | set         | no
 4| set      | yes           | before                     | set_once    | yes
 5| set      | yes           | after                      | set         | yes
 6| set      | yes           | after                      | set_once    | yes 
 7| set_once | yes           | before                     | set         | no 
 8| set_once | yes           | before                     | set_once    | yes
 9| set_once | yes           | after                      | set         | no
10| set_once | yes           | after                      | set_once    | no

tiina303 avatar Oct 01 '21 00:10 tiina303

Thanks for being patient and taking time to break this down. Now I get the issue and adding this column does make a lot of sense!

mariusandra avatar Oct 01 '21 08:10 mariusandra

This discussion (+ @tiina303's table particularly) was so useful

https://github.com/PostHog/posthog/pull/6259

yakkomajuri avatar Oct 05 '21 14:10 yakkomajuri

This issue has 1133 words. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

https://github.com/PostHog/posthog/issues/7454 Person Created At needs to also handle ingestion in any order

tiina303 avatar Dec 01 '21 21:12 tiina303

Curious what the status of this is? Anything I can to do help at all (e.g. manual testing, etc)?

marcushyett-ph avatar Dec 06 '21 13:12 marcushyett-ph

Potentially ... here's how I'm thinking about doing the testing: I have a DigitalOcean instance already (do-lon1-tiina-testing-2021-11-30-2054 with hostname/user/pass here: https://posthog.slack.com/archives/C01MM7VT7MG/p1638494244381100?thread_ts=1638494082.380800&cid=C01MM7VT7MG). The plan is:

  1. migrate the old way in-order into a new project from hogflix demo app - user props data should match hogflix, but wanted to use this as the base to avoid confusion.
  2. migrate the old way backwards into a new project (haven't fully figured this out, but thinking of edit/fork migrator plugin) - user props data shouldn't match, verify seeing some differences
  3. update the instance to use an image on top of https://github.com/PostHog/posthog/pull/7109 (needs some changes) on top of https://github.com/PostHog/posthog/pull/7472 & set NEW_PERSON_PROPERTIES_UPDATE_ENABLED_TEAMS to be all the appropriate project IDs, my network was too bad for image push 🙄
  4. migrate new way in-order into a new project - user props data should probably match with (1), but could not
  5. migrate new way reverse order into a new project - user props data should probably match with (3)

I suspect we'll see some differences that we'll need to ignore, e.g. maybe geoIP or maybe turning that off before is better.

tiina303 avatar Dec 06 '21 23:12 tiina303

Relevant recent conversation (tied to person properties updates): https://posthog.slack.com/archives/C0374DA782U/p1660840149534569

We have properties_last_operation and properties_last_updated_at but they aren't always updated properly. Given how our js lib sends browser etc every time we would increase the number of writes to postgres if we updated those fields (specifically timestamp for properties_last_updated_at) properly.

tiina303 avatar Sep 20 '22 12:09 tiina303