posthog icon indicating copy to clipboard operation
posthog copied to clipboard

Alias to not merge two identified persons

Open tiina303 opened this issue 1 year ago • 4 comments

Is your feature request related to a problem?

PostHog users sometimes accidentally call $alias with two identified persons (two persons who have multiple distinct ids tied to them already). For example when logging in as someone else isn't properly gated and they use $alias not $identify in their code. Similarly $identify is sometimes accidentally used with two identified persons, in this case we already have a safeguard in place and we won't merge those users. However our safeguard here is potentially too strict in some cases.

Current world

(1) Identify safeguard:

identify('Alice', anon_id = 'anon_alice_1')    # ok
identify('Alice', anon_id = 'anon_alice_2')    # ok
identify('Bob', anon_id = 'anon_bob_1')        # ok
identify('Alice', anon_id = 'anon_bob_1')      # not merged

(2) Identify restricts any calls after identify usage:

identify('Alice_backend', anon_id = null, properties = {'email' = '[email protected]'})   # ok
identify('Alice', anon_id = 'Alice_backend')                                            # not merged

^ note that this is because we mark the user as identified when identify is called and that's what the safeguard checks

(3) Alias not safeguarded (1):

identify('Alice', anon_id = 'anon_alice_1')    # ok
identify('Bob', anon_id = 'anon_bob_1')        # ok
alias('Alice', 'anon_bob_1')                   # ok

(4) Alias not restricted:

identify('Alice_backend', anon_id = null, properties = {'email' = '[email protected]'})   # ok
alias('Alice', 'Alice_backend')                                                         # ok

(5) Aliasing first doesn't restrict identifying later:

alias('Bob', 'Bob_backend')                            # ok
identify('Alice', anon_id = 'Bob_backend')             # ok

Usage of merging or trying to merge two already identified users on cloud (note that we don't have info about how many already merged users are being merged later yet as this is based on the current definition of identified): https://metrics.posthog.com/d/CD1kWjLnz/plugin-server-new?viewPanel=115

Describe the solution you'd like

identified means the id has already been merged (or more simply the person has >1 distinct_ids tied to it). $alias and $identify have the same safeguard in place against merging already merged persons.

Proposed future world

Looking at the same cases: (1) Identify safeguard - nothing changes (2) Identify without a merge now won't restricts future merges:

identify('Alice_backend', anon_id = null, properties = {'email' = '[email protected]'})   # ok
identify('Alice', anon_id = 'Alice_backend')                                            # ok

^ we will allow the merge: Alice_backend hasn't been merged into any users, so it can be merged into Alice (3) Alias is now safeguarded too:

identify('Alice', anon_id = 'anon_alice_1')    # ok
identify('Bob', anon_id = 'anon_bob_1')        # ok
alias('Alice', 'anon_bob_1')                   # not merged

(4) Alias not restricted - nothing changes (5) Aliasing first restricts identifying later:

alias('Bob', 'Bob_backend')                            # ok
identify('Alice', anon_id = 'Bob_backend')             # not merged

^ this is now safeguarded too, since Bob_backend was already merged into Bob we can't merge it into Alice.

Steps

  • [x] Update docs about the future of $alias
  • [ ] Update code to make identified mean the person has >1 distinct_id tied to it
    • [ ] Check that we don't have users expecting (5) to work the current way
  • [ ] Highlight events (& persons & on the project level?) in the UI where we didn't do the merge and where we wouldn't do a merge in the future with context & where to reach us
  • [ ] Async Migration to backfill identified based on the new meaning (the person has >1 distinct_id tied to it).
  • [ ] Update code for$alias to have the same safeguard in place as $identify against merging two identified persons
    • [ ] Check usage, reach out to users, ... hopefully they have already seen the highlights
  • [ ] Optional: Update code to always merge into the identified person for properties (that seems like a safer default, though if neither of them are identified we still need to use the order so 🤷‍♀️ ).
  • [ ] Clarify the merge persons UI option - it's just alias & we merge into the identified person for properties.
  • [ ] Update docs to clear up any leftovers from the past

Describe alternatives you've considered

Instead of reusing identified for this we could

  1. create a separate field - we don't use identified for anything else so it would just be a no-op afterwards (why I opted for re-purpose)
  2. just look up if there's more than 1 distinct_id tied to this person already - this would be slower, but could be safer.

We could keep identified = true for any identify calls, the main risk here is if events arrive out of order and cause case (2) above.

Additional context

We might need an ability to merge already identified persons in the future, e.g. when someone messed up and doesn't care about past events, but wants the future events to be tied to the right person. An alternative here is for them to delete the person who's past events they don't care about and they can then use the distinct_id freely again as it's not tied to any identified persons anymore.

Related:

  1. https://docs.google.com/document/d/1Pssujl2zbmAW-gvln5HpcFStraoQjIaYjEkqQ911j48/edit
  2. https://posthog.slack.com/archives/C0374DA782U/p1661950576073009?thread_ts=1661854584.261829&cid=C0374DA782U

Thank you for your feature request – we love each and every one!

tiina303 avatar Sep 19 '22 17:09 tiina303

Our docs look pretty good, with regards to alias, we don't really say much, later we'll likely want to update https://posthog.com/docs/integrate/identifying-users to explain the highlighting and that it applies to alias calls too.

tiina303 avatar Sep 20 '22 12:09 tiina303

Most of this makes sense.

Curious why we need an async migration to backfill this though? Can't we infer if the user is anonymous or not?

yakkomajuri avatar Sep 20 '22 15:09 yakkomajuri

Also RE the UI merge: how many people actually use this? And are they big teams? Seems like something hacky we built early on to cover some bugs we may have had and something that only a small team (with true oversight over their data in a granular way) would use it

yakkomajuri avatar Sep 20 '22 15:09 yakkomajuri

Curious why we need an async migration to backfill this though?

The problem is that alias usage didn't mark persons as identified, so we'd have lots of persons who have identified set as false but have many ids tied to them (i.e. we shouldn't allow merging/aliasing them with anyone of many ids)

Can't we infer if the user is anonymous or not?

Two problems:

  1. we don't know for mobile libraries if that was an anonymous user or not (why we turned the buffer off for mobile libs)
  2. shouldn't we be able to merge non-anonymous users (the ones who haven't been merged with any other users)?

Also RE the UI merge: how many people actually use this? And are they big teams? Seems like something hacky we built early on to cover some bugs we may have had and something that only a small team (with true oversight over their data in a granular way) would use it

Looks like it's not used a lot https://app.posthog.com/insights/7UDu0M8t but split person is used by some teams, there's also no other way of splitting persons that I know of. We can remove the merge person option from there.

tiina303 avatar Sep 20 '22 16:09 tiina303

identified means the id has already been merged (or more simply the person has >1 distinct_ids tied to it). $alias and $identify have the same safeguard in place against merging already merged persons.

I assume this is about the is_identified column. This column is not used anywhere in the analytics pipeline besides writing it and not used anywhere in the frontend logic - we should not be doing anything around it for this change.

macobo avatar Sep 23 '22 13:09 macobo

I assume this is about the is_identified column. This column is not used anywhere in the analytics pipeline besides writing it and not used anywhere in the frontend logic - we should not be doing anything around it for this change.

Yes I'm referring to that column and it's used only (afaik not anywhere else) in the code safeguarding not merging already identified people (code). My proposal is we extend that column's purpose to be able to safeguard with alias calls too.

tiina303 avatar Sep 23 '22 14:09 tiina303

More details about user communication side of the proposal (per @yakkomajuri 's request):

1. Highlighting in the UI when users didn't get merged

Currently we would show the identified case and there are already users who run into this, but we're not communicating this to the users in any way. Note that I updated the panel to include identify too https://metrics.posthog.com/d/CD1kWjLnz/plugin-server-new?viewPanel=115. As we implement the alias restriction we would show those warnings too (furthermore we can show the warning already for alias calls that it would not be merged in the future).

How this would be implemented - I propose we add a "$warnings = []" property to the event where we store sth like "We didn't merge the distinct IDs because ...", is this property is a list we can use it for other purposes too, e.g. someone is sending "$set: initial_referrer" from the posthog-js libary still because they are using an old version & we can suggest upgrading or other kinds of suboptimal PostHog usage.

In the UI I propose we'd show a ⚠️ icon next to those events and the warnings view if any warnings exist, sth like (events view on live events or person->events, etc) Screenshot 2022-09-23 at 16 12 23

Persons page and project level we could also show a warning somehow, but I'd initially just stick with events view only.

2. Reach out to users who are going to be impacted significantly

2.1. cloud: changing the meaning of is_identified will start to restrict identify merges after alias calls

There are only 87 teams who use both alias and identify over the past week (111 over past month) https://metabase.posthog.net/question/445-mixed-usage-of-identify-and-alias and hence are potentially impacted by this change. Note that our team 2 is an example of where we use both identify and alias for the same person. That's not that many.

  1. Roll-out the change via a feature-flag for everyone else
  2. Talk to the impacted teams and continue the roll-out.

2.2 cloud: restricting alias calls too

Before this we've highlighted in the UI and done the backfill on cloud. So impacted users would see that in the events view. Reach out to the users directly who would be impacted significantly based on https://metrics.posthog.com/d/CD1kWjLnz/plugin-server-new?viewPanel=115

3. Self-hosted

We'll have an async migration to backfill is_identified, where first step in that migration enables the roll-out (i.e. 2.1 and 2.2 above). The release email would explain what the async migration does and its consequences.

tiina303 avatar Sep 23 '22 15:09 tiina303

The proposal is slightly confusing since it's using the wrong names for things. $alias should be $create_alias, identified should be is_identified etc.

Reviewing the proposal:

1.

(2) Identify without a merge now won't restricts future merges:

While this sound reasonable why do this change? I believe it makes the project trickier to execute and the outcome more confusing to communicate.

We already have a bunch of values in posthog_person.is_identified column already that would be made wrong after this change. Your proposed async migration might take over a week to develop and test and to make reliable which seems like 90% of the complexity of this project suddently.

I propose we do the simple thing instead.

Conclusion

I feel we're over-complicating this by changing how identify behaves, but other than that the approach seems sensible.

Let's spend this sync arriving at a point where you feel comfortable starting to code this up. Ideally the scope would be to remove an if-branch, update UI and reach out/update users and docs.

macobo avatar Sep 26 '22 10:09 macobo

(2) Identify without a merge now won't restricts future merges:

This was a side-effect of my proposal and is just something I wanted to call out explicitly that it would change. The real goal for changing is_identified was so these would work properly:

(3) Alias is now safeguarded too: (5) Aliasing first restricts identifying later:

Happy to talk about this in the sync

tiina303 avatar Sep 26 '22 11:09 tiina303

Discussed in sync:

  1. Should we have do the async migration or not: how many persons this would impact (i.e. how many people have more than a single distinct_id and is_identified=false) - is that worth the work?
  2. the UI - send $$warning event and show that in the events table in a separate tab.

tiina303 avatar Sep 26 '22 14:09 tiina303

Should we have do the async migration or not: how many persons this would impact (i.e. how many people have more than a single distinct_id and is_identified=false) - is that worth the work?

There's quite a lot of teams with quite a lot of persons (limited the team_ids due to hitting the memory limit) https://metabase.posthog.net/question/446-is-identified-false-and-multiple-distinct-ids

Thoughts on where to draw the line it's worth the effort vs not?

tiina303 avatar Sep 26 '22 14:09 tiina303

Looked up impacted users from https://metrics.posthog.com/d/CD1kWjLnz/plugin-server-new?orgId=1&viewPanel=115&from=now-7d&to=now (anyone who had oldPerson.is_identified = true during alias calls).

  • Organization name lookup: https://metabase.posthog.net/question/457-alias-restriction-impacted-users
  • excel with everything together https://docs.google.com/spreadsheets/d/1Iwn7lzOayaG5mub8rlwxZ5azbMXknfNU0BEGyXJACQc

Draft for the email to send to impacted users https://docs.google.com/document/d/1qbd52Mmf8DgOPbFB4bs2vAHNi1gSPNw2kGxPD4Khewk

tiina303 avatar Oct 12 '22 13:10 tiina303

We need to review all libraries here - some examples:

  • https://github.com/PostHog/posthog.com/pull/4643 python
  • https://github.com/PostHog/posthog-ruby/pull/14 new-id is likely identified user id, so that should be provided as the distinct_id not alias

tiina303 avatar Nov 11 '22 17:11 tiina303