glide icon indicating copy to clipboard operation
glide copied to clipboard

Sync: Fix page by page query in sync, leading to duplicated entries

Open keymon opened this issue 1 year ago • 1 comments

Comes from https://github.com/common-fate/glide/pull/665

In this PR we:

  • Modify the DynamodDB queries in the sync lambda to iterate the query page by page.

  • Provide commands in dev-cli tool to allow get users and groups

  • Provide a command line tool to delete duplicated users in dynamodb

Why?

Currently sync process does not iterate the paginated responses from dynamodb, reading around 2700 users maximum, Active and Archived.

This can lead to a bug in the sync proces, where it will create duplicated entries in DB in each sync, with new IDs, and updating the groups to point to these users. In the meantime, the users might still have access to Glide, but not be members of the groups.

In detail:

  1. Sync reads the first 2700 user entries in dynamodb, missing any other
  2. Sync reads all IDP users
  3. Sync tries to find if the user is already in DB. As the query was partial, it will be considered a new user.
  4. A new user ID is created, with a new PK/SK
  5. Groups are updated to point to the new ID

Observed behaviour:

  • A random set of users would not have access to any rule based on groups. But can be added individually.
  • DynamoDB table grows infinitely in each sync

How to trigger the issue:

  • Try to get more than >2700 users in the dynamodb, for instance by syncing >2700 users. Users can be later deactivated.

Workaround and remediation

In our case our DB grew to >650k users. We were forced to create a cli tool to cleanup dynamodb that is included.

This tool can run workers in parallel, but does not handle well failure/retries. It is safe to rerun and might require multiple reruns.

How did you test it?

Tested manually. Pending implement some unit testing mocking the ddb library.

Cli tool tested manually.

Potential risks

If somebody was experiencing this bug, they might have 100s of 1000s of users. that means the page by page query will take long to run and it might cause sync to excess the time to run.

One workaround is just run the tool to delete duplicates, but you might want to extend the lambda timeout.

Is patch release candidate?

keymon avatar Jan 31 '24 11:01 keymon

We had to delete 100s of 1000s using the cli. But this in one example with one:

image

Fixed with:

./bin/commonfate ddb delete-duplicated-users --workers 20  --name Granted    | tee ~/tmp/delete-duplicated-users.report.$(date +%Y%m%d-%H%M%S).json > /dev/null 
Got 3041 users
Calculating duplicates...
Found 1 duplicated users
- [email protected]: 1
WARNING: Dry-run mode, skipping deletion...
./bin/commonfate ddb delete-duplicated-users --workers 20  --name Granted  --dry-run=false  | tee ~/tmp/delete-duplicated-users.report.$(date +%Y%m%d-%H%M%S).json > /dev/null 
Got 3041 users
Calculating duplicates...
Found 1 duplicated users
 - [email protected]: 1
Deleting duplicates at 2024-02-01T10:50:34Z using 20 workers...
Deleting 1 dups for [email protected] at 2024-02-01T10:50:34Z...
Done deleting dups for [email protected] Duration=173.408125ms...
Deletion complete at 2024-02-01T10:50:35Z. Duration 173.922ms...

keymon avatar Feb 01 '24 10:02 keymon