clients icon indicating copy to clipboard operation
clients copied to clipboard

[PM-24630] feature: de-duplication tool with test coverage

Open harr1424 opened this issue 5 months ago â€ĸ 21 comments

đŸŽŸī¸ Tracking

https://github.com/orgs/bitwarden/discussions/15942

https://community.bitwarden.com/t/duplicate-removal-tool-report-including-merge/648

https://bitwarden.atlassian.net/browse/PM-24630

📔 Objective

This PR includes a service to find duplicated ciphers in a single user's vault and Ui components to display these duplicates for review and allow the user to delete them. Duplicates are identified by identical item names (or matching canonical names) sharing the same username, folder, and organization. Such duplicates are common when importing credentials from multiple sources into a Bitwarden vault, and as the community issue above demonstrates, this is a feature that the user community has been asking for.

📸 Screenshots

de-duplicate-component duplicate-review-dialog duplicate-success-dialog second-run-duplicate-review-dialog second-run-duplicate-success-dialog

🌱 Next steps:

  • I will reach out to Crowdin project owners to determine the best way to integrate i18n and l10n in my contribution. I am a first-time contributor, and my current understanding after reading documentation is that the project owner will need to add keys for the translated strings to the web project. Please reach out to correct me if I am wrong regarding the workflow here.

⏰ Reminders before review

  • Contributor guidelines followed YES
  • All formatters and local linters executed and passed YES
  • Written new unit and / or integration tests where applicable YES, see below: clients/apps/web/src/app/vault/services/de-duplicate.service.spec.ts
  • Protected functional changes with optionality (feature flags) NO
  • Used internationalization (i18n) for all UI strings NO, I need to research this further
  • CI builds passed N/A
  • Communicated to DevOps any deployment requirements NO
  • Updated any necessary documentation (Confluence, contributing docs) or informed the documentation team NO

đŸĻŽ Reviewer guidelines

  • 👍 (:+1:) or similar for great changes
  • 📝 (:memo:) or â„šī¸ (:information_source:) for notes or general info
  • ❓ (:question:) for questions
  • 🤔 (:thinking:) or 💭 (:thought_balloon:) for more open inquiry that's not quite a confirmed issue and could potentially benefit from discussion
  • 🎨 (:art:) for suggestions / improvements
  • ❌ (:x:) or âš ī¸ (:warning:) for more significant problems or concerns needing attention
  • 🌱 (:seedling:) or â™ģī¸ (:recycle:) for future improvements or indications of technical debt
  • ⛏ (:pick:) for minor or nitpick changes

harr1424 avatar Aug 09 '25 14:08 harr1424

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Aug 09 '25 14:08 CLAassistant

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Aug 09 '25 14:08 CLAassistant

Thank you for your contribution! We've added this to our internal Community PR board for review. ID: PM-24630 Link: https://bitwarden.atlassian.net/browse/PM-24630

Details on our contribution process can be found here: https://contributing.bitwarden.com/contributing/pull-requests/community-pr-process.

bitwarden-bot avatar Aug 09 '25 14:08 bitwarden-bot

Hi @harr1424! Thank you for drafting support for a deduplication tool!

👍đŸģ I'm really excited to help you get your first PR merged! This PR seems to have required a substantial amount of work already, and it's going to take more to be ready to integrate into our systems. It's an exciting thing, making a contribution and then seeing your work in the live product. I promise that seeing your PR through is part of that process, and collaboration makes it all the more worthwhile!

📝 I'm going to switch it to a draft while you and I align this code with our current practices (i18n first!). It'll also give our project teams time to review your feature, assess its UX and accessibility, and determine whether this is a tool or part of @bitwarden/team-data-insights-and-reporting. (cc: @ttalty)

🌱 In the meantime, we can continue improving the code! You mentioned i18n as a primary concern. Here's an example of how you can add that support to your code:

There's no need to reach out to our i18n leads or translators. We'll handle that for you. 😄

🎨 You can also take a look at the following pages in our contrib guidelines and make sure that your code conforms to them:

📝 And these pages from our component guide may also help:

audreyality avatar Aug 11 '25 12:08 audreyality

Logo Checkmarx One – Scan Summary & Details – 1f6f0f87-936a-4108-ba5f-2a429f350763

Great job! No new security vulnerabilities introduced in this pull request

github-actions[bot] avatar Aug 11 '25 12:08 github-actions[bot]

@audreyality Thank you for the warm welcome! I've really enjoyed working on this so far.

I appreciate your detailed instructions regarding the next steps. I will plan on integrating i18n as you described, and ensure that my code adheres to project conventions and the referenced style guides.

I will comment back once I believe this has all been completed, and is present in a fresh commit.

Thanks again for your kindness and eagerness to provide some orientation 😀

harr1424 avatar Aug 11 '25 20:08 harr1424

Progress checkpoint

  • Feature is working well and has seen improvements in duplicate detection logic and UI rendering time (my initial approach at displaying duplicates inside of a Dialog Component was very slow for hundreds to thousands of duplicates)
  • Virtual scrolling, TableDataSource, and bitSortable have all been implemented on the Duplicate Review Dialog
  • Translation keys have been added to the EN locale and no hard-coded strings remain

Remaining work items:

  • Continue to evaluate my contribution against contributor guidelines
  • Explore alternative toasts for the situation where no duplicates are found, as the current toast seems out of place when shown top right corner - explore similar situations in the project and adhere to conventions
  • Decide if progress spinner should be adjusted from full-page overlay to being contained inside of the triggering button, similar to the import tool, for consistency
  • The current way of displaying duplicates is not at all visually pleasing, explore creating a card style view of each duplicate group, and displaying these cards inside of a scrolling dialog

harr1424 avatar Aug 14 '25 13:08 harr1424

Explore alternative toasts for the situation where no duplicates are found, as the current t oast seems out of place when shown top right corner...

That's the correct place for the toast. We used to have it at the bottom of the window, but that blocks navigation so we moved it.

Decide if progress spinner should be adjusted from full-page overlay to being contained inside of the triggering button, similar to the import tool, for consistency

The current way of displaying duplicates is not at all visually pleasing, explore creating a card style view of each duplicate group, and displaying these cards inside of a scrolling dialog

@sukhleenb - Could you comment on this?

@harr1424 - Feel free to refine your work, but be wary of overinvestment. We'll likely iterate the design after it's merged.

I will comment back once I believe this has all been completed, and is present in a fresh commit.

There's a re-review button in the upper-right, between my name and the review state. You can use that to request a review instead of commenting.

audreyality avatar Aug 15 '25 18:08 audreyality

Update: re-submission for review

✅ i18n fully integrated, no hard-coded strings remain ✅ Code style guidelines have been reviewed and adhered to ✅ UI uses elements from BW Component Library when available

👀 Some UI elements have been changed from my initial screenshots:

  • I've replaced the Toast usages with Callouts as this was more consistent with UIs in the Reports section. It would be very easy to swap back to using toasts, though, if this is preferred in the tool section.
  • I've also replaced the success dialog, used to describe how many logins were moved to trash, and how many were permanently deleted, with a Callout, again for consistency with the reporting tools. I've saved my success dialog files in case they are preferred, the import tool uses a dialog to convey similar information.

🎨 I am very open to making any suggested changes 🎨 Easy to switch back to using toasts and dialogs if preferred

🌱 The Duplicate Review Dialog displays rows much wider than the dialog box itself, and so considerable horizontal scrolling is necessary 🌱 This same dialog does not display consistent column widths 🌱 Explore formatting duplicate ciphers as cards, and then place these cards into a vertical scroll

Initial page state: initial_page_state

Progress spinner inside button, similar to reporting tools: progress_spinner

If a user decides to cancel the operation: cancellation_callout

The initial pass will detect duplicates (which in the example below are not currently in the trash, but could be): initial_detection

First time duplicate deletion moves items to trash: de_dup_complete

Duplicate Review Dialog (which in this example shows logins moved to trash in previous 'pass'): dup_review

Second 'pass' over duplicates will delete permanently: permanent_delete

If no duplicates are found (migrated from toast to callout): no_dups

harr1424 avatar Aug 15 '25 22:08 harr1424

Sorry, just a nosy customer... exciting development! I wanted to throw in two things:

  1. As I also was nosy enough to have a look into this discussion - and I don't see it in the screenshots - I wonder if the fact that login items can have multiple URIs is now taken into account here?

  2. One of your new screenshots @harr1424 -- namely the "Duplicate Review Dialog"-- shows selected items as "(in trash)", which probably shall indicate, that they get deleted, when you click "Delete". But I would raise the point, if it a) is enough instruction there at the beginning, if the user shall select items to DELETE or shall select items to KEEP - and b) I'm not sure, if it would be more "natural" to select the items the user wants to KEEP, and consequently to have a "Keep" button (for the selection)?!

And maybe a third point I didn't think of before:

  1. As it's a de-duplication tool - and not a merge of items - I hope one can a have a look into the complete item data, including notes, custom fields, TOTP/integrated authenticator, passkeys etc. to make sure, that no data gets deleted "inadvertently" with items that get deleted (when the user doesn't see, that there is more data in those items...).

pamperer562580892423 avatar Aug 15 '25 23:08 pamperer562580892423

👋 Hi @pamperer562580892423

Thanks for your feedback! I've responded to your questions below, please feel free to follow up with anything that might still be unclear.

Questions & Answers

I wonder if the fact that login items can have multiple URIs is now taken into account here?

For each login cipher, potential duplicates will determined by combining the login username with every URI associated with that cipher. This means that if a cipher has three URIs, three username+uri combinations will be added to a duplicate "bucket." As the bucket grows, each addition to the bucket will be checked to determine if that username+uri combination is already in the bucket. If it does exist, that is a duplicate, and that cipher will be added to a list of duplicate ciphers.[^1]

For example, consider the following login ciphers:

"1_TEST": {"username": "tester", "uris": ["test.domain.org", "forum.test.domain.org", "forum.test.domain.org/login"] }

"2_TEST": {"username": "tester", "uris": ["test.domain.org"] }

"3_TEST": {"username": "tester", "uris": ["forum.test.domain.org"] }

The example above is instructive to better understand how URI normalization has been used in this feature: URIs are normalized such that only subdomain, domain, and top-level domain are retained. So the above set of ciphers (5 URIs total, 3 of which are unique), only yeilds 2 normalized ciphers.[^2]

The screenshot below demonstrates that the duplicate review dialog presents "sets" of duplicate logins, which correspond to the buckets described earlier. Since the username+uri combinations resulting from URI normalization yielded 2 buckets, two sets will be presented to the user.

Because 3 ciphers were present in these 2 sets, the user is informed that 3 duplicates were found.

uri_logic

In situations like this, 1_TEST should collapse to one row in the UI presentation, since it is the same cipher, this has been noted.

In summary, multiple URIs are considered here, but they do complicate the deletion process, and highlight the need for detailed information presentation to the user. What do you think of this approach? Would you prefer that ciphers having certain URI characteriztics (more normalized versus more specific) be preferred for retention or deletion? Any input is appreciated.

One of your new screenshots @harr1424 -- namely the "Duplicate Review Dialog"-- shows selected items as "(in trash)", which probably shall indicate, that they get deleted, when you click "Delete". But I would raise the point, if it a) is enough instruction there at the beginning, if the user shall select items to DELETE or shall select items to KEEP - and b) I'm not sure, if it would be more "natural" to select the items the user wants to KEEP, and consequently to have a "Keep" button (for the selection)?!

My apologies, but for brevity I only included the duplicate review dialog being displayed after an initial 'pass,' which moved the same duoplicates to the trash. On the initial pass, ciphers would not be indicated as being present in the vault trash unless the user had already moved them there.

I think that the dialog presenting ciphers that, when checked will be deleted, is reasonable here. It seems that reversing the logic from delete to keep and changing the UI elements accordingly might be more of a personal preference, and one that could vary widely among the application's audience.

As it's a de-duplication tool - and not a merge of items - I hope one can a have a look into the complete item data, including notes, custom fields, TOTP/integrated authenticator, passkeys etc. to make sure, that no data gets deleted "inadvertently" with items that get deleted (when the user doesn't see, that there is more data in those items...).

I agree wholeheartedly, and will be exploring ways to display ciphers using a card view, with much more information. The duplicate review dialog currently shows the following columns: Name, Username, Website (URI), Folder, and Organization. When moving to a card view, I think it is reasonable to add in more complete item data, or at a minimum indicate that it is present. This would allow users to ensure they retain the login having TOTP or other desired fields.

Again, the initial use or 'pass' of this tool moves ciphers selected for deletion to the trash. From there they can be easily recovered. The user can decide to empty their vault trash, or else run the tool a second time to permanently delete the ciphers.

Summary

In the future, a merge tool may be more appropriate in order to address several of your concerns. For now, this tool works very well with the most common use case: ciphers sharing a single username and single URI. I had hundreds of these after importing credentials from my browser. Care has been taken to support other common causes of duplicates I encountered in my personal vault:

  • login ciphers sharing normalized names (described in the original discussion you linked) where one example could be names myspace.com and myspace.com (Tom) which both normalize to myspace.com (Important this detection logic is only used when usernames match as well)
  • login ciphers lacking a username but having identical Name values. In my case these were erroneously created by my browser when using certain authentication services (Okta), they had no passwords, and each had a different URI. I found it useful for this tool to detect and delete these.

My personal vault will not generalize perfectly to every user and every organization, which is why discussions like this one are so important. Thank you for taking the time to review my proposed changes and submit your feedback. If you have any suggestions, they will be warmly received.

[^1]: This has an important effect which may need to be considered, the ordering of duplicates in the duplicate review dialog is determined by the order in which the duplicates were encountered in the vault. This in turn determines which ciphers are automatically selected/suggested for deletion. As you noted, thorough presentation of cipher data is essential so that the user can retain or delete the ciphers based on their preference. Notice in my screenshot that the cipher with the normalized URI is suggested for deletion, which may be undesirable.

[^2]: forum.test.domain.org/login > forum.test.domain.org and test.domain.org

harr1424 avatar Aug 16 '25 03:08 harr1424

@harr1424 Wow, thanks for that detailed response! I'm not sure if I got everything right (I'm not a developer), but here are some thoughts from my side:

Would you prefer that ciphers having certain URI characteriztics (more normalized versus more specific) be preferred for retention or deletion?

If I understand that question correctly: I think, since we even have a URI match detection "exact", some people do rely on the exact URI, so I'm a bit sceptic about using "normalized" versions (and therefore potentially losing an important part of the data?). But I may be misunderstanding your approach.

Questions to the screenshot in your response to me: your first "bucket" shows the 1_TEST item two times. Is that an error or did I miss a point? 😅 And second question: if you chose 1_TEST in the first bucket for deletion, then 1_TEST would be auto-selected/marked in your second bucket (and possible other buckets) as well? --> And maybe follow-up question to that: if that happened further down the list... when you select or de-select an item for deletion - could it happen, that "out of sight" in another place of the whole list, the selection of the identical item would change without the user being aware of that?

--> If it is not already in place, I think there should be an info box, if a selection possibly influences other "buckets"

I think that the dialog presenting ciphers that, when checked will be deleted, is reasonable here. It seems that reversing the logic from delete to keep and changing the UI elements accordingly might be more of a personal preference, and one that could vary widely among the application's audience.

Yeah, I understand. Just three things to that: yesterday, I initially thought, it might be more "intuitive" to select the items you want to keep - but I don't know... So, second point: yeah, if there was a "convention" on doing that, I would recommend to follow that. And third point: I think my main point even was, that at the beginning of your de-duplication process, there should be absolute clear instructions what the user should do, which items they should select (the ones they want to keep v. the ones that should be deleted), what the selection "means" in the end (keeping v. deletion) etc. Or put in other words: It would be very frustrating for users, to work through the whole list, just to discover they selected the items wrong and/or just click and delete the wrong ones.

--> Idea: maybe an "invert all selections" button (besides the final Keep / Delete / Cancel buttons) could be a solution for that unfortunate case

I agree wholeheartedly, and will be exploring ways to display ciphers using a card view, with much more information. The duplicate review dialog currently shows the following columns: Name, Username, Website (URI), Folder, and Organization. When moving to a card view, I think it is reasonable to add in more complete item data, or at a minimum indicate that it is present. This would allow users to ensure they retain the login having TOTP or other desired fields.

Sounds good!

I had an idea to that too: instead of listing everything of every item, another possibility could be to just list the differences of the items (besides the "core" like item name, username, URIs) ?!

Speaking of URIs: in Bitwarden, the URIs can have a customized order. I think, to compare different items in your "buckets", if it was possible, it would be easier, if the URIs of an item would be shown in some kind of alphabetical order, instead of the stored customized order. (just an idea - if it isn't possible, don't pursue it)

And BTW, speaking of URIs, usernames etc., a general question: does your de-duplication tool "only" process login items? Or does it also include the other Bitwarden item types:

  • cards
  • identities
  • (secure) notes
  • SSH keys

?

Again, the initial use or 'pass' of this tool moves ciphers selected for deletion to the trash. From there they can be easily recovered. The user can decide to empty their vault trash, or else run the tool a second time to permanently delete the ciphers.

Speaking of the trash... In your screenshot, selected items are marked with:

(In Trash)

That sounds a bit like "they are already in trash" and may be a bit ambivalent. I would suggest to reflect more the action that will happen / result, so like:

(Move to Trash) or (will be moved to trash)

Finally, another thought that came to me: it certainly is not your intended use of your toll, but I can imagine some "scared" users, who a sceptic of an automatic deletion of some of their vault items - and who maybe want to do it manually. Your tool can detect duplicates - and it would be a nice additional feature of your tool, if it (also) just could mark duplicates (instead of processing them). Unfortunately, we don't have tags in Bitwarden, which would marking duplicate items very easy, I guess. Possible workarounds could be, to add a "DUPLICATE" to the item name, or maybe create a new custom field in each item with "DUPLICATE" (or whatever else) in it, and what would make it possible to find and process the duplicate items manually (and way more easy than now, where one has to even search manually for duplicates).

Also just an idea...

My personal vault will not generalize perfectly to every user and every organization, which is why discussions like this one are so important.

👍 Very good! Thanks for that discussion! - And I don't want to open a whole new and large can, but if you don't know it already, here is the corresponding feature request on the Community Forum: https://community.bitwarden.com/t/duplicate-removal-tool-report-including-merge/648 If you look through it, there might be other ideas and user feedback. (but maybe better use the time, to develope your tool further 😅 - those large threads are absolutely valuable, but there might be so much ideas and requests in it, it can overwhelm you and paralyze you for the next months đŸ¤Ŗ)

pamperer562580892423 avatar Aug 16 '25 14:08 pamperer562580892423

👋 Hello @pamperer562580892423 !

Questions & Answers (pt. 2)

If I understand that question correctly: I think, since we even have a URI match detection "exact", some people do rely on the exact URI, so I'm a bit sceptic about using "normalized" versions (and therefore potentially losing an important part of the data?). But I may be misunderstanding your approach.

Yes I was thinking the same, which is why normalization is only used for detection logic. At first I was considering adding logic to automatically select ciphers for deletion based upon URI characteristics, but this seemed out of scope for an initial de-duplication tool. It would take a lot of thought and careful testing to ensure logic like this respected existing the existing functionality you mentioned.

Questions to the screenshot in your response to me: your first "bucket" shows the 1_TEST item two times. Is that an error or did I miss a point? 😅

This was absolutely an error, and I've just patched it. Now, duplicates won't be displayed redundantly. Your question allowed this to surface earlier than it might have otherwise, thank you!

And second question: if you chose 1_TEST in the first bucket for deletion, then 1_TEST would be auto-selected/marked in your second bucket (and possible other buckets) as well? -->

I understand your concern: unselecting 1_TEST in one grouping will unselect it from all groupings, and vice-versa.

And maybe follow-up question to that: if that happened further down the list... when you select or de-select an item for deletion - could it happen, that "out of sight" in another place of the whole list, the selection of the identical item would change without the user being aware of that? f it is not already in place, I think there should be an info box, if a selection possibly influences other "buckets"

Yes it would happen that if 1_TEST appeared in multiple groupings, some of which were not in view, that selecting or unselecting this cipher in one grouping would update its retention in the other groupings.

In another point, you suggested providing very clear, detailed instructions to the user when the tool page first appears. I think that describing this behavior at that point would allow for a cleaner UX than adding an info box here. Please see this footnote describing my opinion on making design changes at this stage.[^1]

does your de-duplication tool "only" process login items? Or does it also include the other Bitwarden item types

Currently only login types are included. Cards, identities, notes, and SSH keys have a different structure, and would require different matching logic. It is very possible to add these items into a future iteration of this feature. For now, I wanted to focus my efforts on developing a solid tool to find and remove duplicated login items, since these seem to be the most common and numerous duplicated cipher type after importing data into a vault.

Speaking of the trash... In your screenshot, selected items are marked with: (In Trash)

I think there is still a misunderstanding here, and will edit my earlier comment to display the entire workflow. My apologies for not doing this originally. The descriptor '(In Trash)' is only used to indicate to the user that a cipher is in the trash. It could be that they previously used this tool which moved the cipher(s) to the trash, or they could have deleted that cipher themselves.

It would be a nice additional feature of your tool, if it (also) just could mark duplicates (instead of processing them)

For now, I think this is outside of the scope of this feature, but any thoughts or suggestions are always appreciated. If any of the design changes[^1] or any other functionality you have suggested is supported by a large part of the application audience, I would be happy to explore ways of making these improvements.

[^1]: Please understand I am not a repository owner, and have contributed this feature in order that it's core functionality may be introduced into the application. Repository owners have cautioned against over-investing time and effort at this stage, particularly with respect to design elements, since these will likely be evaluated and changed throughout the the release process. That being said, I appreciate your suggestions and your participation in reviewing my contribution. I am very excited to see the community shape this feature!

harr1424 avatar Aug 16 '25 14:08 harr1424

@harr1424 - please refrain from force-pushing changes unless absolutely necessary. Altering the commit history makes it difficult to verify the changes you've made since the last review. We squash merge, so you don't need to worry about your PR history appearing in the main commit log.

Please also leave it a draft until Product has cleared the ticket for merge.

@pamperer562580892423 - Thank you for your feedback! Our UX and Product teams will be reviewing this feature, and feedback like yours helps them decide how features like this evolve.

PRs, however, are not an appropriate place for feature feedback. Please keep conversation about this feature on the forum thread you linked.

audreyality avatar Aug 18 '25 14:08 audreyality

@pamperer562580892423 regarding your feedback above, here is the related Github discussion post for the proposed PR: https://github.com/orgs/bitwarden/discussions/15942

dwbit avatar Aug 22 '25 10:08 dwbit

Hi, any updates about this PR ?

gaspachoo avatar Nov 04 '25 12:11 gaspachoo

Hi @gaspachoo 👋

Bitwarden internal engineering and product teams are still reviewing this contribution. Our product team prioritizes work items in response to community feedback, so if this is a feature you would really like to see, please consider promoting it on our community forums.

harr1424 avatar Nov 04 '25 12:11 harr1424

Waiting for this as well! :)

stefan1983 avatar Nov 28 '25 18:11 stefan1983

Codecov Report

:x: Patch coverage is 43.89671% with 239 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 41.83%. Comparing base (8b5cb48) to head (7573426). :white_check_mark: All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...web/src/app/vault/services/de-duplicate.service.ts 66.05% 78 Missing and 15 partials :warning:
...c/app/tools/de-duplicate/de-duplicate.component.ts 0.00% 73 Missing :warning:
.../de-duplicate/duplicate-review-dialog.component.ts 8.95% 61 Missing :warning:
...uplicate/de-duplicate-warnings-dialog.component.ts 0.00% 10 Missing :warning:
apps/web/src/app/oss-routing.module.ts 0.00% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15967      +/-   ##
==========================================
+ Coverage   41.81%   41.83%   +0.01%     
==========================================
  Files        3589     3593       +4     
  Lines      104160   104586     +426     
  Branches    15707    15827     +120     
==========================================
+ Hits        43556    43751     +195     
- Misses      58759    58976     +217     
- Partials     1845     1859      +14     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Dec 10 '25 16:12 codecov[bot]