dbatools icon indicating copy to clipboard operation
dbatools copied to clipboard

Invoke-DbaDbDataMasking - Improve performance

Open AdamJKoehler opened this issue 7 months ago • 2 comments

Switch from column-based masking to row-based masking.

Type of Change

  • [x] Bug fix (non-breaking change, fixes #7940 )
  • [ ] New feature (non-breaking change, adds functionality, fixes # )
  • [ ] Breaking change (affects multiple commands or functionality, fixes # )
  • [x] Ran manual Pester test and has passed (Invoke-ManualPester)
  • [ ] Adding code coverage to existing functionality
  • [ ] Pester test is included
  • [ ] If new file reference added for test, has is been added to github.com/dataplat/appveyor-lab ?
  • [ ] Unit test is included
  • [ ] Documentation
  • [ ] Build system

Purpose

To improve the speed at which Invoke-DbaDbDataMasking masks data in large tables

Approach

This change switches from Column by column masking to row by row masking, allowing for a batch of rows to be done at a single time, instead of having to iterate over each column to be masked in the table for each row.

AdamJKoehler avatar May 16 '25 17:05 AdamJKoehler

This is a big change and not easy to review. Best would be to let @sanderstad have a look at this and approve it. Sander, do you have the time?

andreasjordan avatar May 31 '25 15:05 andreasjordan

As #9664 is merged, you should update your branch to resolve the conflict. Then we can continue to review this PR.

andreasjordan avatar Jun 11 '25 12:06 andreasjordan

Sorry, I didn't see that you tagged me until now. I could've done the check, but it's already merged

sanderstad avatar Jul 04 '25 11:07 sanderstad

@sanderstad this perf change isn't merged in - was another fix. Let us know what you think 😄

jpomfret avatar Jul 04 '25 11:07 jpomfret

I looked at the code and I know there was a reason why I did it column by column for a single table. I was thinking about this approach from a column perspective, not a row perspective. The reason I went for this approach is because I wanted the entire column to be updated for a table at once, and not per row (RBAR).

@AdamJKoehler Your approach could work, but I'm very much interested in your results from testing. Do you have an example of the original way of doing it and the new way?

The code looks good for the rest of it, but I would like to see the results from a run with that amount of records (200.000 +).

sanderstad avatar Jul 10 '25 13:07 sanderstad

@sanderstad, I do have some old ones. I'll get a fresh set in the next few days and send it over.

AdamJKoehler avatar Jul 11 '25 19:07 AdamJKoehler

How're we looking, all?

potatoqualitee avatar Jul 29 '25 12:07 potatoqualitee

@potatoqualitee I got pulled into some emergency issues that took me away from this. I should be able to get back to this within the next week.

AdamJKoehler avatar Jul 30 '25 04:07 AdamJKoehler

Thank you @AdamJKoehler 👍🏼

potatoqualitee avatar Jul 31 '25 13:07 potatoqualitee

Hello, closing for now but I encourage you to reopen once it's been tested, I'll be happy to merge 👍🏼

potatoqualitee avatar Aug 10 '25 22:08 potatoqualitee

@sanderstad, @potatoqualitee Here's my recent test. My changes are the top result, and the bottom are the original implementation. scramble_performance_test

I was going to re-open the PR, but it appears I don't have the ability to do so.

AdamJKoehler avatar Aug 22 '25 16:08 AdamJKoehler

Awesome, thank you @AdamJKoehler ! Reopened and will merge once tests pass 👍🏼

potatoqualitee avatar Aug 23 '25 10:08 potatoqualitee

boom, thank you! Merrrrrrrrrged!

potatoqualitee avatar Aug 23 '25 11:08 potatoqualitee