Dataset Edit Performance Improvements
What this PR does / why we need it: These PR includes multiple changes to the UpdateDatasetVersionCommand to improve the performance/scalability when editing dataset with large numbers of files. Key changes include:
- Adding a feature flag to allow disabling the edit-draft logging (separate log files that report changes being made by the current user)
- Changing functionality to not update the lastmodifieddate on existing files (since they do not change)
- The DatasetVersionDifference optimizations from #10818 (only improves time when edit-draft reporting is still enabled)
- Doing an initial merge of the dataset and avoiding subsequent merge/flush operations
Which issue(s) this PR closes:
Closes #10138
Special notes for your reviewer: In my testing on a dataset with 10K files, the time required for the UpdateDatasetVersionCommand in the DatasetPage.save() method to complete (as measured by logging in the save method) when a one char change to the description was made was averaging ~30 seconds. With all the changes in the PR, it now takes ~12-13 seconds. In general, verifying the impact of individual changes is hard:
- I see variations of ~2 seconds between repeat runs
- The first run after deployment can be ~3-4 seconds longer
- Simply logging the time a statement takes can be misleading: in one iteration, I saw that calculating the md5 hash of the :CVocConf setting was taking 2 seconds! While moving the retrieval of that setting as in the PR reduced that time to a ~1ms and produced an overall improvement, the overall change was much smaller than 2 seconds - looks like parallel operations were just slowing that step.
- Similarly, while #10818 reduced the difference time from ~12 seconds to < 1 sec when run after operations, trying to do it early led to a ~4-5 second run time - my guess is that some of the time is in lazy loading elements used in the differencing, but I'm not sure.
That said, I would estimate that the first two changes contribute ~4 second reductions each (the feature flag would save 12 seconds, but the differencing PR saves ~ 8 seconds there).
Suggestions on how to test this: All the automated tests should pass, any/all variants of making changes to a dataset should work as before, there should be no changes w.r.t. the db-level updates except for the change to not update datafile lastmodified dates. Performance should be improved overall and scaling should be improved. The simplest way to test that might be to turn on fine logging for the DatasetPage where I've added logging of the time to run the update command. (Note that the overall time seen in the UI includes both the time to save the changes and the time to reload the page. The latter, with 10K files is still many seconds and hasn't been improved in this PR.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: Probably one for any/all performance updates going into 6.5 along with announcing the feature flag and change to file last modified behavior.
Additional documentation: to be added