Improve Snowflake Generate Performance
Closes PROD-1639
Description Of Changes
The Snowflake connector has some known out of the box issues, and the suggested workaround has some performance hit. This PR introduces the ability to surface the schema in parallel to reduce the time required. No data or funactionality change, just the option to surface the schema in multiple threads.
This was found as a problem for larger datasets that caused us to hit the FastAPI timeout (60s by default)
Code Changes
- [x] Include
joblib - [x] separate column discovery to new function
- [x] run the discovery in parallel
- [x] use a scale for the number of threads to be used
- [x] ensure test coverage
Steps to Confirm
- [x] Tested manually with our POCDB in Snowflake, where it took the time down from ~8 seconds to under 3 when using 4 threads
Pre-Merge Checklist
- [ ] All CI Pipelines Succeeded
- Documentation:
- [ ] documentation complete, PR opened in fidesdocs
- [ ] documentation issue created in fidesdocs
- [ ] Issue Requirements are Met
- [ ] Relevant Follow-Up Issues Created
- [ ] Update
CHANGELOG.md - [ ] For API changes, the Postman collection has been updated
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Ignored Deployment
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| fides-plus-nightly | ⬜️ Ignored (Inspect) | Visit Preview | Apr 22, 2024 6:52am |
Passing run #7354 ↗︎
| ⚠️ You've recorded test results over your free plan limit. Upgrade your plan to view test results. | ||||
Details:
| Merge 7708cd3a0d7b9f1696f7e9ca91b15260676970db into b6a774b0b15eac9a71ce29ccb0cb... | |||
| Project: fides | Commit: 77e56efcc0 ℹ️ |
||
| Status: Passed | Duration: 00:34 💡 | ||
| Started: Apr 19, 2024 9:17 PM | Ended: Apr 19, 2024 9:17 PM | ||
Review all test suite changes for PR #4587 ↗︎
Codecov Report
Attention: Patch coverage is 25.00000% with 6 lines in your changes are missing coverage. Please review.
Project coverage is 86.59%. Comparing base (
b6a774b) to head (7708cd3).
| Files | Patch % | Lines |
|---|---|---|
| src/fides/core/dataset.py | 25.00% | 6 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #4587 +/- ##
==========================================
- Coverage 86.60% 86.59% -0.02%
==========================================
Files 339 339
Lines 20100 20105 +5
Branches 2587 2588 +1
==========================================
+ Hits 17407 17409 +2
- Misses 2220 2223 +3
Partials 473 473
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I think I may need to add some test coverage but was hoping to get an early review to try and get this on an alpha release if one of you has any time @pattisdr or @adamsachs
It adds a new dependency (joblib) but one we already have on Fidesplus so I don't see it as a huge trade (but could be wrong!)
i agree that some test coverage here would be very nice, though it may be a bit hard to get the parallelization functionality itself covered in automated tests. do we at least have the overall functionality of this codepath covered in existing tests, i.e. enough to ensure that this change doesn't cause regressions?
I believe we do, in that the same output should match the results after this change - in the end, I think the only functional change is taking the parallelized output and outputting it correctly so I was going to look at testing that distinctly but it does feel covered... The current testing path should be redirected to hit both of these steps with Snowflake and we should match the same output (it's how I tested it manually as well!)