fides icon indicating copy to clipboard operation
fides copied to clipboard

Improve Snowflake Generate Performance

Open SteveDMurphy opened this issue 1 year ago • 5 comments

Closes PROD-1639

Description Of Changes

The Snowflake connector has some known out of the box issues, and the suggested workaround has some performance hit. This PR introduces the ability to surface the schema in parallel to reduce the time required. No data or funactionality change, just the option to surface the schema in multiple threads.

This was found as a problem for larger datasets that caused us to hit the FastAPI timeout (60s by default)

Code Changes

  • [x] Include joblib
  • [x] separate column discovery to new function
  • [x] run the discovery in parallel
  • [x] use a scale for the number of threads to be used
  • [x] ensure test coverage

Steps to Confirm

  • [x] Tested manually with our POCDB in Snowflake, where it took the time down from ~8 seconds to under 3 when using 4 threads

Pre-Merge Checklist

SteveDMurphy avatar Feb 01 '24 00:02 SteveDMurphy

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
fides-plus-nightly ⬜️ Ignored (Inspect) Visit Preview Apr 22, 2024 6:52am

vercel[bot] avatar Feb 01 '24 00:02 vercel[bot]

Passing run #7354 ↗︎

0 4 0 0 Flakiness 0
⚠️ You've recorded test results over your free plan limit.
Upgrade your plan to view test results.

Details:

Merge 7708cd3a0d7b9f1696f7e9ca91b15260676970db into b6a774b0b15eac9a71ce29ccb0cb...
Project: fides Commit: 77e56efcc0 ℹ️
Status: Passed Duration: 00:34 💡
Started: Apr 19, 2024 9:17 PM Ended: Apr 19, 2024 9:17 PM

Review all test suite changes for PR #4587 ↗︎

cypress[bot] avatar Feb 01 '24 00:02 cypress[bot]

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 86.59%. Comparing base (b6a774b) to head (7708cd3).

Files Patch % Lines
src/fides/core/dataset.py 25.00% 6 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4587      +/-   ##
==========================================
- Coverage   86.60%   86.59%   -0.02%     
==========================================
  Files         339      339              
  Lines       20100    20105       +5     
  Branches     2587     2588       +1     
==========================================
+ Hits        17407    17409       +2     
- Misses       2220     2223       +3     
  Partials      473      473              

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 01 '24 00:02 codecov[bot]

I think I may need to add some test coverage but was hoping to get an early review to try and get this on an alpha release if one of you has any time @pattisdr or @adamsachs

It adds a new dependency (joblib) but one we already have on Fidesplus so I don't see it as a huge trade (but could be wrong!)

SteveDMurphy avatar Feb 01 '24 21:02 SteveDMurphy

i agree that some test coverage here would be very nice, though it may be a bit hard to get the parallelization functionality itself covered in automated tests. do we at least have the overall functionality of this codepath covered in existing tests, i.e. enough to ensure that this change doesn't cause regressions?

I believe we do, in that the same output should match the results after this change - in the end, I think the only functional change is taking the parallelized output and outputting it correctly so I was going to look at testing that distinctly but it does feel covered... The current testing path should be redirected to hit both of these steps with Snowflake and we should match the same output (it's how I tested it manually as well!)

SteveDMurphy avatar Feb 02 '24 00:02 SteveDMurphy