great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Memory Leak in GE Running in databricks

Open MLGalloudec opened this issue 2 years ago • 3 comments

Describe the bug We have a data validation running every five minutes as part of a databricks job running from a batch request which uses an in memory pandas dataframe as the batch data. After the job is completed, the library is not releasing all used memory back. We have isolated the problem as we have taken portions of our code out one at a time, and it's GE that is causing the memory leak. Below is a graph of memory usage over time - when we add the validation back in, this climbs incrementally up to the point that the cluster crashes.

export_for_ge

To Reproduce Steps to reproduce the behavior:

  1. Create base data context with the root directory in the databricks file system
  2. Create data config and add to data context - the module name of the execution engine is great_expectations.execution_engine, with PandasExecutionEngine as the class_name. We are using a RuntimeDataConnector
  3. Create batch request on the dateframe using a RuntieBatchRequest.
  4. Run against expectation suite (we have 11 basic expectations, checking that columns exist, and that they contain the correct data types).

Expected behavior All memory to be released after the process has finished.

Environment (please complete the following information):

  • Operating System: Databricks Runtime Version 10.4 LTS
  • Great Expectations Version: 0.15.2

I'm very happy to supply code samples of how we've implemented GE in python if required. We followed this guide to write the code: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/

MLGalloudec avatar Apr 26 '22 16:04 MLGalloudec

Thank you for raising this, @MLGalloudec - we will review and be in touch.

talagluck avatar Apr 29 '22 19:04 talagluck

Hi @MLGalloudec - thank you for your patience here. We are working on internal reprioritization this week. Is this still an issue? Are you able to share the code samples you mentioned for how implementing Great Expectations? Thanks!

talagluck avatar Aug 09 '22 16:08 talagluck

Hi @talagluck, yes - this is still an issue. I will get the code samples for how we're implementing GE and post them below. We're connecting and running expectations on an in memory pandas dataframe

MLGalloudec avatar Aug 12 '22 10:08 MLGalloudec

Hey @MLGalloudec it's been a minute, heh. Is this still an issue and if so, would appreciate the code samples mentioned above. Thanks!

rdodev avatar Mar 08 '23 16:03 rdodev