great_expectations Job aborted due to stage failure: Total size of serialized results bigger than spark.driver.maxResultSize

Job aborted due to stage failure: Total size of serialized results bigger than spark.driver.maxResultSize

Open bingwenhe opened this issue 2 years ago • 1 comments

First thanks for the great work with Great Expectations. We are using it to validate/profile some large data in our project. And we start to run into the following problem:

Job aborted due to stage failure: Total size of serialized results of 12 tasks (4.2 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

As can be seen in the attached screenshot, it happened at line 238 in core/util.py, where GE will try to do a data.collect() to pull back ALL the data into the driver node. We could, of course, increase the driver node's spec to solve this problem, but this approach is NOT really sustainable and scalable, defeating the sole purpose of using a distributed data system.

Are there any configurations/workarounds we can use to mitigate the problem? We really like the framework.

Thank you very much.

Feb 23 '22 22:02 bingwenhe

Howdy @bingwenhe, thank you for reaching out and informing us 🎉 We'll bring this up with the team.

Feb 25 '22 16:02 AFineDayFor

We've added this optimization request to our product board. Unfortunately it's not been prioritized so can't provide a time estimate at this point. Thanks.

Mar 07 '23 20:03 rdodev

great_expectations great_expectations copied to clipboard

Job aborted due to stage failure: Total size of serialized results bigger than spark.driver.maxResultSize

great_expectations
great_expectations copied to clipboard