great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Job aborted due to stage failure: Total size of serialized results bigger than spark.driver.maxResultSize

Open bingwenhe opened this issue 2 years ago • 1 comments

First thanks for the great work with Great Expectations. We are using it to validate/profile some large data in our project. And we start to run into the following problem:

Job aborted due to stage failure: Total size of serialized results of 12 tasks (4.2 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.

As can be seen in the attached screenshot, it happened at line 238 in core/util.py, where GE will try to do a data.collect() to pull back ALL the data into the driver node. We could, of course, increase the driver node's spec to solve this problem, but this approach is NOT really sustainable and scalable, defeating the sole purpose of using a distributed data system.

Are there any configurations/workarounds we can use to mitigate the problem? We really like the framework.

Thank you very much.

image

bingwenhe avatar Feb 23 '22 22:02 bingwenhe

Howdy @bingwenhe, thank you for reaching out and informing us 🎉 We'll bring this up with the team.

AFineDayFor avatar Feb 25 '22 16:02 AFineDayFor

We've added this optimization request to our product board. Unfortunately it's not been prioritized so can't provide a time estimate at this point. Thanks.

rdodev avatar Mar 07 '23 20:03 rdodev