great_expectations
great_expectations copied to clipboard
Job aborted due to stage failure: Total size of serialized results bigger than spark.driver.maxResultSize
First thanks for the great work with Great Expectations. We are using it to validate/profile some large data in our project. And we start to run into the following problem:
Job aborted due to stage failure: Total size of serialized results of 12 tasks (4.2 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
As can be seen in the attached screenshot, it happened at line 238 in core/util.py
, where GE will try to do a data.collect()
to pull back ALL the data into the driver node. We could, of course, increase the driver node's spec to solve this problem, but this approach is NOT really sustainable and scalable, defeating the sole purpose of using a distributed data system.
Are there any configurations/workarounds we can use to mitigate the problem? We really like the framework.
Thank you very much.
Howdy @bingwenhe, thank you for reaching out and informing us 🎉 We'll bring this up with the team.
We've added this optimization request to our product board. Unfortunately it's not been prioritized so can't provide a time estimate at this point. Thanks.