Include S3 based Shuffle storage plugin

Open gauravtanwar03 opened this issue 1 year ago • 1 comments

AWS lambda has a limit of 1024 open File descriptors that leads to task result loss failure if you are merging data in the target tables for building a idempotent ingestion pipelines.

Solution: Use S3 based shuffle storage plugin that can act as a replacement for disk in AWS lambda.

Feb 04 '24 10:02 gauravtanwar03

@gauravtanwar03 Good thinking. Do you have a prototype ready?

Mar 04 '24 15:03 JohnChe88

@gauravtanwar03 do you have a working code ?

Aug 08 '24 19:08 JohnChe88

We can include

.config("spark.shuffle.spill", True)
.config("spark.shuffle.spill.compress", True)
.config("spark.shuffle.compress", True)
.config("spark.shuffle.file.buffer", "64k")
.config("spark.local.dir", "s3a://your-bucket-name/spark/shuffle/") \

Aug 09 '24 17:08 JohnChe88