Include S3 based Shuffle storage plugin
AWS lambda has a limit of 1024 open File descriptors that leads to task result loss failure if you are merging data in the target tables for building a idempotent ingestion pipelines.
Solution: Use S3 based shuffle storage plugin that can act as a replacement for disk in AWS lambda.
@gauravtanwar03 Good thinking. Do you have a prototype ready?
@gauravtanwar03 do you have a working code ?
We can include
.config("spark.shuffle.spill", True)
.config("spark.shuffle.spill.compress", True)
.config("spark.shuffle.compress", True)
.config("spark.shuffle.file.buffer", "64k")
.config("spark.local.dir", "s3a://your-bucket-name/spark/shuffle/") \