spark-sas7bdat icon indicating copy to clipboard operation
spark-sas7bdat copied to clipboard

Decompression of sas7bdat.bz2 file is not distributed across worker nodes

Open yivanova88 opened this issue 5 years ago • 1 comments
trafficstars

Hello,

I have been experimenting with the bz2 decompression functionality in the repo's master branch which isn't part of your last release. When a bz2 compressed file is read, the decompression seems to be happening on one worker node only. Is it possible to parallelise the decompression of externally compressed files?

Thanks in advance for your response.

yivanova88 avatar Mar 17 '20 11:03 yivanova88

based on https://github.com/saurfang/spark-sas7bdat/pull/50 this seems to be expected. bz2 is indeed splittable but we need to seek for page boundaries within sas files. the easiest workaround is probably decompress and parse separately both should be parallelizable.

saurfang avatar Sep 14 '20 04:09 saurfang