spark-sas7bdat
spark-sas7bdat copied to clipboard
Decompression of sas7bdat.bz2 file is not distributed across worker nodes
trafficstars
Hello,
I have been experimenting with the bz2 decompression functionality in the repo's master branch which isn't part of your last release. When a bz2 compressed file is read, the decompression seems to be happening on one worker node only. Is it possible to parallelise the decompression of externally compressed files?
Thanks in advance for your response.
based on https://github.com/saurfang/spark-sas7bdat/pull/50 this seems to be expected. bz2 is indeed splittable but we need to seek for page boundaries within sas files. the easiest workaround is probably decompress and parse separately both should be parallelizable.