Sandip Agarwala
Sandip Agarwala
> cc @sandip-db WDYT? It looks ok to me. Although, I have not come across anyone asking for it.
> Even if you don't want to touch the Hadoop code, this PR approach looks too overkill, Hadoop provides io.compression.codecs.CompressionCodec to allow implementing custom codecs, implementing a org.apache.spark.xxx.SparkZstdCompressionCodec and configuring...
> does Hadoop's LineRecordReader allow us to specify the compression at the session level without forking the code? @cloud-fan Its possible to pass different codecs via `io.compression.codecs.CompressionCodec` Hadoop conf, but...
> for "no extensions" compressed text files, I'm not sure if this is a valid use case(see my last comment) @pan3793 While uncommon, we come across users who have compressed...
> also, please be careful with that Hadoop codec may have different behavior with Spark/Unix tool codec, for example, HADOOP-12990(lz4) Thanks for bringing this to my attention. We are not...
@pan3793 Thanks for your input. > how do you define the behavior of "specify the compression at the session level"? always respect session conf and ignore filename suffix? or fallback...
> I think you should at least test reading zstd text file written by Hadoop Added a test scenario with a file compressed using Hadoop native ZSTD codec.
@pan3793 Thanks for pointing it out. If native hadoop or user provided zstd codec is available, we will [use that](https://github.com/apache/spark/pull/51182/files#diff-8bf2be281511318ebeb1e5d306a9f266d78fa0f78f2e8134760acf4ef084eafcR84-R91) instead of Spark's zstd-JNI based decompression.