Parth Chandra
Parth Chandra
This was taken from Spark which has corrected it since
> * The Parquet scan of lineitem seems to take ~10% longer than Spark and 60%+ of the time is spent in native decoding, so perhaps we should add criterion...
Seems like a perennial issue. This signature changes in every release it appears (it is private after all). https://github.com/apache/datafusion-comet/issues/1576
There are a couple of considerations here - 1) What version of Spark users are likely to be on (and therefore likely to want to use Comet with)? 2) What...
Spark produces the worst possible query plan for q72 which amplifies the difference in performance. The C2R overhead for comet is amplified because the conversion happens on a dataset that...
I'll look into this @comphead.
Update on this - Spark vectorized reader also throws the same error. Users have to turn off vectorized reading to read such files. It is also pretty near impossible to...
Yes, let's close this. We can revisit this if more people report it.
IIRC there were differences in output between Spark 3.2 and Spark 3.4 for the timestamp_ntz type. Taking a closer look, the definition of timestamp_ntz (in Spark) essentially means that the...
IIRC, the vectorized versions of these encodings in Spark did not improve performance much over the row based implementation in the parquet library