Jason Lowe comments

Results 59 comments of


                                            Jason Lowe

[FEA] Force use PERFILE scan in low shuffle merge.

Thanks for clarifying what this is doing. Note that an alternative to this approach is to fully implement a row ID metadata column. If we had that, we wouldn't need...

[FEA] Force use PERFILE scan in low shuffle merge.

> I think as first step, we can limit PERFILE scan to target files only I agree this would be a valuable, incremental step.

Delta Lake MERGE/UPDATE/DELETE on Databricks should trigger optimized write and auto compaction

Note that this also should remove the repartition by partition key for partitioned tables when writing a MERGE because we're going to turn around and repartition for the optimized write...

Delta Lake MERGE/UPDATE/DELETE on Databricks should trigger optimized write and auto compaction

Note that for MERGE the user can specify `spark.databricks.delta.merge.repartitionBeforeWrite.enabled=false` to avoid repartitioning by the partition key when doing a merge into a few number of partitions to avoid sending all...

Delta Lake MERGE/UPDATE/DELETE on Databricks should trigger optimized write and auto compaction

This is a Databricks-specific behavior per the doc linked above, not a behavior in OSS Delta Lake, at least for the versions of OSS Delta Lake that we support. There's...

[FEA] Support bloom filter joins on Databricks

@mattahrens significant speedup is expected with just that setting, since that's comparing non-Bloom filter joins vs. Bloom filter joins, with no GPU fallbacks in either case. For the purposes of...

[BUG] Unnecessary stream synchronization in cudf::is_valid

It looks like the stream synchronization is triggered by the use of `rmm::exec_policy` instead of `rmm::exec_policy_nosync` in `cudf::detail::true_if` at https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/include/cudf/detail/unary.hpp#L62.

Investigate CoalescedHashPartitioning

See https://github.com/apache/spark/commit/81639090622 for changes that were needed to the CPU BroadcastHashJoinExec that are probably relevant to the changes likely needed for the GPU version.

Host Memory OOM handling for RowToColumnarIterator

> It looks like a build issue where spark-rapids-jni failed to pull in the correct nvcomp version. That's seems like a scary error. How could we be pulling in such...