Jason Lowe

Results 59 comments of Jason Lowe

Thanks for clarifying what this is doing. Note that an alternative to this approach is to fully implement a row ID metadata column. If we had that, we wouldn't need...

> I think as first step, we can limit PERFILE scan to target files only I agree this would be a valuable, incremental step.

Note that this also should remove the repartition by partition key for partitioned tables when writing a MERGE because we're going to turn around and repartition for the optimized write...

Note that for MERGE the user can specify `spark.databricks.delta.merge.repartitionBeforeWrite.enabled=false` to avoid repartitioning by the partition key when doing a merge into a few number of partitions to avoid sending all...

This is a Databricks-specific behavior per the doc linked above, not a behavior in OSS Delta Lake, at least for the versions of OSS Delta Lake that we support. There's...

@mattahrens significant speedup is expected with just that setting, since that's comparing non-Bloom filter joins vs. Bloom filter joins, with no GPU fallbacks in either case. For the purposes of...

It looks like the stream synchronization is triggered by the use of `rmm::exec_policy` instead of `rmm::exec_policy_nosync` in `cudf::detail::true_if` at https://github.com/rapidsai/cudf/blob/branch-24.12/cpp/include/cudf/detail/unary.hpp#L62.

See https://github.com/apache/spark/commit/81639090622 for changes that were needed to the CPU BroadcastHashJoinExec that are probably relevant to the changes likely needed for the GPU version.

> It looks like a build issue where spark-rapids-jni failed to pull in the correct nvcomp version. That's seems like a scary error. How could we be pulling in such...