Robert (Bobby) Evans comments

Results 257 comments of


                                            Robert (Bobby) Evans

[QST] Does `spark-rapids` support GPU acceleration for pandas-on-Spark (`pyspark.pandas`)?

@asddfl We should support it mostly, but we don't officially test it. `pyspark.pandas` generally is translated into dataframe operations that are common with the SQL back end. If an operations...

[BUG] hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts failed with DATAGEN_SEED=1705756525

I ran this locally and was not able to reproduce it. I think it is the same problem as https://github.com/NVIDIA/spark-rapids/issues/9822 and https://github.com/NVIDIA/spark-rapids/issues/10026 because average really is a `SUM(X)/COUNT(x)` and if...

[BUG] hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts failed with DATAGEN_SEED=1705756525

I think I only tried 2.12

[BUG] Spill occurs in GpuAggregate when GPU batch size reduces

So I think this really comes down to a limitation that we have with targetBatchSize. All of our code and optimizations assume that target batch size correlates directly to the...

[FEA] Column-wise columnar batch concatenation

When I looked at how we calculate the target merge size I think I found the problem. https://github.com/NVIDIA/spark-rapids/blob/925ef96e5e303e495469aa0a98eb90d681b81a5e/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala#L108 We are trying to avoid re-partitioning the data, and end up trying...

[BUG] `from_json` generated inconsistent result comparing with CPU for input column with nested json strings

The C++ for the JSON parser returns a table_with_metadata. https://github.com/rapidsai/cudf/blob/29556a2514f4d274164a27a80539410da7e132d6/cpp/include/cudf/io/types.hpp#L231 We strip off much of the metadata to try and make the API consistent with the other reader APIs that...

[BUG] `from_json` generated inconsistent result comparing with CPU for input column with nested json strings

With the most recent changes (including https://github.com/NVIDIA/spark-rapids/pull/10575) in we are now getting an exception instead of the wrong data. With `spark.rapids.sql.json.read.mixedTypesAsString.enabled` set to true or false we get back ```...

Robert (Bobby) Evans

[QST] Does `spark-rapids` support GPU acceleration for pandas-on-Spark (`pyspark.pandas`)?

[BUG] hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts failed with DATAGEN_SEED=1705756525

[BUG] hash_aggregate_test.py::test_hash_multiple_mode_query_avg_distincts failed with DATAGEN_SEED=1705756525

[BUG] Spill occurs in GpuAggregate when GPU batch size reduces

[FEA] Column-wise columnar batch concatenation

[BUG] `from_json` generated inconsistent result comparing with CPU for input column with nested json strings

[BUG] `from_json` generated inconsistent result comparing with CPU for input column with nested json strings

[BUG] `from_json` generated inconsistent result comparing with CPU for input column with nested json strings

Add in basic GPU/CPU bridge operation [databricks]

Add in basic GPU/CPU bridge operation [databricks]