Robert (Bobby) Evans
Robert (Bobby) Evans
@asddfl We should support it mostly, but we don't officially test it. `pyspark.pandas` generally is translated into dataframe operations that are common with the SQL back end. If an operations...
I ran this locally and was not able to reproduce it. I think it is the same problem as https://github.com/NVIDIA/spark-rapids/issues/9822 and https://github.com/NVIDIA/spark-rapids/issues/10026 because average really is a `SUM(X)/COUNT(x)` and if...
I think I only tried 2.12
So I think this really comes down to a limitation that we have with targetBatchSize. All of our code and optimizations assume that target batch size correlates directly to the...
When I looked at how we calculate the target merge size I think I found the problem. https://github.com/NVIDIA/spark-rapids/blob/925ef96e5e303e495469aa0a98eb90d681b81a5e/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuAggregateExec.scala#L108 We are trying to avoid re-partitioning the data, and end up trying...
The C++ for the JSON parser returns a table_with_metadata. https://github.com/rapidsai/cudf/blob/29556a2514f4d274164a27a80539410da7e132d6/cpp/include/cudf/io/types.hpp#L231 We strip off much of the metadata to try and make the API consistent with the other reader APIs that...
With the most recent changes (including https://github.com/NVIDIA/spark-rapids/pull/10575) in we are now getting an exception instead of the wrong data. With `spark.rapids.sql.json.read.mixedTypesAsString.enabled` set to true or false we get back ```...
@andygrove do you still plan on trying to fix this?
CI failed because of https://github.com/NVIDIA/spark-rapids/issues/14009 but our CI currently has not way to turn off the databricks tests when you touched something even remotely related to databricks.