Andy Grove

Results 657 comments of Andy Grove

> It's correctly caching now() so it returns the same value when used multiple times in the same query, but it shouldn't be caching the value past statement execution, correct?...

Maybe `file_extension` could be made an `Option`, defaulting to `None`?

This is one expensive workaround that we have, that we could remove with additional work in cuDF: ```scala // if the last entry in a column is incomplete or invalid,...

I added some debug logging to show the size of the inputs being passed to `readJSON` in my perf test and see two tasks both trying to allocate ~500 MB...

The earlier OOM was happening when running on a workstation with an RTX 3080 which only has 10GB RAM so I am not convinced that this is really an issue....

@revans2 I could use a sanity check on my conclusions here before closing this issue. Also, let me know if there are other benchmarks that you would like to see.

The output shows that an input of `{"a": {"b":"md"} }` produces the same results between CPU and GPU: ``` [2024-01-31T20:47:58.663Z] Row(a='{"a": {"b":"md"} }', from_json(a)=Row(a='{"b":"md"}')) ``` But an almost identical input...

Also, my manual test is using `show` ... if I run `collect` then I do see the same results. I think the `show` issue is already known under issue https://github.com/NVIDIA/spark-rapids/issues/8558

> Note that this failure was from a distributed cluster setup, so the nature of the failure may have something to do with how the input data is partitioned across...

Substrait support is now in DataFusion, so I plan on working on this soon