Andy Grove comments

Results 657 comments of


                                            Andy Grove

once now() is called in a statement, it forever returns the same value

> It's correctly caching now() so it returns the same value when used multiple times in the same query, but it shouldn't be caching the value past statement execution, correct?...

Files without `.parquet`, `.csv` extension inferred as having no schema

Maybe `file_extension` could be made an `Option`, defaulting to `None`?

[FEA] Improve memory efficiency of from_json

This is one expensive workaround that we have, that we could remove with additional work in cuDF: ```scala // if the last entry in a column is incomplete or invalid,...

[FEA] Improve memory efficiency of from_json

I added some debug logging to show the size of the inputs being passed to `readJSON` in my perf test and see two tasks both trying to allocate ~500 MB...

[FEA] Improve memory efficiency of from_json

The earlier OOM was happening when running on a workstation with an RTX 3080 which only has 10GB RAM so I am not convinced that this is really an issue....

[FEA] Improve memory efficiency of from_json

@revans2 I could use a sanity check on my conclusions here before closing this issue. Also, let me know if there are other benchmarks that you would like to see.

[BUG] test_from_json_mixed_types_list_struct failed

The output shows that an input of `{"a": {"b":"md"} }` produces the same results between CPU and GPU: ``` [2024-01-31T20:47:58.663Z] Row(a='{"a": {"b":"md"} }', from_json(a)=Row(a='{"b":"md"}')) ``` But an almost identical input...

[BUG] test_from_json_mixed_types_list_struct failed

Also, my manual test is using `show` ... if I run `collect` then I do see the same results. I think the `show` issue is already known under issue https://github.com/NVIDIA/spark-rapids/issues/8558

[BUG] test_from_json_mixed_types_list_struct failed

> Note that this failure was from a distributed cluster setup, so the nature of the failure may have something to do with how the input data is partitioned across...

[EPIC] Add support for Substrait

Substrait support is now in DataFusion, so I plan on working on this soon