datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

Enable `View` types

Open milenkovicm opened this issue 4 months ago • 7 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

ViewTypes has been disabled in #1185 due to Arrow IPC ViewType serialisation issues (#1182).

Describe the solution you'd like

As pointed out by @mbutrovich in https://discord.com/channels/885562378132000778/1408110811637088286/1408118010161401958 view types can be garbage collected before arrow IPC called. His implementation can be seen at https://github.com/mbutrovich/datafusion-comet/commit/a8cbfb7a7c6071b61f3bf0ab28e5d87fdf8a5639

Describe alternatives you've considered

Additional context

  • https://github.com/apache/arrow-rs/issues/7185

milenkovicm avatar Aug 21 '25 20:08 milenkovicm

@Huy1Ng would you be interested to give it a try ?

milenkovicm avatar Aug 24 '25 09:08 milenkovicm

sure. I can take this issue.

Huy1Ng avatar Aug 24 '25 10:08 Huy1Ng

thanks @Huy1Ng view types have been disabled at https://github.com/milenkovicm/datafusion-ballista/blob/85411550f92c2f8c6e21848619420cef61622b74/ballista/core/src/extension.rs#L393-L394

I believe ShuffleWriter should be changed to accommodate this, not sure how hard would it be to test this change (if possible)

milenkovicm avatar Aug 24 '25 13:08 milenkovicm

I tried to replicate the issue in #1182 with schema_force_view_types=true but there is no error raised. @andygrove can you share the configuration? Mine is:

podman run -v `pwd`/data:/data -it --rm --platform linux/amd64 ghcr.io/scalytics/tpch-docker:main -vf -s 100

podman run -v `pwd`/data:/data -it --entrypoint /bin/bash --rm --platform linux/amd64 ghcr.io/scalytics/tpch-docker:main -c "cp /opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/"

cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 2
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 10
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 16
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 20

Huy1Ng avatar Sep 06 '25 00:09 Huy1Ng

This issue might be a corner case hard to test

milenkovicm avatar Sep 06 '25 05:09 milenkovicm

I was able to reproduce the problem at TPCH -s 1, 1 scheduler + 2 executors, query 10, and with parquet format

# cargo install tpchgen-cli 
tpchgen-cli -s 1 --format=parquet --output-dir data 
cargo run --release --bin tpch -- benchmark ballista --host localhost --port 50050 --path $(pwd)/data --format parquet --query 10

But for the solution, I'm not sure how to best approach the problem. Should the gc be called by datafusion, or by arrow? Also, should we call gc all the time? Polars will run a heuristic to determine if gc is appropriate: https://github.com/orlp/polars/blob/bec54ddba1cad1caef2f74afb1d0d3f283b1391d/crates/polars-arrow/src/array/binview/mod.rs#L371

I'm tempted to go with that. Is there any second opinion?

Huy1Ng avatar Sep 29 '25 10:09 Huy1Ng

If I'm not mistaken this should be part of shuffle writer as we get batches to write to shuffle file. If i remember correctly issue was there (you can correct me if im wrong). I don't mind adding heuristics, if you believe its needed. Also, gc can be disabled if view types are disabled datafusion.execution.parquet.schema_force_view_types like they are at the moment

milenkovicm avatar Sep 29 '25 13:09 milenkovicm