Enable `View` types
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
ViewTypes has been disabled in #1185 due to Arrow IPC ViewType serialisation issues (#1182).
Describe the solution you'd like
As pointed out by @mbutrovich in https://discord.com/channels/885562378132000778/1408110811637088286/1408118010161401958 view types can be garbage collected before arrow IPC called. His implementation can be seen at https://github.com/mbutrovich/datafusion-comet/commit/a8cbfb7a7c6071b61f3bf0ab28e5d87fdf8a5639
Describe alternatives you've considered
Additional context
- https://github.com/apache/arrow-rs/issues/7185
@Huy1Ng would you be interested to give it a try ?
sure. I can take this issue.
thanks @Huy1Ng view types have been disabled at https://github.com/milenkovicm/datafusion-ballista/blob/85411550f92c2f8c6e21848619420cef61622b74/ballista/core/src/extension.rs#L393-L394
I believe ShuffleWriter should be changed to accommodate this, not sure how hard would it be to test this change (if possible)
I tried to replicate the issue in #1182 with schema_force_view_types=true but there is no error raised. @andygrove can you share the configuration? Mine is:
podman run -v `pwd`/data:/data -it --rm --platform linux/amd64 ghcr.io/scalytics/tpch-docker:main -vf -s 100
podman run -v `pwd`/data:/data -it --entrypoint /bin/bash --rm --platform linux/amd64 ghcr.io/scalytics/tpch-docker:main -c "cp /opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/"
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 2
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 10
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 16
cargo run --release --bin tpch -- benchmark datafusion --iterations 1 --path ./data --format tbl --query 20
This issue might be a corner case hard to test
I was able to reproduce the problem at TPCH -s 1, 1 scheduler + 2 executors, query 10, and with parquet format
# cargo install tpchgen-cli
tpchgen-cli -s 1 --format=parquet --output-dir data
cargo run --release --bin tpch -- benchmark ballista --host localhost --port 50050 --path $(pwd)/data --format parquet --query 10
But for the solution, I'm not sure how to best approach the problem. Should the gc be called by datafusion, or by arrow? Also, should we call gc all the time? Polars will run a heuristic to determine if gc is appropriate:
https://github.com/orlp/polars/blob/bec54ddba1cad1caef2f74afb1d0d3f283b1391d/crates/polars-arrow/src/array/binview/mod.rs#L371
I'm tempted to go with that. Is there any second opinion?
If I'm not mistaken this should be part of shuffle writer as we get batches to write to shuffle file. If i remember correctly issue was there (you can correct me if im wrong). I don't mind adding heuristics, if you believe its needed. Also, gc can be disabled if view types are disabled datafusion.execution.parquet.schema_force_view_types like they are at the moment