datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

Ballista does not support external file systems

Open ZhangqyTJ opened this issue 4 years ago • 6 comments

I added the s3 (minio_store) module in datafusion/src/datasource/object_store, and registered the minio_store in benchmarks/tpch.rs through the register_object_store() method of ExecutionContext. But when I start the Scheduler and Executor, and then run "cargo run --bin tpch --release****", the data in minio cannot be read. After checking the code, I found that LocalFileSystem is used directly at ballista/rust/core/src/serde/physical_plan/from_proto.rs(789) and ballista/rust/core/src/serde/logical_plan/from_proto.rs(201), so I modified these two codes to minio_store and it ran successfully. How to make Ballista support external file system?

The project address after I added minio_store https://github.com/ZhangqyTJ/arrow-datafusion.git Modify the code before running ballista/rust/core/src/serde/physical_plan/from_proto.rs(789) ballista/rust/core/src/serde/logical_plan/from_proto.rs(201) Run command To run the scheduler from source:

cd $ARROW_HOME/ballista/rust/scheduler
RUST_LOG=info cargo run --release

By default the scheduler will bind to 0.0.0.0 and listen on port 50050.

To run the executor from source:

cd $ARROW_HOME/ballista/rust/executor
RUST_LOG=info cargo run --release

To run the benchmarks:

    cargo run --bin tpch --release benchmark ballista --host localhost --port 50050 --query 1 --partitions 1 --path s3://test1/tpch_tbl/cutdata --format tbl --storage-type minio --endpoint 192.168.75.81:9091 --username minioadmin --password minioadmin --bucket test1 

ZhangqyTJ avatar Dec 08 '21 03:12 ZhangqyTJ

Hi @ZhangqyTJ are you right that right now ballista is hardcoded to use local file system object store. Supporting remote file system is in our roadmap and we have all the infrastructure in place to execute on this now thanks to the object store abstraction. As you already mentioned, the main change that needs to be introduced is the plan ser/de layer. You are more than welcome to work on this if you would like to. I don't think anyone is actively working on this at the moment.

houqp avatar Dec 08 '21 06:12 houqp

We had some prior discussions around this subject in https://github.com/apache/arrow-datafusion/pull/1072

houqp avatar Dec 08 '21 06:12 houqp

I will try to solve this issue

ZhangqyTJ avatar Dec 08 '21 08:12 ZhangqyTJ

I think I have resolved this issue: (1) Add the'object_store_str' field in ListingTableScanNode to store the serialized object_store (2) Serialize object_store in to_proto.rs, deserialize in from_proto.rs, and perform type matching through ObjectStoreEnum But I found a new issue. When I tested reading minio's .tbl data, it was successful, but reading the .parquet file gave an error: ParquetError("Parquet error: underlying IO error: failed to fill whole buffer")'. I don't know the reason for the error

ZhangqyTJ avatar Dec 14 '21 03:12 ZhangqyTJ

Could you share a reproducible code example with us? It would be really hard for us to locate the problem with a generic error message :)

houqp avatar Dec 14 '21 05:12 houqp

Project address: https://github.com/ZhangqyTJ/arrow-datafusion.git Test code file: datafusion/src/physical_plan/file_format/parquet.rs Test function: test_minio_parquet() Steps: (1) Build a minio service: https://docs.min.io/docs/minio-quickstart-guide.html (2) Create a bucket in minio: Access with browser http://ip:port (default user name and password: minioadmin) click 'Object Orowser' -> 'Create bucket' (3) Modify the parameters in test_minio_parquet(): s3_options, data_rows (4) Execute test_minio_parquet()

ZhangqyTJ avatar Dec 14 '21 07:12 ZhangqyTJ