blaze icon indicating copy to clipboard operation
blaze copied to clipboard

Support more file format

Open kettlelinna opened this issue 2 years ago • 3 comments

Is your feature request related to a problem? Please describe. Blaze only support parquet file format so far, and it is cusotmerize, but in fact datafusion have implement parquet source

Describe the solution you'd like Can we use datafusion reader interface? I think it more easier to extend, btw datafusion have provided multiple reader so far

kettlelinna avatar Nov 09 '23 03:11 kettlelinna

the customized ParquetExec is designed for reading data directly from HDFS via JNI (we don't use object-store or libhdfs because they are too hard to be used in production environemnt). I don't think datafusion's Reader interface outperforms current ExecutionPlan/SendableRecordBatchStream implementation. and i'm not attracted to datafusion's builtin formats (like csv, json), as they are not widely used in spark.

richox avatar Nov 09 '23 11:11 richox

yeah, I see. but it is hard to extend data source now, it doesn't have extend interface to support that. we can easy to extend datasource if we use datafusion reader something like deltalake, avro, etc

kettlelinna avatar Nov 09 '23 12:11 kettlelinna

yeah, I see. but it is hard to extend data source now, it doesn't have extend interface to support that. we can easy to extend datasource if we use datafusion reader something like deltalake, avro, etc

it should be hard. different formats have lots of specialized logics of reading data, like pruning, data type converting, delimiting, and so on. i don't have any idea to design an input format interface yet.

richox avatar Nov 10 '23 13:11 richox

related to #498

richox avatar Jul 04 '24 09:07 richox