parquet-rs Add support for reading columns as Apache Arrow arrays

Add support for reading columns as Apache Arrow arrays

Open andygrove opened this issue 6 years ago • 14 comments

This is going to be easy. I have some ugly prototype code working already.

    let path = Path::new(&args[1]);
    let file = File::open(&path).unwrap();
    let parquet_reader = SerializedFileReader::new(file).unwrap();

    let row_group_reader = parquet_reader.get_row_group(0).unwrap();

    for i in 0..row_group_reader.num_columns() {
        match row_group_reader.get_column_reader(i) {
            Ok(ColumnReader::Int32ColumnReader(ref mut r)) => {

                let batch_size = 1024;
                let sz = mem::size_of::<i32>();
                let p = memory::allocate_aligned((batch_size * sz) as i64).unwrap();
                let ptr_i32 = unsafe { mem::transmute::<*const u8, *mut i32>(p) };
                let mut buf = unsafe {
                    slice::from_raw_parts_mut(ptr_i32, batch_size) };

//                let mut builder : Builder<i32> = Builder::with_capacity(1024);
//                let buffer = builder.finish();
                match r.read_batch(1024, None, None, &mut buf) {
                    Ok((count,_)) => {
                        let arrow_buffer = Buffer::from_raw_parts(ptr_i32, count as i32);
                        let arrow_array = Array::from(arrow_buffer);

                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }


                    },
                    _ => println!("error")
                }
            }
            _ => println!("column type not supported")
        }
    }

Now we need to come up with a real design and I probably need to add more helper methods to Arrow to make this easier.

Apr 06 '18 13:04 andygrove

Here is a cleaned up version:

                let mut builder : Builder<i32> = Builder::with_capacity(batch_size);
                match r.read_batch(1024, None, None, builder.slice_mut(0, batch_size)) {
                    Ok((count,_)) => {
                        builder.set_len(count);
                        let arrow_array = Array::from(builder.finish());
                        match arrow_array.data() {
                            ArrayData::Int32(b) => {
                                println!("len: {}", b.len());
                                println!("data: {:?}", b.iter().collect::<Vec<i32>>());
                            },
                            _ => println!("wrong type")
                        }

Apr 06 '18 13:04 andygrove

I've made good progress with integrating parquet-rs with datafusion .. I have examples like this working

    let mut ctx = ExecutionContext::local();

    let df = ctx.load_parquet("test/data/uk_cities.parquet/part-00000-bdf0c245-d300-4b28-bfdd-0f1f9cb898c4-c000.snappy.parquet").unwrap();

    ctx.register("uk_cities", df);

    // define the SQL statement
    let sql = "SELECT lat, lng FROM uk_cities";

    // create a data frame
    let df = ctx.sql(&sql).unwrap();

    df.show(10);

It only works for int32 and f32 columns though so far

Apr 07 '18 17:04 andygrove

nice progress @andygrove ! really glad you can read parquet now using SQL.

Apr 07 '18 19:04 sunchao

Looks great! I am curious how Arrow handles/maps optional or repeated fields (when definition and/or repetition levels exist).

Apr 08 '18 01:04 sadikovi

I have merged the current Parquet support to master in DataFusion. I have added examples for both DataFrame and SQL API.

https://github.com/datafusion-rs/datafusion-rs/tree/master/examples

This is very rough code and I won't be promoting the fact that Parquet support is there until this is a little more complete and tested.

Apr 08 '18 16:04 andygrove

@andygrove @sunchao Are you guys working on this? I'm working on an implementation which takes the cpp version as a reference.

Oct 10 '18 08:10 liurenjie1024

@liurenjie1024 are you working on the arrow part or the parquet-rs part? some more work in the arrow repo needs to be done so that parquet-rs can read into arrow format . I'm working (slowly) on that part.

Oct 10 '18 09:10 sunchao

I'm working on the arrow part. Yes only part of cpp version can be implemented. I'll work out an early version.

Chao Sun [email protected] 于 2018年10月10日周三下午5:27写道：

@liurenjie1024 https://github.com/liurenjie1024 are you working on the arrow https://github.com/apache/arrow part or the parquet-rs part? some more work in the arrow repo needs to be done so that parquet-rs can read into arrow format . I'm working (slowly) on that part.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/79#issuecomment-428502989, or mute the thread https://github.com/notifications/unsubscribe-auth/ACpL5SeEZrsr5j8lEbv6H_y6s3s0SG-vks5ujb2OgaJpZM4TKCyA .

Oct 10 '18 11:10 liurenjie1024

This is unrelated, but I've seen that CSV readers are being implemented in Arrow (I think Python, C++, and there's a Go one that's an open PR).

BurntSushi's rust-csv got me wondering whether it'd be possible to implement a native CSV to arrow reader, which I think would also extend nicely to parquet.

@sunchao @andygrove , do you know if there's been discussion of something like this? Also, would some codegen be required? It's something I'd love to try contribute to in the coming weeks.

Oct 31 '18 20:10 nevi-me

BurntSushi's rust-csv got me wondering whether it'd be possible to implement a native CSV to arrow reader, which I think would also extend nicely to parquet.

Yes it is certainly do-able - it needs to be implemented in the arrow repo though. Some pieces may still be missing and you're welcome to work on the arrow repo :)

Also I'm not sure how this extend to parquet. Can you elaborate?

Oct 31 '18 20:10 sunchao

Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one).

The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature.

Also, I haven't checked to see if datafusion-rs supports csv and parquet as sinks, so one'd be able to create table my_table<parquet> as select * from input<csv>. A lot of my curiousity comes as a result of having to watch paint dry at work while I use Spark, so I've been thinking of what I'd need to be able to do to replace some of my workflow with Rust.

Oct 31 '18 20:10 nevi-me

Yes, I've been following the Rust impl in Arrow. When I'm ready, I'll ask about it in the mailing list before opening a JIRA (didn't see one).

Yes feel free to create a JIRA :)

The extension to parquet was more in concept than anything; in the sense that if I can read a csv to arrow, I'd be able to get csv -> parquet working. I understand that csv is a "simpler" format due to its flat nature.

I see. Yes I think we discussed (cc @sadikovi ) multiple times about creating a CSV -> parquet converter. This will be convenient. We should certainly do it.

Oct 31 '18 20:10 sunchao

I'm working on an arrow reader implementation and have finished the first step, converting parquet schema to arrow schema in this PR, please help to review this.

Nov 06 '18 01:11 liurenjie1024

The umbrella task is #186. Please watch that issue for progress on this matter.

Nov 07 '18 18:11 sunchao

parquet-rs parquet-rs copied to clipboard

Add support for reading columns as Apache Arrow arrays

parquet-rs
parquet-rs copied to clipboard