miniparquet icon indicating copy to clipboard operation
miniparquet copied to clipboard

Plan to support recursive data structures?

Open MichaelChirico opened this issue 4 years ago • 6 comments

A lot of my common use cases store map & array data types. It would be great to have support to read such parquet with miniparquet.

Is this out if scope?

MichaelChirico avatar Sep 24 '19 10:09 MichaelChirico

Are they stored as nested tables or more complex values? Also, can you provide some sample files please?

hannes avatar Sep 24 '19 12:09 hannes

I'm not sure how to answer about their storage, but the Hive type is array and/or map. Though those types are potentially recursive (and hence highly complex), I've only used one-level complexity (e.g. array(int) or map(int, varchar)).

Will try and create something & pass along. Any preferred medium?

MichaelChirico avatar Sep 25 '19 06:09 MichaelChirico

medium, e.g. wetransfer?

hannes avatar Sep 25 '19 07:09 hannes

yes, or dropbox, i could try gist...

MichaelChirico avatar Sep 25 '19 07:09 MichaelChirico

parquet_test.tar.gz

seems i can upload tar.gz here! i ran the following in SparkR and attached is the compressed output:

# spark start boilerplate
iris = iris
names(iris) = gsub('.', '_', names(iris), fixed = TRUE)
irisSDF = createDataFrame(iris)
irisSDF %>% createOrReplaceTempView('iris')

sql("
select 1 as int, 'a' as str, 1.1 as dbl,
       timestamp('2019-09-20T12:34:56Z') as ts,
       true as bool, date('2019-09-21') as dt,
       map(Species, Sepal_Length) as mp,
       array(Sepal_Width) as arr
from iris
") %>% write.parquet('/path/to/output')

MichaelChirico avatar Sep 27 '19 07:09 MichaelChirico

thanks, will see what i can do

hannes avatar Sep 27 '19 09:09 hannes