parquet4s icon indicating copy to clipboard operation
parquet4s copied to clipboard

Read Schema of parquet file

Open ChrisMuki opened this issue 2 years ago • 2 comments

First i want to thank you for this great library!

I need to merge hundreds of small parquet files into bigger ones. Sadly they are not all the same schema (e.g. missing columns), nor is the schema known at compile time.

I am just wondering what would be the most eficient way to get only the schema of a parquet file. Currently i am looking into the first RowParquetRecord but as there might be NullValues....

Further, i am interested if there is a complete list of how to map scala types properly to fields, like this Types.primitive(INT32, OPTIONAL).as(LogicalTypeAnnotation.dateType()).named(Birthday)

Thanks

ChrisMuki avatar Sep 30 '22 09:09 ChrisMuki

Hi Chris!

Parquet4s doesn't expose file schema in its own API (it is a thing that could be added). However, you can easily access it by calling the original Java API that Parquet4s is using under the hood. Check org.apache.parquet.hadoop.ParquetFileReader, e.g.:

val reader = ParquetFileReader.open(inputFile, readerOptions)
try {
  val schema: MessageType = reader.getFileMetaData.getSchema
  ...
} finally reader.close()

mjakubowski84 avatar Oct 02 '22 11:10 mjakubowski84

Regarding

a complete list of how to map scala types properly to fields

check the content of TypedSchemaDef

I mean... use this type class implicitly or explicitly to obtain type mapping. Check also a quite rich API of RowParquetRecord

mjakubowski84 avatar Oct 02 '22 12:10 mjakubowski84