petastorm icon indicating copy to clipboard operation
petastorm copied to clipboard

Inferring Unischema from Spark DataFrame/Schema

Open seranotannason opened this issue 6 years ago • 3 comments

Is it possible to infer the Unischema from a Spark DataFrame or its schema? There's a method to convert a Unischema into a Spark schema (as_spark_schema) but I'm wondering if there is the reverse method, which would be useful.

seranotannason avatar Jun 26 '19 15:06 seranotannason

Can you please describe the use case for this request?

We have infer_or_load_unischema which was not intended to be a public method and it returns unischema.

Do you need Unischema instance particularly, or you would be ok working with pyarrow schema directly?

selitvin avatar Jun 26 '19 16:06 selitvin

In the example to materialize a dataset a hand crafted Unischema is required. Can't the Unischema be derived from the DataFrame's schema?

ahutterTA avatar Jan 09 '20 19:01 ahutterTA

@ahutterTA , If you are working with a Petastorm Parquet store, you get an ability to store multidimensional arrays (tensors) in parquet file. Spark DataFrame does not support tensor types. We introduced Unischema in order to carry additional information about the tensor type/shape (while storing tensors serialized in byte-arrays). Since Spark DataFrame does not carry this information, we can not really derive it from DataFrame.

If you are working with non-Petastorm Parquet store, then petastorm simply allows you to stream regular Parquet files into TF or PyTorch. Then, you would not need to recover Unischema from the parquet store and work with Spark tools that query regular Parquet schema (since it has no non-standard types).

Hope this answers your question...

selitvin avatar Jan 10 '20 05:01 selitvin