petastorm
petastorm copied to clipboard
`parquet file size 0 bytes` when materializing dataset
I'm trying out petastorm on a google dataproc cluster, and when I try to materialize a dataset like the below:
Schema = Unischema('Schema', [
UnischemaField('features', np.float32, (310,), NdarrayCodec(), False)
])
def make_dataset(output_uri):
rowgroup_size_mb = 256
with materialize_dataset(spark, output_uri, Schema, rowgroup_size_mb, use_summary_metadata=True):
rows_rdd = blah.select('features').limit(1000) \
.rdd\
.map(lambda x: {'features': np.array(x['features'].toArray(), dtype=np.float32) }) \
.map(lambda x: dict_to_spark_row(Schema, x))
spark.createDataFrame(rows_rdd, Schema.as_spark_schema())\
.write \
.mode('overwrite') \
.parquet(output_uri)
I get pyarrow errors like "ArrowInvalid: Parquet file size is 0 bytes"
when executing the above on a google storage URI like "gs://bucket/path/petastorm
" . Can anybody tell if this is a petastorm issue, a pyarrow issue, or maybe something else?
library versions:
fsspec==0.9.0
gcsfs==0.8.0
petastorm==0.10.0
pyarrow==0.17.1
# Editable install with no version control (pyspark==3.1.1)
The files are created and appear to be valid upon inspection with the regular spark parquet reader. Trying to make a reader on the files via
with make_reader("gs://bucket/petastorm") as reader:
pass
causes the same error ArrowInvalid: Parquet file size is 0 bytes
, and I assume it's the same root cause, whatever it is.
raw = spark.read.parquet('gs://bucket/petastorm')
yields
[Row(features=bytearray(b"\x93NUMPY\x01\x00v\x00{\'descr\': \'<f4\', \'fortran_order\': False, \'shape\': (310,), } \n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x <...>
which looks like the NdArrayCodec worked.
Would running this locally and writing either to local fs or gs works worth the same code?
When using an inmemory local cluster:
- writing to / reading from a
file:///<blah>/temp/
URI works - writing to / reading from
gs://bucket/petastorm
as above does not.
This feels like it's related to https://stackoverflow.com/questions/58646728/pyarrow-lib-arrowioerror-invalid-parquet-file-size-is-0-bytes, but I get the same size 0 issue even if I disable _SUCCESS files and don't write metadata.
More strange details
with make_reader('gs://bucket/blah/part-00000-629b3fcc-6ee4-40b0-8b84-31513272952f-c000.snappy.parquet') as reader:
pass
yields OSError: Passed non-file path: bucket/blah/petastorm/part-00000-629b3fcc-6ee4-40b0-8b84-31513272952f-c000.snappy.parquet
with make_reader('gs://churn_dev/v4/predict/2021-04-28/petastorm/') as reader:
pass
yields ArrowInvalid: Parquet file size is 0 bytes
Tried reproducing your issue using examples/hello_world/petastorm_dataset/
scripts against a gs storage. I was not able to reproduce the ArrowInvalid
exception. I did observe a misleading OSError: Passed non-file path
exception when in fact there was some permission issues accessing the bucket.
Can you try opening that parquet store using pyarrow (without petastorm?)? That way, we would know if the problem stems from software layers under petastorm or the problem is with petastorm itself?