petastorm icon indicating copy to clipboard operation
petastorm copied to clipboard

`parquet file size 0 bytes` when materializing dataset

Open ckchow opened this issue 3 years ago • 4 comments

I'm trying out petastorm on a google dataproc cluster, and when I try to materialize a dataset like the below:

Schema = Unischema('Schema', [
    UnischemaField('features', np.float32, (310,), NdarrayCodec(), False)
])

def make_dataset(output_uri):
    rowgroup_size_mb = 256
    with materialize_dataset(spark, output_uri, Schema, rowgroup_size_mb, use_summary_metadata=True):
        rows_rdd = blah.select('features').limit(1000) \
          .rdd\
          .map(lambda x: {'features': np.array(x['features'].toArray(), dtype=np.float32) }) \
          .map(lambda x: dict_to_spark_row(Schema, x))

        spark.createDataFrame(rows_rdd, Schema.as_spark_schema())\
          .write \
          .mode('overwrite') \
          .parquet(output_uri)

I get pyarrow errors like "ArrowInvalid: Parquet file size is 0 bytes" when executing the above on a google storage URI like "gs://bucket/path/petastorm" . Can anybody tell if this is a petastorm issue, a pyarrow issue, or maybe something else?

library versions:

fsspec==0.9.0
gcsfs==0.8.0
petastorm==0.10.0
pyarrow==0.17.1
# Editable install with no version control (pyspark==3.1.1)

The files are created and appear to be valid upon inspection with the regular spark parquet reader. Trying to make a reader on the files via

with make_reader("gs://bucket/petastorm") as reader:
  pass

causes the same error ArrowInvalid: Parquet file size is 0 bytes, and I assume it's the same root cause, whatever it is.

ckchow avatar May 01 '21 02:05 ckchow

raw = spark.read.parquet('gs://bucket/petastorm') yields

[Row(features=bytearray(b"\x93NUMPY\x01\x00v\x00{\'descr\': \'<f4\', \'fortran_order\': False, \'shape\': (310,), }                                                          \n\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x <...>

which looks like the NdArrayCodec worked.

ckchow avatar May 01 '21 03:05 ckchow

Would running this locally and writing either to local fs or gs works worth the same code?

selitvin avatar May 01 '21 14:05 selitvin

When using an inmemory local cluster:

  • writing to / reading from a file:///<blah>/temp/ URI works
  • writing to / reading from gs://bucket/petastorm as above does not.

This feels like it's related to https://stackoverflow.com/questions/58646728/pyarrow-lib-arrowioerror-invalid-parquet-file-size-is-0-bytes, but I get the same size 0 issue even if I disable _SUCCESS files and don't write metadata.


More strange details

with make_reader('gs://bucket/blah/part-00000-629b3fcc-6ee4-40b0-8b84-31513272952f-c000.snappy.parquet') as reader:
    pass

yields OSError: Passed non-file path: bucket/blah/petastorm/part-00000-629b3fcc-6ee4-40b0-8b84-31513272952f-c000.snappy.parquet


with make_reader('gs://churn_dev/v4/predict/2021-04-28/petastorm/') as reader:
    pass

yields ArrowInvalid: Parquet file size is 0 bytes

ckchow avatar May 02 '21 06:05 ckchow

Tried reproducing your issue using examples/hello_world/petastorm_dataset/ scripts against a gs storage. I was not able to reproduce the ArrowInvalid exception. I did observe a misleading OSError: Passed non-file path exception when in fact there was some permission issues accessing the bucket.

Can you try opening that parquet store using pyarrow (without petastorm?)? That way, we would know if the problem stems from software layers under petastorm or the problem is with petastorm itself?

selitvin avatar May 03 '21 04:05 selitvin