petastorm icon indicating copy to clipboard operation
petastorm copied to clipboard

ArrowIOError: Corrupted file, smaller than file footer

Open balajib5497 opened this issue 6 years ago • 1 comments

NewsSchema = Unischema('NewsSchema', [
                  UnischemaField('headline_vec', np.float16, (sequence_length, char_vec_length), NdarrayCodec(), False),
                  UnischemaField('category_vec', np.float16, (num_labels,), NdarrayCodec(), False)
])

with materialize_dataset(spark, output_url, NewsSchema, rowgroup_size_mb):
  rows_rdd = train_data.select('headline_vec', 'category_vec').rdd.map(row2dict).map(lambda x: dict_to_spark_row(NewsSchema, x))
  spark.createDataFrame(rows_rdd, NewsSchema.as_spark_schema())\
                        .coalesce(4)\
                        .write.mode('overwrite')\
                        .parquet(output_url)

Above code throws following error: pls help me.


ArrowIOError Traceback (most recent call last) in 3 4 with materialize_dataset(spark, output_url, NewsSchema, rowgroup_size_mb) as m: ----> 5 train_data.coalesce(4).write.mode('overwrite').parquet(output_url) 6 7 # rows_rdd = train_data.select('headline_vec', 'category_vec').rdd.map(row2dict).map(lambda x: dict_to_spark_row(NewsSchema, x))

/usr/lib/python3.7/contextlib.py in exit(self, type, value, traceback) 117 if type is None: 118 try: --> 119 next(self.gen) 120 except StopIteration: 121 return False

/local_disk0/pythonVirtualEnvDirs/virtualEnv-b7f3809e-9296-4821-a705-fe6d82c95cd8/lib/python3.7/site-packages/petastorm/etl/dataset_metadata.py in materialize_dataset(spark, dataset_url, schema, row_group_size_mb, use_summary_metadata, filesystem_factory) 110 validate_schema=False) 111 --> 112 _generate_unischema_metadata(dataset, schema) 113 if not use_summary_metadata: 114 _generate_num_row_groups_per_file(dataset, spark.sparkContext, filesystem_factory)

/local_disk0/pythonVirtualEnvDirs/virtualEnv-b7f3809e-9296-4821-a705-fe6d82c95cd8/lib/python3.7/site-packages/petastorm/etl/dataset_metadata.py in _generate_unischema_metadata(dataset, schema) 190 assert schema 191 serialized_schema = pickle.dumps(schema) --> 192 utils.add_to_dataset_metadata(dataset, UNISCHEMA_KEY, serialized_schema) 193 194

/local_disk0/pythonVirtualEnvDirs/virtualEnv-b7f3809e-9296-4821-a705-fe6d82c95cd8/lib/python3.7/site-packages/petastorm/utils.py in add_to_dataset_metadata(dataset, key, value) 113 arrow_metadata = pyarrow.parquet.read_metadata(f) 114 else: --> 115 arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open) 116 117 base_schema = arrow_metadata.schema.to_arrow_schema()

/local_disk0/pythonVirtualEnvDirs/virtualEnv-b7f3809e-9296-4821-a705-fe6d82c95cd8/lib/python3.7/site-packages/petastorm/compat.py in compat_get_metadata(piece, open_func) 29 arrow_metadata = piece.get_metadata(open_func) 30 else: ---> 31 arrow_metadata = piece.get_metadata() 32 return arrow_metadata 33

/databricks/python/lib/python3.7/site-packages/pyarrow/parquet.py in get_metadata(self, open_file_func) 500 f = self._open(open_file_func) 501 else: --> 502 f = self.open() 503 return f.metadata 504

/databricks/python/lib/python3.7/site-packages/pyarrow/parquet.py in open(self) 518 Returns instance of ParquetFile 519 """ --> 520 reader = self.open_file_func(self.path) 521 if not isinstance(reader, ParquetFile): 522 reader = ParquetFile(reader)

/databricks/python/lib/python3.7/site-packages/pyarrow/parquet.py in open_file(path, meta) 1060 memory_map=self.memory_map, 1061 metadata=meta, -> 1062 common_metadata=self.common_metadata) 1063 return open_file 1064

/databricks/python/lib/python3.7/site-packages/pyarrow/parquet.py in init(self, source, metadata, common_metadata, memory_map) 128 memory_map=True): 129 self.reader = ParquetReader() --> 130 self.reader.open(source, use_memory_map=memory_map, metadata=metadata) 131 self.common_metadata = common_metadata 132 self._nested_paths_by_prefix = self._build_nested_paths()

/databricks/python/lib/python3.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.open()

/databricks/python/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: Corrupted file, smaller than file footer

balajib5497 avatar Dec 22 '19 14:12 balajib5497

Is it possible that some of the parquet files in the parquet dataset directory has zero records? Repartitioning your datasets so that number of partitions is smaller than the number of records could help. I have to guess here since the example in the issue is partial and is not directly runnable.

selitvin avatar Jan 09 '20 23:01 selitvin