fastparquet
fastparquet copied to clipboard
ValueError when loading partitioned dataset with None values
What happened: Fastparquet raises a ValueError when attempting to load a Parquet file with None values in a partitioned column:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_9855/558527708.py in <cell line: 2>()
1 from fastparquet import ParquetFile
----> 2 ParquetFile('people-partitioned.parquet').to_pandas()
~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index, row_filter)
768 else v[start:start + thislen])
769 for (name, v) in views.items()}
--> 770 self.read_row_group_file(rg, columns, categories, index,
771 assign=parts, partition_meta=self.partition_meta,
772 row_filter=sel, infile=infile)
~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/api.py in read_row_group_file(self, rg, columns, categories, index, assign, partition_meta, row_filter, infile)
372 f = infile or self.open(fn, mode='rb')
373
--> 374 core.read_row_group(
375 f, rg, columns, categories, self.schema, self.cats,
376 selfmade=self.selfmade, index=index,
~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, scheme, partition_meta, row_filter)
620 key, val = [p for p in partitions if p[0] == cat][0]
621 val = val_to_num(val, meta=partition_meta.get(key))
--> 622 assign[cat][:] = cats[cat].index(val)
ValueError: 25 is not in list
What you expected to happen: Fastparquet loads the partitioned dataset with the same result as the unpartitioned dataset:
Age Name
0 20 John 1 25 Joe 2 <NA> Jane
Minimal Complete Verifiable Example:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(20,'John'), (25, 'Joe'), (None, 'Jane')], ['Age', 'Name'])
df.write.parquet('people.parquet')
df.write.parquet('people-partitioned.parquet', partitionBy='Age')
from fastparquet import ParquetFile
ParquetFile('people-partitioned.parquet').to_pandas()
Anything else we need to know?:
Environment:
- Dask version:'2022.8.1'
- Fastparquet version: '0.8.2'
- Python version: '3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'
- Operating System: Linux
- Install method (conda, pip, source): conda
Unfortunately, this loads fine for me:
Name Age
0 John 20
1 Joe 25
2 Jane __HIVE_DEFAULT_PARTITION__
except that we do not recognise the very special value that spark has used to represent None.
What version of spark do you have, and did it make a folder structure with names different than
Age=20/ Age=25/ Age=__HIVE_DEFAULT_PARTITION__/
(I should have said, my successful run was with pyspark 3.1.2)