fastparquet icon indicating copy to clipboard operation
fastparquet copied to clipboard

ValueError when loading partitioned dataset with None values

Open Andreas5739738 opened this issue 2 years ago • 2 comments

What happened: Fastparquet raises a ValueError when attempting to load a Parquet file with None values in a partitioned column:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_9855/558527708.py in <cell line: 2>()
      1 from fastparquet import ParquetFile
----> 2 ParquetFile('people-partitioned.parquet').to_pandas()

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index, row_filter)
    768                             else v[start:start + thislen])
    769                      for (name, v) in views.items()}
--> 770             self.read_row_group_file(rg, columns, categories, index,
    771                                      assign=parts, partition_meta=self.partition_meta,
    772                                      row_filter=sel, infile=infile)

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/api.py in read_row_group_file(self, rg, columns, categories, index, assign, partition_meta, row_filter, infile)
    372         f = infile or self.open(fn, mode='rb')
    373 
--> 374         core.read_row_group(
    375             f, rg, columns, categories, self.schema, self.cats,
    376             selfmade=self.selfmade, index=index,

~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/fastparquet/core.py in read_row_group(file, rg, columns, categories, schema_helper, cats, selfmade, index, assign, scheme, partition_meta, row_filter)
    620         key, val = [p for p in partitions if p[0] == cat][0]
    621         val = val_to_num(val, meta=partition_meta.get(key))
--> 622         assign[cat][:] = cats[cat].index(val)

ValueError: 25 is not in list

What you expected to happen: Fastparquet loads the partitioned dataset with the same result as the unpartitioned dataset:

Age	Name

0 20 John 1 25 Joe 2 <NA> Jane

Minimal Complete Verifiable Example:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(20,'John'), (25, 'Joe'), (None, 'Jane')], ['Age', 'Name'])
df.write.parquet('people.parquet')
df.write.parquet('people-partitioned.parquet', partitionBy='Age')

from fastparquet import ParquetFile
ParquetFile('people-partitioned.parquet').to_pandas()

Anything else we need to know?:

Environment:

  • Dask version:'2022.8.1'
  • Fastparquet version: '0.8.2'
  • Python version: '3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51) \n[GCC 9.4.0]'
  • Operating System: Linux
  • Install method (conda, pip, source): conda

Andreas5739738 avatar Aug 26 '22 10:08 Andreas5739738

Unfortunately, this loads fine for me:

   Name                         Age
0  John                          20
1   Joe                          25
2  Jane  __HIVE_DEFAULT_PARTITION__

except that we do not recognise the very special value that spark has used to represent None.

What version of spark do you have, and did it make a folder structure with names different than

Age=20/                         Age=25/                         Age=__HIVE_DEFAULT_PARTITION__/

martindurant avatar Aug 26 '22 13:08 martindurant

(I should have said, my successful run was with pyspark 3.1.2)

martindurant avatar Aug 26 '22 13:08 martindurant