parquet-python
parquet-python copied to clipboard
Structures cause error
Hi Joe and others I am trying to use your module to read a parquet file, and i ran into a problem here: schema.py, line 21: assert len(self.schema_elements) == len(self.schema_elements_by_name) Apparently the init method assumes that my structure has multiple fields with the same name. Module works correctly if you comment out this line though Originally these files were used by Hive, and here is the list of fields in the table:
fileid bigint,
version bigint,
ip_geocode structcountrycode:string,regionname:string,city:string,postalcode:string,metrocode:string,dmacode:string,
timestamp bigint,
region bigint,
pixel bigint,
uuid bigint,
uuid_exists boolean,
referingurl string,
useragent string,
ip string,
querystring string,
campaignsinfo array<struct<campaign_id:bigint,media_types:array
Here is how the list of fields that the module sees:
name=u'hive_schema', field_id=None, repetition_type=None, type_length=None, precision=None, num_children=17, converted_type=None, type=None name=u'fileid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'version', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'ip_geocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None name=u'countrycode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'regionname', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'city', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'postalcode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'metrocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'dmacode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'timestamp', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'region', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'pixel', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'uuid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'uuid_exists', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0 name=u'referingurl', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'useragent', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'ip', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'querystring', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'campaignsinfo', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None name=u'campaign_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'media_types', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'advertiser_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'funnel_step_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'funnel_step_value', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'track_conversion', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0 name=u'opted_out', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0 name=u'event_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'dt', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1 name=u'hr', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1
Apparently there are 2 elements named 'array_element' and 'bag' - i assume these fields just come with structures
@SergeNov thanks for the report. I'll attempt to reproduce and fix the issue.
@SergeNov I've started to work on support for schemas like these. The first step is in #45, if you want to give it a try. Unfortunately, I don't think your schema is fully supported yet because it includes an array.
Still experiencing this issue in version 1.3.1.