parquet-python icon indicating copy to clipboard operation
parquet-python copied to clipboard

Structures cause error

Open SergeNov opened this issue 8 years ago • 3 comments

Hi Joe and others I am trying to use your module to read a parquet file, and i ran into a problem here: schema.py, line 21: assert len(self.schema_elements) == len(self.schema_elements_by_name) Apparently the init method assumes that my structure has multiple fields with the same name. Module works correctly if you comment out this line though Originally these files were used by Hive, and here is the list of fields in the table:

fileid bigint, version bigint, ip_geocode structcountrycode:string,regionname:string,city:string,postalcode:string,metrocode:string,dmacode:string, timestamp bigint, region bigint, pixel bigint, uuid bigint, uuid_exists boolean, referingurl string, useragent string, ip string, querystring string, campaignsinfo array<struct<campaign_id:bigint,media_types:array,advertiser_id:bigint,funnel_step_id:bigint,funnel_step_value:bigint,track_conversion:boolean>>, opted_out boolean, event_id string

Here is how the list of fields that the module sees:

name=u'hive_schema', field_id=None, repetition_type=None, type_length=None, precision=None, num_children=17, converted_type=None, type=None name=u'fileid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'version', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'ip_geocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None name=u'countrycode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'regionname', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'city', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'postalcode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'metrocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'dmacode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'timestamp', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'region', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'pixel', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'uuid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'uuid_exists', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0 name=u'referingurl', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'useragent', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'ip', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'querystring', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'campaignsinfo', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None name=u'campaign_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'media_types', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'advertiser_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'funnel_step_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'funnel_step_value', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2 name=u'track_conversion', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0 name=u'opted_out', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0 name=u'event_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6 name=u'dt', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1 name=u'hr', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1

Apparently there are 2 elements named 'array_element' and 'bag' - i assume these fields just come with structures

SergeNov avatar Aug 23 '16 23:08 SergeNov

@SergeNov thanks for the report. I'll attempt to reproduce and fix the issue.

jcrobak avatar Aug 28 '16 18:08 jcrobak

@SergeNov I've started to work on support for schemas like these. The first step is in #45, if you want to give it a try. Unfortunately, I don't think your schema is fully supported yet because it includes an array.

jcrobak avatar Oct 01 '16 17:10 jcrobak

Still experiencing this issue in version 1.3.1.

halfak avatar Jan 10 '22 23:01 halfak