turbodbc "Table schema does not match schema used to create file" when incrementally writing parquet with batches from fetcharrowbatches if using strings_as

"Table schema does not match schema used to create file" when incrementally writing parquet with batches from fetcharrowbatches if using strings_as_dictionary=True

Open chriscomeau79 opened this issue 7 years ago • 2 comments

This worked on previous version: Arrow 0.9, Turbodbc 2.7 Current version: Arrow 0.11, Turbodbc 3.0

Workaround: I used strings_as_dictionary=False instead.

for batch in cursor.fetcharrowbatches(strings_as_dictionary=True):
  if schema == None:
    schema = batch.schema
    writer = pq.ParquetWriter(local_output_path, schema, compression='gzip')
  writer.write_table(batch)

I checked the schema comparison in the output and the only difference I can see is that the dictionaries have different contents. The columns, types and dictionary index sizes are all the same. For example:

Table schema does not match schema used to create file: 
table:
...
value: dictionary<values=string, indices=int16, ordered=0>
  dictionary:
    [
      "3121.0",
      "3136.0",
      ...
      "3170.0"
    ]
vs
file:
value: dictionary<values=string, indices=int16, ordered=0>
  dictionary:
    [
      "3183.0",
      "3125.0",
      ...
      "3199.0"
    ]

Nov 13 '18 19:11 chriscomeau79

@xhochy Does this look familiar?

Nov 14 '18 16:11 MathMagique

Yes, this look familiar. We have not yet implemented functionality to merge dictionary encoded data into a unionized type: https://issues.apache.org/jira/browse/ARROW-554

Dec 01 '18 18:12 xhochy

turbodbc turbodbc copied to clipboard

"Table schema does not match schema used to create file" when incrementally writing parquet with batches from fetcharrowbatches if using strings_as_dictionary=True

turbodbc
turbodbc copied to clipboard