turbodbc icon indicating copy to clipboard operation
turbodbc copied to clipboard

"Table schema does not match schema used to create file" when incrementally writing parquet with batches from fetcharrowbatches if using strings_as_dictionary=True

Open chriscomeau79 opened this issue 7 years ago • 2 comments

This worked on previous version: Arrow 0.9, Turbodbc 2.7 Current version: Arrow 0.11, Turbodbc 3.0

Workaround: I used strings_as_dictionary=False instead.

for batch in cursor.fetcharrowbatches(strings_as_dictionary=True):
  if schema == None:
    schema = batch.schema
    writer = pq.ParquetWriter(local_output_path, schema, compression='gzip')
  writer.write_table(batch)

I checked the schema comparison in the output and the only difference I can see is that the dictionaries have different contents. The columns, types and dictionary index sizes are all the same. For example:

Table schema does not match schema used to create file: 
table:
...
value: dictionary<values=string, indices=int16, ordered=0>
  dictionary:
    [
      "3121.0",
      "3136.0",
      ...
      "3170.0"
    ]
vs
file:
value: dictionary<values=string, indices=int16, ordered=0>
  dictionary:
    [
      "3183.0",
      "3125.0",
      ...
      "3199.0"
    ]

chriscomeau79 avatar Nov 13 '18 19:11 chriscomeau79

@xhochy Does this look familiar?

MathMagique avatar Nov 14 '18 16:11 MathMagique

Yes, this look familiar. We have not yet implemented functionality to merge dictionary encoded data into a unionized type: https://issues.apache.org/jira/browse/ARROW-554

xhochy avatar Dec 01 '18 18:12 xhochy