turbodbc
turbodbc copied to clipboard
"Table schema does not match schema used to create file" when incrementally writing parquet with batches from fetcharrowbatches if using strings_as_dictionary=True
This worked on previous version: Arrow 0.9, Turbodbc 2.7 Current version: Arrow 0.11, Turbodbc 3.0
Workaround: I used strings_as_dictionary=False instead.
for batch in cursor.fetcharrowbatches(strings_as_dictionary=True):
if schema == None:
schema = batch.schema
writer = pq.ParquetWriter(local_output_path, schema, compression='gzip')
writer.write_table(batch)
I checked the schema comparison in the output and the only difference I can see is that the dictionaries have different contents. The columns, types and dictionary index sizes are all the same. For example:
Table schema does not match schema used to create file:
table:
...
value: dictionary<values=string, indices=int16, ordered=0>
dictionary:
[
"3121.0",
"3136.0",
...
"3170.0"
]
vs
file:
value: dictionary<values=string, indices=int16, ordered=0>
dictionary:
[
"3183.0",
"3125.0",
...
"3199.0"
]
@xhochy Does this look familiar?
Yes, this look familiar. We have not yet implemented functionality to merge dictionary encoded data into a unionized type: https://issues.apache.org/jira/browse/ARROW-554