arrow-julia
arrow-julia copied to clipboard
Invalid arrow file produced when writing partitioned with a dictionary-encoded column?
I was testing with writing partitioned data from Julia using Arrow.jl. When a column is dictionary-encoded it seems the resulting arrow file can not be read by PyArrow:
using Arrow
using Tables
function gendata(rsid, negate)
rsids = String[]
values = Int32[]
for i in 1:10
push!(rsids, rsid)
push!(values, negate ? -i : i)
end
return (rsid=Arrow.DictEncode(rsids), value=values)
end
t1 = gendata("rsid123456789", false)
t2 = gendata("rsid000000001", true)
t = Tables.partitioner([t1, t2])
Arrow.write("partitions.arrow", t)
Python 3.8.2 (default, Jul 31 2020, 22:06:31)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> t = pyarrow.ipc.open_file(open('partitions.arrow','rb'))
>>> t.read_pandas()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/ipc.pxi", line 421, in pyarrow.lib._ReadPandasMixin.read_pandas
File "pyarrow/ipc.pxi", line 663, in pyarrow.lib._RecordBatchFileReader.read_all
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or dictionary delta in IPC file
Trying to read one batch at a time fails with the same error:
>>> import pyarrow
>>> t = pyarrow.ipc.open_file(open('partitions.arrow','rb'))
>>> t.num_record_batches
2
>>> b = t.get_batch(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/ipc.pxi", line 641, in pyarrow.lib._RecordBatchFileReader.get_batch
File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or dictionary delta in IPC file
I can't tell if this is an issue with Arrow.jl writing or with PyArrow reading. Is there a tool somewhere that can dump a .arrow file and check it for correctness?
For completeness sake here proof that PyArrow can read the file when not using dictionary-encoding for the rsid column:
>>> r = pyarrow.ipc.open_file(open('partitions2.arrow','rb'))
>>> r.num_record_batches
2
>>> b = r.get_batch(0)
>>> b
pyarrow.RecordBatch
rsid: string not null
value: int32 not null
>>> b.to_pandas()
rsid value
0 rsid123456789 1
1 rsid123456789 2
2 rsid123456789 3
3 rsid123456789 4
4 rsid123456789 5
5 rsid123456789 6
6 rsid123456789 7
7 rsid123456789 8
8 rsid123456789 9
9 rsid123456789 10
Ah, trying to write the same set of dictionary-encoded batches with PyArrow results in the following:
$ cat p2.py
import pyarrow as pa
def mkbatch(rsid, negate):
d = pa.array([rsid])
i = pa.array([0]*10)
da = pa.DictionaryArray.from_arrays(i, d)
values = [(-i if negate else i) for i in range(10)]
data = [pa.array(values), da]
return pa.RecordBatch.from_arrays(data, ['rsid', 'value'])
batch1 = mkbatch('rsid123456789', False)
batch2 = mkbatch('rsid000000001', True)
f = open('p2.arrow', 'wb')
writer = pa.ipc.new_file(f, batch1.schema)
writer.write_batch(batch1)
writer.write_batch(batch2)
writer.close()
f.close()
$ python p2.py
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/ipc.pxi", line 314, in pyarrow.lib._CRecordBatchWriter.write_batch
File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field accross all batches.
Sorry for the slow response here. I'm actually surprised that pyarrow wouldnt' support replacement dictionaries like this. From the implementation status page, it says pyarrow tracks the C++ implementation, which does support replacement dictionaries. I've asked on the mailing list here if you want to watch there for an answer.
The short answer from the mailing list is that the file format doesn't support dictionaries the same way we do in the Julia implementation (i.e. Julia supports, C++ doesn't), so you'll want to ensure you use the stream format instead of file when doing things w/ dictionaries.