arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Invalid arrow file produced when writing partitioned with a dictionary-encoded column?

Open paulmelis opened this issue 4 years ago • 3 comments

I was testing with writing partitioned data from Julia using Arrow.jl. When a column is dictionary-encoded it seems the resulting arrow file can not be read by PyArrow:

using Arrow
using Tables

function gendata(rsid, negate)

    rsids = String[]
    values = Int32[]
    for i in 1:10
        push!(rsids, rsid)
        push!(values, negate ? -i : i)
    end

    return (rsid=Arrow.DictEncode(rsids), value=values)

end

t1 = gendata("rsid123456789", false)
t2 = gendata("rsid000000001", true)
t = Tables.partitioner([t1, t2])

Arrow.write("partitions.arrow", t)
Python 3.8.2 (default, Jul 31 2020, 22:06:31) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> t = pyarrow.ipc.open_file(open('partitions.arrow','rb'))
>>> t.read_pandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/ipc.pxi", line 421, in pyarrow.lib._ReadPandasMixin.read_pandas
  File "pyarrow/ipc.pxi", line 663, in pyarrow.lib._RecordBatchFileReader.read_all
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or dictionary delta in IPC file

Trying to read one batch at a time fails with the same error:

>>> import pyarrow
>>> t = pyarrow.ipc.open_file(open('partitions.arrow','rb'))
>>> t.num_record_batches
2
>>> b = t.get_batch(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/ipc.pxi", line 641, in pyarrow.lib._RecordBatchFileReader.get_batch
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or dictionary delta in IPC file

I can't tell if this is an issue with Arrow.jl writing or with PyArrow reading. Is there a tool somewhere that can dump a .arrow file and check it for correctness?

paulmelis avatar Feb 11 '21 10:02 paulmelis

For completeness sake here proof that PyArrow can read the file when not using dictionary-encoding for the rsid column:

>>> r = pyarrow.ipc.open_file(open('partitions2.arrow','rb'))
>>> r.num_record_batches
2
>>> b = r.get_batch(0)
>>> b
pyarrow.RecordBatch
rsid: string not null
value: int32 not null
>>> b.to_pandas()
            rsid  value
0  rsid123456789      1
1  rsid123456789      2
2  rsid123456789      3
3  rsid123456789      4
4  rsid123456789      5
5  rsid123456789      6
6  rsid123456789      7
7  rsid123456789      8
8  rsid123456789      9
9  rsid123456789     10

paulmelis avatar Feb 11 '21 10:02 paulmelis

Ah, trying to write the same set of dictionary-encoded batches with PyArrow results in the following:

$ cat p2.py
import pyarrow as pa

def mkbatch(rsid, negate):
    d = pa.array([rsid])
    i = pa.array([0]*10)
    da = pa.DictionaryArray.from_arrays(i, d) 
    values = [(-i if negate else i) for i in range(10)]
    data = [pa.array(values), da]    
    return pa.RecordBatch.from_arrays(data, ['rsid', 'value'])

batch1 = mkbatch('rsid123456789', False)
batch2 = mkbatch('rsid000000001', True)

f = open('p2.arrow', 'wb')
writer = pa.ipc.new_file(f, batch1.schema)
writer.write_batch(batch1)
writer.write_batch(batch2)
writer.close()
f.close()

$ python p2.py
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/ipc.pxi", line 314, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field accross all batches.

paulmelis avatar Feb 11 '21 10:02 paulmelis

Sorry for the slow response here. I'm actually surprised that pyarrow wouldnt' support replacement dictionaries like this. From the implementation status page, it says pyarrow tracks the C++ implementation, which does support replacement dictionaries. I've asked on the mailing list here if you want to watch there for an answer.

quinnj avatar Mar 18 '21 06:03 quinnj

The short answer from the mailing list is that the file format doesn't support dictionaries the same way we do in the Julia implementation (i.e. Julia supports, C++ doesn't), so you'll want to ensure you use the stream format instead of file when doing things w/ dictionaries.

quinnj avatar Jun 09 '23 03:06 quinnj