Fix readback of embedding/tensor types from Parquet

Open jaychia opened this issue 1 year ago • 1 comments

Describe the bug

Sometimes when writing/reading back embedding or tensor types from Parquet we get weird misalignment issues

Jul 18 '24 16:07 jaychia

There's more details in the fix (#2586), but for future reference, this issue with embeddings and tensor types, or nested types in general, was a result of having a top-level column's last row spanning more than one data page. Here's a minimal reproducer:

import pyarrow as pa
import pyarrow.parquet as pq
import random

import daft
from daft.datatype import DataType
from daft.expressions import col
from daft.table import MicroPartition


struct_type = pa.list_(pa.struct([("field1", pa.string()), ("field2", pa.string())]))
data = [[{'field1': "a", "field2": "b"}]]

n = 512
str_size = 3000
for i in range(0, n):
    s = ''
    r = ''
    for _ in range(0, str_size):
        s += chr(random.randint(32, 126))
        r += chr(random.randint(32, 126))
    data += [[{'field1': s, "field2": "b"}, {'field1': r, "field2": "b"}]]

table = daft.from_pydict({'nested': pa.array(data, type=struct_type)})
table.write_parquet("adverse.parquet")

Running the script produces a parquet folder adverse.parquet/ with the following column layout:

Column: nested.list.item.field1
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  S _  1024    2.931 kB   2.931 MB  
  0-1    data  S R  1024    1.39 B     1.393 kB            0       " *7$5R/6PL'M4OaR8r}&S%<rP..." / "~ubE)P:n&)Jwexz,|B\X:*qZ6..."
  0-2    data  S _  1       2.945 kB   2.945 kB            0       "~$[[email protected]#zo97K#b'%.J{4m..." / "~$[[email protected]#zo97K#b'%.J{4m..."


Column: nested.list.item.field2
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  S _  1       5.00 B     5 B       
  0-1    data  S R  1025    0.14 B     147 B               0       "b" / "b"

The last row (which has two nested values) span pages 0-1 and 0-2. Trying to read this parquet file results in an error.

The underlying problem was an incorrect implicit assumption that a row's nested values will be located within a same data page, causing us to stop reading pages once we've seen a value from the last row, which resulted in us missing subsequent data pages that the last row could overflow into.

Jul 31 '24 04:07 desmondcheongzx