Fix readback of embedding/tensor types from Parquet
Describe the bug
Sometimes when writing/reading back embedding or tensor types from Parquet we get weird misalignment issues
There's more details in the fix (#2586), but for future reference, this issue with embeddings and tensor types, or nested types in general, was a result of having a top-level column's last row spanning more than one data page. Here's a minimal reproducer:
import pyarrow as pa
import pyarrow.parquet as pq
import random
import daft
from daft.datatype import DataType
from daft.expressions import col
from daft.table import MicroPartition
struct_type = pa.list_(pa.struct([("field1", pa.string()), ("field2", pa.string())]))
data = [[{'field1': "a", "field2": "b"}]]
n = 512
str_size = 3000
for i in range(0, n):
s = ''
r = ''
for _ in range(0, str_size):
s += chr(random.randint(32, 126))
r += chr(random.randint(32, 126))
data += [[{'field1': s, "field2": "b"}, {'field1': r, "field2": "b"}]]
table = daft.from_pydict({'nested': pa.array(data, type=struct_type)})
table.write_parquet("adverse.parquet")
Running the script produces a parquet folder adverse.parquet/ with the following column layout:
Column: nested.list.item.field1
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict S _ 1024 2.931 kB 2.931 MB
0-1 data S R 1024 1.39 B 1.393 kB 0 " *7$5R/6PL'M4OaR8r}&S%<rP..." / "~ubE)P:n&)Jwexz,|B\X:*qZ6..."
0-2 data S _ 1 2.945 kB 2.945 kB 0 "~$[[email protected]#zo97K#b'%.J{4m..." / "~$[[email protected]#zo97K#b'%.J{4m..."
Column: nested.list.item.field2
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict S _ 1 5.00 B 5 B
0-1 data S R 1025 0.14 B 147 B 0 "b" / "b"
The last row (which has two nested values) span pages 0-1 and 0-2. Trying to read this parquet file results in an error.
The underlying problem was an incorrect implicit assumption that a row's nested values will be located within a same data page, causing us to stop reading pages once we've seen a value from the last row, which resulted in us missing subsequent data pages that the last row could overflow into.