arrow icon indicating copy to clipboard operation
arrow copied to clipboard

Write Dataset with file_visitor core dump

Open wingerted opened this issue 6 months ago • 1 comments

Describe the bug, including details regarding any error messages, version, and platform.

Describe

Call pyarrow.dataset write_dataset with file_visitor will core dump. If not pass file_visitor, write_dataset running success.

Reproduce Code

import pyarrow as pa
import pyarrow.dataset as ds
import uuid, pathlib, time, os
import json, pyarrow.parquet as pq



table_uri = pathlib.Path("data/my_ds")
data_dir   = table_uri
data_dir.mkdir(parents=True, exist_ok=True)


tbl = pa.table({
    "id":    pa.array([1, 2, 3], pa.int64()),
    "value": pa.array([10, 20, 30], pa.int64()),
    "ds":    pa.array(["2025-06-12"]*3, pa.string())
})


f_list = []
def f_visit(f):
    f_list.append(f.path)

ds.write_dataset(
    tbl,
    base_dir=data_dir,
    format="parquet",
    basename_template=str(uuid.uuid4()) + "-{i}.parquet",
    existing_data_behavior="delete_matching",
    file_visitor=f_visit
)

Enviroment

  • Python: python 3.12
  • Arrow Version: 19, 20

Component(s)

C++, Python

wingerted avatar Jun 13 '25 02:06 wingerted

Hello, I've tried to reproduce your issue, but it seems to work fine with me. My configuration is the following:

  • Arrow: pyarrow 20.0.0
  • Python: 3.12.9
  • MacOS Sequoia (arm64, Apple M1 Pro)

lukland avatar Jun 24 '25 07:06 lukland