lance
lance copied to clipboard
Huggingface Integration doesn't work when `streaming=True`
Huggingface integration for converting an HF dataset to Lance doesn't work when using streaming mode. Below is the snippet to reproduce the error.
import lance
import pyarrow as pa
from transformers import AutoTokenizer
from datasets import load_dataset
ds = load_dataset('wikitext', name='wikitext-103-raw-v1', split='validation', streaming=True)
tokenizer = AutoTokenizer.from_pretrained('gpt2')
# Since some rows in the dataset has empty strings
def tokenize(text):
if len(text) == 0:
return tokenizer("<UNK>")
return tokenizer(text['text'])
ds = ds.map(tokenize, batched=False, remove_columns=['text'])
# Schema because when streaming is set to True, lance.write_dataset expects a schema
schema = pa.schema([
pa.field("input_ids", pa.list_(pa.int64(), -1)),
pa.field("attention_mask", pa.list_(pa.int64(), -1)),
])
lance.write_dataset(ds, "dataset.lance", schema=schema)
When executed, the above snippet causes the following error:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
[<ipython-input-51-dce90d892700>](https://localhost:8080/#) in <cell line: 25>()
23 ])
24
---> 25 lance.write_dataset(ds, "dataset.lance", schema=schema)
[/usr/local/lib/python3.10/dist-packages/lance/dataset.py](https://localhost:8080/#) in write_dataset(data_obj, uri, schema, mode, max_rows_per_file, max_rows_per_group, max_bytes_per_file, commit_lock, progress, storage_options)
2402
2403 uri = os.fspath(uri) if isinstance(uri, Path) else uri
-> 2404 inner_ds = _write_dataset(reader, uri, params)
2405
2406 ds = LanceDataset.__new__(LanceDataset)
OSError: LanceError(Arrow): C Data interface error: Type error: Expected RecordBatch, got <class 'dict'>. Detail: Python exception: TypeError, /home/runner/work/lance/lance/rust/lance-datafusion/src/utils.rs:41:28
However, when the same snippet is run with streaming=False
, it finishes successfully without any error.