lance icon indicating copy to clipboard operation
lance copied to clipboard

Huggingface Integration doesn't work when `streaming=True`

Open tanaymeh opened this issue 10 months ago • 0 comments

Huggingface integration for converting an HF dataset to Lance doesn't work when using streaming mode. Below is the snippet to reproduce the error.

import lance
import pyarrow as pa

from transformers import AutoTokenizer
from datasets import load_dataset

ds = load_dataset('wikitext', name='wikitext-103-raw-v1', split='validation', streaming=True)
tokenizer = AutoTokenizer.from_pretrained('gpt2')

# Since some rows in the dataset has empty strings
def tokenize(text):
    if len(text) == 0:
        return tokenizer("<UNK>")
    return tokenizer(text['text'])

ds = ds.map(tokenize, batched=False, remove_columns=['text'])

# Schema because when streaming is set to True, lance.write_dataset expects a schema
schema = pa.schema([
    pa.field("input_ids", pa.list_(pa.int64(), -1)),
    pa.field("attention_mask", pa.list_(pa.int64(), -1)),
])

lance.write_dataset(ds, "dataset.lance", schema=schema)

When executed, the above snippet causes the following error:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
[<ipython-input-51-dce90d892700>](https://localhost:8080/#) in <cell line: 25>()
     23 ])
     24 
---> 25 lance.write_dataset(ds, "dataset.lance", schema=schema)

[/usr/local/lib/python3.10/dist-packages/lance/dataset.py](https://localhost:8080/#) in write_dataset(data_obj, uri, schema, mode, max_rows_per_file, max_rows_per_group, max_bytes_per_file, commit_lock, progress, storage_options)
   2402 
   2403     uri = os.fspath(uri) if isinstance(uri, Path) else uri
-> 2404     inner_ds = _write_dataset(reader, uri, params)
   2405 
   2406     ds = LanceDataset.__new__(LanceDataset)

OSError: LanceError(Arrow): C Data interface error: Type error: Expected RecordBatch, got <class 'dict'>. Detail: Python exception: TypeError, /home/runner/work/lance/lance/rust/lance-datafusion/src/utils.rs:41:28

However, when the same snippet is run with streaming=False, it finishes successfully without any error.

tanaymeh avatar Apr 17 '24 08:04 tanaymeh