weave icon indicating copy to clipboard operation
weave copied to clipboard

feat(Improve Datasets)

Open tcapelle opened this issue 1 year ago • 1 comments

The dataset class is still very lightweight but has a lot of potential. More now that we will have feedback and ways to annotate data. Let's try to put some feature parity with hf-datasets.

  • Add some convenience str, len, iter methods
>> print(ds)
Dataset({
    name: 'cape_dev',
    features: ['id', 'text', 'length'],
    num_rows: 10
})

>> len(ds)
10

For row in ds:
  print(row)
{'id': 1, 'text': 'The quick brown fox jumps over the lazy dog.', 'length': 43, 'other_text': 'The '}
{'id': 2, 'text': 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.', 'length': 56, 'other_text': 'Lore'}
{'id': 3, 'text': 'To be or not to be, that is the question.', 'length': 41, 'other_text': 'To b'}
{'id': 4, 'text': 'All that glitters is not gold.', 'length': 30, 'other_text': 'All '}
{'id': 5, 'text': 'A journey of a thousand miles begins with a single step.', 'length': 58, 'other_text': 'A jo'}

  • Adds a map method: It bakes the asyncio.run call inside, maybe not a good idea?
from weave import Dataset


rows = [
    {"id": 1, "text": "The quick brown fox jumps over the lazy dog.", "length": 43},
    {"id": 2, "text": "Lorem ipsum dolor sit amet, consectetur adipiscing elit.", "length": 56},
    {"id": 3, "text": "To be or not to be, that is the question.", "length": 41},
    {"id": 4, "text": "All that glitters is not gold.", "length": 30},
    {"id": 5, "text": "A journey of a thousand miles begins with a single step.", "length": 58},
]


ds = Dataset(name="cape_dev", rows=rows)

def f(text: str):
    return {"other_text": text[0:4], "text_length": len(text)}

mapped_ds = ds.map(f)
print(mapped_ds)

Mapped 5 of 5 examples in 0.00 seconds
Dataset({
    name: 'cape_dev',
    features: ['id', 'text', 'text_length', 'other_text'],
    num_rows: 5
})

tcapelle avatar Jun 27 '24 13:06 tcapelle