datasets
datasets copied to clipboard
How to assign new values to Dataset?

Hi, if I want to change some values of the dataset, or add new columns to it, how can I do it?
For example, I want to change all the labels of the SST2 dataset to 0:
from datasets import load_dataset
data = load_dataset('glue','sst2')
data['train']['label'] = [0]*len(data)
I will get the error:
TypeError: 'Dataset' object does not support item assignment
Hi! One option is use map with a function that overwrites the labels (dset = dset.map(lamba _: {"label": 0}, features=dset.features)). Or you can use the remove_column + add_column combination (dset = dset.remove_columns("label").add_column("label", [0]*len(data)).cast(dset.features), but note that this approach creates an in-memory table for the added column instead of writing to disk, which could be problematic for large datasets.
Hi! I tried your proposed solution, but it does not solve my problem unfortunately. I am working with a set of protein sequences that have been tokenized with ESM, but some sequences are longer than max_length, they have been truncated in the tokenization. So now I want to truncate my labels as well, but that does not work with a mapping (e.g. dset.map as you suggested). Specifically, what I did was the following:
def postprocess_tokenize(tokenized_data):
"""
adjust label lengths if they dont match.
"""
if len(tokenized_data['input_ids']) < len(tokenized_data['labels']):
new_labels = tokenized_data['labels'][:len(tokenized_data['input_ids'])]
tokenized_data["labels"] = new_labels
return tokenized_data
tokenized_data = tokenized_data.map(postprocess_tokenize, batched=True) # this does not adjust the labels...
Any tips on how to do this properly?
More generally, I am wondering why the DataCollator supports padding but does not support truncation? Seems odd to me.
Thanks in advance!