datasets How to assign new values to Dataset?

Hi, if I want to change some values of the dataset, or add new columns to it, how can I do it?

For example, I want to change all the labels of the SST2 dataset to 0:

from datasets import load_dataset
data = load_dataset('glue','sst2')

data['train']['label'] = [0]*len(data)

I will get the error:

TypeError: 'Dataset' object does not support item assignment

Jul 15 '22 04:07 beyondguo

Hi! One option is use map with a function that overwrites the labels (dset = dset.map(lamba _: {"label": 0}, features=dset.features)). Or you can use the remove_column + add_column combination (dset = dset.remove_columns("label").add_column("label", [0]*len(data)).cast(dset.features), but note that this approach creates an in-memory table for the added column instead of writing to disk, which could be problematic for large datasets.

Jul 15 '22 16:07 mariosasko

Hi! I tried your proposed solution, but it does not solve my problem unfortunately. I am working with a set of protein sequences that have been tokenized with ESM, but some sequences are longer than max_length, they have been truncated in the tokenization. So now I want to truncate my labels as well, but that does not work with a mapping (e.g. dset.map as you suggested). Specifically, what I did was the following:

def postprocess_tokenize(tokenized_data):
    """
    adjust label lengths if they dont match.
    """
    if len(tokenized_data['input_ids']) < len(tokenized_data['labels']):
        new_labels = tokenized_data['labels'][:len(tokenized_data['input_ids'])]
        tokenized_data["labels"] = new_labels
    return tokenized_data

tokenized_data = tokenized_data.map(postprocess_tokenize, batched=True) # this does not adjust the labels...

Any tips on how to do this properly?

More generally, I am wondering why the DataCollator supports padding but does not support truncation? Seems odd to me.

Thanks in advance!

Mar 20 '23 15:03 dimiboeckaerts

datasets datasets copied to clipboard

How to assign new values to Dataset?

datasets
datasets copied to clipboard