datasets Datasets.map causes incorrect overflow_to_sample_mapping when used with tokenizers and small batch size

Describe the bug

When using tokenizer, we can retrieve the field overflow_to_sample_mapping, since long samples will be overflown into multiple token sequences.

However, when tokenizing is done via Dataset.map, with n_proc > 1, the overflow_to_sample_mapping field is wrong. This seems to be because each tokenizer only looks at its share of the samples, and maps to the index within its share, but then Dataset.map collates them together.

Steps to reproduce the bug

Make a dataset of 3 strings.
Tokenize via Dataset.map with n_proc = 8
Inspect the overflow_to_sample_mapping field

Expected results

[0, 1, 2]

Actual results

[0, 0, 0]

Notes:

I have not yet extracted a minimal example, but the above works reliably
If the dataset is large, I've yet to determine if this bug still happens a. not at all b. always c. on the small, leftover batch at the end.

Jul 27 '22 14:07 srobertjames

I've built a minimal example that shows this bug without n_proc. It seems like it's a problem any way of using tokenizers, overflow_to_sample_mapping, and Dataset.map, with a small batch size:

import datasets
import transformers
pretrained = 'deepset/tinyroberta-squad2'
tokenizer = transformers.AutoTokenizer.from_pretrained(pretrained)

questions = ['Can you tell me why?', 'What time is it?']
contexts = ['This is context zero', 'Another paragraph goes here']   

def tok(questions, contexts):
    return tokenizer(text=questions,
            text_pair=contexts,
            truncation='only_second',
            return_overflowing_tokens=True,
            )
print(tok(questions, contexts)['overflow_to_sample_mapping'])
assert tok(questions, contexts)['overflow_to_sample_mapping'] == [0, 1] # PASSES

def tok2(d):
    return tok(d['question'], d['context'])

def tok2(d):
    return tok(d['question'], d['context'])

ds = datasets.Dataset.from_dict({'question': questions, 'context': contexts})
tokens = ds.map(tok2, batched=True, batch_size=1)
print(tokens['overflow_to_sample_mapping'])
assert tokens['overflow_to_sample_mapping'] == [0, 1] # FAILS produces [0,0]

Note that even if the batch size would be larger, there will be instances where we will not have a lot of data, and end up using small batches. This can occur e.g. if n_proc causes batches to be underfill. I imagine it can also occur in other ways, e.g. the final leftover batch at the end.

Jul 27 '22 17:07 srobertjames

A larger batch size does not have this behavior:

def tok2(d):
    return tok(d['question'], d['context'])

ds = datasets.Dataset.from_dict({'question': questions, 'context': contexts})
tokens = ds.map(tok2, batched=True, batch_size=2)
print(tokens['overflow_to_sample_mapping'])
assert tokens['overflow_to_sample_mapping'] == [0, 1] # PASSES

Jul 27 '22 17:07 srobertjames

I was trying the Question answering tutorial on Hugging face when i faced the same problem. The preprocessing step is here. i have changed max_length=200, stride=50,

validation_dataset = raw_datasets['validation'].select(range(8)).map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
    num_proc=1
)
print(validation_dataset['overflow_to_sample_mapping'])
print(validation_dataset['example_id'])

result

[0, 1, 2, 3, 4, 5, 6, 7]
['56be4db0acb8001400a502ec', '56be4db0acb8001400a502ed', '56be4db0acb8001400a502ee', 
'56be4db0acb8001400a502ef', '56be4db0acb8001400a502f0', '56be8e613aeaaa14008c90d1', 
'56be8e613aeaaa14008c90d2', '56be8e613aeaaa14008c90d3']

when num_proc=2, result -

[0, 1, 2, 3, 0, 1, 2, 3]
['56be4db0acb8001400a502ec', '56be4db0acb8001400a502ed', '56be4db0acb8001400a502ee', 
'56be4db0acb8001400a502ef', '56be4db0acb8001400a502f0', '56be8e613aeaaa14008c90d1', 
'56be8e613aeaaa14008c90d2', '56be8e613aeaaa14008c90d3']

when num_proc=3, result -

[0, 1, 2, 0, 1, 2, 0, 1]
['56be4db0acb8001400a502ec', '56be4db0acb8001400a502ed', '56be4db0acb8001400a502ee', 
'56be4db0acb8001400a502ef', '56be4db0acb8001400a502f0', '56be8e613aeaaa14008c90d1', 
'56be8e613aeaaa14008c90d2', '56be8e613aeaaa14008c90d3']

Theoverflow_to_sample_mapping changes with num_proc, but example_id field remains the same . It seems that each process in map has its own counter for overflow_to_sample_mapping. If you are using overflow_to_sample_mapping inside the preprocess_validation_examples function, then there is no issue.

Dec 13 '23 19:12 Kaustuv1234

datasets datasets copied to clipboard

Datasets.map causes incorrect overflow_to_sample_mapping when used with tokenizers and small batch size

Describe the bug

Steps to reproduce the bug

Expected results

Actual results

datasets
datasets copied to clipboard