datasets
datasets copied to clipboard
Datasets.map causes incorrect overflow_to_sample_mapping when used with tokenizers and small batch size
Describe the bug
When using tokenizer, we can retrieve the field overflow_to_sample_mapping, since long samples will be overflown into multiple token sequences.
However, when tokenizing is done via Dataset.map, with n_proc > 1, the overflow_to_sample_mapping field is wrong. This seems to be because each tokenizer only looks at its share of the samples, and maps to the index within its share, but then Dataset.map collates them together.
Steps to reproduce the bug
- Make a dataset of 3 strings.
- Tokenize via Dataset.map with n_proc = 8
- Inspect the
overflow_to_sample_mappingfield
Expected results
[0, 1, 2]
Actual results
[0, 0, 0]
Notes:
- I have not yet extracted a minimal example, but the above works reliably
- If the dataset is large, I've yet to determine if this bug still happens a. not at all b. always c. on the small, leftover batch at the end.
I've built a minimal example that shows this bug without n_proc. It seems like it's a problem any way of using tokenizers, overflow_to_sample_mapping, and Dataset.map, with a small batch size:
import datasets
import transformers
pretrained = 'deepset/tinyroberta-squad2'
tokenizer = transformers.AutoTokenizer.from_pretrained(pretrained)
questions = ['Can you tell me why?', 'What time is it?']
contexts = ['This is context zero', 'Another paragraph goes here']
def tok(questions, contexts):
return tokenizer(text=questions,
text_pair=contexts,
truncation='only_second',
return_overflowing_tokens=True,
)
print(tok(questions, contexts)['overflow_to_sample_mapping'])
assert tok(questions, contexts)['overflow_to_sample_mapping'] == [0, 1] # PASSES
def tok2(d):
return tok(d['question'], d['context'])
def tok2(d):
return tok(d['question'], d['context'])
ds = datasets.Dataset.from_dict({'question': questions, 'context': contexts})
tokens = ds.map(tok2, batched=True, batch_size=1)
print(tokens['overflow_to_sample_mapping'])
assert tokens['overflow_to_sample_mapping'] == [0, 1] # FAILS produces [0,0]
Note that even if the batch size would be larger, there will be instances where we will not have a lot of data, and end up using small batches. This can occur e.g. if n_proc causes batches to be underfill. I imagine it can also occur in other ways, e.g. the final leftover batch at the end.
A larger batch size does not have this behavior:
def tok2(d):
return tok(d['question'], d['context'])
ds = datasets.Dataset.from_dict({'question': questions, 'context': contexts})
tokens = ds.map(tok2, batched=True, batch_size=2)
print(tokens['overflow_to_sample_mapping'])
assert tokens['overflow_to_sample_mapping'] == [0, 1] # PASSES
I was trying the Question answering tutorial on Hugging face when i faced the same problem. The preprocessing step is here. i have changed max_length=200, stride=50,
validation_dataset = raw_datasets['validation'].select(range(8)).map(
preprocess_validation_examples,
batched=True,
remove_columns=raw_datasets["validation"].column_names,
num_proc=1
)
print(validation_dataset['overflow_to_sample_mapping'])
print(validation_dataset['example_id'])
result
[0, 1, 2, 3, 4, 5, 6, 7]
['56be4db0acb8001400a502ec', '56be4db0acb8001400a502ed', '56be4db0acb8001400a502ee',
'56be4db0acb8001400a502ef', '56be4db0acb8001400a502f0', '56be8e613aeaaa14008c90d1',
'56be8e613aeaaa14008c90d2', '56be8e613aeaaa14008c90d3']
when num_proc=2, result -
[0, 1, 2, 3, 0, 1, 2, 3]
['56be4db0acb8001400a502ec', '56be4db0acb8001400a502ed', '56be4db0acb8001400a502ee',
'56be4db0acb8001400a502ef', '56be4db0acb8001400a502f0', '56be8e613aeaaa14008c90d1',
'56be8e613aeaaa14008c90d2', '56be8e613aeaaa14008c90d3']
when num_proc=3, result -
[0, 1, 2, 0, 1, 2, 0, 1]
['56be4db0acb8001400a502ec', '56be4db0acb8001400a502ed', '56be4db0acb8001400a502ee',
'56be4db0acb8001400a502ef', '56be4db0acb8001400a502f0', '56be8e613aeaaa14008c90d1',
'56be8e613aeaaa14008c90d2', '56be8e613aeaaa14008c90d3']
Theoverflow_to_sample_mapping changes with num_proc, but example_id field remains the same . It seems that each process in map has its own counter for overflow_to_sample_mapping. If you are using overflow_to_sample_mapping inside the preprocess_validation_examples function, then there is no issue.