datasets Dataset.map gets stuck on _cast_to_python

Describe the bug

Dataset.map, when fed a Huggingface Tokenizer as its map func, can sometimes spend huge amounts of time doing casts. A minimal example follows.

Not all usages suffer from this. For example, I profiled the preprocessor at https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb , and it did not have this problem. However, I'm at a loss to figure out how it avoids it, as the example below is simple and minimal and still has this problem.

This casting, where it occurs, causes the Dataset.map to run approximately 7x slower than it runs for code which does not cause this casting.

This may be related to https://github.com/huggingface/datasets/issues/1046 . However, the tokenizer is not set to return Tensors.

Steps to reproduce the bug

A minimal, self-contained example to reproduce is below:

import transformers
from transformers import AutoTokenizer
from datasets import load_dataset
import torch
import cProfile

pretrained = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(pretrained)

squad = load_dataset('squad')
squad_train = squad['train']
squad_tiny = squad_train.select(range(5000))

assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

def tokenize(ds):
        tokens = tokenizer(text=ds['question'],
                                text_pair=ds['context'],
                                add_special_tokens=True,
                                padding='max_length',
                                truncation='only_second',
                                max_length=160,
                                stride=32,
                                return_overflowing_tokens=True,
                                return_offsets_mapping=True,
                                )
        return tokens

cmd = 'squad_tiny.map(tokenize, batched=True, remove_columns=squad_tiny.column_names)'
cProfile.run(cmd, sort='tottime')

Actual results

The code works, but takes 10-25 sec per batch (about 7x slower than non-casting code), with the following profile. Note that _cast_to_python_objects is the culprit.

    63524075 function calls (58206482 primitive calls) in 121.836 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
5274034/40   68.751    0.000  111.060    2.776 features.py:262(_cast_to_python_objects)
 42223832   24.077    0.000   33.310    0.000 {built-in method builtins.isinstance}
 16338/20    5.121    0.000  111.053    5.553 features.py:361(<listcomp>)
  5274135    4.747    0.000    4.749    0.000 {built-in method _abc._abc_instancecheck}
    80/40    4.731    0.059  116.292    2.907 {pyarrow.lib.array}
  5274135    4.485    0.000    9.234    0.000 abc.py:96(__instancecheck__)
2661564/2645196    2.959    0.000    4.298    0.000 features.py:1081(_check_non_null_non_empty_recursive)
        5    2.786    0.557    2.786    0.557 {method 'encode_batch' of 'tokenizers.Tokenizer' objects}
  2668052    0.930    0.000    0.930    0.000 {built-in method builtins.len}
     5000    0.930    0.000    0.938    0.000 tokenization_utils_fast.py:187(_convert_encoding)
        5    0.750    0.150    0.808    0.162 {method 'to_pydict' of 'pyarrow.lib.Table' objects}
        1    0.444    0.444  121.749  121.749 arrow_dataset.py:2501(_map_single)
       40    0.375    0.009  116.291    2.907 arrow_writer.py:151(__arrow_array__)
       10    0.066    0.007    0.066    0.007 {method 'write_batch' of 'pyarrow.lib._CRecordBatchWriter' objects}
        1    0.060    0.060  121.835  121.835 fingerprint.py:409(wrapper)
11387/5715    0.049    0.000    0.175    0.000 {built-in method builtins.getattr}
       36    0.049    0.001    0.049    0.001 {pyarrow._compute.call_function}
    15000    0.040    0.000    0.040    0.000 _collections_abc.py:719(__iter__)
        3    0.023    0.008    0.023    0.008 {built-in method _imp.create_dynamic}
       77    0.020    0.000    0.020    0.000 {built-in method builtins.dir}
       37    0.019    0.001    0.019    0.001 socket.py:543(send)
       15    0.017    0.001    0.017    0.001 tokenization_utils_fast.py:460(<listcomp>)
  432/421    0.015    0.000    0.024    0.000 traitlets.py:1388(_notify_observers)
     5000    0.015    0.000    0.018    0.000 _collections_abc.py:672(keys)
       51    0.014    0.000    0.042    0.001 traitlets.py:276(getmembers)
        5    0.014    0.003    3.775    0.755 tokenization_utils_fast.py:392(_batch_encode_plus)
      3/1    0.014    0.005    0.035    0.035 {built-in method _imp.exec_dynamic}
        5    0.012    0.002    0.950    0.190 tokenization_utils_fast.py:438(<listcomp>)
    31626    0.012    0.000    0.012    0.000 {method 'append' of 'list' objects}
1532/1001    0.011    0.000    0.189    0.000 traitlets.py:643(get)
        5    0.009    0.002    3.796    0.759 arrow_dataset.py:2631(apply_function_on_filtered_inputs)
       51    0.009    0.000    0.062    0.001 traitlets.py:1766(traits)
        5    0.008    0.002    3.784    0.757 tokenization_utils_base.py:2632(batch_encode_plus)
      368    0.007    0.000    0.044    0.000 traitlets.py:1715(_get_trait_default_generator)
       26    0.007    0.000    0.022    0.001 traitlets.py:1186(setup_instance)
       51    0.006    0.000    0.010    0.000 traitlets.py:1781(<listcomp>)
    80/32    0.006    0.000    0.052    0.002 table.py:1758(cast_array_to_feature)
      684    0.006    0.000    0.007    0.000 {method 'items' of 'dict' objects}
4344/1794    0.006    0.000    0.192    0.000 traitlets.py:675(__get__)
...

Environment info

I observed this on both Google colab and my local workstation:

Google colab

datasets version: 2.3.2
Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.13
PyArrow version: 6.0.1
Pandas version: 1.3.5

Local

datasets version: 2.3.2
Platform: Windows-7-6.1.7601-SP1
Python version: 3.8.10
PyArrow version: 8.0.0
Pandas version: 1.4.3

Jul 12 '22 15:07 srobertjames

Are you able to reproduce this? My example is small enough that it should be easy to try.

Jul 14 '22 13:07 srobertjames

Hi! Thanks for reporting and providing a reproducible example. Indeed, by default, datasets performs an expensive cast on the values returned by map to convert them to one of the types supported by PyArrow (the underlying storage format used by datasets). This cast is not needed on NumPy arrays as PyArrow supports them natively, so one way to make this transform faster is to add return_tensors="np" to the tokenizer call.

I think we should mention this in the docs (cc @stevhliu)

Jul 18 '22 12:07 mariosasko

I tested this tokenize function and indeed noticed a casting. However it seems to only concerns the offset_mapping field, which contains a list of tuples, that is converted to a list of lists. Since pyarrow also supports tuples, we actually don't need to convert the tuples to lists.

I think this can be changed here:

https://github.com/huggingface/datasets/blob/ede72d3f9796339701ec59899c7c31d2427046fb/src/datasets/features/features.py#L382-L383

- if isinstance(obj, list): 
+ if isinstance(obj, (list, tuple)):

and here:

https://github.com/huggingface/datasets/blob/ede72d3f9796339701ec59899c7c31d2427046fb/src/datasets/features/features.py#L386-L387

- return obj if isinstance(obj, list) else [], isinstance(obj, tuple)
+ return obj, False

@srobertjames can you try applying these changes and let us know if it helps ? If so, feel free to open a Pull Request to contribute this improvement if you want :)

Jul 18 '22 13:07 lhoestq

Wow, adding return_tensors="np" sped up my example by a factor 17x of and completely eliminated the casting! I'd recommend not only to document it, but to make that the default.

The code at https://github.com/huggingface/notebooks/blob/main/examples/question_answering.ipynb does not specify return_tensors="np" but yet avoids the casting penalty. How does it do that? (The ntbk seems to do return_overflowing_tokens=True, return_offsets_mapping=True,).

Also, surprisingly enough, using return_tensors="pt" (which is my eventual application) yields this error:

TypeError: Provided `function` which is applied to all elements of table returns a `dict` of types 
[<class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>, <class 'torch.Tensor'>]. 
When using `batched=True`, make sure provided `function` returns a `dict` of types like 
`(<class 'list'>, <class 'numpy.ndarray'>)`.

Jul 18 '22 13:07 srobertjames

Setting the output to "np" makes the whole pipeline fast because it moves the data buffers from rust to python to arrow using zero-copy, and also because it does eliminate the casting completely ;)

Have you had a chance to try eliminating the tuple casting using the trick above ?

Jul 18 '22 17:07 lhoestq

@lhoestq I just benchmarked the two edits to features.py above, and they appear to solve the problem, bringing my original example to within 20% the speed of the output "np" example. Nice!

For a pull request, do you suggest simply following https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md ?

Jul 18 '22 20:07 srobertjames

Cool ! Sure feel free to follow these instructions to open a PR :) thanks !

Jul 19 '22 13:07 lhoestq

#take

Sep 19 '22 11:09 szmoro

Resolved via https://github.com/huggingface/datasets/pull/4993.

Oct 03 '22 13:10 mariosasko

datasets
datasets copied to clipboard

Dataset.map gets stuck on _cast_to_python_objects

Describe the bug

Steps to reproduce the bug

Actual results

Environment info

Google colab

Local

datasets datasets copied to clipboard

Dataset.map gets stuck on _cast_to_python_objects

Describe the bug

Steps to reproduce the bug

Actual results

Environment info

Google colab

Local

datasets
datasets copied to clipboard