oslo icon indicating copy to clipboard operation
oslo copied to clipboard

How to pretrain T5

Open TristanThrush opened this issue 2 years ago • 0 comments

Hi there, I'm wondering if there is an example of how to use this repo to pretrain T5?

I saw this file and thought that it could maybe serve as a start to an example. But when I try to run it, I get this error:

(benchmarking) tristan_huggingface_co@tristan-olm-training-a100-80:~/oslo/tests/transformers/models/mt5$ python test_training.py
Downloading builder script: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.8k/28.8k [00:00<00:00, 351kB/s]
Downloading metadata: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 9.87MB/s]
Downloading and preparing dataset glue/sst2 (download: 7.09 MiB, generated: 4.81 MiB, post-processed: Unknown size, total: 11.90 MiB) to /home/tristan_huggingface_co/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.44M/7.44M [00:01<00:00, 5.51MB/s]
Dataset glue downloaded and prepared to /home/tristan_huggingface_co/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 698.12it/s]
  0%|                                                                                                                                                                                          | 0/68 [00:00<?, ?ba/s]
Traceback (most recent call last):
  File "test_training.py", line 60, in <module>
    processed_dataset = dataset.map(
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/dataset_dict.py", line 771, in map
    {
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/dataset_dict.py", line 772, in <dictcomp>
    k: dataset.map(
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2449, in map
    return self._map_single(
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 577, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 544, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2849, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2729, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/tristan_huggingface_co/anaconda3/envs/benchmarking/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2409, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "/home/tristan_huggingface_co/oslo/oslo/transformers/tasks/data_t5_pretraining.py", line 57, in __call__
    list_of_input_ids: List[List[int]] = self._tokenizer(
TypeError: 'str' object is not callable

Separately, I had to downgrade my version of datasets to get this far.

Thanks for any help that anyone can give! TLDR: I'm wondering if there is a working example of T5 pretraining

TristanThrush avatar Dec 30 '22 03:12 TristanThrush