transformers
transformers copied to clipboard
Datasets in run_translation.py
System Info
Hello there! 👋 I'm following along the run_translation.py example. Thanks for making it! It expands gratefully from the translation docs tutorial
Context
Managed to configure flags for training, When launching in CLI
python train_model.py --model_name_or_path '/Users/.../The-Lord-of-The-Words-The-two-frameworks/src/models/t5-small' --output_dir '/en-ru-model' --dataset_name '/Users/.../The-Lord-of-The-Words-The-two-frameworks/src/data/opus_books' --dataset_config_name en-ru --do_train --source_lang en --target_lang ru --num_train_epochs 1 --overwrite_output_dir
the following error appears
raise TypeError("Dataset argument should be a datasets.Dataset!")
TypeError: Dataset argument should be a datasets.Dataset!
Then read forum recommendation, tried to launch the training commenting the tf_eval_dataset
creation , and launched training. The model trained without having the eval_dataset.
When I passed the flag --do_eval
it raised error flagged here
I downloaded the opus books dataset and I saw in the README.md that it don´t have a validation split
- config_name: en-ru
features:
- name: id
dtype: string
- name: translation
dtype:
translation:
languages:
- en
- ru
splits:
- name: train
num_bytes: 5190880
num_examples: 15496
download_size: 1613419
dataset_size: 5190880
Issue 1. Reproducibility coming from tutorial
-
Can you please confirm that this example runs straightforward with WMT19 and that I might not have this issue taking this dataset and not the opus books one?
-
Would you be willing to accept a PR with a comment in the example either pointing to the readme table or making more explicit that this example comes with a specific dataset with its link around here ? Is there a way you think I could help those users having the path from docs tutorial to script example ?
Am I missing something ? I think it's dataset related but Im not sure anymore...
Issue 2. Broken link
Found a broken link, if you are ok i´ll fix it with this
Dependencies
transformers==4.31.0.dev0
tensorflow-macos==2.10.0
Tangential and mental model
I'm actually following this script which is a copy, that came recommended from #24254 . Please let me know if something has changed. Im seeing the history and last commit seems from Jun7 and mine is Jun13 I grouped the broken link with dataset in one issue as it might impact 1 PR for Reproducibility, but let me know if you prefer them separately.
Thanks so so much for your help 🙏 & thanks for the library!
Who can help?
No response
Information
- [X] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
- run the script
- download opus books dataset
- config flags
- run script with and without eval_dataset logic
Expected behavior
- Dataset ? Either with link in README.md or in script commented?
- Correct link for
Tagging @sgugger
cc @Rocketknight1 since this is a TensorFlow example.
Hey @Rocketknight1 👋 I think we crossed in #24341 . Thanks for the Notebook repository discovery . Was a nice quick fix !
I´ve given another try to some vector thoughts posted in the issue.
- Regarding reproducibility. Read the script again, and tried to get some distance and analyze it as an isolated example coming from the library. The script is quite structured and documentation comments are well-suited, it generalizes really well. Adding here the dataset name wouldn´t really work . Besides, if the dataset associated to the examples changes, it would require a change. At this point maybe would add a small sentence with a recommendation to go through the README.md , so the example remains general/scalable across various datatasets? But minor in retrospective. Makes sense to you ?
- Sent a PR to fix the broken link
Thanks for the script structure and the guidance! 🙏
Comments on Issue 1
Currently the run_translation.py script works well with wtm16 dataset , as it provides splitting for train, test and validation .
I´m closing this issue, as the dataset for running the script has been found, and the broken link was fixed through a PR #24594