transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Datasets in run_translation.py

Open SoyGema opened this issue 1 year ago • 2 comments

System Info

Hello there! 👋 I'm following along the run_translation.py example. Thanks for making it! It expands gratefully from the translation docs tutorial

Context

Managed to configure flags for training, When launching in CLI

python train_model.py --model_name_or_path '/Users/.../The-Lord-of-The-Words-The-two-frameworks/src/models/t5-small' --output_dir '/en-ru-model' --dataset_name '/Users/.../The-Lord-of-The-Words-The-two-frameworks/src/data/opus_books' --dataset_config_name en-ru --do_train --source_lang en --target_lang ru --num_train_epochs 1 --overwrite_output_dir

the following error appears

raise TypeError("Dataset argument should be a datasets.Dataset!")
TypeError: Dataset argument should be a datasets.Dataset!

Then read forum recommendation, tried to launch the training commenting the tf_eval_dataset creation , and launched training. The model trained without having the eval_dataset. When I passed the flag --do_eval it raised error flagged here

I downloaded the opus books dataset and I saw in the README.md that it don´t have a validation split

- config_name: en-ru
  features:
  - name: id
    dtype: string
  - name: translation
    dtype:
      translation:
        languages:
        - en
        - ru
  splits:
  - name: train
    num_bytes: 5190880
    num_examples: 15496
  download_size: 1613419
  dataset_size: 5190880

Issue 1. Reproducibility coming from tutorial

  • Can you please confirm that this example runs straightforward with WMT19 and that I might not have this issue taking this dataset and not the opus books one?

  • Would you be willing to accept a PR with a comment in the example either pointing to the readme table or making more explicit that this example comes with a specific dataset with its link around here ? Is there a way you think I could help those users having the path from docs tutorial to script example ?

Am I missing something ? I think it's dataset related but Im not sure anymore...

Issue 2. Broken link

Found a broken link, if you are ok i´ll fix it with this

Dependencies

transformers==4.31.0.dev0
tensorflow-macos==2.10.0

Tangential and mental model

I'm actually following this script which is a copy, that came recommended from #24254 . Please let me know if something has changed. Im seeing the history and last commit seems from Jun7 and mine is Jun13 I grouped the broken link with dataset in one issue as it might impact 1 PR for Reproducibility, but let me know if you prefer them separately.

Thanks so so much for your help 🙏 & thanks for the library!

Who can help?

No response

Information

  • [X] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

  1. run the script
  2. download opus books dataset
  3. config flags
  4. run script with and without eval_dataset logic

Expected behavior

  • Dataset ? Either with link in README.md or in script commented?
  • Correct link for

Tagging @sgugger

SoyGema avatar Jun 29 '23 14:06 SoyGema

cc @Rocketknight1 since this is a TensorFlow example.

sgugger avatar Jun 29 '23 15:06 sgugger

Hey @Rocketknight1 👋 I think we crossed in #24341 . Thanks for the Notebook repository discovery . Was a nice quick fix !

I´ve given another try to some vector thoughts posted in the issue.

  • Regarding reproducibility. Read the script again, and tried to get some distance and analyze it as an isolated example coming from the library. The script is quite structured and documentation comments are well-suited, it generalizes really well. Adding here the dataset name wouldn´t really work . Besides, if the dataset associated to the examples changes, it would require a change. At this point maybe would add a small sentence with a recommendation to go through the README.md , so the example remains general/scalable across various datatasets? But minor in retrospective. Makes sense to you ?
  • Sent a PR to fix the broken link

Thanks for the script structure and the guidance! 🙏

SoyGema avatar Jun 30 '23 12:06 SoyGema

Comments on Issue 1

Currently the run_translation.py script works well with wtm16 dataset , as it provides splitting for train, test and validation .

I´m closing this issue, as the dataset for running the script has been found, and the broken link was fixed through a PR #24594

SoyGema avatar Jul 03 '23 17:07 SoyGema