fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

AssertionError: Sentences lengths should not exceed max_tokens=X

Open shuvohishab opened this issue 2 years ago • 2 comments

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I'm finetuning a pre-trained model. The AssertionError problem occurs every time when it's time for validation. I used max_tokens: 300000 and other configurations as well, but to no avail. So, I found by debugging that it's not skipping the invalid size(for example, I've provided max tokens as 300000, but it's showing me that validation set has 4751360 length of sample) inputs of validation set (skip_invalid_size_inputs_valid_test) though I kept it true.

So, how can I skip these invalid sets of data from validation set?

Code

I'm finetuning my wav2vec2 model using README.md. This is the config file I'm modifying:

` common: fp16: true log_format: json log_interval: 200

checkpoint: save_interval: 50 save_interval_updates: 1000 keep_interval_updates: 1 no_epoch_checkpoints: true best_checkpoint_metric: wer

task: _name: audio_finetuning data: ??? normalize: true labels: ltr

dataset: num_workers: 6 max_tokens: 300000 skip_invalid_size_inputs_valid_test: true validate_after_updates: 1000 validate_interval: 50 valid_subset: valid

distributed_training: ddp_backend: legacy_ddp distributed_world_size: 2

criterion: _name: ctc zero_infinity: true

optimization: max_update: 35000 lr: [0.00005] sentence_avg: true update_freq: [4]

optimizer: _name: adam adam_betas: (0.9,0.98) adam_eps: 1e-08

lr_scheduler: _name: tri_stage phase_ratio: [0.1, 0.4, 0.5] final_lr_scale: 0.05

model: _name: wav2vec_ctc w2v_path: ??? apply_mask: true mask_prob: 0.65 mask_channel_prob: 0.5 mask_channel_length: 64 layerdrop: 0.05 activation_dropout: 0.1 feature_grad_mult: 0.0 freeze_finetune_updates: 10000 `

What have you tried?

What's your environment?

  • fairseq Version (e.g., 1.0 or main): 0.12.2
  • PyTorch Version (e.g., 1.0): 1.12.1
  • OS (e.g., Linux): Linux
  • How you installed fairseq (pip, source): pip install --editable ./
  • Build command you used (if compiling from source):
  • Python version: 3.8
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: 1xTesla T4
  • Any other relevant information:

shuvohishab avatar Oct 05 '22 15:10 shuvohishab

Please what is the progression of the question?

AlexNLP avatar Nov 02 '22 13:11 AlexNLP

Here is the PR, I sent to debug this: https://github.com/facebookresearch/fairseq/pull/4800 If this can be added, those validation sets which has more tokens than the defined max_tokens will be discared.

shuvohishab avatar Nov 02 '22 14:11 shuvohishab

Hi~

If you just want set argument MAX_TOKENS and filter invalid valid_set samples, maybe you can modify trainer.py Line749 from :

max_positions=utils.resolve_max_positions( 
    self.task.max_positions(),
    self.model.max_positions(), 
),

to :

max_positions=utils.resolve_max_positions(
    self.task.max_positions(),
    self.model.max_positions(),
    self.cfg.dataset.max_tokens,
)

same settings as train batch_iterator (Line716)

Hope this can help you !

yuzijiano avatar Jan 03 '23 09:01 yuzijiano