fairseq
fairseq copied to clipboard
AssertionError: Sentences lengths should not exceed max_tokens=X
❓ Questions and Help
Before asking:
- search the issues.
- search the docs.
What is your question?
I'm finetuning a pre-trained model. The AssertionError problem occurs every time when it's time for validation.
I used max_tokens: 300000
and other configurations as well, but to no avail.
So, I found by debugging that it's not skipping the invalid size(for example, I've provided max tokens as 300000
, but it's showing me that validation set has 4751360
length of sample) inputs of validation set (skip_invalid_size_inputs_valid_test) though I kept it true
.
So, how can I skip these invalid sets of data from validation set?
Code
I'm finetuning my wav2vec2 model using README.md. This is the config file I'm modifying:
` common: fp16: true log_format: json log_interval: 200
checkpoint: save_interval: 50 save_interval_updates: 1000 keep_interval_updates: 1 no_epoch_checkpoints: true best_checkpoint_metric: wer
task: _name: audio_finetuning data: ??? normalize: true labels: ltr
dataset: num_workers: 6 max_tokens: 300000 skip_invalid_size_inputs_valid_test: true validate_after_updates: 1000 validate_interval: 50 valid_subset: valid
distributed_training: ddp_backend: legacy_ddp distributed_world_size: 2
criterion: _name: ctc zero_infinity: true
optimization: max_update: 35000 lr: [0.00005] sentence_avg: true update_freq: [4]
optimizer: _name: adam adam_betas: (0.9,0.98) adam_eps: 1e-08
lr_scheduler: _name: tri_stage phase_ratio: [0.1, 0.4, 0.5] final_lr_scale: 0.05
model: _name: wav2vec_ctc w2v_path: ??? apply_mask: true mask_prob: 0.65 mask_channel_prob: 0.5 mask_channel_length: 64 layerdrop: 0.05 activation_dropout: 0.1 feature_grad_mult: 0.0 freeze_finetune_updates: 10000 `
What have you tried?
What's your environment?
- fairseq Version (e.g., 1.0 or main): 0.12.2
- PyTorch Version (e.g., 1.0): 1.12.1
- OS (e.g., Linux): Linux
- How you installed fairseq (
pip
, source): pip install --editable ./ - Build command you used (if compiling from source):
- Python version: 3.8
- CUDA/cuDNN version: 11.3
- GPU models and configuration: 1xTesla T4
- Any other relevant information:
Please what is the progression of the question?
Here is the PR, I sent to debug this: https://github.com/facebookresearch/fairseq/pull/4800
If this can be added, those validation sets which has more tokens than the defined max_tokens
will be discared.
Hi~
If you just want set argument MAX_TOKENS
and filter invalid valid_set samples, maybe you can modify trainer.py Line749
from :
max_positions=utils.resolve_max_positions(
self.task.max_positions(),
self.model.max_positions(),
),
to :
max_positions=utils.resolve_max_positions(
self.task.max_positions(),
self.model.max_positions(),
self.cfg.dataset.max_tokens,
)
same settings as train batch_iterator (Line716)
Hope this can help you !