transformers
transformers copied to clipboard
[ASR Examples] Only filter training samples by audio length criterion
What does this PR do?
We should only filter training samples by our audio length criterion when fine-tuning ASR systems. The eval and test sets should not be filtered. Doing so leads to partial datasets, and thus in-valid results.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
@patrickvonplaten @anton-l
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
I'm reluctant to add a cut-off for eval: in filtering the dataset by an audio length criterion, one removes the longest samples in the eval set. Evaluating on a partial eval set yields WER metrics that are not valid for academic purposes. We need to have the full eval set in order to compare WER figures between models. You can't compare a WER for a partial sub-set A with the WER for a different partial sub-set B!
Take the extreme example where you have a model that performs well on short samples <= 5s, and terribly for longer samples > 5s. Filtering with a cut-off of 5s would hide the fact the model breaks down for longer samples. You couldn't then compare this WER metric with another model where you've not filtered the eval samples, and thus included all the samples (including the 'harder' ones > 5s).
IMO, if not filtering eval samples poses an OOM threat, it has to be addressed by a reduction in batch size (chosen by the user depending on their model/hardware/dataset). WDYT @anton-l @patrickvonplaten?
More or less agree here - just in terms of backwards compatibility, I think this might lead to a couple of unexpected errors. Max length can prevent OOM and min length can prevent weird Conv layer doesn't have enough input values.
So I'd maybe prefer to instead add a test_min_duration_in_seconds
and test_max_duration_in_seconds
here, let it default to min_duration_in_seconds
and max_duration_in_seconds
(e.g. set to None and if None, set to min_duration_in_seconds
) and add a warning that this will be changed in the next versions.
This fine-tuning script is actually used quite a bit and usually the cutting doesn't make a huge difference (also note in the examples, we're not looking for the "perfect" script, but rather an example of how ctc training could be done)
But overall ok with the PR, just think we should make it more backwards compatible
@sanchit-gandhi, let me know once you want to merge this :-)
Failing test unrelated - rebasing should fix this. Will request a review when the CI is green 🟢
Gently ping here @sanchit-gandhi :-)
Thanks for the ping, will address in the coming days!
Think this is worth fixing still, shouldn't take too long
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.