[ASR Examples] Only filter training samples by audio length criterion

Open sanchit-gandhi opened this issue 1 year ago • 9 comments

What does this PR do?

We should only filter training samples by our audio length criterion when fine-tuning ASR systems. The eval and test sets should not be filtered. Doing so leads to partial datasets, and thus in-valid results.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

@patrickvonplaten @anton-l

Aug 25 '22 16:08 sanchit-gandhi

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Aug 25 '22 16:08 HuggingFaceDocBuilderDev

I'm reluctant to add a cut-off for eval: in filtering the dataset by an audio length criterion, one removes the longest samples in the eval set. Evaluating on a partial eval set yields WER metrics that are not valid for academic purposes. We need to have the full eval set in order to compare WER figures between models. You can't compare a WER for a partial sub-set A with the WER for a different partial sub-set B!

Take the extreme example where you have a model that performs well on short samples <= 5s, and terribly for longer samples > 5s. Filtering with a cut-off of 5s would hide the fact the model breaks down for longer samples. You couldn't then compare this WER metric with another model where you've not filtered the eval samples, and thus included all the samples (including the 'harder' ones > 5s).

IMO, if not filtering eval samples poses an OOM threat, it has to be addressed by a reduction in batch size (chosen by the user depending on their model/hardware/dataset). WDYT @anton-l @patrickvonplaten?

Aug 26 '22 10:08 sanchit-gandhi

More or less agree here - just in terms of backwards compatibility, I think this might lead to a couple of unexpected errors. Max length can prevent OOM and min length can prevent weird Conv layer doesn't have enough input values.

So I'd maybe prefer to instead add a test_min_duration_in_seconds and test_max_duration_in_seconds here, let it default to min_duration_in_seconds and max_duration_in_seconds (e.g. set to None and if None, set to min_duration_in_seconds) and add a warning that this will be changed in the next versions.

Aug 26 '22 11:08 patrickvonplaten

This fine-tuning script is actually used quite a bit and usually the cutting doesn't make a huge difference (also note in the examples, we're not looking for the "perfect" script, but rather an example of how ctc training could be done)

But overall ok with the PR, just think we should make it more backwards compatible

Aug 26 '22 11:08 patrickvonplaten

@sanchit-gandhi, let me know once you want to merge this :-)

Oct 10 '22 16:10 patrickvonplaten

Failing test unrelated - rebasing should fix this. Will request a review when the CI is green 🟢

Oct 10 '22 16:10 sanchit-gandhi

Gently ping here @sanchit-gandhi :-)

Nov 07 '22 20:11 patrickvonplaten

Thanks for the ping, will address in the coming days!

Dec 02 '22 16:12 sanchit-gandhi

Think this is worth fixing still, shouldn't take too long

Mar 01 '23 15:03 patrickvonplaten

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 17 '23 15:06 github-actions[bot]

transformers transformers copied to clipboard

[ASR Examples] Only filter training samples by audio length criterion

What does this PR do?

Who can review?

transformers
transformers copied to clipboard