Make smaller subsets of the data

Open galv opened this issue 4 years ago • 0 comments

We would like to make smaller subsets of the data ourselves for the sake of downstream users.

Thje msot obvious way to do this is to sort each aligned segment by character-error-rate between the alignment model output and the transcript. Those segments with low CERs would go into a small-ish "clean" training set. This is what librispeech did.

Ideally, we would like to also guarantee some degree of "diversity" in the data. This means doing some sort of data deduplication.

@keithachorn-intel

Jun 23 '21 06:06 galv