IPED Create an internal audio->transcription data set to evaluate transcription models

Create an internal audio->transcription data set to evaluate transcription models

Open lfcnassif opened this issue 2 years ago • 2 comments

This is important to evaluate different transcription algorithms/models on a data set not used for training for sure. Many public data sets were used to train many models evaluated on #1214 and data sets used by Microsoft and Google implementations are unknown.

Sep 11 '22 19:09 lfcnassif

Guys, do you think that a dataset created specifically for checking how well audio transcription models perform in the task of understanding the "custom" audio most commonly used by criminals would be enough? By this, I mean that we can start with small size DataSet. Best practices recommend that it should be 10% of the training dataset, but for our case, it seems that the larger the better it should be for us, but maybe we can start with a small goal. Ideas about what should be the mininum size? Could we start collecting files and storing them like this way: Hash-of-Audio.audio-extension (original file, might be any extension) Hash-of-Audio.txt (utf-8 file with human verified transcription) Another requirement would be to restrict access of the dataset to investigators only, since voice patterns might be disclosed.

Jul 20 '23 14:07 leosol

Guys, do you think that a dataset created specifically for checking how well audio transcription models perform in the task of understanding the "custom" audio most commonly used by criminals would be enough?

Sure! The goal here isn't to create a training set or cross validation set, they usually are large, but just a small test set to evaluate models trained on other sets. I think variability in this test set is much more important than size, maybe 1 hour would be a good starting duration.

But thinking about variability, maybe audios just from the Federal District Police wouldn't represent the variability in our country. So I'll try to get in touch with the guys in my agency responsible for the management of the system that stores transcriptions made by officers, if I can get some samples from different investigations and country states. If you can help with this, 10min transcribed by your agency, from different investigations, with different speakers, would help a lot!

Jul 20 '23 20:07 lfcnassif

IPED IPED copied to clipboard

Create an internal audio->transcription data set to evaluate transcription models

IPED
IPED copied to clipboard