AISHELL-4 icon indicating copy to clipboard operation
AISHELL-4 copied to clipboard

Amount of Clean Non-Overlapped data

Open picheny-nyu opened this issue 3 years ago • 2 comments

It looks like the amount of non-overlapped data is much smaller than the overall corpus. I am seeing less than 20 hours. Is this correct?

Thanks Michael Picheny

picheny-nyu avatar Apr 09 '22 21:04 picheny-nyu

It looks like the amount of non-overlapped data is much smaller than the overall corpus. I am seeing less than 20 hours. Is this correct?

Thanks Michael Picheny

Thank you for your interest. Maybe our overlap calculation methods are different? Our method is: overlap length / all speech length If there are 2 speakers, everyone speaks 10s and overlaps 5s. The ratio is (5+5)/20.

felixfuyihui avatar Apr 24 '22 08:04 felixfuyihui

I am using the methodology described in https://github.com/DanBerrebbi/AISHELL-4.git which I thought was based on your original work, but perhaps not?

picheny-nyu avatar Apr 25 '22 18:04 picheny-nyu