peoples-speech
peoples-speech copied to clipboard
Things to do before neurips
- [ ] Create two separate datasets to distribute, one CC-BY, one CC-BY-SA.
- [ ] Rerun yamnet on the entire dataset. This means we need to make it more performant See #40
- [ ] Send data to be hand-transcribed.
- [ ] Optionally, do audio-based deduplication first.
- [ ] Add text deata deduplication to the data creation pipeline.
- [ ] Train kaldi and/or nemo models on the dataset. Provide fixes to the dataset, based on this work. Adding more as time goes on...
Poster + 3 minute talk due: Oct 18th
Camera-ready paper due: November 6th
Neurips (dataset release): Early December