peoples-speech
peoples-speech copied to clipboard
Create two datasets to distribute
Now that we support creating tar files of our data, we want to filter the files into two types:
- CC-BY-SA files
- CC-BY and public domain files
We can do this by reusing the logic inside of https://github.com/mlcommons/peoples-speech/blob/main/galvasr2/dump_cc_by_licenses.py, specifically this regex: https://github.com/mlcommons/peoples-speech/blob/12b5b79d2919b0cdac5367dc60eae0b268b7adf8/galvasr2/dump_cc_by_licenses.py#L79
This is to support users who cannot use CC-BY-SA data.