Create two datasets to distribute

Open galv opened this issue 4 years ago • 0 comments

Now that we support creating tar files of our data, we want to filter the files into two types:

CC-BY-SA files
CC-BY and public domain files

We can do this by reusing the logic inside of https://github.com/mlcommons/peoples-speech/blob/main/galvasr2/dump_cc_by_licenses.py, specifically this regex: https://github.com/mlcommons/peoples-speech/blob/12b5b79d2919b0cdac5367dc60eae0b268b7adf8/galvasr2/dump_cc_by_licenses.py#L79

This is to support users who cannot use CC-BY-SA data.

Aug 23 '21 18:08 galv