NeMo-Curator
NeMo-Curator copied to clipboard
[FEA] Add license detector for code repositories
Sometimes, it is needed to detect the code license of the corpus being curated.
Some code repositories, such as GitHub, provides an API to identify the repository license, but other code repositories, such as Software Heritage, does not provide that capability.
There is a nice tool, scancode-toolkit that could be used for that purpose.
My understanding is that its API support extracting licenses from disk locations, I think that the solution might fake an in-memory filesystem. I would suggest having a look at https://github.com/pytest-dev/pyfakefs for that purpose.
As an optimization, we might identify the files that might contain a license based on a regex pattern, as detailed in Appendix A.3. from StarCoder 2 and The Stack v2: The Next Generation paper.
Hope it helps! Miguel