RedPajama-Data
RedPajama-Data copied to clipboard
fix clean copyright
I think there are 2 main problems in current clean_copyright_comments
function https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L27.
First, It cannot remove the copyright successfully in the following C-style code because of the early return in https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L37
// Copyright
int main() {
return 0;
/* comment */
}
Second, I find that, when the file is large, the regex sometimes costs much time in my experiment, I think we only need to find the copyright in the first 100 lines.
Hi @hust-nj ! Thanks for bringing this to our attention! I will review your PR asap.
Hi @hust-nj , I had a look at your PR. Here's some feedback:
- I would prefer not to limit the search for copyright to the first 100 lines; based on what are you proposing 100 lines?
- Your current implementation also gets rid of comments in the beginning of any file, which we would like to keep. For example, this:
// A comment
int main() {
return 0;
/* comment */
}
yields
int main() {
return 0;
/* comment */
}