RedPajama-Data icon indicating copy to clipboard operation
RedPajama-Data copied to clipboard

fix clean copyright

Open hust-nj opened this issue 1 year ago • 2 comments

I think there are 2 main problems in current clean_copyright_comments function https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L27.

First, It cannot remove the copyright successfully in the following C-style code because of the early return in https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L37

// Copyright

int main() {
    return 0;
    
    /* comment */
}

Second, I find that, when the file is large, the regex sometimes costs much time in my experiment, I think we only need to find the copyright in the first 100 lines.

hust-nj avatar Apr 27 '23 16:04 hust-nj

Hi @hust-nj ! Thanks for bringing this to our attention! I will review your PR asap.

mauriceweber avatar May 02 '23 15:05 mauriceweber

Hi @hust-nj , I had a look at your PR. Here's some feedback:

  • I would prefer not to limit the search for copyright to the first 100 lines; based on what are you proposing 100 lines?
  • Your current implementation also gets rid of comments in the beginning of any file, which we would like to keep. For example, this:
// A comment

int main() {
    return 0;
    
    /* comment */
}

yields

int main() {
    return 0;
    
    /* comment */
}

mauriceweber avatar May 09 '23 06:05 mauriceweber