RedPajama-Data fix clean copyright

fix clean copyright

Open hust-nj opened this issue 1 year ago • 2 comments

I think there are 2 main problems in current clean_copyright_comments function https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L27.

First, It cannot remove the copyright successfully in the following C-style code because of the early return in https://github.com/togethercomputer/RedPajama-Data/blob/567ac9a0927c6dd3a2bf7e880de191239acfc308/data_prep/github/github_clean_dedup_local.py#L37

// Copyright

int main() {
    return 0;
    
    /* comment */
}

Second, I find that, when the file is large, the regex sometimes costs much time in my experiment, I think we only need to find the copyright in the first 100 lines.

Apr 27 '23 16:04 hust-nj

Hi @hust-nj ! Thanks for bringing this to our attention! I will review your PR asap.

May 02 '23 15:05 mauriceweber

Hi @hust-nj , I had a look at your PR. Here's some feedback:

I would prefer not to limit the search for copyright to the first 100 lines; based on what are you proposing 100 lines?
Your current implementation also gets rid of comments in the beginning of any file, which we would like to keep. For example, this:

// A comment

int main() {
    return 0;
    
    /* comment */
}

yields

int main() {
    return 0;
    
    /* comment */
}

May 09 '23 06:05 mauriceweber

RedPajama-Data RedPajama-Data copied to clipboard

fix clean copyright

RedPajama-Data
RedPajama-Data copied to clipboard