dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Clarification Needed on "C4 NoPunc" in Data Processing

Open codefly13 opened this issue 1 month ago • 0 comments

I am currently working with a dataset and noticed the term "C4 NoPunc" used in the context of data quality filtering. I would like to clarify what exactly this term refers to. Specifically, does "C4 NoPunc" mean:

  1. Quality filters are applied except for the "lines_with_no_ending_punctuation" rule. This means all other C4 quality filters are applied, but lines are not removed based solely on the absence of ending punctuation.

  2. Only the "lines_with_no_ending_punctuation" rule is used in quality filtering. This means that the sole criterion for removing lines is the absence of ending punctuation, and no other C4 quality filters are applied.

Could you please provide some insight into which of these interpretations is correct, or if there's another meaning entirely?

codefly13 avatar May 16 '24 02:05 codefly13