dolma
dolma copied to clipboard
Need help in customizing python/dolma/taggers/c4.py
Dear authors, I tried to implement the rule on page 57 of your Dolma paper 'Remove documents with more than half of their line not ending in...'. And I modified a few lines of code at python/dolma/taggers/c4.py to: Line 107~ Line 130
start = count = 0
line_no_pending_punc_count = 0
for sent in text.split("\n"):
end = start + len(sent)
if end != len(text):
# account for the newline
end += 1
# strip any trailing whitespace
sent = sent.strip()
if not sent.endswith((".", "?", "!", '"')):
spans.append(Span(start, end, type="lines_with_no_ending_punctuation"))
line_no_pending_punc_count += 1
if len(sent.split()) < MIN_WORDS_PER_LINE:
spans.append(Span(start, end, type="lines_with_too_few_words"))
count += 1
start = end
spans.append(Span(0, len(doc.text), type="line_count", score=count))
spans.append(Span(0, len(doc.text), type="lines_with_no_ending_punctuation_ratio", score=line_no_pending_punc_count / count))
return DocResult(doc=doc, spans=spans)
However, I found that 'lines_with_no_ending_punctuation_ratio' is not working and the results of c4_v2 don't contain this data field.
Could you please help me on this c4 rule?
Many thanks! :)
Best regards, Xinlin Zhuang