dolma
dolma copied to clipboard
Need clarification of Gopher in Step 2
Dear authors, I was trying to reimplement the Dolma-Web described in your paper. However, in the Step 2, using the dolma toolkit, I found Gopher implementation in this repo something different with original Gopher at http://arxiv.org/abs/2112.11446. Specifically, There are no computations for 'Duplicate paragraph fraction' and 'Duplicate paragraph character fraction' in current code at /python/dolma/taggers.py , which are provided in Table A1 in the Gopher paper.
Is this a bug or there is no need to compute these metrics? Looking forward to your kind reply.
Best regards, Xinlin Zhuang