dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Need clarification of Gopher in Step 2

Open mihara-bot opened this issue 8 months ago • 0 comments

Dear authors, I was trying to reimplement the Dolma-Web described in your paper. However, in the Step 2, using the dolma toolkit, I found Gopher implementation in this repo something different with original Gopher at http://arxiv.org/abs/2112.11446. Specifically, There are no computations for 'Duplicate paragraph fraction' and 'Duplicate paragraph character fraction' in current code at /python/dolma/taggers.py , which are provided in Table A1 in the Gopher paper.

Is this a bug or there is no need to compute these metrics? Looking forward to your kind reply.

Best regards, Xinlin Zhuang

mihara-bot avatar Jun 18 '24 11:06 mihara-bot