clang8 icon indicating copy to clipboard operation
clang8 copied to clipboard

The size of the datasets

Open DarlingJOJO opened this issue 2 years ago • 2 comments

I‘d like to know why the size of cLang-8 is larger than the original Lang-8. cLang-8 contains 2372119 English sent-pairs, while Lang-8 contains only 1037561 English sent-pairs.

DarlingJOJO avatar Jun 06 '22 14:06 DarlingJOJO

I was wondering the same. If the author's of clang8 could clarify this, it will be really helpful.

cc @ekQ

ashokrajab avatar Jul 14 '22 10:07 ashokrajab

We use the raw Lang-8 dataset with 237,843 English entries (each consisting of multiple sentences) while the dataset with 1,037,561 English sent-pairs that you're referring to probably corresponds to the cleaned English v1.0 corpus with 100,051 entries.

ekQ avatar Jul 18 '22 13:07 ekQ