clang8
clang8 copied to clipboard
The size of the datasets
I‘d like to know why the size of cLang-8 is larger than the original Lang-8. cLang-8 contains 2372119 English sent-pairs, while Lang-8 contains only 1037561 English sent-pairs.
I was wondering the same. If the author's of clang8 could clarify this, it will be really helpful.
cc @ekQ
We use the raw Lang-8 dataset with 237,843 English entries (each consisting of multiple sentences) while the dataset with 1,037,561 English sent-pairs that you're referring to probably corresponds to the cleaned English v1.0 corpus with 100,051 entries.