ircot icon indicating copy to clipboard operation
ircot copied to clipboard

Dataset encoding format

Open foreverlove944 opened this issue 11 months ago • 1 comments

What encoding method is used for the data set you provided? I opened it in UTF-8 encoding format. English characters are normal, but Russian and other languages are not normal. 屏幕截图 2024-04-03 205121

foreverlove944 avatar Apr 03 '24 12:04 foreverlove944

The contexts/paragraphs were taken from the original source datasets. However, I did apply ftfy at runtime. See commaqa/inference/dataset_readers.py for example. You might want to give it a try.

HarshTrivedi avatar Jun 12 '24 01:06 HarshTrivedi