RecSysDatasets icon indicating copy to clipboard operation
RecSysDatasets copied to clipboard

Codec error when converting movie lens dataset

Open guedes-joaofelipe opened this issue 3 years ago • 3 comments

I followed the instructions on Readme.md to download and convert the movie lens dataset but I got the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2892: invalid continuation byte

Just changed the pd.read_csv method on file convertion_tools/src/extended_dataset.py (line 52) to include an encoding argument and fix the problem.

pd.read_csv(self.item_file, delimiter=self.item_sep, header=None, engine='python', encoding = "ISO-8859-1")

guedes-joaofelipe avatar Jul 24 '21 19:07 guedes-joaofelipe

Hi, @guedes-joaofelipe! Thank you for your issue, but we can't reproduce the problem here. So could you please check your dataset and your environment again?

EliverQ avatar Jul 26 '21 13:07 EliverQ

I had the same problem.

ZZZZZZZZeng avatar Nov 28 '22 05:11 ZZZZZZZZeng

@EliverQ I had the same problem,When I convert the yelp data set on windows。

Traceback (most recent call last): File "run.py", line 40, in datasets.convert_inter() File "D:\学业\研究生\数据集\数据集转换程序\RecSysDatasets-master\conversion_tools\src\extended_dataset.py", line 4581, in convert_inter for _ in fin: UnicodeDecodeError: 'gbk' codec can't decode byte 0x8b in position 1909: illegal multibyte sequence

ZZZZZZZZeng avatar Nov 28 '22 05:11 ZZZZZZZZeng