CialloCorpus
CialloCorpus copied to clipboard
huggingface下载的时候报错
下载用的代码:
from datasets import load_dataset
dataset_name = "Papersnake/people_daily_news"
dataset = load_dataset(dataset_name,cache_dir=r'xxx/')
错误信息:
An error occurred while generating the dataset
All the data files must have the same columns, but at some point there are 2 missing columns ({'author', 'page'})
This happened while the json dataset builder was generating data using
..\downloads\d434406d0e80132d996bc6796817699b81390d86744e10acda0ec2ea71fead71
Please either edit the data files to have matching columns, or separate them into different configurations (see docs at https://hf.co/docs/hub/datasets-manual-configuration#multiple-configurations)
Traceback (most recent call last):
File "_pydevd_bundle/pydevd_cython.pyx", line 546, in _pydevd_bundle.pydevd_cython.PyDBFrame._handle_exception
File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
def getline(filename, lineno, module_globals=None):
File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
def getlines(filename, module_globals=None):
File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
def updatecache(filename, module_globals=None):
File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte
0.03s - Error on build_exception_info_response.
Traceback (most recent call last):
File "c:\program files\microsoft visual studio\2022\community\common7\ide\extensions\microsoft\python\core\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 1404, in build_exception_info_response
def build_exception_info_response(dbg, thread_id, request_seq, set_additional_thread_info, iter_visible_frames_info, max_frames):
File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
def getline(filename, lineno, module_globals=None):
File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
def getlines(filename, module_globals=None):
File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
def updatecache(filename, module_globals=None):
File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte
0.03s - Error on build_exception_info_response.
Traceback (most recent call last):
File "c:\program files\microsoft visual studio\2022\community\common7\ide\extensions\microsoft\python\core\debugpy\_vendored\pydevd\_pydevd_bundle\pydevd_comm.py", line 1404, in build_exception_info_response
def build_exception_info_response(dbg, thread_id, request_seq, set_additional_thread_info, iter_visible_frames_info, max_frames):
File "C:\Program Files\Python39\lib\linecache.py", line 26, in getline
def getline(filename, lineno, module_globals=None):
File "C:\Program Files\Python39\lib\linecache.py", line 36, in getlines
def getlines(filename, module_globals=None):
File "C:\Program Files\Python39\lib\linecache.py", line 80, in updatecache
def updatecache(filename, module_globals=None):
File "C:\Program Files\Python39\lib\codecs.py", line 319, in decode
def decode(self, input, final=False):
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 41: invalid start byte
打开看了对应的文件,内容是这个:
{"url": "hf://datasets/Papersnake/people_daily_news@e61323bc7692312d907fc2d154b4ffc4290ce496/2004.jsonl.gz", "etag": null}
不同年份的 jsonl 并不能保证以相同格式提供,建议下载后手动处理,例如在 git lfs 安装后,使用 git clone https://huggingface.co/datasets/Papersnake/people_daily_news 来下载数据。
不同年份的 jsonl 并不能保证以相同格式提供,建议下载后手动处理,例如在 git lfs 安装后,使用
git clone https://huggingface.co/datasets/Papersnake/people_daily_news来下载数据。 好的,我试试