AdaSeq icon indicating copy to clipboard operation
AdaSeq copied to clipboard

[Question]How to solve [datasets.builder.DatasetGenerationError: An error occurred while generating the dataset]

Open Shawnzheng011019 opened this issue 1 year ago • 8 comments

What is your question?

Traceback (most recent call last):
File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1618, in _prepare_split_single writer = writer_class( File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\arrow_writer.py", line 334, in init self.stream = self._fs.open(fs_token_paths[2][0], "wb") File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\fsspec\spec.py", line 1309, in open f = self._open( File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\fsspec\implementations\local.py", line 180, in _open return LocalFileOpener(path, mode, fs=self, **kwargs) File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\fsspec\implementations\local.py", line 298, in init self._open() File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\fsspec\implementations\local.py", line 303, in _open self.f = open(self.path, mode=self.mode) FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/shawn/.cache/huggingface/datasets/named_entity_recognition_dataset_builder/default-c270794ce0d 23d06/0.0.0/db737b9bb893f20fb03d04403a30bf7c033256c212b7e9f0ebc6e9c958535c51.incomplete/named_entity_recognition_dataset_builder-train-00000-00000-of-NNNNN.arro w'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "C:\Users\shawn\anaconda3\envs\pytorch\lib\runpy.py", line 197, in _run_module_as_main return run_code(code, main_globals, None, File "C:\Users\shawn\anaconda3\envs\pytorch\lib\runpy.py", line 87, in run_code exec(code, run_globals) File "C:\Users\shawn\anaconda3\envs\pytorch\Scripts\adaseq.exe_main.py", line 7, in File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\main.py", line 13, in run main(prog='adaseq') File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\commands_init.py", line 29, in main args.func(args) File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\commands\train.py", line 84, in train_model_from_args train_model( File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\commands\train.py", line 156, in train_model trainer = build_trainer_from_partial_objects( File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\commands\train.py", line 185, in build_trainer_from_partial_objects dm = DatasetManager.from_config(task=config.task, **config.dataset) File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\adaseq\data\dataset_manager.py", line 182, in from_config hfdataset = hf_load_dataset(path, name=name, **kwargs) File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\load.py", line 1797, in load_dataset builder_instance.download_and_prepare( File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 909, in download_and_prepare self._download_and_prepare( File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1670, in _download_and_prepare super()._download_and_prepare( File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1004, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1508, in _prepare_split for job_id, done, content in self._prepare_split_single( File "C:\Users\shawn\anaconda3\envs\pytorch\lib\site-packages\datasets\builder.py", line 1665, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

What have you tried?

set http proxy and successfully conneted to Youtube.

Code (if necessary)

No response

What's your environment?

  • AdaSeq Version (e.g., 1.0 or master):
  • ModelScope Version (e.g., 1.0 or master):
  • PyTorch Version (e.g., 1.12.1):
  • OS (e.g., Ubuntu 20.04):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

Shawnzheng011019 avatar Oct 23 '23 02:10 Shawnzheng011019

environment was set automatically by the file requiremets.txt

Shawnzheng011019 avatar Oct 23 '23 02:10 Shawnzheng011019

同样遇到这个问题,看起来应该是adaseq加载数据集的时候,可能处理逻辑有问题,加载数据集的格式

···text data_type: json_spans ···

可能有点问题

ykallan avatar Dec 16 '23 17:12 ykallan

是因为数据集找不到或者数据集不是标准的解析格式,可以按照toy msra的加载代码重写一下数据加载

PPPP-kaqiu avatar Mar 12 '24 13:03 PPPP-kaqiu

@PPPP-kaqiu 你重新写了吗?可以分享一下吗

houyuchao avatar Mar 19 '24 09:03 houyuchao

@Shawnzheng011019 请问解决了吗,大哥

lichen146 avatar Apr 26 '24 09:04 lichen146

完全按照hf dataset的格式写数据加载脚本,yaml的数据加载就只写数据那个文件夹就好了

PPPP-kaqiu avatar Apr 26 '24 09:04 PPPP-kaqiu

@PPPP-kaqiu 加个微信吧大哥,求教啊WX:Xugeyuan923

lichen146 avatar Apr 26 '24 09:04 lichen146

完全按照hf dataset的格式写数据加载脚本,yaml的数据加载就只写数据那个文件夹就好了

大神您好可以分享一下怎么解决的吗

houyuchao avatar Jul 21 '24 09:07 houyuchao