GLM-4 文生文微调时候是否有数据集数量限制？

文生文微调时候是否有数据集数量限制？

Open want-well opened this issue 4 months ago • 1 comments

System Info / 系統信息

absl-py 2.0.0 accelerate 0.33.0 addict 2.4.0 aiofiles 23.2.1 aiohttp 3.9.5 aiosignal 1.3.1 aliyun-python-sdk-core 2.15.1 aliyun-python-sdk-kms 2.16.3 altair 5.3.0 annotated-types 0.7.0 anyio 3.7.1 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 async-lru 2.0.4 async-timeout 4.0.3 attrs 23.2.0 Babel 2.14.0 beautifulsoup4 4.12.2 bitsandbytes 0.43.3 bleach 6.1.0 blinker 1.8.2 brotlipy 0.7.0 cachetools 5.3.2 certifi 2022.12.7 cffi 1.15.1 chardet 4.0.0 charset-normalizer 2.0.4 click 8.1.7 comm 0.2.1 conda 22.11.1 conda-content-trust 0.1.3 conda-package-handling 1.9.0 contourpy 1.2.0 crcmod 1.7 cryptography 38.0.1 cycler 0.12.1 datasets 2.20.0 debugpy 1.8.0 decorator 5.1.1 deepspeed 0.14.4 defusedxml 0.7.1 dill 0.3.6 distro 1.9.0 einops 0.8.0 exceptiongroup 1.2.0 executing 2.0.1 fastapi 0.104.1 fastjsonschema 2.19.1 ffmpy 0.4.0 filelock 3.13.1 fonttools 4.47.0 fqdn 1.5.1 frozenlist 1.4.1 fsspec 2023.12.2 gast 0.5.4 gitdb 4.0.11 GitPython 3.1.43 google-auth 2.26.1 google-auth-oauthlib 1.2.0 gradio 4.42.0 gradio_client 1.3.0 greenlet 3.0.3 grpcio 1.60.0 h11 0.14.0 hjson 3.1.0 httpcore 1.0.5 httpx 0.27.2 huggingface-hub 0.24.6 idna 2.10 importlib-metadata 6.11.0 importlib_resources 6.4.4 ipykernel 6.28.0 ipython 8.20.0 ipywidgets 8.1.1 isoduration 20.11.0 jedi 0.19.1 jieba 0.42.1 Jinja2 3.1.4 jiter 0.5.0 jmespath 0.10.0 joblib 1.4.2 json5 0.9.14 jsonpatch 1.33 jsonpointer 2.4 jsonschema 4.20.0 jsonschema-specifications 2023.12.1 jupyter_client 8.6.0 jupyter_core 5.7.1 jupyter-events 0.9.0 jupyter-lsp 2.2.1 jupyter_server 2.12.2 jupyter_server_terminals 0.5.1 jupyterlab 4.0.10 jupyterlab-language-pack-zh-CN 4.0.post6 jupyterlab_pygments 0.3.0 jupyterlab_server 2.25.2 jupyterlab-widgets 3.0.9 kiwisolver 1.4.5 langchain 0.2.1 langchain-core 0.2.3 langchain-text-splitters 0.2.0 langsmith 0.1.69 Markdown 3.5.1 markdown-it-py 3.0.0 MarkupSafe 2.1.3 matplotlib 3.8.2 matplotlib-inline 0.1.6 mdurl 0.1.2 mistune 3.0.2 modelscope 1.9.5 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.14 nbclient 0.9.0 nbconvert 7.14.0 nbformat 5.9.2 nest-asyncio 1.5.8 networkx 3.2.1 ninja 1.11.1.1 nltk 3.8.1 notebook_shim 0.2.3 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.560.30 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.68 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.2 openai 1.43.0 orjson 3.10.3 oss2 2.18.5 overrides 7.4.0 packaging 23.2 pandas 2.2.2 pandocfilters 1.5.0 parso 0.8.3 peft 0.12.0 pexpect 4.9.0 pillow 10.4.0 pip 24.2 platformdirs 4.1.0 pluggy 1.0.0 prometheus-client 0.19.0 prompt-toolkit 3.0.43 protobuf 4.23.4 psutil 5.9.7 ptyprocess 0.7.0 pure-eval 0.2.2 py-cpuinfo 9.0.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pycosat 0.6.4 pycparser 2.21 pycryptodome 3.20.0 pydantic 2.8.2 pydantic_core 2.20.1 pydeck 0.9.1 pydub 0.25.1 Pygments 2.17.2 Pympler 1.0.1 pyOpenSSL 22.0.0 pyparsing 3.1.1 PySocks 1.7.1 python-dateutil 2.8.2 python-json-logger 2.0.7 python-multipart 0.0.9 pytz 2024.1 pytz-deprecation-shim 0.1.0.post0 PyYAML 6.0.1 pyzmq 25.1.2 referencing 0.32.1 regex 2024.5.15 requests 2.32.3 requests-oauthlib 1.3.1 responses 0.18.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rich 13.7.1 rouge-chinese 1.0.3 rpds-py 0.16.2 rsa 4.9 ruamel.yaml 0.18.6 ruamel.yaml.clib 0.2.8 ruff 0.6.3 safetensors 0.4.3 scikit-learn 1.5.1 scipy 1.13.1 semantic-version 2.10.0 Send2Trash 1.8.2 sentence-transformers 3.0.1 sentencepiece 0.2.0 setuptools 65.5.0 shellingham 1.5.4 simplejson 3.19.2 six 1.16.0 smmap 5.0.1 sniffio 1.3.0 sortedcontainers 2.4.0 soupsieve 2.5 SQLAlchemy 2.0.30 sse-starlette 2.1.3 stack-data 0.6.3 starlette 0.27.0 streamlit 1.38.0 supervisor 4.2.5 sympy 1.12 tenacity 8.3.0 tensorboard 2.15.1 tensorboard-data-server 0.7.2 terminado 0.18.0 threadpoolctl 3.5.0 tiktoken 0.7.0 timm 1.0.9 tinycss2 1.2.1 tokenizers 0.19.1 toml 0.10.2 tomli 2.0.1 tomlkit 0.12.0 toolz 0.12.0 torch 2.4.0 torchvision 0.19.0 tornado 6.4 tqdm 4.66.5 traitlets 5.14.1 transformers 4.44.0 transformers-stream-generator 0.0.4 triton 3.0.0 typer 0.12.5 types-python-dateutil 2.8.19.20240106 typing_extensions 4.12.2 tzdata 2024.1 tzlocal 4.3.1 uri-template 1.3.0 urllib3 2.2.2 uvicorn 0.24.0.post1 validators 0.28.3 watchdog 4.0.1 wcwidth 0.2.13 webcolors 1.13 webencodings 0.5.1 websocket-client 1.7.0 websockets 12.0 Werkzeug 3.0.1 wheel 0.37.1 widgetsnbextension 4.0.9 xxhash 3.4.1 yapf 0.40.2 yarl 1.9.4 zipp 3.19.1

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

1.python GLM-4/finetune_demo/finetune.py dataset/ ZhipuAI/glm-4-9b-chat GLM-4/finetune_demo/configs/lora.yaml 2.Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:01<00:00, 5.25it/s] The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. trainable params: 2,785,280 || all params: 9,402,736,640 || trainable%: 0.0296 Generating train split: 0 examples [00:00, ? examples/s]Failed to load JSON from file '/root/autodl-tmp/dataset/train.jsonl' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a name for object member. in row 648 Generating train split: 0 examples [00:00, ? examples/s] ╭────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────────────────────────╮ │ /root/miniconda3/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:153 in _generate_tables │ │ │ │ 150 │ │ │ │ │ │ │ │ with open( │ │ 151 │ │ │ │ │ │ │ │ │ file, encoding=self.config.encoding, errors=self.con │ │ 152 │ │ │ │ │ │ │ │ ) as f: │ │ ❱ 153 │ │ │ │ │ │ │ │ │ df = pd.read_json(f, dtype_backend="pyarrow") │ │ 154 │ │ │ │ │ │ │ except ValueError: │ │ 155 │ │ │ │ │ │ │ │ logger.error(f"Failed to load JSON from file '{file}' wi │ │ 156 │ │ │ │ │ │ │ │ raise e │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:815 in read_json │ │ │ │ 812 │ if chunksize: │ │ 813 │ │ return json_reader │ │ 814 │ else: │ │ ❱ 815 │ │ return json_reader.read() │ │ 816 │ │ 817 │ │ 818 class JsonReader(abc.Iterator, Generic[FrameSeriesStrT]): │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1025 in read │ │ │ │ 1022 │ │ │ │ │ │ data_lines = data.split("\n") │ │ 1023 │ │ │ │ │ │ obj = self._get_object_parser(self._combine_lines(data_lines)) │ │ 1024 │ │ │ │ else: │ │ ❱ 1025 │ │ │ │ │ obj = self._get_object_parser(self.data) │ │ 1026 │ │ │ │ if self.dtype_backend is not lib.no_default: │ │ 1027 │ │ │ │ │ return obj.convert_dtypes( │ │ 1028 │ │ │ │ │ │ infer_objects=False, dtype_backend=self.dtype_backend │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1051 in _get_object_parser │ │ │ │ 1048 │ │ } │ │ 1049 │ │ obj = None │ │ 1050 │ │ if typ == "frame": │ │ ❱ 1051 │ │ │ obj = FrameParser(json, **kwargs).parse() │ │ 1052 │ │ │ │ 1053 │ │ if typ == "series" or obj is None: │ │ 1054 │ │ │ if not isinstance(dtype, bool): │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1187 in parse │ │ │ │ 1184 │ │ │ 1185 │ @final │ │ 1186 │ def parse(self): │ │ ❱ 1187 │ │ self._parse() │ │ 1188 │ │ │ │ 1189 │ │ if self.obj is None: │ │ 1190 │ │ │ return None │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/pandas/io/json/_json.py:1403 in _parse │ │ │ │ 1400 │ │ │ │ 1401 │ │ if orient == "columns": │ │ 1402 │ │ │ self.obj = DataFrame( │ │ ❱ 1403 │ │ │ │ ujson_loads(json, precise_float=self.precise_float), dtype=None │ │ 1404 │ │ │ ) │ │ 1405 │ │ elif orient == "split": │ │ 1406 │ │ │ decoded = { │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ValueError: Trailing data

During handling of the above exception, another exception occurred:

╭────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────────────────────────╮ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1997 in _prepare_split_single │ │ │ │ 1994 │ │ │ ) │ │ 1995 │ │ │ try: │ │ 1996 │ │ │ │ _time = time.time() │ │ ❱ 1997 │ │ │ │ for _, table in generator: │ │ 1998 │ │ │ │ │ if max_shard_size is not None and writer._num_bytes > max_shard_size │ │ 1999 │ │ │ │ │ │ num_examples, num_bytes = writer.finalize() │ │ 2000 │ │ │ │ │ │ writer.close() │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:156 in _generate_tables │ │ │ │ 153 │ │ │ │ │ │ │ │ │ df = pd.read_json(f, dtype_backend="pyarrow") │ │ 154 │ │ │ │ │ │ │ except ValueError: │ │ 155 │ │ │ │ │ │ │ │ logger.error(f"Failed to load JSON from file '{file}' wi │ │ ❱ 156 │ │ │ │ │ │ │ │ raise e │ │ 157 │ │ │ │ │ │ │ if df.columns.tolist() == [0]: │ │ 158 │ │ │ │ │ │ │ │ df.columns = list(self.config.features) if self.config.f │ │ 159 │ │ │ │ │ │ │ try: │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py:130 in _generate_tables │ │ │ │ 127 │ │ │ │ │ │ try: │ │ 128 │ │ │ │ │ │ │ while True: │ │ 129 │ │ │ │ │ │ │ │ try: │ │ ❱ 130 │ │ │ │ │ │ │ │ │ pa_table = paj.read_json( │ │ 131 │ │ │ │ │ │ │ │ │ │ io.BytesIO(batch), read_options=paj.ReadOptions( │ │ 132 │ │ │ │ │ │ │ │ │ ) │ │ 133 │ │ │ │ │ │ │ │ │ break │ │ │ │ in pyarrow._json.read_json:308 │ │ │ │ in pyarrow.lib.pyarrow_internal_check_status:154 │ │ │ │ in pyarrow.lib.check_status:91 │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ArrowInvalid: JSON parse error: Missing a name for object member. in row 648

The above exception was the direct cause of the following exception:

╭────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────────────────────────╮ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:406 in main │ │ │ │ 403 ): │ │ 404 │ ft_config = FinetuningConfig.from_file(config_file) │ │ 405 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │ │ ❱ 406 │ data_manager = DataManager(data_dir, ft_config.data_config) │ │ 407 │ │ │ 408 │ train_dataset = data_manager.get_dataset( │ │ 409 │ │ Split.TRAIN, │ │ │ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:204 in init │ │ │ │ 201 │ def init(self, data_dir: str, data_config: DataConfig): │ │ 202 │ │ self._num_proc = data_config.num_proc │ │ 203 │ │ │ │ ❱ 204 │ │ self._dataset_dct = _load_datasets( │ │ 205 │ │ │ data_dir, │ │ 206 │ │ │ data_config.data_format, │ │ 207 │ │ │ data_config.data_files, │ │ │ │ /root/autodl-tmp/GLM-4/finetune_demo/finetune.py:189 in _load_datasets │ │ │ │ 186 │ │ num_proc: Optional[int], │ │ 187 ) -> DatasetDict: │ │ 188 │ if data_format == '.jsonl': │ │ ❱ 189 │ │ dataset_dct = load_dataset( │ │ 190 │ │ │ data_dir, │ │ 191 │ │ │ data_files=data_files, │ │ 192 │ │ │ split=None, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/load.py:2616 in load_dataset │ │ │ │ 2613 │ │ return builder_instance.as_streaming_dataset(split=split) │ │ 2614 │ │ │ 2615 │ # Download and prepare data │ │ ❱ 2616 │ builder_instance.download_and_prepare( │ │ 2617 │ │ download_config=download_config, │ │ 2618 │ │ download_mode=download_mode, │ │ 2619 │ │ verification_mode=verification_mode, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1029 in download_and_prepare │ │ │ │ 1026 │ │ │ │ │ │ │ prepare_split_kwargs["max_shard_size"] = max_shard_size │ │ 1027 │ │ │ │ │ │ if num_proc is not None: │ │ 1028 │ │ │ │ │ │ │ prepare_split_kwargs["num_proc"] = num_proc │ │ ❱ 1029 │ │ │ │ │ │ self._download_and_prepare( │ │ 1030 │ │ │ │ │ │ │ dl_manager=dl_manager, │ │ 1031 │ │ │ │ │ │ │ verification_mode=verification_mode, │ │ 1032 │ │ │ │ │ │ │ **prepare_split_kwargs, │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1124 in _download_and_prepare │ │ │ │ 1121 │ │ │ │ │ 1122 │ │ │ try: │ │ 1123 │ │ │ │ # Prepare split will record examples associated to the split │ │ ❱ 1124 │ │ │ │ self._prepare_split(split_generator, **prepare_split_kwargs) │ │ 1125 │ │ │ except OSError as e: │ │ 1126 │ │ │ │ raise OSError( │ │ 1127 │ │ │ │ │ "Cannot find data file. " │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:1884 in _prepare_split │ │ │ │ 1881 │ │ │ gen_kwargs = split_generator.gen_kwargs │ │ 1882 │ │ │ job_id = 0 │ │ 1883 │ │ │ with pbar: │ │ ❱ 1884 │ │ │ │ for job_id, done, content in self._prepare_split_single( │ │ 1885 │ │ │ │ │ gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args │ │ 1886 │ │ │ │ ): │ │ 1887 │ │ │ │ │ if done: │ │ │ │ /root/miniconda3/lib/python3.10/site-packages/datasets/builder.py:2040 in _prepare_split_single │ │ │ │ 2037 │ │ │ │ e = e.context │ │ 2038 │ │ │ if isinstance(e, DatasetGenerationError): │ │ 2039 │ │ │ │ raise │ │ ❱ 2040 │ │ │ raise DatasetGenerationError("An error occurred while generating the dataset │ │ 2041 │ │ │ │ 2042 │ │ yield job_id, True, (total_num_examples, total_num_bytes, writer.features, num │ │ 2043 │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ DatasetGenerationError: An error occurred while generating the dataset 3.这里我进行lora微调的时候648个数据集可以微调，但是把最后一条数据集复制一下到649条就报这个错误

Expected behavior / 期待表现

希望能帮助我解决这个问题

Sep 29 '24 01:09 want-well

GLM-4 GLM-4 copied to clipboard

文生文微调时候是否有数据集数量限制？

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

GLM-4
GLM-4 copied to clipboard