ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: Chat train_sft.py SupervisedDataset: TypeError: __init__() got an unexpected keyword argument 'max_length'

Open mikeda100 opened this issue 2 years ago • 2 comments

Error Message:

ninja: no work to do. Loading extension module fused_optim... Traceback (most recent call last): File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 187, in train(args) File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 107, in train train_dataset = SupervisedDataset(tokenizer=tokenizer, TypeError: init() got an unexpected keyword argument 'max_length' False Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... False WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1160 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1162 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1159) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')()) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_sft.py FAILED

Environment

I used the latest official Docker image: NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0

Python 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] :: Anaconda, Inc. on linux

torch.version '1.12.1'

mikeda100 avatar Apr 07 '23 02:04 mikeda100

You can manually cancel this parameter. This parameter looks fine, but it leads to this result

allendred avatar Apr 07 '23 06:04 allendred

hi @mikeda100 could you share your run command when facing ths error?

Camille7777 avatar Apr 17 '23 07:04 Camille7777

I encountered this issue while running the stage 1: train_sft.sh: line 1: /home/coati/bin/torchrun: Permission denied

SixGoodX avatar May 17 '23 10:05 SixGoodX

We have updated our code. Try out the latest version. Please reopen this issue if the problem persists.

cwher avatar Jul 19 '23 10:07 cwher