ColossalAI
ColossalAI copied to clipboard
[BUG]: Chat train_sft.py SupervisedDataset: TypeError: __init__() got an unexpected keyword argument 'max_length'
Error Message:
ninja: no work to do.
Loading extension module fused_optim...
Traceback (most recent call last):
File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 187, in
train(args)
File "/workspace/ColossalAI/applications/Chat/examples/train_sft.py", line 107, in train
train_dataset = SupervisedDataset(tokenizer=tokenizer,
TypeError: init() got an unexpected keyword argument 'max_length'
False
Emitting ninja build file /root/.cache/colossalai/torch_extensions/torch1.12_cu11.3/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
False
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1160 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1162 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1159) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_sft.py FAILED
Environment
I used the latest official Docker image: NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4
nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Mon_May__3_19:15:13_PDT_2021 Cuda compilation tools, release 11.3, V11.3.109 Build cuda_11.3.r11.3/compiler.29920130_0
Python 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] :: Anaconda, Inc. on linux
torch.version '1.12.1'
You can manually cancel this parameter. This parameter looks fine, but it leads to this result
hi @mikeda100 could you share your run command when facing ths error?
I encountered this issue while running the stage 1: train_sft.sh: line 1: /home/coati/bin/torchrun: Permission denied
We have updated our code. Try out the latest version. Please reopen this issue if the problem persists.