ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: run examples in pipeline mode failed.

Open imhuim982 opened this issue 2 years ago • 11 comments

🐛 Describe the bug

I tried examples/language/gpt/experiments/pipeline_parrallel/run.sh and examples/language/gpt/titans/run.sh. But no one works. The followings are error messages: for pipeline/parrallel/run.sh, I got: Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function result = python_udf.func(*python_udf.args, **python_udf.kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rpc/rref_proxy.py", line 11, in _local_invoke return getattr(rref.local_value(), func_name)(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 230, in sync_global_worker_rrefs self._initialize_partition() File "/root/miniconda3/lib/python3.8/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 185, in _initialize_partition self.module_partition: nn.Module = partition_fn(*partition_args).to(device) File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 74, in partition module = create_partition_module(pp_rank, stage_num, model, data_kwargs) File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 61, in create_partition_module graph = tracer.trace(root=model, meta_args=meta_args) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 397, in trace self.graph = super().trace(root, concrete_args=concrete_args) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 739, in trace (self.create_arg(fn(*args)),), File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/model_zoo.py", line 29, in forward return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0] File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward transformer_outputs = self.transformer( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in forward outputs = block( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 388, in forward target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: maximum recursion depth exceeded while calling a Python object

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function result = python_udf.func(*python_udf.args, **python_udf.kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rpc/rref_proxy.py", line 11, in _local_invoke return getattr(rref.local_value(), func_name)(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 230, in sync_global_worker_rrefs self._initialize_partition() File "/root/miniconda3/lib/python3.8/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 185, in _initialize_partition self.module_partition: nn.Module = partition_fn(*partition_args).to(device) File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 74, in partition module = create_partition_module(pp_rank, stage_num, model, data_kwargs) File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 61, in create_partition_module graph = tracer.trace(root=model, meta_args=meta_args) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 397, in trace self.graph = super().trace(root, concrete_args=concrete_args) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 739, in trace (self.create_arg(fn(*args)),), File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/model_zoo.py", line 29, in forward return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0] File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward transformer_outputs = self.transformer( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in forward outputs = block( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 388, in forward attn_outputs = self.attn( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 310, in forward query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/pytorch_utils.py", line 115, in forward x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 705, in module_getattr_wrapper return self.getattr(attr, attr_val, parameter_proxy_cache) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 484, in getattr maybe_parameter_proxy = maybe_get_proxy_for_attr( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 478, in maybe_get_proxy_for_attr val_proxy = self.create_proxy("get_attr", n, (), {}, **kwargs) # type: ignore[arg-type] File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 142, in create_proxy meta_out = self._meta_data_computing( File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 308, in _meta_data_computing raise RuntimeError(f"Could not compute metadata for {kind} target {target}: {e}") RuntimeError: Could not compute metadata attn_outputs = self.attn( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 310, in forward query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/pytorch_utils.py", line 115, in forward x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 705, in module_getattr_wrapper return self.getattr(attr, attr_val, parameter_proxy_cache) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 484, in getattr maybe_parameter_proxy = maybe_get_proxy_for_attr( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 478, in maybe_get_proxy_for_attr val_proxy = self.create_proxy("get_attr", n, (), {}, **kwargs) # type: ignore[arg-type] File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 142, in create_proxy meta_out = self._meta_data_computing( File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 308, in _meta_data_computing raise RuntimeError(f"Could not compute metadata for {kind} target {target}: {e}") RuntimeError: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target ... repeat * 100 model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr del.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target ... repeat * 100 model.transformer.h.0.attn.c_attn.bias: maximum recursion depth exceeded while calling a Python object target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target ... repeat * 100 model.transformer.h.0.attn.c_attn.bias: maximum recursion depth exceeded while calling a Python object

for titans/run.sh: sh run.sh /root/miniconda3/lib/python3.8/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor registered at aten/src/ATen/RegisterSchema.cpp:6 dispatch key: Meta previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053 new kernel: registered at /dev/null:219 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.) self.m.impl(name, dispatch_key, fn) /bin/bash: line 0: fg: no job control Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch --use_dummy_dataset on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /ossfs/workspace/ColossalAi/examples/language/gpt/titans && export AS="/root/miniconda3/bin/x86_64-conda-linux-gnu-as" LC_PAPER="en_US.UTF-8" LDFLAGS="-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/root/miniconda3//lib -Wl,-rpath-link,/root/miniconda3//lib -L/root/miniconda3//lib" AR="/root/miniconda3/bin/x86_64-conda-linux-gnu-ar" POD_NAMESPACE="kubemaker" BIZ_ID="132704^pai_alipay^85816978^2023-02-26" CONDA_BACKUP_RANLIB="/root/miniconda3//bin/x86_64-conda-linux-gnu-ranlib" CONDA_BACKUP_DEBUG_CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /root/miniconda3//include" CUDA_PATH="/root/miniconda3" LC_ADDRESS="en_US.UTF-8" AISTUDIO_JCS_SUB_JOB_ID="1" GCC_NM="/root/miniconda3/bin/x86_64-conda-linux-gnu-gcc-nm" LC_MONETARY="en_US.UTF-8" HADOOP_LOGS="/home/hadoop/hadoop-data/hadoop-logs" HOSTNAME="kmaker-49-011036210011" THEIA_MINI_BROWSER_HOST_PATTERN="{{hostname}}" CONDA_MIRRORS="mirrors.bfsu.edu.cn" PIP_NO_CACHE_DIR="1" GIT__CAN_USE_NO_OPTIONAL_LOCKS="true" JUPYTER_WORK_DIR="/root" ENV_ARGO_WORKFLOW_NAME="aistudio-85816978" HOST="x86_64-conda-linux-gnu" TERM="xterm-color" ENV_GREY_IMAGE_TYPE="false" KUBERNETES_PORT_443_TCP_PORT="443" KUBERNETES_PORT="tcp://172.16.0.1:443" ARGO_DEADLINE="2023-03-19T07:46:41Z" SHELL="/bin/bash" NM="/root/miniconda3/bin/x86_64-conda-linux-gnu-nm" HADOOP_HOME="/hadoop-client-dev/bin/hadoop" ALIPAY_APP_ZONE="GZ00G" AISTUDIO_TASK_ROOT_PATH="/home/admin" CONDA_BACKUP_CXXFILT="/root/miniconda3//bin/x86_64-conda-linux-gnu-c++filt" CPPFLAGS="-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /root/miniconda3//include" LEGACY_CONTAINER_SIZE_CPU_COUNT="2" CONDA_BACKUP_AR="/root/miniconda3//bin/x86_64-conda-linux-gnu-ar" CONDA_BACKUP_GXX="/root/miniconda3//bin/x86_64-conda-linux-gnu-g++" JUPYTER_NOTEBOOK_DIR="/ossfs" CONDA_SHLVL="1" CONDA_BACKUP_AS="/root/miniconda3//bin/x86_64-conda-linux-gnu-as" CONDA_BACKUP_CONDA_BUILD_SYSROOT="/root/miniconda3//x86_64-conda-linux-gnu/sysroot" JUPYTER_RUNTIME_DIR="/root/.local/share/jupyter/runtime" AISTUDIO_JOB_NAME="2H2BC7II" KUBERNETES_SERVICE_PORT="6443" ULOGFS_ENABLED="true" CONDA_PROMPT_MODIFIER="(base) " SIZE="/root/miniconda3/bin/x86_64-conda-linux-gnu-size" LC_NUMERIC="en_US.UTF-8" SYSTEMCTL_SKIP_REDIRECT="1" ENV_ODPS_ACCESS_KEY="p2bsQ0xVGNm8RnJ2y0hGKEr0E3bcTn" THEIA_WEBVIEW_EXTERNAL_ENDPOINT="{{hostname}}" KUBERNETES_SERVICE_HOST="apiserver.sigma-stl.svc.stl.alipay.com" POD_NAME="aistudio-85816978-2502438577" IDE_COMMON_OSS_BUCKET="dmsint" ENV_ODPS_ACCESS_ID="LTAIQH9JObhyEydB" CONDA_BACKUP_LD="/root/miniconda3//bin/x86_64-conda-linux-gnu-ld" CONDA_BACKUP_STRIP="/root/miniconda3//bin/x86_64-conda-linux-gnu-strip" CXX_FOR_BUILD="/root/miniconda3/bin/x86_64-conda-linux-gnu-c++" AISTUDIO_INNER_ZONE="prod" LC_ALL="en_US.UTF-8" D2_CYCTIME="20230227154641" CUDA_HOME="/root/miniconda3" PYPI_MIRRORS="mirrors.bfsu.edu.cn/pypi/web" ENV_SIGMA_APP_NAME="kmaker" CONDA_BACKUP_DEBUG_CPPFLAGS="-D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /root/miniconda3//include" LC_TELEPHONE="en_US.UTF-8" LD_LIBRARY_PATH="/usr/lib64::/lib:/lib64:/usr/lib64:/usr/lib:/root/miniconda3/lib/:/root/miniconda3/lib/python3.6/site-packages/aistudio_common/reader/libs/:/opt/taobao/java/jre/lib/amd64/server/:/usr/local/cuda/lib64:/usr/local/lib" NVIDIA_VISIBLE_DEVICES="GPU-fbacfc74-9e5e-4d0c-aeaa-a1825b3d527e,GPU-f9bdac76-fd45-104b-dee4-2e212f380492" SIGMA_MAX_PROCESSORS_LIMIT="2" DefaultRoute="11.36.211.253" ILOGTAIL_PODNAME="aistudio-85816978-2502438577" CONDA_BACKUP_SIZE="/root/miniconda3//bin/x86_64-conda-linux-gnu-size" CONDA_BACKUP_GCC_NM="/root/miniconda3//bin/x86_64-conda-linux-gnu-gcc-nm" ALIPAY_APP_APPNAME="kmaker" EXECUTIONRECORD_ID="85816978" CONDA_BACKUP_HOST="x86_64-conda-linux-gnu" AISTUDIO_NAMESPACE="workflow_15090068" ENV_EXPERIMENT_TYPE_ENUM="K8S_CONTAINER" BATCH_SIZE="32" CONDA_EXE="/root/miniconda3/bin/conda" USER_NAME="aaron.hx" CONDA_BACKUP_GPROF="/root/miniconda3//bin/x86_64-conda-linux-gnu-gprof" CONDA_BACKUP_CXX_FOR_BUILD="/root/miniconda3//bin/x86_64-conda-linux-gnu-c++" USERNUMBER="132704" VSCODE_API_VERSION="1.53.2" TF_JNI_OPTS="-XX:ErrorFile=/tmp/hs_err_pid_%p.log" NVIDIA_DRIVER_CAPABILITIES="all" CONDA_BACKUP_BUILD="x86_64-conda-linux-gnu" DATA="/data/scratch/gpt_data/small-gpt-dataset.json" CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /root/miniconda3//include" ENV_ARGO_NODE_NAME="aistudio-85816978-2502438577" ODPS_PROJECT="ant_p13n_dev" LINKB_APP_NAME="15090068" LD_GOLD="/root/miniconda3/bin/x86_64-conda-linux-gnu-ld.gold" WORKFLOW_ID="15090068" CONDA_BUILD_SYSROOT="/root/miniconda3/x86_64-conda-linux-gnu/sysroot" STRINGS="/root/miniconda3/bin/x86_64-conda-linux-gnu-strings" AISTUDIO_SITE_ENUM="INTERNAL" CONDA_BACKUP_LDFLAGS="-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/root/miniconda3//lib -Wl,-rpath-link,/root/miniconda3//lib -L/root/miniconda3//lib" CONDA_BACKUP_host_alias="x86_64-conda-linux-gnu" CPP="/root/miniconda3/bin/x86_64-conda-linux-gnu-cpp" pouch_container_image="reg.docker.alibaba-inc.com/aii/aistudio:3350123-20230221212038_nydus_v2" CXXFILT="/root/miniconda3/bin/x86_64-conda-linux-gnu-c++filt" ali_run_mode="alipay_container" ZHENJIN_HTTP_PREFIX="http://cmps-model.cn-hangzhou.alipay.aliyun-inc.com/264991" ULOGFS_ZCLEAN_ENABLE="true" CONDA_BACKUP_CXX="/root/miniconda3//bin/x86_64-conda-linux-gnu-c++" PATH="/root/miniconda3/bin:/root/miniconda3/condabin:/usr/local/cuda/bin:/root/.tnvm/versions/alinode/v5.20.3/bin:/root/coreutils/bin:/root/miniconda3/bin:/root/.tnvm/versions/alinode/v5.20.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin:/opt/satools:/opt/taobao/java/bin:/opt/odpscmd_public/bin:/root/apache-maven-3.6.3/bin" CONDA_BACKUP_CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /root/miniconda3//include" LC_MESSAGES="en_US.UTF-8" ACCELERATE_DISABLE_RICH="1" ODPS_ALIYUN_ID="[email protected]" ENV_CODE_NAME="dev_container" ILOGTAIL_ENV="{"Appname":"kmaker","LogAppname":"","Idcname":"stl","Apppath":"/home/admin/logs","Taglist":{"POD_NAME":"aistudio-85816978-2502438577","aistudio":"aistudio-85816978","app":"argo","component":"workflow","kubemaker":"aistudio-85816978"}}" JUPYTER_SERVICE_PORT="8080" DEBUG_CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /root/miniconda3//include" BUILD="x86_64-conda-linux-gnu" LD="/root/miniconda3/bin/x86_64-conda-linux-gnu-ld" CONDA_PREFIX="/root/miniconda3" LC_IDENTIFICATION="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" ARGO_PROGRESS_FILE="/var/run/argo/progress" PWD="/ossfs/workspace/ColossalAi/examples/language/gpt/titans" AISTUDIO_PROXY_ADDR="https://aistudioproxy.alipay.com/proxy/workflow_15090068:8080" STRIP="/root/miniconda3/bin/x86_64-conda-linux-gnu-strip" JAVA_HOME="/opt/taobao/java" npm_config_user="root" pouchSupportCgroup="true" ENV_ODPS_PROJECT_NAME="ant_p13n_dev" GPUNUM="2" CONDA_BACKUP_CC_FOR_BUILD="/root/miniconda3//bin/x86_64-conda-linux-gnu-cc" CMAKE_ARGS="-DCMAKE_LINKER=/root/miniconda3/bin/x86_64-conda-linux-gnu-ld -DCMAKE_STRIP=/root/miniconda3/bin/x86_64-conda-linux-gnu-strip" ELFEDIT="/root/miniconda3/bin/x86_64-conda-linux-gnu-elfedit" EDITOR="vim" pouch_container_id="f3591b81c52317c948382e9463490ea0e3b43ba0a3ad660f7371749003d55f31" CONDA_BACKUP_OBJCOPY="/root/miniconda3//bin/x86_64-conda-linux-gnu-objcopy" GCC_RANLIB="/root/miniconda3/bin/x86_64-conda-linux-gnu-gcc-ranlib" LANG="en_US.UTF-8" LOCAL_GIT_DIRECTORY="/root/miniconda3" io_alibaba_pouch_snapshotter="rafs" LAUNCH_CONTAINER_MODE="dev_container" LC_MEASUREMENT="en_US.UTF-8" IDE_SERVICE_PORT="8088" CONDA_BACKUP_OBJDUMP="/root/miniconda3//bin/x86_64-conda-linux-gnu-objdump" SN="5caf8b60-a482-4065-bbee-09a0b0621a94" JUPYTER_CONFIG_DIR="/root/.jupyter" ALIPAY_POD_NAME="aistudio-85816978-2502438577" ENV_TYPE="prod" ali_runtime_type="runc" CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME="_sysconfigdata_x86_64_conda_cos7_linux_gnu" ULOGFS_STREAM_ENABLED="true" ALIPAY_POD_NAMESPACE="kubemaker" ODPS_ENDPOINT="http://service.odps.aliyun-inc.com/api" CXX="/root/miniconda3/bin/x86_64-conda-linux-gnu-c++" CC_FOR_BUILD="/root/miniconda3/bin/x86_64-conda-linux-gnu-cc" ENDPOINT="http://service.odps.aliyun-inc.com/api" OBJCOPY="/root/miniconda3/bin/x86_64-conda-linux-gnu-objcopy" PANGU_CLUSTER_NAME="pangu1_analyze_sata_em14_online" ARGO_TEMPLATE="eyJuYW1lIjoiMkgyQkM3SUkiLCJpbnB1dHMiOnt9LCJvdXRwdXRzIjp7fSwiYWZmaW5pdHkiOnsibm9kZUFmZmluaXR5Ijp7InJlcXVpcmVkRHVyaW5nU2NoZWR1bGluZ0lnbm9yZWREdXJpbmdFeGVjdXRpb24iOnsibm9kZVNlbGVjdG9yVGVybXMiOlt7Im1hdGNoRXhwcmVzc2lvbnMiOlt7ImtleSI6Im1hbmRhdG9yeS5rOHMuYWxpcGF5LmNvbS9hcHAtbG9naWMtcG9vbCIsIm9wZXJhdG9yIjoiSW4iLCJ2YWx1ZXMiOlsia3ViZW1ha2VyIl19XX1dfX0sInBvZEFmZmluaXR5Ijp7InByZWZlcnJlZER1cmluZ1NjaGVkdWxpbmdJZ25vcmVkRHVyaW5nRXhlY3V0aW9uIjpbeyJ3ZWlnaHQiOjEwMCwicG9kQWZmaW5pdHlUZXJtIjp7ImxhYmVsU2VsZWN0b3IiOnsibWF0Y2hFeHByZXNzaW9ucyI6W3sia2V5Ijoia3ViZW1ha2VyIiwib3BlcmF0b3IiOiJFeGlzdHMifV19LCJ0b3BvbG9neUtleSI6Imt1YmVybmV0ZXMuaW8vaG9zdG5hbWUifX1dfX0sIm1ldGFkYXRhIjp7ImFubm90YXRpb25zIjp7IkFJU1RVRElPX0dVSV9OT0RFX0lEIjoiMkgyQkM3SUkiLCJtZXRhLms4cy5hbGlwYXkuY29tL3BvZC16YXBwaW5mbyI6Int9IiwicmFmcy5zaWdtYS5hbGlwYXkuY29tL2NvbnRhaW5lcnMiOiJtYWluIiwidGFncy51bG9nZnMuazhzLmFsaXBheS5jb20iOiJBTElZVU5fTE9HX0VOVl9UQUdTfGt1YmVtYWtlcnxjb21wb25lbnR8YXBwfGFpc3R1ZGlvIiwidGFncy51bG9nZnMuazhzLmFsaXBheS5jb20vQUxJWVVOX0xPR19FTlZfVEFHUyI6Imt1YmVtYWtlcnxjb21wb25lbnR8YXBwfFBPRF9OQU1FfGFpc3R1ZGlvIiwidGFncy51bG9nZnMuazhzLmFsaXBheS5jb20vYWlzdHVkaW8iOiJhaXN0dWRpby04NTgxNjk3OCIsInRhZ3MudWxvZ2ZzLms4cy5hbGlwYXkuY29tL2FwcCI6ImFyZ28iLCJ0YWdzLnVsb2dmcy5rOHMuYWxpcGF5LmNvbS9jb21wb25lbnQiOiJ3b3JrZmxvdyIsInRhZ3MudWxvZ2ZzLms4cy5hbGlwYXkuY29tL2t1YmVtYWtlciI6ImFpc3R1ZGlvLTg1ODE2OTc4IiwidWxvZ2ZzLms4cy5hbGlwYXkuY29tL2VuYWJsZS16Y2xlYW4iOiJ0cnVlIiwidWxvZ2ZzLms4cy5hbGlwYXkuY29tL2luamVjdCI6ImVuYWJsZWQifSwibGFiZWxzIjp7IkNPTkZSRUdVUkwiOiJjb25mcmVnLXBvb2wuZ3owMGcuYWxpcGF5LmNvbSIsImFpc3R1ZGlvLmFsaXBheS5jb20vY29udGFpbmVyX3R5cGUiOiJkZXZfY29udGFpbmVyIiwiYWlzdHVkaW8uYWxpcGF5LmNvbS9ncm91cF9pZCI6IjUyMTEiLCJhaXN0dWRpby5hbGlwYXkuY29tL2pvYi1lbnYtdHlwZSI6ImRldiIsImFpc3R1ZGlvLmFsaXBheS5jb20vcmVjb3JkX2lkIjoiODU4MTY5NzgiLCJhaXN0dWRpby5sYWJlbC9lbmFibGVfZ3B1IjoidHJ1ZSIsImFpc3R1ZGlvLmxvZy9zZXJ2aWNlLXR5cGUiOiJub3RlYm9vayIsImFwcCI6ImFyZ28iLCJhcmdvLnN0ZXAub3duZXIiOiJhaXN0dWRpby04NTgxNjk3OC0yNTAyNDM4NTc3IiwiY29tcG9uZW50Ijoibm90ZWJvb2siLCJjdXN0b20uazhzLmFsaXBheS5jb20vZW5hYmxlLWFudGRuc2ZpbHRlciI6InRydWUiLCJjdXN0b20uazhzLmFsaXBheS5jb20vc3ByZWFkLXN0cmF0ZWd5Ijoic2NhdHRlciIsImRvbWFpbm5hbWUiOiJnejAwZy5hbGlwYXkuY29tIiwia3ViZW1ha2VyIjoiYWlzdHVkaW8tODU4MTY5NzgiLCJrdWJlbWFrZXJfam9iX3VzZXIiOiIxMzI3MDQiLCJtYW5kYXRvcnkuazhzLmFsaXBheS5jb20vYXBwLWxvZ2ljLXBvb2wiOiJrdWJlbWFrZXIiLCJtZXRhLms4cy5hbGlwYXkuY29tL2Jpei1uYW1lIjoia3ViZW1ha2VyIiwibWV0YS5rOHMuYWxpcGF5LmNvbS9pZ25vcmUtY21kYiI6InRydWUiLCJtZXRhLms4cy5hbGlwYXkuY29tL2lnbm9yZS1uYW1pbmciOiJ0cnVlIiwibWV0YS5rOHMuYWxpcGF5LmNvbS9xb2MtY2xhc3MiOiJQcm9kRW5oYW5jZWQiLCJtZXRhLms4cy5hbGlwYXkuY29tL3pvbmUiOiJHWjAwRyIsInBvZC5rOHMuYWxpcGF5LmNvbS9jb250YWluZXItc3RhcnQtcG9saWN5IjoicGFyYWxsZWwiLCJzaWdtYS5hbGkvYXBwLW5hbWUiOiJrbWFrZXIiLCJzaWdtYS5hbGkvZGVwbG95LXVuaXQiOiJrbWFrZXItcHJvZCIsInNpZ21hLmFsaS9kaXNhYmxlLW5ldHdvcmstZW52LWluamVjdGlvbiI6InRydWUiLCJzaWdtYS5hbGkvZGlzYWJsZS1zZXJ2aWNlLWxpbmtzIjoidHJ1ZSIsInNpZ21hLmFsaS9pbnN0YW5jZS1ncm91cCI6ImttYWtlcmhvc3QiLCJzaWdtYS5hbGkvcW9zIjoiU2lnbWFEZWRpY2F0ZWQiLCJzaWdtYS5hbGkvc2l0ZSI6InN0bCJ9fSwiY29udGFpbmVyIjp7Im5hbWUiOiJtYWluIiwiaW1hZ2UiOiJyZWcuZG9ja2VyLmFsaWJhYmEtaW5jLmNvbS9haWkvYWlzdHVkaW86MzM1MDEyMy0yMDIzMDIyMTIxMjAzOCIsImNvbW1hbmQiOlsiL2luaXQtYmluL2R1bWItaW5pdCIsIi0tIiwiYmFzaCIsIi1jIiwiKHdnZXQgLXEgXCJodHRwOi8vZG1zaW50LmNuLWhhbmd6aG91LmFsaXBheS5hbGl5dW4taW5jLmNvbS9haXN0dWRpby9haXN0dWRpb19wcmVfaG9vay5zaFwiIC1PIC90bXAvYWlzdHVkaW9fcHJlX2hvb2suc2ggXHUwMDI2XHUwMDI2IHNvdXJjZSAvdG1wL2Fpc3R1ZGlvX3ByZV9ob29rLnNoO2VjaG8gXCJzdGFydCBhbmFseXplciBcIiBgZGF0ZSArXCIlWS0lbS0lZCAlSDolTTolU1wiYCBcdTAwM2VcdTAwM2UgL2hvbWUvYWRtaW4vbG9ncy9haXN0dWRpb19hbmFseXplci5sb2c7IHBpcCBpbnN0YWxsIGFpc3R1ZGlvLWFuYWx5emVyIC0taW5kZXgtdXJsPSdodHRwczovL3B5cGkuYW50ZmluLWluYy5jb20vc2ltcGxlJyAtSSBcdTAwM2VcdTAwM2UgL2hvbWUvYWRtaW4vbG9ncy9haXN0dWRpb19hbmFseXplci5sb2cgMlx1MDAzZVx1MDAyNjEgXHUwMDI2XHUwMDI2ICggbm9odXAgYWlzdHVkaW9fYW5hbHl6ZXIgbW9uaXRvcl9yZXNvdXJjZV91c2FnZSBcdTAwM2VcdTAwM2UgL2hvbWUvYWRtaW4vbG9ncy9haXN0dWRpb19hbmFseXplci5sb2cgMlx1MDAzZVx1MDAyNjEgXHUwMDI2IGVjaG8gJCEgXHUwMDNlIC9ob21lL2FkbWluL2Fpc3R1ZGlvX3Jlc291cmNlX21vbml0b3IucGlkOyApOyAoIG5vaHVwIGFpc3R1ZGlvX2FuYWx5emVyIFx1MDAzZVx1MDAzZSAvaG9tZS9hZG1pbi9sb2dzL2Fpc3R1ZGlvX2FuYWx5emVyLmxvZyAyXHUwMDNlXHUwMDI2MSBcdTAwMjYgZWNobyAkISBcdTAwM2UgL2hvbWUvYWRtaW4vYWlzdHVkaW9fYW5hbHl6ZXIucGlkOyApO2lmIFtbICEgYHJwbSAtcWEgfCBncmVwIC1pIG5mcy11dGlsc2AgXV07IHRoZW4geXVtIGluc3RhbGwgbmZzLXV0aWxzIC15OyBmaTtta2RpciAtcCAvQTE5dDFzbXogXHUwMDI2XHUwMDI2IG1vdW50IC10IG5mcyAtbyB2ZXJzPTMsbm9sb2NrLHByb3RvPXRjcCxyc2l6ZT0xMDQ4NTc2LHdzaXplPTEwNDg1NzYsaGFyZCx0aW1lbz02MDAscmV0cmFucz0yLG5vcmVzdnBvcnQgYWxpcGF5c2huYXMtMDA5LWdibDQ4LmNuLXNoYW5naGFpLWV1MTMtYTAxLm5hcy5hbGl5dW5jcy5jb206LyAvQTE5dDFzbXo7bWtkaXIgLXAgL0ExOXQxc216LzEzMjcwNC93b3JrZmxvd18xNTA5MDA2ODt1bW91bnQgLWwgL0ExOXQxc216O21rZGlyIC1wIC9vc3Nmczttb3VudCAtdCBuZnMgLW8gdmVycz0zLG5vbG9jayxwcm90bz10Y3AscnNpemU9MTA0ODU3Nix3c2l6ZT0xMDQ4NTc2LGhhcmQsdGltZW89NjAwLHJldHJhbnM9Mixub3Jlc3Zwb3J0IGFsaXBheXNobmFzLTAwOS1nYmw0OC5jbi1zaGFuZ2hhaS1ldTEzLWEwMS5uYXMuYWxpeXVuY3MuY29tOi8xMzI3MDQvd29ya2Zsb3dfMTUwOTAwNjggL29zc2ZzOy9wYWktZXh0ZW5zaW9uL3N0YXJ0LWNvbnRhaW5lci5zaCA4MDgwIDgwODggd29ya2Zsb3dfMTUwOTAwNjggZGV2X2NvbnRhaW5lciBwcm9kIC9vc3NmczthaXN0dWRpb19leGl0X2NvZGU9JD8gOyBlY2hvIFwibWFpbiBwcm9jZXNzIHJldHVybiBjb2RlOiBcIiRhaXN0dWRpb19leGl0X2NvZGU7IGVjaG8gJGFpc3R1ZGlvX2V4aXRfY29kZSBcdTAwM2UgL2hvbWUvYWRtaW4vYWlzdHVkaW9fbWFpbl9wcm9jZXNzLmRvbmU7ICggbm9odXAgYWlzdHVkaW9fYW5hbHl6ZXIgcG9zdF9ydW4gXHUwMDNlXHUwMDNlIC9ob21lL2FkbWluL2xvZ3MvYWlzdHVkaW9fYW5hbHl6ZXIubG9nIDJcdTAwM2VcdTAwMjYxIFx1MDAyNiBlY2hvICQhIFx1MDAzZSAvaG9tZS9hZG1pbi9haXN0dWRpb19hbmFseXplcl9wb3N0X3J1bi5waWQ7ICk7IGFpc3R1ZGlvX2FuYWx5emVyIGNsZWFuIFx1MDAzZVx1MDAzZSAvaG9tZS9hZG1pbi9sb2dzL2Fpc3R1ZGlvX2FuYWx5emVyLmxvZyAyXHUwMDNlXHUwMDI2MSB8fCBlY2hvIFwiYWlzdHVkaW9fYW5hbHl6ZXIgY2xlYW4gY2F1c2UgcHJvYmxlbVwiIFx1MDAzZVx1MDAzZSAvaG9tZS9hZG1pbi9sb2dzL2Fpc3R1ZGlvX2FuYWx5emVyLmxvZzsgKGV4aXQgJGFpc3R1ZGlvX2V4aXRfY29kZSkpIDJcdTAwM2VcdTAwMjYxIHwgdGVlIC9ob21lL2FkbWluL2xvZ3Mva3ViZW1ha2VyLmxvZzsgZXhpdCAke1BJUEVTVEFUVVNbMF19Il0sImVudiI6W3sibmFtZSI6Ik9EUFNfQUxJWVVOX0lEIiwidmFsdWUiOiJiczN5b2NubG1qZGdAYWxpeXVuLmNvbSJ9LHsibmFtZSI6Ik9EUFNfRU5EUE9JTlQiLCJ2YWx1ZSI6Imh0dHA6Ly9zZXJ2aWNlLm9kcHMuYWxpeXVuLWluYy5jb20vYXBpIn0seyJuYW1lIjoiSVNfREVWX0NPTlRBSU5FUiIsInZhbHVlIjoidHJ1ZSJ9LHsibmFtZSI6IlVTRVJfTkFNRSIsInZhbHVlIjoiYWFyb24uaHgifSx7Im5hbWUiOiJJTUFHRV9UWVBFIiwidmFsdWUiOiIzMzUwMTIzIn0seyJuYW1lIjoiT0RQU19QUk9KRUNUIiwidmFsdWUiOiJhbnRfcDEzbl9kZXYifSx7Im5hbWUiOiJBSVNUVURJT19QUk9YWV9BRERSIiwidmFsdWUiOiJodHRwczovL2Fpc3R1ZGlvcHJveHkuYWxpcGF5LmNvbS9wcm94eS93b3JrZmxvd18xNTA5MDA2ODo4MDgwIn0seyJuYW1lIjoiQUlTVFVESU9fTkFNRVNQQUNFIiwidmFsdWUiOiJ3b3JrZmxvd18xNTA5MDA2OCJ9LHsibmFtZSI6IkhBRE9PUF9IT01FIiwidmFsdWUiOiIvaGFkb29wLWNsaWVudC9iaW4vaGFkb29wIn0seyJuYW1lIjoiT0RQU19BQ0NFU1NfS0VZIiwidmFsdWUiOiJwMmJzUTB4VkdObThSbkoyeTBoR0tFcjBFM2JjVG4ifSx7Im5hbWUiOiJFTkRQT0lOVCIsInZhbHVlIjoiaHR0cDovL3NlcnZpY2Uub2Rwcy5hbGl5dW4taW5jLmNvbS9hcGkifSx7Im5hbWUiOiJIQURPT1BfTE9HUyIsInZhbHVlIjoiL2hvbWUvaGFkb29wL2hhZG9vcC1kYXRhL2hhZG9vcC1sb2dzIn0seyJuYW1lIjoiVVNFUk5VTUJFUiIsInZhbHVlIjoiMTMyNzA0In0seyJuYW1lIjoiT0RQU19BQ0NFU1NfSUQiLCJ2YWx1ZSI6IkxUQUlRSDlKT2JoeUV5ZEIifSx7Im5hbWUiOiJQQU5HVV9DTFVTVEVSX05BTUUiLCJ2YWx1ZSI6InBhbmd1MV9hbmFseXplX3NhdGFfZW0xNF9vbmxpbmUifSx7Im5hbWUiOiJMQVVOQ0hfQ09OVEFJTkVSX01PREUiLCJ2YWx1ZSI6ImRldl9jb250YWluZXIifSx7Im5hbWUiOiJBSVNUVURJT19KQ1NfU1VCX0pPQl9JRCIsInZhbHVlIjoiMSJ9LHsibmFtZSI6IkVOQUJMRV9BSVNUVURJT19SRUFERVJfU0xFRVAiLCJ2YWx1ZSI6InRydWUifSx7Im5hbWUiOiJBSVNUVURJT19KT0JfVFlQRSIsInZhbHVlIjoibGF1bmNoQ29udGFpbmVyIn0seyJuYW1lIjoiQUlTVFVESU9fU0lURV9FTlVNIiwidmFsdWUiOiJJTlRFUk5BTCJ9LHsibmFtZSI6IkVOVl9BUkdPX05PREVfTkFNRSIsInZhbHVlIjoiYWlzdHVkaW8tODU4MTY5NzgtMjUwMjQzODU3NyJ9LHsibmFtZSI6IkFJU1RVRElPX0RPTUFJTl9DT0RFIiwidmFsdWUiOiJleHBlcmllbmNlLmwyZG9tYWluIn0seyJuYW1lIjoiRU5WX09EUFNfQUNDRVNTX0lEIiwidmFsdWUiOiJMVEFJUUg5Sk9iaHlFeWRCIn0seyJuYW1lIjoiQUlTVFVESU9fSU5ORVJfWk9ORSIsInZhbHVlIjoicHJvZCJ9LHsibmFtZSI6IkVOVl9BSVNUVURJT19IT1NUIiwidmFsdWUiOiJhaXN0dWRpby5hbGlwYXkuY29tIn0seyJuYW1lIjoiRVhFQ1VUSU9OUkVDT1JEX0lEIiwidmFsdWUiOiI4NTgxNjk3OCJ9LHsibmFtZSI6IkVOVl9TSUdNQV9BUFBfTkFNRSIsInZhbHVlIjoia21ha2VyIn0seyJuYW1lIjoiRU5WX1RZUEUiLCJ2YWx1ZSI6IlBST0QifSx7Im5hbWUiOiJFTlZfQ09ERV9OQU1FIiwidmFsdWUiOiJkZXZfY29udGFpbmVyIn0seyJuYW1lIjoiRU5WX0dST1VQX0lEIiwidmFsdWUiOiI1MjExIn0seyJuYW1lIjoiRU5WX0VYUEVSSU1FTlRfVFlQRV9FTlVNIiwidmFsdWUiOiJLOFNfQ09OVEFJTkVSIn0seyJuYW1lIjoiRU5WX09EUFNfRU5EUE9JTlQiLCJ2YWx1ZSI6Imh0dHA6Ly9zZXJ2aWNlLm9kcHMuYWxpeXVuLWluYy5jb20vYXBpIn0seyJuYW1lIjoiSURFX0NPTU1PTl9PU1NfQlVDS0VUIiwidmFsdWUiOiJkbXNpbnQifSx7Im5hbWUiOiJBSVNUVURJT19UQVNLX1JPT1RfUEFUSCIsInZhbHVlIjoiL2hvbWUvYWRtaW4ifSx7Im5hbWUiOiJCSVpfSUQiLCJ2YWx1ZSI6IjEzMjcwNF5wYWlfYWxpcGF5Xjg1ODE2OTc4XjIwMjMtMDItMjYifSx7Im5hbWUiOiJFTlZfT0RQU19QUk9KRUNUX05BTUUiLCJ2YWx1ZSI6ImFudF9wMTNuX2RldiJ9LHsibmFtZSI6IkVOVl9PRFBTX0FMSVlVTl9JRCIsInZhbHVlIjoiYnMzeW9jbmxtamRnQGFsaXl1bi5jb20ifSx7Im5hbWUiOiJEMl9DWUNUSU1FIiwidmFsdWUiOiIyMDIzMDIyNzE1NDY0MSJ9LHsibmFtZSI6IkVOVl9BUkdPX1dPUktGTE9XX05BTUUiLCJ2YWx1ZSI6ImFpc3R1ZGlvLTg1ODE2OTc4In0seyJuYW1lIjoiQUlTVFVESU9fSkNTX0pPQl9JRCIsInZhbHVlIjoic3RsIyNhaXN0dWRpby04NTgxNjk3OCMja21ha2VyIn0seyJuYW1lIjoiQUlTVFVESU9fSk9CX05BTUUiLCJ2YWx1ZSI6IjJIMkJDN0lJIn0seyJuYW1lIjoiV09SS0ZMT1dfSUQiLCJ2YWx1ZSI6IjE1MDkwMDY4In0seyJuYW1lIjoiVklTVUFMX0RBVEFfUEFUSCIsInZhbHVlIjoicGFuZ3U6Ly9wYW5ndTFfYW5hbHl6ZV9zYXRhX2VtMTRfb25saW5lL3BhaS9haXN0dWRpby9jaGVja3BvaW50L2Fpc3R1ZGlvLTg1ODE2OTc4In0seyJuYW1lIjoiRU5WX09EUFNfQUNDRVNTX0tFWSIsInZhbHVlIjoicDJic1EweFZHTm04Um5KMnkwaEdLRXIwRTNiY1RuIn0seyJuYW1lIjoiRU5WX0JJWkRBVEUiLCJ2YWx1ZSI6IjIwMjMwMjI2MTU0NjQwIn0seyJuYW1lIjoiUE9EX05BTUVTUEFDRSIsInZhbHVlRnJvbSI6eyJmaWVsZFJlZiI6eyJhcGlWZXJzaW9uIjoidjEiLCJmaWVsZFBhdGgiOiJtZXRhZGF0YS5uYW1lc3BhY2UifX19LHsibmFtZSI6IlBPRF9OQU1FIiwidmFsdWVGcm9tIjp7ImZpZWxkUmVmIjp7ImFwaVZlcnNpb24iOiJ2MSIsImZpZWxkUGF0aCI6Im1ldGFkYXRhLm5hbWUifX19LHsibmFtZSI6IlBPRF9JUCIsInZhbHVlRnJvbSI6eyJmaWVsZFJlZiI6eyJhcGlWZXJzaW9uIjoidjEiLCJmaWVsZFBhdGgiOiJzdGF0dXMucG9kSVAifX19LHsibmFtZSI6IkVOVl9HUkVZX0lNQUdFX1RZUEUiLCJ2YWx1ZSI6ImZhbHNlIn1dLCJyZXNvdXJjZXMiOnsibGltaXRzIjp7ImNwdSI6IjIiLCJlcGhlbWVyYWwtc3RvcmFnZSI6IjYwR2kiLCJtZW1vcnkiOiIzMkdpIiwibnZpZGlhLmNvbS9WMTAwLVBDSUUtMTZHQi1QIjoiMiJ9LCJyZXF1ZXN0cyI6eyJjcHUiOiIyIiwiZXBoZW1lcmFsLXN0b3JhZ2UiOiI2MEdpIiwibWVtb3J5IjoiMzJHaSIsIm52aWRpYS5jb20vVjEwMC1QQ0lFLTE2R0ItUCI6IjIifX0sInZvbHVtZU1vdW50cyI6W3sibmFtZSI6ImR1bWItaW5pdCIsIm1vdW50UGF0aCI6Ii9pbml0LWJpbiJ9XSwiaW1hZ2VQdWxsUG9saWN5IjoiSWZOb3RQcmVzZW50In0sInZvbHVtZXMiOlt7Im5hbWUiOiJkdW1iLWluaXQiLCJlbXB0eURpciI6e319XSwiaW5pdENvbnRhaW5lcnMiOlt7Im5hbWUiOiJkdW1iLWluaXQiLCJpbWFnZSI6InJlZy5kb2NrZXIuYWxpYmFiYS1pbmMuY29tL3JhZnMva3ViZW1ha2VyLmR1bWItaW5pdDp2MS4yLjIiLCJjb21tYW5kIjpbImNwIiwiL3Vzci9sb2NhbC9iaW4vZHVtYi1pbml0IiwiL2luaXQtYmluL2R1bWItaW5pdCJdLCJlbnYiOlt7Im5hbWUiOiJFTlZfR1JFWV9JTUFHRV9UWVBFIiwidmFsdWUiOiJmYWxzZSJ9XSwicmVzb3VyY2VzIjp7ImxpbWl0cyI6eyJjcHUiOiIxMDBtIiwiZXBoZW1lcmFsLXN0b3JhZ2UiOiIyNTZNaSIsIm1lbW9yeSI6IjEyOE1pIn0sInJlcXVlc3RzIjp7ImNwdSI6IjEwMG0iLCJlcGhlbWVyYWwtc3RvcmFnZSI6IjI1Nk1pIiwibWVtb3J5IjoiMTI4TWkifX0sInZvbHVtZU1vdW50cyI6W3sibmFtZSI6ImR1bWItaW5pdCIsIm1vdW50UGF0aCI6Ii9pbml0LWJpbiJ9XSwiaW1hZ2VQdWxsUG9saWN5IjoiQWx3YXlzIn1dLCJyZXRyeVN0cmF0ZWd5Ijp7ImxpbWl0IjozLCJyZXRyeVBvbGljeSI6Ik9uVHJhbnNpZW50RXJyb3IifSwidG9sZXJhdGlvbnMiOlt7ImtleSI6Im1hbmRhdG9yeS5rOHMuYWxpcGF5LmNvbS9hcHAtbG9naWMtcG9vbCIsIm9wZXJhdG9yIjoiRXF1YWwiLCJ2YWx1ZSI6Imt1YmVtYWtlciIsImVmZmVjdCI6Ik5vU2NoZWR1bGUifSx7ImtleSI6Im5vZGUua3ViZXJuZXRlcy5pby9ub3QtcmVhZHkiLCJvcGVyYXRvciI6IkV4aXN0cyIsImVmZmVjdCI6Ik5vRXhlY3V0ZSIsInRvbGVyYXRpb25TZWNvbmRzIjozMDB9LHsia2V5Ijoibm9kZS5rdWJlcm5ldGVzLmlvL3VucmVhY2hhYmxlIiwib3BlcmF0b3IiOiJFeGlzdHMiLCJlZmZlY3QiOiJOb0V4ZWN1dGUiLCJ0b2xlcmF0aW9uU2Vjb25kcyI6MzAwfV0sInByaW9yaXR5Q2xhc3NOYW1lIjoibG93In0=" CONDA_BACKUP_ELFEDIT="/root/miniconda3//bin/x86_64-conda-linux-gnu-elfedit" CONDA_BACKUP_GCC_AR="/root/miniconda3//bin/x86_64-conda-linux-gnu-gcc-ar" TOKENIZERS_PARALLELISM="false" TF_CPP_MIN_LOG_LEVEL="2" SHLVL="5" HOME="/root" LESSCHARSET="utf-8" ALIPAY_APP_ENV="prod" LANGUAGE="en_us" CONDA_BACKUP_NM="/root/miniconda3//bin/x86_64-conda-linux-gnu-nm" CONDA_BACKUP_build_alias="x86_64-conda-linux-gnu" ENV_ODPS_ALIYUN_ID="[email protected]" KUBERNETES_PORT_443_TCP_PROTO="tcp" USE_LOCAL_GIT="true" DEBUG_CPPFLAGS="-D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /root/miniconda3//include" CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /root/miniconda3//include" KUBERNETES_SERVICE_PORT_HTTPS="443" _CONDA_PYTHON_SYSCONFIGDATA_NAME="_sysconfigdata_x86_64_conda_cos7_linux_gnu" GCC="/root/miniconda3/bin/x86_64-conda-linux-gnu-gcc" AISTUDIO_DOMAIN_CODE="experience.l2domain" ENV_AISTUDIO_HOST="aistudio.alipay.com" CONDA_BACKUP_DEBUG_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /root/miniconda3//include" JUPYTER_DATA_DIR="/root/.local/share/jupyter" IMAGE_TYPE="3350123" BASH_ENV="/root/.bashrc" CONDA_BACKUP_GCC="/root/miniconda3//bin/x86_64-conda-linux-gnu-gcc" CONDA_BACKUP_CMAKE_ARGS="-DCMAKE_LINKER=/root/miniconda3//bin/x86_64-conda-linux-gnu-ld -DCMAKE_STRIP=/root/miniconda3//bin/x86_64-conda-linux-gnu-strip" ADDR2LINE="/root/miniconda3/bin/x86_64-conda-linux-gnu-addr2line" PYTHONPATH="/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:" RequestedIP="11.36.210.11" ENV_BIZDATE="20230226154640" ENV_GROUP_ID="5211" ARGO_PROGRESS_FILE_TICK_DURATION="3s" CONDA_PYTHON_EXE="/root/miniconda3/bin/python" ARGO_INCLUDE_SCRIPT_OUTPUT="false" IS_DEV_CONTAINER="true" TPDEGREE="2" build_alias="x86_64-conda-linux-gnu" CLASSPATH="/root/miniconda3/lib/python3.6/site-packages/aistudio_common/reader/libs/penrose-1.0-SNAPSHOT-jar-with-dependencies.jar" LC_CTYPE="en_US.UTF-8" CONDA_BACKUP_GCC_RANLIB="/root/miniconda3//bin/x86_64-conda-linux-gnu-gcc-ranlib" CONDA_BACKUP_CMAKE_PREFIX_PATH="/root/miniconda3/:/root/miniconda3//x86_64-conda-linux-gnu/sysroot/usr" OMP_NUM_THREADS="1" AJDK_MAX_PROCESSORS_LIMIT="2" ARGO_PROGRESS_PATCH_TICK_DURATION="1m0s" AISTUDIO_JOB_TYPE="launchContainer" NODE_LOG_DIR="/logs/alinode" ENABLE_NODE_LOG="YES" CONDA_DEFAULT_ENV="base" CONDA_BACKUP_CC="/root/miniconda3//bin/x86_64-conda-linux-gnu-cc" DEBUG_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /root/miniconda3//include" RANLIB="/root/miniconda3/bin/x86_64-conda-linux-gnu-ranlib" ANTB_BUILD_PLATFORM="AISTUDIO" PROMPT_COMMAND="printf "\033]0;%s@%s:%s\007" "${USER}" "${HOSTNAME%%.*}" "${PWD/#$HOME/~}";sh /etc/sysconfig/bash-prompt-history" IPYTHON_PROFILE_PATH="/root/.ipython/profile_default" CONDA_BACKUP_STRINGS="/root/miniconda3//bin/x86_64-conda-linux-gnu-strings" CMAKE_PREFIX_PATH="/root/miniconda3:/root/miniconda3/x86_64-conda-linux-gnu/sysroot/usr" CC="/root/miniconda3/bin/x86_64-conda-linux-gnu-cc" AISTUDIO_COMMON_PATH="/root/miniconda3/lib/python3.6/site-packages/aistudio_common" ENV_ODPS_ENDPOINT="http://service.odps.aliyun-inc.com/api" KUBERNETES_PORT_443_TCP_ADDR="172.16.0.1" CONDA_BACKUP_LD_GOLD="/root/miniconda3//bin/x86_64-conda-linux-gnu-ld.gold" host_alias="x86_64-conda-linux-gnu" READELF="/root/miniconda3/bin/x86_64-conda-linux-gnu-readelf" ARGO_CONTAINER_NAME="main" AISTUDIO_JCS_JOB_ID="stl##aistudio-85816978##kmaker" VISUAL_DATA_PATH="pangu://pangu1_analyze_sata_em14_online/pai/aistudio/checkpoint/aistudio-85816978" DefaultMask="255.255.252.0" ODPS_ACCESS_KEY="p2bsQ0xVGNm8RnJ2y0hGKEr0E3bcTn" ALIPAY_SIGMA_CPUMODE="cpushare" KUBERNETES_PORT_443_TCP="tcp://172.16.0.1:443" CONDA_BACKUP_ADDR2LINE="/root/miniconda3//bin/x86_64-conda-linux-gnu-addr2line" CONDA_BACKUP_READELF="/root/miniconda3//bin/x86_64-conda-linux-gnu-readelf" CONDA_BACKUP_CPP="/root/miniconda3//bin/x86_64-conda-linux-gnu-cpp" GCC_AR="/root/miniconda3/bin/x86_64-conda-linux-gnu-gcc-ar" OBJDUMP="/root/miniconda3/bin/x86_64-conda-linux-gnu-objdump" LC_TIME="en_US.UTF-8" container="placeholder" ODPS_ACCESS_ID="LTAIQH9JObhyEydB" WORKFLOW_API_PARAM_FILE="/ossfs/.param.conf" CONDA_BACKUP_CPPFLAGS="-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /root/miniconda3//include" CONDA_BACKUP_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /root/miniconda3//include" GPROF="/root/miniconda3/bin/x86_64-conda-linux-gnu-gprof" ENABLE_AISTUDIO_READER_SLEEP="true" GIT_EXEC_PATH="/root/miniconda3/libexec/git-core" GXX="/root/miniconda3/bin/x86_64-conda-linux-gnu-g++" LC_NAME="en_US.UTF-8" POD_IP="11.36.210.11" _="/root/miniconda3/bin/colossalai" && torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch --use_dummy_dataset'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes ===== 127.0.0.1: failure

====== Stopping All Nodes ===== 127.0.0.1: finish

Could you help with the issues?

Environment

2 * v100 32G pytorch 1.13 + cu117 python 3.8.13

imhuim982 avatar Feb 28 '23 06:02 imhuim982

May I know your transformers version?

Wesley-Jzy avatar Mar 03 '23 10:03 Wesley-Jzy

transformers==4.26.1

imhuim982 avatar Mar 03 '23 13:03 imhuim982

Would you mind trying to run with another version of transformers < 4.25.1? I guess it's because the experimental feature hasn't matched new operators introduced in 4.25.1.

Wesley-Jzy avatar Mar 07 '23 00:03 Wesley-Jzy

titan pipeline启动太慢了,单机8卡 3090卡,pp=4, tp=2,启动训练需要等待10多分钟,30B的模型A100 80G 启动训练要卡30分钟以上, Megatron就启动很快。。。

joan126 avatar Mar 07 '23 09:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The startup of the titan pipeline is too slow. The stand-alone machine has 8 cards and 3090 cards, pp=4, tp=2. It takes more than 10 minutes to start the training. The 30B model A100 80G takes more than 30 minutes to start the training.

Issues-translate-bot avatar Mar 07 '23 09:03 Issues-translate-bot

titan pipeline启动太慢了,单机8卡 3090卡,pp=4, tp=2,启动训练需要等待10多分钟,30B的模型A100 80G 启动训练要卡30分钟以上, Megatron就启动很快。。。

Hi @joan126 Can you open a new issue and provide details? So we can reproduce your question. Thanks. https://github.com/hpcaitech/ColossalAI/issues/new/choose

binmakeswell avatar Mar 07 '23 09:03 binmakeswell

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The titan pipeline starts too slowly. The single machine has 8 cards and 3090 cards, pp=4, tp=2. It takes more than 10 minutes to start training. The 30B model A100 80G takes more than 30 minutes to start training. Megatron starts quickly. . .

Hi @joan126 Can you open a new issue and provide details? So we can reproduce your question. Thanks. https://github.com/hpcaitech/ColossalAI/issues/new/choose

Issues-translate-bot avatar Mar 07 '23 09:03 Issues-translate-bot

This experimental feature works well in PyTorch 1.12.1 version, we will improve the generality of this feature in future.

YuliangLiu0306 avatar Mar 15 '23 07:03 YuliangLiu0306

I got this error too when useing pytorch 1.13.0, colossalai 0.2.8 with transformers 4.28.1 and 4.24.0

wenqf11 avatar Apr 19 '23 07:04 wenqf11

I got this error too when useing pytorch 1.13.0, colossalai 0.2.8 with transformers 4.28.1 and 4.24.0

How about PyTorch 1.12.1?

binmakeswell avatar Apr 19 '23 08:04 binmakeswell

@binmakeswell it works fine using official docker colossalai:0.2.7 which is pytorch 1.12.1

wenqf11 avatar Apr 19 '23 12:04 wenqf11

Glad to hear it was resolved. Thanks.

binmakeswell avatar Apr 27 '23 08:04 binmakeswell