ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: run examples in pipeline mode failed.

Open imhuim982 opened this issue 2 years ago • 11 comments

🐛 Describe the bug

I tried examples/language/gpt/experiments/pipeline_parrallel/run.sh and examples/language/gpt/titans/run.sh. But no one works. The followings are error messages: for pipeline/parrallel/run.sh, I got: Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function result = python_udf.func(*python_udf.args, **python_udf.kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rpc/rref_proxy.py", line 11, in _local_invoke return getattr(rref.local_value(), func_name)(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 230, in sync_global_worker_rrefs self._initialize_partition() File "/root/miniconda3/lib/python3.8/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 185, in _initialize_partition self.module_partition: nn.Module = partition_fn(*partition_args).to(device) File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 74, in partition module = create_partition_module(pp_rank, stage_num, model, data_kwargs) File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 61, in create_partition_module graph = tracer.trace(root=model, meta_args=meta_args) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 397, in trace self.graph = super().trace(root, concrete_args=concrete_args) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 739, in trace (self.create_arg(fn(*args)),), File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/model_zoo.py", line 29, in forward return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0] File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward transformer_outputs = self.transformer( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in forward outputs = block( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 388, in forward target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: maximum recursion depth exceeded while calling a Python object

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function result = python_udf.func(*python_udf.args, **python_udf.kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/rpc/rref_proxy.py", line 11, in _local_invoke return getattr(rref.local_value(), func_name)(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 230, in sync_global_worker_rrefs self._initialize_partition() File "/root/miniconda3/lib/python3.8/site-packages/colossalai/pipeline/rpc/_pipeline_base.py", line 185, in _initialize_partition self.module_partition: nn.Module = partition_fn(*partition_args).to(device) File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 74, in partition module = create_partition_module(pp_rank, stage_num, model, data_kwargs) File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/train_gpt_pp.py", line 61, in create_partition_module graph = tracer.trace(root=model, meta_args=meta_args) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 397, in trace self.graph = super().trace(root, concrete_args=concrete_args) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 739, in trace (self.create_arg(fn(*args)),), File "/ossfs/workspace/ColossalAi/examples/language/gpt/experiments/pipeline_parallel/model_zoo.py", line 29, in forward return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0] File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward transformer_outputs = self.transformer( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in forward outputs = block( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 388, in forward attn_outputs = self.attn( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 310, in forward query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/pytorch_utils.py", line 115, in forward x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 705, in module_getattr_wrapper return self.getattr(attr, attr_val, parameter_proxy_cache) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 484, in getattr maybe_parameter_proxy = maybe_get_proxy_for_attr( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 478, in maybe_get_proxy_for_attr val_proxy = self.create_proxy("get_attr", n, (), {}, **kwargs) # type: ignore[arg-type] File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 142, in create_proxy meta_out = self._meta_data_computing( File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 308, in _meta_data_computing raise RuntimeError(f"Could not compute metadata for {kind} target {target}: {e}") RuntimeError: Could not compute metadata attn_outputs = self.attn( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 310, in forward query, key, value = self.c_attn(hidden_states).split(self.split_size, dim=2) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 717, in module_call_wrapper return self.call_module(mod, forward, args, kwargs) File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 195, in call_module return forward(*args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 710, in forward return _orig_module_call(mod, *args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/pytorch_utils.py", line 115, in forward x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 705, in module_getattr_wrapper return self.getattr(attr, attr_val, parameter_proxy_cache) File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 484, in getattr maybe_parameter_proxy = maybe_get_proxy_for_attr( File "/root/miniconda3/lib/python3.8/site-packages/torch/fx/_symbolic_trace.py", line 478, in maybe_get_proxy_for_attr val_proxy = self.create_proxy("get_attr", n, (), {}, **kwargs) # type: ignore[arg-type] File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 142, in create_proxy meta_out = self._meta_data_computing( File "/root/miniconda3/lib/python3.8/site-packages/colossalai/fx/tracer/tracer.py", line 308, in _meta_data_computing raise RuntimeError(f"Could not compute metadata for {kind} target {target}: {e}") RuntimeError: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target ... repeat * 100 model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr del.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target ... repeat * 100 model.transformer.h.0.attn.c_attn.bias: maximum recursion depth exceeded while calling a Python object target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target model.transformer.h.0.attn.c_attn.bias: Could not compute metadata for get_attr target ... repeat * 100 model.transformer.h.0.attn.c_attn.bias: maximum recursion depth exceeded while calling a Python object

for titans/run.sh: sh run.sh /root/miniconda3/lib/python3.8/site-packages/torch/library.py:130: UserWarning: Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::index.Tensor(Tensor self, Tensor?[] indices) -> Tensor registered at aten/src/ATen/RegisterSchema.cpp:6 dispatch key: Meta previous kernel: registered at ../aten/src/ATen/functorch/BatchRulesScatterOps.cpp:1053 new kernel: registered at /dev/null:219 (Triggered internally at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:150.) self.m.impl(name, dispatch_key, fn) /bin/bash: line 0: fg: no job control Error: failed to run torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch --use_dummy_dataset on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /ossfs/workspace/ColossalAi/examples/language/gpt/titans && export AS="/root/miniconda3/bin/x86_64-conda-linux-gnu-as" LC_PAPER="en_US.UTF-8" LDFLAGS="-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/root/miniconda3//lib -Wl,-rpath-link,/root/miniconda3//lib -L/root/miniconda3//lib" AR="/root/miniconda3/bin/x86_64-conda-linux-gnu-ar" POD_NAMESPACE="kubemaker" BIZ_ID="132704^pai_alipay^85816978^2023-02-26" CONDA_BACKUP_RANLIB="/root/miniconda3//bin/x86_64-conda-linux-gnu-ranlib" CONDA_BACKUP_DEBUG_CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /root/miniconda3//include" CUDA_PATH="/root/miniconda3" LC_ADDRESS="en_US.UTF-8" AISTUDIO_JCS_SUB_JOB_ID="1" GCC_NM="/root/miniconda3/bin/x86_64-conda-linux-gnu-gcc-nm" LC_MONETARY="en_US.UTF-8" HADOOP_LOGS="/home/hadoop/hadoop-data/hadoop-logs" HOSTNAME="kmaker-49-011036210011" THEIA_MINI_BROWSER_HOST_PATTERN="{{hostname}}" CONDA_MIRRORS="mirrors.bfsu.edu.cn" PIP_NO_CACHE_DIR="1" GIT__CAN_USE_NO_OPTIONAL_LOCKS="true" JUPYTER_WORK_DIR="/root" ENV_ARGO_WORKFLOW_NAME="aistudio-85816978" HOST="x86_64-conda-linux-gnu" TERM="xterm-color" ENV_GREY_IMAGE_TYPE="false" KUBERNETES_PORT_443_TCP_PORT="443" KUBERNETES_PORT="tcp://172.16.0.1:443" ARGO_DEADLINE="2023-03-19T07:46:41Z" SHELL="/bin/bash" NM="/root/miniconda3/bin/x86_64-conda-linux-gnu-nm" HADOOP_HOME="/hadoop-client-dev/bin/hadoop" ALIPAY_APP_ZONE="GZ00G" AISTUDIO_TASK_ROOT_PATH="/home/admin" CONDA_BACKUP_CXXFILT="/root/miniconda3//bin/x86_64-conda-linux-gnu-c++filt" CPPFLAGS="-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /root/miniconda3//include" LEGACY_CONTAINER_SIZE_CPU_COUNT="2" CONDA_BACKUP_AR="/root/miniconda3//bin/x86_64-conda-linux-gnu-ar" CONDA_BACKUP_GXX="/root/miniconda3//bin/x86_64-conda-linux-gnu-g++" JUPYTER_NOTEBOOK_DIR="/ossfs" CONDA_SHLVL="1" CONDA_BACKUP_AS="/root/miniconda3//bin/x86_64-conda-linux-gnu-as" CONDA_BACKUP_CONDA_BUILD_SYSROOT="/root/miniconda3//x86_64-conda-linux-gnu/sysroot" JUPYTER_RUNTIME_DIR="/root/.local/share/jupyter/runtime" AISTUDIO_JOB_NAME="2H2BC7II" KUBERNETES_SERVICE_PORT="6443" ULOGFS_ENABLED="true" CONDA_PROMPT_MODIFIER="(base) " SIZE="/root/miniconda3/bin/x86_64-conda-linux-gnu-size" LC_NUMERIC="en_US.UTF-8" SYSTEMCTL_SKIP_REDIRECT="1" ENV_ODPS_ACCESS_KEY="p2bsQ0xVGNm8RnJ2y0hGKEr0E3bcTn" THEIA_WEBVIEW_EXTERNAL_ENDPOINT="{{hostname}}" KUBERNETES_SERVICE_HOST="apiserver.sigma-stl.svc.stl.alipay.com" POD_NAME="aistudio-85816978-2502438577" IDE_COMMON_OSS_BUCKET="dmsint" ENV_ODPS_ACCESS_ID="LTAIQH9JObhyEydB" CONDA_BACKUP_LD="/root/miniconda3//bin/x86_64-conda-linux-gnu-ld" CONDA_BACKUP_STRIP="/root/miniconda3//bin/x86_64-conda-linux-gnu-strip" CXX_FOR_BUILD="/root/miniconda3/bin/x86_64-conda-linux-gnu-c++" AISTUDIO_INNER_ZONE="prod" LC_ALL="en_US.UTF-8" D2_CYCTIME="20230227154641" CUDA_HOME="/root/miniconda3" PYPI_MIRRORS="mirrors.bfsu.edu.cn/pypi/web" ENV_SIGMA_APP_NAME="kmaker" CONDA_BACKUP_DEBUG_CPPFLAGS="-D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /root/miniconda3//include" LC_TELEPHONE="en_US.UTF-8" LD_LIBRARY_PATH="/usr/lib64::/lib:/lib64:/usr/lib64:/usr/lib:/root/miniconda3/lib/:/root/miniconda3/lib/python3.6/site-packages/aistudio_common/reader/libs/:/opt/taobao/java/jre/lib/amd64/server/:/usr/local/cuda/lib64:/usr/local/lib" NVIDIA_VISIBLE_DEVICES="GPU-fbacfc74-9e5e-4d0c-aeaa-a1825b3d527e,GPU-f9bdac76-fd45-104b-dee4-2e212f380492" SIGMA_MAX_PROCESSORS_LIMIT="2" DefaultRoute="11.36.211.253" ILOGTAIL_PODNAME="aistudio-85816978-2502438577" CONDA_BACKUP_SIZE="/root/miniconda3//bin/x86_64-conda-linux-gnu-size" CONDA_BACKUP_GCC_NM="/root/miniconda3//bin/x86_64-conda-linux-gnu-gcc-nm" ALIPAY_APP_APPNAME="kmaker" EXECUTIONRECORD_ID="85816978" CONDA_BACKUP_HOST="x86_64-conda-linux-gnu" AISTUDIO_NAMESPACE="workflow_15090068" ENV_EXPERIMENT_TYPE_ENUM="K8S_CONTAINER" BATCH_SIZE="32" CONDA_EXE="/root/miniconda3/bin/conda" USER_NAME="aaron.hx" CONDA_BACKUP_GPROF="/root/miniconda3//bin/x86_64-conda-linux-gnu-gprof" CONDA_BACKUP_CXX_FOR_BUILD="/root/miniconda3//bin/x86_64-conda-linux-gnu-c++" USERNUMBER="132704" VSCODE_API_VERSION="1.53.2" TF_JNI_OPTS="-XX:ErrorFile=/tmp/hs_err_pid_%p.log" NVIDIA_DRIVER_CAPABILITIES="all" CONDA_BACKUP_BUILD="x86_64-conda-linux-gnu" DATA="/data/scratch/gpt_data/small-gpt-dataset.json" CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /root/miniconda3//include" ENV_ARGO_NODE_NAME="aistudio-85816978-2502438577" ODPS_PROJECT="ant_p13n_dev" LINKB_APP_NAME="15090068" LD_GOLD="/root/miniconda3/bin/x86_64-conda-linux-gnu-ld.gold" WORKFLOW_ID="15090068" CONDA_BUILD_SYSROOT="/root/miniconda3/x86_64-conda-linux-gnu/sysroot" STRINGS="/root/miniconda3/bin/x86_64-conda-linux-gnu-strings" AISTUDIO_SITE_ENUM="INTERNAL" CONDA_BACKUP_LDFLAGS="-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/root/miniconda3//lib -Wl,-rpath-link,/root/miniconda3//lib -L/root/miniconda3//lib" CONDA_BACKUP_host_alias="x86_64-conda-linux-gnu" CPP="/root/miniconda3/bin/x86_64-conda-linux-gnu-cpp" pouch_container_image="reg.docker.alibaba-inc.com/aii/aistudio:3350123-20230221212038_nydus_v2" CXXFILT="/root/miniconda3/bin/x86_64-conda-linux-gnu-c++filt" ali_run_mode="alipay_container" ZHENJIN_HTTP_PREFIX="http://cmps-model.cn-hangzhou.alipay.aliyun-inc.com/264991" ULOGFS_ZCLEAN_ENABLE="true" CONDA_BACKUP_CXX="/root/miniconda3//bin/x86_64-conda-linux-gnu-c++" PATH="/root/miniconda3/bin:/root/miniconda3/condabin:/usr/local/cuda/bin:/root/.tnvm/versions/alinode/v5.20.3/bin:/root/coreutils/bin:/root/miniconda3/bin:/root/.tnvm/versions/alinode/v5.20.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/X11R6/bin:/opt/satools:/opt/taobao/java/bin:/opt/odpscmd_public/bin:/root/apache-maven-3.6.3/bin" CONDA_BACKUP_CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /root/miniconda3//include" LC_MESSAGES="en_US.UTF-8" ACCELERATE_DISABLE_RICH="1" ODPS_ALIYUN_ID="[email protected]" ENV_CODE_NAME="dev_container" ILOGTAIL_ENV="{"Appname":"kmaker","LogAppname":"","Idcname":"stl","Apppath":"/home/admin/logs","Taglist":{"POD_NAME":"aistudio-85816978-2502438577","aistudio":"aistudio-85816978","app":"argo","component":"workflow","kubemaker":"aistudio-85816978"}}" JUPYTER_SERVICE_PORT="8080" DEBUG_CXXFLAGS="-fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /root/miniconda3//include" BUILD="x86_64-conda-linux-gnu" LD="/root/miniconda3/bin/x86_64-conda-linux-gnu-ld" CONDA_PREFIX="/root/miniconda3" LC_IDENTIFICATION="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" ARGO_PROGRESS_FILE="/var/run/argo/progress" PWD="/ossfs/workspace/ColossalAi/examples/language/gpt/titans" AISTUDIO_PROXY_ADDR="https://aistudioproxy.alipay.com/proxy/workflow_15090068:8080" STRIP="/root/miniconda3/bin/x86_64-conda-linux-gnu-strip" JAVA_HOME="/opt/taobao/java" npm_config_user="root" pouchSupportCgroup="true" ENV_ODPS_PROJECT_NAME="ant_p13n_dev" GPUNUM="2" CONDA_BACKUP_CC_FOR_BUILD="/root/miniconda3//bin/x86_64-conda-linux-gnu-cc" CMAKE_ARGS="-DCMAKE_LINKER=/root/miniconda3/bin/x86_64-conda-linux-gnu-ld -DCMAKE_STRIP=/root/miniconda3/bin/x86_64-conda-linux-gnu-strip" ELFEDIT="/root/miniconda3/bin/x86_64-conda-linux-gnu-elfedit" EDITOR="vim" pouch_container_id="f3591b81c52317c948382e9463490ea0e3b43ba0a3ad660f7371749003d55f31" CONDA_BACKUP_OBJCOPY="/root/miniconda3//bin/x86_64-conda-linux-gnu-objcopy" GCC_RANLIB="/root/miniconda3/bin/x86_64-conda-linux-gnu-gcc-ranlib" LANG="en_US.UTF-8" LOCAL_GIT_DIRECTORY="/root/miniconda3" io_alibaba_pouch_snapshotter="rafs" LAUNCH_CONTAINER_MODE="dev_container" LC_MEASUREMENT="en_US.UTF-8" IDE_SERVICE_PORT="8088" CONDA_BACKUP_OBJDUMP="/root/miniconda3//bin/x86_64-conda-linux-gnu-objdump" SN="5caf8b60-a482-4065-bbee-09a0b0621a94" JUPYTER_CONFIG_DIR="/root/.jupyter" ALIPAY_POD_NAME="aistudio-85816978-2502438577" ENV_TYPE="prod" ali_runtime_type="runc" CONDA_BACKUP__CONDA_PYTHON_SYSCONFIGDATA_NAME="_sysconfigdata_x86_64_conda_cos7_linux_gnu" ULOGFS_STREAM_ENABLED="true" ALIPAY_POD_NAMESPACE="kubemaker" ODPS_ENDPOINT="http://service.odps.aliyun-inc.com/api" CXX="/root/miniconda3/bin/x86_64-conda-linux-gnu-c++" CC_FOR_BUILD="/root/miniconda3/bin/x86_64-conda-linux-gnu-cc" ENDPOINT="http://service.odps.aliyun-inc.com/api" OBJCOPY="/root/miniconda3/bin/x86_64-conda-linux-gnu-objcopy" PANGU_CLUSTER_NAME="pangu1_analyze_sata_em14_online" ARGO_TEMPLATE="{"name":"2H2BC7II","inputs":{},"outputs":{},"affinity":{"nodeAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"mandatory.k8s.alipay.com/app-logic-pool","operator":"In","values":["kubemaker"]}]}]}},"podAffinity":{"preferredDuringSchedulingIgnoredDuringExecution":[{"weight":100,"podAffinityTerm":{"labelSelector":{"matchExpressions":[{"key":"kubemaker","operator":"Exists"}]},"topologyKey":"kubernetes.io/hostname"}}]}},"metadata":{"annotations":{"AISTUDIO_GUI_NODE_ID":"2H2BC7II","meta.k8s.alipay.com/pod-zappinfo":"{}","rafs.sigma.alipay.com/containers":"main","tags.ulogfs.k8s.alipay.com":"ALIYUN_LOG_ENV_TAGS|kubemaker|component|app|aistudio","tags.ulogfs.k8s.alipay.com/ALIYUN_LOG_ENV_TAGS":"kubemaker|component|app|POD_NAME|aistudio","tags.ulogfs.k8s.alipay.com/aistudio":"aistudio-85816978","tags.ulogfs.k8s.alipay.com/app":"argo","tags.ulogfs.k8s.alipay.com/component":"workflow","tags.ulogfs.k8s.alipay.com/kubemaker":"aistudio-85816978","ulogfs.k8s.alipay.com/enable-zclean":"true","ulogfs.k8s.alipay.com/inject":"enabled"},"labels":{"CONFREGURL":"confreg-pool.gz00g.alipay.com","aistudio.alipay.com/container_type":"dev_container","aistudio.alipay.com/group_id":"5211","aistudio.alipay.com/job-env-type":"dev","aistudio.alipay.com/record_id":"85816978","aistudio.label/enable_gpu":"true","aistudio.log/service-type":"notebook","app":"argo","argo.step.owner":"aistudio-85816978-2502438577","component":"notebook","custom.k8s.alipay.com/enable-antdnsfilter":"true","custom.k8s.alipay.com/spread-strategy":"scatter","domainname":"gz00g.alipay.com","kubemaker":"aistudio-85816978","kubemaker_job_user":"132704","mandatory.k8s.alipay.com/app-logic-pool":"kubemaker","meta.k8s.alipay.com/biz-name":"kubemaker","meta.k8s.alipay.com/ignore-cmdb":"true","meta.k8s.alipay.com/ignore-naming":"true","meta.k8s.alipay.com/qoc-class":"ProdEnhanced","meta.k8s.alipay.com/zone":"GZ00G","pod.k8s.alipay.com/container-start-policy":"parallel","sigma.ali/app-name":"kmaker","sigma.ali/deploy-unit":"kmaker-prod","sigma.ali/disable-network-env-injection":"true","sigma.ali/disable-service-links":"true","sigma.ali/instance-group":"kmakerhost","sigma.ali/qos":"SigmaDedicated","sigma.ali/site":"stl"}},"container":{"name":"main","image":"reg.docker.alibaba-inc.com/aii/aistudio:3350123-20230221212038","command":["/init-bin/dumb-init","--","bash","-c","(wget -q \"http://dmsint.cn-hangzhou.alipay.aliyun-inc.com/aistudio/aistudio_pre_hook.sh\" -O /tmp/aistudio_pre_hook.sh \u0026\u0026 source /tmp/aistudio_pre_hook.sh;echo \"start analyzer \" `date +\"%Y-%m-%d %H:%M:%S\"` \u003e\u003e /home/admin/logs/aistudio_analyzer.log; pip install aistudio-analyzer --index-url='https://pypi.antfin-inc.com/simple' -I \u003e\u003e /home/admin/logs/aistudio_analyzer.log 2\u003e\u00261 \u0026\u0026 ( nohup aistudio_analyzer monitor_resource_usage \u003e\u003e /home/admin/logs/aistudio_analyzer.log 2\u003e\u00261 \u0026 echo $! \u003e /home/admin/aistudio_resource_monitor.pid; ); ( nohup aistudio_analyzer \u003e\u003e /home/admin/logs/aistudio_analyzer.log 2\u003e\u00261 \u0026 echo $! \u003e /home/admin/aistudio_analyzer.pid; );if [[ ! `rpm -qa | grep -i nfs-utils` ]]; then yum install nfs-utils -y; fi;mkdir -p /A19t1smz \u0026\u0026 mount -t nfs -o vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport alipayshnas-009-gbl48.cn-shanghai-eu13-a01.nas.aliyuncs.com:/ /A19t1smz;mkdir -p /A19t1smz/132704/workflow_15090068;umount -l /A19t1smz;mkdir -p /ossfs;mount -t nfs -o vers=3,nolock,proto=tcp,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport alipayshnas-009-gbl48.cn-shanghai-eu13-a01.nas.aliyuncs.com:/132704/workflow_15090068 /ossfs;/pai-extension/start-container.sh 8080 8088 workflow_15090068 dev_container prod /ossfs;aistudio_exit_code=$? ; echo \"main process return code: \"$aistudio_exit_code; echo $aistudio_exit_code \u003e /home/admin/aistudio_main_process.done; ( nohup aistudio_analyzer post_run \u003e\u003e /home/admin/logs/aistudio_analyzer.log 2\u003e\u00261 \u0026 echo $! \u003e /home/admin/aistudio_analyzer_post_run.pid; ); aistudio_analyzer clean \u003e\u003e /home/admin/logs/aistudio_analyzer.log 2\u003e\u00261 || echo \"aistudio_analyzer clean cause problem\" \u003e\u003e /home/admin/logs/aistudio_analyzer.log; (exit $aistudio_exit_code)) 2\u003e\u00261 | tee /home/admin/logs/kubemaker.log; exit ${PIPESTATUS[0]}"],"env":[{"name":"ODPS_ALIYUN_ID","value":"bs3yocnlmjdg@aliyun.com"},{"name":"ODPS_ENDPOINT","value":"http://service.odps.aliyun-inc.com/api"},{"name":"IS_DEV_CONTAINER","value":"true"},{"name":"USER_NAME","value":"aaron.hx"},{"name":"IMAGE_TYPE","value":"3350123"},{"name":"ODPS_PROJECT","value":"ant_p13n_dev"},{"name":"AISTUDIO_PROXY_ADDR","value":"https://aistudioproxy.alipay.com/proxy/workflow_15090068:8080"},{"name":"AISTUDIO_NAMESPACE","value":"workflow_15090068"},{"name":"HADOOP_HOME","value":"/hadoop-client/bin/hadoop"},{"name":"ODPS_ACCESS_KEY","value":"p2bsQ0xVGNm8RnJ2y0hGKEr0E3bcTn"},{"name":"ENDPOINT","value":"http://service.odps.aliyun-inc.com/api"},{"name":"HADOOP_LOGS","value":"/home/hadoop/hadoop-data/hadoop-logs"},{"name":"USERNUMBER","value":"132704"},{"name":"ODPS_ACCESS_ID","value":"LTAIQH9JObhyEydB"},{"name":"PANGU_CLUSTER_NAME","value":"pangu1_analyze_sata_em14_online"},{"name":"LAUNCH_CONTAINER_MODE","value":"dev_container"},{"name":"AISTUDIO_JCS_SUB_JOB_ID","value":"1"},{"name":"ENABLE_AISTUDIO_READER_SLEEP","value":"true"},{"name":"AISTUDIO_JOB_TYPE","value":"launchContainer"},{"name":"AISTUDIO_SITE_ENUM","value":"INTERNAL"},{"name":"ENV_ARGO_NODE_NAME","value":"aistudio-85816978-2502438577"},{"name":"AISTUDIO_DOMAIN_CODE","value":"experience.l2domain"},{"name":"ENV_ODPS_ACCESS_ID","value":"LTAIQH9JObhyEydB"},{"name":"AISTUDIO_INNER_ZONE","value":"prod"},{"name":"ENV_AISTUDIO_HOST","value":"aistudio.alipay.com"},{"name":"EXECUTIONRECORD_ID","value":"85816978"},{"name":"ENV_SIGMA_APP_NAME","value":"kmaker"},{"name":"ENV_TYPE","value":"PROD"},{"name":"ENV_CODE_NAME","value":"dev_container"},{"name":"ENV_GROUP_ID","value":"5211"},{"name":"ENV_EXPERIMENT_TYPE_ENUM","value":"K8S_CONTAINER"},{"name":"ENV_ODPS_ENDPOINT","value":"http://service.odps.aliyun-inc.com/api"},{"name":"IDE_COMMON_OSS_BUCKET","value":"dmsint"},{"name":"AISTUDIO_TASK_ROOT_PATH","value":"/home/admin"},{"name":"BIZ_ID","value":"132704^pai_alipay^85816978^2023-02-26"},{"name":"ENV_ODPS_PROJECT_NAME","value":"ant_p13n_dev"},{"name":"ENV_ODPS_ALIYUN_ID","value":"bs3yocnlmjdg@aliyun.com"},{"name":"D2_CYCTIME","value":"20230227154641"},{"name":"ENV_ARGO_WORKFLOW_NAME","value":"aistudio-85816978"},{"name":"AISTUDIO_JCS_JOB_ID","value":"stl##aistudio-85816978##kmaker"},{"name":"AISTUDIO_JOB_NAME","value":"2H2BC7II"},{"name":"WORKFLOW_ID","value":"15090068"},{"name":"VISUAL_DATA_PATH","value":"pangu://pangu1_analyze_sata_em14_online/pai/aistudio/checkpoint/aistudio-85816978"},{"name":"ENV_ODPS_ACCESS_KEY","value":"p2bsQ0xVGNm8RnJ2y0hGKEr0E3bcTn"},{"name":"ENV_BIZDATE","value":"20230226154640"},{"name":"POD_NAMESPACE","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.namespace"}}},{"name":"POD_NAME","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"metadata.name"}}},{"name":"POD_IP","valueFrom":{"fieldRef":{"apiVersion":"v1","fieldPath":"status.podIP"}}},{"name":"ENV_GREY_IMAGE_TYPE","value":"false"}],"resources":{"limits":{"cpu":"2","ephemeral-storage":"60Gi","memory":"32Gi","nvidia.com/V100-PCIE-16GB-P":"2"},"requests":{"cpu":"2","ephemeral-storage":"60Gi","memory":"32Gi","nvidia.com/V100-PCIE-16GB-P":"2"}},"volumeMounts":[{"name":"dumb-init","mountPath":"/init-bin"}],"imagePullPolicy":"IfNotPresent"},"volumes":[{"name":"dumb-init","emptyDir":{}}],"initContainers":[{"name":"dumb-init","image":"reg.docker.alibaba-inc.com/rafs/kubemaker.dumb-init:v1.2.2","command":["cp","/usr/local/bin/dumb-init","/init-bin/dumb-init"],"env":[{"name":"ENV_GREY_IMAGE_TYPE","value":"false"}],"resources":{"limits":{"cpu":"100m","ephemeral-storage":"256Mi","memory":"128Mi"},"requests":{"cpu":"100m","ephemeral-storage":"256Mi","memory":"128Mi"}},"volumeMounts":[{"name":"dumb-init","mountPath":"/init-bin"}],"imagePullPolicy":"Always"}],"retryStrategy":{"limit":3,"retryPolicy":"OnTransientError"},"tolerations":[{"key":"mandatory.k8s.alipay.com/app-logic-pool","operator":"Equal","value":"kubemaker","effect":"NoSchedule"},{"key":"node.kubernetes.io/not-ready","operator":"Exists","effect":"NoExecute","tolerationSeconds":300},{"key":"node.kubernetes.io/unreachable","operator":"Exists","effect":"NoExecute","tolerationSeconds":300}],"priorityClassName":"low"}" CONDA_BACKUP_ELFEDIT="/root/miniconda3//bin/x86_64-conda-linux-gnu-elfedit" CONDA_BACKUP_GCC_AR="/root/miniconda3//bin/x86_64-conda-linux-gnu-gcc-ar" TOKENIZERS_PARALLELISM="false" TF_CPP_MIN_LOG_LEVEL="2" SHLVL="5" HOME="/root" LESSCHARSET="utf-8" ALIPAY_APP_ENV="prod" LANGUAGE="en_us" CONDA_BACKUP_NM="/root/miniconda3//bin/x86_64-conda-linux-gnu-nm" CONDA_BACKUP_build_alias="x86_64-conda-linux-gnu" ENV_ODPS_ALIYUN_ID="[email protected]" KUBERNETES_PORT_443_TCP_PROTO="tcp" USE_LOCAL_GIT="true" DEBUG_CPPFLAGS="-D_DEBUG -D_FORTIFY_SOURCE=2 -Og -isystem /root/miniconda3//include" CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /root/miniconda3//include" KUBERNETES_SERVICE_PORT_HTTPS="443" _CONDA_PYTHON_SYSCONFIGDATA_NAME="_sysconfigdata_x86_64_conda_cos7_linux_gnu" GCC="/root/miniconda3/bin/x86_64-conda-linux-gnu-gcc" AISTUDIO_DOMAIN_CODE="experience.l2domain" ENV_AISTUDIO_HOST="aistudio.alipay.com" CONDA_BACKUP_DEBUG_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /root/miniconda3//include" JUPYTER_DATA_DIR="/root/.local/share/jupyter" IMAGE_TYPE="3350123" BASH_ENV="/root/.bashrc" CONDA_BACKUP_GCC="/root/miniconda3//bin/x86_64-conda-linux-gnu-gcc" CONDA_BACKUP_CMAKE_ARGS="-DCMAKE_LINKER=/root/miniconda3//bin/x86_64-conda-linux-gnu-ld -DCMAKE_STRIP=/root/miniconda3//bin/x86_64-conda-linux-gnu-strip" ADDR2LINE="/root/miniconda3/bin/x86_64-conda-linux-gnu-addr2line" PYTHONPATH="/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:/root/miniconda3/lib/python3.8/site-packages/aistudio_notebook/public:" RequestedIP="11.36.210.11" ENV_BIZDATE="20230226154640" ENV_GROUP_ID="5211" ARGO_PROGRESS_FILE_TICK_DURATION="3s" CONDA_PYTHON_EXE="/root/miniconda3/bin/python" ARGO_INCLUDE_SCRIPT_OUTPUT="false" IS_DEV_CONTAINER="true" TPDEGREE="2" build_alias="x86_64-conda-linux-gnu" CLASSPATH="/root/miniconda3/lib/python3.6/site-packages/aistudio_common/reader/libs/penrose-1.0-SNAPSHOT-jar-with-dependencies.jar" LC_CTYPE="en_US.UTF-8" CONDA_BACKUP_GCC_RANLIB="/root/miniconda3//bin/x86_64-conda-linux-gnu-gcc-ranlib" CONDA_BACKUP_CMAKE_PREFIX_PATH="/root/miniconda3/:/root/miniconda3//x86_64-conda-linux-gnu/sysroot/usr" OMP_NUM_THREADS="1" AJDK_MAX_PROCESSORS_LIMIT="2" ARGO_PROGRESS_PATCH_TICK_DURATION="1m0s" AISTUDIO_JOB_TYPE="launchContainer" NODE_LOG_DIR="/logs/alinode" ENABLE_NODE_LOG="YES" CONDA_DEFAULT_ENV="base" CONDA_BACKUP_CC="/root/miniconda3//bin/x86_64-conda-linux-gnu-cc" DEBUG_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-all -fno-plt -Og -g -Wall -Wextra -fvar-tracking-assignments -ffunction-sections -pipe -isystem /root/miniconda3//include" RANLIB="/root/miniconda3/bin/x86_64-conda-linux-gnu-ranlib" ANTB_BUILD_PLATFORM="AISTUDIO" PROMPT_COMMAND="printf "\033]0;%s@%s:%s\007" "${USER}" "${HOSTNAME%%.*}" "${PWD/#$HOME/~}";sh /etc/sysconfig/bash-prompt-history" IPYTHON_PROFILE_PATH="/root/.ipython/profile_default" CONDA_BACKUP_STRINGS="/root/miniconda3//bin/x86_64-conda-linux-gnu-strings" CMAKE_PREFIX_PATH="/root/miniconda3:/root/miniconda3/x86_64-conda-linux-gnu/sysroot/usr" CC="/root/miniconda3/bin/x86_64-conda-linux-gnu-cc" AISTUDIO_COMMON_PATH="/root/miniconda3/lib/python3.6/site-packages/aistudio_common" ENV_ODPS_ENDPOINT="http://service.odps.aliyun-inc.com/api" KUBERNETES_PORT_443_TCP_ADDR="172.16.0.1" CONDA_BACKUP_LD_GOLD="/root/miniconda3//bin/x86_64-conda-linux-gnu-ld.gold" host_alias="x86_64-conda-linux-gnu" READELF="/root/miniconda3/bin/x86_64-conda-linux-gnu-readelf" ARGO_CONTAINER_NAME="main" AISTUDIO_JCS_JOB_ID="stl##aistudio-85816978##kmaker" VISUAL_DATA_PATH="pangu://pangu1_analyze_sata_em14_online/pai/aistudio/checkpoint/aistudio-85816978" DefaultMask="255.255.252.0" ODPS_ACCESS_KEY="p2bsQ0xVGNm8RnJ2y0hGKEr0E3bcTn" ALIPAY_SIGMA_CPUMODE="cpushare" KUBERNETES_PORT_443_TCP="tcp://172.16.0.1:443" CONDA_BACKUP_ADDR2LINE="/root/miniconda3//bin/x86_64-conda-linux-gnu-addr2line" CONDA_BACKUP_READELF="/root/miniconda3//bin/x86_64-conda-linux-gnu-readelf" CONDA_BACKUP_CPP="/root/miniconda3//bin/x86_64-conda-linux-gnu-cpp" GCC_AR="/root/miniconda3/bin/x86_64-conda-linux-gnu-gcc-ar" OBJDUMP="/root/miniconda3/bin/x86_64-conda-linux-gnu-objdump" LC_TIME="en_US.UTF-8" container="placeholder" ODPS_ACCESS_ID="LTAIQH9JObhyEydB" WORKFLOW_API_PARAM_FILE="/ossfs/.param.conf" CONDA_BACKUP_CPPFLAGS="-DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /root/miniconda3//include" CONDA_BACKUP_CFLAGS="-march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /root/miniconda3//include" GPROF="/root/miniconda3/bin/x86_64-conda-linux-gnu-gprof" ENABLE_AISTUDIO_READER_SLEEP="true" GIT_EXEC_PATH="/root/miniconda3/libexec/git-core" GXX="/root/miniconda3/bin/x86_64-conda-linux-gnu-g++" LC_NAME="en_US.UTF-8" POD_IP="11.36.210.11" _="/root/miniconda3/bin/colossalai" && torchrun --nproc_per_node=2 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch --use_dummy_dataset'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes ===== 127.0.0.1: failure

====== Stopping All Nodes ===== 127.0.0.1: finish

Could you help with the issues?

Environment

2 * v100 32G pytorch 1.13 + cu117 python 3.8.13

imhuim982 avatar Feb 28 '23 06:02 imhuim982

May I know your transformers version?

Wesley-Jzy avatar Mar 03 '23 10:03 Wesley-Jzy

transformers==4.26.1

imhuim982 avatar Mar 03 '23 13:03 imhuim982

Would you mind trying to run with another version of transformers < 4.25.1? I guess it's because the experimental feature hasn't matched new operators introduced in 4.25.1.

Wesley-Jzy avatar Mar 07 '23 00:03 Wesley-Jzy

titan pipeline启动太慢了,单机8卡 3090卡,pp=4, tp=2,启动训练需要等待10多分钟,30B的模型A100 80G 启动训练要卡30分钟以上, Megatron就启动很快。。。

joan126 avatar Mar 07 '23 09:03 joan126

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The startup of the titan pipeline is too slow. The stand-alone machine has 8 cards and 3090 cards, pp=4, tp=2. It takes more than 10 minutes to start the training. The 30B model A100 80G takes more than 30 minutes to start the training.

Issues-translate-bot avatar Mar 07 '23 09:03 Issues-translate-bot

titan pipeline启动太慢了,单机8卡 3090卡,pp=4, tp=2,启动训练需要等待10多分钟,30B的模型A100 80G 启动训练要卡30分钟以上, Megatron就启动很快。。。

Hi @joan126 Can you open a new issue and provide details? So we can reproduce your question. Thanks. https://github.com/hpcaitech/ColossalAI/issues/new/choose

binmakeswell avatar Mar 07 '23 09:03 binmakeswell

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The titan pipeline starts too slowly. The single machine has 8 cards and 3090 cards, pp=4, tp=2. It takes more than 10 minutes to start training. The 30B model A100 80G takes more than 30 minutes to start training. Megatron starts quickly. . .

Hi @joan126 Can you open a new issue and provide details? So we can reproduce your question. Thanks. https://github.com/hpcaitech/ColossalAI/issues/new/choose

Issues-translate-bot avatar Mar 07 '23 09:03 Issues-translate-bot

This experimental feature works well in PyTorch 1.12.1 version, we will improve the generality of this feature in future.

YuliangLiu0306 avatar Mar 15 '23 07:03 YuliangLiu0306

I got this error too when useing pytorch 1.13.0, colossalai 0.2.8 with transformers 4.28.1 and 4.24.0

wenqf11 avatar Apr 19 '23 07:04 wenqf11

I got this error too when useing pytorch 1.13.0, colossalai 0.2.8 with transformers 4.28.1 and 4.24.0

How about PyTorch 1.12.1?

binmakeswell avatar Apr 19 '23 08:04 binmakeswell

@binmakeswell it works fine using official docker colossalai:0.2.7 which is pytorch 1.12.1

wenqf11 avatar Apr 19 '23 12:04 wenqf11

Glad to hear it was resolved. Thanks.

binmakeswell avatar Apr 27 '23 08:04 binmakeswell