[Bug] Aborted (core dumped) after completing single inference
Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
After each multi-images inference done, the core will be dumped and stopped. Streaming output is not available.
Reproduction
` from lmdeploy import pipeline, TurbomindEngineConfig, GenerationConfig, VisionConfig from lmdeploy.vl import load_image
import os import json import ast import pandas as pd from tqdm import tqdm
import setproctitle setproctitle.setproctitle("TJ-InternVL_Multi_Images")
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
model = '/data2/zhangxin/model_zoo/OpenGVLab/InternVL2-40B' pipe = pipeline(model, backend_config=TurbomindEngineConfig(session_len=8192))
folder_path = "/data1/ouyangtianjian/ALL_DATA_MIXED_NEW"
for task in os.listdir(folder_path):
if "Socioeconomic" not in task and "Retrieval" not in task:
continue
task_name = task.split("_formatted")[0]
done_len = 0
if os.path.exists(f"/data1/ouyangtianjian/ALL_DATA_MIXED_NEW/results_new_multi_images/InternVL2-40B/{task_name}_test_result.jsonl"):
with open(f"/data1/ouyangtianjian/ALL_DATA_MIXED_NEW/results_new_multi_images/InternVL2-40B/{task_name}_test_result.jsonl", "r") as f:
done_lines = f.readlines()
done_len = len(done_lines)
with open(f"/data1/ouyangtianjian/ALL_DATA_MIXED_NEW/{task_name}_formatted/{task_name}_data_test_1000.json", "r") as f:
test_data_1000 = json.load(f)
for cell in tqdm(test_data_1000[done_len:]):
try:
question = cell["conversations"][0]["value"].replace("<image>", "{IMAGE_TOKEN}")
image_paths = cell["image"]
images = [load_image(image_path) for image_path in image_paths]
response = pipe((question, images), use_tqdm=False, gen_config=GenerationConfig(temperature=0.3))
with open(f"/data1/ouyangtianjian/ALL_DATA_MIXED_NEW/results_new_multi_images/InternVL2-40B/{task_name}_test_result.jsonl", "a") as fout:
fout.write(json.dumps({
"id": cell["sample_id"],
"true_label": cell["conversations"][1]["value"],
"response": response.text
}) + "\n")
except Exception as e:
print("ERROR:", e)
`
Environment
absl-py==2.1.0
accelerate==0.33.0
addict==2.4.0
aiofiles==24.1.0
aiohappyeyeballs==2.4.0
aiohttp==3.10.5
aiosignal==1.3.1
altair==5.4.1
annotated-types==0.7.0
anyio==4.4.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work
async-lru==2.0.4
async-timeout==4.0.3
attrs==24.2.0
autocommand==2.2.2
babel==2.16.0
backports.tarfile==1.2.0
beautifulsoup4==4.12.3
bitsandbytes==0.41.0
bleach==6.1.0
blinker==1.8.2
cachetools==5.5.0
certifi==2024.8.30
cffi==1.17.0
charset-normalizer==3.3.2
click==8.1.7
cmake==3.25.0
colorama==0.4.6
comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1710320294760/work
contourpy==1.3.0
cycler==0.12.1
debugpy @ file:///croot/debugpy_1690905042057/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
decord==0.6.0
deepspeed==0.13.5
defusedxml==0.7.1
einops==0.6.1
einops-exts==0.0.4
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1720869315914/work
executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1725214404607/work
fastapi==0.112.2
fastjsonschema==2.20.0
ffmpy==0.4.0
filelock==3.15.4
fire==0.6.0
flash_attn==2.3.6
fonttools==4.53.1
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.6.1
future==1.0.0
gdown==5.2.0
gitdb==4.0.11
GitPython==3.1.43
gradio==3.35.2
gradio_client==0.2.9
grpcio==1.66.1
h11==0.14.0
hjson==3.1.0
httpcore==1.0.5
httpx==0.27.2
huggingface-hub==0.24.6
idna==3.8
imageio==2.35.1
importlib_metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1724187233579/work
importlib_resources==6.4.4
inflect==7.3.1
ipdb==0.13.13
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1719845459717/work
ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1701831663892/work
ipywidgets @ file:///home/conda/feedstock_root/build_artifacts/ipywidgets_1724334859652/work
isoduration==20.11.0
jaraco.context==5.3.0
jaraco.functools==4.0.1
jaraco.text==3.12.1
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work
Jinja2==3.1.4
joblib==1.4.2
json5==0.9.25
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1716472197302/work
jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1710257447442/work
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_widgets_1724331334887/work
kiwisolver==1.4.5
latex2mathml==3.77.0
linkify-it-py==2.0.3
lit==15.0.7
lmdeploy==0.5.3
Markdown==3.7
markdown-it-py==2.2.0
markdown2==2.5.0
MarkupSafe==2.1.5
matplotlib==3.9.2
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1713250518406/work
mdit-py-plugins==0.3.3
mdurl==0.1.2
mistune==3.0.2
mmcls==0.25.0
mmcv==2.2.0
mmcv-full==1.6.2
mmengine==0.10.5
mmengine-lite==0.10.4
mmsegmentation==0.30.0
model-index==0.1.11
more-itertools==10.3.0
mpmath==1.3.0
multidict==6.0.5
narwhals==1.6.0
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work
networkx==3.2.1
ninja==1.11.1.1
notebook==7.2.2
notebook_shim==0.2.4
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
opencv-python==4.10.0.84
opencv-python-headless==4.10.0.84
opendatalab==0.0.10
openmim==0.3.9
openxlab==0.0.11
ordered-set==4.1.0
orjson==3.10.7
overrides==7.7.0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1718189413536/work
pandas==2.2.2
pandocfilters==1.5.1
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1712320355065/work
peft==0.11.1
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
pillow==10.4.0
platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1715777629804/work
prettytable==3.11.0
prometheus_client==0.20.0
prompt_toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1718047967974/work
protobuf==5.28.0
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1719274564771/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure_eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1721585709575/work
py-cpuinfo==9.0.0
pyarrow==17.0.0
pycocoevalcap==1.2
pycocotools==2.0.8
pycparser==2.22
pycryptodome==3.20.0
pydantic==2.8.2
pydantic_core==2.20.1
pydeck==0.9.1
pydub==0.25.1
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1714846767233/work
pynvml==11.5.3
pyparsing==3.1.4
PySocks==1.7.1
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1709299778482/work
python-json-logger==2.0.7
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.2
pyzmq @ file:///croot/pyzmq_1705605076900/work
referencing==0.35.1
regex==2024.7.24
requests==2.32.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.8.0
rpds-py==0.20.0
safetensors==0.4.4
scikit-learn==1.5.1
scipy==1.13.1
semantic-version==2.10.0
Send2Trash==1.8.3
sentencepiece==0.1.99
setproctitle==1.3.3
shortuuid==1.0.13
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.6
stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work
starlette==0.38.3
streamlit==1.38.0
streamlit-image-select==0.6.0
svgwrite==1.4.3
sympy==1.13.2
tabulate==0.9.0
tenacity==8.5.0
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorboardX==2.6.2.2
termcolor==2.4.0
terminado==0.18.1
terminaltables==3.1.10
threadpoolctl==3.5.0
tiktoken==0.7.0
timm==0.9.12
tinycss2==1.3.0
tokenizers==0.15.1
toml==0.10.2
tomli==2.0.1
torch==2.3.1
torchaudio==2.3.1+cu121
torchvision==0.18.1
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1724955920300/work
tqdm==4.66.5
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1713535121073/work
transformers==4.37.2
triton==2.3.1
typeguard==4.3.0
types-python-dateutil==2.9.0.20240821
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1717802530399/work
tzdata==2024.1
uc-micro-py==1.0.3
uri-template==1.3.0
urllib3==2.2.2
uvicorn==0.30.6
watchdog==4.0.2
wavedrom==2.0.3.post3
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
websockets==13.0.1
Werkzeug==3.0.4
widgetsnbextension @ file:///home/conda/feedstock_root/build_artifacts/widgetsnbextension_1724331337528/work
yacs==0.1.8
yapf==0.40.1
yarl==1.9.6
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1724730934107/work
Error traceback
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo
0%|▏ | 1/721 [00:09<1:59:20, 9.95s/it]
Aborted (core dumped)
GPU 0 has no active processes or is corrupted. Restarting associated process.
Restarted process on GPU 0: bash -c 'source /usr/local/anaconda3/etc/profile.d/conda.sh && conda activate internvl && cd /data1/ouyangtianjian/InternVL && python /data1/ouyangtianjian/Intern
VL/InternVL_Multi_Images.py'
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the `with torch.aut
ocast(device_type='torch_device'):` decorator.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the `with torch.aut
ocast(device_type='torch_device'):` decorator.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo
0%|▏ | 1/720 [00:04<57:50, 4.83s/it]
Aborted (core dumped)
GPU 0 has no active processes or is corrupted. Restarting associated process.
Restarted process on GPU 0: bash -c 'source /usr/local/anaconda3/etc/profile.d/conda.sh && conda activate internvl && cd /data1/ouyangtianjian/InternVL && python /data1/ouyangtianjian/Intern
VL/InternVL_Multi_Images.py'
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the `with torch.aut
ocast(device_type='torch_device'):` decorator.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the `with torch.aut
ocast(device_type='torch_device'):` decorator.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING] gemm_config.in is not found; using default GEMM algo
0%|▏ | 1/719 [00:09<1:50:01, 9.19s/it]
Aborted (core dumped)
GPU 0 has no active processes or is corrupted. Restarting associated process.
Restarted process on GPU 0: bash -c 'source /usr/local/anaconda3/etc/profile.d/conda.sh && conda activate internvl && cd /data1/ouyangtianjian/InternVL && python /data1/ouyangtianjian/Intern
VL/InternVL_Multi_Images.py'
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the `with torch.aut
ocast(device_type='torch_device'):` decorator.
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. No dtype was provided, you should run training or inference using Automatic Mixed-Precision via the `with torch.aut
ocast(device_type='torch_device'):` decorator.
^CTraceback (most recent call last):
File "/data1/ouyangtianjian/InternVL/InternVL_Multi_Images_AutoRecover.py", line 58, in <module>
monitor_gpus(gpu_processes)
File "/data1/ouyangtianjian/InternVL/InternVL_Multi_Images_AutoRecover.py", line 55, in monitor_gpus
time.sleep(check_interval)