vllm SIGABRT - Fatal Python error: Aborted when running vllm on llama2-7b with --tensor-parallel-size 2

In my setup, vLLM works fine when running llama2-7b with 1 GPU. But when running it with multiple gpus, it runs into a Fatal error every time. Sharing the traces below. This is persistent - that is there is no single instance when I am able to run vllm with multiple gpus. Can you please share thoughts on what could be the issue and how to go about it ?

-- Environment-- CentOS 7.9 Cuda 11.8, V11.8.89 Nvidia Driver 530.30.2 A100 host with 8 GPU cards python 3.10 vLLM 0.1.3 /dev/shm 60G ulimit -u 30000

--- Error Trace 1 --- python3.10 -m vllm.entrypoints.api_server --model "models/llama-2-7b-hf" --swap-space 1 --disable-log-requests --disable-log-stats --tensor-parallel-size 2 2023-08-25 08:22:59,206 INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 INFO 08-25 08:23:00 llm_engine.py:70] Initializing an LLM engine with config: model='models/llama-2-7b-hf', tokenizer='models/llama-2-7b-hf', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0) INFO 08-25 08:23:00 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. [2023-08-25 08:23:04,248 E 32338 1301] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11] [2023-08-25 08:23:04,279 E 32338 1301] logging.cc:104: Stack trace: /site-packages/ray/_raylet.so(+0xdaf01a) [0x7f732773d01a] ray::operator<<() /site-packages/ray/_raylet.so(+0xdb17d8) [0x7f732773f7d8] ray::TerminateHandler() /lib64/libstdc++.so.6(+0x5ea06) [0x7f7397676a06] /lib64/libstdc++.so.6(+0x5ea33) [0x7f7397676a33] /lib64/libstdc++.so.6(+0x5ec53) [0x7f7397676c53] /site-packages/ray/_raylet.so(+0x4a3ef4) [0x7f7326e31ef4] boost::throw_exception<>() /site-packages/ray/_raylet.so(+0xdc33db) [0x7f73277513db] boost::asio::detail::do_throw_error() /site-packages/ray/_raylet.so(+0xdc3dfb) [0x7f7327751dfb] boost::asio::detail::posix_thread::start_thread() /site-packages/ray/_raylet.so(+0xdc425c) [0x7f732775225c] boost::asio::thread_pool::thread_pool() /site-packages/ray/_raylet.so(+0x8c5b74) [0x7f7327253b74] ray::rpc::(anonymous namespace)::_GetServerCallExecutor() /site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f7327253c09] ray::rpc::GetServerCallExecutor() /site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyEE17HandleRequestImplEvEUlS1_S4_S4_E_E9_M_invokeERKSt9_Any_dataOS1_OS4_SI+0xe2) [0x7f7326fe5172] std::_Function_handler<>::_M_invoke() /site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x8d1) [0x7f732701abf1] ray::core::CoreWorker::HandleGetCoreWorkerStats() /site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc14ServerCallImplINS2_24CoreWorkerServiceHandlerENS2_25GetCoreWorkerStatsRequestENS2_23GetCoreWorkerStatsReplyEE13HandleRequestEvEUlvE_E9_M_invokeERKSt9_Any_data+0x116) [0x7f7327011636] std::_Function_handler<>::_M_invoke() /site-packages/ray/_raylet.so(+0x9683b6) [0x7f73272f63b6] EventTracker::RecordExecution() /site-packages/ray/_raylet.so(+0x90580e) [0x7f732729380e] std::_Function_handler<>::_M_invoke() /site-packages/ray/_raylet.so(+0x905d66) [0x7f7327293d66] boost::asio::detail::completion_handler<>::do_complete() /site-packages/ray/_raylet.so(+0xdc0b6b) [0x7f732774eb6b] boost::asio::detail::scheduler::do_run_one() /site-packages/ray/_raylet.so(+0xdc2639) [0x7f7327750639] boost::asio::detail::scheduler::run() /site-packages/ray/_raylet.so(+0xdc2af2) [0x7f7327750af2] boost::asio::io_context::run() /site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7f732702745d] ray::core::CoreWorker::RunIOService() /site-packages/ray/_raylet.so(+0xee9270) [0x7f7327877270] execute_native_thread_routine /lib64/libpthread.so.0(+0x7ea5) [0x7f73d59e1ea5] start_thread /lib64/libc.so.6(clone+0x6d) [0x7f73d5001b0d] clone

*** SIGABRT received at time=1692951784 on cpu 44 *** PC: @ 0x7f73d4f39387 (unknown) raise @ 0x7f73d59e9630 3520 (unknown) @ 0x7f7397676a06 1791331480 (unknown) @ 0x7f73279b1640 1875585256 (unknown) @ 0x7f7397677fb0 (unknown) (unknown) @ 0x3de907894810c083 (unknown) (unknown) [2023-08-25 08:23:04,282 E 32338 1301] logging.cc:361: *** SIGABRT received at time=1692951784 on cpu 44 *** [2023-08-25 08:23:04,282 E 32338 1301] logging.cc:361: PC: @ 0x7f73d4f39387 (unknown) raise [2023-08-25 08:23:04,284 E 32338 1301] logging.cc:361: @ 0x7f73d59e9630 3520 (unknown) [2023-08-25 08:23:04,284 E 32338 1301] logging.cc:361: @ 0x7f7397676a06 1791331480 (unknown) [2023-08-25 08:23:04,286 E 32338 1301] logging.cc:361: @ 0x7f73279b1640 1875585256 (unknown) [2023-08-25 08:23:04,286 E 32338 1301] logging.cc:361: @ 0x7f7397677fb0 (unknown) (unknown) [2023-08-25 08:23:04,287 E 32338 1301] logging.cc:361: @ 0x3de907894810c083 (unknown) (unknown) Fatal Python error: Aborted

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, simplejson._speedups, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, setproctitle, grpc._cython.cygrpc, ray._raylet, pvectorc, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, _cffi_backend, frozenlist._frozenlist, pyarrow._json, PIL._imaging (total: 111) Aborted

--- Error Trace 2 -- python3.10 -m vllm.entrypoints.api_server --model "models/llama-2-7b-hf" --swap-space 1 --disable-log-requests --disable-log-stats --tensor-parallel-size 2 2023-08-25 00:42:35,237 INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 INFO 08-25 00:42:36 llm_engine.py:70] Initializing an LLM engine with config: model=models/llama-2-7b-hf', tokenizer='models/llama-2-7b-hf', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0) INFO 08-25 00:42:36 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. (pid=5623) [2023-08-25 00:42:40,265 E 5623 7733] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11] (pid=5623) [2023-08-25 00:42:40,287 E 5623 7733] logging.cc:104: Stack trace: (pid=5623) /site-packages/ray/_raylet.so(+0xdaf01a) [0x7f2ebfbd801a] ray::operator<<() (pid=5623) /site-packages/ray/_raylet.so(+0xdb17d8) [0x7f2ebfbda7d8] ray::TerminateHandler() (pid=5623) /lib64/libstdc++.so.6(+0x5ea06) [0x7f2ebe977a06] (pid=5623) /lib64/libstdc++.so.6(+0x5ea33) [0x7f2ebe977a33] (pid=5623) /lib64/libstdc++.so.6(+0x5ec53) [0x7f2ebe977c53] (pid=5623) /site-packages/ray/_raylet.so(+0x4a3ef4) [0x7f2ebf2ccef4] boost::throw_exception<>() (pid=5623) /site-packages/ray/_raylet.so(+0xdc33db) [0x7f2ebfbec3db] boost::asio::detail::do_throw_error() (pid=5623) /site-packages/ray/_raylet.so(+0xdc3dfb) [0x7f2ebfbecdfb] boost::asio::detail::posix_thread::start_thread() (pid=5623) /site-packages/ray/_raylet.so(+0xdc425c) [0x7f2ebfbed25c] boost::asio::thread_pool::thread_pool() (pid=5623) /site-packages/ray/_raylet.so(+0x8c5b74) [0x7f2ebf6eeb74] ray::rpc::(anonymous namespace)::_GetServerCallExecutor() (pid=5623) /site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f2ebf6eec09] ray::rpc::GetServerCallExecutor() (pid=5623) /site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyEE17HandleRequestImplEvEUlS1_S4_S4_E_E9_M_invokeERKSt9_Any_dataOS1_OS4_SI+0xe2) [0x7f2ebf480172] std::_Function_handler<>::_M_invoke() (pid=5623) /site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x8d1) [0x7f2ebf4b5bf1] ray::core::CoreWorker::HandleGetCoreWorkerStats() (pid=5623) /site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc14ServerCallImplINS2_24CoreWorkerServiceHandlerENS2_25GetCoreWorkerStatsRequestENS2_23GetCoreWorkerStatsReplyEE13HandleRequestEvEUlvE_E9_M_invokeERKSt9_Any_data+0x116) [0x7f2ebf4ac636] std::_Function_handler<>::_M_invoke() (pid=5623) /site-packages/ray/_raylet.so(+0x9683b6) [0x7f2ebf7913b6] EventTracker::RecordExecution() (pid=5623) /site-packages/ray/_raylet.so(+0x90580e) [0x7f2ebf72e80e] std::_Function_handler<>::_M_invoke() (pid=5623) /site-packages/ray/_raylet.so(+0x905d66) [0x7f2ebf72ed66] boost::asio::detail::completion_handler<>::do_complete() (pid=5623) /site-packages/ray/_raylet.so(+0xdc0b6b) [0x7f2ebfbe9b6b] boost::asio::detail::scheduler::do_run_one() (pid=5623) /site-packages/ray/_raylet.so(+0xdc2639) [0x7f2ebfbeb639] boost::asio::detail::scheduler::run() (pid=5623) /site-packages/ray/_raylet.so(+0xdc2af2) [0x7f2ebfbebaf2] boost::asio::io_context::run() (pid=5623) /site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7f2ebf4c245d] ray::core::CoreWorker::RunIOService() (pid=5623) /site-packages/ray/_raylet.so(+0xee9270) [0x7f2ebfd12270] execute_native_thread_routine (pid=5623) /lib64/libpthread.so.0(+0x7ea5) [0x7f2ec8afaea5] start_thread (pid=5623) /lib64/libc.so.6(clone+0x6d) [0x7f2ec811ab0d] clone (pid=5623) (pid=5623) *** SIGABRT received at time=1692924160 on cpu 59 *** (pid=5623) PC: @ 0x7f2ec8052387 (unknown) raise (pid=5623) @ 0x7f2ec8b02630 3520 (unknown) (pid=5623) @ 0x7f2ebe977a06 286194840 (unknown) (pid=5623) @ 0x7f2ebfe4c640 (unknown) (unknown) (pid=5623) @ 0x7f2ebe978fb0 (unknown) (unknown) (pid=5623) @ 0x3de907894810c083 (unknown) (unknown) (pid=5623) [2023-08-25 00:42:40,289 E 5623 7733] logging.cc:361: *** SIGABRT received at time=1692924160 on cpu 59 *** (pid=5623) [2023-08-25 00:42:40,289 E 5623 7733] logging.cc:361: PC: @ 0x7f2ec8052387 (unknown) raise (pid=5623) [2023-08-25 00:42:40,289 E 5623 7733] logging.cc:361: @ 0x7f2ec8b02630 3520 (unknown) (pid=5623) [2023-08-25 00:42:40,289 E 5623 7733] logging.cc:361: @ 0x7f2ebe977a06 286194840 (unknown) (pid=5623) [2023-08-25 00:42:40,290 E 5623 7733] logging.cc:361: @ 0x7f2ebfe4c640 (unknown) (unknown) (pid=5623) [2023-08-25 00:42:40,290 E 5623 7733] logging.cc:361: @ 0x7f2ebe978fb0 (unknown) (unknown) (pid=5623) [2023-08-25 00:42:40,291 E 5623 7733] logging.cc:361: @ 0x3de907894810c083 (unknown) (unknown) (pid=5623) Fatal Python error: Aborted (pid=5623) (pid=5623) (pid=5623) Extension modules: msgpack._cmsgpack, setproctitle, psutil._psutil_linux, psutil._psutil_posix, yaml._yaml, grpc._cython.cygrpc, ray._raylet, pvectorc, simplejson._speedups (total: 9) (pid=5643) E0825 00:42:41.394104305 8973 thd.cc:157] pthread_create failed: Resource temporarily unavailable Exception in thread ray_print_logs: Traceback (most recent call last): File "python/3.10/lib/python3.10/threading.py", line 1009, in _bootstrap_inner self.run() File "python/3.10/lib/python3.10/threading.py", line 946, in run self._target(*self._args, **self._kwargs) File "/site-packages/ray/_private/worker.py", line 885, in print_logs data = subscriber.poll() File "/site-packages/ray/_private/gcs_pubsub.py", line 335, in poll self._poll_locked(timeout=timeout) File "/site-packages/ray/_private/gcs_pubsub.py", line 208, in _poll_locked fut = self._stub.GcsSubscriberPoll.future( File "/site-packages/grpc/_channel.py", line 1060, in future call = self._managed_call( File "/site-packages/grpc/_channel.py", line 1443, in create _run_channel_spin_thread(state) File "/site-packages/grpc/_channel.py", line 1404, in _run_channel_spin_thread channel_spin_thread.start() File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 120, in grpc._cython.cygrpc.ForkManagedThread.start File "python/3.10/lib/python3.10/threading.py", line 928, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread E0825 00:42:41.832129666 19378 thd.cc:157] pthread_create failed: Resource temporarily unavailable Traceback (most recent call last): File "python/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "python/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/site-packages/vllm/entrypoints/api_server.py", line 78, in engine = AsyncLLMEngine.from_engine_args(engine_args) File "/site-packages/vllm/engine/async_llm_engine.py", line 232, in from_engine_args engine = cls(engine_args.worker_use_ray, File "/site-packages/vllm/engine/async_llm_engine.py", line 55, in init self.engine = engine_class(*args, **kwargs) File "/site-packages/vllm/engine/llm_engine.py", line 99, in init self._init_workers_ray(placement_group) File "/site-packages/vllm/engine/llm_engine.py", line 170, in _init_workers_ray self._run_workers( File "/site-packages/vllm/engine/llm_engine.py", line 474, in _run_workers all_outputs = ray.get(all_outputs) File "/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper return fn(*args, **kwargs) File "/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/site-packages/ray/_private/worker.py", line 2540, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::RayWorker.execute_method() (pid=19512, ip=10.123.87.13, actor_id=a56bb9a6f4852bf5f1d4123d01000000, repr=<vllm.engine.ray_utils.RayWorker object at 0x7ef9f09317b0>) File "/site-packages/vllm/engine/ray_utils.py", line 25, in execute_method return executor(*args, **kwargs) File "/site-packages/vllm/worker/worker.py", line 62, in init_model _init_distributed_environment(self.parallel_config, self.rank, File "/site-packages/vllm/worker/worker.py", line 329, in _init_distributed_environment torch.distributed.all_reduce(torch.zeros(1).cuda()) RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Thanks

Aug 25 '23 20:08 dhritiman

I just re-ran this with the latest vLLM 0.1.4 wheels for python 3.10 and it failed with the same error. Would highly appreciate if you can provide any inputs here.

Aug 26 '23 00:08 dhritiman

Hi @dhritiman, thanks for trying out vLLM. Could you try --tensor_parallel_size 1 and see if it works?

Aug 31 '23 08:08 WoosukKwon

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Sep 02 '23 01:09 lonngxiang

Getting the same error when running Yarn-llama2-70B-32k on vLLM 0.2.4

Cuda 11.8, V11.8.89 Nvidia Driver 515.105.01 A100 host with 8 GPU cards python 3.9 vLLM 0.2.4

Dec 19 '23 21:12 islam-nassar

Getting the same error while trying to serve Mixtral-8x7B-Instruct-v0.1 on vLLM 0.2.6 with --tensor_parallel_size 2

Cuda 12.2 Nvidia Driver 535.104.12 A100 host with 8 GPU cards python 3.11.5 vLLM 0.2.6

Dec 25 '23 03:12 data-panda

Any resolution on this?

Jan 04 '24 15:01 Gaurav141199

Getting the same error while trying to serve Mixtral-8x7B-Instruct-v0.1 on vLLM 0.2.7 with --tensor_parallel_size 8 on Kubernetes Deployment

-- Environment-- CUDA Version: 12.2 Driver Version: 535.129.03 Kubernetes Pod running on A10G Host with 8 GPU Cards Python 3.10.13 vLLM 0.2.7

--- Error Trace --- /usr/local/bin/python -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --revision 125c431e2ff41a156b9f9076f744d2f35dd6e67a --max-model-len 8191 --download-dir /data --tensor-parallel-size 8

INFO 01-10 14:36:51 api_server.py:727] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision='125c431e2ff41a156b9f9076f744d2f35dd6e67a', tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir='/data', load_format='auto', dtype='auto', max_model_len=8191, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=False, disable_log_requests=False, max_log_len=None) config.json: 100%|██████████| 720/720 [00:00<00:00, 5.99MB/s] 2024-01-10 14:36:53,774 INFO worker.py:1724 -- Started a local Ray instance. E0110 14:36:57.034505970 73 thd.cc:157] pthread_create failed: Resource temporarily unavailable

Jan 10 '24 15:01 smghasempour

We also see this on separate k8, but can't reproduce on own setup of k8 on same 4*A100 system.

Feb 10 '24 19:02 pseudotensor

Does anyone solve this problem?

Feb 23 '24 09:02 JinpilChoi

same problem, SOS

Mar 20 '24 08:03 panxnan

same problem , any update on the solution here @dhritiman

Mar 27 '24 07:03 suryanshbhar

I had the same error:

logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]

After making some changes to /etc/security/limits.conf，the error is gone open file count user process count and so on

Before modification：

After modification：

Hope this's helpful

Apr 12 '24 11:04 hejxiang

I had the same error:

logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]

After making some changes to /etc/security/limits.conf，the error is gone open file count user process count and so on

Before modification：

After modification：

Hope this's helpful

Hi @hejxiang , I used the command ulimit change the arguments. But it still didn't work. Is there any difference between using ulimit and modifying /etc/security/limits.conf to change the limited argus? Can you show how you modify the limits.conf?

May 22 '24 02:05 guangzlu

Hi @hejxiang , I used the command ulimit change the arguments. But it still didn't work. Is there any difference between using ulimit and modifying /etc/security/limits.conf to change the limited argus? Can you show how you modify the limits.conf?

@guangzlu Sorry for the late reply Changes made by ulimitwill apply only to the current processes (current shell) . Ray creates multi processes not in the current shell, so It should make changes permanent, should edit the /etc/security/limits.conf Just edit the config file using vim or other，follow the comment instruction in the config file, add what you want at the end of the file, e.g. change the max open files count and the max process count to 65535 for all users (You can decide the specific values yourself）

* soft nofile 65535
* hard nofile 65535
* soft nproc 65535
* hard nproc 65535

open a new shell terminal and test by ulimit -a

May 24 '24 10:05 hejxiang

Hi @hejxiang , I used the command ulimit change the arguments. But it still didn't work. Is there any difference between using ulimit and modifying /etc/security/limits.conf to change the limited argus? Can you show how you modify the limits.conf?

@guangzlu Sorry for the late reply Changes made by ulimitwill apply only to the current processes (current shell) . Ray creates multi processes not in the current shell, so It should make changes permanent, should edit the /etc/security/limits.conf Just edit the config file using vim or other，follow the comment instruction in the config file, add what you want at the end of the file, e.g. change the max open files count and the max process count to 65535 for all users (You can decide the specific values yourself）
* soft nofile 65535
* hard nofile 65535
* soft nproc 65535
* hard nproc 65535
open a new shell terminal and test by ulimit -a

Thank you very much! It is very detailed and helpful!

May 27 '24 02:05 guangzlu