SIGABRT - Fatal Python error: Aborted when running vllm on llama2-7b with --tensor-parallel-size 2
In my setup, vLLM works fine when running llama2-7b with 1 GPU. But when running it with multiple gpus, it runs into a Fatal error every time. Sharing the traces below. This is persistent - that is there is no single instance when I am able to run vllm with multiple gpus. Can you please share thoughts on what could be the issue and how to go about it ?
-- Environment-- CentOS 7.9 Cuda 11.8, V11.8.89 Nvidia Driver 530.30.2 A100 host with 8 GPU cards python 3.10 vLLM 0.1.3 /dev/shm 60G ulimit -u 30000
--- Error Trace 1 --- python3.10 -m vllm.entrypoints.api_server --model "models/llama-2-7b-hf" --swap-space 1 --disable-log-requests --disable-log-stats --tensor-parallel-size 2 2023-08-25 08:22:59,206 INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 INFO 08-25 08:23:00 llm_engine.py:70] Initializing an LLM engine with config: model='models/llama-2-7b-hf', tokenizer='models/llama-2-7b-hf', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0) INFO 08-25 08:23:00 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. [2023-08-25 08:23:04,248 E 32338 1301] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11] [2023-08-25 08:23:04,279 E 32338 1301] logging.cc:104: Stack trace: /site-packages/ray/_raylet.so(+0xdaf01a) [0x7f732773d01a] ray::operator<<() /site-packages/ray/_raylet.so(+0xdb17d8) [0x7f732773f7d8] ray::TerminateHandler() /lib64/libstdc++.so.6(+0x5ea06) [0x7f7397676a06] /lib64/libstdc++.so.6(+0x5ea33) [0x7f7397676a33] /lib64/libstdc++.so.6(+0x5ec53) [0x7f7397676c53] /site-packages/ray/_raylet.so(+0x4a3ef4) [0x7f7326e31ef4] boost::throw_exception<>() /site-packages/ray/_raylet.so(+0xdc33db) [0x7f73277513db] boost::asio::detail::do_throw_error() /site-packages/ray/_raylet.so(+0xdc3dfb) [0x7f7327751dfb] boost::asio::detail::posix_thread::start_thread() /site-packages/ray/_raylet.so(+0xdc425c) [0x7f732775225c] boost::asio::thread_pool::thread_pool() /site-packages/ray/_raylet.so(+0x8c5b74) [0x7f7327253b74] ray::rpc::(anonymous namespace)::_GetServerCallExecutor() /site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f7327253c09] ray::rpc::GetServerCallExecutor() /site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyEE17HandleRequestImplEvEUlS1_S4_S4_E_E9_M_invokeERKSt9_Any_dataOS1_OS4_SI+0xe2) [0x7f7326fe5172] std::_Function_handler<>::_M_invoke() /site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x8d1) [0x7f732701abf1] ray::core::CoreWorker::HandleGetCoreWorkerStats() /site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc14ServerCallImplINS2_24CoreWorkerServiceHandlerENS2_25GetCoreWorkerStatsRequestENS2_23GetCoreWorkerStatsReplyEE13HandleRequestEvEUlvE_E9_M_invokeERKSt9_Any_data+0x116) [0x7f7327011636] std::_Function_handler<>::_M_invoke() /site-packages/ray/_raylet.so(+0x9683b6) [0x7f73272f63b6] EventTracker::RecordExecution() /site-packages/ray/_raylet.so(+0x90580e) [0x7f732729380e] std::_Function_handler<>::_M_invoke() /site-packages/ray/_raylet.so(+0x905d66) [0x7f7327293d66] boost::asio::detail::completion_handler<>::do_complete() /site-packages/ray/_raylet.so(+0xdc0b6b) [0x7f732774eb6b] boost::asio::detail::scheduler::do_run_one() /site-packages/ray/_raylet.so(+0xdc2639) [0x7f7327750639] boost::asio::detail::scheduler::run() /site-packages/ray/_raylet.so(+0xdc2af2) [0x7f7327750af2] boost::asio::io_context::run() /site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7f732702745d] ray::core::CoreWorker::RunIOService() /site-packages/ray/_raylet.so(+0xee9270) [0x7f7327877270] execute_native_thread_routine /lib64/libpthread.so.0(+0x7ea5) [0x7f73d59e1ea5] start_thread /lib64/libc.so.6(clone+0x6d) [0x7f73d5001b0d] clone
*** SIGABRT received at time=1692951784 on cpu 44 *** PC: @ 0x7f73d4f39387 (unknown) raise @ 0x7f73d59e9630 3520 (unknown) @ 0x7f7397676a06 1791331480 (unknown) @ 0x7f73279b1640 1875585256 (unknown) @ 0x7f7397677fb0 (unknown) (unknown) @ 0x3de907894810c083 (unknown) (unknown) [2023-08-25 08:23:04,282 E 32338 1301] logging.cc:361: *** SIGABRT received at time=1692951784 on cpu 44 *** [2023-08-25 08:23:04,282 E 32338 1301] logging.cc:361: PC: @ 0x7f73d4f39387 (unknown) raise [2023-08-25 08:23:04,284 E 32338 1301] logging.cc:361: @ 0x7f73d59e9630 3520 (unknown) [2023-08-25 08:23:04,284 E 32338 1301] logging.cc:361: @ 0x7f7397676a06 1791331480 (unknown) [2023-08-25 08:23:04,286 E 32338 1301] logging.cc:361: @ 0x7f73279b1640 1875585256 (unknown) [2023-08-25 08:23:04,286 E 32338 1301] logging.cc:361: @ 0x7f7397677fb0 (unknown) (unknown) [2023-08-25 08:23:04,287 E 32338 1301] logging.cc:361: @ 0x3de907894810c083 (unknown) (unknown) Fatal Python error: Aborted
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, simplejson._speedups, yaml._yaml, sentencepiece._sentencepiece, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, setproctitle, grpc._cython.cygrpc, ray._raylet, pvectorc, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.tslib, pandas._libs.lib, pandas._libs.hashing, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.index, pandas._libs.join, pandas._libs.sparse, pandas._libs.reduction, pandas._libs.indexing, pandas._libs.internals, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.tslibs.strptime, pandas._libs.groupby, pandas._libs.testing, pandas._libs.parsers, pandas._libs.json, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, _cffi_backend, frozenlist._frozenlist, pyarrow._json, PIL._imaging (total: 111) Aborted
--- Error Trace 2 --
python3.10 -m vllm.entrypoints.api_server --model "models/llama-2-7b-hf" --swap-space 1 --disable-log-requests --disable-log-stats --tensor-parallel-size 2
2023-08-25 00:42:35,237 INFO worker.py:1627 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 08-25 00:42:36 llm_engine.py:70] Initializing an LLM engine with config: model=models/llama-2-7b-hf', tokenizer='models/llama-2-7b-hf', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=2, seed=0)
INFO 08-25 00:42:36 tokenizer.py:29] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
(pid=5623) [2023-08-25 00:42:40,265 E 5623 7733] logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
(pid=5623) [2023-08-25 00:42:40,287 E 5623 7733] logging.cc:104: Stack trace:
(pid=5623) /site-packages/ray/_raylet.so(+0xdaf01a) [0x7f2ebfbd801a] ray::operator<<()
(pid=5623) /site-packages/ray/_raylet.so(+0xdb17d8) [0x7f2ebfbda7d8] ray::TerminateHandler()
(pid=5623) /lib64/libstdc++.so.6(+0x5ea06) [0x7f2ebe977a06]
(pid=5623) /lib64/libstdc++.so.6(+0x5ea33) [0x7f2ebe977a33]
(pid=5623) /lib64/libstdc++.so.6(+0x5ec53) [0x7f2ebe977c53]
(pid=5623) /site-packages/ray/_raylet.so(+0x4a3ef4) [0x7f2ebf2ccef4] boost::throw_exception<>()
(pid=5623) /site-packages/ray/_raylet.so(+0xdc33db) [0x7f2ebfbec3db] boost::asio::detail::do_throw_error()
(pid=5623) /site-packages/ray/_raylet.so(+0xdc3dfb) [0x7f2ebfbecdfb] boost::asio::detail::posix_thread::start_thread()
(pid=5623) /site-packages/ray/_raylet.so(+0xdc425c) [0x7f2ebfbed25c] boost::asio::thread_pool::thread_pool()
(pid=5623) /site-packages/ray/_raylet.so(+0x8c5b74) [0x7f2ebf6eeb74] ray::rpc::(anonymous namespace)::_GetServerCallExecutor()
(pid=5623) /site-packages/ray/_raylet.so(_ZN3ray3rpc21GetServerCallExecutorEv+0x9) [0x7f2ebf6eec09] ray::rpc::GetServerCallExecutor()
(pid=5623) /site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFvN3ray6StatusESt8functionIFvvEES4_EZNS0_3rpc14ServerCallImplINS6_24CoreWorkerServiceHandlerENS6_25GetCoreWorkerStatsRequestENS6_23GetCoreWorkerStatsReplyEE17HandleRequestImplEvEUlS1_S4_S4_E_E9_M_invokeERKSt9_Any_dataOS1_OS4_SI+0xe2) [0x7f2ebf480172] std::_Function_handler<>::_M_invoke()
(pid=5623) /site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker24HandleGetCoreWorkerStatsENS_3rpc25GetCoreWorkerStatsRequestEPNS2_23GetCoreWorkerStatsReplyESt8functionIFvNS_6StatusES6_IFvvEES9_EE+0x8d1) [0x7f2ebf4b5bf1] ray::core::CoreWorker::HandleGetCoreWorkerStats()
(pid=5623) /site-packages/ray/_raylet.so(_ZNSt17_Function_handlerIFvvEZN3ray3rpc14ServerCallImplINS2_24CoreWorkerServiceHandlerENS2_25GetCoreWorkerStatsRequestENS2_23GetCoreWorkerStatsReplyEE13HandleRequestEvEUlvE_E9_M_invokeERKSt9_Any_data+0x116) [0x7f2ebf4ac636] std::_Function_handler<>::_M_invoke()
(pid=5623) /site-packages/ray/_raylet.so(+0x9683b6) [0x7f2ebf7913b6] EventTracker::RecordExecution()
(pid=5623) /site-packages/ray/_raylet.so(+0x90580e) [0x7f2ebf72e80e] std::_Function_handler<>::_M_invoke()
(pid=5623) /site-packages/ray/_raylet.so(+0x905d66) [0x7f2ebf72ed66] boost::asio::detail::completion_handler<>::do_complete()
(pid=5623) /site-packages/ray/_raylet.so(+0xdc0b6b) [0x7f2ebfbe9b6b] boost::asio::detail::scheduler::do_run_one()
(pid=5623) /site-packages/ray/_raylet.so(+0xdc2639) [0x7f2ebfbeb639] boost::asio::detail::scheduler::run()
(pid=5623) /site-packages/ray/_raylet.so(+0xdc2af2) [0x7f2ebfbebaf2] boost::asio::io_context::run()
(pid=5623) /site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker12RunIOServiceEv+0xcd) [0x7f2ebf4c245d] ray::core::CoreWorker::RunIOService()
(pid=5623) /site-packages/ray/_raylet.so(+0xee9270) [0x7f2ebfd12270] execute_native_thread_routine
(pid=5623) /lib64/libpthread.so.0(+0x7ea5) [0x7f2ec8afaea5] start_thread
(pid=5623) /lib64/libc.so.6(clone+0x6d) [0x7f2ec811ab0d] clone
(pid=5623)
(pid=5623) *** SIGABRT received at time=1692924160 on cpu 59 ***
(pid=5623) PC: @ 0x7f2ec8052387 (unknown) raise
(pid=5623) @ 0x7f2ec8b02630 3520 (unknown)
(pid=5623) @ 0x7f2ebe977a06 286194840 (unknown)
(pid=5623) @ 0x7f2ebfe4c640 (unknown) (unknown)
(pid=5623) @ 0x7f2ebe978fb0 (unknown) (unknown)
(pid=5623) @ 0x3de907894810c083 (unknown) (unknown)
(pid=5623) [2023-08-25 00:42:40,289 E 5623 7733] logging.cc:361: *** SIGABRT received at time=1692924160 on cpu 59 ***
(pid=5623) [2023-08-25 00:42:40,289 E 5623 7733] logging.cc:361: PC: @ 0x7f2ec8052387 (unknown) raise
(pid=5623) [2023-08-25 00:42:40,289 E 5623 7733] logging.cc:361: @ 0x7f2ec8b02630 3520 (unknown)
(pid=5623) [2023-08-25 00:42:40,289 E 5623 7733] logging.cc:361: @ 0x7f2ebe977a06 286194840 (unknown)
(pid=5623) [2023-08-25 00:42:40,290 E 5623 7733] logging.cc:361: @ 0x7f2ebfe4c640 (unknown) (unknown)
(pid=5623) [2023-08-25 00:42:40,290 E 5623 7733] logging.cc:361: @ 0x7f2ebe978fb0 (unknown) (unknown)
(pid=5623) [2023-08-25 00:42:40,291 E 5623 7733] logging.cc:361: @ 0x3de907894810c083 (unknown) (unknown)
(pid=5623) Fatal Python error: Aborted
(pid=5623)
(pid=5623)
(pid=5623) Extension modules: msgpack._cmsgpack, setproctitle, psutil._psutil_linux, psutil._psutil_posix, yaml._yaml, grpc._cython.cygrpc, ray._raylet, pvectorc, simplejson._speedups (total: 9)
(pid=5643) E0825 00:42:41.394104305 8973 thd.cc:157] pthread_create failed: Resource temporarily unavailable
Exception in thread ray_print_logs:
Traceback (most recent call last):
File "python/3.10/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "python/3.10/lib/python3.10/threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
File "/site-packages/ray/_private/worker.py", line 885, in print_logs
data = subscriber.poll()
File "/site-packages/ray/_private/gcs_pubsub.py", line 335, in poll
self._poll_locked(timeout=timeout)
File "/site-packages/ray/_private/gcs_pubsub.py", line 208, in _poll_locked
fut = self._stub.GcsSubscriberPoll.future(
File "/site-packages/grpc/_channel.py", line 1060, in future
call = self._managed_call(
File "/site-packages/grpc/_channel.py", line 1443, in create
_run_channel_spin_thread(state)
File "/site-packages/grpc/_channel.py", line 1404, in _run_channel_spin_thread
channel_spin_thread.start()
File "src/python/grpcio/grpc/_cython/_cygrpc/fork_posix.pyx.pxi", line 120, in grpc._cython.cygrpc.ForkManagedThread.start
File "python/3.10/lib/python3.10/threading.py", line 928, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
E0825 00:42:41.832129666 19378 thd.cc:157] pthread_create failed: Resource temporarily unavailable
Traceback (most recent call last):
File "python/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "python/3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/site-packages/vllm/entrypoints/api_server.py", line 78, in TORCH_USE_CUDA_DSA to enable device-side assertions.
Thanks
I just re-ran this with the latest vLLM 0.1.4 wheels for python 3.10 and it failed with the same error. Would highly appreciate if you can provide any inputs here.
Hi @dhritiman, thanks for trying out vLLM. Could you try --tensor_parallel_size 1 and see if it works?
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Getting the same error when running Yarn-llama2-70B-32k on vLLM 0.2.4
Cuda 11.8, V11.8.89 Nvidia Driver 515.105.01 A100 host with 8 GPU cards python 3.9 vLLM 0.2.4
Getting the same error while trying to serve Mixtral-8x7B-Instruct-v0.1 on vLLM 0.2.6 with --tensor_parallel_size 2
Cuda 12.2 Nvidia Driver 535.104.12 A100 host with 8 GPU cards python 3.11.5 vLLM 0.2.6
Any resolution on this?
Getting the same error while trying to serve Mixtral-8x7B-Instruct-v0.1 on vLLM 0.2.7 with --tensor_parallel_size 8 on Kubernetes Deployment
-- Environment-- CUDA Version: 12.2 Driver Version: 535.129.03 Kubernetes Pod running on A10G Host with 8 GPU Cards Python 3.10.13 vLLM 0.2.7
--- Error Trace --- /usr/local/bin/python -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --revision 125c431e2ff41a156b9f9076f744d2f35dd6e67a --max-model-len 8191 --download-dir /data --tensor-parallel-size 8
INFO 01-10 14:36:51 api_server.py:727] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision='125c431e2ff41a156b9f9076f744d2f35dd6e67a', tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir='/data', load_format='auto', dtype='auto', max_model_len=8191, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=False, disable_log_requests=False, max_log_len=None) config.json: 100%|██████████| 720/720 [00:00<00:00, 5.99MB/s] 2024-01-10 14:36:53,774 INFO worker.py:1724 -- Started a local Ray instance. E0110 14:36:57.034505970 73 thd.cc:157] pthread_create failed: Resource temporarily unavailable
We also see this on separate k8, but can't reproduce on own setup of k8 on same 4*A100 system.
Does anyone solve this problem?
same problem, SOS
same problem , any update on the solution here @dhritiman
I had the same error:
logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]
After making some changes to /etc/security/limits.conf,the error is gone
open file count user process count and so on
Before modification:
After modification:
Hope this's helpful
I had the same error:
logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): thread: Resource temporarily unavailable [system:11]After making some changes to /etc/security/limits.conf,the error is gone
open file count user process count and so onBefore modification:
After modification:
Hope this's helpful
Hi @hejxiang , I used the command ulimit change the arguments. But it still didn't work. Is there any difference between using ulimit and modifying /etc/security/limits.conf to change the limited argus? Can you show how you modify the limits.conf?
Hi @hejxiang , I used the command ulimit change the arguments. But it still didn't work. Is there any difference between using ulimit and modifying /etc/security/limits.conf to change the limited argus? Can you show how you modify the limits.conf?
@guangzlu Sorry for the late reply
Changes made by ulimitwill apply only to the current processes (current shell) . Ray creates multi processes not in the current shell, so It should make changes permanent, should edit the /etc/security/limits.conf
Just edit the config file using vim or other,follow the comment instruction in the config file, add what you want at the end of the file,
e.g. change the max open files count and the max process count to 65535 for all users (You can decide the specific values yourself)
* soft nofile 65535
* hard nofile 65535
* soft nproc 65535
* hard nproc 65535
open a new shell terminal and test by ulimit -a
Hi @hejxiang , I used the command ulimit change the arguments. But it still didn't work. Is there any difference between using ulimit and modifying /etc/security/limits.conf to change the limited argus? Can you show how you modify the limits.conf?
@guangzlu Sorry for the late reply Changes made by
ulimitwill apply only to the current processes (current shell) . Ray creates multi processes not in the current shell, so It should make changes permanent, should edit the /etc/security/limits.conf Just edit the config file usingvimor other,follow the comment instruction in the config file, add what you want at the end of the file, e.g. change the max open files count and the max process count to 65535 for all users (You can decide the specific values yourself)* soft nofile 65535 * hard nofile 65535 * soft nproc 65535 * hard nproc 65535open a new shell terminal and test by
ulimit -a
Thank you very much! It is very detailed and helpful!

