onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

Onnxruntime-directml 1.18.0 broken multithreading inference session

Open Djdefrag opened this issue 1 year ago • 5 comments

Describe the issue

With the new version 1.18 it seems that trying to use different InferenceSession using the same DirectML device, all threads remain stalled without giving any exception or error

To reproduce

Thread 1

 AI_model_loaded = onnx_load(AI_model_path)

AI_model = onnxruntime_inferenceSession(
    path_or_bytes = AI_model_loaded.SerializeToString(), 
    providers =  [('DmlExecutionProvider', {"device_id": "0"})]
)    

onnx_input  = {AI_model.get_inputs()[0].name: image}
onnx_output = AI_model.run(None, onnx_input)[0]

Thread n (where n can be any number)

AI_model_loaded = onnx_load(AI_model_path)

AI_model = onnxruntime_inferenceSession(
    path_or_bytes = AI_model_loaded.SerializeToString(), 
    providers =  [('DmlExecutionProvider', {"device_id": "0"})]
)    

onnx_input  = {AI_model.get_inputs()[0].name: image}
onnx_output = AI_model.run(None, onnx_input)[0]

Urgency

No response

Platform

Windows

OS Version

10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

1.18.0

Djdefrag avatar May 17 '24 19:05 Djdefrag

Tagging @PatriceVignola @smk2007 @fdwr for visibility.

sophies927 avatar May 19 '24 19:05 sophies927

Same here on Windows, versions 1.16.0 to 1.17.3 work fine over multiple threads, however 1.18.0 gives Windows fatal exception: access violation with the following stack trace produced by my own Windows SEH handler:

-----------
Caught unhandled exception...
-----------

Terminating from thread id 10152

Non-C++ exception:
  Error: EXCEPTION_ACCESS_VIOLATION
  Type: Read
  Addr: 0x0

Trace:
 40:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 39:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 38:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 37:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 36:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 35:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 34:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 33:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 32:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 31:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 30:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 29:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 28:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 27:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 26:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 25:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 24:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 23:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 22:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 21:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 20:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 19:  ?: PyInit_onnxruntime_pybind11_state  (onnxruntime_pybind11_state.pyd)
 18:  ?: pybind11::error_already_set::discard_as_unraisable  (onnxruntime_pybind11_state.pyd)
 17:  ?: PyObject_MakeTpCall  (python311.dll)
 16:  ?: PyObject_Vectorcall  (python311.dll)
 15:  ?: PyEval_EvalFrameDefault  (python311.dll)
 14:  ?: PyFunction_Vectorcall  (python311.dll)
 13:  ?: PyFunction_Vectorcall  (python311.dll)
 12:  ?: PyObject_CallObject  (python311.dll)
 11:  ?: PyEval_EvalFrameDefault  (python311.dll)
 10:  ?: PyFunction_Vectorcall  (python311.dll)
  9:  ?: PyObject_CallObject  (python311.dll)
  8:  ?: PyEval_EvalFrameDefault  (python311.dll)
  7:  ?: PyFunction_Vectorcall  (python311.dll)
  6:  ?: PyFunction_Vectorcall  (python311.dll)
  5:  ?: PyObject_Call  (python311.dll)
  4:  ?: PyInterpreterState_Delete  (python311.dll)
  3:  ?: PyInterpreterState_Delete  (python311.dll)
  2:  ?: recalloc  (ucrtbase.dll)
  1:  ?: BaseThreadInitThunk  (KERNEL32.DLL)
  0:  ?: RtlUserThreadStart  (ntdll.dll)

saulthu avatar Jun 03 '24 00:06 saulthu

We’ve noted the issue with GPU resource contention due to multiple threads. This usage pattern is not recommended as it makes multiple threads request all of the GPU resources, and can cause contention. Also, the allocator in python API (both CUDA and DML) is explicitly not thread safe because it initializes the allocator as a global singleton due it living outside of the session.

We’re investigating the recent failure and will address it. Meanwhile, please avoid this pattern to prevent GPU contention.

liuyunms avatar Jun 07 '24 00:06 liuyunms

Hi @liuyunms

Sorry to bother, I'm currently using an InferenceSession per tread, but you say it shouldn't be used this way.

4 threds -> 4 inference session with same gpu

Do you mean to use the same InferenceSession in multiple threads? Is it possible?

4 threds -> 1 inference session with same gpu

Djdefrag avatar Jun 20 '24 10:06 Djdefrag

@PatriceVignola @smk2007 @fdwr

Hi, sorry to bother, there are some news for this problem? Actually testing 1.18.1 and the problem is still present :(

Thank you

Djdefrag avatar Jun 28 '24 06:06 Djdefrag

https://github.com/microsoft/onnxruntime/pull/21566 This PR might be related. @Djdefrag Could you help verify if the problem is fixed?

zhangxiang1993 avatar Aug 15 '24 02:08 zhangxiang1993

@zhangxiang1993 It still crashes using multiple threads in my application. I just tried the nightly build of 1.19 from here (I used the Python 3.11 build for Windows). I've reverted back to 1.17.3 which still works.

saulthu avatar Aug 15 '24 03:08 saulthu

@zhangxiang1993

Hi, I can confirm that the problem is also present on 1.19 nightly (python 3.11)

  • ORT 1.19 [NOT working]
  • ORT 1.18.1 [NOT working]
  • ORT 1.18 [NOT working]
  • ORT 1.17.3 [working]

Djdefrag avatar Aug 15 '24 05:08 Djdefrag

Not sure if this helps, but I have this method to work around it.

import threading
from contextlib import nullcontext
from typing import ContextManager, Union

THREAD_SEMAPHORE : threading.Semaphore = threading.Semaphore()
NULL_CONTEXT : ContextManager[None] = nullcontext()

def conditional_thread_semaphore() -> Union[threading.Semaphore, ContextManager[None]]:
	if has_execution_provider('directml') or has_execution_provider('rocm'):
		return THREAD_SEMAPHORE
	return NULL_CONTEXT
with conditional_thread_semaphore():
	onnxruntime.run()

Sorry, but implement has_execution_provider yourself :)

henryruhs avatar Aug 15 '24 21:08 henryruhs

@henryruhs

with conditional_thread_semaphore():
	onnxruntime.run()

This works, however it defeats the purpose of running in multiple threads. In older versions that do not crash I can have the GPU running at 100%, but this workaround causes a very large performance hit.

saulthu avatar Aug 15 '24 22:08 saulthu

yeah, the performance hit is something I am aware of.

henryruhs avatar Aug 16 '24 05:08 henryruhs

Hi @henryruhs

Thank you. I tried the solution using Semaphore and it works, but the performance is in line with using only one Thread.

Hopefully they will fix the problem with the next release.

Djdefrag avatar Aug 16 '24 08:08 Djdefrag

This problem is significant, so most of us will remain on version 1.17.3. Please fix it.

linyu0219 avatar Aug 31 '24 08:08 linyu0219

We’ve noted the issue with GPU resource contention due to multiple threads. This usage pattern is not recommended as it makes multiple threads request all of the GPU resources, and can cause contention. Also, the allocator in python API (both CUDA and DML) is explicitly not thread safe because it initializes the allocator as a global singleton due it living outside of the session.

We’re investigating the recent failure and will address it. Meanwhile, please avoid this pattern to prevent GPU contention.

@linyu0219 It's unfortunate that the Python API is broken like this. The official docs for DirectML provider says, "Multiple threads are permitted to call Run simultaneously if they operate on different inference session objects", yet this is apparently not true if you use Python. 😟

Can we please get a fix to the Python API?

saulthu avatar Oct 09 '24 02:10 saulthu

When unloading an inference session in a multi threading scenario, it crashes the whole application. I assume various threads try to access None, still expecting an inference session.

This is a DirectML only issue, we had to downgrade to 1.17.3 as well.

henryruhs avatar Oct 09 '24 06:10 henryruhs

I'm forced to use my own C++ Pybind11 wrapper since this official Python wrapper is broken for multithreading. 😟

Edit: Here's the rough bit of code that does the job for me, until the official python wrapper is fixed: https://gist.github.com/saulthu/c60a8f1f10352e98a986e57205cedd49

saulthu avatar Oct 10 '24 01:10 saulthu

Tested the RC diretcml 1.20.0.dev20241022005

the problem has not been solved, in fact, it have gotten worse and now with 1.20 the gpu driver crash

Djdefrag avatar Oct 27 '24 14:10 Djdefrag

@saulthu it sounds like you understood the underlying issue... is this just a binding issue? can you send a pull request?

henryruhs avatar Oct 30 '24 08:10 henryruhs

@saulthu it sounds like you understood the underlying issue... is this just a binding issue? can you send a pull request?

@henryruhs Sorry, I don't have a patch to provide. I have been a user of the python bindings for a while. The only real info I have to go on for the true cause is the comment by @liuyunms:

We’ve noted the issue with GPU resource contention due to multiple threads. This usage pattern is not recommended as it makes multiple threads request all of the GPU resources, and can cause contention. Also, the allocator in python API (both CUDA and DML) is explicitly not thread safe because it initializes the allocator as a global singleton due it living outside of the session.

We’re investigating the recent failure and will address it. Meanwhile, please avoid this pattern to prevent GPU contention.

I have written my own pybind11 wrapper using the C++ API, which appears to run without issue, and also seems to run with better parallelism -- I guess I'm releasing the GIL for longer?

Here's the rough bit of code that does the job for me, until the official python wrapper is fixed: https://gist.github.com/saulthu/c60a8f1f10352e98a986e57205cedd49

saulthu avatar Oct 30 '24 10:10 saulthu

Hi @PatriceVignola @smk2007 @fdwr @liuyunms

There are some news about this issue?

Djdefrag avatar Nov 15 '24 09:11 Djdefrag

Hello, happy new year!

Is there any news for this issue? @PatriceVignola @smk2007 @fdwr @liuyunms

Djdefrag avatar Jan 02 '25 11:01 Djdefrag

I am also encountering this problem

SSFRPA avatar Feb 07 '25 08:02 SSFRPA

@skottmckay Could you share your thoughts on this? It's been preventing projects from being updated for over a year.

henryruhs avatar Feb 07 '25 09:02 henryruhs

@skottmckay Could you share your thoughts on this? It's been preventing projects from being updated for over a year.

@skottmckay We really need a fix, it's almost a year that this bug has been present and we can't even access Python 3.13 improvements because of this

Djdefrag avatar Feb 16 '25 08:02 Djdefrag

Sorry - I'm not a DML expert. @fdwr any ideas what changed between 1.17.3 and 1.18 that might cause this?

skottmckay avatar Feb 17 '25 06:02 skottmckay

Hope this issue can be resolved.

SSFRPA avatar Feb 18 '25 09:02 SSFRPA

Hi, tested 1.21.0 Release Candidates #23885

the problem is still present, please fix the problem, is almost a year :(

Djdefrag avatar Mar 07 '25 16:03 Djdefrag