rocm_smi_lib
rocm_smi_lib copied to clipboard
init() and shut_down() needs reference counting
Hello
I am reviewing patches that add rsmi support to hwloc. The major complain is that rsmi init()/shutdown() do not have reference counting.
Any application using both rsmi and hwloc will call rsmi_init() and hwloc_topology_load() which also calls rsmi_init(). hwloc doesn't know when the application actually uses rsmi. Hence hwloc cannot ever call rsmi_shut_down() because it may always break somebody else using rsmi.
The issue basically prevents us from ever calling rsmi_shut_down(), which is ugly and makes valgrind complain etc.
Note that this is not specific to hwloc. It's a generic problem with libraries that may be used by multiple layers in a software stack. MPI is fixing similar issues (nobody cared 25 years ago, today it's a big issue).
Note also that I am not even talking about thread safety here. Reference counting can be thread-unsafe for now if the rest of the lib isn't thread-safe. But you must make rsmi_shut_down() a noop unless it's the last invocation with respect to number of rsmi_init() calls earlier.
Also it'd be good to know which version fixes this so that hwloc can check the rsmi version before deciding whether it may safely call rsmi_shut_down() or not.
thanks Brice
Thanks Brice
I'm working on this now. I'll push something probably next week.
Chris
On Tue, May 5, 2020 at 4:03 AM Brice Goglin [email protected] wrote:
Hello
I am reviewing patches that add rsmi support to hwloc. The major complain is that rsmi init()/shutdown() do not have reference counting.
Any application using both rsmi and hwloc will call rsmi_init() and hwloc_topology_load() which also calls rsmi_init(). hwloc doesn't know when the application actually uses rsmi. Hence hwloc cannot ever call rsmi_shut_down() because it may always break somebody else using rsmi.
The issue basically prevents us from ever calling rsmi_shut_down(), which is ugly and makes valgrind complain etc.
Note that this is not specific to hwloc. It's a generic problem with libraries that may be used by multiple layers in a software stack. MPI is fixing similar issues (nobody cared 25 years ago, today it's a big issue).
Note also that I am not even talking about thread safety here. Reference counting can be thread-unsafe for now if the rest of the lib isn't thread-safe. But you must make rsmi_shut_down() a noop unless it's the last invocation with respect to number of rsmi_init() calls earlier.
thanks Brice
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RadeonOpenCompute/rocm_smi_lib/issues/62, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDTF2NU6JSZKLAKKCGYK6TRP7I7JANCNFSM4MZMWX4Q .
Thanks, I just saw the commit. Can you confirm that refcounting will be enabled when rsmi_version_get() return > 3.3.x ? I see your last release in March is 3.3.0 but I can't find the next version number in git master.
That's correct. 3.5 (there is no 3.4) should be the first release with the ref counting.
Chris
Chris
On Thu, May 28, 2020 at 3:50 AM Brice Goglin [email protected] wrote:
Thanks, I just saw the commit. Can you confirm that refcounting will be enabled when rsmi_version_get() return > 3.3.x ? I see your last release in March is 3.3.0 but I can't find the next version number in git master.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/RadeonOpenCompute/rocm_smi_lib/issues/62#issuecomment-635209261, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDTF2OGSR4I4Z56V3FN4KDRTYQVHANCNFSM4MZMWX4Q .
--
Perfect, thanks.
Ref count exists as of 3 years ago :)