rocm_smi_lib icon indicating copy to clipboard operation
rocm_smi_lib copied to clipboard

init() and shut_down() needs reference counting

Open bgoglin opened this issue 5 years ago • 4 comments

Hello

I am reviewing patches that add rsmi support to hwloc. The major complain is that rsmi init()/shutdown() do not have reference counting.

Any application using both rsmi and hwloc will call rsmi_init() and hwloc_topology_load() which also calls rsmi_init(). hwloc doesn't know when the application actually uses rsmi. Hence hwloc cannot ever call rsmi_shut_down() because it may always break somebody else using rsmi.

The issue basically prevents us from ever calling rsmi_shut_down(), which is ugly and makes valgrind complain etc.

Note that this is not specific to hwloc. It's a generic problem with libraries that may be used by multiple layers in a software stack. MPI is fixing similar issues (nobody cared 25 years ago, today it's a big issue).

Note also that I am not even talking about thread safety here. Reference counting can be thread-unsafe for now if the rest of the lib isn't thread-safe. But you must make rsmi_shut_down() a noop unless it's the last invocation with respect to number of rsmi_init() calls earlier.

Also it'd be good to know which version fixes this so that hwloc can check the rsmi version before deciding whether it may safely call rsmi_shut_down() or not.

thanks Brice

bgoglin avatar May 05 '20 09:05 bgoglin

Thanks Brice

I'm working on this now. I'll push something probably next week.

Chris

On Tue, May 5, 2020 at 4:03 AM Brice Goglin [email protected] wrote:

Hello

I am reviewing patches that add rsmi support to hwloc. The major complain is that rsmi init()/shutdown() do not have reference counting.

Any application using both rsmi and hwloc will call rsmi_init() and hwloc_topology_load() which also calls rsmi_init(). hwloc doesn't know when the application actually uses rsmi. Hence hwloc cannot ever call rsmi_shut_down() because it may always break somebody else using rsmi.

The issue basically prevents us from ever calling rsmi_shut_down(), which is ugly and makes valgrind complain etc.

Note that this is not specific to hwloc. It's a generic problem with libraries that may be used by multiple layers in a software stack. MPI is fixing similar issues (nobody cared 25 years ago, today it's a big issue).

Note also that I am not even talking about thread safety here. Reference counting can be thread-unsafe for now if the rest of the lib isn't thread-safe. But you must make rsmi_shut_down() a noop unless it's the last invocation with respect to number of rsmi_init() calls earlier.

thanks Brice

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RadeonOpenCompute/rocm_smi_lib/issues/62, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDTF2NU6JSZKLAKKCGYK6TRP7I7JANCNFSM4MZMWX4Q .

cfreehill avatar May 08 '20 20:05 cfreehill

Thanks, I just saw the commit. Can you confirm that refcounting will be enabled when rsmi_version_get() return > 3.3.x ? I see your last release in March is 3.3.0 but I can't find the next version number in git master.

bgoglin avatar May 28 '20 08:05 bgoglin

That's correct. 3.5 (there is no 3.4) should be the first release with the ref counting.

Chris

Chris

On Thu, May 28, 2020 at 3:50 AM Brice Goglin [email protected] wrote:

Thanks, I just saw the commit. Can you confirm that refcounting will be enabled when rsmi_version_get() return > 3.3.x ? I see your last release in March is 3.3.0 but I can't find the next version number in git master.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/RadeonOpenCompute/rocm_smi_lib/issues/62#issuecomment-635209261, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDTF2OGSR4I4Z56V3FN4KDRTYQVHANCNFSM4MZMWX4Q .

--

cfreehill avatar May 28 '20 16:05 cfreehill

Perfect, thanks.

bgoglin avatar May 28 '20 16:05 bgoglin

Ref count exists as of 3 years ago :)

dmitrii-galantsev avatar Dec 12 '23 21:12 dmitrii-galantsev