ovis
ovis copied to clipboard
Variorum LDMS Plugin for vendor-neutral power management
This pull request introduces a vendor-neutral power monitoring plugin that reports node, CPU (socket), memory and GPU power using LLNL's Variorum library (https://variorum.readthedocs.io/).
Hi @morrone Thank you so much for getting the process started!
Jessica and I will start working on the comments for the name, Makefile and autoconf soon (early next week), and update this PR accordingly.
We can also demo the plugin to you internally if you are interested in looking at the functionality across different architectures.
Please mark this PR draft while major revisions are on-going.
It would be helpful if the interaction/conflict between the library underlying this samplers (indirect) usage of MSR and the syspapi sampler use of MSRs were documented. In particular since in many cases MSRs are scarce and the kernel may attempt to time-slice them if they are oversubscribed, inaccurate data for both samplers might result.
It appears jansson is not directly supported in RHEL7. It would be fantastic if variorum had a build switch to select between jansson and ovis_json (which has all the needed json functionality afaict).
I'm not sure that it makes sense to have a build switch to select between jansson or ovis_json. If ovis_json meets their needs, then they should probably just switch to using that. Keeping the extra build complexity around for jansson when it is no longer needed doesn't seem desirable to me.
@baallan The library uses more than just MSRs (MSRs are only available on Intel systems and AMD systems, for IBM we use OPAL firmware, and for Nvidia and AMD GPUs we use NVML and Rocm). This is well documented here: https://variorum.readthedocs.io/en/latest/HWArchitectures.html as what Variorum uses internally will be architecture-specific.
What other documentation would you like to see here?
Regarding Jansson vs OVIS-JSON -- we don't support that in Variorum yet, and we can in the future in a few months based on our prioritization, but I think that'll be beyond the scope of this PR. I am hoping that can be an enhancement we can work on later. Let me know your thoughts on that.
Oh, if that is what baallan was requesting, then I agree. The external variorum package should not use ovis-json at all. There are plenty of json libraries out there in the world. We don't need add ovis's to the pile. Moreover, doing so in variorum would introduce a circular dependency at build time, and we can't have that.
I had assumed though that baallan meant that the variorum sampler in this PR should use ovis-json. That at least would be reasonable.
@morrone Sure, updating just the sampler to use your OVIS-JSON library should be possible easily. I probably misunderstood what @baallan was suggesting there. Can you point me to the documentation for that and equivalent of the following or an example that does something similar in your codebase?
json_loads(string ...)
json_real_value(json_object_get(<obj-name>, "key"))
I would like to eliminate (at least from the ldms config/build) direct dependence on libjansson. If linking libvariorum.so automatically pulls in libjansson that is in theory ok. If using ovis_json to parse data will still leave us with the explicit link dependency on jansson, then there's no point in using the ovis_json.h code.
I actually was suggesting the radical course of making variorum json handling pluggable; on previous projects with plugins 'json version hell' has ensued when everyone likes their own version of their own third-party favorite json lib. But I see there's resistance... understandable.
@morrone Sure, updating just the sampler to use your OVIS-JSON library should be possible easily. I probably misunderstood what @baallan was suggesting there. Can you point me to the documentation for that and equivalent of the following or an example that does something similar in your codebase?
https://github.com/ovis-hpc/ovis/blob/OVIS-4/lib/src/ovis_json/ovis_json.h
json_loads(string ...)
json_parse_buffer
json_real_value(json_object_get(<obj-name>, "key"))
This is some combination of json_value_float and json_value_find
I'm not entirely sure if updating just the LDMS sampler code to parse with OVIS-JSON removes the need for linking in libjansson for the LDMS part. You will still need libjansson for building variorum, of course.
Pullling @slabasan into this discussion so she can comment here too.
Libjansson is a more common C library for doing JSON stuff, so I don't think we want to move to OVIS-JSON altogether in Variorum. We export the JSON object externally as a string, so I believe users of Variorum (eg LDMS) can parse it the way they like.
May be we can at some point support more JSON libraries, but that's not on the roadmap at least at the moment for a few months, and that's a lot of internal code changes across multiple architectures that we can't make time for until we meet other priorities for Variorum. This isn't because of resistance necessarily due to having a favorite JSON library, but because of limited staff (2 main developers) and several other important things to support instead. That being said, I do hope that we can revisit that bit of supporting other JSON libraries in Variorum in the future when we have more time to devote to adding more JSON-related support.
Is there a current norm for other LDMS plugins that use their own JSON libraries?
@morrone Thank you so much for all your suggestions! I apologize for being inattentive; I did not realize my pushes would go straight through to this pull request. I will work my way through your comments, but I know very little about autoconf, so it might take me a bit. Thank you for your patience.
Pushed some more changes for the weekend; I'll be back to work on this more next week. Thank you again for your patience.
@morrone We tried to switch to using AC_LIB_HAVE_LINKFLAGS, but despite passing in "--with-libvariorum-prefix=<variorum/install/directory>", I am seeing this:
checking for libvariorum... no
checking for libvariorum... (cached) no
checking for libjansson... yes
checking how to link with libjansson... -ljansson
Libjansson is in a standard location; libvariorum is not. Do you have any advice?
@morrone We tried to switch to using AC_LIB_HAVE_LINKFLAGS, but despite passing in "--with-libvariorum-prefix=<variorum/install/directory>"
I believe that the option is named just "--with-libvariorum". You can run "./configure --help" to verify that. (I assume that we at not literally including the greater-than and less-than symbols.)
EDIT: NOPE! I was wrong, I checked and it probably is "--with-libvariorum-prefix". So, OK I think the next step is to show me the exact configure line that you used instead of a pseudo option. You can always email me if you aren't comfortable showing that here.
Also please share the layout of the libvariorum install directory (the output of "find /path/to/variorum/install/directory" would likely be sufficient).
Assuming there is no simple problem that we can see there, the next step in debugging is to take a look at the config.log file after the configure fails. That will likely show us the command line used to compile the test program, the error message that the compiler printed, and the test program itself. That will help a great deal in determining where the problem lies.
But I think we're on the right track this time. :)
Good news! I think the problem was in the AC_LIB_HAVE_LINKFLAGS macro itself. I updated the macros to the latest from gnulib, and I was able to configure and build ldms against libvariorum. PR #1008 has the updated m4 files.
I am not positive, but I believe these are the last of the changes that needed to be made.
@morrone We have addressed all your comments and tested this thoroughly, so I marked this PR ready for review.
Hi @morrone, I fixed the typo around enable_variorum
. Please re-run tests when you get a chance, hopefully they will pass correctly this time :)
@tom95858, @narategithub, could one of you please approve the workflows to run on this PR?
@tom95858, @narategithub: I've removed the trailing whitespace, which should fix the tests passing. When you get a chance, could you approve the workflows (hopefully last time)?
@narategithub @morrone @tom95858: I believe this PR ready to go now. Do let us know if there's anything else we need to do for this PR.
Hi everyone, is this ready to merge?