ovis icon indicating copy to clipboard operation
ovis copied to clipboard

Multi Instance Plugins

Open narategithub opened this issue 1 year ago • 12 comments

This series of commits include:

  • multi-instance plugin support in ldmsd
  • apply ovis_ref to cfgobj
  • Making the following plugins multi-instances:
    • store_sos (test: https://github.com/ovis-hpc/ldms-test/blob/stage/multi_store_sos_test)
    • store_csv (test: https://github.com/ovis-hpc/ldms-test/blob/stage/multi_store_csv_test)
    • store_avro_kafka (test: https://github.com/ovis-hpc/ldms-test/blob/stage/multi_store_avro_kafka_test)
    • json_stream_sampler (test: https://github.com/ovis-hpc/ldms-test/blob/stage/multi_json_stream_sampler_test)
    • test_sampler (test: https://github.com/ovis-hpc/ldms-test/blob/stage/multi_test_sampler_test)
    • procnetdev2 (test: https://github.com/ovis-hpc/ldms-test/blob/stage/multi_json_stream_sampler_test)

narategithub avatar Jan 15 '24 16:01 narategithub

I would really, really like to see some changes to the approach taken here.

First of all, I would love to see plugins only have to have implement known API functions, rather than bundling function pointers into an ldmsd_plugin struct. The ldmsd_plugin struct, and how it tracks the functions pointers from a plugin, are ldmsd implementation details and should not be exposed to the plugin author.

Next, I would really, really, really want to see us do away with this practice of forcing plugin authors to play tricks of wrapping the ldmsd_plugin struct with hidden state data after the end. The plugin instance's state data should be passed back explicitly when the intance is initialized (for instance, it could be the return value from config()). The state data for an instance should be explicitly passed in to every subsequent call. This is pretty standard in most C code that I see. Moving things from hidden-and-implicit to being explicit in the API will make things much easier to understand for new plugin authors.

We need a new plugin API call to allow the plugin to express whether or not it is multi-instance capable, and ldmsd should guarantee that if the answer is "no", that it will not allow attempts to configure more than one instance.

I think it may be in scope, but maybe this is more debateable: we need the API calls fully documented in the header file, including their semantics. As things stand, the semantics are muddy at best. For insance, it should absolutely be enforced by ldmsd that it will never call deconstructor functions on an instance that has not had its constructor called. The sample() functon should never becalled on something that has not had its constructor called, or after the destructor has run. We shouldn't have to have every single plugin author reinvent the wheel when we can handle these semantics once in ldmsd, and document the semantics.

morrone avatar Jan 17 '24 23:01 morrone

I don't know how much further review I should continue to do. If my high level requests are addressed, I think a number of things will naturally fall out of that and get better.

For instance, the lack of symmetry and clarity in just allocating and freeing the memory for plugin instances needs to be addressed. The plugin author (at least for a store that I am looking at right now) needs to allocate the instance state in get_plugin_instance(). But there is no matching put_plugin_instance(). Instead the author must free the memory in ptr->store.base.term().

We need to call "ldmsd_store_alloc()" in get_plugin_instance(). We would then expect to call ldmsd_store_free() in term(), right? But instead we need to call ldmsd_store_cleanup() and then free().

morrone avatar Jan 18 '24 18:01 morrone

What are the symantics of sample() in the new paradigm? Does sample() need to be reentrant, or will ldmsd ensure that it is not required?

morrone avatar Jan 18 '24 18:01 morrone

@morrone Thank you for a detailed review :)

I think the structure extension is quite standard practice. And I'm not sure if we should force the names of the functions. I think @tom95858 can elaborate on this matter better than me.

After plugin creation, all calls from ldmsd to the plugin always supply the plugin structure that was created by the plugin. It is the same as self in Python. The plugin can track its state in the extended part of the structure.

I agree with you that the plugin creation is quite cumbersome. I'll change the ldmsd_sampler_alloc() and ldmsd_store_alloc() to also receive the necessary options and all common initialization can be done in there.

The existing plugins that does not have get_plugin_instance() is guaranteed to have only 1 instance. If the plugin implements get_plugin_instance(), then it must support multi instances.

sample() is thread safe by plugin instance. What I mean is that sample() function of a plugin (e.g. procnetdev2) can be called simultaneously but for different instances of that function.

There is also a lack of developer documentation that shall be addressed. The asymmetric that you mention should be either addressed or explained in the next update.

Again, thank you for the review :)

narategithub avatar Jan 19 '24 15:01 narategithub

I think the structure extension is quite standard practice.

I can't speak to whether it is common in some circles, but it certainly isn't standard. I think you would be hard pressed to find that called out as a best practice in text books. In fact in most uses, it is not the best approach. There is no good reason to make the plugin's state data a hidden data at the end of ldmsd's internal pointer structure, when we can easily make them separate and explicit.

I think there are various reasons why this is wrong from a code readability and maintainability standpoint, but those would take a long discussion of softrware development pracitices to make a good case for. Here is a clear technical reason why this is a bad approach:

This approach virtually guarantees that plugin ABI compatibility is impossible. Any time we add a function to the API, or tweak the structure in any way, all plugins must be recompiled and tracked against a specific build of LDMS.

So one big reason that I believe that my proposed approach to plugin API design is that it entirely allows for making backwards-compatible API improvements in a clean and straight-forward way. New optional API functions do not break the ABI.

morrone avatar Jan 19 '24 18:01 morrone

Temporarily, to move the conversation forward, I'll set aside the idea of exporting all the of the API functions in a plugin module (rather than exporting one function that returns the struct of API function pointers). I think we can address my other major concern pretty easily.

To try to expand on why I think it is bad practice, in this case, to hide an instance's data at the end of the API function pointer structure, I think basically we are failing to have good separation of concerns. We are creating a mishmash of different concepts that don't really go together. In this case, I think there should be a clear division between:

  1. Loading and discovering a plugin's exported functions (and by "plugin", here I mean the shared-library module file)
  2. Launching and interacting with instances (of a sampler, store, etc) using the plugin's exported functions

In the proposed change, these are combined. But we should only load the module/plugin and learn its' exported functions once. On the other hand, instances and their state need to be capable of multiple interactions.

This PR proposes that we replace get_plugin() with get_plugin_instance(). Instead of that, how about we keep get_plugin(), and make its job to just return the struct of API function pointers. And we do not hide any plugin state hanging off the end of that structure.

Next we introduce a new pair of functions to the API struct:

  /* Construct a new instance
   * RETURN: Opaque pointer to the new instance's internal state, or NULL is unable to do so
   */
  void *instance_init([whatever parameters]);

  /* Destroy an instance 
    * IN self: Opaque pointer to the instance's internal state
    */
  void instance_fini(void *self);

By doing this, we eliminate any need to combine the API pointer struct (which is something that is only used internally in ldmsd), with the instance state (which is only something used internally in the plugin).

Since we are changing the API, I'd like to take the opportunity to improve naming a bit. Instead of calling it "get_plugin", maybe the name could be:

ldmsdplug_import

And perhaps we could then at least export a second function named something like:

ldmsdplug_export

I think these suggestions untangle the the two concepts (a module and an instance), improves encapsulation, and helps to introduce a stronger degree of degree of symmetry to the API.

morrone avatar Jan 19 '24 19:01 morrone

Hi @narategithub, we have mixed the ldmsd_noun_verb_ function naming and ldmsd_verb_noun_ in some places. We should make these consistent in the API. I think most of the API is ldmsd_noun_verb().

tom95858 avatar Jan 22 '24 23:01 tom95858

@morrone, @narategithub let's get a bit more specific:

Requirements

  • Support a single plugin with multiple configurations
  • Support existing plugins that do not support multiple configurations
  • Make it easier, not harder to write plugins
  • Better documentation

Currently Non Requirements (but may be desirable)

  • Support running OVIS-4.3.x plugins in a OVIS-4.4.2 ldmsd
    • This creates an impossibly complicated test matrix that I don't believe that we want to undertake
  • Completely hide a plugin's context data inside a structure understood only by the plugin
    • It is currently not the case that a plugin's data is entirely private to the plugin, the most obvious example is the sample interval and offset. These are known to the updater in order to schedule calls to sample()
    • We could however, have plugin functions that export the data that needs to be shared, e.g. interval/offset.

Other Thoughts

Having a base structure that is overloaded to contain additional data for those plugins/libraries/clients that understand the extensions to the structure is quite common in my experience; especially in the Linux kernel. I don't have any numbers on how often and/or if this paradigm is frowned upon in academic circles. That said we could trivially avoid this with a context pointer in the self pointer that is handed around.

Let's maybe plan a white-board meeting on Zoom to toss around ideas?

tom95858 avatar Jan 22 '24 23:01 tom95858

I don't think anyone is suggesting the "Currently Non Requirements", so I think we're all good there.

Having a base structure that is overloaded to contain additional data for those plugins/libraries/clients that understand the extensions to the structure is quite common in my experience; especially in the Linux kernel.

First, the kernel is not always the best example for user-space programming decisions. The kernel needs to deal with lower level details that often explain what they are doing. And often the kernel has very good reasons for the niche techniques that it employs that would not also apply here. And of course historically some of the kernel's C usage has been gcc-specific rather than generally C-compliant. And I think if we look closer at those use cases, we'll find that they don't apply here.

I don't think it is only academics that would frown. :)

This isn't really "subclassing". We're in C, not C++, and there are probably quite a few more things that would need to be done to make this qualify as subclassing. And more importantly, we are talking about designing an API, and we are not setting up proper abstraction across that API.

But let's consider for a moment that this is subclassing. If that were the case, then its an example of bad subclassing. It is a mistake in code abstraction. The plugin never calls a single method in the class. It doesn't even employ any of the data in the base classes. The only data it uses is its own internal data, which is hidden at the end of the class. Combining unrelated classes in a memory management trick would be less than desirable coding practice. So there is a clear division in use of data and methods, but for some reason, this code smashes them together.

This is going to be strange and unexpected for any developer new to this code base. I can tell you it already was when I started on LDMS, and this moves in the directions of even more unusual. It violates the Principle of Least Surprise. And like you said, one or our goals is "Make it easier, not harder to write plugins".

And like I said, and I do think this is important for any API design, we are designing an API in which backwards compatible evolution of the API (retaining ABI compatibility) is just about impossible.

I agree that putting a context pointer in "self" would be better, but far better still would be for "self" to be the context pointer.

I'm up for a Zoom meeting!

morrone avatar Jan 23 '24 04:01 morrone

Setting aside my thoughts about the instance model mechanics, where is the discussion of the content (methods and their lifecycle semantics for multinstance plugins) of the APIs for samplers, stores (and if we're ambitious, transforms?) I didn't see it in the headers and at least one thing I did see in a header (get_set) rather disturbs me. Maybe the cart is before the horse?

for example, it would strongly simplify sampler and stream plugins (eliminate the coder needing to enforce data safety with explicit use of pthreads) if the "what will be called when and what will not" was fully documented (as has come up in among other discussions, the dcgm plugin).

for example, taking a sampler which does not work by subscribing to a stream, we should have the framework guarantee an order of operations something like:

load
config[,config*]
start
sample
stop
[config*; start; stop]
term

meaning that stuff in [] is possible/optional and that the ordering above is strict (config and term are not called while between start and stop; config is called at least once; stop is called before term; nothing else is called while a sample is in progress). The central problem we currently have is that config/stop/term can all happen due to the command line while the sample loop is on a timer.

baallan avatar Jan 30 '24 21:01 baallan

@baallan I do generally agree with all of that. I was alluding to the same thing when I questioned the semantics earlier.

It looks like get_set() is something that is being brought forward into the updated API, when it really should just be eliminated. I'm not even sure of its existing semantics, let alone what they might be in the new API. Plenty of the new samplers just return NULL to get_set(), so it can't be very important?

morrone avatar Jan 31 '24 02:01 morrone

Re get_set: If it exists, something like it should be a utility function provided by the framework (which is managing underlying lists anyway) and not put on the plugin author to implement.

I can picture the utility of 3 functions (pseudocode, not derived from any current code) managing lists of objects:

// let people find the plugins in which they may be interested
ldmsd_plugin_class_iterator ldmsd_plugin_class_iterator_get()

// let people find the instances of a specific factory plugin in which they may be interested
ldmsd_plugin_class_instance_iterator ldmsd_plugin_class_instance_iterator_get(ldmsd_plugin_class c)

// let people find the entire list of sets registered by a single plugin instance (whether factory or singleton)
ldmsd_plugin_instance_sets_iterator_get(ldmsd_plugin_instance i)

yes of course this is c and not c++, so maybe we should introduce the word factory (singleton plugins and factory plugins) to clarify the behavior patterns for new developers. For some samplers and stores, singleton really is the obvious implementation and it should be easy, while for others the factory pattern may be the most appropriate. And of course in the // comments by 'people' I really mean generalized plugins which do things like transform data as it flows through the pipeline.

I don't see the value in trying to make a single set of function names which encapsulates both singleton and factory plugins. That just introduces ambiguities and conditional behavior points in the documentation of api functions, which makes the documentation hard to write and maintain.

baallan avatar Jan 31 '24 16:01 baallan