ovis
ovis copied to clipboard
ldms plugin option "schema" issues
ldms plugins are currently required to parse and option named "schema". There are a few ways in which does not seem to be a good requirement.
First of all, how are plugins that use multiple schemas supposed to handle a single schema name configuration option?
For plugins with a single statically coded schema, it would seem like the schema name should be chosen by the person writing the plugin. Allowing users to give different names to the same exact schema seems like a recipe for unnecessary confusion.
We might also consider either explicitly introducing a schema version field in metric sets, or perhaps developing guidelines plugin developers for how to include versioning in the schema string. Then when changes need to be made to a statically coded schema, the programmer has a way to communicate to the users that the schema has change in a standard way.
This is in essence a duplicate of #45, and could have similar solutions. Personally, I would like to see the use of explicit schema names at the sampler config level disappear in most cases. This could be accomplished by the schema named always being derived from a sampler-author-defined base name (meminfo) and a suffix generated by our api from a hash of the schema details. If the plugin author knows a better suffix generation method, they could use it instead.
This could be a back compatible api extension for v4.
@tom95858 mentioned at LDMSCON the possibility of including a hash of the schema so that set/schema handling code can know with reasonable certainty that two schema are identical in every detail. Since cryptography is not at issue but uniqueness is, I would like to suggest that we add to schemas the CityHash64 value of the schema. https://github.com/google/cityhash
- It gets computed once at schema creation.
- It's easily represented (uint64).
- It's already in our code base in lib/src/third/city.h
- It's easily converted to a string (hex digits or uuencode) that can be uniquely appended to schema base names like 'meminfo_' for cases where there is no obvious uniquification such as adding the number of cores to the schema name procstat_%d. This would let us get admins out of the business of thinking about schema names except in rare cases.
I'm not clear if this is a protocol change relative to v4.3 or just a pure plugin api extension. The hash could be stored as a conventional metadata item or baked into the struct (and which it is might be an implementation detail that varies from v4 to v5).
I was referring to a quick way for a client of a set to determine whether or not two sets have identical schema given that the schema name is not reliable. Computing the schema name does not seem workable to me given that this name will be used to define measurement names for influxdb and sos; and is a configuration option that maps store policies to metric sets. If schema name were computed, it's effectively impossible to use it as a parameter to a storage policy.
On Wed, Oct 30, 2019 at 8:58 AM Benjamin Allan [email protected] wrote:
@tom95858 https://github.com/tom95858 mentioned at LDMSCON the possibility of including a hash of the schema so that set/schema handling code can know with reasonable certainty that two schema are identical in every detail. Since cryptography is not at issue but unique is, I would like to suggest that we add to schemas the CityHash64 value of the schema. https://github.com/google/cityhash
- It gets computed once at schema creation.
- It's easily represented (uint64).
- It's already in our code base in lib/src/third/city.h
- It's easily converted to a string (hex digits or uuencode) that can be uniquely appended to schema base names like 'meminfo_' for cases where there is no obvious uniquification such as adding the number of cores to the schema name procstat_%d.
I'm not clear if this is a protocol change in v4.3 or just a pure plugin api extension. The hash could be stored as a conventional metadata item or baked into the struct (and which it is might be an implementation detail that varies from v4 to v5.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ovis-hpc/ovis/issues/46?email_source=notifications&email_token=ABVTPXBXA4UMSZ73L3V2CC3QRGOI3A5CNFSM4HJO47I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECUQNEY#issuecomment-547948179, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVTPXA2U4JZW3IKO6YNDL3QRGOI3ANCNFSM4HJO47IQ .
-- Thomas Tucker, President, Open Grid Computing, Inc.
Please share an example of how a schema name becomes a measurement name.
As a separate issue, why would a computed schema name preclude configuring storage policies?
We already have computed schema names (procstat_16, procstat_72) that we have to deal with. The fact that currently the suffix has to be computed mentally by an administrator instead of being a documented feature of the particular plugin just makes the plugin harder to use.
Going down a separate v5 design wormhole, independent of the config file syntax issue:
v5 storage policy specification shouldn't be as inflexible as it is in v4, and in particular it should perhaps allow defining a policy based on set instance name expressions and/or schema wildcard expressions rather than (or in addition to) unique schema names. Binding such a specification down to C-level instances of storage policies isn't something an administrator should ever have to deal with. Given that in v5 we will have properly independent storage plugin instances, I'm not clear why we still need separate storage policies at all.
I don't consider the genders generated object naming convention as a valid use-case.
If you are proposing a schema-less design, then that's fine, we can have that discussion, but having none sense for a schema name is not compelling design choice.
Automatically (at sampler writer's opt-in behest) disambiguating schema names has nothing to do with genders or any other configuration syntax. I'm not suggesting schema-less design, but I am suggesting schema names (rather than schema families) should be a lot less in the face of both administrators and analysts. This is a larger design discussion that goes well beyond the scope of the original issue 67, and this is almost certainly not the correct forum or list of participants.
Whatever treatment of schemas we end up with, it looks like TOML or a de-quoted TOML would be a sound path for composing readable, cluster-friendly specifications of collection and aggregation and storage.