libelektra spec: tags for defaults in arrays/tables/maps

Status Quo

Note: if you already know about the problems with the _ and # wildcards in specifications, skip to the next section

When writing a specification, there are 2 basic kinds of keys you can specify:

Fixed Name Keys For keys that have a fixed name (e.g. server/port) you can just define the specification directly.
```
# ni format
[server/port]
type = unsigned_short
default = 8080
```
Dynamic Name Keys Sometimes there are cases where you don't know the full key name when writing the specification. Some parts of the name are chosen by the user. Currently we have _ (single non-array part) and # (single array part) as wildcards for these cases.
```
# ni format
[subdomains/_/port]
type = unsigned_short
description = "port to be used for a subdomain, i.e. to route foo.example.com to port 8081, set subdomains/foo/port = 8080"

[locales/#/country]
type = string

[locales/#/language]
type = string
```

However, the current implementation has some serious limitations (especially for _) when it comes to default values. As long as you don't use default, both _ and # work fine. The spec metadata is copied over to the existing keys and those will be checked by various plugins. But if there is no key matching the wildcard and a key should be generated from the default spec, things get tricky. The _ wildcard cannot be used at all for generating keys (because we don't know which keys to generate in advance). The # has some limitations as well. If a default array size was specified, then we can generate that number of keys. But if there is no default, or we want something other than the default size, then the desired array size has to be stored somewhere. That means storage plugins need to be able to store the array metadata. Another issue is, if we want the default for most array elements, but a few of them should be overriden. That would require storing "arrays with holes".

All of this leads to some awkward specifications where the # is used in a very specific ways to circumvent all the problems.

Differentiating use cases

I see a few different use cases for the # and _ wildcards:

A simple array of values (foo/#): Sometimes you need a simple array of strings/number/etc. For example, this could be a list of files that should be read.
A simple map of values (foo/_): An array is not always sufficient, sometimes you need a set of key-value pairs, or mappings. For example, in an embedded application you may need to map keys on a keypad to functionality.
A list of complex values (foo/#/bar, foo/#/baz, ...): If a single value is not enough, you may need something like an array of objects. In Elektra this means the spec keys have a common prefix, the array wildcard, and a fixed suffix. An example is show above with the locales spec, where we want to store the language (e.g. en) and country (e.g. GB) part of the locale separately. An advanced case of these lists is, if not all "objects" in the array have the same schema. Currently, this is not easily possible with Elektra's spec system, because it is very rarely actually needed. You can just use to separate arrays, unless you actually need to maintain an order between different types. If an order needs to be maintained, you can use an array of references.
A map of complex values (foo/_/bar, foo/_/baz, ...): These are essentially the same as "list of complex values", but instead of using an array index to refer to a the individual "objects", we use a name. Because of the many limitations of _, currently this is often replaced by a foo/#/bar type spec, where one of the keys represents the name. This is can be quite awkward, because while elements can have names, referring to an element by name is hard (basically need to loop over the array indices).

As stated above, all of this works fine until you want to use defaults. But there might be a solution.

A possible solution: Tags

For the simple arrays/maps the solution is easy: There cannot be defaults for individual elements. It simply doesn't make sense to say "I don't know the keys of this map, but the values default to X" or "I don't know how big this array should be, but it's elements default to X". There may be a use-case for defining a default for the entire map/array, e.g. specify that the default for foo/# is [1,2,3] (JSON) or the default for foo/_ is {"left": "open", "right": "close"} (JSON).

For the more complex cases, I propose a kind restriction as a solution. To define a default on a key that contains _ or # wildcards, you must define a "tag" key for every wildcard. For example, the config (derived from LCDproc) could look like this:

[drivers/#]
array =
array/tag = type

[drivers/#/type]
check/enum = #3
check/enum/#0 = hd44780
check/enum/#1 = curses
check/enum/#2 = text
check/enum/#3 = xosd

[drivers/#/size]
type = string
check/validation = ([1-9]+[0-9]*)x([1-9]+[0-9]*)
check/validation/match = LINE
check/validation/message = Not a valid size declaration. Examples: 20x4, 19x3, 40x150
default = 20x4

The trick here is that, the default for drivers/#8/size will only be applied, if e.g. drivers/#8/type is defined. That way, we don't actually rely on e.g. array = #9 being stored as metadata on drivers. This allows us to use e.g. this TOML file

[[drivers]]
type = "curses"

[[drivers]]
type = "xosd"
size = "200x40"

to define a curses driver with the default 20x4 size and an xosd driver with a custom size.

Note: the above TOML file defines these keys
drivers/#0/type = curses
drivers/#1/type = xosd
drivers/#1/size = 200x40

Another advantage of this tag system is that it works for the _ wildcard as well. If we replace the # with _ in the example spec above, we can give names to the driver instances and write.

[drivers.xosd1]
type = "xosd"

[drivers.xosd2]
type = "xosd"
size = "200x40"

Note: the above TOML file defines these keys
drivers/xosd1/type = xosd
drivers/xosd2/type = xosd
drivers/xosd2/size = 200x40

Furthermore, we could extend the system to allow for conditional spec. I'm not certain on how exactly we would specify this, but in principle, it should be possible to mark a part of the spec, so that it is only used for certain values of the tag key.

Conclusion

There is more to discuss here and I have already thought about some of it (I have some ideas for the implementation already). But this issue is already long enough, so I will wait for some responses to the general concept, before explaining further.

To me the biggest advantage of this concept is, that AFAICT it means we don't need to store metadata outside of spec:/ anymore.

Jul 21 '21 15:07 kodebach

Basic idea of implementation

We see that e.g. drivers/# has array/tag = type, so we look for existing keys matching drivers/#/type. For every matching key we find, we remove the base name to get e.g. drivers/#9. We then apply the spec below drivers/# to the keys below drivers/#9. This should work for _ too, as well as keys with multiple wildcards e.g. my/_/weird/#/wildcard/key/_, as long as there aren't to wildcard levels in a row.

Jul 21 '21 15:07 kodebach

I don't understand the proposal, some questions which might help me understand:

you say that you want to define tags, suddenly array/tag gets specified? What is a tag what is a array/tag? What is the difference? Are there other tags?
"Note: the above TOML file defines these keys" Doesn't give an hint what the tags do, they seem to be ignored?
How do the tags make metadata outside of spec:/ unnecessary?

Jul 21 '21 18:07 markus2330

To understand what tags do, ignore all the examples, ignore array/tag and even ignore the word "tag" (it's just name I used).

Now, the problem with the combination of default and _/# is essentially that we do not know what we need to replace _/# to get all the keys we need to generate. In contrast, if we don't need to generate a default key, because the keys exists already, there is no problem.

For example, this spec works totally fine (*), because we don't define a default:

[foo/_/bar]
type = long

The following spec also works fine, but only if the default never comes into play. For example, because apple/granny_smith/color = green is explicitly set and we don't access any other keys.

[apple/_/color]
default = red

Now before we go further, I'll introduce a bit of simple terminology. In the following spec, I'll call foo an "array of objects" (**) or a "complex array" and in foo2 a "map of objects" or a "complex map":

# array of objects
[foo/#]
array =

[foo/#/bar]
type = string
default = def

[foo/#/baz]
type = string
default = ault

# map of objects
[foo2/_]
map =

[foo2/_/bar]
type = string
default = def

[foo2/_/baz]
type = string
default = ault

Note: The map metakey to mark foo2 as a map is new. It simply mimics the array = metakey, which marks foo as an array.

With that out of the way, we can get one step closer to the actual proposal. To re-iterate, the problem is that in the above spec, we don't really know, whether we need to generate the keys foo/#23/bar and foo/fancyname/baz or not. But we already know, if these keys exist, everything is fine and we just apply the spec.

Now the trick is to think about "arrays of objects"/"maps of objects". So we say e.g. foo/#23/bar and foo/#23/baz form one object. We can call that object foo/#23. Suddenly, the problem is different. The question now is no longer: "Should we generate foo/#23/bar?". Instead it now is: "Should there be an object foo/#23?".

This new question isn't any easier to answer in general. If the object foo/#23 doesn't exist, we still have no idea, if we should generate it. But what if foo/#23 partially exists already? For example, what if foo/#23/bar = "clearly defined value" is set, but foo/#23/baz is missing? Well in that case, it would only make sense to generate a default key for foo/#23/baz to complete the object.

While in theory it would be possible, to build the functionality for always "completing partial objects". It's not that easy, and I don't think you can do it efficiently. Also sometimes, you don't actually want to treat a spec to be treated as an "array of objects" or a "map of objects". But the whole thing becomes a lot easier, if we always look at the same key to determine, if a partial object should be completed or not.

So in our example, maybe we say: If foo/#23/bar is set, we generate foo/#23/baz, but not the other way around. Then we just need to find all keys that match foo/#/bar and we'll know all the keys we need to generate. We just replace bar with baz in all the keys we found, check if the new baz keys exist and add those that don't exist yet. This clearly works, but we still need to tell the spec plugin somehow, that bar is the key that say "complete this object". That's where array/tag comes in.

[foo/#]
array =
array/tag = bar

[foo/#/bar]
type = string
# default doesn't make sense here

[foo/#/baz]
type = string
default = ault

In the spec above, we define that

foo is an array of objects
the "tag key" in foo is called bar
the bar keys within foo are strings
the baz keys within foo are strings and their default value is ault

Note: "tag key" is just a name, I will explain later why I call it a "tag"

This means, if I set foo/#19/bar = "something" in my config, the spec plugin will generate foo/#19/baz = "ault" during kdbGet(). Similarly, we could use _, map and map/tag instead of #, array and array/tag and then set e.g. foo/key/bar = "something" in the config to trigger the default for foo/key/baz.

Note: I ignore here that arrays should be continuous and that spec should probably enforce this. For this proposal # and _ can be treated exactly the same. They only differ in which key names they match.

I hope this explanation was a bit more understandable to you.

(*) Without a default we can't really use the highlevel API, but let's ignore that for now, it won't matter in the end.

(**) I use the term objects, because often the keys .../bar, .../baz and their "siblings" are meant to mimic some data type or structure similar to struct { char * bar; char * baz; };

So, why do I call this key a "tag"? That comes from tagged union. In theory the tag concept could be used to introduce limited polymorphism into the specification.

[fruit/_]
array =
array/tag = kind

[fruit/_/kind]
type = enum
check/enum = #1
check/enum/#0 = berry
check/enum/#1 = stonefruit

[fruit/_/color]
type = string
default = "greenish purple"

[fruit/_/stonesize]
onlyif/tag = stonefruit               # hypothectical (!)
type = enum
check/enum = #2
check/enum/#0 = small
check/enum/#1 = medium
check/enum/#2 = large
default = medium

A specification like this could mean that there are two kinds of fruit (berry and stonefruit), that both have a color, but only stonefruit, as well as some defaults and other specification. Now we could go ahead and write a config like this (TOML):

[fruit.strawberry]
kind = "berry"
color = "red"

[fruit.cherry]
kind = "stonefruit"
color = "dark-red"
stonesize = "small"

[fruit.magicstone]
kind = "stonefruit"

[fruit.magicberry]
kind = "berry"

This would say:

Strawberries are red berries.
Cherries are dark-red stonefruits with small stones.
"Magicstone" is also a kind of stonefruit. It is greenish purple in color and has a medium stone.
"Magicberry" is a kind of berry and also greenish purple in color.

In this case kind is the "tag key", so an actual tagged union version of this could look something like this in C:

enum fruit_tag {
  BERRY = 0,
  STONEFRUIT = 1
};

struct berry {
  char* color;
};

struct stonefruit {
  char* color;
  int size; // I was to lazy to type the enum
};

struct fruit {
  enum fruit_tag kind;
  union {
    struct berry b;
    struct stonefruit s;
  } data;
};

A type for the map of fruit is a bit hard to define in C, but an array of fruit could just be struct fruit *.

Jul 21 '21 20:07 kodebach

My previous answer leaves one question open:

3. How do the tags make metadata outside of spec:/ unnecessary?

AFAIK there is only one reason why actually need to store metadata in config files outside of spec :/. To override a part of the specification. The most common case currently is to override the default size of an array. In theory there could be other types of override (e.g. restrict from type = long to type = short), but those are hard to do safely, so I don't think they are all that relevant. I will however, still address them at the end.

Note: If you already know why array needs to be stored outside of spec:/ skip to the next Note.

For the case of array sizes, we need to consider two cases:

The storage plugin doesn't support arrays natively.
The storage plugin has native array support.

In the first case, the array metakey is currently the intended way to keep track of the array size. Storing the array metakey explicitly works fine, but most plugins have poor or no support for metadata. We could of course infer the array size from existing array elements, but that means we can't store arrays with holes at the end.

In the second case, the storage plugin is supposed to generate the array metakey based on it's native format. But there is also a problem, most formats want arrays to be continuous. So we can't store any arrays with holes at all.

Why are arrays with holes important? They are needed to define a config that says: I want the default value on all array elements, but element N should be set explicitly. So why is this important? Seems a bit like a weird requirement. And it is. But with the current setup, this is sometimes required. Take for example LCDproc. It knows many types of driver and you can define a list of active drivers. You can also activate a driver multiple with a different config. You could give a name to each driver instance and us a _ wildcard. But that doesn't work right now. So you need to use some kind of array. And then you will run into the problem of: I want one instance with the default config and another with a modified config. So you the first instance should be entirely generated from the spec. But for that to work, we need to have holes in our array.

Note: skip here, if you know about the need for arrays with holes

So how does the new tag stuff help here. Well first of all, you can know use _ wildcards for the LCDproc driver example. This is shown in the original issue description at the top (Sidenote: I prefer the _ version over the #, because then the driver instances actually have names. However, the TOML syntax lends itself better to the # version).

And in general, you can use _ where you need non-continuous sets of config. This makes it possible to say: All arrays must be continuous. Specifically, if there exists any key below e.g. foo/#1, then there must also at least one key below foo/#0. The same goes for foo/#2 and foo/#1 and so on. Storage plugins may put further restrictions on that, e.g. some plugins might not support heterogeneous arrays or maybe some kinds of nesting are not allowed.

If arrays must be continuous, we can simply infer the array = metakey inside the spec plugin. Storage plugins (even those with array support) wouldn't need to generate anything. Although it may make sense as a form of validation, i.e. if array metadata exists spec checks it is correct.

I also promised to address other kinds of "overriding specification", so here we go. I don't have a way to do it safely, and I certainly don't have a way of overriding specification without actually storing the override metadata. However, I think we could just store the override specification as a separate file next to the actual config. I know this breaks the atomicity of reading a mountpoint. But I don't think this would matter. The specification is already stored in a different file, so a third file wouldn't change much. A mismatch between spec:/ and e.g. user:/ is already possible. Let's say we change the type in spec:/foo from string to long, and in the same kdbSet() we also change user:/foo from abc to 123. There will be moment when spec:/foo was already written to disk, but user:/foo was not. If at that moment a kdbGet() call reads both spec:/foo and user:/foo, we get a mismatch between spec and config.

Jul 21 '21 20:07 kodebach

I like the general idea to avoid specifications in other namespaces and the examples look useful to me.

I wonder if we can simply assume tags on the parents, i.e., the level below where array or map (what we'll probably need to define what the tag for _ is) is specified, e.g, given the specification:

[foo/#]
array:=

[foo/#/bar]
default:=20

and the config:

foo/#0
foo/#0/bar = 10
foo/#1

In the case foo/#0 there is already a bar, so nothing to do but in foo/#1 we have the "tag" but we do not have foo/#1/bar, so we create it with the default.

Imho "tag" is a bit misleading. As you write yourself the "tag" (without further "tag/onlyif" or similar) is not a tagged union. But if we can put the semantics you had in mind directly to array and map, we will not need the name "tag" anyway, will we?

Jul 30 '21 07:07 markus2330

I not 100% sure and I didn't check, but I believe, if foo/#1 has a value we already generate a default key for e.g. foo/#1/bar. If we don't it would be easy to add this, yes. However, the problem here is that this requires non-leaf values which many formats (including TOML) don't support natively.

But if we can put the semantics you had in mind directly to array and map, we will not need the name "tag" anyway, will we?

For map this works, but we already use array to define the default array size, e.g. array = #3.

In theory it would be possible to treat any specified child as a "tag". So with the fruit spec from above, I could define either color or stonesize and the other one would be generated automatically. You couldn't generate an element with just default values, but that could be fixed in the spec by adding a dummy key that is not used by the application. Then you can just set the dummy key and get the defaults for all the other keys.

The problem with this approach is that it is harder to implement and much more computationally intensive. We need to look for many keys instead of just one. Limiting the "tags" to only direct children helps, but it's still not a easy as an explicit tag key.

Maybe simply using a different name is the solution, e.g. marker. It's pretty generic, so it doesn't give any ideas and it still fits because the key "marks" the element as existing.

Jul 30 '21 08:07 kodebach

However, the problem here is that this requires non-leaf values which many formats (including TOML) don't support natively.

Please explain, why would it be required? I mean that foo/#1/bar without foo/#1 should be an error (invalid array).

The problem with this approach is that it is harder to implement and much more computationally intensive.

That is why I suggested that only the elements directly below array or map are the markers. (Btw. would be interesting how much more computationally intensive the different variants actually are. This would be something for a master thesis.)

Limiting the "tags" to only direct children helps, but it's still not a easy as an explicit tag key.

You mean as easy for the spec plugin? I think for the person writing the specification, it is easier if no markers are needed.

Maybe simply using a different name is the solution, e.g. marker.

Yes, e.g. array/marker already sounds better. I see that there can be use cases for this (also for tagged unions) but I would prefer if standard arrays and maps are as simple as possible.

Jul 30 '21 10:07 markus2330

I mean that foo/#1/bar without foo/#1 should be an error (invalid array).

What would the value of foo/#1 be? Empty string? NULL? If the parent is required like this, how would you define that in TOML?

For foo/#0 (where foo/#0/bar is explicitly set), the toml plugin can generate foo/#0, but for foo/#1 I don't really see a way to do it.

That is why I suggested that only the elements directly below array or map are the markers.

Depending on the spec this could still be a lot more work.

would be interesting how much more computationally intensive the different variants actually are

The real world impact would be very hard to measure, as it very much depends on the specification. However, the theory tells us that an explicit marker is definitely better (by some non-zero amount). Essentially, we go from O(1) for an explicit marker to O(n) (where n is the number of spec keys directly below the array parent). For small n this wouldn't be bad, but this is per key, so we actually go from O(N+A) to O(N+A*n) (A is the number of array/map parents, N is the number of other spec:/ keys).

Note: This assumes, that processing a single spec key is constant time. I also ignored the fact that arrays may be nested, which makes the whole thing even more expensive.

You mean as easy for the spec plugin? I think for the person writing the specification, it is easier if no markers are needed.

For writing the spec I don't think it makes a big difference. Writing the config could be annoying, if you need markers everywhere.

I would prefer if standard arrays and maps are as simple as possible.

I will think about it some more, maybe we can find some restricted cases where the marker is not required to achieve close to O(1) performance.

Jul 30 '21 13:07 kodebach

ping

Jul 13 '22 15:07 kodebach

https://tiss.tuwien.ac.at/thesis/thesisDetails.xhtml?thesisId=104004

Jul 15 '22 15:07 markus2330

libelektra libelektra copied to clipboard

spec: tags for defaults in arrays/tables/maps

Status Quo

Differentiating use cases

A possible solution: Tags

Conclusion

libelektra
libelektra copied to clipboard