yaml
yaml copied to clipboard
Merging sequences (like Merge keys for mappings)
In #35 a syntax to merge sequences was proposed, but the syntax isn't going to happen, while I can understand that it would be useful to have a standard way of doing this.
There are plans to add programmatic features in YAML 1.3, as @ingydotnet mentioned, but they simply don't exist yet.
The question is, can we come up with something simple, that doesn't introduce new syntax?
Here are some suggestions:
- @ingydotnet 's suggestion was:
array1: &my_array_alias
- foo
- bar
array2:
- <: *my_array_alias
- baz
- My alternative suggestion would be:
array1: &my_array_alias
- foo
- bar
array2:
- <<
- *my_array_alias
- baz
This approach would be very close to the merge key feature; also when thinking about the implementation (because I have just implemented this in my YAML processor).
- Another variant would be almost equal to @ingydotnet's suggestion:
array1: &my_array_alias
- foo
- bar
array2:
- <<: *my_array_alias
- baz
I can't think of a reason not to use the same <<
as for merge keys.
Thinking about implementation, IMHO suggestion 1 and 3 will be much harder, because one would have to add additional handling for this on several levels. Suggestion 2 can be more verbose, especially when you want to merge more than one sequence, but it should be comparably easy to add to an already existing merge key feature.
More a question than a suggestion, as I might be overlooking something obvious. Why not just provide a "flattened entry" key?
array1: &my_array_alias
- foo
- bar
array2:
+ *my_array_alias
- baz
this loads +
with the "flatten" meaning.
Possibly, this could work regardless of the list syntax (JSON style or multiline). All these entries should have the same result
flat: [a, b, c, 1, 2, 3]
classic:
- a
- b
- c
+
- 1
- 2
- 3
mixed:
- a
- b
- c
+ [1, 2, 3]
json:
[a, b, c, + [1, 2, 3]]
@DanySK this looks nice, I agree. But it would mean a new syntax element, so more work for parsers, and another thing +
which would be disallowed in plain scalars.
I think this should be avoided like explained in #35.
If we get a new syntax element (or more), then it should be something general which will be able to add more programmatic features, and the basic syntax should not get more complex.
I know this feature would be useful and I could use it myself, but I can't think of a simple solution.
Thanks @perlpunk
Would overloading a currently disallowed symbol in plain scalars (e.g., |
) cause retrocompatibility issues? As said, the idea came off the top of my mind, I did not consider all the consequences.
Yes, there will be some more work for parsers, and yes, I do understand the YAML specification is already large. However, let me for a moment look at this from a higher perspective: why does YAML exist? Why don't we just use JSON or TOML? I can see two main reasons:
- readability is higher wrt JSON, and wrt TOML on complex specifications; but more importantly
- support for reuse via anchors, enabling DRY strategies.
Under this point of view, I advocate for further, not simplifying syntax changes to be considered in case they lead to clear, non ambiguous, and YAML-style-like forms of reuse.
What about syntax like this:
array1: &my_array_alias
- foo
- bar
array2:
-*my_array_alias
- baz
I'm not going to argue at length for this, just wanted to propose this syntax. I'm sure this suggestion requires parser and language changes, but it doesn't introduce any new reserved symbols and sort of fits in with the feel of a list by starting with a dash.
For sure there is a real need for this feature. We only need to be careful on how we describe its behaviour because lists are sorted and can allow duplicate entries.
AFAIK, I do not see a need to allow duplicates (in fact preventing them may count as an advantage). Still the order of entries may be critical, do we want to merge at too, bottom, middle? Which duplicate entry takes priority the one from inside the default or the override?
@ssbarnea switching from lists to sets is madness, you'd lose also JSON compatibility. Let's keep the discussion clean, the only needed feature is a "flatten-inside" operator. But I can't see any way of introducing it without tinkering with the grammar
@ssbarnea switching from lists to sets is madness, you'd lose also JSON compatibility. Let's keep the discussion clean, the only needed feature is a "flatten-inside" operator. But I can't see any way of introducing it without tinkering with the grammar
I did not propose any divergence from JSON, it would be insane to do so. I only wasted to state that we need to define very well the merging logic, so implementations would not have different behaviours.
I had lots of cases where I had a default list and I wanted to new entries to it, at top of bottom. Based on experience I encountered cases where the tool loading the list would choke if it finds duplicates (list of packages to install). This is why I mentioned that a set like behaviour when doing the merge could be desirable.
Don't mix the use for your your specific use cases with the general framework. Performing a set union would for instance be irrelevant for most of my uses and deleterious (making the feature useless) for others. I think the only meaningful merging behaviour for a list is list inserting.
What'd be the output of merging a [2] into a [1, 1] list? [1, 2]? That'd be very surprising to me.
@bbrouwer
but it doesn't introduce any new reserved symbols
Well, it does introduce new syntax. The plain scalar -*foo
is perfectly fine currently.
If I rewrite your example:
array2:
-*my_array_alias
then its meaning changes to something which is valid YAML right now. Also, this would look odd in flow style:
array2: [ -*my_array_alias ]
I still think that introducing new syntax for implementing just one specific programmatic function is wrong. And I bet many other people who actually implemented a YAML parser would agree.
It would also be totally different from merging mapping keys, which happens in the constructing level, while a new syntax element introduces a new type of parsing event.
Don't mix the use for your your specific use cases with the general framework
I think this is also a good example why introducing a specific programmatic element is not a good idea.
Merge keys are already not trivial to implement (look into pyyaml and try to figure out how to implement forbidding duplicate mapping keys while keeping the merge key behaviour).
And some might rather want a deep merge, which is not what merge keys do. Would you introduce another merge key (e.g. <<<
) just to implement deep merging?
The solution is a more generic programmatic syntax which allows functions and parameters.
This is already possible with local tags, for example in AWS CloudFormation files you can use the !Join
tag to concatenate list items.
The disadvantage of that is, that you cannot give the result of such a tag-function another tag.
@ssbarnea If I added a merge-list feature similar to merge keys, I would just concatenate the lists.
Anything more complicated needs something like a templating (jinja for example) or a more generic programmatic syntax.
@perlpunk "merging list" it's not necessarily programmatic. The way I see it, it is purely declarative. The problem with local tags is that it's not a standard solution, hence many use cases, the majority actually (e.g.: CI configuration) won't enjoy them. For my own software, I implement workarounds manually, but still I am convinced that an equivalent of the merge keys for lists would be useful.
Also re-reading your initial post, I believe:
array2:
- <<: *my_array_alias
- baz
looks good, and does not introduce any new syntax element.
I've been thinking generically about the future on this one and thought I might add a suggestion or two....
Since YAML is a human friendly data serialization standard and not a markup language, I'm less worried about the <<
syntax than the !!merge
syntax. When editing the file there isn't really a visible difference between the two. There is, however, a grammar difference between them which seems relevant here.
With a merge I do want to "serialize" the data into this location. I view merges as a data storage/retrieval issue.
Vague thoughts that aren't well fleshed out follow:
Things I really want:
- simple merge maps
- merge sequence
- deep merge
Sequence uniqueness feels like it manipulates the data rather than translating the result of multiple sequences put together. Or phrased differently, if you want the unique values of a sequence after a merge - then you don't want the node as is merged over here - you want to manipulate rather than store the data.
The !!
name space is reserved, so can we use it for something fancy? Could we extend the !!
name space a tiny bit by adding a !!!
space for serialization specific functions? I'd want to limit those functions right up front to ONLY YAML node (scalar, sequence, or mapping) specific serialization.
psudo yaml:
---
&SEQ_A:
- 1
- 2
- 3
&SEQ_B:
- 4
- 3
- 2
- 1
&DEFAULT_MAP:
a: "value a set by DEFAULT_MAP"
b: "value b set by DEFAULT_MAP"
sub_map:
- 1
- 3
&EXTRA_MAP:
c: "value c set by EXTRA_MAP"
d: "value d set by EXTRA_MAP"
sub_map:
- 2
&REPLACE_MAP:
b: "value b set by REPLACE_MAP"
d: "value d set by REPLACE_MAP"
sub_map:
a_map: 3
##
## possible simple map merge syntax where the top
## level map key is just replaced if it exists
##
merge: !!!merge_map_replace([*DEFAULT_MAP, *EXTRA_MAP])
## which would become
## simple_merge:
## a: "value a set by DEFAULT_MAP"
## b: "value b set by DEFAULT_MAP"
## c: "value c set by EXTRA_MAP"
## d: "value d set by EXTRA_MAP"
## sub_map:
## - 2
merge: !!!merge_map_replace([*DEFAULT_MAP, *REPLACE_MAP])
## which would become
## simple_merge:
## a: "value a set by DEFAULT_MAP"
## b: "value b set by REPLACE_MAP"
## d: "value d set by REPLACE_MAP"
## sub_map:
## a_map: 3
merge: !!!merge_map_replace([*DEFAULT_MAP, *EXTRA_MAP, *REPLACE_MAP])
## which would become
## merge:
## a: "value a set by DEFAULT_MAP"
## b: "value b set by REPLACE_MAP"
## c: "value c set by EXTRA_MAP"
## d: "value d set by REPLACE_MAP"
## sub_map:
## a_map: 3
merge:
!!!merge_map_replace([*DEFAULT_MAP, *EXTRA_MAP, *REPLACE_MAP])
d: "value d set locally"
e: "value e set locally"
## which would become
## merge:
## a: "value a set by DEFAULT_MAP"
## b: "value b set by REPLACE_MAP"
## c: "value c set by EXTRA_MAP"
## d: "value d set locally"
## e: "value e set locally"
## sub_map:
## a_map: 3
##
## possible simple seq merge syntax
##
merge: !!!join_seq([*SEQ_A, *SEQ_B])
## which would become
## merge:
## - 1
## - 2
## - 3
## - 4
## - 3
## - 2
## - 1
merge: !!!join_seq([*SEQ_A, *SEQ_B, 5, 6, 7])
## which would become
## merge:
## - 1
## - 2
## - 3
## - 4
## - 3
## - 2
## - 1
## - 5
## - 6
## - 7
##
## possible deep merge syntax
## logically I'd build it from the merges above.
## if the types don't match, replace the node
## aka, merge_map_replace if I can't put the data together
##
merge: !!!merge_nodes(*DEFAULT_MAP, *EXTRA_MAP])
## which would become
## simple_merge:
## a: "value a set by DEFAULT_MAP"
## b: "value b set by DEFAULT_MAP"
## c: "value c set by EXTRA_MAP"
## d: "value d set by EXTRA_MAP"
## sub_map:
## - 1
## - 3
## - 2
merge: !!!merge_nodes([*DEFAULT_MAP, *EXTRA_MAP, *REPLACE_MAP])
## which would become
## merge:
## a: "value a set by DEFAULT_MAP"
## b: "value b set by REPLACE_MAP"
## c: "value c set by EXTRA_MAP"
## d: "value d set by REPLACE_MAP"
## sub_map:
## a_map: 3
merge:
!!!merge_nodes([*DEFAULT_MAP, *EXTRA_MAP, *REPLACE_MAP])
d: "value d set locally"
e: "value e set locally"
sub_map:
b_map: 4
## which would become
## merge:
## a: "value a set by DEFAULT_MAP"
## b: "value b set by REPLACE_MAP"
## c: "value c set by EXTRA_MAP"
## d: "value d set locally"
## e: "value e set locally"
## sub_map:
## a_map: 3
## b_map: 4
merge:
!!!merge_nodes([*DEFAULT_MAP, *EXTRA_MAP, *REPLACE_MAP])
d: "value d set locally"
e: "value e set locally"
sub_map:
- 8
## which would become
## merge:
## a: "value a set by DEFAULT_MAP"
## b: "value b set by REPLACE_MAP"
## c: "value c set by EXTRA_MAP"
## d: "value d set locally"
## e: "value e set locally"
## sub_map:
## - 8
merge: !!!merge_nodes([*SEQ_A, *SEQ_B])
## which would become
## merge:
## - 1
## - 2
## - 3
## - 4
## - 3
## - 2
## - 1
merge:
!!!merge_nodes([*SEQ_A, *SEQ_B])
- 5
- 6
- 7
## which would become
## merge:
## - 1
## - 2
## - 3
## - 4
## - 3
## - 2
## - 1
## - 5
## - 6
## - 7
merge:
!!!merge_nodes([*SEQ_A, *SEQ_B])
d: "value d set locally"
e: "value e set locally"
## which would become
## merge:
## d: "value d set locally"
## e: "value e set locally"
I'm not sure a "short syntax" (<<
) would add anything or make this more readable.
These thoughts aren't fully baked, but hopefully they are interesting....
+1 :) My suggestion is the below which does not introduce new syntax and just uses the existing asterisk for merging.
a: &a
- 1
- 2
b: *a
- 1
- 3
Look, if you suggest a new syntax (and both @jcpunk and @muuvmuuv did that), please implement it in one of the existing YAML parsers first. It's a useless discussion, if you think about how it should look like if you have no idea how it is actually implemented.
@muuvmuuv
which does not introduce new syntax
That would mean there are no necessary changes to a YAML parser. But that's wrong.
@perlpunk "merging list" it's not necessarily programmatic. The way I see it, it is purely declarative.
Well, whatever this means for you in this context - it is a transformation that has to happen in one of the stages of YAML loading.
Also re-reading your initial post, I believe:
array2: - <<: *my_array_alias - baz
looks good, and does not introduce any new syntax element.
This <<: *my_array_alias
is simply a mapping with exactly one key, a merge key. This mapping will get transformed in the constructor state of the loading process. In fact, since it is only one merge key and nothing else, it can be written shorter as:
array2:
- *my_array_alias
- baz
If you intended to show this as an example of a merge sequence, then please explain how the constructor is supposed to know that it is. Please implement it in PyYAML or a constructor of your choice.
@perlpunk
- *arr_alias
is already valid syntax for adding the aliased object as an element, at least per the ruby parser, so it can't be overloaded for merging.
require 'yaml'
pp YAML.load <<~YAML
anchors:
- &arr1
- a
- b
- &arr2
- c
- d
arr_merge:
- *arr1
- *arr2
- e
- f
YAML
{"anchors"=>[["a", "b"], ["c", "d"]],
"arr_merge"=>[["a", "b"], ["c", "d"], "e", "f"]}
@bughit I know
Not wanting to intrude on the conversation, but I'd like to suggest a possible refinement of @DanySK's array flattening idea.
If we assume that data in most real-world arrays will be homogenously typed, then is there any scope for adding a "block sequence style" indicator, similar to the scalar style indicators. This doesn't allow for "per-item" array flattening (like the syntax above) or changes the syntax of scalar values at all, but rather applies a schema based transformation to all sequence in a yaml document, based on the first item in the sequence.
For example,
x-list: &list
- bar1
- bar2
mylist:
- <<
- foo
- *list
- baz
Would be parsed (using the current 1.2 syntax) as ['<<', 'foo', ['bar1', 'bar2'], baz]
(and similarly for the json-array syntax).
The schema transformation rule would essentially be "if '<<' is the first element of an array, then it is removed from the result and, if any item in the list is an array, it is flattened into the result (nested arrays are left untouched)". This is analogous to the transformation rule about '<<' if it appears in an object.
This transformation would produce the array ['foo', 'bar1', 'bar2', 'baz']
, which is I think what [the subset of users who use anchors/aliases] would expect.
The question is how many unexpecting users and real-world yaml files would be affected by '<<'
special treatment if it occurs as the first element of an array? There are two ways people could get stung by this -- legacy files which are being converted over to the new format, and people who don't expect ['<<'] == []
. But, it should be similar enough to the idea of a "header" token at the start of a scalar value?
Any concerns in this area would be mitigated by choosing a longer (and therefore less likely to collide or be entered unexpectedly) token as the array header e.g. '<<flatten'
? Essentially any change made in this area is going to break some theoretically existing yaml files and/or make the contents of the file less literal, so if it is worth doing (and I would err on the side of "yes", making anchors/aliases more expressive and consistent is worth it) then it becomes a question of "what is the least invasive change we could make to the grammar?"
So yeah, after all that, the original option 2 gets my vote, although with slightly different semantics than I assume you initially intended.
Some edge-ish cases, nowhere near complete:
x-seq1: &list1
- value1
- value3
x-seq2: &list2
- <<
- *list1
- value2
x-seq3: &list3
- *list1
- value4
# "Escaping" '<<' as the first element of the list
l1: ['<<', '<<'] # Expected ['<<']
l2: ['<<', ['a', 'b'], ['a', ['b', 'c']]] # Expected ['a', 'b', 'a', ['b', 'c']]
# More than one reference in the list
l3: [*list1, *list2] # Expected [['value1', 'value3'], ['value1', 'value3', 'value2']]
l4: ['<<', *list1, *list4] # Expected ['value1', 'value3', ['value1', 'value3'], 'value4']
# Schema rule applied to nested lists?
l5: ['<<', ['<<', *list1], *list3] # Expected [['value1', 'value3'], ['value1', 'value3'], 'value4']
Update 1/10:
- "what people would expect" is too generic and vague
- Expand thoughts about version compatibility issues
- Fixed keys in example
- Added self-referenential "update" section outlining edits to original post.
The essence of the desire is so basic, I find it hard to believe the language has yet to provide a way to merge sequences. This needs stronger consideration.
Without this feature, DRY is impossible in many, many config files in many, many projects that use YAML.
@ovangle I recommend you take a step back and consider how you're communicating -- we understand you're passionate about this issue but you don't need to use inflammatory language to get your point across. please be nice -- there are humans on the other side of the cable
@gsmethells My sincerest apologies -- as you probably noticed I deleted my message immediately after dispatching it (although that doesn't stop it being delivered to anyone subscribed to this). I am not typically that rude, nor am I even particularly passionate about what I was saying. It started out as a simple "you probably should think about why this isn't the easiest thing to change", but yeah, I was a bit overcaffeinated and hit send before taking the time to reflect on what I was saying.
@ovangle no hard feelings. Thank you for your hard work on this issue.
I'm going to chime in here with a few successive comments.
The right way to all functional transformation in YAML is with tags. Every mapping, sequence and scalar has a tag assigned to it either explicitly or implicitly during the load() process. The tag is associated with a function that controls how the data is processed into a native (Python here) data structure.
In YAML 1.1 the <<
key is implicitly tagged with a !!merge
tag that triggers a merging transformation.
Here's a first pass solution with real code using PyYAML:
#!/usr/bin/env python3
from yaml import *
def join(ldr, node):
l = []
for e in ldr.construct_sequence(node, deep=True):
if type(e) is list:
l.extend(e)
else:
l.append(e)
return l
add_constructor('!++', join)
yaml = """\
data:
- &seq1
- aaa
- bbb
joined: !++
- foo
- *seq1
- bar
"""
print(load(yaml, Loader))
Which produces:
{'data': [['aaa', 'bbb']], 'joined': ['foo', 'aaa', 'bbb', 'bar']}
This solution uses a local tag !++
to flatten any lists in a container list. It uses an alias to another list that must be stored somewhere in the YAML document.
As you can see we created our own tag and used the punctuation tag characters ++
instead of something like !join
. This is just a personal style choice here, that you can decide on yourself.
Then we associated a transformation function with the tag. This whole scenario assumes you have a reasonable YAML framework. PyYAML is a pretty decent YAML framework overall.
A problem with the last comment solution is that you can't have lists in the container list that don't get flattened.
Here's a modification where we explicitly mark the list elements that we want to splat.
#!/usr/bin/env python3
from yaml import *
def join(ldr, node):
v = ldr.construct_sequence(node, deep=True)
l = []
for e in v:
if type(e) is tuple and e[0] == 'splat':
l.extend(e[1][0])
else:
l.append(e)
return l
add_constructor('!++', join)
def splat(ldr, node):
return ('splat', ldr.construct_sequence(node, deep=True))
add_constructor('!*', splat)
yaml = """\
data:
- &seq1
- aaa
- bbb
joined: !++
- [foo, yoo: hoo]
- !* [*seq1]
- bar
"""
print(load(yaml, Loader))
which prints:
{'data': [['aaa', 'bbb']], 'joined': [['foo', {'yoo': 'hoo'}], 'aaa', 'bbb', 'bar']}
Here the container list contains a list that don't flatten and one that we do.
We tag the list we want to splat with custom local !*
tag attached to a splat
function.
Since we can't tag an alias in YAML 1.2, we need to wrap it in a sequence.
Not ideal, but not terrible.
Let's see if we can do better...
Here we have almost the same thing, but notice we don't have to specify a !++
tag anymore.
#!/usr/bin/env python3
from yaml import *
def join(ldr, node):
v = ldr.construct_sequence(node, deep=True)
l = []
for e in v:
if type(e) is tuple and e[0] == 'splat':
l.extend(e[1][0])
else:
l.append(e)
return l
add_constructor('tag:yaml.org,2002:seq', join)
def splat(ldr, node):
return ('splat', ldr.construct_sequence(node, deep=True))
add_constructor('!*', splat)
yaml = """\
data:
- &seq1
- aaa
- bbb
joined:
- [foo, yoo: hoo]
- !* [*seq1]
- bar
"""
print(load(yaml, Loader))
It prints:
{'data': [['aaa', 'bbb']], 'joined': [['foo', {'yoo': 'hoo'}], 'aaa', 'bbb', 'bar']}
same as before.
Note even without the !++
tag we still have the join
function. We just attached it to !!seq
so every sequence that has a !*
splat element works.
In this final rendition, we add a couple cool things:
#!/usr/bin/env python3
from yaml import *
data = {
'seq1': ['aaa', 'bbb'],
'seq2': ['ccc', 'ddd'],
}
def get_data(ldr, node):
k = ldr.construct_scalar(node)
return data.get(k, [])
add_constructor('!$', get_data)
def join(ldr, node):
l = []
for e in ldr.construct_sequence(node, deep=True):
if type(e) is tuple and e[0] == 'splat':
l.extend(e[1][0])
elif type(e) is dict and e.get('<', None) is not None:
l.extend(e['<'])
else:
l.append(e)
return l
add_constructor('tag:yaml.org,2002:seq', join)
def splat(ldr, node):
return ('splat', ldr.construct_sequence(node, deep=True))
add_constructor('!*', splat)
yaml = """\
seq3: &seq3
- xxx
- yyy
joined:
- !* [ !$ seq1 ]
- [ foo, yoo: hoo ]
- <: !$ seq2
- bar
- <: *seq3
"""
print(load(yaml, Loader))
First off we made a !$
tag to import data from outside of the YAML. It could have be from a file or a database or anything, but here it's just a Python dict. As you can see it was trivial to do.
For the seq1
list we splatted it with !*
as before. For seq2
we instead introduce using a special <
key, with a value that we want to use. This lets us not have to wrap the value in a sequence as before.
Note that <<
could not be used here because PyYAML already wants to use that for merging maps.
So now we kind of can merge lists today in YAML 1.2 with a slightly customized PyYAML like:
- string
- <: *list
- a: thing
- [ with, things ] # not flattened
- <:
- hello
- world
Now that the YAML language development team has just released the YAML specification revision 1.2.2, we are actively working on the next specification and reference implementations.
I can't say for sure what YAML 1.3 will look like exactly, but I'm pretty certain that a whole host of loading transformations will be specified, including merging sequences. That means you'll be able to do these transformations in the same manner from framework to compliant framework.
We can do most of this without any syntax modifications, but we've been working on dozens of back compat ideas that will make these functional things super slick, while keeping today's YAML working as-is.
If you want to engage directly with us, stop by https://matrix.to/#/#chat:yaml.io