Duplicate keys are not handled properly
YAML spec in version 1.2 says that in a valid YAML file, mapping keys are unique.
This is not the case in pyyaml, as can be seen by loading this sample file:
a:
- b
- c
a:
q: a
q: b
The correct result from loading this file is an error. pyyaml instead naively overwrites results with the last key, resulting in this dict: {'a': {'q': 'b'}}.
some discussion of workarounds here: https://gist.github.com/pypt/94d747fe5180851196eb
I am also getting same issue but its slight different click here: https://github.com/gunthercox/ChatterBot/issues/1640
could any one find the solution
I just wanted to comment that in my case, the overwrite is a feature and not a bug. In my model, I have several "sites", each of them with a yaml parameter file. And I have also a parameter file which apply to all sites. So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry. So it would be helpful to at least keep an option for the overwrite feature.
This seems to duplicate #41.
So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry.
@parrenin I think this is misguided. Since this violates the spec, you don't know whether the original or the overridden value would be active (without looking at the implementation, which could change at any time, since it is not part of the spec.) Simply reading the "base" and "overriding" settings as two separate files, and combining as you see fit would be more correct.
The documentation, specifically "Deviations from the specification" in https://pyyaml.org/wiki/PyYAMLDocumentation should be updated.
I looked in the code, and fixing this is pretty simple (famous last words). I was going to submit a pull request, but there are actually tests validating that duplicate keys ARE ALLOWED, so obviously the maintainers have some different opinions about this. The patch at https://gist.github.com/marvingreenberg/ba2f67f740400f722680fc7c538e94a0 does prevent duplicate keys, and seems correct. But, there are then 4 test failures. Three of them seem simple to simply disable (or change to verify that an exception is raised). The fourth I don't quite understand, and may have something to do with places where YAML becomes "not actually so simple" like references and other subtle things. If pyyaml simply intends to keep this behavior (since 2006) I'd say closing this case and updating the documentation is warranted.
And not surprisingly, the patch I include is incomplete since there is a separate constructor for a yaml "ordered mapping" which specifically has a comment that "# Note: we do not check for duplicate keys, because it's too CPU-expensive." (which again seems misguided to me).
One problem is that this logic would break the functionality of merge keys:
https://yaml.org/type/merge.html
See def flatten_mapping.
To fix this, we would need a seperate handling of the merge keys and the actual keys in the mapping.
I don't know which tests were failing (haven't tested your patch yet), but I guess the ones with merge keys were among them.
If you want I can take the trouble of making a pull request, and try my best at updating the tests to be complete. I am a little unclear about what the goals, according to the folks in a better position to decide (the maintainers), are with respect to this particular issue. I think it's important to follow the spec -- silently and without documentation ignoring the spec seems wrong to me. But, YAML is definitely more complicated than "oh, its just a simple little thing".
@marvingreenberg yeah, I think forbidding duplicate keys should be the default. Adding an option for disabling it might be nice.
But merge key behaviour should stay the same.
@ingydotnet is the one to decide about this.
@parrenin
So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry
I think this sounds like a use case for merge keys, actually.
The YAML 1.1 spec also says that keys should be unique: https://yaml.org/spec/1.1/#id861060
The content of a mapping node is an unordered set of key: value node pairs, with the restriction that each of the keys is unique
@perlpunk Does that include the merge keys themselves, or only the keys after a resolved merge?
e.g.
- k0: v0
vars: &splat0
VAR0: potato # vars from blob0
- k1: v1
vars: &splat1
VAR1: spud # vars from blob1
- k: v
extra: # merged vars from blob0 + blob1
<<: *splat0
<<: *splat1
@wimglenn the correct way to use multiple merge keys is actually:
- k: v
extra: # merged vars from blob0 + blob1
<<: [ *splat0, *splat1 ]
That way duplicate merge keys can be avoided.
Other than that, the spec is actually not very specific about equality of nodes.
For example, 0xa and 10 are different before resolving, but equal after.
Then what about
- k: v
extra:
VAR0: before
<<: [ *splat0, *splat1 ]
VAR1: after
I know merge spec says that splat0 wins in case of duplicate keys in splat0 & splat1, but what does it say about dupes outside of the merge caused by resolving the merge?
@wimglenn Duplicates should only be considered in the actual mapping definition itself. Merge keys are explicitly there for merging, so they should override keys.
By the way, the spec https://yaml.org/type/merge.html is not very detailed about this, but from how I read the spec and the example, it does not matter where the merge key appers - as the first key, in the middle or at the end. The order is always the same:
---
map:
<<: [ *alias2, *alias3, *alias4 ]
key: value1
# key: something # would be an error
value1 overrides *alias2, *alias2 overrides *alias3, etc.
PyYAML should reject duplicate keys per the YAML spec. These are silent errors in users' code. Do what ruamel.yaml does, and add a yaml.allow_duplicate_keys = True property for those who want to opt-in.
@alexchandel yes, but it's not that trivial to implement, if we want to keep merge key functionality as well. Like I said earlier https://github.com/yaml/pyyaml/issues/165#issuecomment-525047100
The workaround that anz-rfc posted works well enough for me, though a more canonical solution would be preferable.
bump