psych
psych copied to clipboard
Psych does not complain about duplicate keys
Psych does not complain or raise an error when it encounters a duplicate key. Instead, it happily overwrites the previous value of the key. This is in contrast to the YAML spec, which maintains that the keys must be unique.
Agree/disagree that this is a problem?
Well, Psych maintains the behavior that Syck had for this. Until now, nobody has had a problem with the way it works. I don't mind changing it, but it will not be backwards compatible.
Got it. I don't suppose Psych already has a place for configuration options, does it? I don't see one. Would that be a desirable approach? Then, for example, a Rails app in development mode could be more strict/fragile about the YAML than a production deployment.
It's been about 2 years since this was brought up. Let's do it.
So according to this tweet[0], we should add an option to enforce this behaviour (maybe the option is called 'strict').
In my opinion, with this option enabled we should raise a Psych::SyntaxError
if we encounter the duplicate keys. This error is already a known error that can be raised on Psych.parse
[1].
Psych has a number of interfaces that we would want this behaviour added to (cribbed from the docs[2]):
-
(Object) Pysch.load(yaml, filename = nil)
-
(Object) Psych.load_documents(yaml, &block)
-
(Object) Psych.load_file(filename)
-
(Object) Psych.load_stream(yaml, filename = nil)
-
(Object) Psych.parse(yaml, filename = nil)
-
(Object) Psych.parse_file(filename)
-
(Object) Psych.parse_stream(yaml, filename = nil, &block)
Right now none of these methods take an options hash. We can expand the signature with an optional options hash. I think this the right course of action, does anybody have any other suggestion?
Also, consider that the generation of YAML already has an options hash (in Psych.dump
, but not Psych.dump_stream
), so this isn't an unexpected behaviour, in my opinion.
[0] - https://twitter.com/tenderlove/status/468781805940641793 [1] - http://rubydoc.info/stdlib/psych/1.9.3/Psych#parse-class_method [2] - http://rubydoc.info/stdlib/psych/1.9.3/Psych
Basically I'm proposing changing from this:
Psych.parse(yaml, filename = nil)
to this:
Psych.parse(yaml, filename = nil, options = {})
(and on all the other methods listed above)
@Nitrodist :+1: I think it's good. For a major release, I think we should change the filename to be an option like Psych.load(yaml, filename: name, strict: true)
(but that's for a breaking major release). For now, lets add the options hash.
We're having a hacknight tomorrow in Toronto. I'm hoping to work on a solution with @nwjsmith and have it ready by Thursday. :open_hands:
@Nitrodist What was the outcome of your hacking night? :smile:
@coin3d I was able to get it to 'work' but its efficiency left much to be desired. This is what I had:
diff --git lib/psych/handlers/document_stream.rb lib/psych/handlers/document_stream.rb
index e429993..bf04896 100644
--- lib/psych/handlers/document_stream.rb
+++ lib/psych/handlers/document_stream.rb
@@ -17,6 +17,18 @@ module Psych
@last.implicit_end = implicit_end
@block.call pop
end
+
+ def end_mapping
+ mapping = pop
+ keys = []
+ mapping.children.each_slice(2) do |(key_scalar, _)|
+ next if key_scalar.is_a?(Psych::Nodes::Sequence) or key_scalar.is_a?(Psych::Nodes::Alias) or key_scalar.is_a?(Psych::Nodes::Mapping)
+ key = key_scalar.value
+ raise Psych::Exception, "Same key exists on this level" if keys.include? key
+ keys << key
+ end
+ mapping
+ end
end
end
end
diff --git test/psych/test_hash.rb test/psych/test_hash.rb
index dac7f8d..b3ecbc0 100644
--- test/psych/test_hash.rb
+++ test/psych/test_hash.rb
@@ -21,6 +21,16 @@ module Psych
assert_equal X, x.class
end
+ def test_error_on_same_key
+ assert_raises(Psych::Exception) do
+ Psych.load <<-EOF
+ -
+ same_key: 'value'
+ same_key: 'value'
+ EOF
+ end
+ end
+
def test_self_referential
@hash['self'] = @hash
assert_cycle(@hash)
Hey i'll take a look at this :smile:
I would have liked a warning or error message like this... I now used https://github.com/adrienverge/yamllint for checking for duplicate keys.
What is the state of this ? I don't really mind or care about the default but there should be a way to raise an error on duplicate keys, the consequences of a duplicate key are rather harsh since the first one will simply be entirely overwritten.
If you use a hash instead of a list then the performance shouldn't suffer nearly as much:
+ def end_mapping
+ mapping = pop
+ keys = {}
+ mapping.children.each_slice(2) do |(key_scalar, _)|
+ next if key_scalar.is_a?(Psych::Nodes::Sequence) or key_scalar.is_a?(Psych::Nodes::Alias) or key_scalar.is_a?(Psych::Nodes::Mapping)
+ key = key_scalar.value
+ raise Psych::Exception, "Same key exists on this level" if keys.key? key
+ keys[key] = nil
+ end
+ mapping
+ end
any takers for testing and a PR?
Is this likely to be implemented any time soon?
@tenderlove you mentioned backwards compatibility... Instead raising an error, just (optionally) printing out a warning wouldn't brake backwards compatibility but help many people manage their yaml files. What do you think?
Just wasted 2 days investigating this very issue after a developer accidentally forgot to rename a job that was copy/pasted. Any thoughts on just printing a message as @wteuber suggests? Would save a lot of time - especially when multiple developers are working on automation.
I helped programmatically avoiding this issue by writing and introducing a normalized YAML format in the projects I am involved in. Maybe you find this helpful, too: https://github.com/Sage/yaml_normalizer @imaginejosephishak
Travis CI user has had a problem because of the current behavior: https://travis-ci.community/t/build-no-longer-installing-apt-packages-ignoring-config/4306
http://www.yamllint.com/ also has the same problem: it doesn't report duplicate keys and instead silently deletes all but the last one and reports YAML as "valid".
Since multiple keys will likely be supported by default, permanently, I think there should be a guarantee about the order they will be processed. It seems to use a "first come, first served" model rather than an override model, i.e. the first key encountered is the one that's used, with subsequent duplicates being discarded rather than acting as overrides.
Would it make sense to document this and add a test to ensure it remains stable?
While I realise it's not strictly YAML-compliant, it's actually useful for merging YAMLs from multilpe sources, so if it's in the library anyway, may as well make it work deterministically.
That this is not an error is problematic, especially in the context of merge keys. Gitlab (working with Ruby for YAML) seems to allow this because of this bug, and the effect when mulitple mapped in merges have the same key (but different values) was allowed but impossible for me to trace what it means. I.e. is
<<: *a
<<: *b
the same as <<: [*a, *b]
or as <<: [*b, *a]
or as <<: *b
or as <<: *a
? These all have different results depending on the keys in the mappings &a
and &b
and the mapping in which things are merged.
FWIW, PyYAML overwrites a key's value with any value with the same that occur later in the document text in a mapping. https://github.com/yaml/pyyaml/blob/8cdff2c80573b8be8e8ad28929264a913a63aa33/lib/yaml/constructor.py#L132
I strongly recommend to implement this issue because I encountered issue related to this implementation.
Related issue
I encountered following issue.
- I use openapi3_parser gem to validate Open API 3 Specification syntax. The gem depends on
psych
. - There was a multiple key definition in the Open API 3 Specification file written by YAML but it's not detected.
- As as point of view from Open API 3 Specification, multiple key definition leads to confuse users' understanding.
- example
- https://github.com/tanomimaster/tanomimaster-openapi/compare/v2.2.8...v2.2.10
Reasons to implement
- The multiple same key is not allowed according to the RFC
- https://yaml.org/spec/1.2.2/#3213-node-comparison
- If the multiple same key is allowed, OpenAPI3 syntax checker would need to validate YAML to assure the multiple key is not definied in the YAML file.