psych icon indicating copy to clipboard operation
psych copied to clipboard

Psych does not complain about duplicate keys

Open rehevkor5 opened this issue 12 years ago • 23 comments

Psych does not complain or raise an error when it encounters a duplicate key. Instead, it happily overwrites the previous value of the key. This is in contrast to the YAML spec, which maintains that the keys must be unique.

Agree/disagree that this is a problem?

rehevkor5 avatar Aug 24 '12 16:08 rehevkor5

Well, Psych maintains the behavior that Syck had for this. Until now, nobody has had a problem with the way it works. I don't mind changing it, but it will not be backwards compatible.

tenderlove avatar Aug 24 '12 17:08 tenderlove

Got it. I don't suppose Psych already has a place for configuration options, does it? I don't see one. Would that be a desirable approach? Then, for example, a Rails app in development mode could be more strict/fragile about the YAML than a production deployment.

rehevkor5 avatar Aug 24 '12 20:08 rehevkor5

It's been about 2 years since this was brought up. Let's do it.

Nitrodist avatar May 20 '14 15:05 Nitrodist

So according to this tweet[0], we should add an option to enforce this behaviour (maybe the option is called 'strict').

In my opinion, with this option enabled we should raise a Psych::SyntaxError if we encounter the duplicate keys. This error is already a known error that can be raised on Psych.parse[1].

Psych has a number of interfaces that we would want this behaviour added to (cribbed from the docs[2]):

  • (Object) Pysch.load(yaml, filename = nil)
  • (Object) Psych.load_documents(yaml, &block)
  • (Object) Psych.load_file(filename)
  • (Object) Psych.load_stream(yaml, filename = nil)
  • (Object) Psych.parse(yaml, filename = nil)
  • (Object) Psych.parse_file(filename)
  • (Object) Psych.parse_stream(yaml, filename = nil, &block)

Right now none of these methods take an options hash. We can expand the signature with an optional options hash. I think this the right course of action, does anybody have any other suggestion?

Also, consider that the generation of YAML already has an options hash (in Psych.dump, but not Psych.dump_stream), so this isn't an unexpected behaviour, in my opinion.

[0] - https://twitter.com/tenderlove/status/468781805940641793 [1] - http://rubydoc.info/stdlib/psych/1.9.3/Psych#parse-class_method [2] - http://rubydoc.info/stdlib/psych/1.9.3/Psych

Nitrodist avatar May 20 '14 16:05 Nitrodist

Basically I'm proposing changing from this:

Psych.parse(yaml, filename = nil)

to this:

Psych.parse(yaml, filename = nil, options = {})

(and on all the other methods listed above)

Nitrodist avatar May 20 '14 16:05 Nitrodist

@Nitrodist :+1: I think it's good. For a major release, I think we should change the filename to be an option like Psych.load(yaml, filename: name, strict: true) (but that's for a breaking major release). For now, lets add the options hash.

tenderlove avatar May 20 '14 17:05 tenderlove

We're having a hacknight tomorrow in Toronto. I'm hoping to work on a solution with @nwjsmith and have it ready by Thursday. :open_hands:

Nitrodist avatar May 20 '14 17:05 Nitrodist

@Nitrodist What was the outcome of your hacking night? :smile:

perryventas avatar Sep 01 '15 10:09 perryventas

@coin3d I was able to get it to 'work' but its efficiency left much to be desired. This is what I had:

diff --git lib/psych/handlers/document_stream.rb lib/psych/handlers/document_stream.rb
index e429993..bf04896 100644
--- lib/psych/handlers/document_stream.rb
+++ lib/psych/handlers/document_stream.rb
@@ -17,6 +17,18 @@ module Psych
         @last.implicit_end = implicit_end
         @block.call pop
       end
+
+      def end_mapping
+        mapping = pop
+        keys = []
+        mapping.children.each_slice(2) do |(key_scalar, _)|
+          next if key_scalar.is_a?(Psych::Nodes::Sequence) or key_scalar.is_a?(Psych::Nodes::Alias) or key_scalar.is_a?(Psych::Nodes::Mapping)
+          key = key_scalar.value
+          raise Psych::Exception, "Same key exists on this level" if keys.include? key
+          keys << key
+        end
+        mapping
+      end
     end
   end
 end
diff --git test/psych/test_hash.rb test/psych/test_hash.rb
index dac7f8d..b3ecbc0 100644
--- test/psych/test_hash.rb
+++ test/psych/test_hash.rb
@@ -21,6 +21,16 @@ module Psych
       assert_equal X, x.class
     end

+    def test_error_on_same_key
+      assert_raises(Psych::Exception) do
+        Psych.load <<-EOF
+        -
+          same_key: 'value'
+          same_key: 'value'
+        EOF
+      end
+    end
+
     def test_self_referential
       @hash['self'] = @hash
       assert_cycle(@hash)

Nitrodist avatar Sep 17 '15 14:09 Nitrodist

Hey i'll take a look at this :smile:

DarthCharles avatar Oct 09 '15 14:10 DarthCharles

I would have liked a warning or error message like this... I now used https://github.com/adrienverge/yamllint for checking for duplicate keys.

wteuber avatar Jun 24 '16 22:06 wteuber

What is the state of this ? I don't really mind or care about the default but there should be a way to raise an error on duplicate keys, the consequences of a duplicate key are rather harsh since the first one will simply be entirely overwritten.

schmurfy avatar Sep 05 '17 08:09 schmurfy

If you use a hash instead of a list then the performance shouldn't suffer nearly as much:

+      def end_mapping
+        mapping = pop
+        keys = {}
+        mapping.children.each_slice(2) do |(key_scalar, _)|
+          next if key_scalar.is_a?(Psych::Nodes::Sequence) or key_scalar.is_a?(Psych::Nodes::Alias) or key_scalar.is_a?(Psych::Nodes::Mapping)
+          key = key_scalar.value
+          raise Psych::Exception, "Same key exists on this level" if keys.key? key
+          keys[key] = nil
+        end
+        mapping
+      end

any takers for testing and a PR?

imoverclocked avatar Oct 26 '17 21:10 imoverclocked

Is this likely to be implemented any time soon?

alex-harvey-z3q avatar Mar 23 '18 12:03 alex-harvey-z3q

@tenderlove you mentioned backwards compatibility... Instead raising an error, just (optionally) printing out a warning wouldn't brake backwards compatibility but help many people manage their yaml files. What do you think?

wteuber avatar Mar 25 '18 12:03 wteuber

Just wasted 2 days investigating this very issue after a developer accidentally forgot to rename a job that was copy/pasted. Any thoughts on just printing a message as @wteuber suggests? Would save a lot of time - especially when multiple developers are working on automation.

imaginejosephishak avatar Apr 23 '19 18:04 imaginejosephishak

I helped programmatically avoiding this issue by writing and introducing a normalized YAML format in the projects I am involved in. Maybe you find this helpful, too: https://github.com/Sage/yaml_normalizer @imaginejosephishak

wteuber avatar Apr 23 '19 19:04 wteuber

Travis CI user has had a problem because of the current behavior: https://travis-ci.community/t/build-no-longer-installing-apt-packages-ignoring-config/4306

native-api avatar Jul 23 '19 02:07 native-api

http://www.yamllint.com/ also has the same problem: it doesn't report duplicate keys and instead silently deletes all but the last one and reports YAML as "valid".

native-api avatar Jul 23 '19 03:07 native-api

Since multiple keys will likely be supported by default, permanently, I think there should be a guarantee about the order they will be processed. It seems to use a "first come, first served" model rather than an override model, i.e. the first key encountered is the one that's used, with subsequent duplicates being discarded rather than acting as overrides.

Would it make sense to document this and add a test to ensure it remains stable?

While I realise it's not strictly YAML-compliant, it's actually useful for merging YAMLs from multilpe sources, so if it's in the library anyway, may as well make it work deterministically.

mahemoff avatar Sep 15 '19 10:09 mahemoff

That this is not an error is problematic, especially in the context of merge keys. Gitlab (working with Ruby for YAML) seems to allow this because of this bug, and the effect when mulitple mapped in merges have the same key (but different values) was allowed but impossible for me to trace what it means. I.e. is

   <<: *a
   <<: *b

the same as <<: [*a, *b] or as <<: [*b, *a] or as <<: *b or as <<: *a ? These all have different results depending on the keys in the mappings &a and &b and the mapping in which things are merged.

FWIW, PyYAML overwrites a key's value with any value with the same that occur later in the document text in a mapping. https://github.com/yaml/pyyaml/blob/8cdff2c80573b8be8e8ad28929264a913a63aa33/lib/yaml/constructor.py#L132

AvdN avatar Dec 02 '21 17:12 AvdN

I strongly recommend to implement this issue because I encountered issue related to this implementation.

Related issue

I encountered following issue.

  • I use openapi3_parser gem to validate Open API 3 Specification syntax. The gem depends on psych.
  • There was a multiple key definition in the Open API 3 Specification file written by YAML but it's not detected.
  • As as point of view from Open API 3 Specification, multiple key definition leads to confuse users' understanding.
  • example
    • https://github.com/tanomimaster/tanomimaster-openapi/compare/v2.2.8...v2.2.10

Reasons to implement

  • The multiple same key is not allowed according to the RFC
    • https://yaml.org/spec/1.2.2/#3213-node-comparison
  • If the multiple same key is allowed, OpenAPI3 syntax checker would need to validate YAML to assure the multiple key is not definied in the YAML file.

matsubo avatar Jun 02 '22 01:06 matsubo