psych icon indicating copy to clipboard operation
psych copied to clipboard

Raise exception on duplicate keys

Open getaaron opened this issue 5 years ago • 11 comments

The YAML spec is extremely clear that duplicate keys are forbidden. A few examples:

an unordered association of unique keys to values

and:

keys are unordered and must be unique

and:

The content of a mapping node is an unordered set of key: value node pairs, with the restriction that each of the keys is unique

and:

YAML mappings require key uniqueness

For these reasons, psych should raise an exception when trying to parse YAML with duplicate keys. It should be considered a bug that this ever worked, so I don't think we need to worry about backwards compatibility to support out-of-spec YAML parsing.

This PR adds functionality to raise an exception on duplicates, and a test for it. Closes #79.

getaaron avatar Dec 18 '19 01:12 getaaron

@yuki24 could we please merge this?

getaaron avatar Mar 25 '20 04:03 getaaron

We should warn it before raising an exception. This change breaks the user application.

hsbt avatar Mar 25 '20 09:03 hsbt

I agree with @hsbt. The current functionality has been like this forever.

Besides that, I don't think this is the correct place to fix it. These stream handlers are used for building ASTs. I think it would be more appropriate to fix this where the AST is converted to a Ruby object. Would you mind fixing it in that place?

Here's a starting point:

diff --git a/lib/psych/visitors/to_ruby.rb b/lib/psych/visitors/to_ruby.rb
index a922f90..9ca2d9a 100644
--- a/lib/psych/visitors/to_ruby.rb
+++ b/lib/psych/visitors/to_ruby.rb
@@ -348,6 +348,8 @@ module Psych
           end
           val = accept(v)
 
+          raise Psych::Exception if hash.key? key
+
           if key == SHOVEL && k.tag != "tag:yaml.org,2002:str"
             case v
             when Nodes::Alias, Nodes::Mapping

The above patch makes the test in this PR pass, but there is another test in the suite that fails and I'm not sure why.

tenderlove avatar May 08 '20 00:05 tenderlove

Btw, adding this check is going to have a performance impact on loading hashes. It might be nice if there is a way we can let people who care more about performance and less about being true to the spec. Maybe introduce a strict mode?

tenderlove avatar May 08 '20 00:05 tenderlove

Shouldn't this be enforced at the libyaml level?

For the JRuby extension, we could enable this strict behavior by setting options to the SnakeYAML library we use:

https://www.javadoc.io/doc/org.yaml/snakeyaml/1.20/org/yaml/snakeyaml/LoaderOptions.html

It seems odd to enforce this at the Ruby level when it's part of the YAML spec.

headius avatar May 08 '20 02:05 headius

@headius does that actually impact the AST, or just the loader? I don't know SnakeYaml very well, but it doesn't look like we're using LoaderOptions at all.

We could also do this while the AST is being created, but this PR seems to do post-processing on the AST and that is definitely not the right place to do it regardless.

tenderlove avatar May 08 '20 15:05 tenderlove

@tenderlove That's a good point... it appears the LoaderOptions are applied when loading YAML into an in-memory graph of Java objects (maps, lists, etc). From what I can read in the code, it's applied at the point the MappingNode gets processed into a Map.

https://bitbucket.org/asomov/snakeyaml/src/70ff9d8b181f7e1aff9eb7349b2ca288cf308496/src/main/java/org/yaml/snakeyaml/constructor/SafeConstructor.java#lines-83:118

They don't do a separate scan of the node's contents. Instead they check each key inserted into the resulting map to see if something else was there already.

headius avatar May 08 '20 20:05 headius

They don't do a separate scan of the node's contents. Instead they check each key inserted into the resulting map to see if something else was there already.

Ok. So they're doing basically what I suggested to do. IOW the events and AST don't care that there are duplicate keys, but the thing that creates objects does.

tenderlove avatar May 08 '20 20:05 tenderlove

@tenderlove so are the next steps…

  • Keep my test
  • Replace my implementation with your patch
  • investigate the other failing test you mentioned

?

What about @hsbt’s concern? Strict mode seems reasonable, but might expand the scope of this PR a lot.

I’m happy to do some more work on this but I could use some more design direction since it’s my first time working in this repo.

getaaron avatar May 08 '20 23:05 getaaron

@getaaron yep, I think those are the next steps. The initialize method for the ToRuby visitor already takes some options. I think we could use that to configure the strategy we need. I pushed a commit here that implements it (but I haven't tested it)

tenderlove avatar May 08 '20 23:05 tenderlove