psych icon indicating copy to clipboard operation
psych copied to clipboard

Different behaviour between psych and libyaml on indentation

Open dominicj opened this issue 10 years ago • 8 comments

Hi,

When I try to parse the following YAML snippet:

  - indentation 2 spaces
  - indentation 2 spaces
- indentation 0 space

It fails with libyaml with error code = 4 (i.e. YAML_PARSER_ERROR) and description = did not find expected <document start>.

With psych, it does not complain at all.

Any idea why?

Thanks!

dominicj avatar Jun 30 '15 20:06 dominicj

I don't really understand. Can you provide some code as an example? Psych uses libyaml, so it should be the same.

tenderlove avatar Jun 30 '15 21:06 tenderlove

Hi,

Thanks for your fast reply.

So I downloaded and compiled libyaml from the Github repo here.

Then I used the run-load.c test to load a sample file with the YAML snippet above and it fails (i.e. once compiled, I run the test like this: ./run-load test.yaml).

But when I use psych.load_file() in pure Ruby to load the same file, it succeeds.

I would expect the same behaviour i.e. both failing to load the file.

Let me know if you need more details.

Thanks!

dominicj avatar Jul 01 '15 02:07 dominicj

I have the same issue:

test code:

#!/usr/bin/env ruby
require 'psych'
require 'pp'
pp Psych.load_file(ARGV.first)

bad.yaml contains (note the single space before abc):

 abc: 123
xyz: 456

and produces a unsatisfying result. where's xyz?: {"abc"=>123}


good.yaml contains:

abc: 123
xyz: 456

and produces the expected result: {"abc"=>123, "xyz"=>456}


This mal-feature is such that if you start a file/text-stream with N spaces, then after it encounters a line without the N leading-spaces, the remainder of the text is ignored.

  abc: def
  biz: buzz
 buzz: biz
  axe: hatchet
birds: fowl
  xyz: zyx

produces: {"abc"=>"def", "biz"=>"buzz"}


Switching gears, i tried the same files with PERL's yaml reader using this script.

#!/usr/bin/env perl
use YAML;
use Data::Dumper;
my $config = YAML::LoadFile(@ARGV[0]);
print Dumper($config);

from above examples, bad.yaml produces:

   Code: YAML_PARSE_ERR_INCONSISTENT_INDENTATION
   Line: 2
   Document: 1
 at /usr/local/share/perl5/YAML/Loader.pm line 736.

and good.yaml produces:

$VAR1 = {
          'abc' => '123',
          'xyz' => '456'
        };

I feel like Psych should raise an exception for this broken input instead of silently ignoring the remainder of the file.

nanobowers avatar Sep 08 '17 20:09 nanobowers

Further diagnosis:

#parse_stream correctly throws a syntax error:

2.1.5 :257 > Psych.parse_stream(" a: 1\nb: 2")
Psych::SyntaxError: (<unknown>): did not find expected <document start> at line 1 column 1
	from /home/ben/.rvm/rubies/ruby-2.1.5/lib/ruby/2.1.0/psych.rb:373:in `parse'
	from /home/ben/.rvm/rubies/ruby-2.1.5/lib/ruby/2.1.0/psych.rb:373:in `parse_stream'
	from (irb):257
	from /home/ben/.rvm/rubies/ruby-2.1.5/bin/irb:11:in `<main>'

however #parse does not.

2.1.5 :258 > Psych.parse(" a: 1\nb: 2")
 => #<Psych::Nodes::Document:0x00000000a3fb10 @children=[#<Psych::Nodes::Mapping:0x00000000a3fa98 @children=[#<Psych::Nodes::Scalar:0x00000000a3f9d0 @value="a", @anchor=nil, @tag=nil, @plain=true, @quoted=false, @style=1>, #<Psych::Nodes::Scalar:0x00000000a3f8b8 @value="1", @anchor=nil, @tag=nil, @plain=true, @quoted=false, @style=1>], @anchor=nil, @tag=nil, @implicit=true, @style=1>], @version=[], @tag_directives=[], @implicit=true, @implicit_end=true> 

from my layman's perspective, something funky seems to be happening with the return in Psych.parse which appears to eat the exception? Using a return inside a block seems a little odd to me, and in most cases i can't figure out how the fallback value is used here, unless somehow the block never gets called.

def self.parse yaml, filename = nil, fallback = false
    parse_stream(yaml, filename) do |node|
      return node
    end
    fallback
end

nanobowers avatar Sep 09 '17 03:09 nanobowers

@tenderlove maybe someone else can take a look at this? I isolated the problem, but am not sure how to fix it without breaking other stuff. I feel like the current behavior (die silently) is a serious flaw though.

nanobowers avatar Oct 06 '17 01:10 nanobowers

Strings can contain multiple documents, and it looks like libyaml thinks this particular thing is two documents. If I try like this:

ruby -I lib -r psych -e'Psych.parse_stream(" a: 1\nb: 2") { |x| p x }'

The output is this:

#<Psych::Nodes::Document:0x00007fac1d026888 @children=[#<Psych::Nodes::Mapping:0x00007fac1d026608 @children=[#<Psych::Nodes::Scalar:0x00007fac1d026248 @value="a", @anchor=nil, @tag=nil, @plain=true, @quoted=false, @style=1, @start_line=0, @start_column=1, @end_line=0, @end_column=2>, #<Psych::Nodes::Scalar:0x00007fac1d026068 @value="1", @anchor=nil, @tag=nil, @plain=true, @quoted=false, @style=1, @start_line=0, @start_column=4, @end_line=0, @end_column=5>], @anchor=nil, @tag=nil, @implicit=true, @style=1, @start_line=0, @start_column=1, @end_line=1, @end_column=0>], @version=[], @tag_directives=[], @implicit=true, @implicit_end=true>
Traceback (most recent call last):
	2: from -e:1:in `<main>'
	1: from /Users/aaron/git/psych/lib/psych.rb:380:in `parse_stream'
/Users/aaron/git/psych/lib/psych.rb:380:in `parse': (<unknown>): did not find expected <document start> at line 1 column 1 (Psych::SyntaxError)

The first document parses, the second one does not.

Psych.load only streams and loads the first document as it doesn't know how to merge documents. I could have Psych.load raise an exception if there are multiple documents, but I'm not really sure what to do.

tenderlove avatar Oct 06 '17 04:10 tenderlove

I checked with PyYaml and it also detects and errors on faulty indentation - isn't it also using libyaml? Perhaps worth looking at how it arrives at the indentation being an error. (Here is our ticket https://tickets.puppetlabs.com/browse/PUP-8547 raised by puppet users).

hlindberg avatar Mar 14 '18 13:03 hlindberg

Confirm. The problem turned into painful consequences.

Below is the example which gets silently parsed into { a:1, b: 2} what is an obvious bug.

--
#someobj:
  a: 1
  b: 2
somevar: 1

My suggestion:

  1. Fail if multiple documents are detected on #load.
  2. Add additional #load_all API to mimic the original Yaml::load() and Yaml::loadAll().

andvgal avatar Mar 14 '18 14:03 andvgal