yaml icon indicating copy to clipboard operation
yaml copied to clipboard

Single carriage return (`\r`) not treated as line break

Open jackwhelpton opened this issue 1 year ago • 2 comments

Describe the bug Single carriage return line endings (\r) are treated as non-breaking characters. This is contrary to the YAML spec, which explicitly calls out CR, LF and CRLF as the (only) possible line breaks.

A similar problem afflicts other parsers, for example https://github.com/jbeder/yaml-cpp/issues/986

To Reproduce

const lfTokens = Array.from((new Lexer()).lex("text: \"a\n\n\n\n b\""));
console.log(lfTokens.length);

const crTokens = Array.from((new Lexer()).lex("text: \"a\n\r\r\r b\""));
console.log(crTokens.length);

Note the output:

6
9

Expected behaviour

6
6

The extra tokens in the second case are a result of \r being treated like any other (non-breaking) character, meaning the tokens are split only at the first \n and not reassembled:

(..., a, \n, \x1F, \r\r\r b)

Versions (please complete the following information):

  • Environment: [e.g. Node.js 14.7.0 or Chrome 87.0] Node.js 20.17.0

  • yaml: [e.g. 1.10.0 or 2.0.0-2] 2.6.1

Additional context There are several places this seems to manifest, for example:

https://github.com/eemeli/yaml/blob/main/src/parse/lexer.ts#L223 https://github.com/eemeli/yaml/blob/main/src/parse/lexer.ts#L191

jackwhelpton avatar Dec 16 '24 06:12 jackwhelpton

While recognising that this is a divergence from the spec, does it affect any real-world use for you? As in, do you have some content that is actually using a lone \r as a line break?

eemeli avatar Dec 16 '24 08:12 eemeli

No real-world impact at the moment, beyond the time spent debugging it.

For full context: I'm working on a component that extracts YAML fields from a larger document for passing to a third-party tool, and needs to be able to correlate line and column number from that extracted content with the original document.

I'd hoped I could reuse much of the existing logic, but given the need to know about the effects on column position of expanding, for instance, escape codes, for now I'm effectively replacing the whole "compose" layer. In the rewritten version, I keep track of positions, so I can emit a list of coordinates with the composed text that lets me align it to the original source doc.

My set of unit tests/edge cases for this include a mixture of line endings, and those are failing, which is why I had to track the problem back through the layers (right to the lexer).

If there's no intent to fix this, for now I just need to ensure all my test cases use only \n or \r\n, and that I document that as a restriction. I think mixing those two in a file would also be supported, as most/all of the problems I've encountered so far seem to revolve around \r.

jackwhelpton avatar Dec 16 '24 20:12 jackwhelpton