yamllint icon indicating copy to clipboard operation
yamllint copied to clipboard

Add a multi-line string rule

Open robherring opened this issue 1 year ago • 5 comments

YAML supports 3 styles of multi-line string values: plain, folded, and block. The folded ('>') and block ('|') modifiers define how formatting and some special characters are interpreted. Common problems are using the modifiers when not necessary or not using them when necessary. Add a rule to check for these conditions.

This is a constant source of manual review comments for us (Devicetree schema files in Linux kernel tree) as many users are completely clueless about multi-line strings in YAML.

I'm looking for feedback on the general approach. Will add more doc examples and tests if the approach is okay.

robherring avatar Mar 03 '25 18:03 robherring

Coverage Status

coverage: 98.738% (-1.1%) from 99.824% when pulling 521d25f5ec4fbd0eabbfd4422004931f94bc05b6 on robherring:multi-line-strings into e427005b528f5a4cd59387791b6f9d87ae7dd4ce on adrienverge:master.

coveralls avatar Mar 03 '25 18:03 coveralls

Hello Rob,

The idea is interesting! To understand all the ins and outs, could you give concrete examples of how it would be useful for Device Tree schema files?

In YAML, all the following representations are equivalent (in JSON they would be written as "This is a sentence.\nThis is another sentence.\n\nThis is another paragraph."):

plain flow scalar:
  This is a
  sentence.

  This is another
  sentence.


  This is another
  paragraph.

double-quoted flow scalar:
  "This is a
  sentence.

  This is another
  sentence.


  This is another
  paragraph."

folded block scalar: >-
  This is a
  sentence.

  This is another
  sentence.


  This is another
  paragraph.

literal block scalar: |-
  This is a sentence.
  This is another sentence.

  This is another paragraph.

So if I understand your goal: - missing-block-token: true could be named forbid-empty-lines-in-flow-scalars: true? - unnecessary-block-token: true could be named forbid-block-scalars-without-empty-lines: true?

(About option names, I believe it's better to use "forbid-…" or "require-…" for clarity and consistency with other rules. And stay close to the YAML terminology (see flow scalars and block scalars) for precision and future-proofness.)

adrienverge avatar Mar 07 '25 11:03 adrienverge

Hello Rob,

The idea is interesting! To understand all the ins and outs, could you give concrete examples of how it would be useful for Device Tree schema files?

It's pretty straight-forward as we have cases when to use or not use literal/folded block which are the source of frequent manual review comments. The only place this comes into play are 'description' values as those are the only free form data. None of this really matters until we feed this into something else like some json-schema to docs or a pass thru rtyaml.

These 3 examples are pretty exhaustive for the patterns we have. We don't ever use quotes.

description:
  This is a 
  paragraph.
  This is another
  paragraph.

description:
  This has formatting that
    - needs
    - to
    - be
    - preserved

description: |
  Formatting is not
  important in this case.

So if I understand your goal: - missing-block-token: true could be named forbid-empty-lines-in-flow-scalars: true? - unnecessary-block-token: true could be named forbid-block-scalars-without-empty-lines: true?

I suppose that's better as it encodes exactly what we are checking for. I was a bit worried about corner cases of what's missing or unnecessary. I also thought about making this a regex as essentially we're looking for '\n[\n ]+', but I don't have any use case for other regex's.

There's also the cases of indented lines and special character and character sequences (e.g. ': '), so empty lines doesn't fully capture it. Perhaps forbid-formatted-whitespace-in-flow-scalars?

Another potential check I'm thinking about is line wrapping. Some reason when the rule is wrap at 80 chars, folks wrap at something like 60. Even the line length check doesn't work to enforce <80 because that's set to 110 for exceptions (URLs typically). Maybe we could exclude lines with only leading whitespace. But to fully check wrapping, I think that would have to be rewrap the text at some length and complain if it is different than the input. I bring this up because something like that would catch all the forbid-formatted-whitespace-in-flow-scalars cases.

(About option names, I believe it's better to use "forbid-…" or "require-…" for clarity and consistency with other rules. And stay close to the YAML terminology (see flow scalars and block scalars) for precision and future-proofness.)

FWIW, I was using https://yaml.info which used "block token".

robherring avatar Mar 07 '25 15:03 robherring

Now that I understand the use case better, I'm questioning myself about whether a yamllint rule would be suited for this.

  • This is about controlling that the content contains / doesn't contain blank lines, whereas yamllint normally only lints the form of YAML. Although here, it's debatable because this would be checked to determine whether the use of block tokens (e.g. > or |) is needed.

  • Forbidding empty lines in flow scalars (= forbidding newline characters in the *resulting string in their content) is very specific to this use case, isn't it?

    There are many cases where it makes sense to have multiline contents, which needs blank lines to be correctly written in a YAML plain flow scalar. For example:

    script:
      GIT_TRACE=1
      git add
      src/dir/dir/dir/dir/dir/up/to/max/allowed/line/length
      --ignore-missing
    
      git
      commit
      --allow-empty
    
      exit
    
  • Similarly, forbidding block scalars without empty lines inside them is not that common. Block scalars without blank lines are often encountered in YAML files:

    script: |
      GIT_TRACE=1 git add src/… --ignore-missing
      git commit --allow-empty
      exit
    

    In this code from your commit, I haven't understood why you want to forbid the former, but not the latter? Why would the first snippet be more valid?

    #. With ``multi-line-strings: {unnecessary-block-token: true}``
    
       the following code snippet would **PASS**:
       ::
        foo: |
          bar
    
          baz
    
       the following code snippet would **FAIL**:
       ::
        foo: |
          bar
          baz
    
  • As you just described, what you're trying to enforce is beyond blank lines: it's also about whitespaces and special markers (-, :…)

adrienverge avatar Mar 21 '25 18:03 adrienverge

Now that I understand the use case better, I'm questioning myself about whether a yamllint rule would be suited for this.

Honestly, I'd rather have a plug-in than have this built-in because I'm not sure there's wider appeal and getting something that's going to work universally is hard. But I saw you pushed back on the plug-in idea. Currently (for me), it doesn't really matter how things are formatted. It's just prose for human consumption. But in the future, I'd like to generate documentation out of it and then how the YAML is parsed and whether the formatting is maintained will matter.

I've looked at coding up something using ruamel, but it's slow in RT mode and doesn't give me the original line breaks in plain flow scalars, so I can't check if flow/block modifier is missing.

* This is about controlling that **the content** contains / doesn't contain blank lines, whereas yamllint normally only lints **the form** of YAML. Although here, it's debatable because this would be checked to determine whether the use of block tokens (e.g. `>` or `|`) is needed.

I'd say it is about checking whether a YAML parser is going to parse the content as intended or reformat it. Certainly there may be cases where the intent isn't clear. My only answer for that is don't turn on this check...

* Forbidding empty lines in flow scalars (= forbidding newline characters in the *resulting string in their content) is very specific to this use case, isn't it?
  There are many cases where it makes sense to have multiline contents, which needs blank lines to be correctly written in a YAML plain flow scalar. For example:
  ```yaml
  script:
    GIT_TRACE=1
    git add
    src/dir/dir/dir/dir/dir/up/to/max/allowed/line/length
    --ignore-missing
  
    git
    commit
    --allow-empty
   
    exit
  ```

While this works, I wouldn't recommend anyone use it because many users don't know YAML well enough to know this and below are equivalent. Even if you do know, it is easy enough to forget to add a '|'. For something like this (a script), I would want to require it to be a block scalar and disallow the others. So maybe this needs to be only for specific keys. That would work for me as it is only 'description' that I need to check. But I guess something applied per key would be a bit of a departure from normal yamllint rules.

* Similarly, forbidding block scalars without empty lines inside them is not that common.
  Block scalars without blank lines are often encountered in YAML files:
  ```yaml
  script: |
    GIT_TRACE=1 git add src/… --ignore-missing
    git commit --allow-empty
    exit
  ```
      
    
  In this code from your commit, I haven't understood why you want to forbid the former, but not the latter? Why would the first snippet be more valid?

Huh? I want to forbid the latter (the FAIL one).

I've run this thru our files. The only kind of false-positive has been something like the latter where the intent was clearly to be a list. So that would require some rework to make lists clearly a list. Given it's just prose, not a big deal for us.

  ```
  #. With ``multi-line-strings: {unnecessary-block-token: true}``
  
     the following code snippet would **PASS**:
     ::
      foo: |
        bar
  
        baz
  
     the following code snippet would **FAIL**:
     ::
      foo: |
        bar
        baz
  ```

* As you just described, what you're trying to enforce is beyond blank lines: it's also about whitespaces and special markers (`-`, `:`…)

Yes, as any of those require either quoting (we also minimize quoting) or a block/flow modifier. At least a ':' will trigger a parsing error, but a '# ' will just silently drop the content when parsed.

robherring avatar Mar 21 '25 20:03 robherring