wdl icon indicating copy to clipboard operation
wdl copied to clipboard

Multi-line strings

Open jdidion opened this issue 3 years ago • 23 comments

The ability to have multi-line strings has been requested several times. It would also make easier several of the other proposals under consideration, such as in-line Dockerfile and evaluation of arbitrary expressions. I went with Python/Scala-style rather than HEREDOC style as that seems most idiomatic for WDL.

This is implemented in the v2 ANTLR4 grammar in a wdlTools branch: https://github.com/dnanexus-rnd/wdlTools/tree/multiline-strings/src/main/antlr4/v2

Checklist

  • [ ] Pull request details were added to CHANGELOG.md

jdidion avatar Nov 04 '20 16:11 jdidion

@jdidion This is amazing. I believe this will be considered a highly welcome addition to the WDL specification!

patmagee avatar Nov 04 '20 16:11 patmagee

  • IMO the "least surprising" behavior would be that all whitespace in the multi-line string are preserved by default.
  • I like the idea of starting a line with | to indicate "strip margin" but in practice, people might try to use this to generate bash commands, in which case erasing |s and > feels dangerous. Could that be a function (eg like strip_margin(multiline_string) rather than default behavior)?

cjllanwarne avatar Nov 04 '20 17:11 cjllanwarne

We could also have a remove_newlines(multiline_string) to replace the "we remove all newlines by default" behavior currently proposed

cjllanwarne avatar Nov 04 '20 17:11 cjllanwarne

My feeling was that the most common usage for multi-line strings will be to break up long strings. If you want to write bash, you put that in the command block, no?

I don't think we need a special remove_newlines function - just use sub(str, "\n", " ").

jdidion avatar Nov 04 '20 17:11 jdidion

Coming from scala might be distorting my sense of "what people expect" w.r.t. newlines in multi-line strings.

Usually the "what would people expect" tie-breaker in cases of different aesthetic preferences is "what would python do", which seems appropriate here too.

cjllanwarne avatar Nov 04 '20 17:11 cjllanwarne

As far as I can tell, python also preserves newlines in multi-line string literals (https://www.techbeamers.com/python-multiline-string/). Unless there's a really strong case against it, I think that should be what we do.

Yes, I think bash commands usually go in command blocks, but I still think having a syntax that erases very common bash-style syntax by default in multi-line string literals feels dangerous to me

cjllanwarne avatar Nov 04 '20 17:11 cjllanwarne

Personally, I find it very annoying to have to litter my Scala code with .stripMargin.replaceAll("\n", " ").

I agree that "least surprising" is a good approach, but I also feel it can be over-ridden by the desire to simplify the most common use case. We can add more examples to make it clear that if you want to write a multi-line bash string (or whatever) that preserves newlines you have to use | or > prefixes.

That said, I admit I may be wrong about what will be the most common use-case.

jdidion avatar Nov 04 '20 17:11 jdidion

I'd argue with your premise there though that the most common use case is "I want to split a single line string across many lines". I think the most common usage would be "I want a to write a multi-line string directly in my WDL". In which case the simplest way to provide that is to make it the default.

But we're arguing our own personal preferences here - which will never have a definitive answer. I think "choose the options that's most similar to python" is the most objective possible way to resolve these discussions.

cjllanwarne avatar Nov 04 '20 17:11 cjllanwarne

Fair enough. Let's wait to get a bit more feedback from others. I'm happy to update the spec and grammar if the consensus is to leave in whitespace by default.

jdidion avatar Nov 04 '20 17:11 jdidion

Agreed! An n > 2 is always a good idea...

cjllanwarne avatar Nov 04 '20 17:11 cjllanwarne

Personally, I find the least surprising thing to be represent the string "as-is" with new lines intact. for me that is the main point of a multiline string, is to write across multiple lines. The tricky part I see is around preserving indents or removing them. I believe in python (since its tab/space delimeted) this is easy, since you simply strip out leading tabs corresponding to the current indent level. In WDL this is a bit trickier. we could

  1. Not touch anything
  2. Strip the least number of common tabs
  3. Define a syntax (ie | ) for retaining those indents

patmagee avatar Nov 04 '20 17:11 patmagee

Ok, how about:

  • By default, leave newlines and strings alone
  • If a line is prefixed by |, automatically strip all the leading whitespace before it (as well as the | itself)
    • The | and whitespace is left in tact if it is escaped
  • If the user wants to remove newlines they have to sub(str, "\n", " ")

jdidion avatar Nov 04 '20 18:11 jdidion

I really worry about strings like

String command_prefix = """echo hi \
  | grep foo \
  | wc -l"""

cjllanwarne avatar Nov 04 '20 18:11 cjllanwarne

Oh, good point.

In that case I guess we may need a trim_leading_whitespace function. By default this would strip the least number of common spaces/tabs after each newline. We could have an optional second argument that would do any/all of the following:

  • If it is an integer, strip (up to) that many leading whitespace characters
  • If it is a whitespace-only string, strip exactly that whitespace from the beginning of each line (and leave in-tact lines that don't begin with exactly that whitespace)
  • If it is a single non-whitespace character, strip all leading whitespace up to and including that character any time that character is the first non-whitespace character in the line

jdidion avatar Nov 04 '20 18:11 jdidion

A simpler alternative would be a dedent function (akin to the one in python textwrap https://docs.python.org/3/library/textwrap.html#textwrap.dedent) that just removes leading common whitespace, and not try to have a syntax for margin. It could take a second boolean parameter which says whether to also strip out newlines.

jdidion avatar Nov 04 '20 18:11 jdidion

Those ideas all sound reasonable to me.

cjllanwarne avatar Nov 04 '20 18:11 cjllanwarne

Have you looked at the spec for Java text blocks? The java spec supports both automatically stripping newlines and preserving relative indent and doesn't require a lot of extra syntax. https://openjdk.java.net/jeps/378

pshapiro4broad avatar Nov 04 '20 19:11 pshapiro4broad

I do like that java text block spec's interpretation of "intrinsic whitespace", which matches very closely the command block behavior.

cjllanwarne avatar Nov 04 '20 20:11 cjllanwarne

Ive just started using Java's implementaiton of text blocks and they are very intuitive, so I give a big +1 to that!

patmagee avatar Nov 04 '20 21:11 patmagee

is the discussion on how to handle whitespace still live or do we have general consensus?

patmagee avatar Apr 14 '21 16:04 patmagee

This would be really helpful, we have a large CWL pipeline we want to convert to WDL which uses YAML text blocks extensively for SQL queries. Not having an equivalent in WDL has been frustrating, as the SQL needs to be squashed into a single line. This feature would be greatly appreciated.

jmesterh avatar Dec 13 '21 18:12 jmesterh

@jdidion any update on the status of this? It would be great to get this to voting

patmagee avatar Jul 21 '22 14:07 patmagee

FWIW I'd vote for the Java text block syntax. Since my last post we've resorted to a yaml → wdl converter which isn't fun.

jmesterh avatar Jul 21 '22 20:07 jmesterh

:+1: to the usefulness of this feature!

hkeward avatar Oct 20 '22 20:10 hkeward

I am good with Java text block syntax. I haven't read the spec in detail but it looks to me like the important things are:

  • Strings start and end with three matching quote ("""...""" or '''...''')
  • The minimum indentation in all non-empty lines is stripped from all lines
  • Newlines are preserved unless they are escaped by a \ as the last character in the line

The only thing that is not clear to me is how to handle mixed indentation. For example, is "<space><tab><space>" longer or shorter than "<space><space><space><space>"? I think the two most reasonable options for mixed indentation are:

  1. Raise an error
  2. Treat each whitespace character as equal (i.e. a tab and a space both count as 1).

Preference?

jdidion avatar Oct 20 '22 20:10 jdidion

For now I went with option 2. I also updated the section on stripping leading whitespace from commands to make it consistent.

It looks like my editor also reformatted all the tables in the spec. If that's a problem I can redo the commit.

jdidion avatar Oct 21 '22 00:10 jdidion

I'm good w/ this

geoffjentry avatar Oct 25 '22 20:10 geoffjentry

Should we just explicate that/whether expression placeholders work inside multi-line strings?

mlin avatar Oct 28 '22 04:10 mlin

Thanks @mlin - I added a note on this in the section on string interpolation.

jdidion avatar Oct 28 '22 17:10 jdidion

Looks good to me!

patmagee avatar Nov 01 '22 01:11 patmagee