wdl
wdl copied to clipboard
Multi-line strings
The ability to have multi-line strings has been requested several times. It would also make easier several of the other proposals under consideration, such as in-line Dockerfile and evaluation of arbitrary expressions. I went with Python/Scala-style rather than HEREDOC style as that seems most idiomatic for WDL.
This is implemented in the v2 ANTLR4 grammar in a wdlTools branch: https://github.com/dnanexus-rnd/wdlTools/tree/multiline-strings/src/main/antlr4/v2
Checklist
- [ ] Pull request details were added to CHANGELOG.md
@jdidion This is amazing. I believe this will be considered a highly welcome addition to the WDL specification!
- IMO the "least surprising" behavior would be that all whitespace in the multi-line string are preserved by default.
- I like the idea of starting a line with
|
to indicate "strip margin" but in practice, people might try to use this to generate bash commands, in which case erasing|
s and>
feels dangerous. Could that be a function (eg likestrip_margin(multiline_string)
rather than default behavior)?
We could also have a remove_newlines(multiline_string)
to replace the "we remove all newlines by default" behavior currently proposed
My feeling was that the most common usage for multi-line strings will be to break up long strings. If you want to write bash, you put that in the command block, no?
I don't think we need a special remove_newlines
function - just use sub(str, "\n", " ")
.
Coming from scala might be distorting my sense of "what people expect" w.r.t. newlines in multi-line strings.
Usually the "what would people expect" tie-breaker in cases of different aesthetic preferences is "what would python do", which seems appropriate here too.
As far as I can tell, python also preserves newlines in multi-line string literals (https://www.techbeamers.com/python-multiline-string/). Unless there's a really strong case against it, I think that should be what we do.
Yes, I think bash commands usually go in command blocks, but I still think having a syntax that erases very common bash-style syntax by default in multi-line string literals feels dangerous to me
Personally, I find it very annoying to have to litter my Scala code with .stripMargin.replaceAll("\n", " ")
.
I agree that "least surprising" is a good approach, but I also feel it can be over-ridden by the desire to simplify the most common use case. We can add more examples to make it clear that if you want to write a multi-line bash string (or whatever) that preserves newlines you have to use |
or >
prefixes.
That said, I admit I may be wrong about what will be the most common use-case.
I'd argue with your premise there though that the most common use case is "I want to split a single line string across many lines". I think the most common usage would be "I want a to write a multi-line string directly in my WDL". In which case the simplest way to provide that is to make it the default.
But we're arguing our own personal preferences here - which will never have a definitive answer. I think "choose the options that's most similar to python" is the most objective possible way to resolve these discussions.
Fair enough. Let's wait to get a bit more feedback from others. I'm happy to update the spec and grammar if the consensus is to leave in whitespace by default.
Agreed! An n > 2 is always a good idea...
Personally, I find the least surprising thing to be represent the string "as-is" with new lines intact. for me that is the main point of a multiline string, is to write across multiple lines. The tricky part I see is around preserving indents or removing them. I believe in python (since its tab/space delimeted) this is easy, since you simply strip out leading tabs corresponding to the current indent level. In WDL this is a bit trickier. we could
- Not touch anything
- Strip the least number of common tabs
- Define a syntax (ie
|
) for retaining those indents
Ok, how about:
- By default, leave newlines and strings alone
- If a line is prefixed by
|
, automatically strip all the leading whitespace before it (as well as the|
itself)- The
|
and whitespace is left in tact if it is escaped
- The
- If the user wants to remove newlines they have to
sub(str, "\n", " ")
I really worry about strings like
String command_prefix = """echo hi \
| grep foo \
| wc -l"""
Oh, good point.
In that case I guess we may need a trim_leading_whitespace
function. By default this would strip the least number of common spaces/tabs after each newline. We could have an optional second argument that would do any/all of the following:
- If it is an integer, strip (up to) that many leading whitespace characters
- If it is a whitespace-only string, strip exactly that whitespace from the beginning of each line (and leave in-tact lines that don't begin with exactly that whitespace)
- If it is a single non-whitespace character, strip all leading whitespace up to and including that character any time that character is the first non-whitespace character in the line
A simpler alternative would be a dedent
function (akin to the one in python textwrap https://docs.python.org/3/library/textwrap.html#textwrap.dedent) that just removes leading common whitespace, and not try to have a syntax for margin. It could take a second boolean parameter which says whether to also strip out newlines.
Those ideas all sound reasonable to me.
Have you looked at the spec for Java text blocks? The java spec supports both automatically stripping newlines and preserving relative indent and doesn't require a lot of extra syntax. https://openjdk.java.net/jeps/378
I do like that java text block spec's interpretation of "intrinsic whitespace", which matches very closely the command
block behavior.
Ive just started using Java's implementaiton of text blocks and they are very intuitive, so I give a big +1 to that!
is the discussion on how to handle whitespace still live or do we have general consensus?
This would be really helpful, we have a large CWL pipeline we want to convert to WDL which uses YAML text blocks extensively for SQL queries. Not having an equivalent in WDL has been frustrating, as the SQL needs to be squashed into a single line. This feature would be greatly appreciated.
@jdidion any update on the status of this? It would be great to get this to voting
FWIW I'd vote for the Java text block syntax. Since my last post we've resorted to a yaml → wdl converter which isn't fun.
:+1: to the usefulness of this feature!
I am good with Java text block syntax. I haven't read the spec in detail but it looks to me like the important things are:
- Strings start and end with three matching quote (
"""..."""
or'''...'''
) - The minimum indentation in all non-empty lines is stripped from all lines
- Newlines are preserved unless they are escaped by a
\
as the last character in the line
The only thing that is not clear to me is how to handle mixed indentation. For example, is "<space><tab><space>"
longer or shorter than "<space><space><space><space>"
? I think the two most reasonable options for mixed indentation are:
- Raise an error
- Treat each whitespace character as equal (i.e. a tab and a space both count as 1).
Preference?
For now I went with option 2. I also updated the section on stripping leading whitespace from commands to make it consistent.
It looks like my editor also reformatted all the tables in the spec. If that's a problem I can redo the commit.
I'm good w/ this
Should we just explicate that/whether expression placeholders work inside multi-line strings?
Thanks @mlin - I added a note on this in the section on string interpolation.
Looks good to me!