cylc-flow icon indicating copy to clipboard operation
cylc-flow copied to clipboard

parsec: keep whitespace inside quoted single-line values

Open hjoliver opened this issue 2 years ago • 8 comments

Check List

Address #5080 - parsec should only dedent and strip multi-line string values.

  • [x] I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • [x] Contains logically grouped changes (else tidy your branch by rebase).
  • [x] Does not contain off-topic changes (use other PRs for other changes).
  • [x] Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • [ ] Tests are included (or explain why tests are not needed).
  • [ ] CHANGES.md entry included if this is a change that can affect users
  • [x] Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • [ ] If this is a bug fix, PRs raised to both master and the relevant maintenance branch.

hjoliver avatar Aug 18 '22 03:08 hjoliver

Test drive it:

[scheduling]
   [[graph]]
      R1 = a
[runtime]
   [[a]]
      script = """
          echo "only space: >>>${ONLY_SPACE}<<<"
          echo "leading   : >>>${LEADING_SPACE}"
          echo "trailing  : ${TRAILING_SPACE}<<<"
          echo "unquoted 1: >>>${UNQUOTED1}<<<"
          echo "unquoted 2: >>>${UNQUOTED2}<<<"
      """
      [[[environment]]]
           ONLY_SPACE='   '
           LEADING_SPACE='   x'
           TRAILING_SPACE='y   '   # blah
           UNQUOTED1 =   z    
           UNQUOTED2 =   z   # blah

Result:

only space: >>>   <<<
leading   : >>>   x
trailing  : y   <<<
unquoted 1: >>>z<<<
unquoted 2: >>>z   <<<

If this doesn't break anything else, only unquoted2 needs fixing (if a trailing comment gets stripped from an unquoted value, also strip the whitespace).

hjoliver avatar Aug 18 '22 03:08 hjoliver

I think we have to be very careful with this one as we've been working with this "feature" for a long time.

After a quick scan it looks like a high proportion of workflows match the pattern "$ (which largely yields [environment] results).

Hard to say whether these will cause problems, one of the more common cases is ROSE_OPT_CONF_KEYS which should tolerate arbitrary whitespace so no problem there. However, I found an example like this too:

SOME_FILE = "$CYLC_SUITE_RUN_DIR/bla/bla/bla "

Which would almost certainly cause a breakage.

oliver-sanders avatar Aug 18 '22 08:08 oliver-sanders

After a quick scan it looks like a high proportion of workflows match the pattern "$ (which largely yields [environment] results).

Not sure I understand that, can you give an example?

However, I found an example like this too:

Sorry, that is just user error! Users could potentially end up relying on any Cylc bug, that doesn't mean we should never fix it.

I wonder if we can make validate and/or lint flag this kind of thing?

hjoliver avatar Aug 18 '22 09:08 hjoliver

Sorry, that is just user error!

Disagree. It is a parsec feature that arbitrary whitespace is always stripped and as a result workflows have grown into this designed behaviour for many years.

Now we have decided that this behaviour doesn't really make sense for one particular section [fair enough] and would like to change it. This could potentially cause problems. Given that this use case has never shown up before I'm not sure it's worth the effort, at least not unless we can confirm it's not going to cause significant issues or find a way to mitigate against them. Just saying we can't afford to be too hasty with this one, it's a potentially nasty breaking change.

Especially given that Jinja2 is often used to template environment variables there are plenty of ways arbitrary whitespace could find its way into env vars. This would be a very hard to debug error if it showed up in the wild.

It's a but difficult to scan for this one, will try and get a better handle on the likelyhood of issues and get back.

oliver-sanders avatar Aug 18 '22 09:08 oliver-sanders

It's a but difficult to scan for this one, will try and get a better handle on the likelyhood of issues and get back.

Cool, thanks.

It is a parsec feature that arbitrary whitespace is always stripped and as a result workflows have grown into this designed behaviour for many years.

Designer of parsec here :grin: (not that the design of parsec is entirely something to be proud of!). Stripping arbitrary whitespace from a config file is pretty much necessary to make the format usable, but whitespace inside a quoted value is not arbitrary. If a user writes foo = " xxx " then we should assume that value is supposed to contain those spaces.

Stripping that sort of space was definitely not a design choice as far as I recall. It was either an unfortunate omission (in which case, my bad) or pragmatic short-cut during implementation that I should have come back to fix (my bad) or a mistake that crept in sometime later (partly my bad).

That's not to say, of course, that users haven't come to rely on the old behaviour. But that doesn't necessarily mean it shouldn't be fixed. And so far we have hard evidence of at least one user who has a problem with the status quo. If fixing it really would cause chaos, which I doubt, we could always provide a switch to retain the old behaviour (and even make that the default if necessary, although I'd rather not).

hjoliver avatar Aug 18 '22 10:08 hjoliver

Since this is an interface change mind bumping to 8.1.0?

oliver-sanders avatar Aug 18 '22 12:08 oliver-sanders

Sure, fair enough.

hjoliver avatar Aug 18 '22 20:08 hjoliver

Re-targeted to master and 8.1.0

TBD:

  • fix the trailing comment case above
  • consider the likely impact on users, and hence whether or not we should continue optional support for the old way
  • tests

hjoliver avatar Aug 18 '22 20:08 hjoliver

I've scanned through 54377 cylc-run installations looking for spaces at the end of environment variables :sweat:.

I've found a fair few examples, for the most part they are harmless, however, I have identified a few cases where this change would result in breakages.

Mostly Harmless

Starting with the cases where this change should be harmless.

ROSE_APP_OPT_CONF_KEYS

For some unknown reason the overwhelming majority of examples are with ROSE_APP_OPT_CONF_KEYS.

ROSE_APP_OPT_CONF_KEYS = "foo bar baz " 

I have absolutely no idea why people add spaces onto the end of this variable but not others. The good news is from experimentation, Rose 2019 and 2 are both fine with arbitrary whitespace in this variable.

CLI Opts

There are a few examples of CLI options being held in env vars e.g:

SELECT_VARS_CMD = '--selected-vars ${SELECTED_VARS} '

However, when I checked usage these were expanded like so:

--abc ${SELECT_VARS_CMD} --def

So no problem.

Lists

There are a fair few space or space+comma separated lists out there:

REMOTE_HOST = e f

It's hard to say how any systems these lists are passed into would handle this, however, where these lists were used within run directory scripts they were mostly iterated over in a way which would ignore arbitrary whitespace e.g:

for host in $REMOTE_HOST; do

So no problem

Not So Harmless

I did manage to find some examples where I could confirm this change would definitely cause breakages by looking at the run dir files alone.

-n tests

I found one example (present in a few working copies but the same example each time) with arbitrary whitespace:

SSTBIAS_LIST="28 24 44 38 34 "

Which was used like so:

if [[ -n ${SSTBIAS_LIST} ]]; then

Which would be a change in behaviour.

Bunch

I found a couple of issues with rose_bunch scripts. Here's a double whammy with two env vars:

INPUT_DIRS = "$ROSE_DATAC/mix/*/blendgrids "
PREPROCESSING_STEPS = "'percentile_extract ecc netcdf_to_ff' "

Which would both cause breakages in this rose_bunch app:

[bunch]
command-format=set -eu
               = ...
               = ... "%(input_dir)s" ... "%(preprocessing_steps)s" --diags="$INPUT_DIAGS"

[bunch-args]                                                                                                  
stream=${STREAM_LIST}                                                                                         
preprocessing_steps=${PREPROCESSING_STEPS}  

Because Rose does not strip whitespace from variables and even preserves it when one variable is defined from another this whitespace would be preserved through several layers.

Unknown

There are few examples where there is no direct usage of the variable within the workflow run dir making it hard to tell whether the change is likely to be harmful or not. One example:

FCRLIST="1200 2400 3600 4800 "

As well as examples where corner cases may now result in buggy behaviour

{% set c = '' %}
FOO="a b {{ c }}"

oliver-sanders avatar Nov 18 '22 12:11 oliver-sanders

Meeting: we agreed that quoted whitespace should not be stripped in principle, but in the Cylc context (a) there's no obvious need to support that, and (b) there's evidence that it would break some existing workflows that rely on the existing behaviour... so, closing this.

hjoliver avatar Jul 17 '23 10:07 hjoliver

(@oliver-sanders - I just noticed you bumped the milestone way back - do you want to keep it that way rather than just close it?)

hjoliver avatar Jul 17 '23 10:07 hjoliver

Bumped the milestone to get it off 8.2.0, happy to close this PR. Can open an issue to consider any alternative proposal.

oliver-sanders avatar Jul 17 '23 10:07 oliver-sanders