snakefmt icon indicating copy to clipboard operation
snakefmt copied to clipboard

snakefmt corrupts heredoc string

Open dariober opened this issue 4 years ago • 18 comments
trafficstars

snakefmt v0.4.0 reformats this rule:

rule one:
    output:
        out_file="out.txt",
    shell:
        r"""
cat <<'EOF'> tmp.txt

touch {output}

EOF
bash tmp.txt
        """

into:

rule one:
    output:
        out_file="out.txt",
    shell:
        r"""
        cat <<'EOF'> tmp.txt

        touch {output}

        EOF
        bash tmp.txt
        """

the reformatted indentation makes the heredoc invalid since bash expects the closing EOF at the start of the line. When execute it I get:

snakemake -p -j 1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job      count    min threads    max threads
-----  -------  -------------  -------------
all          1              1              1
one          1              1              1
total        2              1              1

Select jobs to execute...

[Tue Oct 26 09:06:30 2021]
rule one:
    output: out.txt
    jobid: 1
    resources: tmpdir=/tmp


        cat <<'EOF'> tmp.txt

        touch out.txt

        EOF
        bash tmp.txt
        
/bin/bash: line 6: warning: here-document at line 1 delimited by end-of-file (wanted `EOF')
Waiting at most 5 seconds for missing files.

I tried to upgrade to snakefmt 0.4.4 using mamba. This maybe a separate issue, I get:

mamba update snakefmt # Completes OK and installs 0.4.4

snakefmt -h
Traceback (most recent call last):
  File "/home/dario/miniconda3/envs/tritume/bin/snakefmt", line 6, in <module>
    from snakefmt.snakefmt import main
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/snakefmt/snakefmt.py", line 9, in <module>
    from black import get_gitignore
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/black/__init__.py", line 48, in <module>
    from black.files import find_project_root, find_pyproject_toml, parse_pyproject_toml
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/black/files.py", line 26, in <module>
    from black.handle_ipynb_magics import jupyter_dependencies_are_installed
  File "/home/dario/miniconda3/envs/tritume/lib/python3.6/site-packages/black/handle_ipynb_magics.py", line 15, in <module>
    from typing_extensions import TypeGuard
ImportError: cannot import name 'TypeGuard'

dariober avatar Oct 26 '21 08:10 dariober

Hi @dariober can you please retry with v0.4.4 - we have made some changes to how shell strings are handled since v0.4.0

mbhall88 avatar Oct 27 '21 09:10 mbhall88

@mbhall88 Thanks for looking into this. I made a new conda environment, updated snakefmt and the problem persists in v0.4.4.

dariober avatar Oct 27 '21 13:10 dariober

Ok, this seems to be related to #39 where we made a conscious decision to format these shell strings.

@bricoletc maybe we need to add some functionality whereby r-strings (like in the above example) are left alone?

mbhall88 avatar Oct 29 '21 03:10 mbhall88

I would be happy for r-strings to be left alone (I got the habit of always using r-strings for shell directives unless required otherwise) but I would still consider it a bug. A user may write a heredoc without r-formatting and find it broken after snakefmt.

dariober avatar Oct 29 '21 08:10 dariober

@dariober sorry for the massive delay. Could you try https://github.com/mbhall88/snakefmt/commit/b591a812ff516469eb42c4f4734a44e9328ca32e and see if this works as you expect? I added a test case using your example above and this is indeed formatted (i.e. not formatted) as expected

mbhall88 avatar Jun 15 '22 06:06 mbhall88

@mbhall88 thanks for that and no worries about the delay! I can confirm that the version you linked works on the test file I posted.

I would still contend that the content of a string is "data" not "code" and as such it shouldn't be edited in any way. I mean, you cannot make assumptions about what the user put in a string. But I assume you guys have discussed this and opted for this solution...

dariober avatar Jun 16 '22 07:06 dariober

That's a good point made by @dariober , @mbhall88 black doesn't change the following python code:

def p():
    result = r"""
cat <<'EOF'> tmp.txt

touch {output}

EOF
bash tmp.txt
    """

So at some level we should also be doing that

bricoletc avatar Jun 16 '22 08:06 bricoletc

Yes, you're both correct.

@bricoletc, even when I make it a normal docstring it doesn't change that example.

I think this is something we have touched on in #118 and #121 and it may have gotten a little lost in the details.

I would still contend that the content of a string is "data" not "code" and as such it shouldn't be edited in any way.

At the end of the day, this should be handled inline with black - see black's code style for strings.

Black also processes docstrings. Firstly the indentation of docstrings is corrected for both quotations and the text within, although relative indentation in the text is preserved. Superfluous trailing whitespace on each line and unnecessary new lines at the end of the docstring are removed. All leading tabs are converted to spaces, but tabs inside text are preserved. Whitespace leading and trailing one-line docstrings is removed.

So after all of this, I'm a little confused as to what needs to change?

mbhall88 avatar Jun 17 '22 00:06 mbhall88

Black also processes docstrings. ...

@mbhall88 , maybe you are conflating docstrings, which is ok to reformat, with heredoc strings

dariober avatar Jun 17 '22 14:06 dariober

I think there is some confusion about docstrings and multiline strings. Python doesn't have heredocs

Other high-level languages such as Python, Julia and Tcl have other facilities for multiline strings.

So I guess black's actions on multiline strings is the behaviour we should mirror at the end of the day.

I think nearly all of the confusion about what to do here stems from the shell directive and whether to align it with the snakemake rule. Maybe we could get some input from @johanneskoester here?

mbhall88 avatar Jun 19 '22 23:06 mbhall88

Just a side question @dariober , can you describe why heredoc is used here? It seems to me like you can achieve the same effect by doing:

echo "touch {output}" > tmp.txt
bash tmp.txt

This is of course ultimately unrelated to the issue as we do have an issue with heredoc here

bricoletc avatar Jun 20 '22 07:06 bricoletc

@bricoletc,

can you describe why heredoc is used here?

Sometimes I use heredoc strings to execute small R scripts like this:

rule one:
    input:
        txt= 'input-table.txt',
    output:
        out= 'output-table.txt',
    shell:
        r"""
cat <<'EOF' > {rule}.$$.tmp.R

dat <- read.table('{input.txt}')
... do stuff ...
write.table(out, '{output.out}')

EOF
Rscript {rule}.$$.tmp.R
rm {rule}.$$.tmp.R
        """

Basically, dump the R code to a temporary file and execute it with Rscript

For small R scripts I find it more transparent than using the script directive. See also this question on SO

dariober avatar Jun 20 '22 07:06 dariober

To confirm I have this correct, I will test out the fix I implemented for this issue and see how it performs on other heredoc-style strings and ensure we don't indent multiline strings?

mbhall88 avatar Jun 21 '22 00:06 mbhall88

Neat @dariober ! Yes @mbhall88 that sounds good

bricoletc avatar Jun 21 '22 07:06 bricoletc

@mbhall88 how about we simply not indent if the shell string contains some regexp capturing EOF, e.g. "<<['\"]EOF ?

bricoletc avatar Jun 24 '22 12:06 bricoletc

I guess my only reservation with that approach is whether there are examples other than a heredoc where the indenting in a multiline string matters?

mbhall88 avatar Jun 26 '22 23:06 mbhall88

@bricoletc :

some regexp capturing EOF, e.g. "<<['"]EOF

I suspect that is going to be tricky because EOF is not a special keyword. You can delimit the heredoc with any string you like (well, not any string...). Other tricky cases could be like:

shell:
    """
    python -c "
    ...some properly indented python code here. Possibly containing raw-formatted strings...?
    "
    """

Personally, I would leave the content of the strings alone and just re-align the opening and closing quotes.

dariober avatar Jun 27 '22 07:06 dariober

@dariober this should now be fixed in https://github.com/mbhall88/snakefmt/commit/4c83e0692c1613f718671c414396a58aafe8bd5c.

Feel free to try it out if you like. I will create a PR soon to get this into a release

mbhall88 avatar Jun 29 '22 05:06 mbhall88