spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Fix command parsing on Windows in Projects

Open polm opened this issue 3 years ago • 14 comments

Description

The way we parse commands from the project file before running them on Windows with shlex is incorrect. The practical effect of this is that quoting that should work normally in Windows, or works directly in the Windows command prompt, would fail if used in a project.

The issue is in split_command, which uses shlex(cmd, posix=False) on Windows. This looks reasonable but it turns out posix=False doesn't mean it's suitable for Windows, it's a non-POSIX compliant fallback mode. (There is no mode for Windows compatibility in shlex or the Python stdlib.)

It looks like the correct thing to do on Windows is to pass the command as a string without any splitting ahead of time. If you don't do that then a string is assembled using the list2cmdline function internally. In my simple tests using posix=True worked equally well though.

Here's a simple test I wrote and tried on Windows and Linux to show the difference in shlex behaviour.

import shlex
import subprocess

def run(cmd):
    # just so we can try all three of them
    try:
        subprocess.run(cmd)
    except Exception as ee:
        print("failed with exception:", ee)
        

# raw string: windows yes, linux no
cmd = 'bash -c "echo hi"'
print("run raw string:", cmd)
run(cmd)

# posix=False: windows no, linux no
cmd_list = shlex.split(cmd, posix=False)
print("run posix=False:", cmd_list)
run(cmd_list)

# posix=True: windows yes, linux yes
cmd_list_posix = shlex.split(cmd, posix=True)
print("run posix=True:", cmd_list_posix)
run(cmd_list_posix)

This change needs to be tested thoroughly, but I don't think it should break any existing scripts unless they relied on literal quotes in filenames. I think we can start by running it on all our projects on Windows. There should be no change in behavior on Linux.

Marking as draft until testing can be done.

Types of change

bug fix

Checklist

  • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [ ] I ran the tests, and all new and existing tests passed.
  • [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

polm avatar Aug 25 '22 08:08 polm

Tests are failing because Windows in CI does not have bash, I'll have to think of another test.

This seems to be the issue in #10845. The proposed fix there was using shell=True, which would probably also work, but I don't think is necessary.

polm avatar Aug 25 '22 10:08 polm

The proposed fix there was using shell=True, which would probably also work, but I don't think is necessary.

The internal consensus is not to use shell=True anyway. But yeah, IIRC it would work.

rmitsch avatar Aug 25 '22 10:08 rmitsch

OK, got the test working. Now this needs to be tested on real projects.

polm avatar Aug 26 '22 05:08 polm

OK, I ran this on all the projects in the projects repo. Not all of them completed successfully, but there were no issues related to this change at all, so I think this fix is correct. That said, someone more familiar with Windows should take a look at it.

polm avatar Aug 30 '22 05:08 polm

I checked and it looks like this util function isn't used in any of our other repos, so it looks like it'll be safe to remove.

One thing is that we aren't just using this command to split things, we also inspect the first token sometimes to rewrite commands. With no way to split things on Windows that may be an issue. Anyway, I'll take a look at it and see what needs to be done.

polm avatar Sep 01 '22 08:09 polm

Still dealing with some issues with tests, but current status is:

The split_command function has been removed and replaced with shlex.split where appropriate. For Windows command strings are used instead, though it is possible to call a command list if you pass it to util.run_command, which should also work. This is occasionally used in spaCy, as in spacy package, and may need further checks.

shlex.split is still used on Windows commands to get the "tool" value for an error message if running the command fails. The output for this should be correct for reasonable command lines, but won't handle every valid quoted commandline the way Windows would. In order to be exactly correct we would need a full parser for Windows command lines; the sample function here might work but would need to be evaluated.

One other change in behaviour is that currently, if a command starts with python, pip, or either ending in "3", we rewrite it to use sys.executable. In the current state of the PR this behavior is just skipped on Windows. Detecting simple cases of the commands (like unquoted) isn't too hard, but I need to research more to verify the right way to quote sys.executable in order to re-assemble the command. (It's pretty easy to imagine sys.executable having a space in the path somewhere on Windows.) In tests with cmd.exe in Wine, simple quoting like "dir" didn't work.

polm avatar Sep 02 '22 05:09 polm

After review I believe the current state of this is reasonable.

The only thing not handled equivalently on Windows at the moment is rewriting commands that start with python or pip to use sys.executable. There are some ways to do this, but they're kind of complicated and may create the opportunity for weird failure cases. Given that, and that the rewriting functionality itself is already marked as "do we want to do this?", I think it's better to just leave it alone.

Other solutions considered:

  • Use string concatenation and simple quoting for Windows sys.executable path. Issue is that that won't always work, and could fail strangely. (Unlikely but concrete example failure: quotes in the path.)
  • Use executable argument to subprocess.run. Main issue is that it's not clear what depends on argv[0] and what the full implications of changing that are.

polm avatar Sep 06 '22 05:09 polm

Also, this comment hasn't been addressed - I was talking about shlex.quote specifically.

We also shouldn't be having an utility function in spacy.util that is unsafe for some platforms.

svlandeg avatar Sep 07 '22 12:09 svlandeg

Also, this comment hasn't been addressed - I was talking about shlex.quote specifically.

Do you mean join_command? I left that since it isn't used in run_command and won't be used in spacy project calls, but I can go ahead and remove it.

shlex.quote is still currently called unconditionally in run_command, but the output is not used on Windows. I can restructure the function to make that clearer / avoid calling it entirely on Windows.

polm avatar Sep 08 '22 03:09 polm

In its previous state, I had left out a line in run_commands, resulting in breaking projects on Windows (nothing would run). That has been resolved. I think the current state should work but I'm testing it on projects now.

polm avatar Sep 08 '22 05:09 polm

I have run several projects with the latest state of this on Windows and Linux without issue, so that should be working at least. I need to test the DVC related changes still.

polm avatar Sep 08 '22 08:09 polm

There were some issues with the DVC command that weren't directly related to quoting issues that have been fixed. Mainly, if you used the --verbose flag when calling spacy project dvc, you would get a dvc.yaml file with commands like spacy project run preprocess --verbose that wouldn't work. Also there was support for a --quiet flag but no way to actually call it from the command line. So those have both been addressed.

Having fixed those issues I need to run dvc on Windows again, but I don't foresee any other issues.

polm avatar Sep 12 '22 06:09 polm

OK, I ran DVC on Windows and checked that quoted paths work. I found one more minor issue in DVC (not related to quoting) that should be resolved now. If the tests go green I think this is actually ready.

polm avatar Sep 12 '22 09:09 polm

OK, tests are green. I have commented on a few places I'd like feedback, but I think the current state is acceptable. If no changes are required the one thing I'd like to change is to remove the TODO statement about subprocess.list2cmdline.

polm avatar Sep 12 '22 11:09 polm

On Windows I have a Git Bash shell which I'm using for these tests. I have a virtualenv with spaCy installed and the projects repo checked out, and I can run the projects without issue.

It looks like in your environment, the Python in your virtualenv is not in your PATH (or after the system Python anyway) in the subprocess. Checking locally with a simple script, the python picked up by subprocess.run in my environment is still the same Python as in my virtualenv. I've seen the different version getting picked up happen to users before but I'm not sure how to reproduce it without calling the virtualenv Python directly, for example. What shell are you using?

polm avatar Sep 28 '22 06:09 polm

It needs more testing, but I added an attempt at basic rewriting on Windows. This solution 1. does not include any special code for parsing or joining Windows commands 2. should still handle the majority of cases, and in other cases it will behave predictably (by literally using the user's command).

The way this works is it uses str.partition to get the first part of the Windows command. This will work if it's simply python or pip3 or whatever. If that is a command that needs rewriting, then sys.executable is passed as the executable parameter to subprocess.run, rather than being rewritten into the command directly.

The reason I'm doing it this way is that technically sys.executable can be anything; for example it's not hard to imagine it would contain spaces. If we allow arbitrary paths like that and want to put them in an existing command string, we would require a full Windows quoting solution or run the risk of breakage. So this avoids requiring that; if we're happy to include those implementations then we have other options.

polm avatar Sep 28 '22 10:09 polm

On Windows I have a Git Bash shell which I'm using for these tests. I have a virtualenv with spaCy installed and the projects repo checked out, and I can run the projects without issue.

It looks like in your environment, the Python in your virtualenv is not in your PATH (or after the system Python anyway) in the subprocess. Checking locally with a simple script, the python picked up by subprocess.run in my environment is still the same Python as in my virtualenv. I've seen the different version getting picked up happen to users before but I'm not sure how to reproduce it without calling the virtualenv Python directly, for example. What shell are you using?

The same setup as you described (Git Bash + spaCy-specific venv), so yea the culprit must be with the order of path environment variables then.

svlandeg avatar Sep 28 '22 11:09 svlandeg

OK, I have tested this on Windows using a Python executable with a space in the path and I think it should be fine.

I have still not been able to reproduce a situation where the python called by a subprocess doesn't match the python found in the calling bash shell, and am not quite sure what causes it. I'll write what I have figured out, though mostly it's just a memo and, since commands are being rewritten, shouldn't affect the correctness of the PR in its current state.

I assume PATH is used the same way in bash as in the Windows CreateProcess, and the variable value is not different if it's not explicitly modified, though I have had trouble verifying this.

The way lookups work when calling subprocess.run, which calls CreateProcess, is documented here. There are several potential ways for things to be looked up that would ignore PATH, but none of them obviously apply to normal usage with a venv. Referring to the numbered list in the lpCommandLine documentation:

  • cases 1-2 are about a python.exe in local directories / cwd, which seems unlikely.
  • cases 3-5 cover Windows system directories. I think Python can be installed there if it's installed "for all users", but I'm not sure that's always the case.
  • case 6 mentions "App Paths" registry keys that are ignored, but those don't seem to apply to venv Python executables.

I made a small set of scripts to just run Python in a subprocess and check what's called, you can see them here.

polm avatar Sep 30 '22 08:09 polm

Given this command:

python -m spacy download 'en_core_web_sm'

An error like this:

✘ No compatible package found for ''en_core_web_sm'' (spaCy v3.4.1)

is expected behavior, because by convention Windows doesn't use single quotes to quote arguments, so they are simply passed through. bash on Windows and Powershell both treat single quotes as quotes and would remove them, but cmd.exe would give the same error you're currently getting. CreateProcess, the OS API subprocess uses, behaves similarly.

polm avatar Oct 06 '22 04:10 polm

You're right that it does work for double quotes (which it didn't before this PR), yea.

svlandeg avatar Oct 06 '22 07:10 svlandeg

As a note, we'll need to update this PR and remove the DVC fixes that are now part of https://github.com/explosion/spaCy/pull/11592. I think we can probably get #11592 merged relatively soonish, and then redo this PR against weasel.

svlandeg avatar Oct 10 '22 16:10 svlandeg

Quick update, but since it looks like we might release weasel with v4, I'm going to go ahead and prep this PR for inclusion in 3.5. In that case the only change resulting from this PR is that the current weird behavior with quotes on Windows should be replaced by more consistent behavior - in particular double quotes should work.

polm avatar Nov 08 '22 06:11 polm

On review, this shouldn't need any further changes. There are a few places I'm uncertain about and left comments, but I think the current implementation is reasonable.

polm avatar Nov 08 '22 06:11 polm

I'd be happy to review this again, but can you have a look at the merge conflicts first, @polm?

svlandeg avatar Nov 14 '22 08:11 svlandeg

Looks like the merge conflict was because there was a little overlap in modified code in #11777, should be good now.

polm avatar Nov 14 '22 09:11 polm

After internal discussion, we decided to leave this out of spaCy and instead just put it in weasel.

polm avatar Nov 28 '22 06:11 polm