dvc icon indicating copy to clipboard operation
dvc copied to clipboard

dvc: windows dir out containing files with / in their name will explode

Open Suor opened this issue 5 years ago • 10 comments

Using master branch.

When we save dir cache we convert filenames in it to posix style in a lossy way:

"hey\there" -> "hey/there"
"hey/there" -> "hey/there"

which will lead to hey/there becoming hey\there after back transform, making filename contain path sep, which will lead to cache mismatch and other interesting behavior.

Suor avatar May 28 '19 10:05 Suor

Same issue with other places posix path strings are used as serialization:

  • relative local output paths saved to stage files
  • stage wdir

Suor avatar May 28 '19 11:05 Suor

@Suor Good catch! Do you suggest that we raise an error if we encounter illegal slash on particular os?

efiop avatar May 29 '19 01:05 efiop

So we may just not permit having / in filenames on Windows or escape it somehow when converting it to posix. That latter could be messy though.

Suor avatar May 29 '19 09:05 Suor

/ is not allowed on Windows, right?

how does GIt deal with \ in a path? does not allow them?

shcheklein avatar Jun 09 '19 06:06 shcheklein

You can use / on Windows, but not \, Unix is reverse.

As far as I can see git does nothing to mitigate that. It commits paths containing \ on Unix and / on Windows normally and then everything fails when you checkout on different system.

Suor avatar Jun 10 '19 10:06 Suor

I think this issue may be related to a plot generation bug that I have encountered on Windows. In params.yaml:dirs, I specify file paths. A JSON schema generated by Pydantic facilitates auto-completion of keys in params.yaml:dirs with sensible default paths, and uses \ as the path separator (due to underlying pathlib usage). For example, I can use params.yaml:dirs:results in a dvc.yaml stage as ${dirs:my_plot_file}, and that will lead to, say .\data\results\plot.png being used in dvc.yaml. If a plot is specified in this manner, it will not show up properly in experiments, or Iterative Studio, or via dvc plots show.

I will try to produce a minimal reproduction repo soon, perhaps all it takes is \ in a path in dvc.yaml without all the ${...} funny-business, to break plots. It seems to only break plots for me, as metrics and outs specified with \ don't fail. Its seemingly limited to plots.

This behavior is actually quite difficult to override in Python, due to pathlib uses some dynamic attribute loading to coerce Path to WindowsPath on Windows. I wish I could just tell pathlib to let me use PosixPath on Windows, because Windows supports such paths just fine.

The workaround is to just never use \ in your yaml files, but since tooling like pydantic and pathlib pass around WindowsPath by default, it makes it difficult to automate/de-duplicate things when coupling these tools with dvc.

blakeNaccarato avatar Nov 04 '22 03:11 blakeNaccarato

@blakeNaccarato, we don't use pydantic. Not sure about the issue here, but most likely we have some plots collection bug where we are assuming posixpath. Would be great if you could post a repro. Thanks.

skshetry avatar Nov 04 '22 04:11 skshetry

I will find time to repro this bug sometime this weekend, and also determine whether it's related to the issue at hand. Since my issue is related to ingesting \ in dvc.yaml causing issues, where the root issue is about /. The over-arching issue seems to be that DVC is not platform-agnostic regarding path separators in some corner cases.

I figured out a workaround for my specific issue, using Pydantic to generate schemas for params.yaml which then provide templated values to dvc.yaml deps/outs/plots/etc. In case anyone is using this particular combo, this is my workaround below...

Workaround for users of Pydantic and DVC on Windows with path issues

I orchestrate stages inside an editable-installed package in my repo, to facilitate Pylance/pyright auto-completion and refactoring project-wide. Fixing Pydantic schema generation to produce POSIX paths even on Windows amounts to using the schema_extra configuration attribute (see near the bottom of this page in Pydantic docs) to replace \\ with / in the default paths dumped to schema.

src/mypackage/models/dirs.py
from __future__ import annotations

from pathlib import Path
from typing import Any, Optional

from pydantic import BaseModel, DirectoryPath, FilePath, validator

from mypackage.models.common import StrPath

class Dirs(BaseModel):
    """Directories relevant to the project."""

    class Config:
        @staticmethod
        def schema_extra(schema: dict[str, Any], model: Dirs):
            for prop in schema.get("properties", {}).values():
                if default := prop.get("default"):
                    prop["default"] = default.replace("\\", "/")

    base: DirectoryPath = Path(".")
    package: DirectoryPath = base / "src/mypackage"
    stages: DirectoryPath = package / "stages"
    models: DirectoryPath = package / "models"
    data: DirectoryPath = base / "data"
    project_schema: DirectoryPath = data / "schema"

    ...

    literature: DirectoryPath = data / "literature"
    literature_results: DirectoryPath = data / "literature_results"
    file_literature_results: Path = literature_results / "lit.csv"

    modelfun: DirectoryPath = data / "modelfun"
    file_model: Path = modelfun / "model.dillpickle"

    ...

    # "always" so it'll run even if not in YAML
    # "pre" because dir must exist pre-validation
    @validator(
        "project_schema",
        "literature_results",
        "modelfun",
        ...,
        always=True,
        pre=True,
    )
    def validate_output_directories(cls, directory: StrPath):
        """Re-create designated output directories each run, for reproducibility."""
        directory = Path(directory)
        directory.mkdir(parents=True, exist_ok=True)
        return directory

blakeNaccarato avatar Nov 04 '22 18:11 blakeNaccarato

Okay here's a repro that breaks a plot when there's a \ in the path. The plot path\fail.png doesn't show up in rendered HTML of dvc plots show, while path/pass.png does. The same goes for the plot rendering in the DVC VSCode extension.

The repro is in main, then in fix-offending-plot the \ is swapped for / in the offending plot, then in remove-offending-plot the offending plot is commented out of dvc.yaml. Each branch has the output of dvc plots show committed as well.

blakeNaccarato avatar Nov 09 '22 04:11 blakeNaccarato

Okay here's a repro that breaks a plot when there's a \ in the path. The plot path\fail.png doesn't show up in rendered HTML of dvc plots show, while path/pass.png does. The same goes for the plot rendering in the DVC VSCode extension.

The repro is in main, then in fix-offending-plot the \ is swapped for / in the offending plot, then in remove-offending-plot the offending plot is commented out of dvc.yaml. Each branch has the output of dvc plots show committed as well.

Closing in favor of https://github.com/iterative/dvc/issues/8689

daavoo avatar Dec 20 '22 15:12 daavoo