dvc
dvc copied to clipboard
dvc: windows dir out containing files with / in their name will explode
Using master branch.
When we save dir cache we convert filenames in it to posix style in a lossy way:
"hey\there" -> "hey/there"
"hey/there" -> "hey/there"
which will lead to hey/there
becoming hey\there
after back transform, making filename contain path sep, which will lead to cache mismatch and other interesting behavior.
Same issue with other places posix path strings are used as serialization:
- relative local output paths saved to stage files
- stage
wdir
@Suor Good catch! Do you suggest that we raise an error if we encounter illegal slash on particular os?
So we may just not permit having /
in filenames on Windows or escape it somehow when converting it to posix. That latter could be messy though.
/ is not allowed on Windows, right?
how does GIt deal with \ in a path? does not allow them?
You can use /
on Windows, but not \
, Unix is reverse.
As far as I can see git does nothing to mitigate that. It commits paths containing \
on Unix and /
on Windows normally and then everything fails when you checkout on different system.
I think this issue may be related to a plot generation bug that I have encountered on Windows. In params.yaml:dirs
, I specify file paths. A JSON schema generated by Pydantic facilitates auto-completion of keys in params.yaml:dirs
with sensible default paths, and uses \
as the path separator (due to underlying pathlib
usage). For example, I can use params.yaml:dirs:results
in a dvc.yaml
stage as ${dirs:my_plot_file}
, and that will lead to, say .\data\results\plot.png
being used in dvc.yaml
. If a plot is specified in this manner, it will not show up properly in experiments, or Iterative Studio, or via dvc plots show
.
I will try to produce a minimal reproduction repo soon, perhaps all it takes is \
in a path in dvc.yaml
without all the ${...}
funny-business, to break plots. It seems to only break plots for me, as metrics and outs specified with \
don't fail. Its seemingly limited to plots.
This behavior is actually quite difficult to override in Python, due to pathlib
uses some dynamic attribute loading to coerce Path
to WindowsPath
on Windows. I wish I could just tell pathlib
to let me use PosixPath
on Windows, because Windows supports such paths just fine.
The workaround is to just never use \
in your yaml
files, but since tooling like pydantic
and pathlib
pass around WindowsPath
by default, it makes it difficult to automate/de-duplicate things when coupling these tools with dvc
.
@blakeNaccarato, we don't use pydantic. Not sure about the issue here, but most likely we have some plots collection bug where we are assuming posixpath. Would be great if you could post a repro. Thanks.
I will find time to repro this bug sometime this weekend, and also determine whether it's related to the issue at hand. Since my issue is related to ingesting \
in dvc.yaml
causing issues, where the root issue is about /
. The over-arching issue seems to be that DVC is not platform-agnostic regarding path separators in some corner cases.
I figured out a workaround for my specific issue, using Pydantic to generate schemas for params.yaml
which then provide templated values to dvc.yaml
deps/outs/plots/etc. In case anyone is using this particular combo, this is my workaround below...
Workaround for users of Pydantic and DVC on Windows with path issues
I orchestrate stages inside an editable-installed package in my repo, to facilitate Pylance/pyright auto-completion and refactoring project-wide. Fixing Pydantic schema generation to produce POSIX paths even on Windows amounts to using the schema_extra
configuration attribute (see near the bottom of this page in Pydantic docs) to replace \\
with /
in the default paths dumped to schema.
src/mypackage/models/dirs.py
from __future__ import annotations
from pathlib import Path
from typing import Any, Optional
from pydantic import BaseModel, DirectoryPath, FilePath, validator
from mypackage.models.common import StrPath
class Dirs(BaseModel):
"""Directories relevant to the project."""
class Config:
@staticmethod
def schema_extra(schema: dict[str, Any], model: Dirs):
for prop in schema.get("properties", {}).values():
if default := prop.get("default"):
prop["default"] = default.replace("\\", "/")
base: DirectoryPath = Path(".")
package: DirectoryPath = base / "src/mypackage"
stages: DirectoryPath = package / "stages"
models: DirectoryPath = package / "models"
data: DirectoryPath = base / "data"
project_schema: DirectoryPath = data / "schema"
...
literature: DirectoryPath = data / "literature"
literature_results: DirectoryPath = data / "literature_results"
file_literature_results: Path = literature_results / "lit.csv"
modelfun: DirectoryPath = data / "modelfun"
file_model: Path = modelfun / "model.dillpickle"
...
# "always" so it'll run even if not in YAML
# "pre" because dir must exist pre-validation
@validator(
"project_schema",
"literature_results",
"modelfun",
...,
always=True,
pre=True,
)
def validate_output_directories(cls, directory: StrPath):
"""Re-create designated output directories each run, for reproducibility."""
directory = Path(directory)
directory.mkdir(parents=True, exist_ok=True)
return directory
Okay here's a repro that breaks a plot when there's a \
in the path. The plot path\fail.png
doesn't show up in rendered HTML of dvc plots show
, while path/pass.png
does. The same goes for the plot rendering in the DVC VSCode extension.
The repro is in main, then in fix-offending-plot the \
is swapped for /
in the offending plot, then in remove-offending-plot the offending plot is commented out of dvc.yaml
. Each branch has the output of dvc plots show
committed as well.
Okay here's a repro that breaks a plot when there's a
\
in the path. The plotpath\fail.png
doesn't show up in rendered HTML ofdvc plots show
, whilepath/pass.png
does. The same goes for the plot rendering in the DVC VSCode extension.The repro is in main, then in fix-offending-plot the
\
is swapped for/
in the offending plot, then in remove-offending-plot the offending plot is commented out ofdvc.yaml
. Each branch has the output ofdvc plots show
committed as well.
Closing in favor of https://github.com/iterative/dvc/issues/8689