dvc
dvc copied to clipboard
dvc: windows dir out containing files with / in their name will explode
Using master branch.
When we save dir cache we convert filenames in it to posix style in a lossy way:
"hey\there" -> "hey/there"
"hey/there" -> "hey/there"
which will lead to hey/there becoming hey\there after back transform, making filename contain path sep, which will lead to cache mismatch and other interesting behavior.
Same issue with other places posix path strings are used as serialization:
- relative local output paths saved to stage files
- stage
wdir
@Suor Good catch! Do you suggest that we raise an error if we encounter illegal slash on particular os?
So we may just not permit having / in filenames on Windows or escape it somehow when converting it to posix. That latter could be messy though.
/ is not allowed on Windows, right?
how does GIt deal with \ in a path? does not allow them?
You can use / on Windows, but not \, Unix is reverse.
As far as I can see git does nothing to mitigate that. It commits paths containing \ on Unix and / on Windows normally and then everything fails when you checkout on different system.
I think this issue may be related to a plot generation bug that I have encountered on Windows. In params.yaml:dirs, I specify file paths. A JSON schema generated by Pydantic facilitates auto-completion of keys in params.yaml:dirs with sensible default paths, and uses \ as the path separator (due to underlying pathlib usage). For example, I can use params.yaml:dirs:results in a dvc.yaml stage as ${dirs:my_plot_file}, and that will lead to, say .\data\results\plot.png being used in dvc.yaml. If a plot is specified in this manner, it will not show up properly in experiments, or Iterative Studio, or via dvc plots show.
I will try to produce a minimal reproduction repo soon, perhaps all it takes is \ in a path in dvc.yaml without all the ${...} funny-business, to break plots. It seems to only break plots for me, as metrics and outs specified with \ don't fail. Its seemingly limited to plots.
This behavior is actually quite difficult to override in Python, due to pathlib uses some dynamic attribute loading to coerce Path to WindowsPath on Windows. I wish I could just tell pathlib to let me use PosixPath on Windows, because Windows supports such paths just fine.
The workaround is to just never use \ in your yaml files, but since tooling like pydantic and pathlib pass around WindowsPath by default, it makes it difficult to automate/de-duplicate things when coupling these tools with dvc.
@blakeNaccarato, we don't use pydantic. Not sure about the issue here, but most likely we have some plots collection bug where we are assuming posixpath. Would be great if you could post a repro. Thanks.
I will find time to repro this bug sometime this weekend, and also determine whether it's related to the issue at hand. Since my issue is related to ingesting \ in dvc.yaml causing issues, where the root issue is about /. The over-arching issue seems to be that DVC is not platform-agnostic regarding path separators in some corner cases.
I figured out a workaround for my specific issue, using Pydantic to generate schemas for params.yaml which then provide templated values to dvc.yaml deps/outs/plots/etc. In case anyone is using this particular combo, this is my workaround below...
Workaround for users of Pydantic and DVC on Windows with path issues
I orchestrate stages inside an editable-installed package in my repo, to facilitate Pylance/pyright auto-completion and refactoring project-wide. Fixing Pydantic schema generation to produce POSIX paths even on Windows amounts to using the schema_extra configuration attribute (see near the bottom of this page in Pydantic docs) to replace \\ with / in the default paths dumped to schema.
src/mypackage/models/dirs.py
from __future__ import annotations
from pathlib import Path
from typing import Any, Optional
from pydantic import BaseModel, DirectoryPath, FilePath, validator
from mypackage.models.common import StrPath
class Dirs(BaseModel):
"""Directories relevant to the project."""
class Config:
@staticmethod
def schema_extra(schema: dict[str, Any], model: Dirs):
for prop in schema.get("properties", {}).values():
if default := prop.get("default"):
prop["default"] = default.replace("\\", "/")
base: DirectoryPath = Path(".")
package: DirectoryPath = base / "src/mypackage"
stages: DirectoryPath = package / "stages"
models: DirectoryPath = package / "models"
data: DirectoryPath = base / "data"
project_schema: DirectoryPath = data / "schema"
...
literature: DirectoryPath = data / "literature"
literature_results: DirectoryPath = data / "literature_results"
file_literature_results: Path = literature_results / "lit.csv"
modelfun: DirectoryPath = data / "modelfun"
file_model: Path = modelfun / "model.dillpickle"
...
# "always" so it'll run even if not in YAML
# "pre" because dir must exist pre-validation
@validator(
"project_schema",
"literature_results",
"modelfun",
...,
always=True,
pre=True,
)
def validate_output_directories(cls, directory: StrPath):
"""Re-create designated output directories each run, for reproducibility."""
directory = Path(directory)
directory.mkdir(parents=True, exist_ok=True)
return directory
Okay here's a repro that breaks a plot when there's a \ in the path. The plot path\fail.png doesn't show up in rendered HTML of dvc plots show, while path/pass.png does. The same goes for the plot rendering in the DVC VSCode extension.
The repro is in main, then in fix-offending-plot the \ is swapped for / in the offending plot, then in remove-offending-plot the offending plot is commented out of dvc.yaml. Each branch has the output of dvc plots show committed as well.
Okay here's a repro that breaks a plot when there's a
\in the path. The plotpath\fail.pngdoesn't show up in rendered HTML ofdvc plots show, whilepath/pass.pngdoes. The same goes for the plot rendering in the DVC VSCode extension.The repro is in main, then in fix-offending-plot the
\is swapped for/in the offending plot, then in remove-offending-plot the offending plot is commented out ofdvc.yaml. Each branch has the output ofdvc plots showcommitted as well.
Closing in favor of https://github.com/iterative/dvc/issues/8689