dvc
dvc copied to clipboard
ISO 8859-1 filenames break functionnalities such as dvc exp show
Bug Report
ISO 8858-1 filenames break functionnalities such as dvc exp show
Description
A file with an ISO-8859-1 character in my case 'ç' was committed to the git repository. The git directory was pushed on a distant server and then retrieved via a pull. Then dvc exp show does not work properly. The filename causes a problem to scmrepo/git/backend/pygit2.py at line 57 (scmrepo==0.0.25, pygit2==1.9.2).
Reproduce
#!/bin/bash
set -exu
wsp=test_wspace
rep=test_repo
rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)
mkdir $rep && pushd $rep
git init
dvc init
echo "m: 1" > params.yaml
dvc run -d params.yaml -o output -n train cp params.yaml output
#git add -A
echo "breaking file" >> 'Fran'$'\347''ais.txt'
git add -A
git commit -am "initial"
dvc exp show
echo "m: 2" > params.yaml
dvc exp run
dvc exp show
Expected
After typing 'q' for the first dvc exp show, which is allready broken, we get the following error message for the second dvc exp show:
ERROR: unexpected error - 'data'
Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
Environment information
The bug was generated within a conda environment, where dvc 2.12.1 was installed with pip.
Output of dvc doctor:
$ dvc doctor
DVC version: 2.12.1 (pip)
---------------------------------
Platform: Python 3.10.4 on Linux-5.4.0-121-generic-x86_64-with-glibc2.31
Supports:
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda3
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda3
Repo: dvc, git
Additional Information (if any):
@ldelphinpoulat, can you please share the verbose output from dvc exp show -v? It has tracebacks and more logging information.
Here is th log dvc_exp_show.log
@skshetry a workaround is to rename the file 'Fran'$'\347''ais.txt' to 'Francais.txt'. But the initial name is handled correctly from a git point of view.
The issue is that exp show output is always utf-8, but git filenames are encoding agnostic (and use the system encoding). We should be handling git filenames with os.fsdecode() in the pygit scmrepo backend before passing them back to the caller (dvc)