dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Support for non-ASCII characters in metrics files

Open sudoandros opened this issue 4 years ago • 3 comments

Hello. Recently I have come across a problem when DVC thinks my metrics file containing Russian words is binary and refuses to use it. Here is the stack trace:

------------------------------------------------------------
Traceback (most recent call last):
  File "dvc\repo\reproduce.py", line 196, in _reproduce_stages
  File "dvc\repo\reproduce.py", line 39, in _reproduce_stage
  File "funcy\decorators.py", line 45, in wrapper
  File "dvc\stage\decorators.py", line 36, in rwlocked
  File "funcy\decorators.py", line 66, in __call__
  File "dvc\stage\__init__.py", line 427, in reproduce
  File "funcy\decorators.py", line 45, in wrapper
  File "dvc\stage\decorators.py", line 36, in rwlocked
  File "funcy\decorators.py", line 66, in __call__
  File "dvc\stage\__init__.py", line 546, in run
  File "dvc\stage\__init__.py", line 457, in save
  File "dvc\stage\__init__.py", line 477, in save_outs
  File "dvc\output.py", line 531, in save
  File "dvc\output.py", line 688, in verify_metric
dvc.exceptions.DvcException: binary file 'data\full_metrics.yaml' cannot be used as metrics.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc\main.py", line 55, in main
  File "dvc\command\base.py", line 45, in do_run
  File "dvc\command\repro.py", line 12, in run
  File "dvc\repo\__init__.py", line 49, in wrapper
  File "dvc\repo\scm_context.py", line 14, in run
  File "dvc\repo\reproduce.py", line 135, in reproduce
  File "dvc\repo\reproduce.py", line 213, in _reproduce_stages
dvc.exceptions.ReproductionError: failed to reproduce 'dvc.yaml'
------------------------------------------------------------

If we look at the line causing an issue we can see that istextfile function is the real reason: https://github.com/iterative/dvc/blob/14cc7481390b50ae7c921f84091041e5d8068ed0/dvc/istextfile.py#L27-L31 (thanks to @isidentical for helping out) It uses some heuristics to determine if the file is binary. And those heuristics rely on the fact that text files contain primarily ASCII characters. I don't see why shouldn't DVC allow for non-ASCII metrics files. It just looks like a rather annoying inconvenience out of nowhere. For example, I work with Russian finance documents classification and I would be happy to not think about English translation or some other latin equivalent for every document class. It just becomes a game at "what is it really called?" every time I look at the metrics.

Is it possible to allow for non-ASCII metrics file in the future versions of DVC?

sudoandros avatar Aug 17 '21 11:08 sudoandros

Same problem here. JSON files could likely be encoded in UTF-8, which should be supported.

qutang avatar Jan 05 '22 03:01 qutang

@skshetry What do you think?

dberenbaum avatar Jan 05 '22 13:01 dberenbaum

Do we even need the istextfile validation for metrics at all? If users are explicitly listing something as a metrics stage output, it seems like we should just assume that it is a valid metric, and only error out if/when we fail to parse it during diff or UI commands.

pmrowla avatar Jan 06 '22 06:01 pmrowla