omegaconf icon indicating copy to clipboard operation
omegaconf copied to clipboard

String with hexidecimal character \x85 does not roundtrip via yaml serialization

Open rsokl opened this issue 2 years ago • 2 comments

Hello! 😄

This probably isn't a big deal, but this popped up while I was writing some tests for hydra-zen.

Describe the bug UTF-8 strings containing the character \x85 do not roundtrip via yaml serialization.

To Reproduce

@dataclass
class A:
    x: str = '\x85'

OmegaConf.save(A, "tmp.yaml")
loaded = OmegaConf.load("tmp.yaml")
>>> loaded
{'x': ' '}
>>> loaded["x"] == A.x
False

Expected behavior I would expect OmegaConf to either reject such strings as "invalid", or correctly roundtrip the value.

Additional context

  • [x] OmegaConf version: 2.1.1
  • [x] Python version: Python 3.8
  • [x] Operating system: Windows 10
  • [x] Please provide a minimal repro

rsokl avatar Nov 23 '21 00:11 rsokl

Hi @rsokl, thanks for filing this! I can reproduce the round-tripping issue as: A.x != OmegaConf.create(OmegaConf.to_yaml(A))["x"].

Jasha10 avatar Nov 23 '21 13:11 Jasha10

In case it's related, independently I noticed macos has a bug wrt to this char, where isspace(0x85) returns true

pixelb avatar Apr 07 '22 17:04 pixelb

This particular character is unicode nextline

>>> "\u0085" == "\x85"
True

As an aside, this is a character in unicode to support the mapping to/from EBCDIC, which had a separate character for this (in addition to CR and LF). Given round tripping is a key feature of unicode, this character was added.

Now pyYAML treats any such character the same, and will map to \n, for example "line separator" (\u2028), and "paragraph separator" (\u2029). One way to avoid this mapping is to set allow_unicode=False in OmegaConf.to_yaml() which does preserve this particular round tripping issue, but would have other side effects.

Given pyYAML already supports round tripping these, I find it a bit surprising that it maps the new lines rather than escaping them by default, though I support new line mapping is such a common cross platform issue, this is just wound up in that. For details on yaml multiline handling see https://yaml-multiline.info/

Also note, if using bytes rather than str, the round tripping is fine as the data is just base64 encoded

pixelb avatar Aug 18 '22 16:08 pixelb