omegaconf
omegaconf copied to clipboard
String with hexidecimal character \x85 does not roundtrip via yaml serialization
Hello! 😄
This probably isn't a big deal, but this popped up while I was writing some tests for hydra-zen.
Describe the bug
UTF-8 strings containing the character \x85
do not roundtrip via yaml serialization.
To Reproduce
@dataclass
class A:
x: str = '\x85'
OmegaConf.save(A, "tmp.yaml")
loaded = OmegaConf.load("tmp.yaml")
>>> loaded
{'x': ' '}
>>> loaded["x"] == A.x
False
Expected behavior I would expect OmegaConf to either reject such strings as "invalid", or correctly roundtrip the value.
Additional context
- [x] OmegaConf version: 2.1.1
- [x] Python version: Python 3.8
- [x] Operating system: Windows 10
- [x] Please provide a minimal repro
Hi @rsokl, thanks for filing this!
I can reproduce the round-tripping issue as:
A.x != OmegaConf.create(OmegaConf.to_yaml(A))["x"]
.
In case it's related, independently I noticed macos has a bug wrt to this char, where isspace(0x85) returns true
This particular character is unicode nextline
>>> "\u0085" == "\x85"
True
As an aside, this is a character in unicode to support the mapping to/from EBCDIC, which had a separate character for this (in addition to CR and LF). Given round tripping is a key feature of unicode, this character was added.
Now pyYAML treats any such character the same, and will map to \n
,
for example "line separator" (\u2028), and "paragraph separator" (\u2029).
One way to avoid this mapping is to set allow_unicode=False
in OmegaConf.to_yaml()
which does preserve this particular round tripping issue, but would have other side effects.
Given pyYAML already supports round tripping these, I find it a bit surprising that it maps the new lines rather than escaping them by default, though I support new line mapping is such a common cross platform issue, this is just wound up in that. For details on yaml multiline handling see https://yaml-multiline.info/
Also note, if using bytes rather than str, the round tripping is fine as the data is just base64 encoded