nix icon indicating copy to clipboard operation
nix copied to clipboard

Add support for parsing YAML

Open NaN-git opened this issue 1 year ago • 36 comments

Why builtins.fromYAML?

YAML is widely used amongst other package managers and deployment tools. If we want better compatibility to these ecosystems, the ability to parse yaml efficiently in nix is useful, as it is for TOML and JSON, which we already support. Further discussion can be found, i.a., in #4910.

Description

YAML 1.2 is a complex standard and nix has a limited set of data types. Thus only a subset of YAML can be represented in nix. For example attribute sets require String keys, i.e. attribute sets can represent YAML maps with String keys only, and nix has no data types for binary data or dates. If builtins.fromYAML encounters YAML with incompatible data types, then it fails similar to builtins.fromTOML.

First, the implementation uses rapidyaml to parse the YAML string and afterwards the nix objects are created while traversing the YAML tree similar to builtins.fromTOML. Tags are mostly ignored and affect only scalars. Custom tags are always ignored and I don't see how custom tags could be handled by nix.

Why rapidyaml?

As part of nix robustness and safety of the implementation of builtins.fromYAML are important.

Some reasons why rapidyaml was chosen:

  • C++11 library with limited usage of dynamic memory allocations
  • tested with a wide range of compilers and on different platforms
  • extensive set of unit tests
  • heavy usage of assertions within the library
  • available as single header file, i.e. easy to bootstrap, so that no build tool could link another version of rapidyaml
  • good speed in comparison to yaml-cpp and other YAML libraries

limitations of rapidyaml

rapidyaml has a few limitations. Most of these limitations are not really relevant for the nix use case because of the limitations of nix.

Some comments with respect to the limitations:

  • The macro RYML_WITH_TAB_TOKENS is defined for builtins.fromYAML.
  • Anchor names terminated with a colon are not fully supported by rapidyaml. { &anchor: key: val, anchor: *anchor: } is parsed correctly, but this test case fails. The test case is ignored in the builtins.fromYAML-tests because it is an edge case and it's no valid YAML 1.3.
  • ~Otherwise the only issue that I found is that a string/block containing only tab-, space- and new line-characters might be parsed incorrectly as empty string (test case). I don't think that this limits the real world usage.~ UPDATE: This is fixed in the newest rapidyaml release.
  • ADDED: Rapidyaml has some issues parsing flow mappings entries with missing :-token and missing value, e.g. this example cannot be parsed.

Also rapidyaml parses some invalid YAML successfully, but that is actually helpful. ~Otherwise the valid JSON {"a":"b"}, which is emitted by builtins.toJSON, could not be parsed by builtins.fromYAML because it is no valid YAML due to the missing separation white space after the :-token.~ UPDATE: Actually this is allowed in flow mappings.

Tested platforms

  • x86_64-linux
  • aarch64-linux

NaN-git avatar Nov 23 '22 23:11 NaN-git

Like mentioned in https://github.com/NixOS/nix/issues/4910#issuecomment-1214148665, we might want to consider to encode the yaml version into the function somehow. Instead of fromYAML it could be named fromYAML1_2. In case the YAML spec ever gets updated in an incompatible way, we could make a second function, like for example fromYAML2_0 and won't have to update/break the existing builtin.

DavHau avatar Nov 24 '22 11:11 DavHau

Like mentioned in #4910 (comment), we might want to consider to encode the yaml version into the function somehow. Instead of fromYAML it could be named fromYAML1_2. In case the YAML spec ever gets updated in an incompatible way, we could make a second function, like for example fromYAML2_0 and won't have to update/break the existing builtin.

If this should be implemented, then a decision about the versioning scheme is needed and how the versioning should be done, i.e. whether the builtins should have different names or whether a function with additional version argument should be used instead. Having explicitly different function names would be advantageous with respect to reproducibility.

The versioning should apply to different code versions, not only to different YAML standards, so that bug compatibility can be retained. The YAML standard is ugly and we need to parse real world data, i.e. the "YAML" might be invalid, so that some tradeoffs are required...

I'm wondering how updates of builtins.fromJSON and builtins.fromTOML are handled. builtins.fromJSON uses nlohmann-json, but if nix isn't built by nix, then the linked library version isn't fixed. Did this cause any problems yet?

NaN-git avatar Nov 24 '22 19:11 NaN-git

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/announcing-stacklock2nix-easily-build-a-haskell-project-that-contains-a-stack-yaml-lock-file/23563/11

nixos-discourse avatar Dec 03 '22 08:12 nixos-discourse

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/how-to-convert-yaml-nix-object/23755/2

nixos-discourse avatar Dec 05 '22 22:12 nixos-discourse

Like mentioned in #4910 (comment), we might want to consider to encode the yaml version into the function somehow. Instead of fromYAML it could be named fromYAML1_2. In case the YAML spec ever gets updated in an incompatible way, we could make a second function, like for example fromYAML2_0 and won't have to update/break the existing builtin.

I discussed this once more with @NaN-git and it doesn't seem like a good idea to encode the yaml spec version in the function name. rapidyaml doesn't distinguish between yaml versions. It's hard to forsee the reason for which we would have to introduce a new fromYaml in the future, if ever.

Whatever naming schema we would come up right now, there is a high chance that it won't be meaningful. Therefore I think the builtin should just be introduced as a plain fromYaml. If it ever needs to be updated then we can still make a fromYaml_2 or whatever makes sense in that situation.

DavHau avatar Dec 21 '22 04:12 DavHau

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-01-02-nix-team-meeting-minutes-20/24403/1

nixos-discourse avatar Jan 03 '23 12:01 nixos-discourse

Discussed in Nix team meeting on 2023-01-09:

  • general sentiment is in favor of merging this
  • don't vendor the dependency, do it like with nlohmann
  • we will wait for @edolstra's input for the final decision
Complete discussion
  • @roberth: vendoring the library is actually nice, because it doesn't change underneath
    • @thufschmitt: should follow the decision on the nlohmann JSON library
      • that was a special case though, because it was very uncommon yet easy to vendor
    • @roberth: the problem with vendoring is that it requires re-exporting the symbols such that names don't collide
      • John: in short, vendoring is only a good fit for private deps, and so so for a public dep one needs to do these things.
    • @thufschmitt: not vendoring makes packaging for third parties more painful. e.g. there is no rapidyaml on Debian
  • @roberth: in favor of the addition after being against it. In general adding more (external) code does not help with the guarantee for evaluation reproducibility 10 years down the line
  • @Ericson2314: see the practical reasons. it would be better for IFD to work efficiently so one can do the conversion independent of Nix proper, but I also don't see that happening in the near term.
    • @roberth: the main deficiency of IFD is system specificity
  • @thufschmitt: in favor of the feature. we should provide strong control of inputs to derivations. IFD breaks the separation between evaluation and build, and therefore breaks reproducibility guarantees (one can create any inputs at build time, which may be non-deterministic)
  • @fricklerhandwerk: generally agree with @Ericson2314, we should be really careful about growing the API surface. not against the feature though, because it's obviously useful and adds symmetry to the API. we should just make sure we know how to deal with such additions in the future, and have a clear stance to present to contributors.
  • agreement:
    • don't vendor the dependency, do it like with nlohmann
  • @tomberek: agree in principle, it's a natural parallel to the to/fromJSON builtins.
  • we will wait for @edolstra's input

fricklerhandwerk avatar Jan 10 '23 12:01 fricklerhandwerk

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-01-09-nix-team-meeting-minutes-22/24577/1

nixos-discourse avatar Jan 10 '23 12:01 nixos-discourse

@edolstra any reason not to merge this?

domenkozar avatar Jan 20 '23 08:01 domenkozar

Decisions

  • [x] agreement on not vendoring
  • [x] agreement on making this an experimental feature

Issues to clarify

  • [ ] Nix's compile time impact?
  • [ ] closure size?
  • [ ] static link?
  • [ ] show various features and how they map to Nix values
    • [ ] add tests for the features
  • [ ] testing framework improvements
    • [ ] add explanation of testing
    • [ ] convert bash to google-test approach?

Versioning

If the underlying library returns different results over time, this impacts reproducibility

tomberek avatar Jan 23 '23 13:01 tomberek

Issues to clarify

[...] * [ ] show various features and how they map to Nix values

  * [ ]  add tests for the features

* [ ]  testing framework improvements
  
  * [ ]  add explanation of testing
  * [ ]  convert bash to google-test approach?

Regarding the tests:

  • The tests from yaml-test-suite are converted to gtest tests by a bash script. The resulting test suite is part of the PR.
  • The tests contain a JSON representation ("json") of the YAML ("yaml") test cases. Thus the tests check whether builtins.fromJSON json == builtins.fromYAML json == builtins.fromYAML yaml holds true by serializing the nix expressions. Several hundred test cases are executed and compatibility between builtins.fromJSON and builtins.fromYAML is verified at least for these cases. If yaml should be invalid then the test case is skipped because rapidyaml will often parse this without error.
  • builtins.fromYAML converts a YAML file to the following nix types: maps, lists, strings, integers (int64) and floats (double). Conversion to custom data types or interpreting data as nix expression would be very dangerous. Alias nodes (references) are automatically resolved, e.g.
builtins.fromYAML ''
---
values:
  - &value someValue
--- 
# some YAML template
template: *value
''

evaluates to (I'm not very happy with the handling of multiple documents)

[
  {
    values = [ "someValue" ];
  }
  {
    template = "someValue";
  }
]

Build questions

rapidyaml uses cmake as build system. I don't think that this a good choice for nix because bootstrapping of nix has to be easy. The only dependency of rapidyaml is c4core from the same author. Thus it should be easy to statically link the library. At the moment the library is included as single header file similar to builtins.fromTOML. What is the best way to include this library into nix?

Even with rapidyaml as single header file the build time of builtins.fromYAML seems to be faster than the build time of builtins.fromTOML and the size of the object file is much smaller than the size of the builtins.fromTOML object file.

NaN-git avatar Jan 31 '23 01:01 NaN-git

What is the best way to include this library into nix?

I would package it and its dependency in Nixpkgs. I am happy to help with that part if you want.


I personally am not to concerned about the bootstrapping because we could always make YAML a (compile time of Nix) optional feature.

Ericson2314 avatar Jan 31 '23 14:01 Ericson2314

I would package it and its dependency in Nixpkgs. I am happy to help with that part if you want.

I personally am not to concerned about the bootstrapping because we could always make YAML a (compile time of Nix) optional feature.

Packaging rapidyaml is rather easy:

{ cmake
, fetchFromGitHub
, git
, stdenv
, enableStatic ? true
}:
stdenv.mkDerivation rec {
  pname = "rapidyaml";
  version = "0.5.0";

  src = fetchFromGitHub {
    owner = "biojppm";
    repo = pname;
    rev = "v${version}";
    fetchSubmodules = true;
    hash = "sha256-1/P6Szgng94UU8cPFAtOKMS+EmiwfW/IJl2UTolDU5s=";
  };

  nativeBuildInputs = [ cmake git ];
  cmakeFlags = [
    "-DRYML_WITH_TAB_TOKENS=ON"
    "-DBUILD_SHARED_LIBS=${if enableStatic then "OFF" else "ON"}"
  ];
}

I prefer to statically link this because the static library is rather small (~500 KB).

In my opinion not only bootstrapping of nix has to be easy, but compiling it for other distributions should be easy, too. I don't know how to add rapidyaml as optional dependency to nix, so that builds without nix aren't messed up.

NaN-git avatar Feb 01 '23 22:02 NaN-git

I removed rapidyaml as single header file and made it an optional library instead. If the library cannot be found then builtins.fromYAML is not available. Rapidyaml is not part of nixpkgs yet, so that I added the package to flake.nix and it will be linked statically. With builtins.fromYAML the size of libnixexpr increases by 800KB and in my tests on an AMD 5800U the build time of nix increased by 4-5 seconds to ~176 seconds when compiling with 16 threads.

Regarding the testing framework: I need more information what's actually required/wanted. I commented some point in https://github.com/NixOS/nix/pull/7340#issuecomment-1409605494. Of course it would be possible to preprocess the tests differently so that less logic is needed in the testing code and it would be more obvious which tests are executed actually.

NaN-git avatar Feb 11 '23 19:02 NaN-git

If the library cannot be found then builtins.fromYAML is not available.

@edolstra what do you think about an optional built-in? Doesn't sound right to me.

fricklerhandwerk avatar Feb 13 '23 10:02 fricklerhandwerk

If the library cannot be found then builtins.fromYAML is not available.

@edolstra what do you think about an optional built-in? Doesn't sound right to me.

Supporting a build without yaml library might be useful for some build-from-source bootstrapping process, but in this case it must be an explicit choice to remove the yaml feature and create an incomplete Nix. It must not happen by accident.

roberth avatar Feb 13 '23 11:02 roberth

Whether that's even a worthwhile effort, I don't know. Personally I think bootstrapping Nix from source is not an important use case, but packaging by other distros is. Those distros must not package a Nix without yaml though! I'd rather require rapidyaml unconditionally.

roberth avatar Feb 13 '23 11:02 roberth

Whether that's even a worthwhile effort, I don't know. Personally I think bootstrapping Nix from source is not an important use case, but packaging by other distros is. Those distros must not package a Nix without yaml though! I'd rather require rapidyaml unconditionally.

Ok, I see two possible solutions:

  1. Always link against libryml, but then it won't be easy to compile it without rapidyaml. This would enable the removal of some lines of code.
  2. Require rapidyaml unless --disable-ryml is specified, i.e. a few lines of configure.ac have to be changed.

NaN-git avatar Feb 13 '23 13:02 NaN-git

@roberth I implemented option no. 2. Now configure fails if rapidyaml is not available unless it is disabled explicitly.

NaN-git avatar Feb 14 '23 14:02 NaN-git

Lazy evaluation?

I have another thought: Would it make sense to make the implementation lazy like in #7249? I think that it won't have a large impact on the runtime of builtins.fromYAML because in my large test cases most of the runtime was actually spent in parsing the YAML. Of course the runtime of the YAML parser depends a lot on the structure of the parsed YAML and its format because as example, block scalars, double-quoted and single-quoted scalars use different code paths. In case of large, deeply nested YAML (or JSON) lazy evaluation might speed up nix in general, if the YAML tree is only evaluated partially because many strings might not be created and the GC would track less objects.

NaN-git avatar Feb 15 '23 18:02 NaN-git

It is nice it is not vendored, but we should still have the dependency in Nixpkgs. Please make a Nixpkgs PR for that, and I will ensure it is merged and backported to 22.11 so it can be used in this PR right away.

enableStatic can go away, and instead one can pass it a static stdenv which will have the same effect.

The optional dependency alleviates the bootstrapping concerns.

Ericson2314 avatar Feb 16 '23 19:02 Ericson2314

Reproducibility...

This is a little concerning. The rapidyaml documentation says:

Also, ryml tends to be on the permissive side where the YAML standard dictates there should be an error; in many of these cases, ryml will tolerate the input. This may be good or bad, but in any case is being improved on (meaning ryml will grow progressively less tolerant of YAML errors in the coming releases). So we strongly suggest to stay away from those dark corners of YAML which are generally a source of problems, which is a good practice anyway.

What that means for us is that expressions that happen to depend on flawed yaml will become impossible to reproduce, and we have no way of knowing that in advance; not without invoking a more precise implementation first. Perhaps we should use a more compliant library?

roberth avatar Feb 16 '23 20:02 roberth

Reproducibility...

This is a little concerning. The rapidyaml documentation says:

Also, ryml tends to be on the permissive side where the YAML standard dictates there should be an error; in many of these cases, ryml will tolerate the input. This may be good or bad, but in any case is being improved on (meaning ryml will grow progressively less tolerant of YAML errors in the coming releases). So we strongly suggest to stay away from those dark corners of YAML which are generally a source of problems, which is a good practice anyway.

What that means for us is that expressions that happen to depend on flawed yaml will become impossible to reproduce, and we have no way of knowing that in advance; not without invoking a more precise implementation first. Perhaps we should use a more compliant library?

This point was already discussed previously and one of the reasons why I vendored the library initially. In my opinion rapidyaml is one of the most compliant YAML libraries when parsing valid YAML. You can have a look at YAML playground and see that for some edge cases basically each library returns a different result. YAML is a complex standard and it's neither the best standard. Therefore tradeoffs have to be made. Additionally only a subset of YAML can be mapped easily to nix.

I don't like to make promises regarding non-valid YAML, which implies undefined behaviour and undefined behaviour should be avoided (I don't want to rant too much about YAML). Nix is written in C++, so there are big chances that nix relies on UB and it is therefore not reproducible without restrictions. In #7249 simdjson was considered as JSON library. It has another issue of reproducibility because it introduces different code paths for different CPU architectures, i.e. willingness to trade reproducibility against performance seems to exist. One of the major use cases of builtins.fromYAML is to read (possible large) lock files, i.e. as exchange format and only a subset of (hopefully valid) YAML is required for this purpose. Using YAML just because it's possible is probably a bad idea for nix. The performance of the library cannot be neglected. If it would be 10x or 100x slower than rapidyaml, then its usefulness would be limited. One of my most important criteria was the security and safety of the library because it is part of nix and it might be exposed to untrusted input. Also it should not introduce more memory leaks into nix because the inputs might be huge.

NaN-git avatar Feb 17 '23 01:02 NaN-git

I don't like to make promises regarding non-valid YAML,

So then we need a very strict parser. Accepting a file today is a promise to keep accepting that file tomorrow.

Nix is written in C++, so there are big chances that nix relies on UB

We're doing a decent job at avoiding UB, and I don't think UB is a contributor to evaluation reproducibility problems in practice. The existence of UB is not an excuse to choose to undermine reproducibility.

In #7249 simdjson was considered as JSON library. It has another issue of reproducibility because it introduces different code paths for different CPU architectures, i.e. willingness to trade reproducibility against performance seems to exist.

This PR was rejected by the team because of the error messages and the extra maintenance. Why would it cause extra maintenance? Partly because it's more code, and partly because we need a high bar for correctness, and that's because of reproducibility.

One of the major use cases of builtins.fromYAML is to read (possible large) lock files, i.e. as exchange format and only a subset of (hopefully valid) YAML is required for this purpose.

I'd be ok with parsing some useful subset of YAML, as long as we can guarantee to keep parsing at least that subset in the same way in all future versions of Nix. What we can not do is have two subsets: one that we intend to support and one that we actually support. Now this seems like a familiar software engineering problem, but our forward compatibility needs flip the importance of the differences around. Most applications would only care about intended behavior not being supported, aka bugs. In our case however, the opposite difference is equally bad or worse - unintended behavior being supported. That's because this breaks our promise that non-reproducible evaluations are rejected. That's not just a bug, but an unintended increase in the surface area that we have to support. It may even lead to conflicting requirements about how certain cases should be handled.

Using YAML just because it's possible is probably a bad idea for nix. The performance of the library cannot be neglected. If it would be 10x or 100x slower than rapidyaml, then its usefulness would be limited. One of my most important criteria was the security and safety of the library because it is part of nix and it might be exposed to untrusted input. Also it should not introduce more memory leaks into nix because the inputs might be huge.

These are of course also valid requirements.

Regarding performance and lock files, we have a promising path to an alternative solution using RFC 92 dynamic derivations. That allows the lock parsing and fixed-output derivation generation to happen in a derivation, so that the lock parsing, processing and instantiation work is cached. Instantiating such a derivation is certainly faster than parsing and generating many derivations at evaluation time, and it avoids the reproducibility issues by having the expression pin the YAML implementation that's used.

roberth avatar Feb 17 '23 15:02 roberth

There is StrictYAML that parses and validates a restricted subset of the YAML specification. The reference implementation is in python, but there is also one in rust. I could not find a C++ implementation.

DavHau avatar Mar 06 '23 11:03 DavHau

StrictYAML parses booleans into strings. While it makes sense to require a schema for booleans support when you're writing a new application, I think this may be a deal breaker for a Nix builtin. It is not a meaning-preserving subset of YAML and not a superset of JSON either.

roberth avatar Mar 06 '23 11:03 roberth

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nix-ify-cloudformation/30671/4

nixos-discourse avatar Jul 21 '23 05:07 nixos-discourse

Generally looks good to me (i.e. to merge this as an experimental feature).

edolstra avatar Aug 07 '23 12:08 edolstra

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-08-07-nix-team-meeting-78/31488/1

nixos-discourse avatar Aug 09 '23 05:08 nixos-discourse

Bump! I'd really like to see some movement on this.

adisbladis avatar Mar 04 '24 11:03 adisbladis