cabal Migrate from the .cabal format to a widely supported format

In the wake of the exact-printer initiative, I proposed another approach: why not say good bye to the .cabal file format and switch to something that is widely supported.

There is a few alternatives, the most important attribute would be that they are widely supported in industry.

JSON
YAML
TOML

Note that all of these are (mostly) isomorphic to JSON (scalars, lists, dicts), which is important for easy translation between them (e.g. for config generation purposes).

What would this give the Haskell ecosystem?

Editor support: every modern editor like vscode has a way of assigning JSON schema to a file, which gives completion and inline documentation for free everywhere
Also, syntax highlighting and auto formatting come for free
Cabal doesn’t need to implement its own parser. If JSON is chosen, the parser is even context-free.

What would it give to users?

Instant familiarity with the format: you don’t touch cabal files all too often as a user, so you don’t want to learn yet another syntax
Templating cabal config with standard tooling (e.g. jq, yj), which is important e.g. in a monorepo context
Inline documentation without setup

What are others doing?

Most modern package managers that don’t go the full turing-complete configuration route (e.g. Scala’s sbt, Erlang) usually converge their config on a widely supported syntax.

Examples:

npm, yarn (package.json, package-lock.json)
Stack (stack.yaml, stack.yaml.lock)
hpack (project.yaml)
cargo (Cargo.toml)
Elm (elm-package.json)
Maven (pom.xml)
poetry (pyproject.toml, poetry.lock)

Counterexamples:

go (go.mod), though flat shasums and go packages have no configuration file
pip (requirements.txt), though see poetry above
sbt, hex: both use their turing complete parent languages
leiningen (project.clj), clojure is a lisp, and sexps are already a data format

I don’t expect cabal would drop support for the cabal file format very soon, rather it would start out by generating a .cabal file from the .json/.toml/.yaml for consumption by older version of cabal. Then after a multi-year grace period, the new format would become the standard and projects could drop their autogenerated .cabal files.

Aug 15 '21 13:08 Profpatsch

Note that some people have mentioned dhall as a possible alternative, but using it would destroy most benefits, namely:

Familiarity
Editor support
widely available tooling for ops integration
simple parser
isomorphic to json

Aug 15 '21 13:08 Profpatsch

However, I expect that there would be a dhall library for generating cabal.json files, which can aid into integrating dhall-based (dev)ops setups with Haskell packages.

Aug 15 '21 13:08 Profpatsch

Is this cabal.json the plan.json from the cabal docs or is it a .cabal in JSON format?

plan.json (JSON) A JSON serialization of the computed install plan intended for integrating cabal with external tooling. The cabal-plan package provides a library for parsing plan.json files into a Haskell data structure as well as an example tool showing possible applications.

Aug 15 '21 13:08 philderbeast

Note that some people have mentioned dhall as a possible alternative

Also note that this already exists as dhall-to-cabal. While its goal was to generate .cabal files, that's not the only solution. A more integrated solution would be a cabal-install that can actually consume these files. I'm not saying this is the solution, just mentioning this as prior art. I'll step out of the conversation for now and let others share there thoughts, but if any one wants to talk about Dhall in particular here, I do have thoughts

Aug 15 '21 15:08 ocharles

Is this cabal.json the plan.json from the cabal docs or is it a .cabal in JSON format?

It is the current .cabal file in a not-home-grown syntax

Aug 15 '21 16:08 Profpatsch

I proposed another approach: why not say good bye to the .cabal file format and switch to something that is widely supported.

When I think about it, don't think these are mutually exclusive tickets: an exact printer gets us a reasonable source representation. This is good for a few reasons: we can derive translational tools from the same representation - we only need change the parser and the printer. This frees up efforts to migrate between formats!

There is a few alternatives, the most important attribute would be that they are widely supported in industry.

Of the three suggestions, TOML is the most attractive. YAML is has too much variable syntax, and JSON is aesthetically (and mechanically) displeasing for me to write as a human. TOML's grammar is minimal and admits a small and easy to generate + verify parser and lexer (note: toml-parser is a little outdated) that eliminates the need for us to write it from scratch. In fact, maintaining this would be a dream, and would be a boon for tooling, since we can derive an ABNF for our specific flavor quite easily.

Then after a multi-year grace period, the new format would become the standard and projects could drop their autogenerated .cabal files.

👍

Aug 15 '21 17:08 emilypi

As an experiment, I took a library I wrote a few years ago and manually converted its Cabal file to TOML. I like it! The conversion can be totally systematic.

TOML was noticeably nicer to edit—Emacs has a simple built-in TOML mode and I didn't have to worry about indentation/formatting. (I've done Haskell for over a decade now and I'm still not consistent in how I format Cabal files!) Structured commands for navigating and editing the TOML file would be nice; I don't know if something like this already exists, but if it doesn't, adding it to Emacs would be easy. I wouldn't even think of trying something like that for Cabal's custom syntax.

I've used YAML a lot more than TOML in the past. Compared to YAML, I found needing to quote all my strings a bit annoying; on the other hand, TOML was much nicer to pick up and doesn't have weird corner cases to worry about. At work I recently ran into some weird YAML files that used anchors in a way that didn't work in Python—not something that would happen with TOML.

In my dream world we would use an S-expression based syntax (like sexplib) but I know that is not to be :(.

I immediately found that multiline strings were useful. Multiline strings and comments seems like the bare minimum for a human-oriented format; YAML and TOML support that, JSON doesn't.

It's a bit long, but here's the whole file:

cabal-version = "2.2"

[package]
name = "modular-arithmetic"
version = "2.0.0.1"
synopsis = "A type for integers modulo some constant."
description = """
A convenient type for working with integers module some constant. It saves you from manually wrapping numeric operations all over the place and prevents a range of simple mistakes. @Integer `Mod` 7@ is the type of integers (mod 7) backed by @Integer@.

We also have some cute syntax for these types like @ℤ/7@ for integers modulo 7.
"""
homepage = "https://github.com/TikhonJelvis/modular-arithmetic"
bug-reports = "https://github.com/TikhonJelvis/modular-arithmetic/issues"
license = "BSD-3-Clause"
license-file = "LICENSE"
author = "Tikhon Jelvis <[email protected]>"
maintainer = "Tikhon Jelvis <[email protected]>"
category = "Math"
build-type = "Simple"
extra-source-files = ["README.md", "CHANGELOG.md"]

[source-repository.head]
type = "git"
location = "git://github.com/TikhonJelvis/modular-arithmetic.git"

[library]
hs-source-dirs = ["src"]
ghc-options = ["-Wall"]
default-language = "Haskell2010"
exposed-modules = [
  "Data.Modular"
]
build-depends = [
  "base >4.9 && <5",
  "typelits-witnesses <0.5"
]

[test-suite.examples]
hs-source-dirs = ["test-suite", "src"]
main-is = "DocTest.hs"
default-language = "Haskell2020"
type = "exitcode-stdio-1.0"
build-depends = [
  "base >4.9 && <5",
  "doctest >= 0.9",
  "typelits-witnesses <0.5"
]

Aug 15 '21 18:08 TikhonJelvis

Another benefit: the format would be naturally extensible. Cabal could provide a section for plugin/tool/etc config, and tools would have no issues parsing values from there. I'm imagining something like this:

[plugin.liquid-haskell]
smt-solver = "z3mem"

My experience has been that providing "extension points" in formats is always useful. We can't figure out everything people want to do with their libraries ahead of time but we can make the format adaptable. If people need something Cabal doesn't support, they can add it while still keeping a single canonical file for library-specific settings.

Aug 15 '21 18:08 TikhonJelvis

For yaml there's also of course hpack. So anyone who wants to write cabal files in either yaml or dhall is welcome to do so. Note that we don't have exactprinters for either of those formats either, as far as I know. As I recall, due to the semantics of yaml, conditional clauses are rather unpleasant there, among a few other issues (and pretty-printing reorders things in unpleasant ways as well). (And also as emily notes, the yaml grammar is rather complicated as is).

Toml does seem promising, but I worry that its support for conditionals or other more complex syntax wouldn't be particularly great either. Translations of some more complex files might be worthwhile, to experiment with this.

Btw, note that cabal is already extensible, via "x-" fields.

In any case, I think the right next step is to get the cabal grammar pinned down and to have an exactprinter for at least the format we already have and is widespread.

Aug 15 '21 18:08 gbaz

This is good for a few reasons: we can derive translational tools from the same representation - we only need change the parser and the printer. This frees up efforts to migrate between formats!

I like this and am the maintainer of the translational tool hpack-dhall that can translate:

dhall -> cabal
dhall -> json
dhall -> yaml # the package.yaml format of hpack
dhall -> dhall # with imports resolved

I am a bit wary about each format being capable of doing a faithful representation. For instance, hpack's conditionals can break dhall's typing. This is the trouble @gbaz just mentioned.

Aug 15 '21 18:08 philderbeast

I want this to work, but unfortunately I see some issues with all the proposed formats so far. I think TOML / YAML / JSON will never work, but Dhall, while it might not be a good fit today, can be made to work.

TOML / YAML / JSON

If Cabal files were merely data they would be fine, but unfortunately they are code, due to the conditionals, and parameters used in those conditions. This is true of Cargo packages too, and the solution there has been to stuff syntax into strings. Firstly, this largely defeats the point as we still need application-specific parsers (and pretty printers!) to handle those strings.

But more worryingly, I have reason to believe this has warped the design process of Cargo. See, for example, the back and for with @djc and me in https://github.com/rust-lang/rfcs/pull/3143, where @djc agrees Cargo has backed itself into a corner, but objects to my further using strings, or trying to encode the information in a more structured but awkward and verbose way. I may have disagreed with @djc on which unpleasant choice too take, but I absolutely do agree that TOML forcing Cargo into this awkward situation is trajic, and no one shouldhave to pick between those unpleasant options in the first place.

Cabal converting from the existing design avoids some of the distortion from TOML's perverse incentives, but I have no doubt the language of Cabal files will continue to evolve, and I don't want "TOML goggles" to mess things up going forward.

Dhalll

Dhall is an actual programming language, and therefore squarely fixes the above issues. And to be clear I would really like to endorse Dhall as it is the right sort of way to make these things conform to a standard. There are two quibbles with Dhall as it currently exists however, that I think should be addressed first:

imports / IO. As far as I know, Dhall always allows downloading arbitrary stuff, etc. as long as you give it a content address of some sort to make it pure (like fixed output derivations). This is a fine design in general, but I worry about e.g. cabal2nix needing internet access to do it's job, which (ironically giving the nix inspiration!) would be a regression and major pain. If there is a way force Dhall programs to be more self-contained, that would assuage my concern.
abstract interpretation. The Dhall model somewhat assumes that dhall will evaluate a closed term, spltting out a value for the consuming application to deal with. But this doesn't totally reflect how Cabal works. Today, we have automatic flags, which means we need to vary parameters based on results. With a few flags this can be brute forced, but with more abstract interpretation is much more efficient. Dhall's strong normalization is good to make such static analysis tractable, but we might also want additional restrictions to make it efficient and also easier for humans to understand.

Maybe that is overkill for manual flags, but @mpickering's and my presentation on what comes after CPP (https://icfp19.sigplan.org/details/hiw-2019-papers/9/Configuration-but-without-CPP, https://www.youtube.com/watch?v=YupkE1vsZ4o) has gotten me thinking about abstract interpretation more broadly. Eventually we want to tackle the goal of "type safe packaging" i.e. ensuring all valid version solving solutions will in fact compile. It's hard, but not tackling it is anathema to our values, and abstract interpretation of various sorts is key to making it work.

So yeah, in conclusion I want Dhall to work, but it's important we we be able to restrict ourselves to a sort of "mini Dhall" so we can do this analysis and we will have to integrate Dhall with Cabal fairly deeply. I'm not sure whether the current Dhall implementation supports such a restricted "mini Dhall", but that can easily be fixed.

Aug 15 '21 21:08 Ericson2314

Then after a multi-year grace period, the new format would become the standard and projects could drop their autogenerated .cabal files.

Remember that Hackage is an append-only repository. It would be utterly disappointing if a future version of Cabal would be unable to build an old package just because it no longer parses its very own package format. So I don't think it would be wise to abandon a parser of Cabal files even after a very long grace period. And if we are to retain the parser and all its complexity, than what exactly are we to gain? What about other tooling (e. g., Stack)?

With regards to editor support, why aim for a generic JSON autocompletion? These days we should not settle for anything less than a domain-specific language server, and custom format is not a hindrance for it.

I'm sorry if my tone sounds harsh, but I'm afraid we are chasing an ideal to the detriment of compatibility, as it's very customary in Haskell community.

Aug 15 '21 23:08 Bodigrim

Maybe Starlark is a decent option if logic is important to this project?

Aug 16 '21 05:08 djc

Given that the quality standards and popularity standings for configuration languages change every decade, I'd rather focus on a good internal representation, support the old cabal format (and only this format) forever-guaranteed and let contributors add exact-parser-prettyprinters for whatever format works best for them. We also need a story for keeping in sync many files that contain the same information or for translation on the fly (e.g., when showing a .cabal form a Hackage webpage of a package).

Aug 16 '21 08:08 Mikolaj

This might be total stupid ideaw, but how about using limited Haskell for configuration? For example, configuration is a module exports one binding named config which has type Config. Limited to Haskell98, no GHC extension, no external pacakge, no IO. Noone is suggesting it so I assuming there is an obvious reason this is not a good idea..

Aug 16 '21 12:08 kamoii

@kamoii the main argument against that is that a Haskell program is not guaranteed to terminate

Aug 16 '21 12:08 fgaz

The argument is also to not invent something new as much as possible. We want to leverage existing tooling, syntax highlighting, etc. A limited Haskell only lets us benefit from a fraction of this

Aug 16 '21 13:08 ocharles

Two forgotten things in this discussion:

First: JSON / YAML / ... and even Dhall would still need some stringly sublanguages, as @Ericson2314 hints. Consider build-depends or mixins fields.

build-depends: foo (>=0.4.0.0 && <0.4.1) || (>=0.5 && <0.6)
mixins:        foo (Foo.Bar as AnotherFoo.Bar, Foo.Baz as AnotherFoo.Baz)

"build-depends": {
    "foo": {
      "and": [ { "or": [ { ">=": "0.4.0.0"  } 
                         , { "<": "0.4.1" }
                         ]
                 }
               , { "or" : [ { ">=": "0.5" }
                          , { "<": "0.6" }
                          ]
                 }
               ]
    }

(better would be model version numbers as [0 4 0 0] i.e. array of integers - though what [0.0 4.0 0.0 0.0] means?!)

I don't even try to model mixins. Dhall would look terrible as well (from dhall-to-cabal README)

in    GitHub-project { owner = "ocharles", repo = "example" }
    ⫽ { version =
          prelude.v "1.0.0"
      , library =
          prelude.unconditional.library
          (   prelude.defaults.MainLibrary
            ⫽ { build-depends =
                  [ { package =
                        "base"
                    , bounds =
                        prelude.majorBoundVersion (prelude.v "4")
                    }
                  ]

There is also license which uses SPDX license expressions, which is a standard just for that. NPM embedds them as string, i.e. there is no benefit from generic JSON strings helping edit them. (though honestly that field is rarely edited).

EDIT: Also file globs (though I think that was a mistake to add them to .cabal format)

If we use stringly sublanguages (like in @TikhonJelvis examples) we we will need to explain their syntax anyway. Nothing changes in comparison with current format.

Writing a tool to automatically edit bounds is still difficult with stringly build-depends (as difficult as today, I would say).

Second: Performance matters. Solver parses plenty of package descriptions while figuring out dependencies. Dhall unbounded computation costs is asking for problems. Package descriptions in indicies should be (close to) normal forms. Common stanzas make current format not normal, but their substitution is cheap (linear cost).

Currently hackage-tests test suite (cabal run hackage-tests parsec) reports on my machine:

Reading index from: /cabal/packages/hackage.haskell.org/01-index.tar
151055 files processed
41573 files contained warnings
0 files failed to parse
147.663162 seconds elapsed
0.977546 milliseconds per file

That 1ms per file is a good goal. cabal is used as an interactive tool.

A solution is that cabal sdist would normalise the package description files before packing a source tarball. That would work, but we would need to specify the normal form independently. The normal form would need to be only readable by humans, not necessarily convenient to write.

That approach would make sense for revisions too, it might be substatially easier to specify which edits are valid on the normal forms, then on "full" grammar. The current check is semi-syntactical, which is somewhat limiting.

Another solution is that cabal update would produce a cache with normalized descriptions. The drawback is that it would take at least 3 minutes! (Or be too clever and brittle trying to reuse older caches).

If we really want to change the format to something "used elsewhere", then EDN is actually not that bad (I was taught scheme in school).

:build-depends
  { "foo"
    (|| (&& (>= #(ver 0 4 0 0)) (< #(ver 0 4 1)))
        (&& (>= #(ver 0 4) (< #(ver 0 6))))
    )
  }

:mixins
  { "foo"
    (as [Foo Bar] [AnotherFoo Bar])
    ; the drawback is that everything is different, if EDN structure is used deeply:
    ; even the module names, as "Foo.Bar" is an expression in a sublanguage for module names,
    ; something general EDN tools are not aware of.
    ...

TL;DR, I challenge JSON, ..., Dhall suggestors to model e.g.

https://hackage.haskell.org/package/transformers-compat-0.7/transformers-compat.cabal
https://hackage.haskell.org/package/raaz-0.3.0/raaz.cabal

in their favourite "syntax" format. Otherwise this discussion is just wasting everyones time by not being concrete.

(IMO simple examples don't tell much, simple stuff is easy).

Aug 16 '21 14:08 phadej

Does "unlimited Haskell" as opposed to limited Haskell qualify as not something new? IMO the argument that Haskell is turing complete isn't that compelling, as the nix expression language is also. With a cabal file being just some Haskell expression of type Config or a single file program

import Cabal

main = buildPackage PackageOptions {...} -- dependencies and build configuration here

we get to use all of the existing Haskell tooling and get around the sublanguage issues by representing everything as normal Haskell values, which if I understand correctly cabal today does anyway.

Going further with this train of thought, it seems like any configuration format, JSON, YAML, Dhall, edn, TOML, etc. is basically some level of indirection that gets parsed into a Haskell value at build time, so why not just focus on making a more convenient EDSL for Cabal the library?

Aug 16 '21 15:08 jmorag

There's a reason we encourage cabal files rather than custom setups -- far easier for external consumption (even with a not fully specified grammar). To get values out of a haskell executable it needs to either emit them (in which case the format it emits in is the actual spec) or you need to build and link into it directly. Either way you're compiling and building a haskell program every time you want to ask "what modules does this package provide." That is not feasible for, e.g., a package store such as hackage.

Aug 16 '21 16:08 gbaz

The external consumption argument is very compelling. I guess we could have cabal generate a lockfile from a build specification in Haskell and have other tools read that. We already have cabal.project.freeze/stack.yaml.lock so there's precedent, but those files haven't historically been required.

Aug 16 '21 18:08 jmorag

...but then you get in the same situation as now, it's just that cabal is doing the conversion instead of dhall2cabal/hpack/... You still have to commit/upload/distribute the redundant (as opposed to freeze files) generated file

Aug 16 '21 19:08 fgaz

In this issue there is an interesting discussion about how to handle other configuration formats than the builtinn cabal one: https://github.com/haskell/cabal/issues/5343

Aug 16 '21 19:08 jneira

I see no problem that a e.g. YAML outer syntax has to be complemented by ad hoc expression syntaxes for certain fields (constraints etc.) that transcend YAML. Having an outer YAML syntax would still allow third-party tools easy access to certain contents of the .cabal file, and nice syntax (that is, the current syntax) for constraints can parsed from string fields using/adapting the existing cabal parsers.

YAML-bombs can be avoided by restricting to a sublanguage of YAML.

The syntax examples in https://github.com/haskell/cabal/issues/7548#issuecomment-899557379 look like straw-mans to me.

Aug 17 '21 15:08 andreasabel

Does "unlimited Haskell" as opposed to limited Haskell qualify as not something new?

I encourage anyone who thinks this is a good idea to think about how much fun Setup.hs is already (hint: extremely unfun). That is: if you need to compile and run a Haskell program to work out what your config is now you need configuration to work out how to compile and run the config program. What compiler options does it use? What libraries does it have access to? What GHC version is it using? etc. And what if the level-2 configuration is also a non-trivial Haskell program? Time for level-3 configuration. Extremely unfun.

Aug 18 '21 09:08 michaelpj

Having an outer YAML syntax would still allow third-party tools easy access to certain contents of the .cabal file, and nice syntax (that is, the current syntax) for constraints can parsed from string fields using/adapting the existing cabal parsers.

What's wrong with using Cabal as a library? I had great success with that. You need it anyway for expression parsing.

Aug 18 '21 10:08 phadej

What's wrong with using Cabal as a library? I had great success with that. You need it anyway for expression parsing.

For the Haskell programmer, there is the obstacle of Cabal being a large package that regularly undergoes changes. Some third parties might not even use Haskell to write code that extracts information from a .cabal file. YAML parsers are ubiquitous...

Anecdotally, I have just written a small tool (https://github.com/andreasabel/cabal-clean) to partially clean artefacts from dist-newstyle/build, and I originally considered drawing some information (version, tested-with) from the respective .cabal file. But I shied away as there was no light-weight parser for cabal files.

Aug 18 '21 10:08 andreasabel

That's the old problem of Cabal library being both the package description reading library as well as its interpration i.e. building. Former part barely changes (except of normal "let's make better library").

I'd welcome the split, as my tools use only that "parse .cabal file" part. Distribution.Simple namespace can be left for build-type: Custom packages and cabal-install use.

EDIT: even if the outer format were JSON or YAML, a library for working with it on higher level then "it's some jSON' still have to exist (c.f. cabal-plan library for plan.json files).

Aug 18 '21 10:08 phadej

What's wrong with using Cabal as a library? I had great success with that. You need it anyway for expression parsing.

To use Cabal as a library, I have to use Haskell and probably write a Cabal file for that script so that it can depend on Cabal as an external library. Figuring all of that out is a pretty steep up-front cost!

Making it easy to parse project metadata in any language would lower the barrier to entry for adding Haskell to a multi-language environment. I've worked on projects that combined Python, Rust and Haskell which all fit together pretty well. There is no real reason that writing a Python script to get project metadata across these languages should be difficult.

The new syntax (TOML or whatever) won't have to handle everything. Cabal files are structured as key-value pairs along with sections. TOML could replace that, leave other things (mixins, version bounds, etc) as strings and still be useful.

Aug 21 '21 01:08 TikhonJelvis

The new syntax (TOML or whatever) won't have to handle everything. Cabal files are structured as key-value pairs along with sections. TOML could replace that, leave other things (mixins, version bounds, etc) as strings and still be useful.

I see no problem that a e.g. YAML outer syntax has to be complemented by ad hoc expression syntaxes for certain fields (constraints etc.) that transcend YAML. Having an outer YAML syntax would still allow third-party tools easy access to certain contents of the .cabal file, and nice syntax (that is, the current syntax) for constraints can parsed from string fields using/adapting the existing cabal parsers.

The problem is that there's very little that can actually be expressed directly in TOML/YAML, mainly top-level strings like package name, description, author, copyright...

As soon as you want other fields you need additional or cumbersome syntax, and extra validation on top of that.

For example:

Anecdotally, I have just written a small tool (https://github.com/andreasabel/cabal-clean) to partially clean artefacts from dist-newstyle/build, and I originally considered drawing some information (version, tested-with) from the respective .cabal file.

These two fields are structured, and would need to be represented as something like

version = [ 0, 1, 0, 0 ]
tested-with = [
  { compiler = GHC, version = [ 8, 10, 5 ] },
  { compiler = GHC, version = [ 9, 0, 1 ] }
]

where "GHC" needs to be validated and all the versions need to be checked for emptiness and component length.

This means that even in other languages you'd need a library to parse .cabal files for everything but the most basic stuff.

I think the most beneficial thing would be to split the parser from the rest of Cabal so that we'd have a light-weight library to depend on, like @phadej and @gbaz suggested: #7559

edit: I checked, and tested-with is even more complex than this: the version is a version range (even though most people just use ==), so as complex as build-depends!

To use Cabal as a library, I have to use Haskell and probably write a Cabal file for that script so that it can depend on Cabal as an external library

Technically, Cabal being a core/boot/builtin (i forget how they're called) library, you only need ghc unless you depend on something else

Aug 21 '21 11:08 fgaz

cabal cabal copied to clipboard

Migrate from the .cabal format to a widely supported format

What would this give the Haskell ecosystem?

What would it give to users?

What are others doing?

TOML / YAML / JSON

Dhalll

cabal
cabal copied to clipboard