cabal
cabal copied to clipboard
Migrate from the .cabal format to a widely supported format
In the wake of the exact-printer initiative, I proposed another approach: why not say good bye to the .cabal file format and switch to something that is widely supported.
There is a few alternatives, the most important attribute would be that they are widely supported in industry.
- JSON
- YAML
- TOML
Note that all of these are (mostly) isomorphic to JSON (scalars, lists, dicts), which is important for easy translation between them (e.g. for config generation purposes).
What would this give the Haskell ecosystem?
- Editor support: every modern editor like
vscode
has a way of assigning JSON schema to a file, which gives completion and inline documentation for free everywhere - Also, syntax highlighting and auto formatting come for free
- Cabal doesn’t need to implement its own parser. If JSON is chosen, the parser is even context-free.
What would it give to users?
- Instant familiarity with the format: you don’t touch cabal files all too often as a user, so you don’t want to learn yet another syntax
- Templating cabal config with standard tooling (e.g.
jq
,yj
), which is important e.g. in a monorepo context - Inline documentation without setup
What are others doing?
Most modern package managers that don’t go the full turing-complete configuration route (e.g. Scala’s sbt, Erlang) usually converge their config on a widely supported syntax.
Examples:
- npm, yarn (
package.json
,package-lock.json
) - Stack (
stack.yaml
,stack.yaml.lock
) - hpack (
project.yaml
) - cargo (
Cargo.toml
) - Elm (
elm-package.json
) - Maven (
pom.xml
) - poetry (
pyproject.toml
,poetry.lock
)
Counterexamples:
- go (
go.mod
), though flat shasums and go packages have no configuration file - pip (
requirements.txt
), though see poetry above - sbt, hex: both use their turing complete parent languages
- leiningen (
project.clj
), clojure is a lisp, and sexps are already a data format
I don’t expect cabal would drop support for the cabal file format very soon, rather it would start out by generating a .cabal
file from the .json/.toml/.yaml
for consumption by older version of cabal. Then after a multi-year grace period, the new format would become the standard and projects could drop their autogenerated .cabal files.
Note that some people have mentioned dhall
as a possible alternative, but using it would destroy most benefits, namely:
- Familiarity
- Editor support
- widely available tooling for ops integration
- simple parser
- isomorphic to json
However, I expect that there would be a dhall library for generating cabal.json
files, which can aid into integrating dhall-based (dev)ops setups with Haskell packages.
Is this cabal.json
the plan.json
from the cabal docs or is it a .cabal
in JSON format?
plan.json (JSON) A JSON serialization of the computed install plan intended for integrating cabal with external tooling. The cabal-plan package provides a library for parsing plan.json files into a Haskell data structure as well as an example tool showing possible applications.
Note that some people have mentioned dhall as a possible alternative
Also note that this already exists as dhall-to-cabal. While its goal was to generate .cabal files, that's not the only solution. A more integrated solution would be a cabal-install that can actually consume these files. I'm not saying this is the solution, just mentioning this as prior art. I'll step out of the conversation for now and let others share there thoughts, but if any one wants to talk about Dhall in particular here, I do have thoughts
Is this
cabal.json
theplan.json
from the cabal docs or is it a.cabal
in JSON format?
It is the current .cabal file in a not-home-grown syntax
I proposed another approach: why not say good bye to the .cabal file format and switch to something that is widely supported.
When I think about it, don't think these are mutually exclusive tickets: an exact printer gets us a reasonable source representation. This is good for a few reasons: we can derive translational tools from the same representation - we only need change the parser and the printer. This frees up efforts to migrate between formats!
There is a few alternatives, the most important attribute would be that they are widely supported in industry.
Of the three suggestions, TOML is the most attractive. YAML is has too much variable syntax, and JSON is aesthetically (and mechanically) displeasing for me to write as a human. TOML's grammar is minimal and admits a small and easy to generate + verify parser and lexer (note: toml-parser
is a little outdated) that eliminates the need for us to write it from scratch. In fact, maintaining this would be a dream, and would be a boon for tooling, since we can derive an ABNF for our specific flavor quite easily.
Then after a multi-year grace period, the new format would become the standard and projects could drop their autogenerated .cabal files.
👍
As an experiment, I took a library I wrote a few years ago and manually converted its Cabal file to TOML. I like it! The conversion can be totally systematic.
TOML was noticeably nicer to edit—Emacs has a simple built-in TOML mode and I didn't have to worry about indentation/formatting. (I've done Haskell for over a decade now and I'm still not consistent in how I format Cabal files!) Structured commands for navigating and editing the TOML file would be nice; I don't know if something like this already exists, but if it doesn't, adding it to Emacs would be easy. I wouldn't even think of trying something like that for Cabal's custom syntax.
I've used YAML a lot more than TOML in the past. Compared to YAML, I found needing to quote all my strings a bit annoying; on the other hand, TOML was much nicer to pick up and doesn't have weird corner cases to worry about. At work I recently ran into some weird YAML files that used anchors in a way that didn't work in Python—not something that would happen with TOML.
In my dream world we would use an S-expression based syntax (like sexplib) but I know that is not to be :(.
I immediately found that multiline strings were useful. Multiline strings and comments seems like the bare minimum for a human-oriented format; YAML and TOML support that, JSON doesn't.
It's a bit long, but here's the whole file:
cabal-version = "2.2"
[package]
name = "modular-arithmetic"
version = "2.0.0.1"
synopsis = "A type for integers modulo some constant."
description = """
A convenient type for working with integers module some constant. It saves you from manually wrapping numeric operations all over the place and prevents a range of simple mistakes. @Integer `Mod` 7@ is the type of integers (mod 7) backed by @Integer@.
We also have some cute syntax for these types like @ℤ/7@ for integers modulo 7.
"""
homepage = "https://github.com/TikhonJelvis/modular-arithmetic"
bug-reports = "https://github.com/TikhonJelvis/modular-arithmetic/issues"
license = "BSD-3-Clause"
license-file = "LICENSE"
author = "Tikhon Jelvis <[email protected]>"
maintainer = "Tikhon Jelvis <[email protected]>"
category = "Math"
build-type = "Simple"
extra-source-files = ["README.md", "CHANGELOG.md"]
[source-repository.head]
type = "git"
location = "git://github.com/TikhonJelvis/modular-arithmetic.git"
[library]
hs-source-dirs = ["src"]
ghc-options = ["-Wall"]
default-language = "Haskell2010"
exposed-modules = [
"Data.Modular"
]
build-depends = [
"base >4.9 && <5",
"typelits-witnesses <0.5"
]
[test-suite.examples]
hs-source-dirs = ["test-suite", "src"]
main-is = "DocTest.hs"
default-language = "Haskell2020"
type = "exitcode-stdio-1.0"
build-depends = [
"base >4.9 && <5",
"doctest >= 0.9",
"typelits-witnesses <0.5"
]
Another benefit: the format would be naturally extensible. Cabal could provide a section for plugin/tool/etc config, and tools would have no issues parsing values from there. I'm imagining something like this:
[plugin.liquid-haskell]
smt-solver = "z3mem"
My experience has been that providing "extension points" in formats is always useful. We can't figure out everything people want to do with their libraries ahead of time but we can make the format adaptable. If people need something Cabal doesn't support, they can add it while still keeping a single canonical file for library-specific settings.
For yaml there's also of course hpack. So anyone who wants to write cabal files in either yaml or dhall is welcome to do so. Note that we don't have exactprinters for either of those formats either, as far as I know. As I recall, due to the semantics of yaml, conditional clauses are rather unpleasant there, among a few other issues (and pretty-printing reorders things in unpleasant ways as well). (And also as emily notes, the yaml grammar is rather complicated as is).
Toml does seem promising, but I worry that its support for conditionals or other more complex syntax wouldn't be particularly great either. Translations of some more complex files might be worthwhile, to experiment with this.
Btw, note that cabal is already extensible, via "x-" fields.
In any case, I think the right next step is to get the cabal grammar pinned down and to have an exactprinter for at least the format we already have and is widespread.
This is good for a few reasons: we can derive translational tools from the same representation - we only need change the parser and the printer. This frees up efforts to migrate between formats!
I like this and am the maintainer of the translational tool hpack-dhall that can translate:
dhall -> cabal
dhall -> json
dhall -> yaml # the package.yaml format of hpack
dhall -> dhall # with imports resolved
I am a bit wary about each format being capable of doing a faithful representation. For instance, hpack's conditionals can break dhall's typing. This is the trouble @gbaz just mentioned.
I want this to work, but unfortunately I see some issues with all the proposed formats so far. I think TOML / YAML / JSON will never work, but Dhall, while it might not be a good fit today, can be made to work.
TOML / YAML / JSON
If Cabal files were merely data they would be fine, but unfortunately they are code, due to the conditionals, and parameters used in those conditions. This is true of Cargo packages too, and the solution there has been to stuff syntax into strings. Firstly, this largely defeats the point as we still need application-specific parsers (and pretty printers!) to handle those strings.
But more worryingly, I have reason to believe this has warped the design process of Cargo. See, for example, the back and for with @djc and me in https://github.com/rust-lang/rfcs/pull/3143, where @djc agrees Cargo has backed itself into a corner, but objects to my further using strings, or trying to encode the information in a more structured but awkward and verbose way. I may have disagreed with @djc on which unpleasant choice too take, but I absolutely do agree that TOML forcing Cargo into this awkward situation is trajic, and no one shouldhave to pick between those unpleasant options in the first place.
Cabal converting from the existing design avoids some of the distortion from TOML's perverse incentives, but I have no doubt the language of Cabal files will continue to evolve, and I don't want "TOML goggles" to mess things up going forward.
Dhalll
Dhall is an actual programming language, and therefore squarely fixes the above issues. And to be clear I would really like to endorse Dhall as it is the right sort of way to make these things conform to a standard. There are two quibbles with Dhall as it currently exists however, that I think should be addressed first:
-
imports / IO. As far as I know, Dhall always allows downloading arbitrary stuff, etc. as long as you give it a content address of some sort to make it pure (like fixed output derivations). This is a fine design in general, but I worry about e.g. cabal2nix needing internet access to do it's job, which (ironically giving the nix inspiration!) would be a regression and major pain. If there is a way force Dhall programs to be more self-contained, that would assuage my concern.
-
abstract interpretation. The Dhall model somewhat assumes that dhall will evaluate a closed term, spltting out a value for the consuming application to deal with. But this doesn't totally reflect how Cabal works. Today, we have automatic flags, which means we need to vary parameters based on results. With a few flags this can be brute forced, but with more abstract interpretation is much more efficient. Dhall's strong normalization is good to make such static analysis tractable, but we might also want additional restrictions to make it efficient and also easier for humans to understand.
Maybe that is overkill for manual flags, but @mpickering's and my presentation on what comes after CPP (https://icfp19.sigplan.org/details/hiw-2019-papers/9/Configuration-but-without-CPP, https://www.youtube.com/watch?v=YupkE1vsZ4o) has gotten me thinking about abstract interpretation more broadly. Eventually we want to tackle the goal of "type safe packaging" i.e. ensuring all valid version solving solutions will in fact compile. It's hard, but not tackling it is anathema to our values, and abstract interpretation of various sorts is key to making it work.
So yeah, in conclusion I want Dhall to work, but it's important we we be able to restrict ourselves to a sort of "mini Dhall" so we can do this analysis and we will have to integrate Dhall with Cabal fairly deeply. I'm not sure whether the current Dhall implementation supports such a restricted "mini Dhall", but that can easily be fixed.
Then after a multi-year grace period, the new format would become the standard and projects could drop their autogenerated .cabal files.
Remember that Hackage is an append-only repository. It would be utterly disappointing if a future version of Cabal would be unable to build an old package just because it no longer parses its very own package format. So I don't think it would be wise to abandon a parser of Cabal files even after a very long grace period. And if we are to retain the parser and all its complexity, than what exactly are we to gain? What about other tooling (e. g., Stack)?
With regards to editor support, why aim for a generic JSON autocompletion? These days we should not settle for anything less than a domain-specific language server, and custom format is not a hindrance for it.
I'm sorry if my tone sounds harsh, but I'm afraid we are chasing an ideal to the detriment of compatibility, as it's very customary in Haskell community.
Maybe Starlark is a decent option if logic is important to this project?
Given that the quality standards and popularity standings for configuration languages change every decade, I'd rather focus on a good internal representation, support the old cabal format (and only this format) forever-guaranteed and let contributors add exact-parser-prettyprinters for whatever format works best for them. We also need a story for keeping in sync many files that contain the same information or for translation on the fly (e.g., when showing a .cabal form a Hackage webpage of a package).
This might be total stupid ideaw, but how about using limited Haskell for configuration?
For example, configuration is a module exports one binding named config
which has type Config
. Limited to Haskell98, no GHC extension, no external pacakge, no IO.
Noone is suggesting it so I assuming there is an obvious reason this is not a good idea..
@kamoii the main argument against that is that a Haskell program is not guaranteed to terminate
The argument is also to not invent something new as much as possible. We want to leverage existing tooling, syntax highlighting, etc. A limited Haskell only lets us benefit from a fraction of this
Two forgotten things in this discussion:
First: JSON / YAML / ... and even Dhall would still need some stringly sublanguages, as @Ericson2314 hints. Consider build-depends
or mixins
fields.
build-depends: foo (>=0.4.0.0 && <0.4.1) || (>=0.5 && <0.6)
mixins: foo (Foo.Bar as AnotherFoo.Bar, Foo.Baz as AnotherFoo.Baz)
"build-depends": {
"foo": {
"and": [ { "or": [ { ">=": "0.4.0.0" }
, { "<": "0.4.1" }
]
}
, { "or" : [ { ">=": "0.5" }
, { "<": "0.6" }
]
}
]
}
(better would be model version numbers as [0 4 0 0]
i.e. array of integers - though what [0.0 4.0 0.0 0.0] means?!)
I don't even try to model mixins. Dhall would look terrible as well (from dhall-to-cabal
README)
in GitHub-project { owner = "ocharles", repo = "example" }
⫽ { version =
prelude.v "1.0.0"
, library =
prelude.unconditional.library
( prelude.defaults.MainLibrary
⫽ { build-depends =
[ { package =
"base"
, bounds =
prelude.majorBoundVersion (prelude.v "4")
}
]
There is also license
which uses SPDX license expressions,
which is a standard just for that.
NPM embedds them as string, i.e. there is no benefit from generic JSON
strings helping edit them. (though honestly that field is rarely edited).
EDIT: Also file globs (though I think that was a mistake to add them to .cabal
format)
If we use stringly sublanguages (like in @TikhonJelvis examples) we we will need to explain their syntax anyway. Nothing changes in comparison with current format.
Writing a tool to automatically edit bounds is still difficult
with stringly build-depends
(as difficult as today, I would say).
Second: Performance matters. Solver parses plenty of package descriptions while figuring out dependencies. Dhall unbounded computation costs is asking for problems. Package descriptions in indicies should be (close to) normal forms. Common stanzas make current format not normal, but their substitution is cheap (linear cost).
Currently hackage-tests
test suite (cabal run hackage-tests parsec
)
reports on my machine:
Reading index from: /cabal/packages/hackage.haskell.org/01-index.tar
151055 files processed
41573 files contained warnings
0 files failed to parse
147.663162 seconds elapsed
0.977546 milliseconds per file
That 1ms per file is a good goal. cabal
is used as an interactive tool.
A solution is that cabal sdist
would normalise the package description files
before packing a source tarball. That would work, but we would need
to specify the normal form independently.
The normal form would need to be only readable by humans, not necessarily
convenient to write.
That approach would make sense for revisions too, it might be substatially easier to specify which edits are valid on the normal forms, then on "full" grammar. The current check is semi-syntactical, which is somewhat limiting.
Another solution is that cabal update
would produce a cache
with normalized descriptions. The drawback is that it would take
at least 3 minutes! (Or be too clever and brittle trying to reuse older caches).
If we really want to change the format to something "used elsewhere", then EDN is actually not that bad (I was taught scheme in school).
:build-depends
{ "foo"
(|| (&& (>= #(ver 0 4 0 0)) (< #(ver 0 4 1)))
(&& (>= #(ver 0 4) (< #(ver 0 6))))
)
}
:mixins
{ "foo"
(as [Foo Bar] [AnotherFoo Bar])
; the drawback is that everything is different, if EDN structure is used deeply:
; even the module names, as "Foo.Bar" is an expression in a sublanguage for module names,
; something general EDN tools are not aware of.
...
TL;DR, I challenge JSON, ..., Dhall suggestors to model e.g.
- https://hackage.haskell.org/package/transformers-compat-0.7/transformers-compat.cabal
- https://hackage.haskell.org/package/raaz-0.3.0/raaz.cabal
in their favourite "syntax" format. Otherwise this discussion is just wasting everyones time by not being concrete.
(IMO simple examples don't tell much, simple stuff is easy).
Does "unlimited Haskell" as opposed to limited Haskell qualify as not something new? IMO the argument that Haskell is turing complete isn't that compelling, as the nix expression language is also. With a cabal file being just some Haskell expression of type Config
or a single file program
import Cabal
main = buildPackage PackageOptions {...} -- dependencies and build configuration here
we get to use all of the existing Haskell tooling and get around the sublanguage issues by representing everything as normal Haskell values, which if I understand correctly cabal today does anyway.
Going further with this train of thought, it seems like any configuration format, JSON, YAML, Dhall, edn, TOML, etc. is basically some level of indirection that gets parsed into a Haskell value at build time, so why not just focus on making a more convenient EDSL for Cabal the library?
There's a reason we encourage cabal files rather than custom setups -- far easier for external consumption (even with a not fully specified grammar). To get values out of a haskell executable it needs to either emit them (in which case the format it emits in is the actual spec) or you need to build and link into it directly. Either way you're compiling and building a haskell program every time you want to ask "what modules does this package provide." That is not feasible for, e.g., a package store such as hackage.
The external consumption argument is very compelling. I guess we could have cabal generate a lockfile from a build specification in Haskell and have other tools read that. We already have cabal.project.freeze/stack.yaml.lock so there's precedent, but those files haven't historically been required.
...but then you get in the same situation as now, it's just that cabal is doing the conversion instead of dhall2cabal/hpack/... You still have to commit/upload/distribute the redundant (as opposed to freeze files) generated file
In this issue there is an interesting discussion about how to handle other configuration formats than the builtinn cabal one: https://github.com/haskell/cabal/issues/5343
I see no problem that a e.g. YAML outer syntax has to be complemented by ad hoc expression syntaxes for certain fields (constraints etc.) that transcend YAML. Having an outer YAML syntax would still allow third-party tools easy access to certain contents of the .cabal file, and nice syntax (that is, the current syntax) for constraints can parsed from string fields using/adapting the existing cabal parsers.
YAML-bombs can be avoided by restricting to a sublanguage of YAML.
The syntax examples in https://github.com/haskell/cabal/issues/7548#issuecomment-899557379 look like straw-mans to me.
Does "unlimited Haskell" as opposed to limited Haskell qualify as not something new?
I encourage anyone who thinks this is a good idea to think about how much fun Setup.hs
is already (hint: extremely unfun). That is: if you need to compile and run a Haskell program to work out what your config is now you need configuration to work out how to compile and run the config program. What compiler options does it use? What libraries does it have access to? What GHC version is it using? etc. And what if the level-2 configuration is also a non-trivial Haskell program? Time for level-3 configuration. Extremely unfun.
Having an outer YAML syntax would still allow third-party tools easy access to certain contents of the .cabal file, and nice syntax (that is, the current syntax) for constraints can parsed from string fields using/adapting the existing cabal parsers.
What's wrong with using Cabal
as a library? I had great success with that. You need it anyway for expression parsing.
What's wrong with using
Cabal
as a library? I had great success with that. You need it anyway for expression parsing.
For the Haskell programmer, there is the obstacle of Cabal
being a large package that regularly undergoes changes.
Some third parties might not even use Haskell to write code that extracts information from a .cabal
file. YAML parsers are ubiquitous...
Anecdotally, I have just written a small tool (https://github.com/andreasabel/cabal-clean) to partially clean artefacts from dist-newstyle/build
, and I originally considered drawing some information (version
, tested-with
) from the respective .cabal
file. But I shied away as there was no light-weight parser for cabal files.
That's the old problem of Cabal
library being both the package description reading library as well as its interpration i.e. building. Former part barely changes (except of normal "let's make better library").
I'd welcome the split, as my tools use only that "parse .cabal
file" part. Distribution.Simple
namespace can be left for build-type: Custom
packages and cabal-install
use.
EDIT: even if the outer format were JSON or YAML, a library for working with it on higher level then "it's some jSON' still have to exist (c.f. cabal-plan
library for plan.json
files).
What's wrong with using Cabal as a library? I had great success with that. You need it anyway for expression parsing.
To use Cabal as a library, I have to use Haskell and probably write a Cabal file for that script so that it can depend on Cabal as an external library. Figuring all of that out is a pretty steep up-front cost!
Making it easy to parse project metadata in any language would lower the barrier to entry for adding Haskell to a multi-language environment. I've worked on projects that combined Python, Rust and Haskell which all fit together pretty well. There is no real reason that writing a Python script to get project metadata across these languages should be difficult.
The new syntax (TOML or whatever) won't have to handle everything. Cabal files are structured as key-value pairs along with sections. TOML could replace that, leave other things (mixins, version bounds, etc) as strings and still be useful.
The new syntax (TOML or whatever) won't have to handle everything. Cabal files are structured as key-value pairs along with sections. TOML could replace that, leave other things (mixins, version bounds, etc) as strings and still be useful.
I see no problem that a e.g. YAML outer syntax has to be complemented by ad hoc expression syntaxes for certain fields (constraints etc.) that transcend YAML. Having an outer YAML syntax would still allow third-party tools easy access to certain contents of the .cabal file, and nice syntax (that is, the current syntax) for constraints can parsed from string fields using/adapting the existing cabal parsers.
The problem is that there's very little that can actually be expressed directly in TOML/YAML, mainly top-level strings like package name, description, author, copyright...
As soon as you want other fields you need additional or cumbersome syntax, and extra validation on top of that.
For example:
Anecdotally, I have just written a small tool (https://github.com/andreasabel/cabal-clean) to partially clean artefacts from dist-newstyle/build, and I originally considered drawing some information (version, tested-with) from the respective .cabal file.
These two fields are structured, and would need to be represented as something like
version = [ 0, 1, 0, 0 ]
tested-with = [
{ compiler = GHC, version = [ 8, 10, 5 ] },
{ compiler = GHC, version = [ 9, 0, 1 ] }
]
where "GHC" needs to be validated and all the versions need to be checked for emptiness and component length.
This means that even in other languages you'd need a library to parse .cabal files for everything but the most basic stuff.
I think the most beneficial thing would be to split the parser from the rest of Cabal so that we'd have a light-weight library to depend on, like @phadej and @gbaz suggested: #7559
edit: I checked, and tested-with is even more complex than this: the version is a version range (even though most people just use ==
), so as complex as build-depends!
To use Cabal as a library, I have to use Haskell and probably write a Cabal file for that script so that it can depend on Cabal as an external library
Technically, Cabal being a core/boot/builtin (i forget how they're called) library, you only need ghc unless you depend on something else