cabal icon indicating copy to clipboard operation
cabal copied to clipboard

Cabal exact printer in stages? (and other grand design questions)

Open jappeace opened this issue 2 months ago • 33 comments

Previous take: https://github.com/haskell/cabal/issues/7544


Hello 👋🏽 Good news!

I've gotten commitment from Haskell Foundation to start working on the exact printer. @leana8959 has agreed to help, she'll do the most work and I'll support her when she gets stuck or lost.

Recently I've encountered a better idea when working on HIE files then the timeline described in the proposal. We can split up the exact print work into smaller PR's, which should make it easier to review them, for example:

  • Add comment parsing support (without printing, just store it in the tree)
  • Refactor common stanzas to be supported by a generic package description, instead of being merged and "forgotten".

If those features could be added independently, the exact printer PR itself no longer has to modify the parser, and it no longer has to do major refactors, so it becomes a lot smaller! Furthermore, these PR's on their own will make it easier to study performance impact.

Do the maintainers think these two features are useful independently of the exact printer? Or do you think it's preferable to make one big, exact print PR?

jappeace avatar Sep 24 '25 21:09 jappeace

As a reviewer, I would very much prefer smaller PRs. https://github.com/haskell/cabal/pull/9718 was a nightmare.

geekosaur avatar Sep 24 '25 21:09 geekosaur

Hello, I have opened a draft PR at #11252 :)

leana8959 avatar Oct 10 '25 04:10 leana8959

Second draft is opened here, for import merging: https://github.com/haskell/cabal/pull/11277

jappeace avatar Nov 13 '25 14:11 jappeace

This PR is for the changes to the test suite: https://github.com/haskell/cabal/pull/11285

(separated because it's such a large diff!)

jappeace avatar Nov 14 '25 11:11 jappeace

I've read the proposal document and it seems that we have exactComments :: Map Position Text, which stores the absolute positions of comments.

Could you explain how this works if you modify the cabal file in some way, how are the Position fields updated to new coordinates (if you add a new section for example).

mpickering avatar Nov 24 '25 13:11 mpickering

It looks like there’s now agreement that some coordinated work on the exact printer would be valuable, and given that, it would be good to formalise the direction in a cabal-proposal.

As far as I can tell, the TWG proposal on this topic pre-dates the cabal-proposals process coming into effect, so we have a bit of unfortunate timeline overlap. That said, I think we should still make sure we follow Cabal’s own decision-making process so that the history is clear and downstream tooling authors know exactly what has been agreed.

One way forward would be to copy the substance of the TWG proposal into a cabal-proposals entry, update it to reflect where the discussion has already converged, and then run it through the usual review period. That would let us "regularise" the decision from Cabal’s side without going over again the technical details that the TWG already worked through.

Perhaps @jmct could comment on how the TWG’s output is expected to fit into Cabal’s proposal workflow, and whether the above interpretation matches the intended relationship between the two processes?

Any thoughts on this @ulysses4ever @geekosaur @Mikolaj ?

mpickering avatar Nov 25 '25 10:11 mpickering

Could you explain how this works if you modify the cabal file in some way, how are the Position fields updated to new coordinates (if you add a new section for example).

The positions aren't updated when modifying a GenericPackageDescription. During pretty printing, we create new positions for the newly added sections/fields, and offset the rest of the document; for removals, we simply leave the place empty. The users of the exact print API will also be able to change the position if they choose to.

leana8959 avatar Nov 25 '25 12:11 leana8959

I think GenericPackageDescription is way too far in the pipeline to be able to recover the original contents of the file. The proposal explains how to approach comments and common stanzas, but there are other components which have undergone deep transformations by the time we arrive to GPD. For the example, the following two Cabal files have exactly the same GPD:

cabal-version:   3.0
name:            foo
version:         0.1
license:         BSD-3-Clause
license-file:    LICENSE
build-type:      Simple

executable foo
    main-is:          Main.hs
    hs-source-dirs:   app
    build-depends:
        base <5, 
        containers,
cabal-version:   3.0
name:            foo
version:         0.1
license:         BSD-3-Clause
license-file:    LICENSE
build-type:      Simple

executable foo
    main-is:          Main.hs
    hs-source-dirs:   app
    build-depends:
        , base< 5
        , containers >=0

What's the game plan to capture all the details? Which fields are to track leading / trailing commas, implicit >=0 present in every dependency, spaces inside version specification, etc.?

Bodigrim avatar Nov 25 '25 23:11 Bodigrim

I'm not averse to following the proposals process, but would like to note that this has been an open issue for some time looking for someone to take it on: we've already agreed on it internally. Working out details, yes, but I don't think that falls under proposals. It's generally what we use issues for.

geekosaur avatar Nov 26 '25 02:11 geekosaur

@Bodigrim We don't have that in our prototype, but it should be feasible by adding a map to store the position of the comma too like we do with the comments. This comma position map would need to be populated after readFields and before parseFieldGrammar.

leana8959 avatar Nov 26 '25 08:11 leana8959

Getting this through the cabal proposal process would have the advantage of (re-)spelling out the design and getting more people acquainted with it, which would make reviewing the PRs much easier. Also, after the few exploratory PRs, maybe the exact print project is at the development phase, where the design needs to be revisited, clarified, modified, etc.? In agile projects of certain complexity, such a revisit is expected and even desired IMHO. We'd gladly help.

Mikolaj avatar Nov 26 '25 15:11 Mikolaj

Above PR's make Cabal ready to do exact printing. They were also the issues I deemed most risky, so we did them first. They should be reviewable/mergable without the exact print proper, and give benefits, although the printer depends on them.

Getting this through the cabal proposal process would have the advantage of (re-)spelling out the design and getting more people acquainted with it, which would make reviewing the PRs much easier.

I think we can do this. It shouldn't be too hard to transcribe it to the cabal proposal format and add the recent findings. Correct me if I'm wrong, but we're talking about this: https://github.com/haskell/cabal-proposals

Also, after the few exploratory PRs, maybe the exact print project is at the development phase, where the design needs to be revisited, clarified, modified, etc.?

Looks pretty good up till now, but y'all see it in the proposal I'll make in a bit.

jappeace avatar Nov 26 '25 16:11 jappeace

I agree with @geekosaur (https://github.com/haskell/cabal/issues/11227#issuecomment-3578514201) on this one, i.e. my view is that proposals are more high-level than what we want to get to here. E.g. a proposal should convince people that exact-printer is desired, and the existing HF/TWG proposal does that well I think. But take, for instance, Matt's concern on the comments PR that it's not clear at all how the concrete technical solution in that PR can serve the intended purpose (programmatical modification of cabal files).

As with the two other proposals that predate cabal-proposals process, I'd suggests to copy the TWG proposal to cabal-proposals for posterity, but then move on to discussing technical solutions in issues (preferably) or PRs (that maybe wasteful for the authors because they may learn that their implementation doesn't fit some constraints that we have in mind).

If people feel that it is possible to address issues like "how this concrete solution will work for the intended purpose" as part of the proposal process, I'm happy go that way. I'm just worried that we may push more bureaucracy on people than what's necessary, and they may get discouraged before getting to actual coding. On the other hand, the proposal process is new, and maybe I'm wrong in how I see its purpose. I'm happy to be convinced otherwise.

ulysses4ever avatar Nov 26 '25 16:11 ulysses4ever

Getting this through the cabal proposal process would have the advantage of (re-)spelling out the design and getting more people acquainted with it, which would make reviewing the PRs much easier.

I want to address this specifically: I found the TWG proposal completely useless for evaluating the comments PR. Maybe it's my fault, maybe it's proposal's fault, maybe it's noone's fault and proposals, as I say above, are just at a completely different level of abstraction...

ulysses4ever avatar Nov 26 '25 16:11 ulysses4ever

FWIW @ulysses4ever's observation about the TWG proposal WRT reviewing (which he'd also made on Matrix) is part of why I think a proposal is the wrong place for discussion of implementation. As I said, we generally do that in issues, and then the implementation itself is in PRs (of course); that's already being followed for this, so I don't think there's much if any difference from how we normally do things.

geekosaur avatar Nov 26 '25 17:11 geekosaur

Okay then, to summarise:

  • we don't really need a cabal proposal because most people are convinced this is useful
  • we do need a place to discuss the high-level design (which is evolving as people learn more)

We can use this issue for high-level design questions like how does exactComments :: Map Position Text deal with new fields being added. which, for the record, Leana did answer. Does this work for everyone?

jappeace avatar Nov 26 '25 17:11 jappeace

@Bodigrim We don't have that in our prototype, but it should be feasible by adding a map to store the position of the comma too like we do with the comments. This comma position map would need to be populated after readFields and before parseFieldGrammar.

Could you possibly provide more details on this? Which changes and additions to GenericPackageDescription are you planning to be able to represent faithfully examples from above? It would be nice to understand where all the information about spaces and commas is going to be stored.

While exploring possibilities to preserve comments and stanzas is important, I cannot find anywhere a design for more mundane syntactic variability as in the syntax for build-depends and exposed-modules. GHC goes to great lengths in GHC.Parser.Annotation - are we looking to implement something similar?

Bodigrim avatar Nov 26 '25 19:11 Bodigrim

There are many syntactic elements that aren't a part of abstract syntax (often called "trivia"):

  • spaces of all sorts
  • comments of various sorts
  • punctuation (commas, colons, etc.)
  • grouping: parentheses (possibly brackets, curleys)

There are general designs addressing all these. It'd be strange to merge a solution that deals with comments but not with spaces, I feel. I'm not terribly familiar with GHC Annotations, but when I hear about it, I usually see people struggling, so maybe it's not the best example to follow?.. OTOH, other languages' ecosystems dealt with what we call "exact-printing" (and often called "lossless parsing", which I like more) and seemingly succeeded. I'm talking about rust-analyzer's rowan parsing library and its source of inspiration Swift's lib/Syntax (look at the diagram in this example in particular). Has anyone who has thought about dealing with the issue tried to evaluate these or other examples besides GHC Annotations?

ulysses4ever avatar Nov 26 '25 20:11 ulysses4ever

Part of the problem with GHC annotations is that GHC likes to arbitrarily change how it parses things. The realignment of parsing prefixes in (IIRC) 9.0 was intended to clean that up substantially, but subsequent changes (including OverloadedRecordDot) complicated the story again.

geekosaur avatar Nov 26 '25 20:11 geekosaur

It'd be strange to merge a solution that deals with comments but not with spaces, I feel.

Not necessarily: there is often a difference in representation between free-standing elements (like block comments) and attached ones (like, number of spaces preceding a given entity), so it is likely to be reasonable to merge them separately. But it would be nice to have a grand design first.

I'm talking about rust-analyzer's rowan parsing library and its source of inspiration Swift's lib/Syntax (look at the diagram in this example in particular).

I think Cabal syntax is closer to YAML and such (because of common stanzas aka anchors and overall "nested dictionaries" vibe), is there any prior art there?

(But I feel that we are talking over the authors of the proposal and probably generating overwhelming amount of noise. Being in their position before I think it's better we give them ample thinking space)

Bodigrim avatar Nov 26 '25 22:11 Bodigrim

Re: Yaml: sure enough, rowan, the lossless parsing library powering rust-analyzer, was applied to build a lossless YAML parser: https://crates.io/crates/yaml-edit the core principles don't depend on the input language.

ulysses4ever avatar Nov 27 '25 01:11 ulysses4ever

I'm talking about rust-analyzer's rowan parsing library and its source of inspiration Swift's lib/Syntax (look at the diagram in this example in particular). Has anyone who has thought about dealing with the issue tried to evaluate these or other examples besides GHC Annotations?

It seems to me to be roughly the same approach as GHC Exact Print Annotations, with entities capturing trivia around them and terminal tokens for various syntactic elements.

People usually struggle with EPA simply because Haskell syntax is very large, not because of an inherent deficiency of design.

Bodigrim avatar Nov 27 '25 02:11 Bodigrim

I think layout also complicates things: the AST for source using layout and source using explicit braces and semicolons should be the same, but the EPA will be totally different.

geekosaur avatar Nov 27 '25 02:11 geekosaur

I would request that the proposal does make it to cabal-proposals, so we don't set the precedent that TWG decisions are honoured without going through the cabal-proposals process. Even if it is the same text copied into the proposals repo, it is good to note the motivation and thinking at the time a proposal was accepted.

I have done some reading around and it seems that the ruamel.yaml is one of the more mature libraries for modifying yaml.

  • The parser records additional information about whitespace, comments etc.
  • The user edits the result of the parser.
  • When printing, if a node is unchanged, dump the same as was there before.
  • If a node has changed, print it out fresh but with the same "style" as was there before.

I suppose one of the key questions about the whole approach as well is whether we do want to be modifying GenericPackageDescription in order to perform these kinds of edits. As I understand it, the point Andrew makes about cabal-add and I think Oleg makes with his cabal-fields project is

  • It works using the Field data type, and performs untyped manipulation.
  • This is more robust, since you don't have to link against a specific Cabal version, and you are isolated from changes in the package description file.
  • It is a pragmatic choice and works most of the time.
  • It is much easier to exact print, since the format is much simpler.

It does seem like you could make this work for GenericPackageDescription as well, since that is a typed version of Field.

I think I would benefit from having a simple and brief roadmap which explains what the proposed design looks like, and how that will handle the desired outcomes. It seems quite an overwhelming topic at the moment with lots of historical attempts. Let's try and get this one over the line.

mpickering avatar Nov 28 '25 18:11 mpickering

This still feels wrong to me. Do we also have to moot every other proposal accepted before the proposals process existed?

geekosaur avatar Nov 28 '25 18:11 geekosaur

This proposal isn't "every other proposal" because it address a long-standing problem that isn't resolved still; its authors are still around and eager to see it happen, that's also something we don't have in many other cases. Generally, when I don't have a strong opinion, I always try to stay positive and support suggestions coming from anywhere, especially fellow Cabal maintainers. Therefore, I'm inclined to follow Matt's suggestion here.

ulysses4ever avatar Nov 28 '25 20:11 ulysses4ever

@ulysses4ever

It'd be strange to merge a solution that deals with comments but not with spaces, I feel.

The original plan was to use the position information to reconstruct leading spaces. This is indeed imperfect because we don't distinguish tabs and space, nor do we handle trailing trivia.

We should support tabs because representing faithfully the source code should be the goal of the exact printer. There are a lot of tabs used in hackage (7701 lines matched \t pattern), which is another reason that we should make this work.

I think it can be done outside of (and after) the comment parser PR, we just need to change the annotation type returned by readFields.

leana8959 avatar Dec 01 '25 13:12 leana8959

@Bodigrim

Could you possibly provide more details on this? Which changes and additions to GenericPackageDescription are you planning to be able to represent faithfully https://github.com/haskell/cabal/issues/11227#issuecomment-3578004369? It would be nice to understand where all the information about spaces and commas is going to be stored.

We can add the a map called commaMap to AnnotatedGenericPackageDescription. NameSpace is used to track the path down the rose tree of Field ann. Using this design we can look up the commas of a given field, and repair the exact print output if we have the position of the comma (that is, the field is not added programatically later on).

AnnotatedGenericPackageDescription = AnnotatedGenericPackageDescription
  { exactComments :: ExactComments Position
  , exactCommas :: Map NameSpace [Position]
  , unannotatedGpd :: GenericPackageDescription
  }

data NameSpace = NameSpace
  { nameSpaceName :: String
  , nameSpaceArgs :: [String]
  }
Example 1
cabal-version:   3.0
name:            foo
version:         0.1
license:         BSD-3-Clause
license-file:    LICENSE
build-type:      Simple

executable foo
    main-is:          Main.hs
    hs-source-dirs:   app
    build-depends:
        base <5,
        containers,
exactCommas = M.fromList
  [ ( [ NameSpace "executable" ["foo"]
      , NameSpace "build-depends"
      ]
    , [ (Position {-row-} 12 {-col-} 16)
      , (Position {-row-} 13 {-col-} 19)
      ]
    )
  ]
Example 2
cabal-version:   3.0
name:            foo
version:         0.1
license:         BSD-3-Clause
license-file:    LICENSE
build-type:      Simple

executable foo
    main-is:          Main.hs
    hs-source-dirs:   app
    build-depends:
        , base< 5
        , containers >=0
exactCommas = M.fromList
  [ ( [ NameSpace "executable" ["foo"]
      , NameSpace "build-depends"
      ]
    , [ (Position {-row-} 12 {-col-} 9)
      , (Position {-row-} 13 {-col-} 9)
      ]
    )
  ]

During exact printing, for a given comma separated field, we can then fix up the output.

While exploring possibilities to preserve comments and stanzas is important, I cannot find anywhere a design for more mundane syntactic variability as in the syntax for build-depends and exposed-modules. GHC goes to great lengths in GHC.Parser.Annotation - are we looking to implement something similar?

I think there's a trade-off off here, specifically whether we want a more "flat" or "structured" annotation. The more structure we remove from the concrete syntax data (by isolating them to their own fields indexed by the path in the rose tree), the less direct the exact printer will become as we need to reconstruct the original trivia by looking them up. On the other hand having more structure we might make downstream code not work out of the box with the new representation because they would need to "unannotate" everything before they use it. We can also provide a compatibility shim to annotate/unannotate.

Furthermore, contrary to GHC (and likely rowan), we have two parsers to deal with: the field parser and the field grammar parser. Carrying the annotation around in a structured way means I would have to change both the field parser and the field grammar parser, and this might make the task more complex. Furthermore, the field parser can't see the commas in build-depends as it's handled by the field grammar parser, we would need to change the FieldGrammar type class so it knows how to decorate concrete syntax info like commas into its output.

Originally Jappie's design is to store all concrete syntax data like common stanza or comments in their own fields. If we want to preserve another kind of concrete syntax we just add another field. This is not intrusive at all and helps with backward compatibility (we don't change the shape of the AST and just add fields).

What do you think about this tradeoff?

leana8959 avatar Dec 01 '25 13:12 leana8959

During exact printing, for a given comma separated field, we can then fix up the output.

In this example you have two elements in the field and two commas. How do you match field elements and commas? You must store somewhere whether it was leading commas or trailing commas. And potentially you could have one comma or three commas as well, which makes restoring the original layout even more challenging unless you store more information about concrete syntax.

The original plan was to use the position information to reconstruct leading spaces.

How would you reconstruct spaces inside fields? It could be base<5, base < 5, base>=4 && <5 and so on. If all the spaces go into another map, I struggle to imagine how one can match them back.

Bodigrim avatar Dec 01 '25 19:12 Bodigrim

@Bodigrim @mpickering

You are correct, Jappie's prototype doesn't have the granularity to handle trivia from the field grammar parser. I'll spend two working weeks to investigate and adapt the design to come up a new prototype that takes these features into account.

leana8959 avatar Dec 06 '25 13:12 leana8959