parsec icon indicating copy to clipboard operation
parsec copied to clipboard

Option to change whitespace in token parsing.

Open Chobbes opened this issue 8 years ago • 6 comments

Currently there is no way to alter what the token parser considers to be whitespace. This is an issue if one wishes to parse certain indentation specific languages, for instance with the following package:

https://hackage.haskell.org/package/indents

If newlines are consumed by the lexeme parsers in Text.Parsec.Tokens then it is difficult to work with certain languages which depend upon indentation and newlines, such as Python 3.

https://docs.python.org/3/reference/grammar.html

As an example if statements in Python either require semi-colon separated statements after the conditional, e.g.:

if blah: stmt1; stmt2

Or the statements must occur on a newline, but must also be indented further. Not both. This makes it difficult to use Text.Parsec.Tokens.makeTokenParser in its current form, as it ignores any form of newline since it uses Data.Char.isSpace by default to decide what is and what is not whitespace.

Chobbes avatar Aug 21 '15 01:08 Chobbes

Seems I managed to miss these. I'm not the only one who wants this ;). I'm in favor of changing the LanguageDef, but I was worried about backwards compatibility.

https://github.com/aslatter/parsec/pull/41 https://github.com/aslatter/parsec/issues/24

Chobbes avatar Aug 21 '15 03:08 Chobbes

@Chobbes, take a look at @minad 's solution from #41, this is one option. But as he points out this is not enough to elegantly parse languages where indentation matters. This issue is one of our goals in Megaparsec.

I propose you close the PR because with respect, #41 provides not ideal, but much better solution.

mrkkrp avatar Aug 21 '15 08:08 mrkkrp

@mrkkrp https://github.com/aslatter/parsec/pull/41 is a better solution, but not if you wish to maintain backwards compatibility. It will break any existing token parsers which manually construct a LanguageDef. I'm fine with the change in #41 (and would prefer it), but I intentionally avoided it because I thought this option was less destructive (and as a result more likely to be merged). I will leave this for the maintainer to decide.

"But as he points out this is not enough to elegantly parse languages where indentation matters."

@mrkkrp #41 makes no mention of this aside from references to Layout.hs and the IndentParser package. Layout.hs is nearly identical to what the indents package does:

https://hackage.haskell.org/package/indents

and will suffer from the same issues with respect to the Python example I mentioned. IndentParser provides its own token library, and appears to have the best alternative thus far.

megaparsec appears to not currently solve this problem, and at the moment no longer handles commenting like the Token module does in Parsec?

https://github.com/mrkkrp/megaparsec/commit/3661da90e52a8b15f05b033be1da0f47d08acce8#diff-60a69ee7900f16e4c15e0edaf4cce9a3R548

Is this a long term goal of megaparsec, or does it provide a solution now?

Chobbes avatar Aug 21 '15 17:08 Chobbes

@Chobbes, Megaparsec is less than one month old, it's not finished yet, although it's already does many things much better than Parsec, see CHANGELOG.md and closed issues (some of them mirror issues of Parsec) for details. Currently only modules Text.Megaparsec.Expr and Text.Megaparsec.Token are not finished and they are not supposed to work at this moment in time as they are awaiting their turn.

Anyway, good luck with this PR. I would be glad if Parsec finally moved forward extending its functionality in any way. But as I see it, Parsec is really old and its development is dead. It lacks tests, I've studied all its issues in the past, its changelog, etc. Goal of Parsec now is preserving its functionality without breaking anything. So if I were maintainer of Parsec, I would first fix its (well-reproducible and well-known) bugs, see issues tracker, rather than add new functionality. To move forward now, it needs a really passionate maintainer, who would write complete test suite for it for starters. I hope when we finish our test suite for Megaparsec, Parsec could adopt its variant to test its own code (although they will need to edit it manually, I'm not interested in doing that for Parsec without any guarantee that my work won't be just ignored by this sleepy project).

If you need to do something with this actual issue fast, do it in your project (possibly in ugly way) and you will be fine.

mrkkrp avatar Aug 21 '15 18:08 mrkkrp

@Chobbes, I've started work on this (and related) issues in Megaparsec. See branch new-lexer. Here is my comment in dedicated issue thread: https://github.com/mrkkrp/megaparsec/issues/5.

You could try out our lexer and give your feedback. It's possible that we will release Megaparsec sooner then this PR is merged, by the way.

mrkkrp avatar Sep 02 '15 14:09 mrkkrp

@Chobbes, Megaparsec 4.0.0 is ready and it provides solution now. It'll will be tagged and released tomorrow I think. You can clone the repo and try it.

mrkkrp avatar Sep 22 '15 10:09 mrkkrp