parsec
parsec copied to clipboard
integer dosen't fail for leading white spaces
integer doesn't behave properly for input with leading white spaces.
import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (emptyDef)
lexer =
P.makeTokenParser emptyDef
integer = P.integer lexer
natural = P.natural lexer
integer' :: Parser Integer
integer' = do
f <- option id sign'
n <- natural
return $ f n
sign' :: Parser (Integer -> Integer)
sign' = (char '-' >> return negate)
<|> (char '+' >> return id)
main = do
print . (parse natural "input") $ "1"
print . (parse integer "input") $ "1"
print . (parse integer' "input") $ "1"
print . (parse natural "input") $ " 1"
print . (parse integer "input") $ " 1"
print . (parse integer' "input") $ " 1"
print . (parse natural "input") $ " 1"
print . (parse integer "input") $ " +1"
print . (parse integer' "input") $ " +1"
The prime version (integer'
) is provided to illustrate the expected behavior. The output of above program is:
Right 1
Right 1
Right 1
Left "input" (line 1, column 1):
unexpected " "
expecting natural
Right 1
Left "input" (line 1, column 1):
unexpected " "
expecting "-", "+" or natural
Left "input" (line 1, column 1):
unexpected " "
expecting natural
Left "input" (line 1, column 2):
unexpected "+"
expecting digit
Left "input" (line 1, column 1):
unexpected " "
expecting "-", "+" or natural
The integer
parser doesn't fail for <space>1
, while it should, and it gives misleading error message for <space>+1
. On the other hand, the error message for integer'
is more understandable.
I don't have a lot of experience with the TokenParser
module, but I believe the intent was that the different lexeme parsers would work together in concert and jointly handle whitespace among them - the original guide gives some useful examples and background:
https://web.archive.org/web/20140529211116/http://legacy.cs.uu.nl/daan/download/parsec/parsec.html#Lexical%20analysis
Am I posting this in the wrong place? I thought this issue tracker is for parsec.
Thanks for the link; I am not aware of it. The lexeme part is identical, AFAIC.
According to both of them, lexeme parsers could take care of all trailing spaces, but not leading spaces.
The only point where the whiteSpace parser should be called explicitly is the start of the main parser in order to skip any leading white space.
Therefore, in the original example I provided, natural
, integer
, and integer'
, which are the main parsers, for they are used for parse input directly, should fail for input with leading spaces, but the actual behavior is not, as shown in above snippet.
Hope I made myself clear now.
@albertnetymk, I would consider it a bug. All lexemes are expected to consume trailing, but not leading whitespace. In this case, in some circumstances integer
won't fail when it should, thus some alternative branch of parsing logic won't be used, for example. While this is not a severe bug, it's still a bug.
This happens because sign is parsed in int
parser (in Text.Parsec.Token
) as a lexeme and sign can be missing, in this case we consume leading whitespace:
integer = lexeme int <?> "integer"
int = do{ f <- lexeme sign
; n <- nat
; return (f n)
}
sign = (char '-' >> return negate)
<|> (char '+' >> return id)
<|> return id
This is easy to fix, though. I can fix it when I will be fixing #35, what do you think, @aslatter? Is it desirable? If you want to keep present behavior, you should document it in description of integer
parser.
@mrkkrp It's good that we reach consensus on this. However, I think it's best that the fix for this goes into a separate PR, for this issue is different from #35, I think.
@albertnetymk, yes, of course. I meant that I would do it at the same time, but not in the same PR.
To make sure I'm understanding this - the bug is that int
consumes leading whitespace because of the lexeme
combinator wrapping the sign
parser, which is internally structure to optionally parse a sign. The the sign
combinator consumes nothing we still consume the trailing spaces, which leads to wrong errors (or wrong behavior) when int
is used as part of some other branching structure.
It seems to me that the whole point of the structure of the TokeParser
module is to defer white-space consuming as late as possible, so fixing this would be good.
@aslatter, great. Should we add new test for this bug or it's not necessary? What do you think?
@aslatter I agree with your understanding above. @mrkkrp I think it makes sense to add one new test. My original test snippet might be of any use.
I think the best way to fix this is by rewriting of sign
parser so it consumes only +
or -
, but fails when they are not present. This sounds very intuitive to me. Next, we can just rewrite parsing of sign in int
as f <- option id (lexeme sign)
. Of course we should make sure that new version of sign
is used properly in exponent'
too (to allow valid inputs like 0.5e3
as well as 0.5e+3
, etc).
Exactly. That's my thought as well. See the top snippet in this ticket.
@albertnetymk, oh indeed. I just paid more attention to tests that demonstrate the flaw, I should have noticed your solution. Anyway I'm done with it, but I'm waiting decision on my current PR, so I won't push it until this PR is merged. I could create a separate branch, though, if necessary.
I've added test that checks against these inputs:
shouldFail :: [String]
shouldFail = [" 1", " +1", " -1"]
shouldSucceed :: [String]
shouldSucceed = ["1", "+1", "-1", "+ 1 ", "- 1 ", "1 "]
this should be ample.