parsec integer dosen't fail for leading white spaces

integer doesn't behave properly for input with leading white spaces.

import Text.Parsec
import Text.Parsec.String (Parser)
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (emptyDef)

lexer =
  P.makeTokenParser emptyDef

integer    = P.integer lexer
natural    = P.natural lexer

integer' :: Parser Integer
integer' = do
  f <- option id sign'
  n <- natural
  return $ f n

sign' :: Parser (Integer -> Integer)
sign' = (char '-' >> return negate)
     <|> (char '+' >> return id)

main = do
  print . (parse natural "input") $ "1"
  print . (parse integer "input") $ "1"
  print . (parse integer' "input") $ "1"

  print . (parse natural "input") $ " 1"
  print . (parse integer "input") $ " 1"
  print . (parse integer' "input") $ " 1"

  print . (parse natural "input") $ " 1"
  print . (parse integer "input") $ " +1"
  print . (parse integer' "input") $ " +1"

The prime version (integer') is provided to illustrate the expected behavior. The output of above program is:

Right 1
Right 1
Right 1
Left "input" (line 1, column 1):
unexpected " "
expecting natural
Right 1
Left "input" (line 1, column 1):
unexpected " "
expecting "-", "+" or natural
Left "input" (line 1, column 1):
unexpected " "
expecting natural
Left "input" (line 1, column 2):
unexpected "+"
expecting digit
Left "input" (line 1, column 1):
unexpected " "
expecting "-", "+" or natural

The integer parser doesn't fail for <space>1, while it should, and it gives misleading error message for <space>+1. On the other hand, the error message for integer' is more understandable.

Jul 02 '15 23:07 albertnetymk

I don't have a lot of experience with the TokenParser module, but I believe the intent was that the different lexeme parsers would work together in concert and jointly handle whitespace among them - the original guide gives some useful examples and background:

https://web.archive.org/web/20140529211116/http://legacy.cs.uu.nl/daan/download/parsec/parsec.html#Lexical%20analysis

Jul 02 '15 23:07 aslatter

Am I posting this in the wrong place? I thought this issue tracker is for parsec.

Thanks for the link; I am not aware of it. The lexeme part is identical, AFAIC.

According to both of them, lexeme parsers could take care of all trailing spaces, but not leading spaces.

The only point where the whiteSpace parser should be called explicitly is the start of the main parser in order to skip any leading white space.

Therefore, in the original example I provided, natural, integer, and integer', which are the main parsers, for they are used for parse input directly, should fail for input with leading spaces, but the actual behavior is not, as shown in above snippet.

Hope I made myself clear now.

Jul 03 '15 10:07 albertnetymk

@albertnetymk, I would consider it a bug. All lexemes are expected to consume trailing, but not leading whitespace. In this case, in some circumstances integer won't fail when it should, thus some alternative branch of parsing logic won't be used, for example. While this is not a severe bug, it's still a bug.

This happens because sign is parsed in int parser (in Text.Parsec.Token) as a lexeme and sign can be missing, in this case we consume leading whitespace:

    integer         = lexeme int      <?> "integer"

    int             = do{ f <- lexeme sign
                        ; n <- nat
                        ; return (f n)
                        }

    sign            =   (char '-' >> return negate)
                    <|> (char '+' >> return id)
                    <|> return id

This is easy to fix, though. I can fix it when I will be fixing #35, what do you think, @aslatter? Is it desirable? If you want to keep present behavior, you should document it in description of integer parser.

Jul 05 '15 17:07 mrkkrp

@mrkkrp It's good that we reach consensus on this. However, I think it's best that the fix for this goes into a separate PR, for this issue is different from #35, I think.

Jul 05 '15 17:07 albertnetymk

@albertnetymk, yes, of course. I meant that I would do it at the same time, but not in the same PR.

Jul 05 '15 17:07 mrkkrp

To make sure I'm understanding this - the bug is that int consumes leading whitespace because of the lexeme combinator wrapping the sign parser, which is internally structure to optionally parse a sign. The the sign combinator consumes nothing we still consume the trailing spaces, which leads to wrong errors (or wrong behavior) when int is used as part of some other branching structure.

It seems to me that the whole point of the structure of the TokeParser module is to defer white-space consuming as late as possible, so fixing this would be good.

Jul 05 '15 17:07 aslatter

@aslatter, great. Should we add new test for this bug or it's not necessary? What do you think?

Jul 05 '15 17:07 mrkkrp

@aslatter I agree with your understanding above. @mrkkrp I think it makes sense to add one new test. My original test snippet might be of any use.

Jul 05 '15 17:07 albertnetymk

I think the best way to fix this is by rewriting of sign parser so it consumes only + or -, but fails when they are not present. This sounds very intuitive to me. Next, we can just rewrite parsing of sign in int as f <- option id (lexeme sign). Of course we should make sure that new version of sign is used properly in exponent' too (to allow valid inputs like 0.5e3 as well as 0.5e+3, etc).

Jul 05 '15 19:07 mrkkrp

Exactly. That's my thought as well. See the top snippet in this ticket.

Jul 05 '15 22:07 albertnetymk

@albertnetymk, oh indeed. I just paid more attention to tests that demonstrate the flaw, I should have noticed your solution. Anyway I'm done with it, but I'm waiting decision on my current PR, so I won't push it until this PR is merged. I could create a separate branch, though, if necessary.

I've added test that checks against these inputs:

shouldFail :: [String]
shouldFail = [" 1", " +1", " -1"]

shouldSucceed :: [String]
shouldSucceed = ["1", "+1", "-1", "+ 1 ", "- 1 ", "1 "]

this should be ample.

Jul 06 '15 04:07 mrkkrp

parsec parsec copied to clipboard

integer dosen't fail for leading white spaces

parsec
parsec copied to clipboard