parser
parser copied to clipboard
int parser fails on trailing decimal separator
Hi,
I have found that the int
parser fails on the input "42."
. I would have expected that int
parses the "42"
, and leaves "."
to be consumed later on.
Here is an Ellie demonstrating the problem: https://ellie-app.com/3gJS4VwRJrfa1
If this is not a bug, I would propose to add documentation describing that behavior and how to work around it.
I suspect this is not a bug, but it's definitely a legitimate concern. IP address parsing provides a concrete example of how the int
parser can seem like the right tool, but isn't.
Potential solution
Note: Someone more skilled than I may know how to change Parser (Maybe Int)
to Parser Int
. I couldn't figure out how to convert the Nothing
side of String.toInt into a parser error.
digitChain : Parser (Maybe Int)
digitChain =
getChompedString (chompWhile Char.isDigit)
|> Parser.map String.toInt
Demo: https://ellie-app.com/4hJzJjnPQ3Ma1 Chomp approach, thanks to dmy.
Of course, even this isn't general purpose. As hinted at in the docs, you may still need to make choices about how to handle -
. The answer is likely to be different if you're parsing, say, "123. -429. 3482."
vs "123-429-3482."
.
Some thoughts on why this isn't a bug
Here's the underlying implementation of int
:
From Advanced.elm
int : x -> x -> Parser c x Int
int expecting invalid =
number
{ int = Ok identity
, hex = Err invalid
, octal = Err invalid
, binary = Err invalid
, float = Err invalid
, invalid = invalid
, expecting = expecting
}
Here we see that the int
parser, like hex
et al, is actually a disambiguating layer built on top of the number
parser, which is explicitly designed to handle number processing without backtracking.
I'm pretty sure this means that the digits-followed-by-a-dot-that's-not-a-decimal problem can't be solved without a fundamental change in both architecture and behavior for all of the parsers in the number family. I'm also pretty sure that such a shift would go against the library's core goals of simplicity and speed.
@clozach
In my opinion this is clearly a bug, caused by incorrect implementation/architecture which tries to reuse number
parser when it is clearly not correct in this case. Fundamental rewrite seems like the only correct option here.
Also you are completely wrong about current parser following core goals - actually it's exact opposite as it does NOT follow first two of the three core goals:
- "Make writing parsers as simple and fun as possible" - strange special cases like not allowing period after int is not simple nor fun
- "Produce excellent error messages." - error message is awful
Hey @malaire. Sounds like this is causing you some frustration. I'm tempted to mount an argument that I'm not completely wrong, but I'm too familiar with Evan's thoughts on building community to expect any joy from that direction. But if you'll concede that I'm only partially wrong, well…I'd be surprised if that were not the case! 😉
If you're currently stuck trying to resolve something that int
can't help you with, would you be up for sharing it on elm-discourse or Slack?
Or, if you're primarily interested in proving a point (more power to you), how about implementing your own number parsing functions and posting them as a package so the whole community can benefit?
If neither of those appeal, how about writing up an experience report? I'd love to learn more about what it is you were trying to do, and what you experienced when int
didn't behave as you expected.
As one of the individual bitten by this (I was the Discourse thread author whom @malaire and others assisted), allow me to mount an argument for why this is an unexpected behaviour.
Expectation based on Documentation
In the Parser library, there is the end
function. The documentation here states:
Parsers can succeed without parsing the whole string. Ending your parser with end guarantees that you have successfully parsed the whole string.
The implication is that a parser ought to succeed greedily by consuming until the last character that would not produce a parse error. If the user expects an explicit end, one can use the end
function to indicate that "nothing should come after this."
This would allow int
to be chained with decimal delimiters for use cases like IP Addresses or even splitting the integer portion from the mantissa:
type alias IntegerAndMantissa =
{ integer: Int
, mantissa: Int
}
splitIntegerAndMantissa : Parser IntegerAndMantissa
splitIntegerAndMantissa =
succeed IntegerAndMantissa
|= int
|. symbol "."
|= int
main = Html.text <| Debug.toString (run splitIntegerAndMantissa "42.56")
Without the ability to use or symbol "."
, it becomes extremely cumbersome to parse int
when all we want is the raw numeral.
If the user did not want the int to be followed by a period, then perhaps:
succeed identity
|= int
|. end
-- succeeds with the input "42" but fails with "42."
Potential Solution at the Library level
Changing the default behaviour of int
would be a very disruptive change, as I'm sure there are plenty of existing users of this library depending on the fact that parsing 42.
throws an error in the current Parser.
My solution eschews changing the current behaviour, and instead introduce a new function:
numeral
The behaviour as described in the expectation above will be mapped onto this new functions, leaving the current int
behaviour intact (thereby not breaking existing parsers for consumers of this library).
The idea is that in cases where we want to just parse the raw number in a greedy fashion, ignoring interpretations of the period for decimal point (which isn't even universal, to begin with!), we can use numeral
, and thus we can chain it with symbol "."
or symbol ","
as the need arises.
The expected behaviour for numeral
would be to return an Int
type, however internally it will only ever be a positive integer, as it will not attempt to parse and interpret signs (-
and +
) as part of the numeral
.
Summary
Concepts such as "int"
and "float"
(as well as positive and negative handling) are great if the user is writing a programming language, and wishes for a standard interpretation of what an int
and float
are (e.g. IEEE-754), but there are many cases where we would like to simply parse the numeral without a regard for its interpretation (and let's be clear, treating .
as a decimal marker is definitely an "interpretation"!).
Introducing a new function numeral
that specifically consumes digit characters in a greedy fashion until it can no longer do so seems to be an obvious way to solve this problem. Leave the -
/+
/.
interpretation as default for int
and float
functions. For every other use case, numeral
can give us the raw digits as an Int
.
Even this numeral
function could have several variations, for example:
- does it accept leading zeroes? What if it does and I don't want to, or vice-versa?
- what should be the accepted range? What should be the behavior when the number is out of range, an error or an overflow?
Therefore even if the current int
behavior may not be fully intuitive, why not write the exact variant you need on a case-per-case basis? For example your numeral
function could be:
import Parser exposing (Parser)
numeral : Parser Int
numeral =
Parser.getChompedString (Parser.chompWhile Char.isDigit)
|> Parser.andThen
(\str ->
case String.toInt str of
Just n ->
Parser.succeed n
Nothing ->
Parser.problem "expected an integer"
)
This one will only accept digits, including leading zeroes, without consuming more characters.
One alternative could be to write a very configurable function in a user package, maybe something like number
, but with even more options (leading zeros, minus sign, exponent support, etc.). Its configuration might be unnecessarily complex for it to be in elm/parser
though. Or if people find it useful enough, it could become one day part of an hypothetical elm-community/parser-extra
package for example, before its inclusion in elm/parser
is considered.
@clozach My use-case is parsing version string like 1.2.3 which I solved by just replacing "." with ":" before parsing.
But note that using number-parser where it should not be used doesn't cause just this issue with period - issues https://github.com/elm/parser/issues/25 and https://github.com/elm/parser/issues/28 are also caused by this incorrect design.
@clozach
Or, if you're primarily interested in proving a point (more power to you), how about implementing your own number parsing functions and posting them as a package so the whole community can benefit?
This library contains operators. Its API can not be replicated by third-party programmers.