better-parse
better-parse copied to clipboard
Does not parse across multiple Lines
I have to apologize in advance: I have no ideas about lexing and parsing.
When I try to simple a Parser and feed it content with multiple lines, the tokenizer fails
object : Grammar<String>() {
val singleToken by token(""".+""")
override val rootParser: Parser<String> by zeroOrMore(singleToken) map { it.joinToString("#") }
}.parseToEnd("fuu \nbar")
com.github.h0tk3y.betterParse.parser.ParseException: Could not parse input: UnparsedRemainder(startsWith=no token matched for "bar" at 4 (1:5)) at com.github.h0tk3y.betterParse.parser.ParserKt.toParsedOrThrow(Parser.kt:66) at com.github.h0tk3y.betterParse.parser.ParserKt.parseToEnd(Parser.kt:26)
Somehow, after the new line the \G
in the wrapping allInOnePattern
of the DefaultTokenizer
(Tokenizer.kt#L42) does not match anymore.
What am I doing wrong here?
Your regex does not match every characters – it won’t match a new line.
Betterparse uses standard Kotlin regular expressions, and by default with the defaults flags. You need to use the DOT_MATCHES_ALL
option for a .
regex to match a new line character.
Alternatively you should be able to use regex like (.|\n|\r)+
instead of .+
with the default options.
Thanks @silmeth, but it doesn't seem to be the Problem.
I've tried with DOT_MATCHES_ALL
:
val singleToken by token(""".+""".toRegex(RegexOption.DOT_MATCHES_ALL))
But at this Point it doesn't help, because the DefaultTokenizer
just takes all the tokens, extracts their pattern-string, and builds its own regex (Tokenizer.kt#L42), and it does not remember regex options from what I can tell.
Consequently I reimplemented the DefaultTokenizer
modifying the existing one. this is what I came up with:
https://gist.github.com/IARI/91011233658d386f1f1aefd2450537f2#file-mytokenizer-kt-L14
I made sure, that it receives Regex Options, and called in my Grammar using:
override val tokenizer: Tokenizer by lazy {
MyTokenizer(tokens, RegexOption.MULTILINE, RegexOption.DOT_MATCHES_ALL)
}
The result is still the originally descriped error.
Could this have to do with the behavior of Javas Scanner
, which is used by the tokenizer and the used delimiter for it?
@IARI, in the 0.3.5 update, I've added reflex option transformation to embedded flags. Before 0.3.5, you could also just add the reflex embedded flags into the pattern string.
Thats nice, but as stated that from as much as I understand doesn't solve the problem. So far, it worked only without the \G
Any hints?