better-parse icon indicating copy to clipboard operation
better-parse copied to clipboard

Does not parse across multiple Lines

Open IARI opened this issue 5 years ago • 5 comments

I have to apologize in advance: I have no ideas about lexing and parsing.

When I try to simple a Parser and feed it content with multiple lines, the tokenizer fails

object : Grammar<String>() {
        val singleToken by token(""".+""")
        override val rootParser: Parser<String> by zeroOrMore(singleToken) map { it.joinToString("#") }
}.parseToEnd("fuu \nbar")

com.github.h0tk3y.betterParse.parser.ParseException: Could not parse input: UnparsedRemainder(startsWith=no token matched for "bar" at 4 (1:5)) at com.github.h0tk3y.betterParse.parser.ParserKt.toParsedOrThrow(Parser.kt:66) at com.github.h0tk3y.betterParse.parser.ParserKt.parseToEnd(Parser.kt:26)

Somehow, after the new line the \G in the wrapping allInOnePattern of the DefaultTokenizer (Tokenizer.kt#L42) does not match anymore. What am I doing wrong here?

IARI avatar Aug 16 '18 13:08 IARI

Your regex does not match every characters – it won’t match a new line.

Betterparse uses standard Kotlin regular expressions, and by default with the defaults flags. You need to use the DOT_MATCHES_ALL option for a . regex to match a new line character.

Alternatively you should be able to use regex like (.|\n|\r)+ instead of .+ with the default options.

silmeth avatar Aug 17 '18 08:08 silmeth

Thanks @silmeth, but it doesn't seem to be the Problem. I've tried with DOT_MATCHES_ALL:

val singleToken by token(""".+""".toRegex(RegexOption.DOT_MATCHES_ALL))

But at this Point it doesn't help, because the DefaultTokenizer just takes all the tokens, extracts their pattern-string, and builds its own regex (Tokenizer.kt#L42), and it does not remember regex options from what I can tell.

Consequently I reimplemented the DefaultTokenizer modifying the existing one. this is what I came up with: https://gist.github.com/IARI/91011233658d386f1f1aefd2450537f2#file-mytokenizer-kt-L14

I made sure, that it receives Regex Options, and called in my Grammar using:

        override val tokenizer: Tokenizer by lazy {
            MyTokenizer(tokens, RegexOption.MULTILINE, RegexOption.DOT_MATCHES_ALL)
        }

The result is still the originally descriped error.

Could this have to do with the behavior of Javas Scanner, which is used by the tokenizer and the used delimiter for it?

IARI avatar Aug 17 '18 16:08 IARI

@IARI, in the 0.3.5 update, I've added reflex option transformation to embedded flags. Before 0.3.5, you could also just add the reflex embedded flags into the pattern string.

h0tk3y avatar Aug 17 '18 18:08 h0tk3y

Thats nice, but as stated that from as much as I understand doesn't solve the problem. So far, it worked only without the \G

IARI avatar Aug 17 '18 20:08 IARI

Any hints?

IARI avatar Sep 18 '18 16:09 IARI