jison icon indicating copy to clipboard operation
jison copied to clipboard

Line beginning identifier (^) not work.

Open neizod opened this issue 14 years ago • 12 comments

In the lexer part:

"a"     return 'BODY';
^"a"    return 'HEAD';

test case: a a return token: BODY BODY while

^"a"    return 'HEAD';
"a"     return 'BODY';

return token: HEAD HEAD. (expected: HEAD BODY)

neizod avatar Oct 24 '11 21:10 neizod

Have you tried using:

 "^a"

jklmli avatar Nov 14 '11 18:11 jklmli

just try it and nothing happen as i expected.

neizod avatar Nov 14 '11 19:11 neizod

This is tricky because the lexer uses JavaScript regular expressions, which don't allow you to start from an arbitrary position in a string. This means a new string is created each time starting at end of the last match, so ^ is technically alway true.

A possible workaround would be to prepend the input with a unique character and replace ^ with that character in the rules.

zaach avatar Jan 20 '12 21:01 zaach

@zaach The y flag [0] may help with this, however I don't know about how supported it is in other browsers than Gecko-based.

[0] https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/RegExp

victorporof avatar Jun 12 '12 16:06 victorporof

Quick and dirty hack to solve this:

"a" %{
  this.yy_ = this;
  return (this.yylloc.first_column === 0) ? 'HEAD' : 'BODY';
%}

alvaro-cuesta avatar Dec 22 '13 19:12 alvaro-cuesta

What about using custom scanners? I have written a library called Lexer in the spirit of Flex which allows you to match arbitrary expressions as follows:

var Parser = require("jison").Parser;
var Lexer = require("lex");

var grammar = {
    "bnf": {
        // ...
    }
};

var parser = new Parser(grammar);
var lexer = parser.lexer = new Lexer;

lexer.addRule(/^a/, function (lexeme) {
    this.yytext =  lexeme;
    return "BODY";
});

lexer.addRule(/a/, function (lexeme) {
    this.yytext = lexeme;
    return "HEAD";
});

Perhaps we could integrate it into Jison to be the default scanner? Advantages:

  1. It's easier to use regular expressions themselves instead of string descriptions of regular expressions.
  2. It's easier to use functions themselves instead of string descriptions of function bodies.
  3. Lexer currently supports some very powerful features such as start conditions, global patterns, optional case insensitive matching, optionally matching beginning and end of lines, etc.

I've also wanted to improve the performance of Lexer for quite a while by using Finite State Automata instead of native regular expressions. Perhaps we could work on that collaboratively?

aaditmshah avatar Jun 29 '14 03:06 aaditmshah

@aaditmshah A more JavaScript friendly lexer is definitely a nice thing to have, but one of the qualifications for the default lexer is that it can be expressed in a way that's familiar to Flex users.

I've thought about implementing a regex engine in JS, but building one with enough features and speed to be useful is more than I have time for. Another option I believe others have explored is compiling a C/C++ regex engine using emscripten.

zaach avatar Jul 01 '14 00:07 zaach

I have enough time to implement a regex engine in pure JavaScript. What is the interface required to integrate a regex engine with jison? Is it the same interface that's exposed by jison-lex?

aaditmshah avatar Jul 02 '14 03:07 aaditmshah

Since now we have "sticky" flag, we can make all regex sticky and multiline (/my) and manually set lastIndex of the regex going to test to the last matched regex's lastIndex?

var match, rule, lastIndex, i;
lastIndex = lastMatchRegex.lastIndex;
for (i = 0; i < rules.length; i++) {
    rule = rules[i];
    rule.regex.lastIndex = lastIndex;
    match = input.match(rule.regex);
    if (match) {
        return match[0];
    }
}

amobiz avatar Jun 10 '16 14:06 amobiz

@amobiz hey, are you DDOSing?

yosbelms avatar Jun 10 '16 14:06 yosbelms

Sorry, thought no one is here. Just try to update information.

amobiz avatar Jun 10 '16 14:06 amobiz

@amobiz, I think the solution to this issue is pointed by @alvaro-cuesta above regarding the lexer "eats" the input, so, ^"a" and "a" is the same rule, to handle custom specs in the %{%} block is very straight forward and explicit way.

yosbelms avatar Jun 10 '16 15:06 yosbelms