logos icon indicating copy to clipboard operation
logos copied to clipboard

[Feature request] optional end_of_input token just before Lexer starts returning None

Open legeana opened this issue 1 year ago • 3 comments

It would be really convenient to have an ability to inject custom EndOfInput token, just before the lexer starts to return None.

  #[logos(error = LexerError)]
  #[logos(extras = LineTracker)]
  #[logos(skip r"#.*")] // comments
  pub enum Token {
      #[end_of_input]
      EndOfInput,
      #[token("\n")]
      Newline,
  }

For some shell-like grammars where statements terminated by a newline having EndOfInput, or even injecting the Newline itself at the end, can make parsing unterminated trailing statements much easier, because you can define a Statement = Command+ (Newline | EndOfInput).

Without this feature I just made a wrapper that returns one additional token after Logos returned None.

legeana avatar Jul 18 '23 23:07 legeana

Hello, thanks for your suggestion!

Performance wise, I don't see any preference over using the Iterator::chain method:

#[derive(Debug)]
enum Token {
    A,
    B,
    C,
    EOF,
}

fn main() {

    use Token::*;

    let mut lexer = vec![A, B, C, A, B, C]
        .into_iter()
        .chain(Some(EOF));
    
    while let Some(token) = lexer.next() {
        println!("{:?}", token);
    }
}

I understand that this requires to manually add the last token using chain, but I don't think Logos can actually do something better than that :-/

jeertmans avatar Jul 19 '23 10:07 jeertmans

I think we could handle that, I'll keep that in mind when I get to coding!

maciejhirsz avatar Jul 19 '23 11:07 maciejhirsz

Hello, thanks for your suggestion!

Performance wise, I don't see any preference over using the Iterator::chain method:

#[derive(Debug)]
enum Token {
    A,
    B,
    C,
    EOF,
}

fn main() {

    use Token::*;

    let mut lexer = vec![A, B, C, A, B, C]
        .into_iter()
        .chain(Some(EOF));
    
    while let Some(token) = lexer.next() {
        println!("{:?}", token);
    }
}

I understand that this requires to manually add the last token using chain, but I don't think Logos can actually do something better than that :-/

My feeling is if you use chain you lose the logos::Lexer type, so you can't easily access lexer.span(), lexer.slice() and lexer.extras anymore: pub struct Chain<A, B> { /* private fields */ }. Having this function as part of logos makes a difference.

legeana avatar Jul 19 '23 11:07 legeana