chumsky icon indicating copy to clipboard operation
chumsky copied to clipboard

Support for zero-copy parsing

Open zesterer opened this issue 2 years ago • 19 comments

This is now being developed in #82

zesterer avatar Oct 29 '21 13:10 zesterer

Does this imply being able to do something like this?

fn parse() -> impl Parser<char, &str, Error = ....> {}

mainrs avatar Dec 08 '21 20:12 mainrs

Yes, among other things. There are a few complications though: feeding lifetimes through is quite an intricate process, and it's going to require a fairly fundamental rework of the Stream trait.

There are also complications with existing parsers: if just('!').repeated() results in an impl Parser<char, &str>, then what does just('!') result in? A char? But then it's owned, which implies that .repeated() is getting an unowned &str out of thin air. It might require some extra combinators to be used effectively.

One option I'm considering is a combinator that complements map_with_span that allows fetching the slice (&[T] or &str) of the input that was parsed instead of the span of the input like map_with_slice. This would allow building zero-copy parsing on top of the existing API without needing to significantly change the API and while avoiding allocation.

zesterer avatar Dec 09 '21 12:12 zesterer

@zesterer Are you still looking for any help with the zero-copy implementation? I'd love to lend a helping hand. I just finished browsing the whole branch and it looks fantastic so far. Really interesting!

damien-white avatar Jun 30 '22 17:06 damien-white

@dark-fusion Definitely! I've been on a bit of a hiatus over the past few weeks (a mixture of burnout and a lot of work), but it's beginning to look like something that's of merge quality. I think it's probably time for me to write a list of things that need doing before merge (aside from waiting for GAT stabilisation). What sort of things were you interested in working on?

zesterer avatar Jun 30 '22 20:06 zesterer

I'm most interested in adding support for binary-encoded formats, however I also understand that it would involve quite a bit of work and may be best suited as an extension crate (at least for now). This was also already discussed in #48.

While waiting for certain features to land on Rust's stable channel, I'm not sure where an extra set of hands would be most welcome.

damien-white avatar Jul 01 '22 21:07 damien-white

Hello hello :D

I am looking to contribute more to chumsky, and am curious on if there are any open tasks that need to be worked on related to zero-copy parsing, and if so, what those might be!

Thanks for your time!

Zij-IT avatar Aug 17 '22 18:08 Zij-IT

am curious on if there are any open tasks that need to be worked on

Hey, thanks for showing interest! There are a few things on the list, but it really depends on what you're interested in.

  • Looking into overhauling the error recovery system (the current system in master works okay for simple cases, but it would be nice to be able to generalise it a bit, perhaps by letting users provide their own recovery parsers instead of a handful of pre-built ones, as mentioned in #77)
  • Finish off implementing primitives (some aren't being ported across (filter/filter_map are now just both methods on Parser, filter and try_map, for example))
  • Port over (or, more likely, reimplement) the parsers from the text module, perhaps making them a bit more flexible in the process
  • Figure out how to make something akin to the custom parser in master, but actually useful by strategically making methods like InputRef::next public so that people can write their own inline parsers that manually pull tokens out of the input
  • Add Parser implementations for a bunch of types to improve the utility of things (&impl Parser, Box<impl Parser>, Rc<impl Parser>, etc.)
  • Port over 'recent' changes from master (.not(), etc.)
  • If you're really looking for a challenge, something I'm grappling with at the moment is figuring out how to make error generation in zero-copy very fast. master works fine, but each parser returns a rather complex type to represent the errors. It would be better to return just a Result<O, ()> and have the errors stored inline within InputRef (I've already started working on this, but it needs work), then figuring out how best to choose between error sets with combinators like Or and Repeated

Any one or part of these would be really appreciated and would definitely help speed up the process of getting the branch ready for merge. I'd love to dedicated more time to it myself, but I'm going through quite a busy period in my life right now and have a bit less time to spare than I normally do :(

zesterer avatar Aug 17 '22 20:08 zesterer

Thanks for the list :D I will take a crack at text and get a pull request going when I can. And no worries about being busy! If someone blames you for that they can write it themselves ;)

Zij-IT avatar Aug 18 '22 13:08 Zij-IT

Specs

  • Modified version of zero-copy with a pratt-parser, which can be found here.
  • rustc 1.65.0-nightly (addacb587 2022-08-24)
  • Cpu: AMD Ryzen 5 4500U with Radeon Graphics (6) @ 2.375GHz
  • Memory: 16GB

Issue

I am not sure what the exact issue is, but I don't know if zero-copy is going to have a great time with compile times. For reference, I am compiling my Lua parser, and you know you are in for a great time when cargo check takes almost 10 minutes. When I first ran cargo run --release I killed the process after about 60 minutes.

I am up for any and all recommendations you have here, as I am kinda at a loss. I know you are busy, so there's no rush! Thanks for your time in advance.

Zij-IT avatar Aug 25 '22 19:08 Zij-IT

@Zij-IT The very long or chain here is a bit suspect to me. Due to deficiencies in rustc's ability to memoise trait obligation, Or parsers tend to have exponential compilation times. I'd recommend trying choice instead, it tends to not have this issue.

That aside, make sure to throw .boxed() on the end of any large parsers. It has a relatively insubstantial runtime cost, but improves compilation times very significantly!

zesterer avatar Aug 25 '22 20:08 zesterer

I mentioned #197 that I was able to get it to compile, but I figured I would show you some of the fruits of your labor :D

Specs

  • rustc 1.65.0-nightly (addacb587 2022-08-24)
  • Cpu: AMD Ryzen 5 4500U with Radeon Graphics (6) @ 2.375GHz
  • Memory: 16GB

Task

  • Parsing every .lua file in the luvit Lua library
  • Total lines: 21972
  • Total characters: 608323

Results

I would say that is quite the achievement! I am curious how where there is still room for improvement and excited to see that number working its way down!

Zij-IT avatar Aug 26 '22 18:08 Zij-IT

That's quite a nice improvement! Not as much as I'd hoped though, unfortunately. I think there's still more to be done here. Are these just the numbers for the lexer alone, or the parser too?

I've realised that the foldl/foldr combinators are extremely commonly hit in most parsers, so I'm interested to see whether we could make them ingest a Repeated/SeparatedBy directly instead of requiring a .collect() into a vector beforehand, saving a lot of allocation and memory juggling.

zesterer avatar Aug 26 '22 19:08 zesterer

Its for both the lexer and the parser! Each of the 200ish files goes from &str to Chunk!

Zij-IT avatar Aug 26 '22 19:08 Zij-IT

Its me again, and with some fantastic news for me, and uncertain news for zero-copy :D I was going back through my Parser looking for those optimizations you were hoping for and I found THE problem in my implementation. I was looking at a flamegraph and I noticed that the BinaryInfixOperator followed by Foldr was taking up a large portion of the time. I thought this was odd because I only use foldr in two locations, so I went looking at them, and it hit me. This little portion of my expression parser has one CRITICAL issue:

let power = atom
    .clone()
    .then(just(TokenKind::OpPow).to(BinaryOp::Power))
    .repeated()
    .then(atom.labelled("binary operand"))
    .foldr(|(lhs, op), rhs| Expr::Binary {
        lhs: Box::new(lhs),
        rhs: Box::new(rhs),
        op,
    });

It parses EVERY atom twice if that atom is not followed by a TokenKind::OpPow. This is wonderfully terrifying, when the singular expression is, say, a giant function definition like this one:

test('fs chmod', function(expect)
  local mode_async
  local mode_sync
  -- On Windows chmod is only able to manipulate read-only bit
  -- TODO: test on windows
  if is_windows then
    mode_async = 256 --[[tonumber('0400', 8)]] -- read-only
    mode_sync = 438  --[[tonumber('0600', 8)]] -- read-write
  else
    mode_async = 511 --[[tonumber('0777', 8)]]
    mode_sync = 420 --[[tonumber('0644', 8)]]
  end

  local file1 = Path.join(__dirname, 'fixtures', 'a.lua')
  local file2 = Path.join(__dirname, 'fixtures', 'a1.lua')

  local function maskMode(mode, mask)
    return bit.band(mode, mask or 511 --[[tonumber('0777',8)]])
  end

  fs.chmod(file1, mode_async, expect(function(err)
    assert(not err)
    if is_windows then
      assert(maskMode(maskMode(fs.statSync(file1).mode), mode_async))
    else
      assert(mode_async == maskMode(fs.statSync(file1).mode))
    end

    -- TODO: accept mode in number
    assert(fs.chmodSync(file1, mode_sync))
    if is_windows then
      assert(maskMode(maskMode(fs.statSync(file1).mode), mode_sync))
    else
      assert(mode_sync == maskMode(fs.statSync(file1).mode))
    end
  end))
  fs.open(file2, 'a', tonumber('0666', 8), expect(function(err, fd)
    assert(not err)
    fs.fchmod(fd, mode_async, expect(function(err)
      assert(not err)
      if is_windows then
        assert(maskMode(maskMode(fs.fstatSync(fd).mode), mode_async))
      else
        assert(mode_async == maskMode(fs.fstatSync(fd).mode))
      end

      -- TODO: accept mode in number
      assert(fs.fchmodSync(fd, mode_sync))
      if is_windows then
        assert(maskMode(maskMode(fs.fstatSync(fd).mode), mode_sync))
      else
        assert(mode_sync == maskMode(fs.fstatSync(fd).mode))
      end
      fs.close(fd)
    end))
  end))

  -- lchmod
  if fs.lchmod then
    local link = Path.join(__dirname, 'tmp', 'symbolic-link')
    fs.unlinkSync(link)
    fs.symlinkSync(file2, link)

    fs.lchmod(link, mode_async, expect(function(err)
      assert(not err)
      p(fs.lstatSync(link).mode)
      assert(mode_async == maskMode(fs.lstatSync(link).mode))

      -- TODO: accept mode in number
      fs.lchmodSync(link, string.format('%o', mode_sync))
      assert(mode_sync == maskMode(fs.lstatSync(link).mode))
    end))
  end
end)

There are of course as you know many ways to solve this problem; I just went ahead and used the binary_infix_operator from #144 (though I had to fix the Clone implementation). This lead to the following improved times:

  • master Parser - 1600ms (750ms lexing)
  • zero-copy Parser - 1600ms (750ms lexing)

Zij-IT avatar Aug 26 '22 22:08 Zij-IT

Its for both the lexer and the parser! Each of the 200ish files goes from &str to Chunk!

Out of interest, how do just the lexing times compare? My suspicion is that lexing probably got a lot faster, but parsing didn't (because parsing already requires a lot of collecting into vectors, recursive data structures, etc. so there probably wasn't much that zero-copy could improve on).

I was going back through my Parser looking for those optimizations you were hoping for and I found THE problem in my implementation.

Ah yes, this is quite a common pitfall. Chumsky doesn't have memoisation, so it'll eagerly attempt to parse both interpretations twice, which can rapidly become exponential!

I'm quite surprised that master and zero-copy now have similar timings though. I would have hoped that at least some of its improvements were visible...

zesterer avatar Aug 26 '22 22:08 zesterer

I will give it a look when I can tomorrow, but I have only moved over the parser as of currently, so both lexers are using the non-zero-copy parser combinators, and are identical.

Sorry for the confusion 😅 My branch naming scheme when working on my own stuff needs work.

Zij-IT avatar Aug 26 '22 22:08 Zij-IT

Welp, I am back :D I had to fix some issues that I found in the zero-copy branch, but after I got them done I was able to do some testing, and here are the results:

Lex and Parse

Lex

Zij-IT avatar Aug 27 '22 19:08 Zij-IT

It parses EVERY atom twice if that atom is not followed by a TokenKind::OpPow

@Zij-IT Could you further elaborate? I do not really understand why this is the case.

mainrs avatar Aug 27 '22 19:08 mainrs

@Zij-IT Could you further elaborate? I do not really understand why this is the case.

Here is the parser again to prevent you having to scroll up :D

let power = atom
    .clone()
    .then(just(TokenKind::OpPow).to(BinaryOp::Power))
    .repeated()
    .then(atom.labelled("binary operand"))
    .foldr(|(lhs, op), rhs| Expr::Binary {
        lhs: Box::new(lhs),
        rhs: Box::new(rhs),
        op,
    });

To move this onto more pseudocode and hopefully more understandable, here it is as EBNF

power := (atom Pow)* atom

Now, the important thing here is that the left and right hand side of the TokenKind::Pow are the same item. I am going to go through parsing this with the following example:

12 ^ 12 ^ 12

To accurate parse the above expression we have to repeat the (atom Pow) parser until it fails, and then we parse the final atom. Here are the steps:

  1. Parse (atom Pow) results in 12 ^ | 12 ^ 12
  2. Parse (atom Pow) results in 12 ^ 12 ^ | 12
  3. Parse (atom Pow) results in an error, because after parsing the last 12, there is no Pow.
  4. Parse atom results in 12 ^ 12 ^ 12 |

The problem is that in both steps 3 and 4 we have to parse 12, because there is no way to pass the successfully parsed 12 to the next atom.

For this tiny example, that is no issue. But in the Lua example I have above, some of the lines are nested in three lamda definitions, and because each expression will be parsed twice, every item within the third call will be parsed at a minimum 2^3 times.

Zij-IT avatar Aug 27 '22 20:08 Zij-IT

Heyyo :) Me again! Life's been hectic for the last couple months, but I am excited to start helping out again.

Any issues you are currently hoping to get marked of the todo list for zero-copy that I can help with?

Zij-IT avatar Oct 25 '22 20:10 Zij-IT

Hey, hello again! I think a big one would be porting docs / text parsers over, and perhaps some of the newer parsers like not and and_is (suggested in #208).

I'm also considering something big (but fundamentally uncomplicated): switching out the Output associated type with an O type parameter, as with master. The primary reason is to allow implementing Parser for !, but in hindsight it just makes more sense and consistency with master is good. That's quite a big & boring job though (lots of copy & paste, renaming, etc.), so I can completely understand if you're not interested in it.

zesterer avatar Oct 26 '22 19:10 zesterer

lots of [boring] copy & paste, renaming, etc.

If you can describe the tasks (or make a small example to replicate), I might try to play with it, trying to automate most of it with vim if it can help you 🤔

bew avatar Oct 26 '22 20:10 bew

It's a relatively simple process. Everywhere you see Parser<I, Output = O>, you just turn it into Parser<I, O> obviously this gets slightly more complicated than just being a copy & paste process in some of the trait bounds, but there shouldn't be anything particularly wild.

zesterer avatar Oct 26 '22 23:10 zesterer