oak icon indicating copy to clipboard operation
oak copied to clipboard

fix(tokenizer): set correct column when moving back

Open shannonrothe opened this issue 2 years ago • 2 comments

First time contributing any Go code in my life, let me know if this doesn't make sense 😄

shannonrothe avatar Aug 10 '22 02:08 shannonrothe

Thanks @shannonrothe for the contribution! This does indeed do the job, but I'm concerned that splitting a (potentially very large) file into lines every time the code calls .back() might be an unexpected/unwarranted performance cost. .back() is called at least once for every identifier and number that appears in Oak programs, so the impact is significant.

I considered a few solutions when I wrote that TODO comment:

  • One solution would be to store somewhere in the tokenizer state the last column number when .next() was called, and have .back() just roll that back. It adds a bit more bookkeeping, but I think it would slow things down the least. The problem with this is that I can't call .back() twice in a row, because it won't have a column to roll back to the second time around.
  • I also considered having .back() count backwards until it ran into a \n, but that would hit perf issues when running very large compiled Oak programs which can contain super long lines that are entire programs, minified.

So... neither of those solutions are ideal. I'm not sure what other tokenizers do (they may not keep track of columns explicitly — I know Zig's compiler only lazily computes line/column info if it's needed). But at the moment, I think I'm waiting for me or someone else to have an idea for a O(1) implementation.

There are probably clever things I can do if I re-architected the tokenizer a bit, like first only emitting tokens that contain byte offsets into the file, and then at the end computing all the line breaks at once and assigning (line, col) to every token. But I don't think I want to do a refactor that large at the moment.

If you have an idea for a more efficient implementation, I'm all ears! If not, I may come back and close the issue later until I run into a solution that fits better or I have a chance to do the refactor.

thesephist avatar Aug 10 '22 11:08 thesephist

Thanks @shannonrothe for the contribution! This does indeed do the job, but I'm concerned that splitting a (potentially very large) file into lines every time the code calls .back() might be an unexpected/unwarranted performance cost. .back() is called at least once for every identifier and number that appears in Oak programs, so the impact is significant.

I considered a few solutions when I wrote that TODO comment:

* One solution would be to store somewhere in the tokenizer state the last column number when `.next()` was called, and have `.back()` just roll that back. It adds a bit more bookkeeping, but I think it would slow things down the least. The problem with this is that I can't call `.back()` twice in a row, because it won't have a column to roll back to the second time around.

* I also considered having `.back()` count backwards until it ran into a `\n`, but that would hit perf issues when running very large compiled Oak programs which can contain super long lines that are entire programs, minified.

So... neither of those solutions are ideal. I'm not sure what other tokenizers do (they may not keep track of columns explicitly — I know Zig's compiler only lazily computes line/column info if it's needed). But at the moment, I think I'm waiting for me or someone else to have an idea for a O(1) implementation.

There are probably clever things I can do if I re-architected the tokenizer a bit, like first only emitting tokens that contain byte offsets into the file, and then at the end computing all the line breaks at once and assigning (line, col) to every token. But I don't think I want to do a refactor that large at the moment.

If you have an idea for a more efficient implementation, I'm all ears! If not, I may come back and close the issue later until I run into a solution that fits better or I have a chance to do the refactor.

If the program source is not expected to change after the tokenizer is first instantiated then couldn't we pre-compute the lines array once upfront and pay the cost, and then reuse that whenever calling .back()?

I know that upfront cost could still be hefty for a large file, but still significantly better than the implementation in this PR?

shannonrothe avatar Aug 10 '22 22:08 shannonrothe