crystal icon indicating copy to clipboard operation
crystal copied to clipboard

StringScanner#scan(Int), #skip(Int), #current_char, etc, and significantly speed up operations on multibyte characters

Open jneen opened this issue 3 months ago • 5 comments

Supersedes #11785, addresses most of #11259

This includes the following non-breaking changes to StringScanner:

  • Refactor #peek to trust that @byte_offset is a valid char index, and use Char::Reader to read a specified number of characters ahead. The previous implementation would cause at least two full scans from the beginning of the string if it was not single byte optimizable, which could result in O(n^2) parses if #peek is used during a hot loop.
  • Add overloads #scan(Int) and #skip(Int) to advance a specified number of characters, again using Char::Reader for non-optimizable strings. It is also a much more performant alternative to scanner.offset += n, which can cause two full string scans, both to retrieve the offset, and find the target offset.
  • Adds #read_char : Char? to scan forward by one (possibly multibyte) character and return it, or nil if at the end of the string. This is equivalent to Ruby's #getch, but the method name was chosen to be closer to IO.
  • Adds #current_char and #previous_char, as well as UInt8-typed #current_byte and #previous_byte, which do not modify any state.
  • Adds #rewind(Int) and #peek_behind(Int) to rewind the head or scan backwards by a specified number of chars, using Char::Reader#previous_char to find the byte offset in the presence of multibyte chars.
  • Adds #beginning_of_line? (checks backwards one char for a newline, same as the Ruby impelementation) and #unscan, which rewinds the bytesize of the last match.
  • Adds #matched? : Bool to check the presence of @last_match, similar to Ruby's implementation.
  • Adjusts #inspect to show where the scan head is.

jneen avatar Nov 29 '25 22:11 jneen

Related: #11259

HertzDevil avatar Nov 30 '25 18:11 HertzDevil

Happy to add those in this PR as well tbh, I don't think it'd be that difficult

jneen avatar Nov 30 '25 20:11 jneen

Actually, ideally we'd be able to get an encoded character from the byte offset without scanning the whole string for multibyte chars, which String#[] (used by #peek) is wont to do. May require some extra functionality on String to be performant, I'll look into it.

jneen avatar Nov 30 '25 20:11 jneen

...in fact, now that I look at it, it might be better for StringScanner to be backed by Char::Reader instead of String

jneen avatar Nov 30 '25 20:11 jneen

Forgot to mention here, but this is ready for a review.

jneen avatar Dec 05 '25 15:12 jneen

Happy to split it up! I just noticed the issue that @HertzDevil linked and figured I'd knock a few of those out while I was here.

jneen avatar Dec 11 '25 14:12 jneen

Question, which makes more sense for something like:

# example A
s = StringScanner.new("foobar")
puts s.scan(1000)

# example B
s = StringScanner.new("foobar")
puts s.skip(1000)

Should this:

  • Both be a no-op and return nil (equivalent to a non-matching regex or string)
  • Both scan or skip to the end of the string and return the rest of the stream
  • Scan is a no-op, but skip goes to the end of the string? In isolation this makes the most sense but put together is strange.

edit: Also for #peek and #peek_behind. I do feel like these ones should probably just return a best-effort String, which is what crystal currently does.

s = StringScanner.new("foobar")
s.peek(1000) # => "foobar"
s.peek_behind(1000) # => ""

jneen avatar Dec 11 '25 17:12 jneen

Yeah, whether the number of characters should be an exact match or upper limit is a tricky question. I would probably expect an exact match. Thus scan(1000) should be equivalent to scan(/.{1000}/), not scan(/.{0,1000}/).

Either way, both scan and skip should agree on this because they're closely related.

peek on the other hand is an entirely different operation, so aligning with that could be nice but it's not critical.

If there is a need for upper-limit matching, it could perhaps be implemented with a range: scan(..1000). This is mostly an idea for the future, I don't think this is necessary right now.

straight-shoota avatar Dec 11 '25 18:12 straight-shoota

I've edited the behaviour of #scan(Int) and #skip(Int) to both no-op and return nil if they go off the end of the string. I think this makes sense as you said, as well as considering that #scan(1) in a loop should return nil eventually. I also shored up the specs quite a bit and added missing failsafes on #rewind, which now just puts you back at the start if the length is too big.

As for documentation, I've added some doc comments - is there somewhere else I need to look to update docs?

jneen avatar Dec 11 '25 19:12 jneen