maruku CharSourceStrscan does not work correctly with UTF-8 strings. Remove it.

CharSourceStrscan does not work correctly with UTF-8 strings. Remove it.

Open outcassed opened this issue 6 years ago • 4 comments

CharSourceStrScan, an alternate CharSource implementation that is not enabled by default, expects characters to be 1 byte. UTF-8 strings break it.

This removes it entirely.

Example:

Rendering

<p>ö <strong>a</strong></p>

In Ruby 1.9.x:

<p>ö &lt;strong&gt;a&lt;/strong&gt;</p>

In Ruby 2.1 and above:

parse_span.rb:32:in `read_span': invalid byte sequence in UTF-8 (ArgumentError)

Dec 12 '17 21:12 outcassed

Coverage increased (+1.4%) to 78.793% when pulling d68f7855df41555823a8186a87b882b245827689 on caseyf:caseyf-remove-charsourcestrscan into ec44b2709d6c617f6c5f7d79caec9b40570cdd68 on bhollis:master.

Dec 12 '17 21:12 coveralls

Alternatively, one can fix CharSourceStrscan to be multi-byte-aware.

I would still make CharSourceManual the default, 'cuz it's faster.

Dec 12 '17 21:12 distler

A multi-byte aware implementation would replace these methods. Here is a stab at it:

class CharSourceStrscan
    def cur_char
      @scanner.match?(/./m) && @scanner.matched
    end

    def cur_chars(n)
      r = Regexp.new(".{0,#{n}}", Regexp::MULTILINE)
      @scanner.match?(r) && @scanner.matched
    end
    
    def next_char
      @scanner.match?(/../m) && @scanner.matched && @scanner.matched.last
    end
    
    def shift_char
      @scanner.getch
    end
    
    def ignore_char
      @scanner.getch
      nil
    end
    
    def ignore_chars(n)
      n.times { @scanner.getch }
      nil
    end
end

Dec 12 '17 21:12 outcassed

If there's interest in a multi-byte-aware version, I can make a pull request out of the above-linked commits.

Dec 12 '17 21:12 distler

maruku maruku copied to clipboard

CharSourceStrscan does not work correctly with UTF-8 strings. Remove it.

maruku
maruku copied to clipboard