commonmark-hs icon indicating copy to clipboard operation
commonmark-hs copied to clipboard

Fix parsing of link destinations that look like `code` or <html>

Open notriddle opened this issue 1 year ago • 3 comments

stack bench output:

Before
commonmark-cli       > build (exe)
commonmark-cli       > Preprocessing executable 'commonmark' for commonmark-cli-0.2.1..
commonmark-cli       > Building executable 'commonmark' for commonmark-cli-0.2.1..
commonmark-cli       > copy/register
commonmark-cli       > Installing executable commonmark in /home/michael/Development/commonmark-hs/.stack-work/install/x86_64-linux/d6a7ffd91072a14c336d55f594f5fe3da20e2523e0123ffc206169986025661e/8.10.7/bin
commonmark           > benchmarks
Running 1 benchmarks... 
Benchmark benchmark-commonmark: RUNNING...
All                     
  tokenize              
    tokenize sample.md: OK (0.24s)
      3.6 ms ± 250 μs   
  parse sample.md       
    commonmark default: OK (0.34s)
       48 ms ± 2.3 ms   
  pathological          
    nested strong emph  
      commonmark        
        1000:           OK (0.29s)
          1.1 ms ±  56 μs
        2000:           OK (0.28s)
          2.2 ms ± 103 μs
        3000:           OK (0.43s)
          3.3 ms ± 109 μs
        4000:           OK (0.28s)
          4.5 ms ± 323 μs
    many emph closers with no openers
      commonmark        
        1000:           OK (0.54s)
          1.0 ms ±  44 μs
        2000:           OK (0.14s)
          2.2 ms ± 200 μs
        3000:           OK (0.42s)
          3.3 ms ±  91 μs
        4000:           OK (0.27s)
          4.3 ms ± 322 μs
    many emph openers with no closers
      commonmark        
        1000:           OK (0.26s)
          1.0 ms ±  54 μs
        2000:           OK (0.27s)
          2.0 ms ± 127 μs
        3000:           OK (0.20s)
          3.2 ms ± 214 μs
        4000:           OK (4.48s)
          4.3 ms ± 355 μs
    many link closers with no openers
      commonmark        
        1000:           OK (0.32s)
          1.2 ms ±  91 μs
        2000:           OK (0.39s)
          2.9 ms ± 279 μs
        3000:           OK (1.05s)
          4.0 ms ±  57 μs
        4000:           OK (0.37s)
          5.5 ms ± 342 μs
    many link openers with no closers
      commonmark        
        1000:           OK (0.33s)
          1.2 ms ±  63 μs
        2000:           OK (0.17s)
          2.6 ms ± 175 μs
        3000:           OK (0.54s)
          4.3 ms ± 419 μs
        4000:           OK (0.36s)
          5.5 ms ± 274 μs
    mismatched openers and closers
      commonmark        
        1000:           OK (0.35s)
           11 ms ± 698 μs
        2000:           OK (0.30s)
           43 ms ± 3.3 ms
        3000:           OK (0.29s)
           97 ms ± 3.3 ms
        4000:           OK (0.53s)
          175 ms ± 5.0 ms
    openers and closers multiple of 3
      commonmark        
        1000:           OK (0.37s)
          3.0 ms ± 144 μs
        2000:           OK (0.32s)
           10 ms ± 824 μs
        3000:           OK (0.35s)
           23 ms ± 1.2 ms
        4000:           OK (0.12s)
           40 ms ± 2.7 ms
    link openers and emph closers
      commonmark        
        1000:           OK (0.30s)
          1.1 ms ± 113 μs
        2000:           OK (0.30s)
          2.3 ms ± 213 μs
        3000:           OK (0.23s)
          3.6 ms ± 190 μs
        4000:           OK (0.31s)
          4.9 ms ± 295 μs
    nested brackets     
      commonmark        
        1000:           OK (0.44s)
          6.8 ms ± 251 μs
        2000:           OK (0.38s)
           25 ms ± 925 μs
        3000:           OK (0.17s)
           56 ms ± 3.0 ms
        4000:           OK (0.29s)
           96 ms ± 5.9 ms
    inline link openers without closers
      commonmark        
        1000:           OK (0.55s)
          2.1 ms ±  71 μs
        2000:           OK (0.28s)
          4.4 ms ± 291 μs
        3000:           OK (0.44s)
          6.8 ms ± 256 μs
        4000:           OK (0.15s)
           10 ms ± 729 μs
    repeated pattern '[ (]('
      commonmark        
        1000:           OK (0.36s)
          1.4 ms ±  48 μs
        2000:           OK (0.37s)
          2.8 ms ± 187 μs
        3000:           OK (0.26s)
          4.3 ms ± 170 μs
        4000:           OK (0.39s)
          5.9 ms ± 239 μs
    nested block quotes 
      commonmark        
        1000:           OK (0.49s)
          937 μs ±  28 μs
        2000:           OK (0.27s)
          2.0 ms ± 162 μs
        3000:           OK (0.85s)
          3.3 ms ± 183 μs
        4000:           OK (0.15s)
          4.7 ms ± 387 μs
    nested list         
      commonmark        
        1000:           OK (0.14s)
          529 μs ±  50 μs
        2000:           OK (0.40s)
          769 μs ±  25 μs
        3000:           OK (0.27s)
          1.0 ms ±  50 μs
        4000:           OK (0.33s)
          1.3 ms ±  57 μs
    nested list 2       
      commonmark        
        1000:           OK (0.34s)
          2.8 ms ± 118 μs
        2000:           OK (0.42s)
          6.4 ms ± 395 μs
        3000:           OK (0.32s)
           10 ms ± 620 μs
        4000:           OK (0.48s)
           15 ms ± 518 μs
    backticks           
      commonmark        
        1000:           OK (0.20s)
          400 μs ±  29 μs
        2000:           OK (0.23s)
          857 μs ±  52 μs
        3000:           OK (0.32s)
          1.2 ms ±  43 μs
        4000:           OK (0.22s)
          1.6 ms ± 147 μs
    CDATA               
      commonmark        
        1000:           OK (0.23s)
          898 μs ±  74 μs
        2000:           OK (0.23s)
          1.8 ms ±  89 μs
        3000:           OK (0.37s)
          2.9 ms ± 127 μs
        4000:           OK (0.49s)
          3.8 ms ± 173 μs
    <?                  
      commonmark        
        1000:           OK (24.57s)
          1.5 ms ±  42 μs
        2000:           OK (0.81s)
          3.1 ms ± 187 μs
        3000:           OK (0.17s)
          5.5 ms ± 398 μs
        4000:           OK (0.84s)
          6.3 ms ± 305 μs
    <!A                 
      commonmark        
        1000:           OK (0.31s)
          1.2 ms ± 105 μs
        2000:           OK (0.32s)
          2.4 ms ± 132 μs
        3000:           OK (0.24s)
          3.7 ms ± 262 μs
        4000:           OK (0.33s)
          5.0 ms ± 423 μs
                        
All 74 tests passed (54.37s)
Benchmark benchmark-commonmark: FINISH
commonmark-extensions> benchmarks
Running 1 benchmarks...            
Benchmark benchmark-commonmark-extensions: RUNNING...
All                                
  commonmark +smart:      OK (0.37s)
     50 ms ± 4.2 ms                
  commonmark +autolink:   OK (0.36s)
     52 ms ± 2.5 ms                
  commonmark +attributes: OK (1.47s)
     47 ms ± 1.9 ms                
  commonmark +pipe_table: OK (0.32s)
     43 ms ± 4.1 ms                
                                   
All 4 tests passed (2.55s)         
Benchmark benchmark-commonmark-extensions: FINISH
Completed 3 action(s).
After
$ stack bench
commonmark           > benchmarks
Running 1 benchmarks...
Benchmark benchmark-commonmark: RUNNING...
All         
  tokenize  
    tokenize sample.md: OK (0.24s)
      3.7 ms ± 221 μs
  parse sample.md
    commonmark default: OK (0.14s)
       47 ms ± 4.6 ms
  pathological
    nested strong emph
      commonmark
        1000:           OK (0.28s)
          1.1 ms ±  43 μs
        2000:           OK (0.62s)
          2.5 ms ± 181 μs
        3000:           OK (0.44s)
          3.4 ms ± 198 μs
        4000:           OK (0.30s)
          4.6 ms ± 213 μs
    many emph closers with no openers
      commonmark
        1000:           OK (0.55s)
          1.1 ms ±  50 μs
        2000:           OK (0.28s)
          2.1 ms ± 112 μs
        3000:           OK (0.21s)
          3.2 ms ± 300 μs
        4000:           OK (0.28s)
          4.4 ms ± 240 μs
    many emph openers with no closers
      commonmark
        1000:           OK (0.26s)
          1.0 ms ±  46 μs
        2000:           OK (0.27s)
          2.1 ms ± 144 μs
        3000:           OK (0.43s)
          3.3 ms ± 209 μs
        4000:           OK (0.29s)
          4.3 ms ± 199 μs
    many link closers with no openers
      commonmark
        1000:           OK (0.33s)
          1.2 ms ±  74 μs
        2000:           OK (0.35s)
          2.6 ms ± 193 μs
        3000:           OK (0.31s)
          4.7 ms ± 275 μs
        4000:           OK (0.75s)
          5.4 ms ± 151 μs
    many link openers with no closers
      commonmark
        1000:           OK (0.34s)
          1.3 ms ±  48 μs
        2000:           OK (0.35s)
          2.6 ms ± 123 μs
        3000:           OK (0.16s)
          5.1 ms ± 499 μs
        4000:           OK (0.38s)
          5.6 ms ± 261 μs
    mismatched openers and closers
      commonmark
        1000:           OK (0.37s)
           11 ms ± 624 μs
        2000:           OK (0.14s)
           45 ms ± 4.5 ms
        3000:           OK (0.30s)
           98 ms ± 3.6 ms
        4000:           OK (0.54s)
          179 ms ± 8.1 ms
    openers and closers multiple of 3
      commonmark
        1000:           OK (0.19s)
          3.1 ms ± 229 μs
        2000:           OK (0.36s)
           12 ms ± 1.0 ms
        3000:           OK (0.34s)
           22 ms ± 1.5 ms
        4000:           OK (0.61s)
           40 ms ± 1.6 ms
    link openers and emph closers
      commonmark
        1000:           OK (0.63s)
          1.1 ms ± 103 μs
        2000:           OK (0.32s)
          2.4 ms ± 104 μs
        3000:           OK (0.94s)
          3.6 ms ± 196 μs
        4000:           OK (0.65s)
          5.0 ms ± 249 μs
    nested brackets
      commonmark
        1000:           OK (0.45s)
          6.8 ms ± 464 μs
        2000:           OK (0.18s)
           25 ms ± 1.8 ms
        3000:           OK (0.17s)
           55 ms ± 4.8 ms
        4000:           OK (0.29s)
           96 ms ± 4.5 ms
    inline link openers without closers
      commonmark
        1000:           OK (0.29s)
          2.1 ms ± 130 μs
        2000:           OK (0.30s)
          4.5 ms ± 251 μs
        3000:           OK (0.46s)
          6.9 ms ± 558 μs
        4000:           OK (0.63s)
          9.8 ms ± 402 μs
    repeated pattern '[ (]('
      commonmark
        1000:           OK (0.38s)
          1.5 ms ±  55 μs
        2000:           OK (0.39s)
          2.8 ms ± 238 μs
        3000:           OK (0.29s)
          4.3 ms ± 279 μs
        4000:           OK (0.40s)
          6.0 ms ± 554 μs
    nested block quotes
      commonmark
        1000:           OK (0.50s)
          941 μs ±  23 μs
        2000:           OK (0.54s)
          2.0 ms ±  43 μs
        3000:           OK (0.21s)
          3.2 ms ± 316 μs
        4000:           OK (0.29s)
          4.2 ms ± 223 μs
    nested list
      commonmark
        1000:           OK (0.28s)
          529 μs ±  24 μs
        2000:           OK (0.42s)
          813 μs ±  47 μs
        3000:           OK (0.30s)
          1.1 ms ±  66 μs
        4000:           OK (0.36s)
          1.3 ms ± 128 μs
    nested list 2
      commonmark
        1000:           OK (0.36s)
          2.8 ms ± 108 μs
        2000:           OK (0.42s)
          6.3 ms ± 387 μs
        3000:           OK (1.33s)
           10 ms ± 167 μs
        4000:           OK (0.49s)
           15 ms ± 796 μs
    backticks
      commonmark
        1000:           OK (0.21s)
          406 μs ±  23 μs
        2000:           OK (0.22s)
          862 μs ±  57 μs
        3000:           OK (0.33s)
          1.3 ms ±  57 μs
        4000:           OK (0.23s)
          1.7 ms ±  97 μs
    CDATA   
      commonmark
        1000:           OK (0.24s)
          916 μs ±  43 μs
        2000:           OK (0.49s)
          1.8 ms ±  51 μs
        3000:           OK (0.38s)
          2.9 ms ±  93 μs
        4000:           OK (0.50s)
          3.8 ms ±  93 μs
    <?      
      commonmark
        1000:           OK (0.40s)
          1.5 ms ±  53 μs
        2000:           OK (0.40s)
          3.0 ms ±  95 μs
        3000:           OK (0.31s)
          4.6 ms ± 390 μs
        4000:           OK (0.22s)
          6.2 ms ± 570 μs
    <!A     
      commonmark
        1000:           OK (0.32s)
          1.2 ms ±  55 μs
        2000:           OK (0.33s)
          2.4 ms ± 144 μs
        3000:           OK (0.48s)
          3.7 ms ± 157 μs
        4000:           OK (0.33s)
          4.9 ms ± 202 μs
            
All 74 tests passed (28.22s)
Benchmark benchmark-commonmark: FINISH
commonmark-extensions> benchmarks
Running 1 benchmarks...            
Benchmark benchmark-commonmark-extensions: RUNNING...
All                                
  commonmark +smart:      OK (0.78s)
     50 ms ± 2.0 ms                
  commonmark +autolink:   OK (0.39s)
     53 ms ± 4.1 ms                
  commonmark +attributes: OK (0.74s)
     48 ms ± 1.0 ms                
  commonmark +pipe_table: OK (0.35s)
     47 ms ± 3.0 ms                
                                   
All 4 tests passed (2.25s)         
Benchmark benchmark-commonmark-extensions: FINISH
Completed 2 action(s).

Fixes #136

This works by re-parsing the tokens that come after the link, but only when the end delimiter isn't on a chunk boundary (since that's the only way this problem can happen).

Re-parsing a specific chunk won't work, because the part that needs re-interpreted can span more than one chunk. For example, we can draw the bounds of the erroneous code chunk in this example:

[x](`) <a href="`">
    ^-----------^

If we re-parse the underlined part in isolation, we'll fix the first link, but won't find the HTML (since the closing angle bracket is in the next chunk).

On the other hand, parsing links, code, and HTML in a single pass would make writing extensions more complicated. For example, LaTeX math is supposed to have the same binding strength as code spans:

$first[$](about)
^------^ this is a math span, not a link

[first]($)$5/8$
        ^-^ this is an analogue of the original bug
            it shouldn't be a math span, but looks like one

notriddle avatar Jan 18 '24 23:01 notriddle

@jgm other than the merge conflict in regression.md (a simple rebase), is there anything blocking this fix?

notriddle avatar Jul 04 '24 05:07 notriddle

This probably just arrived at a busy time; I will look at. It would help if you could rebase.

jgm avatar Jul 04 '24 06:07 jgm

Okay, it’s rebased.

notriddle avatar Jul 16 '24 02:07 notriddle

Any feedback on this PR?

notriddle avatar Sep 11 '24 15:09 notriddle

Sorry, I was on vacation when this posted and it got ignored. I'll try to have a look in the near future!

jgm avatar Sep 11 '24 15:09 jgm

Do the new comments and variable names help?

notriddle avatar Sep 11 '24 17:09 notriddle

Yes, this looks great, thanks!

jgm avatar Sep 11 '24 18:09 jgm