cmark Fix source positions for inlines.

This PR fixes the source positions for:

Inlines inside inconsistently indented blocks.
Links/images.
Autolinks.
Escaped newlines.
Non-matched backticks.

Fixing the source positions for inlines inside inconsistently indented blocks is accomplished by maintaining the leading trivia when constructing the block (see add_line in blocks.c), and removing it during the inline parsing stage (see changes in inlines.c). The three source position tests added demonstrate the current implementation's shortfalls.

These changes are minimally invasive, and thus, slightly degrade performance—due to stripping the leading whitespace during the inline parsing phase. This can be avoided for normal parse/render operations by wrapping the functionality introduced in this PR with the conditional if (CMARK_OPT_SOURCEPOS & options). However, this will lead to the output of invalid source positions when the CMARK_OPT_SOURCEPOS option is specified for a rendering operation (i.e. cmark_render_xml) but not the original parsing operation (i.e. cmark_parse_document).

Apr 26 '19 18:04 chriszielinski

The performance penalty encountered when running the benchmark 100 times is ~14ms to the mean execution time (and ~10ms to the median). Given that benchinput.md is 110.6 MB containing 1,484,151 lines (including blank lines), this is a minuscule penalty.

That being said, I still advocate only using this functionality when the CMARK_OPT_SOURCEPOS option is specified for the original parsing operation (as described above).

As an additional thought, aside from complexity, the overhead of introducing some other means of tracking the starting columns for each input line https://github.com/commonmark/cmark/issues/296#issuecomment-487398016 may have a worse penalty than just stripping whitespace—thoughts?

System:

OS: macOS Mojave (10.14.4)
Model Identifier: MacBookPro12,1
Processor: 2.7 GHz Intel Core i5
Memory: 8 GB 1867 MHz DDR3

This PR

sudo renice -10 $$; make bench
{ for x in `seq 1 100` ; do \
        /usr/bin/env time -p build/src/cmark --sourcepos </dev/null >/dev/null ; \
        /usr/bin/env time -p build/src/cmark --sourcepos bench/benchinput.md >/dev/null ; \
        done \
    } 2>&1  | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.5747, median = 1.5700, stdev = 0.0258

cmark master (a61c4902f07789d40a717daef710da29e7899485)

sudo renice -10 $$; make bench
Password:
{ for x in `seq 1 100` ; do \
        /usr/bin/env time -p build/src/cmark --sourcepos </dev/null >/dev/null ; \
        /usr/bin/env time -p build/src/cmark --sourcepos bench/benchinput.md >/dev/null ; \
        done \
    } 2>&1  | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.5607, median = 1.5600, stdev = 0.0261

May 08 '19 20:05 chriszielinski

It's a small enough performance difference that I don't think we need to worry about it.

There's still the issue whether to make the parser, as well as the renderer, sensitive to CMARK_OPT_SOURCEPOS. Is there any reason, besides performance, to do this?

May 10 '19 03:05 jgm

Not ostensibly. The only other reason supporting sensitivity is the isolation it affords from any potentially introduced side effects of the functionality. That being said, if there is confidence in the comprehensiveness of the unit tests, then this is no concern.

May 14 '19 19:05 chriszielinski

if there is confidence in the comprehensiveness of the unit tests, then this is no concern.

@jgm @chriszielinski There is high confidence that the patch breaks nothing while improving a lot. Some minor sourcepos issues likely remain, but the current status is just broken. Will explain below together with some more rationale why this patch is critical for some users and good for cmark's reputation.

Why the confidence:

The existing tests passes before and after applying the patch - for the sake of completeness.
When I try to hardcode SOURCEPOS option value on parser only (cmark_parser_new_with_mem), the tests pass both with this flag asserted and cleared. Which does not mean much more than the previous point, as this patch does not evaluate it. It means the already existing code sourcepos is not test-covered. :)
We had made an integration test on around ~250k Markdown texts, where we tried to locate all node delimiters using sourcepos. It is based on patched cmark-gfm, but this patch was included without changes. See the results in cmark-gfm-sourcepos-histogram.md.txt. My interpretation:
- Cmark parsing is likely as expected even on large scale, because the delimiter chars located using sourcepos are as expected.
- This mass test actually cross-checks both parsing and sourcepos generating and would not be even possible without correct sourcepos.
- The sourcepos implementation for links and footnote definitions should be checked. But this does not worsen the current situation.

I think the histogram speaks for itself and I was actually impressed by the resulting accuracy, especially because most of the texts contain UTF-8 outside US-ASCII.

Why to merge this patch: It allows valuable use cases as our GfmEditor. This tool actually serves for migrating from Redcarpet :). Currently part of a Redmine plugin, but we can make it a separate tool for users and the uses cases can go beyond Redcarpet migration. You can check its description here.

Why not to wait (my 2 cents):

The confidence in not breaking things has more sources than unit tests and leak check.
As agreed above, the perfomance penalty is not an issue.
Sourcepos is a game-changing feature for some users, but it is currently just broken
And I totally understand that CMARK_OPT_SOURCEPOS should be more considered as a parser option and it should be done subsequently, but ironically it is a reason not to wait, as the option can't even be communicated to the users atm. The current situation is:
- When I do not enable CMARK_OPT_SOURCEPOS on parser, sourcepos is broken on all inlines.
- When I do enable CMARK_OPT_SOURCEPOS on parser, sourcepos gets improved on code, but remains broken on all inlines.

Hope it won't be turned down because our work was made on the cmark-gfm fork. But everything except extensions applies to this repo as well.

Thank you for your work and let me know if I can help with something. :)

Apr 13 '20 13:04 martincizek

cmark cmark copied to clipboard

Fix source positions for inlines.

System:

This PR

cmark master (a61c4902f07789d40a717daef710da29e7899485)

cmark
cmark copied to clipboard