cmark
cmark copied to clipboard
Fix source positions for inlines.
This PR fixes the source positions for:
- Inlines inside inconsistently indented blocks.
- Links/images.
- Autolinks.
- Escaped newlines.
- Non-matched backticks.
Fixing the source positions for inlines inside inconsistently indented blocks is accomplished by maintaining the leading trivia when constructing the block (see add_line
in blocks.c), and removing it during the inline parsing stage (see changes in inlines.c). The three source position tests added demonstrate the current implementation's shortfalls.
These changes are minimally invasive, and thus, slightly degrade performance—due to stripping the leading whitespace during the inline parsing phase. This can be avoided for normal parse/render operations by wrapping the functionality introduced in this PR with the conditional if (CMARK_OPT_SOURCEPOS & options)
. However, this will lead to the output of invalid source positions when the CMARK_OPT_SOURCEPOS
option is specified for a rendering operation (i.e. cmark_render_xml
) but not the original parsing operation (i.e. cmark_parse_document
).
The performance penalty encountered when running the benchmark 100 times is ~14ms to the mean execution time (and ~10ms to the median). Given that benchinput.md is 110.6 MB containing 1,484,151 lines (including blank lines), this is a minuscule penalty.
That being said, I still advocate only using this functionality when the CMARK_OPT_SOURCEPOS
option is specified for the original parsing operation (as described above).
As an additional thought, aside from complexity, the overhead of introducing some other means of tracking the starting columns for each input line https://github.com/commonmark/cmark/issues/296#issuecomment-487398016 may have a worse penalty than just stripping whitespace—thoughts?
System:
- OS: macOS Mojave (10.14.4)
- Model Identifier: MacBookPro12,1
- Processor: 2.7 GHz Intel Core i5
- Memory: 8 GB 1867 MHz DDR3
This PR
sudo renice -10 $$; make bench
{ for x in `seq 1 100` ; do \
/usr/bin/env time -p build/src/cmark --sourcepos </dev/null >/dev/null ; \
/usr/bin/env time -p build/src/cmark --sourcepos bench/benchinput.md >/dev/null ; \
done \
} 2>&1 | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.5747, median = 1.5700, stdev = 0.0258
cmark master (a61c4902f07789d40a717daef710da29e7899485)
sudo renice -10 $$; make bench
Password:
{ for x in `seq 1 100` ; do \
/usr/bin/env time -p build/src/cmark --sourcepos </dev/null >/dev/null ; \
/usr/bin/env time -p build/src/cmark --sourcepos bench/benchinput.md >/dev/null ; \
done \
} 2>&1 | grep 'real' | awk '{print $2}' | python3 'bench/stats.py'
mean = 1.5607, median = 1.5600, stdev = 0.0261
It's a small enough performance difference that I don't think we need to worry about it.
There's still the issue whether to make the parser, as well as the renderer, sensitive to CMARK_OPT_SOURCEPOS. Is there any reason, besides performance, to do this?
Not ostensibly. The only other reason supporting sensitivity is the isolation it affords from any potentially introduced side effects of the functionality. That being said, if there is confidence in the comprehensiveness of the unit tests, then this is no concern.
if there is confidence in the comprehensiveness of the unit tests, then this is no concern.
@jgm @chriszielinski There is high confidence that the patch breaks nothing while improving a lot. Some minor sourcepos issues likely remain, but the current status is just broken. Will explain below together with some more rationale why this patch is critical for some users and good for cmark's reputation.
Why the confidence:
- The existing tests passes before and after applying the patch - for the sake of completeness.
- When I try to hardcode
SOURCEPOS
option value on parser only (cmark_parser_new_with_mem
), the tests pass both with this flag asserted and cleared. Which does not mean much more than the previous point, as this patch does not evaluate it. It means the already existingcode
sourcepos is not test-covered. :) - We had made an integration test on around ~250k Markdown texts, where we tried to locate all node delimiters using sourcepos. It is based on patched cmark-gfm, but this patch was included without changes. See the results in cmark-gfm-sourcepos-histogram.md.txt. My interpretation:
- Cmark parsing is likely as expected even on large scale, because the delimiter chars located using sourcepos are as expected.
- This mass test actually cross-checks both parsing and sourcepos generating and would not be even possible without correct sourcepos.
- The sourcepos implementation for links and footnote definitions should be checked. But this does not worsen the current situation.
I think the histogram speaks for itself and I was actually impressed by the resulting accuracy, especially because most of the texts contain UTF-8 outside US-ASCII.
Why to merge this patch: It allows valuable use cases as our GfmEditor. This tool actually serves for migrating from Redcarpet :). Currently part of a Redmine plugin, but we can make it a separate tool for users and the uses cases can go beyond Redcarpet migration. You can check its description here.
Why not to wait (my 2 cents):
- The confidence in not breaking things has more sources than unit tests and leak check.
- As agreed above, the perfomance penalty is not an issue.
- Sourcepos is a game-changing feature for some users, but it is currently just broken
- And I totally understand that
CMARK_OPT_SOURCEPOS
should be more considered as a parser option and it should be done subsequently, but ironically it is a reason not to wait, as the option can't even be communicated to the users atm. The current situation is:- When I do not enable
CMARK_OPT_SOURCEPOS
on parser, sourcepos is broken on all inlines. - When I do enable
CMARK_OPT_SOURCEPOS
on parser, sourcepos gets improved oncode
, but remains broken on all inlines.
- When I do not enable
Hope it won't be turned down because our work was made on the cmark-gfm
fork. But everything except extensions applies to this repo as well.
Thank you for your work and let me know if I can help with something. :)