cmark More conditional sourcepos

trafficstars

There you go, three commits:

The source map code, which generation is now conditional to the SOURCEPOS option, when not enabled make bench results are identical to the results without the patch. Note that I had a bunch of follow-up commits on my end which I ended up fixing up with that patch, mostly fixes except for one patch that defines reference nodes. I also squashed up @nwellnhof's commits in there, please tell me if you mind.
The wrapper code, also updated but nothing major at all
The actual fancy stuff, a reindenting script, tested on every test of the spec with 1 to 80 columns, the test checks that the new AST is identical to the old one, minus softbreaks. The actual test I propose only checks for reindenting on one column, because 80 * 610 starts to take a little time I'm afraid. Note that the script could break in a few more places than it currently does, and certainly be massively optimized, but I think it's already in a state worth merging because, as far as I can tell, it is now provably reliable.

Jan 04 '17 23:01 MathieuDuponchelle

That is a timeout, pretty sure the code is fine, @jgm if you feel like triggering a new run :)

Jan 05 '17 01:01 MathieuDuponchelle

Rebased and pushed, no timeout this time :)

Jan 05 '17 15:01 MathieuDuponchelle

Is it possible to break this into smaller, logical chunks? This is currently such a big ball of code, touching so many parts of the code base, that it's hard to survey.

For example, I see changes in buffer.h like this:

-  /* Oversize the buffer by 50% to guarantee amortized linear time
 -   * complexity on append operations. */
 -  bufsize_t new_size = target_size + target_size / 2;
 -  new_size += 1;
 -  new_size = (new_size + 7) & ~7;
 +  // Oversize the buffer by 50% to guarantee amortized linear time
 +  // complexity on append operations.
 +  bufsize_t add = target_size / 2;
 +  // Account for terminating NUL byte.

I wonder: Is that a mistake, going from target_size + target_size / 2 to target_size / 2? If it's intentional, what's the motivation? There is no explanation of this in the commit history.

That's just one example.

Can we have a commit, for example, that just adds reference nodes? After all, that's a single logical change that might make sense completely indepenedently of the "extent" stuff.

As for the demo script, I wasn't sure from your description what it did. But I think it would be better to develop it outside of cmark itself, so we don't get cmark too cluttered up. It certainly shouldn't go in the src/ tree with the other library code.

Jan 05 '17 20:01 jgm

@jgm, the first change you mention is from the commit from @nwellnhof that you reverted, which you had squashed together from a bunch of his commits ..

Jan 05 '17 20:01 MathieuDuponchelle

My initial branch was also 4 commits that you ended up squashing as one, I had re-extracted the python stuff from it as it clearly needed to be separate, but squashing things together, merging them, reverting them then asking me to re-split them up is a bit frustrating I have to say

Jan 05 '17 20:01 MathieuDuponchelle

As for putting the formatting script in src/, I put it there because the cmark executable is also located in there, even though it's not part of the library, just wanted to be consistent.

Jan 05 '17 20:01 MathieuDuponchelle

+++ Mathieu Duponchelle [Jan 05 17 12:36 ]:

[1]@jgm, the first change you mention is from the commit from [2]@nwellnhof that you reverted, which you had squashed together from a bunch of his commits ..

I shouldn't have done that - GitHub now makes squashing the default, and I sometimes forget that.

Looking back at his commit, and looking at the code in more detail, I see that this isn't a change in functionality; it's just that he's operating with the delta rather than the total size (as in the original).

Still, could you divide this into some logically separated changes? At the very least we should have the introduction of reference nodes as a separate commit, and nwellnhof's changes a separate commit; the extent changes should not be mixed with these. And as I suggested the demo program should not be in this PR.

Jan 05 '17 20:01 jgm

+++ Mathieu Duponchelle [Jan 05 17 12:40 ]:

My initial branch was also 4 commits that you ended up squashing as one, I had re-extracted the python stuff from it as it clearly needed to be separate, but squashing things together, merging them, reverting them then asking me to re-split them up is a bit frustrating I have to say

I understand that, the frustration is legitimate and I take the blame. Still, I think going forward, for maintainability and for code review, we need smaller, more atomic, commits. (And this has generally what has made it hard for me to process your PRs in the past.)

It's unfair to ask you to re-split this, so I won't. But I'm also reluctant to merge the ball-of-twine without some splitting, and I don't know if I'll get around to doing it any time soon, because this isn't a high-priority feature for me.

Jan 05 '17 20:01 jgm

+++ Mathieu Duponchelle [Jan 05 17 12:42 ]:

As for putting the formatting script in src/, I put it there because the cmark executable is also located in there, even though it's not part of the library, just wanted to be consistent.

Yes, that's true. But I think a python script belongs somewhere else. (And really, probably not in this repository...it makes more sense to me for you to have your own repository for that, where you can develop it as you see fit and rapidly.)

Jan 05 '17 20:01 jgm

And this has generally what has made it hard for me to process your PRs in the past.

Really ?? My other unmerged PR is split into 20 commits, the source pos stuff was initially four commits!

Jan 05 '17 20:01 MathieuDuponchelle

+++ Mathieu Duponchelle [Jan 04 17 15:30 ]:

The actual fancy stuff, a reindenting script, tested on every test of the spec with 1 to 80 columns, the test checks that the new AST is identical to the old one, minus softbreaks. The actual test I propose only checks for reindenting on one column, because 80 * 610 starts to take a little time I'm afraid. Note that the script could break in a few more places than it currently does, and certainly be massively optimized, but I think it's already in a state worth merging because, as far as I can tell, it is now provably reliable.

What would help me is to hear a bit more about how this differs from what you can do already with cmark's commonmark -> commonmark transitions. I assume the answer is that your script won't make adventitious changes to things like list indentation, list number delimiter, type of header, etc.?

Jan 05 '17 20:01 jgm

What would help me is to hear a bit more about how this differs from what you can do already with cmark's commonmark -> commonmark transitions. I assume the answer is that your script won't make adventitious changes to things like list indentation, list number delimiter, type of header, etc.?

Yes, but far from only that, the end goal with this is to have a tool similar to clang-format for commonmark, allowing to control the style in large projects with multiple contributors.

Jan 05 '17 20:01 MathieuDuponchelle

Let me try out the formatting script and see if it is something that would make sense to keep in the repo as a demo.

Jan 05 '17 20:01 jgm

Example use case:

Let's say you're the maintainer of "progit", and want to enforce a certain set of rules (no soft breaks, lists have to use * as the marker, that sort of things, you could also want to follow this markdown style guide) , and the last thing you want to do when reviewing contributions is nitpick all day long about style issues.

The solution I'd like to adopt as such a maintainer, is to have a pre-commit hook that runs cmark-format (which could figure out the style from a well-known-named configuration file for example), and be reasonably certain that any commits submitted for inclusion have gone through and be validated by the formatting script.

On the contributor side, you could use the tool to automatically refactor your changes in order to make them comply with the required style, which would thus eliminate tedious and error-prone manual changes.

Jan 05 '17 21:01 MathieuDuponchelle

+++ Mathieu Duponchelle [Jan 05 17 12:56 ]:

Yes, but far from only that, the end goal with this is to have a tool similar to clang-format for commonmark, allowing to control the style in large projects with multiple contributors.

Yes, this would be useful and it's certainly something people have wanted.

Jan 05 '17 21:01 jgm

@jgm, well that's what this "demo" script aims at becoming, it only handles width for now because that was the only difficult thing to do, the rest can easily be added by anyone with minimum python's working knowledge, which is part of the reason why I submitted it now, the other being indeed that I wanted to demo the script to you as I figured you would see the point :)

Jan 05 '17 21:01 MathieuDuponchelle

Another important thing with that script is also that it does not do unnecessary escaping, which is really what bugged me with -t commonmark, ie given:

foo > bar

--width 80 will give you:

foo > bar

whereas --width 1 will correctly give you:

foo
\>
bar

Jan 05 '17 21:01 MathieuDuponchelle

If splitting is too difficult, could you at least amend the commit message to include a more detailed accounting of the changes made and their motivations? The changelog will need to contain something like that, in any case, and I usually construct that from commit messages.

All API changes, especially, need to be very carefully documented in the changelog.

In this case that would include the addition of reference nodes and all the new extent-related API.

Jan 05 '17 21:01 jgm

If splitting is too difficult, could you at least amend the commit message to include a more detailed accounting of the changes made and their motivations? The changelog will need to contain something like that, in any case, and I usually construct that from commit messages.

Sure, I didn't do that because I figured it would all get squashed anyway, I usually tend to be quite specific about these things, but I understand now that you got tricked by github :)

I'll see what I can do for splitting up the commit anyway.

Jan 05 '17 21:01 MathieuDuponchelle

@jgm, better like that ?

Jan 06 '17 21:01 MathieuDuponchelle

Much better, thanks! This looks good, but can you remove the cmark-format part and put it in a separate repository? Two more questions occur to me:

With the addition of reference nodes, the xml output will not match the DTD (CommonMark.dtd in jgm/CommonMark). We could add reference nodes to the DTD, or alternatively we could simply skip them in xml output. What do you think?
Previously we had start/end line/column info on nodes, with an option to print this as attributes in HTML and XML. (The information was also accessible via API functions exposed in cmark.h.) Can you explain what has become of these? It seems to me that they should be obsoleted by your new sourcepos functionality. It does no good to have two separate mechanisms for tracking source pos. Unless you're using your sourcepos information to populate these, I think we should just remove all the legacy sourcecode stuff on nodes, and the code in parsers that populates these.

Jan 07 '17 11:01 jgm

Much better, thanks! This looks good, but can you remove the cmark-format part and put it in a separate repository?

Of course, but I would really have liked to have it live alongside the library in order to ensure its test suite ran upon new commits / spec changes :) I would not mind at all moving the code outside of src/ , and I'd stick around for maintaining it, do you have any particular reason not to merge it under these conditions? As I explained in earlier comments, I think it would be a valuable tool to offer. By the way, you said you'd give it a try, did you find any issues that made you want to hold back on it?

With the addition of reference nodes, the xml output will not match the DTD (CommonMark.dtd in jgm/CommonMark). We could add reference nodes to the DTD, or alternatively we could simply skip them in xml output. What do you think?

My initial revision did not include it in the xml output, but my opinion is that this format should expose as faithfully as possible the ast, I am mainly using it for ease of development and debugging, and I like to have as much information as possible in that case. If you agree with this, I will update the schema.

Previously we had start/end line/column info on nodes, with an option to print this as attributes in HTML and XML. (The information was also accessible via API functions exposed in cmark.h.) Can you explain what has become of these? It seems to me that they should be obsoleted by your new sourcepos functionality. It does no good to have two separate mechanisms for tracking source pos. Unless you're using your sourcepos information to populate these, I think we should just remove all the legacy sourcecode stuff on nodes, and the code in parsers that populates these.

We will want to update that, however I'm not sure in which manner. Attributes in the html output are definitely helpful for things like side-by-side autoscrolling, syntax highlighting etc, and we want some form of this. A few questions:

The current API advertises line / columns indices, however it is a bit misleading as shown by:

 meh   more-conditional-sourcepos  ~  devel  cmark  130  cat test.md
’
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark --sourcepos test.md 
<p data-sourcepos="1:1-1:3">’</p>
 meh   more-conditional-sourcepos  ~  devel  cmark 

my expectation of a column API would be that utf8 is accounted for, however that is not the case. I would rework this to simply expose byte offsets.

The current line / column API doesn't break down extents for nodes, for example:

 meh   more-conditional-sourcepos  ~  devel  cmark  1  cat test.md
>     foo
>     bar
 meh   more-conditional-sourcepos  ~  devel  cmark  1  ./build/src/cmark --sourcepos test.md 
<blockquote data-sourcepos="1:1-2:9">
<pre data-sourcepos="1:3-2:9"><code>foo
bar
</code></pre>
</blockquote>
 meh   more-conditional-sourcepos  ~  devel  cmark  1 

I think that's actually an interesting level of details, and can complement the more detailed new API by letting blocks overlap, this could be trivially implemented in the add_extent source map method, which would, whenever a new extent is added, update the start offset of the associated node, and recurse up its ancestors to update their stop offset.

Jan 09 '17 17:01 MathieuDuponchelle

@jgm, can you answer that last comment ?

Jan 13 '17 17:01 MathieuDuponchelle

Of course, but I would really have liked to have it live alongside the library in order to ensure its test suite ran upon new commits / spec changes :) I would not mind at all moving the code outside of src/ , and I'd stick around for maintaining it, do you have any particular reason not to merge it under these conditions? As I explained in earlier comments, I think it would be a valuable tool to offer. By the way, you said you'd give it a try, did you find any issues that made you want to hold back on it?

I did try it and one thing I remember noticing were lots of line breaks in bad places, e.g. between a word and ). It also seems a bit strange to have it written in Python, in the context of cmark. I haven't had a chance yet to look at it carefully, though, and see what it does.

Question: could some of its aims be achieved by modifying cmark's existing commonmark renderer, or a modification of it? E.g. you could pass in the extents + the original source, and the renderer could peek in the source to see if a backslash escape was used, check indentations, bullet types, and so on, and try to preserve these in the output. One thing the existing renderer does very well I think is line breaks.

My initial revision did not include it in the xml output, but my opinion is that this format should expose as faithfully as possible the ast, I am mainly using it for ease of development and debugging, and I like to have as much information as possible in that case. If you agree with this, I will update the schema.

Yes, I think this makes sense.

my expectation of a column API would be that utf8 is accounted for, however that is not the case. I would rework this to simply expose byte offsets.

Right. That makes more sense, I agree.

The current line / column API doesn't break down extents for nodes, for example:

Wasn't sure what you meant here, can you explain?

Jan 13 '17 21:01 jgm

I did try it and one thing I remember noticing were lots of line breaks in bad places, e.g. between a word and ).

Weird, can you please give specific examples, cause that is just not the case here, example:

 meh   more-conditional-sourcepos  ~  devel  cmark  1  cat plop.md 
foo(bar)baz
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark-format plop.md --width 1
foo(bar)baz
 meh   more-conditional-sourcepos  ~  devel  cmark 

Jan 13 '17 21:01 MathieuDuponchelle

@jgm, maybe you're thinking about links, but IMHO my tool's behaviour is way better than cmark -t commonmark in that case:

 meh   more-conditional-sourcepos  ~  devel  cmark  130  cat plop.md 
[foo](bar)
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark plop.md --width 1 -t commonmark
[foo](bar)
 meh   more-conditional-sourcepos  ~  devel  cmark  ./build/src/cmark-format plop.md --width 1
[
foo
](
bar
)
 meh   more-conditional-sourcepos  ~  devel  cmark 

ie, my tool tries its best to stick to the requested width, and will break on punctuation, cmark will not

Jan 13 '17 22:01 MathieuDuponchelle

It also seems a bit strange to have it written in Python, in the context of cmark.

Well, the thing is I needed some sort of hash table for example, strict C89 with no dependencies makes things a bit awkward. As this is a developer tool, I considered that speed was not an absolute necessity. By the way, the glib now builds with no issues in Visual Studio ;)

Jan 13 '17 22:01 MathieuDuponchelle

my tool tries its best to stick to the requested width, and will break on punctuation, cmark will not

I disagree that this is better behavior. These are cases where both tools are going above the requested width (and such cases are going to be inevitable). cmark avoids the awkward punctuation breaks. I personally think we should never break after the [ or ( or before the ] or ) in a link.

Here's another example where I think cmark does the right thing:

cmark:

It provides a shared library
(`libcmark`) with functions for parsing

your tool:

It provides a shared library (
`libcmark`) with functions for parsing

Surely in this case the line break after the opening ( is undesirable! (I got this with cmark-format --width=40 README.md). Another awkward case:

- **Robust.** The library has been
  extensively fuzz-tested using [
  american fuzzy lop]. The test suite

I don't think anyone will think that's a desirable place to break.

Of course, all these things are easily fixable.

Jan 15 '17 10:01 jgm

I'm still not completely decided about whether it makes sense to keep cmark-format in the cmark repository, but at the very least I think it should be in a separate directory with a separate CMakeLists.txt.

Jan 15 '17 10:01 jgm

Hey folks, out of curiosity, have there been any updates on this? It looks like the code is pretty much ready to go, and I’d love to be able to use source positions in a project I’m working on pretty soon. Any ideas about when it might land on master, and/or make it into a “released” version?

Thanks!

Jan 31 '17 21:01 mattpat

cmark cmark copied to clipboard

More conditional sourcepos

cmark
cmark copied to clipboard