comrak icon indicating copy to clipboard operation
comrak copied to clipboard

Unexpected sourcepos.

Open ttakuru88 opened this issue 1 year ago • 6 comments

Hi. "sourcepos" is exactly what I was looking for. What do you think of this result?

markdown

[AB
CD](/)

my expected

<p data-sourcepos="1:1-2:6"><a data-sourcepos="1:1-2:6" href="/">AB
CD</a></p>

comrak v0.18.0

<p data-sourcepos="1:1-2:6"><a data-sourcepos="2:1-2:6" href="/">AB
CD</a></p>

Similar pattern.

markdown

- A
[![B](/B.png)](/B)

comrak

<ul data-sourcepos="1:1-2:18">
<li data-sourcepos="1:1-2:18">A
<a data-sourcepos="2:3-2:20" href="/B"><img data-sourcepos="2:4-2:15" src="/B.png" alt="B" /></a></li>

cmark

<ul data-sourcepos="1:1-2:18">
<li data-sourcepos="1:1-2:18">A
<a href="/B"><img src="/B.png" alt="B" /></a></li>
</ul>

ttakuru88 avatar May 03 '23 13:05 ttakuru88

That is a bit surprising! It's actually almost per upstream. Here's cmark-gfm:

$ cat o
[AB
CD](/)

$ build/src/cmark-gfm o -t xml --sourcepos
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-2:6" xmlns="http://commonmark.org/xml/1.0">
  <paragraph sourcepos="1:1-2:6">
    <link sourcepos="2:1-2:6" destination="/" title="">
      <text sourcepos="1:2-1:3" xml:space="preserve">AB</text>
      <softbreak />
      <text sourcepos="2:1-2:2" xml:space="preserve">CD</text>
    </link>
  </paragraph>
</document>

$ build/src/cmark-gfm o --sourcepos
<p data-sourcepos="1:1-2:6"><a href="/">AB
CD</a></p>

So, in terms of the AST, comrak is doing the same thing as upstream, as shown in <link sourcepos="2:1-2:6"; the difference is that upstream doesn't actually put sourcepos on <a> in output HTML at all.

I'd be happy with fixing this in either direction; either fix how we record sourcepos on the link node, or stop outputting in it HTML. (It's possible that "fixing" the former might have unintended consequences.)

kivikakk avatar May 04 '23 01:05 kivikakk

I see. I prefer the latter(stop outputting). I don't think writing logic that has nothing to do with the upstream is a good idea since it will make things complex.

ttakuru88 avatar May 04 '23 05:05 ttakuru88

Me too! PRs happily accepted; I don't have the time for this right now myself.

kivikakk avatar May 04 '23 07:05 kivikakk

Wait...the idea is to not have sourcepos on the <a>? Are you wanting to remove from other non-block html? Please don't...

I tried running it against markdown-it.rs, I get what was expected

<p data-sourcepos="1:1-2:6"><a data-sourcepos="1:1-2:6" href="/">AB
CD</a></p>

digitalmoksha avatar May 08 '23 20:05 digitalmoksha

In general, I'd rather "matches the upstream we explicitly follow" than not. Right now we have "doesn't match upstream and is incorrect". I'd accept a solution either way.

kivikakk avatar May 09 '23 05:05 kivikakk

But it does match the upstream, doesn't it? Block-level sourcepos works correctly I think.

Inline sourcepos is a great enhancement. I'd certainly rather see a fix, but if that's not feasible at the moment, then how about an option to enable it? Those that want it can enable it, and work on a fix. comrak is already going beyond what upstream offers (thankfully) while trying to maintain backward compatibility. So I think ripping it out should be last resort.

digitalmoksha avatar May 11 '23 22:05 digitalmoksha

Funny thing I found while getting to work on fixing this. I tried pasting the last example, i.e. this:

- A
[![B](/B.png)](/B)

directly into a GitHub comment. Imagine my surprise:

  • A B

kivikakk avatar Jul 12 '24 08:07 kivikakk

#439 fixes this, but only by no longer reporting sourcepos information for inlines in output HTML. (Block-level is unaffected, and the data's still there in the AST or XML output if you want it.)

See https://github.com/kivikakk/comrak/pull/439#issuecomment-2225129960 for comments on effort required to get inline sourcepos working consistently.

kivikakk avatar Jul 12 '24 09:07 kivikakk

@kivikakk ouch. This is a huge breaking change for us. We've built significant functionality on top of inline sourcepos. Its removal is a real problem for us.

I understand it's not 100% accurate, but it's been pretty darn accurate for most things.

Also releasing it along with the other great fixes you just put in means we can't use them.

Not sure what I'm going to do yet. Probably the only thing I can do is build our own HTML outputter. Or if an inline-sourcepos option was added to re-enable it, obviously with the caveat that it's sometimes inaccurate.

digitalmoksha avatar Jul 12 '24 14:07 digitalmoksha

I'd be happy to add it back in with an option, yeah! But "sometimes inaccurate" to me reads as "garbage in, garbage out"; of course, if your pipeline means those cases aren't hit, it's not such a big deal, but we can't even enumerate them all because there's some fundamental brokenness here. (It's why, I'm guessing, they aren't exposed in cmark.) That brokenness then becomes the sourcepos user's brokenness in unexpected ways.

So, happy to take an option to re-add them in the HTML output, but it needs to be called something like experimental-inline-sourcepos. I think removing it from the default sourcepos output is correct, and I'm sorry that this hurts your workflow!

kivikakk avatar Jul 12 '24 15:07 kivikakk

Thanks @kivikakk 🙇. I agree with naming an option experimental-inline-sourcepos.

I'm not happy with "garbage in, garbage out" either. So far we haven't hit these particular cases, but as we come to rely more heavily on it, I'm sure we will eventually run across them.

I'll see about reallocating some of my time to work on some type of fix, though I may have to lean on your expertise 😄

digitalmoksha avatar Jul 12 '24 15:07 digitalmoksha

To better clarify what we use it for, in GitLab we have a rich text editor for our markdown fields (comments, descriptions, etc). Markdown is the single source of truth, and we convert from markdown to rich text and back. There is actually a button in the edit field that allow you to switch back and forth.

Sourcepos allows us to have fidelity when converting between to the two representations - in general we don't want to change markdown the user has created but not edited. It's still a work in progress, but it's evolving.

You can test it out in any comment field in an issue, for example in https://gitlab.com/gitlab-org/gitlab/-/issues/456980

digitalmoksha avatar Jul 12 '24 15:07 digitalmoksha

I don't quite understand how sourcepos is used in that conversion (my experiments resulted in changes on every roundtrip), but I'm glad it's helpful!

kivikakk avatar Jul 12 '24 16:07 kivikakk

Yeah, it's very much a work in progress 😅

Though I am going to have to go back and adjust expectations.

digitalmoksha avatar Jul 12 '24 16:07 digitalmoksha

:D I see! It looks like you might be after more CST-like support as well (thinking of #422 here). In general I don't mind going that route, but for sourcepos fidelity, the big change forewarned about in https://github.com/kivikakk/comrak/pull/439#issuecomment-2225129960 will be unavoidable.

kivikakk avatar Jul 12 '24 16:07 kivikakk

Possibly, though I don't understand the ramifications of that yet, particularly since at the moment it's the generated HTML that gets converted into the editor.

the big change forewarned about in https://github.com/kivikakk/comrak/pull/439#issuecomment-2225129960 will be unavoidable.

yeah I figured

digitalmoksha avatar Jul 12 '24 17:07 digitalmoksha