orgmode-parse icon indicating copy to clipboard operation
orgmode-parse copied to clipboard

Markup parsers should only consider markup on word boundaries

Open ixmatus opened this issue 6 years ago • 16 comments

Discovered while testing the HyperLink parser. The parser will incorrectly parse the following:

/[[https://orgmode.org/manual/Link-format.html][The Org Manual: Link format]]/

... as:

Right [Paragraph [Italic [Plain "[[https:"],Italic [Plain "orgmode.org"],Plain "manual",Italic [Plain "Link-format.html][The Org Manual: Link format]]"]]]

This should be easy to fix since formatting markup is only treated as such if the beginning sentinel character is preceded by whitespace and followed by a non-whitespace character.

ixmatus avatar Nov 22 '18 16:11 ixmatus

CC: @zhujinxuan (only CC'ing to let you know, I have a branch with a fix already in-place I haven't pushed it yet because I want to add more thorough tests).

ixmatus avatar Nov 26 '18 16:11 ixmatus

Hi, I think we shall write Hyperlink parser like the LaTeX parser. The elements inside the [] shall be considered as Text rather than Markup Text

zhujinxuan avatar Nov 26 '18 18:11 zhujinxuan

Like https://github.com/ixmatus/orgmode-parse/blob/master/src/Data/OrgMode/Parse/Attoparsec/Content/Markup.hs#L77-L84

zhujinxuan avatar Nov 26 '18 18:11 zhujinxuan

@zhujinxuan I agree with you. However this is still a problem with the markup parser as it considers some text as "marked up" that org-mode's fontification does not. I have the latter fixed and I will also implement your suggestion.

ixmatus avatar Nov 26 '18 18:11 ixmatus

@ixmatus I think we can guard that by typing. If we define

data Markup a = LaTeX Text

Then we will not need to worry about whether the content of LaTeX is parsed as markup.

zhujinxuan avatar Nov 26 '18 21:11 zhujinxuan

I don't think that's a problem. I mean that:

/http://someurl.com//

... is parsed incorrectly. Disregarding org-mode hyperlink markup syntax, we expect /http://someurl.com// to parse to an Italic [ Plain "http://someurl.com/" ] however it parses into the following:

Paragraph
[ Italic [ Plain "http:"]
, Italic [Plain "someurl.com"]
, Plain "/"
]

ixmatus avatar Nov 26 '18 21:11 ixmatus

As an example, some of the tests demonstrate incorrect behavior too, for instance:

*text *

... should not parse as Bold [ Plain "text" ] but as Plain "*text *".

ixmatus avatar Nov 26 '18 21:11 ixmatus

@ixmatus Do you have a document of orgmode markup syntax? It seems many corner cases are not documented in https://orgmode.org/manual/Markup.html

zhujinxuan avatar Nov 26 '18 21:11 zhujinxuan

@ixmatus I agree. I tested in emacs-org. I am wondering shall we consider * test * as marked? screenshot 2018-11-26 16 59 16

zhujinxuan avatar Nov 26 '18 21:11 zhujinxuan

No, I don't think we should. I think we should follow org-mode's fontification behavior and treat * text * as plain text (that is what my stashed change does now).

ixmatus avatar Nov 27 '18 00:11 ixmatus

I'm finding lots of corner cases (by adding tests) that we didn't account for in the markup parser that I need to resolve before I can push up my work.

ixmatus avatar Dec 03 '18 00:12 ixmatus

@ixmatus Can you open up a PR with some of the tests?

zhujinxuan avatar Dec 03 '18 14:12 zhujinxuan

I will when I clean up some of my experiments :) I will probably get to it throughout the week, no stress!

ixmatus avatar Dec 03 '18 14:12 ixmatus

I haven't had much free-time to finish this up but I do have free-time coming up for the holidays and I will be working on this then.

ixmatus avatar Dec 18 '18 16:12 ixmatus

@ixmatus Hi, are you working on this recently? If not, I will begin the fix in next Sat (Mar 30th)

zhujinxuan avatar Mar 24 '19 02:03 zhujinxuan

@zhujinxuan my real life and job have taken an intense turn. I won't get to this until late May now so if you're able to then that would be great.

Thank you.

ixmatus avatar Mar 24 '19 03:03 ixmatus