html-to-markdown icon indicating copy to clipboard operation
html-to-markdown copied to clipboard

Use of escape.Markdown for #text elements

Open chamilad opened this issue 5 years ago • 3 comments

Hello,

I'm using your library for a markdown generation tool for static site generators. The Rule interface is just perfect!

The use of escape for #text elements mostly seem like a problem for me as I read through the code. Would you be able to explain why this was used in the first place? I couldn't understand why certain characters needed to be escaped in the first place.

Thanks!

chamilad avatar Oct 16 '19 21:10 chamilad

@chamilad great that you like the library!

If the following snippet gets run through the library <p>**Not Strong**</p> it might produce **Not Strong** which would not be what we are expecting. These side-effects happen with quite a few characters ("*" for bold, "_" for italic, "-" for list items, four space characters accidentally creates a code block, ...).


When a header (eg. <h3>) contains any new lines in its body, it will split the header contents over multiple lines, breaking the header in Markdown (because in Markdown, a header just starts with #'s and anything on the next line is not part of the header). Since in HTML and Markdown all white space is treated the same, I chose to replace line endings with spaces. -> https://github.com/lunny/html2md/pull/6

With escaping, this input will generate this output which is not perfect but close to the original.


@chamilad if you send me some snippets that behave unexpectedly, I'm happy to add some test cases and fix that.

As a Background Information: This library was designed to pipe whole websites through it, meaning it is supposed to handle some weird edge cases.

JohannesKaufmann avatar Mar 23 '20 19:03 JohannesKaufmann

Hi there! First, thanks for a great library! Second, I have an example that behaves unexpectedly:

The document I'm converting contains maths equations such as <span class="tex2jax_process">$L’ = (1+n \cdot C) \cdot L$</span>. Amazingly, this almost works out of the box since the $$ syntax is apparently used in some Markdown flavors as well. However, I get $L’ = (1+n \\cdot C) \\cdot L$, i.e. the backslashes before cdot are escaped. I would need them "raw": $L’ = (1+n \cdot C) \cdot L$.

If this is a corner case that breaks something else, then I'm happy to just write my own rule to override the default one, just thought I'd mention this.

estyrke avatar Feb 02 '21 06:02 estyrke

@estyrke Yeah, you are right that is a bug. Unfortunately, it's not that easy to fix.

I have thought about a new approach that might make escaping more reliable (also resolving #19), but that requires a substantial refactor. And I don't have time for that at the moment 🤷‍♂️


For now, you can create a custom rule for "span" and register it using AddRules.

Then check whether the element has the classname “tex2jax_process” using selec.HasClass.

If it has return selec.Text() instead of content. That gets you the original text that is not escaped.

If it does not have the classname, return nil which is then going to run the default rule.

Let me know if you have any problems...

JohannesKaufmann avatar Feb 02 '21 09:02 JohannesKaufmann