TaskerDocumentation icon indicating copy to clipboard operation
TaskerDocumentation copied to clipboard

Convert html to markdown

Open emansih opened this issue 5 years ago • 7 comments

The script to convert html files to md files is https://github.com/emansih/TaskerDocumentation/blob/ec316935bd02e3412a0b6d7aba63ad304da3d7ae/converter.sh

All files under en directory are ok. However userguide_summary.md file is borked.

emansih avatar Jun 15 '19 12:06 emansih

Many of the output files (e.g. ah_copy_file.md, ah_delete_file.md) have backslashes before newlines and quotes, which seems unnecessary (and less readable). GitHub's parser doesn't require them, does Tasker's?

goldfndr avatar Jun 16 '19 18:06 goldfndr

Yes, there are some issues, Pandoc seems to have issues with <br> – it writes the backslashes @goldfndr noticed at that positions. Also the Taker documentation uses void ankers like <A NAME="g"/> or even incorrect ones like

<A NAME="diff"/A>
<H4>Differences Between Widgets and Shortcuts</H4>

Althought that works in HTML, it isn't good style and doesn't translate well to markdown:

[]{#diff}
#### Differences Between Widgets and Shortcuts

It would be better, to change the HTML to <H4 id="diff">Differences Between Widgets and Shortcuts</H4> or #### Differences Between Widgets and Shortcuts{#diff} in Markdown.

CennoxX avatar Jun 16 '19 19:06 CennoxX

Well, we can either manually edit those markdown files or edit the HTML files before doing a conversion again

emansih avatar Jun 17 '19 06:06 emansih

I'd expect a sed script to handle stripping the superfluous backslashes. It's just a matter of creating a blacklist or whitelist and applying it; sed could help there too.

Has a bug been filed for Pandoc?

goldfndr avatar Jun 18 '19 11:06 goldfndr

no bug has been filed with pandoc since I am unsure if it's a bug in pandoc or bad code in the html.

emansih avatar Jun 18 '19 12:06 emansih

Backslashes seem to be in e.g. ah_copy_file.md due to <br/> tags presented in the source. So, it's rather bad code in html. Pent used to have some tool that produced xhtml, I guess. Look at the head of en/index.html, for example. The <br/> tag is part of xhtml, not html. Perhaps, pandoc can be instructed to handle <br/> tags, but the userguide has numerous mistakes in the html formatting anyway. en/variables.html is a prominent example. I'd recommend to feed all those pages to a html checker first and to fix all encountered errors before converting the sources to markdown.

git-core avatar Jun 19 '19 10:06 git-core

Yes, Pent probably did have a tool, but for different reasons (en/index.html is hand-crafted, e.g. some list elements have closing tags and some don't). You can see the XML source for the actions and events and states. The XML's actions include 5.0's Take Screenshot and Set App Shortcuts, so it was definitely in use pre-João. I would expect that the tool was included with what Pent provided, but I don't see it here. The tool probably reads the source (res/values/*.xml), as the A-Z files and individual files have names that the XML doesn't (e.g. "Clear Key"), and the A-Z file is obviously alphabetically sorted (the XML seems to be randomized). Some of the entries (e.g. action_help_clear_encryption and action_help_airplane_radios) do include HTML (italic and bold respectively) so that's allowed.

It's also possible that a tool could convert Markdown files into XML and we can come full circle.

goldfndr avatar Jun 20 '19 00:06 goldfndr