html2text icon indicating copy to clipboard operation
html2text copied to clipboard

Support for plain text without Markdown syntax

Open mwaterfall opened this issue 12 years ago • 15 comments

I'm working on a fork to add an option (--no-markdown) that will allow the conversion of HTML to pure plain text. For example, this will add quotation marks around blockquotes, remove any markdown syntax for headers (and several other places), and basically present things nicely when markdown will not be used to render it.

I believe this is a valid use case; this is the best project that I've found that converts HTML into plain text, but I think it would be nice to have an option to output things straight to plain text without any Markdown syntax.

Here's an example output:

Output (Markdown):
# Title of my document:

**Lorem** Ipsum is simply dummy text of the printing and typesetting 
industry. Lorem Ipsum has been the industry's standard dummy 
text ever since the 1500s, when an unknown printer took a galley 
of type and scrambled it to make a type specimen book.

Check out an awesome project here: [https://github.com/aaronsw/html2text](https://github.com/aaronsw/html2text)

> It was popularised in the 1960s with the release of Letraset sheets 
containing Lorem Ipsum passages, and more recently with desktop 
publishing software like Aldus PageMaker including versions 
of Lorem Ipsum.

  * bit
  * bold italic
    * orange
    * apple
  * final
Output (No Markdown):
Title of my document:

Lorem Ipsum is simply dummy text of the printing and typesetting
industry. Lorem Ipsum has been the industry's standard dummy
text ever since the 1500s, when an unknown printer took a galley
of type and scrambled it to make a type specimen book.

Check out an awesome project here: https://github.com/aaronsw/html2text

“It was popularised in the 1960s with the release of Letraset sheets
containing Lorem Ipsum passages, and more recently with desktop
publishing software like Aldus PageMaker including versions
of Lorem Ipsum.”

  – bit
  – bold italic
    – orange
    – apple
  – final

I've got things rolling here: mwaterfall/html2text@6e288c3

I'd love to hear views on this. I'm happy to put some more work into it so it's ready to eventually merge into the main project.

mwaterfall avatar Mar 18 '13 16:03 mwaterfall

I don’t think it is a good idea ... I would rather keep html2text to minimum HTML to Markdown converter, and if you need anything more, than it would be IMHO much better to create deMarkDowner (MD to plain text converter) rather than pushing it all to html2text.

mcepl avatar Apr 09 '14 08:04 mcepl

@mwaterfall, I happen to have exactly the same use case, and could really use your patch! Can you guide me on what would be the best and quickest way to get it working with the latest code maintained by @ali3z4?

amitembibe avatar Mar 13 '15 10:03 amitembibe

@amitembibe @mwaterfall IMO @mcepl is right about it.

Even having that kind of output that you're saying is "Text" is wrong, what I see is that you only replaced > with " and * with -.

In matter of fact you just introduced a new markdown syntax.

Alir3z4 avatar Mar 15 '15 17:03 Alir3z4

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text.

It also "happens to be valid Markdown". Markdown maintains some structure so it can be converted to HTML if required. All I'm suggesting is that there be an output option for "humans" where a different markup is used (remove #, remove > and encase paragraph in quotations, different list bullets, etc...) in order for it to be _clean, readable prose_ if the intent is to output as plain text to a user.

It would still fulfil the project brief; converting HTML into clean, _easy-to-read_ plain ASCII text.

I see that as a perfectly good use-case.

mwaterfall avatar Mar 19 '15 13:03 mwaterfall

Yes, because it is SO difficult to read Markdown. Please, get out your sed and write all those five lines (or how much) to write that md2asciitextasIlikeit script.

Unfortunately, I don't have CLOSE button here, otherwise this creeping featurism would be the fastest WONTFIX in the history of universe. At least I can hopefully unsubscribe from this issue.

mcepl avatar Mar 19 '15 13:03 mcepl

Try to be civil; you'll live a happier life :-)

Markdown might be human readable, but it's an ugly and confusing format for displaying to non-technical end-users. Why create and pass it through something else?

This project is called html2text, not html2markdown.

mwaterfall avatar Mar 19 '15 14:03 mwaterfall

+1 for @mwaterfall for civility and I too want text for non-technical end-users.

kcrawford avatar Feb 08 '16 19:02 kcrawford

Also interested in that feature. My aim is to get plain text from html to let other scripts parse sentences so I'm not interested in any formatting characters. Otherwise this tool would better be called html2markdown!

geoffroy-noel-ddh avatar Sep 04 '16 00:09 geoffroy-noel-ddh

+1

johnfrancisgit avatar Jul 09 '18 12:07 johnfrancisgit

Heads-up, the maintainer of this repo, Aaron Swartz, is no longer with us. The successor of this repo had a similar issue, but it was closed.

Garrett-R avatar Mar 20 '20 05:03 Garrett-R

I support everything @mwaterfall says here.

Also, the reactions he gets are plain rude: like the owners of this project are walking around with shotguns, ready to down anyone who might make any suggestion. Ugly. Your beloved Markdown, my friends, is not Text and you know it. I repeat Markdown is not Text.

In short: this project is a con: its name promises to do something which in reality it can't and it won't -- ever

bjd-pfq avatar May 31 '21 11:05 bjd-pfq

@bjd-pfq please be more respectful to open source maintainers who generously volunteer their time. When you make comments like this, it burns them out. Even if they seem prickly at time, it doesn't help to escalate.

In the case of the maintainer of this repo, you can read more about him here and everything he gave us. Please read that before criticizing further. Also, it's just good to know about him in general and make sure his memory lives on.

Garrett-R avatar May 31 '21 18:05 Garrett-R

@Garrett-R I think @bjd-pfq has a valid point though particularly regarding the comments made by @mcepl who was very rude to @mwaterfall. But I agree that whatever your opinion is there is a constructive way to voice it. Let's not get emotional over text and md lol

johnfrancisgit avatar May 31 '21 18:05 johnfrancisgit

True, that was being rude, although just a heads-up that mcepl is not even a maintainer on this repo.

Let's not get emotional over text and md lol

:laughing:

Garrett-R avatar May 31 '21 18:05 Garrett-R

The latest open ticket on the new repo gives specific configuration advice:

  • https://github.com/Alir3z4/html2text/issues/359

jeremydouglass avatar Aug 17 '21 16:08 jeremydouglass