html2text icon indicating copy to clipboard operation
html2text copied to clipboard

html2text to use libcurl to retrieve remote documents

Open snooppr opened this issue 2 years ago • 6 comments

Found this package in the Termux repository. The problem is that there are no examples in the readme (in -help it is not clear enough how to parse the url, it gives an error).

Test. curl https://habr.com/en/post/488432/ | html2text

utf-8 displays incorrectly (see screenshot).

1 cleaned

Fix auto encoding selection, add examples to readme. The CLI syntax might also need to be improved, for example instead of [-help --> --help and -h, -version --> --version -v...]

snooppr avatar Feb 02 '22 07:02 snooppr

gnu-ification of the options is certainly an option (standard getopt-long) but it breaks all backwards compat, which I tried to retain for now.

curl https://habr.com/en/post/488432/ | html2text -from_encoding utf-8 seems to work here. Does that work for you too? html2text doesn't do any guessing at this point, and it uses the traditional default for the same reason as above. I think utf-8 as default would today be much more sensical, for the record.

grobian avatar Feb 06 '22 11:02 grobian

curl https://habr.com/en/post/488432/ | html2text -from_encoding utf-8 seems to work here. Does that work for you too?

It's about autodetection of the encoding (based on receiving headers). Here is an example where default and utf-8 fails (because the encoding is windows-1251 this time). Screenshot_20220207-062911_Termux

It will only work in this case. curl http://forum.igromania.ru/member.php?username=adam | html2text -from_encoding windows-1251.

snooppr avatar Feb 07 '22 03:02 snooppr

Right, but html2text doesn't do the http request, so it has no headers. What you want is basically for html2text to link against libcurl, so it can use the content encoding properly.

grobian avatar Feb 07 '22 07:02 grobian

So the request is ok, but the the actual problem was solved by the implementation for issue #20 . Since the document contains a meta charset tag, we don't need http headers or something and can display the document fine without setting a from_encoding.

grobian avatar Apr 01 '22 18:04 grobian

html2text doesn't do the http request

Perhaps the confusion arises because README-1.3.2a gives the impression that it does: it states

"html2text reads HTML documents from standard input or a (local or remote) URI".

The old html2text homepage and the even older html2text homepage had further wording that certainly made it sound like it does indeed do http requests:

"Each HTML document is loaded from a location indicated by a URI or read from standard input, [...]. The input-URI may specify a remote site, from that the documents are loaded via the Hypertext Transfer Protocol (HTTP)."

ryandesign avatar May 05 '22 10:05 ryandesign

I've removed the old README, so this confusion shouldn't happen any more

grobian avatar May 06 '22 11:05 grobian