html2text
html2text copied to clipboard
html2text to use libcurl to retrieve remote documents
Found this package in the Termux repository. The problem is that there are no examples in the readme (in -help it is not clear enough how to parse the url, it gives an error).
Test. curl https://habr.com/en/post/488432/ | html2text
utf-8 displays incorrectly (see screenshot).
Fix auto encoding selection, add examples to readme. The CLI syntax might also need to be improved, for example instead of [-help --> --help and -h, -version --> --version -v...]
gnu-ification of the options is certainly an option (standard getopt-long) but it breaks all backwards compat, which I tried to retain for now.
curl https://habr.com/en/post/488432/ | html2text -from_encoding utf-8
seems to work here. Does that work for you too? html2text doesn't do any guessing at this point, and it uses the traditional default for the same reason as above. I think utf-8 as default would today be much more sensical, for the record.
curl https://habr.com/en/post/488432/ | html2text -from_encoding utf-8
seems to work here. Does that work for you too?
It's about autodetection of the encoding (based on receiving headers). Here is an example where default and utf-8 fails (because the encoding is windows-1251 this time).
It will only work in this case. curl http://forum.igromania.ru/member.php?username=adam | html2text -from_encoding windows-1251
.
Right, but html2text doesn't do the http request, so it has no headers. What you want is basically for html2text to link against libcurl, so it can use the content encoding properly.
So the request is ok, but the the actual problem was solved by the implementation for issue #20 . Since the document contains a meta charset tag, we don't need http headers or something and can display the document fine without setting a from_encoding.
html2text doesn't do the http request
Perhaps the confusion arises because README-1.3.2a gives the impression that it does: it states
"html2text reads HTML documents from standard input or a (local or remote) URI".
The old html2text homepage and the even older html2text homepage had further wording that certainly made it sound like it does indeed do http requests:
"Each HTML document is loaded from a location indicated by a URI or read from standard input, [...]. The input-URI may specify a remote site, from that the documents are loaded via the Hypertext Transfer Protocol (HTTP)."
I've removed the old README, so this confusion shouldn't happen any more