html2text icon indicating copy to clipboard operation
html2text copied to clipboard

-ascii and transliteration

Open brunonaibert opened this issue 5 years ago • 4 comments

The "-ascii" parameter is not performing the transiteration. Here is an example:

echo atenção | html2text -ascii 
aten????o

brunonaibert avatar Oct 14 '20 19:10 brunonaibert

Hmmm

% echo atenção | html2text -from_encoding UTF-8 -ascii
atenc~ao

not sure if that's so expected, but is your env using latin-1 as encoding (opposed to UTF-8)?

grobian avatar Oct 18 '20 08:10 grobian

It's really confusing indeed:

% echo atenção | html2text -from_encoding UTF-8 -to_encoding ascii
aten????o
% echo atenção | html2text -from_encoding UTF-8 -to_encoding ascii//translit
atenc~ao

so on my system (macOS) it seems like -ascii is doing the translit as advertised.

grobian avatar Oct 18 '20 09:10 grobian

You're right. I forgot that the standard input is ISO-8859-1.

Here are the tests and their results again:

$ echo atenção | html2text -from_encoding UTF-8 -ascii
aten??o
$ echo atenção | html2text -from_encoding UTF-8 -to_encoding ascii
aten????o
$ echo atenção | html2text -from_encoding UTF-8 -to_encoding ascii//translit
aten??o

Explaining better, see the result of iconv when using ascii//translit:

$ echo atenção | iconv -f UTF-8 -t ascii//translit
atencao

In my view, this is the expected result.

-- System Information: Debian Release: bullseye/sid APT prefers unstable APT policy: (500, 'unstable') Architecture: amd64 (x86_64)

Kernel: Linux 4.19.0-12-amd64 (SMP w/4 CPU threads) Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8) (ignored: LC_ALL set to C.UTF-8), LANGUAGE=C.UTF-8 Shell: /bin/sh linked to /usr/bin/dash Init: unable to detect

Versions of packages html2text depends on: ii libc6 2.31-4 ii libgcc-s1 10.2.0-15 ii libstdc++6 10.2.0-15

brunonaibert avatar Oct 22 '20 02:10 brunonaibert

Are you sure html2text is linked against the same libiconv as the iconv utility you use? I guess you use glibc, so it should be using the same libc. Can you try this with 2.0.0 or latest git?

grobian avatar Apr 02 '22 11:04 grobian

need info

grobian avatar Jul 30 '23 14:07 grobian

In version 2.2.3 the behavior persists.

brunonaibert avatar Jan 20 '24 06:01 brunonaibert

:(

grobian avatar Jan 20 '24 08:01 grobian