reader
reader copied to clipboard
Fail to parse email's Html containing french punctuation and a quote.
I use reader
as a first step in my script to produce an output for Neomutt email client's pager.
The script receiver the raw html and then pipe it as markdown to pandoc, elinks and then less (to add references and colors).
That's the best solution I found to get something clean, formatted and highlighted for Neomutt html diplay.
Issue
But, a few days ago, I noticed that a message where reader
was not displaying the sender's message, just the quoted part.
It may be related to the gmail html formating or the text itself.
Example
The message is a reply to my previous message and was sent from gmail. (I replaced private text by X's)
- HTML
<div dir="auto">Hello,<div dir="auto"><br></div><div dir="auto">Merci d'y avoir pensé. 🙂</div><div dir="auto">X'xxx xxxxxxxxx. X'xx xxxxx x'xxxxxxxx.</div><div dir="auto"><br></div><div dir="auto">X'xx xxxxxxxxxxx.</div><div dir="auto">Xxx xxxxxxxx 🙂</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">Le lun. 30 oct. 2023 à 17:28, Tomasz Kapias <<a href="mailto:[email protected]">[email protected]</a>> a écrit :<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><u></u><div><div>Xxxx xxxxxxx,</div><div><br></div><div>Xxxxx xx xxxxxx. Xxxxxxxx xxxx x'xxx xxxxxx, xxxxx x xxxxxxx.</div><div><br></div><div>Xx xxxx x'xx xxxxxx x'xxx xxxxx xxx xx xxx x'xxx xxxx, xx x'xx xxxx xx xxx. xxxx x xxx x'xxx xxxx x'xx xxxx x xxx, xxxxxx, xx x'xxx xxx x'xxxx xxxx. xxx xxx x'xxx xxxx.</div><div><br></div><div><br></div><div>Xx x'xxxxxx, bonne soirée.</div><div><br></div><div>Tomasz<br></div></div></blockquote></div>
-
reader
output forreader --image-mode none --markdown-output --verbose message.html
:
Le lun. 30 oct. 2023 à 17:28, Tomasz Kapias < [[email protected]](mailto:[email protected]) \> a écrit :
> Xxxx xxxxxxx,
>
> Xxxxx xx xxxxxx. Xxxxxxxx xxxx x'xxx xxxxxx, xxxxx x xxxxxxx.
>
> Xx xxxx x'xx xxxxxx x'xxx xxxxx xxx xx xxx x'xxx xxxx, xx x'xx xxxx xx xxx. xxxx x xxx x'xxx xxxx x'xx xxxx x xxx, xxxxxx, xx x'xxx xxx x'xxxx xxxx. xxx xxx x'xxx xxxx.
>
> Xx x'xxxxxx, bonne soirée.
>
> Tomasz
- Display in Firefox:
The part above the quote is not parsed by reader.
@tkapias first of all: That's a super nifty use case! Do you happen to have dotfiles with the Neomutt config? Would be curious to try it myself. :-)
I'll have a look at the specific issue. My gut feeling is that it's rather github.com/JohannesKaufmann/html-to-markdown that reader uses for converting the HTML to Markdown.
While this is something I will dig deeper into, I have a different idea to solve this issue more elegantly. It sounds like you're already dealing with Markdown, which you pipe to Pandoc. Would it work for you if reader would provider a --markdown-input
option, so that the conversion from Markdown to HTML and from HTML to Markdown could be cut out?
About the second point: a new feature for reader
To find the current pipeline with Reader, I tried maybe 20 other tools and a lot of combinations with iconv, many pagers and highlighters.
The issue was that nothing combine the specific format needed by neomutt pager, the display of urls as references and a good parsing of tables and element imbrications.
Email solution providers love imbrications of elements and strange tables.
So the only solution I found is to use reader to parse the most important part of the message, then clean it with pandoc to get nice tables that elinks can read, and elinks then add references and some colors. And I wrap it all at 80 columns.
But if you find a way to shorten all that, it would be huge.
My neomutt setup
My Neomutt setup is a huge work-in-progress. I use 'mbsync', 'notmuch' and 'afew' to sync my Imap accounts and sort the messages. And I use Msmtp as a sender. All taht is run by a systemd timer.
That's how the last Github notification message looks like.
To get that pager Display I customized a lot of Neomutt's settings and colors, and used a script to convert the text/html messages in the mailcap file.
-
I will clean the private references in my dotfiles and upload then on my git server, but for now you can check that:
-
~/.config/neomutt/mailcap
:
text/html; auto-view_html %s %{charset} ${COLUMNS}; nametemplate=%s.html; copiousoutput; x-neomutt-nowrap;
-
~/.config/neomutt/scripts/auto-view_html.sh
:
#!/usr/bin/env bash
# takes a temporary HTML attachment from Neomutt's autoview and return a cleaned, formated, colored output, ready for the builtin pager.
# requires 3 attributes: filename, charset, columns
shopt -s extglob
export LC_ALL="C.UTF-8"
export TZ=:/etc/localtime
if [[ $3 -lt 80 ]]; then
_columns=$3
else
_columns=80
fi
reader --image-mode none --markdown-output --terminal-width $_columns "$1" | pandoc -f commonmark+emoji+pipe_tables -t html+empty_paragraphs --wrap auto --columns $_columns --preserve-tabs --tab-stop 2 | elinks -no-connect 1 -localhost 1 -dump 1 -dump-color-mode 4 --force-html -dump-width $_columns | LESS_COLUMNS=$_columns less -QRXs
- My Elinks config is custom too, and it may be important:
## ELinks 0.16.1.1 configuration file
set config.comments = 3
set config.indentation = 2
set config.saving_style = 3
set document.browse.images.display_style = 2
set document.browse.images.image_link_tagging = 1
set document.browse.images.image_link_prefix = "["
set document.browse.images.image_link_suffix = "]"
set document.browse.images.label_maxlen = 0
set document.browse.images.show_as_links = 1
set document.browse.images.show_any_as_links = 1
set document.browse.links.active_link.enable_color = 1
set document.browse.links.color_dirs = 1
set document.browse.links.numbering = 1
set document.browse.links.show_goto = 1
set document.browse.links.label_key = "0123456789"
set document.browse.margin_width = 2
set document.browse.preferred_document_width = 80
set document.browse.use_preferred_document_width = 1
set document.codepage.force_assumed = 0
set document.colors.text = "#c3c3c3"
set document.colors.background = "#011627"
set document.colors.link = "#5555ff"
set document.colors.vlink = "#5555ff"
set document.colors.image = "#ff8888"
set document.colors.bookmark = "#5555ff"
set document.colors.use_link_number_color = 1
set document.colors.link_number = "#21c7a8"
set document.colors.increase_contrast = 0
set document.colors.ensure_contrast = 0
set document.colors.use_document_colors = 0
set document.dump.codepage = "System"
set document.dump.color_mode = 4
set document.dump.numbering = 1
set document.dump.references = 1
set document.dump.terminal_hyperlinks = 0
set document.dump.separator = "
"
set document.dump.width = 80 [0/701]
set document.html.display_frames = 1
set document.html.display_iframes = 0
set document.html.display_tables = 1
set document.html.display_subs = 1
set document.html.display_sups = 1
set document.html.link_display = 2
set document.html.underline_links = 1
set document.html.wrap_nbsp = 1
set document.plain.display_links = 0
set document.plain.compress_empty_lines = 1
set document.plain.fixup_tables = 1
set terminal.rxvt-unicode.charset = "UTF-8"
set terminal.rxvt-unicode.underline = 1
set terminal.rxvt-unicode.italic = 1
set terminal.rxvt-unicode.transparency = 1
set terminal.rxvt-unicode.colors = 4
set terminal.rxvt-unicode.block_cursor = 1
set terminal.rxvt-unicode.restrict_852 = 0
set terminal.rxvt-unicode.combine = 1
set terminal.rxvt-unicode.utf_8_io = 1
set terminal.rxvt-unicode.m11_hack = 1
set terminal.rxvt-unicode.latin1_title = 0
set terminal.rxvt-unicode.type = 2
set terminal.tmux-256color.underline = 1
set terminal.tmux-256color.italic = 1
set terminal.tmux-256color.transparency = 1
set terminal.tmux-256color.colors = 4
set terminal.tmux-256color.block_cursor = 1
set terminal.tmux-256color.restrict_852 = 0
set terminal.tmux-256color.combine = 1
set terminal.tmux-256color.utf_8_io = 1
set terminal.tmux-256color.m11_hack = 0
set terminal.tmux-256color.latin1_title = 0
set terminal.tmux-256color.type = 2
set terminal.tmux-direct.charset = "UTF-8"
set terminal.tmux-direct.underline = 1
set terminal.tmux-direct.italic = 1
set terminal.tmux-direct.transparency = 1
set terminal.tmux-direct.colors = 4
set terminal.tmux-direct.block_cursor = 1
set terminal.tmux-direct.restrict_852 = 0
set terminal.tmux-direct.combine = 1
set terminal.tmux-direct.utf_8_io = 1
set terminal.tmux-direct.m11_hack = 0
set terminal.tmux-direct.latin1_title = 0
set terminal.tmux-direct.type = 2
Sorry for the long delay. I have found the reason for why your mail is being mangled and I have started implementing a fix in Journalist that is needed to implement a fix in reader. However, it turns out that one crucial dependency that reader has been using -- github.com/tinoquang/go-cloudflare-scraper
-- has vanished, making it impossible for me to build a new version of reader atm.
I am working on fixing the dependency issue and, after that, implement the fix for your use case.
A fix for this issue was implemented. You can now use the -r
option of reader for your scripts and it won't mangle your mails.