parser
parser copied to clipboard
Some Unicode characters cause garbage in HTML output
- Platform: Linux server 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 GNU/Linux
- Node Version v10.9.0
- [email protected]
mercury-parser https://www.greaterwrong.com/posts/SqF8cHjJv43mvJJzx/feeling-rational
outputs:
{
"title": "Feeling Rational - LessWrong 2.0 viewer",
"author": null,
"date_published": null,
"dek": null,
"lead_image_url": null,
"content": "<main class=\"post\"><div class=\"body-text post-body\"><p>Since cu­ri­os­ity is an emo­tion, I sus­pect that some peo­ple will ob­ject to treat­ing cu­ri­os­ity as a part of ra­tio­nal­ity. A pop­u­lar be­lief about “ra­tio­nal­ity” is that ra­tio­nal­ity op­poses all emo­tion—that all our sad­ness and all our joy are au­to­mat­i­cally anti-log­i­cal by virtue of be­ing <em>feel­ings</em>. Yet strangely enough, I can’t find any the­o­rem of prob­a­bil­ity the­ory which proves that I should ap­pear ice-cold and ex­pres­sion­less. </p> <p>When peo­ple think of “emo­tion” and “ra­tio­nal­ity” as op­posed, I sus­pect that they are re­ally think­ing of Sys­tem 1 and Sys­tem 2—fast per­cep­tual judg­ments ver­sus slow de­liber­a­tive judg­ments. Sys­tem 2’s de­liber­a­tive judg­ments aren’t always true, and Sys­tem 1’s per­cep­tual judg­ments aren’t always false; so it is very im­por­tant to dis­t­in­guish that di­chotomy from “ra­tio­nal­ity.” Both sys­tems can serve the goal of truth, or defeat it, de­pend­ing on how they are used.</p> <p>For my part, I la­bel an emo­tion as “not ra­tio­nal” if it rests on mis­taken be­liefs, or rather, on mis­take-pro­duc­ing e
âŚ
I donât want this weird âcuÂriÂosÂityâ; It should just be âcuriosityâ. What can I do?
I examined the source html:
<p>Since cu<U+00AD>ri<U+00AD>os
<U+00AD>ity is an emo<U+00AD>tion, I sus<U+00AD>pect that some peo
<U+00AD>ple will ob<U+00AD>ject to treat<U+00AD>ing cu<U+00AD>ri
<U+00AD>os<U+00AD>ity as a part of ra<U+00AD>tio<U+00AD>nal<U+00AD>ity.
A pop<U+00AD>u<U+00AD>lar be<U+00AD>lief about âra<U+00AD>tio<U+00AD>n
al<U+00AD>ityâ is that ra<U+00AD>tio<U+00AD>nal<U+00AD>ity op<U+00AD>po
ses all emo<U+00AD>tionâthat all our sad<U+00AD>ness and all our joy ar
e au<U+00AD>to<U+00AD>mat<U+00AD>i<U+00AD>cally anti-log<U+00AD>i
<U+00AD>cal by virtue of be<U+00AD>ing <em>feel<U+00AD>ings</em>. Yet s
trangely enough, I canât find any the<U+00AD>o<U+00AD>rem of prob
<U+00AD>a<U+00AD>bil<U+00AD>ity the<U+00AD>ory which proves that I shou
ld ap<U+00AD>pear ice-cold and ex<U+00AD>pres<U+00AD>sion<U+00AD>less.
</p>
I donât know why they are doing this, but itâd be nice if mercury was able to deal with it.
I have found the workaround tr -cd "[:print:]\n"
for removing the extra characters, but it works only for the text format and not html, which is where I really need it to work.
Facing the same issue