parser icon indicating copy to clipboard operation
parser copied to clipboard

Some Unicode characters cause garbage in HTML output

Open NightMachinery opened this issue 5 years ago • 3 comments

  • Platform: Linux server 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 GNU/Linux
  • Node Version v10.9.0
  • [email protected]

mercury-parser https://www.greaterwrong.com/posts/SqF8cHjJv43mvJJzx/feeling-rational

outputs:

{
  "title": "Feeling Rational - LessWrong 2.0 viewer",
  "author": null,
  "date_published": null,
  "dek": null,
  "lead_image_url": null,
  "content": "<main class=\"post\"><div class=\"body-text post-body\"><p>Since cu&#xAD;ri&#xAD;os&#xAD;ity is an emo&#xAD;tion, I sus&#xAD;pect that some peo&#xAD;ple will ob&#xAD;ject to treat&#xAD;ing cu&#xAD;ri&#xAD;os&#xAD;ity as a part of ra&#xAD;tio&#xAD;nal&#xAD;ity. A pop&#xAD;u&#xAD;lar be&#xAD;lief about &#x201C;ra&#xAD;tio&#xAD;nal&#xAD;ity&#x201D; is that ra&#xAD;tio&#xAD;nal&#xAD;ity op&#xAD;poses all emo&#xAD;tion&#x2014;that all our sad&#xAD;ness and all our joy are au&#xAD;to&#xAD;mat&#xAD;i&#xAD;cally anti-log&#xAD;i&#xAD;cal by virtue of be&#xAD;ing <em>feel&#xAD;ings</em>. Yet strangely enough, I can&#x2019;t find any the&#xAD;o&#xAD;rem of prob&#xAD;a&#xAD;bil&#xAD;ity the&#xAD;ory which proves that I should ap&#xAD;pear ice-cold and ex&#xAD;pres&#xAD;sion&#xAD;less. </p> <p>When peo&#xAD;ple think of &#x201C;emo&#xAD;tion&#x201D; and &#x201C;ra&#xAD;tio&#xAD;nal&#xAD;ity&#x201D; as op&#xAD;posed, I sus&#xAD;pect that they are re&#xAD;ally think&#xAD;ing of Sys&#xAD;tem 1 and Sys&#xAD;tem 2&#x2014;fast per&#xAD;cep&#xAD;tual judg&#xAD;ments ver&#xAD;sus slow de&#xAD;liber&#xAD;a&#xAD;tive judg&#xAD;ments. Sys&#xAD;tem 2&#x2019;s de&#xAD;liber&#xAD;a&#xAD;tive judg&#xAD;ments aren&#x2019;t always true, and Sys&#xAD;tem 1&#x2019;s per&#xAD;cep&#xAD;tual judg&#xAD;ments aren&#x2019;t always false; so it is very im&#xAD;por&#xAD;tant to dis&#xAD;t&#xAD;in&#xAD;guish that di&#xAD;chotomy from &#x201C;ra&#xAD;tio&#xAD;nal&#xAD;ity.&#x201D; Both sys&#xAD;tems can serve the goal of truth, or defeat it, de&#xAD;pend&#xAD;ing on how they are used.</p> <p>For my part, I la&#xAD;bel an emo&#xAD;tion as &#x201C;not ra&#xAD;tio&#xAD;nal&#x201D; if it rests on mis&#xAD;taken be&#xAD;liefs, or rather, on mis&#xAD;take-pro&#xAD;duc&#xAD;ing e
…

I don’t want this weird ‘cu­ri­os­ity’; It should just be ‘curiosity’. What can I do?

NightMachinery avatar Jul 04 '19 07:07 NightMachinery

I examined the source html:

<p>Since cu<U+00AD>ri<U+00AD>os
<U+00AD>ity is an emo<U+00AD>tion, I sus<U+00AD>pect that some peo
<U+00AD>ple will ob<U+00AD>ject to treat<U+00AD>ing cu<U+00AD>ri
<U+00AD>os<U+00AD>ity as a part of ra<U+00AD>tio<U+00AD>nal<U+00AD>ity.
 A pop<U+00AD>u<U+00AD>lar be<U+00AD>lief about “ra<U+00AD>tio<U+00AD>n
al<U+00AD>ity” is that ra<U+00AD>tio<U+00AD>nal<U+00AD>ity op<U+00AD>po
ses all emo<U+00AD>tion—that all our sad<U+00AD>ness and all our joy ar
e au<U+00AD>to<U+00AD>mat<U+00AD>i<U+00AD>cally anti-log<U+00AD>i
<U+00AD>cal by virtue of be<U+00AD>ing <em>feel<U+00AD>ings</em>. Yet s
trangely enough, I can’t find any the<U+00AD>o<U+00AD>rem of prob
<U+00AD>a<U+00AD>bil<U+00AD>ity the<U+00AD>ory which proves that I shou
ld ap<U+00AD>pear ice-cold and ex<U+00AD>pres<U+00AD>sion<U+00AD>less. 
</p>

I don’t know why they are doing this, but it’d be nice if mercury was able to deal with it.

NightMachinery avatar Jul 04 '19 07:07 NightMachinery

I have found the workaround tr -cd "[:print:]\n" for removing the extra characters, but it works only for the text format and not html, which is where I really need it to work.

NightMachinery avatar Jul 04 '19 07:07 NightMachinery

Facing the same issue

sidhantpanda avatar Aug 22 '19 17:08 sidhantpanda