user-documentation icon indicating copy to clipboard operation
user-documentation copied to clipboard

Some Unicode (or non-7-bit ASCII) characters cause grief

Open RexJaeschke opened this issue 7 years ago • 4 comments

a. Any paragraph of text or Hack … -delimited example containing an ellipsis (U+2026), left-double quote (U+201C), or right-double quote (U+201D), will be rendered as a blank line. b. Em-dash (U+2014) and en-dash (U+2013) cause text to be swallowed up with no output. c. I have cross-references of the form §§, but rather than displaying §§ linked to xxx, the whole construct is swallowed up with no output. BTW, § is U+00A7, so the high bit is set putting it outside the ASCII range.

Are all code points > U+007F handled in this manner?

BTW, I discovered these when pasting text from MS-Word. I've replaced each of these characters with ones that are accepted, but it took me a while to figure out why they "disappeared into the void".

RexJaeschke avatar Aug 27 '18 16:08 RexJaeschke

What does locale output on your server?

Does running the following before building help?

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

fredemmott avatar Aug 27 '18 23:08 fredemmott

Here's my locale when I logon:

ubuntu@ip-172-31-36-66:~$ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

I changed it, as follows:

export LANG=en_US.UTF-8 export LC_ALL=en_US.UTF-8

Here are the locale settings:

ubuntu@ip-172-31-36-66:~/user-documentation/public$ locale LANG=en_US.UTF-8 LANGUAGE= LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=en_US.UTF-8

These changes had no affect; the characters in question (and surrounding text) are still not rendered.

RexJaeschke avatar Aug 28 '18 11:08 RexJaeschke

  • Can you create a pull request with a complete example? I'm unable to reproduce this in isolation
  • Does the result depend on which browser you are using?

fredemmott avatar Aug 28 '18 22:08 fredemmott

Here's my test md file (with suffix .txt added to accommodate the upload constraint):

Non-ASCII Character Tests.md.txt

And here's the captured display for the first few tests:

non-ascii character tests

The Word versions (that is, with text copied straight from MS Word) of each test result in a blank line (except for the section marker, which doesn't render correctly either) on both Chrome and my old IE (I'm running Win8.1).

I did not change any environment variables; just used the default (whose values I reported yesterday).

RexJaeschke avatar Aug 29 '18 15:08 RexJaeschke