haddock icon indicating copy to clipboard operation
haddock copied to clipboard

Character "©" in comment leads to exception: commitAndReleaseBuffer: invalid argument (invalid character)

Open JoergBrueggmann opened this issue 3 years ago • 14 comments

The following haddock comment

...
-- | ...a ....
{-|
Prefix: xyz

Characteristics

* does this and this

    * e.g. bla... (' \0')
    * e.g. ... (' ')
    * e.g. ... 'A'
    * e.g. ... '©'

* regarding particular functions

    * ...

        * ...
    
    * ...

        * ...

    * ...

-}

led to the following output and hence stop of Haddock (version 2.24.0)

Warning: '<stderr>: commitAndReleaseBuffer: invalid argument (invalid character)

After removing the line with '©' the error (designated as Warning) disapeared.

Now, I am using

Haddock version 2.25.1, (c) Simon Marlow 2006

and now it just stops - even without any message.

JoergBrueggmann avatar Apr 10 '22 18:04 JoergBrueggmann

@JoergBrueggmann Could you tell me what locales your shell uses? :)

On my machine with UTF-8 locales, it is successfully converted to &#169.

Kleidukos avatar Apr 12 '22 11:04 Kleidukos

Just another data point. In en_US.UTF-8, I get

'©' is out of scope.

with Haddock 2.24 and 2.26 (presumably, because quotation marks are supposed to denote Haskell identifiers that Haddock will try to hyperlink).

ulysses4ever avatar Apr 13 '22 02:04 ulysses4ever

@JoergBrueggmann Could you tell me what locales your shell uses? :)

On my machine with UTF-8 locales, it is successfully converted to &#169.

I am not exactly sure what you mean by "what locales your shell". May be this answers your question: I am using VSCode which can deal with different encodings. The encoding of the file is UTF-8 with BOM. If this doesn't answer your question, please let me know. Thank you.

JoergBrueggmann avatar Apr 13 '22 06:04 JoergBrueggmann

Just another data point. In en_US.UTF-8, I get

'©' is out of scope.

Well, the code point of "©" is U+00A9 (unicode) and in UTF-8 coded as 0xA9.

with Haddock 2.24 and 2.26 (presumably, because quotation marks are supposed to denote Haskell identifiers that Haddock will try to hyperlink).

Exactly, the single quotation marks denote Haskell identifiers. It seems that Haddock cannot deal with (all) Haskell identifiers that are encoded in UTF-8 and are above code point U+007F.

JoergBrueggmann avatar Apr 13 '22 06:04 JoergBrueggmann

Please try enabling {-# LANGUAGE UnicodeSyntax #-} to handle Unicode identifiers. Is it any better if you save file without BOM?

(The error message commitAndReleaseBuffer: invalid argument (invalid character) is truly abhorrent. Any volunteers to make https://gitlab.haskell.org/ghc/ghc/-/blob/master/libraries/base/GHC/IO/Encoding/Failure.hs more helpful?)

Bodigrim avatar Apr 13 '22 07:04 Bodigrim

Nice, to "see" you again.

Please try enabling {-# LANGUAGE UnicodeSyntax #-} to handle Unicode identifiers. Is it any better if you save file without BOM?

Both, {-# LANGUAGE UnicodeSyntax #-} and saving the file without BOM (using Notepad++) doesn't work any better.

(The error message commitAndReleaseBuffer: invalid argument (invalid character) is truly abhorrent. Any volunteers to make https://gitlab.haskell.org/ghc/ghc/-/blob/master/libraries/base/GHC/IO/Encoding/Failure.hs more helpful?)

What exactly are you looking for? I do not know the concept behind "...IO/Encoding/Failure.hs". Can you provide some links to get some more details? Background: I going to build a compiler-compiler in Haskell. Therefore, for file IO, I am currently creating a library do deal with different character encodings in a completely different way. May be we find some synergy.

JoergBrueggmann avatar Apr 13 '22 08:04 JoergBrueggmann

I raised https://gitlab.haskell.org/ghc/ghc/-/issues/21389 to improve relevant error messages.

@Kleidukos @ulysses4ever I assume Haddock could have catched this exception to provide better user experience.

@JoergBrueggmann I'm not a maintainer here, but it could help if you share a standalone reproducer.

Bodigrim avatar Apr 13 '22 17:04 Bodigrim

@JoergBrueggmann I'm not a maintainer here, but it could help if you share a standalone reproducer.

Do you mean a small Haskell project in e.g. in github to reproduce the bug?

JoergBrueggmann avatar Apr 13 '22 17:04 JoergBrueggmann

Yes, a small package such that cabal haddock fails on it.

Bodigrim avatar Apr 13 '22 18:04 Bodigrim

Yes, a small package such that cabal haddock fails on it.

OK, I will do. I started to create such a package. Unfortunately, it behaves differently after reducing the original version to a smaller package. :-( The original version stops even without a message and the reduced one writes an error message. I will resume tomorrow.

JoergBrueggmann avatar Apr 13 '22 21:04 JoergBrueggmann

@Bodigrim yup'. This goes in the TODO list. :)

Kleidukos avatar Apr 13 '22 22:04 Kleidukos

@Kleidukos, @ulysses4ever, @Bodigrim, please find the standalone project to reproduce the bug in repository https://github.com/JoergBrueggmann/HaddockIssue1472

If you have question regarding the project, please let me know.

JoergBrueggmann avatar Apr 14 '22 07:04 JoergBrueggmann

@JoergBrueggmann thanks for the standalone reproducer. Unfortunately, it builds and renders okay on my end. Assuming, you're on a Linux, could you copy and paste here the result of executing env | grep LANG in your terminal?

ulysses4ever avatar Apr 21 '22 02:04 ulysses4ever

@ulysses4ever, I am working with stack on windows and hence Msys2. There is env but no grep. env prints the following:

...
LANG=en_US.UTF-8
...

JoergBrueggmann avatar Apr 21 '22 07:04 JoergBrueggmann