PyF How are bytestrings encoded?

λ> let bø = [fmt|ø|] :: ByteString; in T.putStrLn $ "ø vs " <> [fmt|{T.decodeUtf8With T.lenientDecode bø}|]
ø vs �

λ> ([fmt|ø|]:: ByteString, T.encodeUtf8 "ø")
("\248","\195\184")


λ> ([fmt|{T.pack "ø"}|]:: ByteString, T.encodeUtf8 "ø")
("\248","\195\184")

("ø" in UTF-8 is 0xC3 0xB8 which is 195 184)

Does fmt encode bytestrings using some other encoding?

My current workaround is to restrict it to text

{-# LANGUAGE TemplateHaskell #-}
module PyF.Foo where
import Language.Haskell.TH.Quote
import Data.Text qualified as T
import PyF
import PyF.Internal.QQ (Config (..))
t :: QuasiQuoter
t =
  mkFormatter
    "t"
    ( fmtConfig
        { postProcess = \e -> [|T.pack $(e)|]
        }
    )

Feb 10 '24 09:02 unhammer

Hey, that's super interesting. Thank you for the report and sorry for the delay (I kinda missed the notification).

I had a look and to be honest it is unclear for me.

The code [fmt|ø|] is actually spliced as fromString "ø", as simple as it can be, it really behaves like a "string literal" in the context of OverloadedString.

Note that if you do not use OverloadedString, you are forced to manually do the conversion and there won't be any problem if you pass by text. However, if you use Data.ByteString.Char8.pack, it will return \248. I'm unable to find a documentation on what is the encoding used by Data.ByteString.Char8.pack.

Your workaround is great.

All of that being said, I'm really wondering if PyF should really work with ByteString. For sure it is convenient, but encoding issues are obvious and we never really know how to deal with them.

Maybe it could be convenient to introduce monomorphised quasi quoters with precise encoding semantic, such as fmtAsUtf8ByteString.

I'll be happy to have your input on that, and especially, in which context did you tried to use pyf to generate ByteString which may contain utf8 chars?

Apr 22 '24 17:04 guibou

Maybe it could be convenient to introduce monomorphised quasi quoters with precise encoding semantic, such as fmtAsUtf8ByteString.

That might be useful, though I'm not sure how to read that (is the "input" assumed to be utf8 or the output?).

I think at least it would be good to have some kind of note about this in the docs, perhaps that one should avoid using plain fmt when producing bytestrings which may contain non-ASCII.

I'll be happy to have your input on that, and especially, in which context did you tried to use pyf to generate ByteString which may contain utf8 chars?

It's been some months so I can't recall exactly, but it may have been that I was appending a bytestring with show Int to create another bytestring (as a key for rocksdb)

Apr 23 '24 07:04 unhammer

PyF PyF copied to clipboard

How are bytestrings encoded?

PyF
PyF copied to clipboard