PyF
PyF copied to clipboard
How are bytestrings encoded?
λ> let bø = [fmt|ø|] :: ByteString; in T.putStrLn $ "ø vs " <> [fmt|{T.decodeUtf8With T.lenientDecode bø}|]
ø vs �
λ> ([fmt|ø|]:: ByteString, T.encodeUtf8 "ø")
("\248","\195\184")
λ> ([fmt|{T.pack "ø"}|]:: ByteString, T.encodeUtf8 "ø")
("\248","\195\184")
("ø" in UTF-8 is 0xC3 0xB8 which is 195 184)
Does fmt encode bytestrings using some other encoding?
My current workaround is to restrict it to text
{-# LANGUAGE TemplateHaskell #-}
module PyF.Foo where
import Language.Haskell.TH.Quote
import Data.Text qualified as T
import PyF
import PyF.Internal.QQ (Config (..))
t :: QuasiQuoter
t =
mkFormatter
"t"
( fmtConfig
{ postProcess = \e -> [|T.pack $(e)|]
}
)
Hey, that's super interesting. Thank you for the report and sorry for the delay (I kinda missed the notification).
I had a look and to be honest it is unclear for me.
The code [fmt|ø|]
is actually spliced as fromString "ø"
, as simple as it can be, it really behaves like a "string literal" in the context of OverloadedString
.
Note that if you do not use OverloadedString
, you are forced to manually do the conversion and there won't be any problem if you pass by text. However, if you use Data.ByteString.Char8.pack
, it will return \248
. I'm unable to find a documentation on what is the encoding used by Data.ByteString.Char8.pack
.
Your workaround is great.
All of that being said, I'm really wondering if PyF
should really work with ByteString
. For sure it is convenient, but encoding issues are obvious and we never really know how to deal with them.
Maybe it could be convenient to introduce monomorphised quasi quoters with precise encoding semantic, such as fmtAsUtf8ByteString
.
I'll be happy to have your input on that, and especially, in which context did you tried to use pyf to generate ByteString
which may contain utf8 chars?
Maybe it could be convenient to introduce monomorphised quasi quoters with precise encoding semantic, such as fmtAsUtf8ByteString.
That might be useful, though I'm not sure how to read that (is the "input" assumed to be utf8 or the output?).
I think at least it would be good to have some kind of note about this in the docs, perhaps that one should avoid using plain fmt
when producing bytestrings which may contain non-ASCII.
I'll be happy to have your input on that, and especially, in which context did you tried to use pyf to generate ByteString which may contain utf8 chars?
It's been some months so I can't recall exactly, but it may have been that I was appending a bytestring with show Int to create another bytestring (as a key for rocksdb)