bytestring
bytestring copied to clipboard
Surprising behavior of ByteString literals via IsString
At work, we discovered a somewhat surprising behavior of ByteString's IsString
instance and interaction with OverloadedStrings
.
The following REPL session demonstrates the issue:
λ> BS.unpack $ T.encodeUtf8 ("bla語" :: Text)
[98,108,97,232,170,158]
λ> BS.unpack $ ("bla語" :: BS.ByteString)
[98,108,97,158]
λ> T.decodeUtf8 $ ("bla語" :: BS.ByteString)
*** Exception: Cannot decode byte '\x9e': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8
The IsString
instance calls packChars
which calls c2w
, which silently truncates the bytes.
I'd be happy to put together a PR to document the behavior of the IsString
instance.
I think I expected it to encode the string using the source encoding. I don't know whether or not that's a feasible or desirable change.
The behavior you're experiencing is documented at http://hackage.haskell.org/package/bytestring-0.10.8.2/docs/Data-ByteString-Char8.html
However, it may make sense to attach a docstring to the IsString instance since the IsString instance transcends module barriers.
It seems unusual to me that packChars
is the basis for fromString
yet is not part of the API. I think packChars
ought to be exported, and the IsString
instance haddock can just link to the packChars
doc, and the packChars
doc should explain what it does.
I don't understand the rationale of having to export packChars
if we already have access to it via other means (i.e. fromString
& pack
) which alias it.
Ah, sorry, I didn't notice the pack
function in the Char8
module. Yeah, the important thing is for the IsString
instance doc to have a link to Char8.pack
.
This issue is still confusing users today: https://stackoverflow.com/questions/58777439/haskell-data-yaml-utf-8-decoding
This is a landmine to anyone using RawFilePath. Consider:
{-# LANGUAGE OverloadedStrings #-} import System.Posix.Directory.ByteString main = removeDirectory "bla語"
I agree that this seems to assume the source encoding will be used.
Since the IsString
instance is still confusing both beginners and experienced users, how about removing it?
@sjakobi Since this instance has been in place for as long as I can remember, it'd be interesting in order to inform a decision to know how many packages would be affected (and also whether those packages were in full compliance with the PVP by having the mandated upper bounds in place) on Hackage if this instance was removed in a future release of bytestring
.
@hvr I'm sure that a removal would cause tons of upgrade hassle.
Maybe it would be better to change the instance to UTF8-encode the input – that should limit any "breakage" to code that intentionally or more likely unintentionally triggered the current truncating behaviour.
I think the instance should be removed. It isn't well defined. Carrying such pitfalls around for backwards compatibility reasons isn't what the haskell ecosystem should be about in my opinion. It should be about correctness.
We could mark it deprecated and remove it in 1 year.
Yet another alternative to removal: Raise an error if any Char
in the input would be truncated. IMO that's better than truncating silently, but runtime errors aren't that great either…
Yet another alternative to removal: Raise an error if any
Char
in the input would be truncated. IMO that's better than truncating silently, but runtime errors aren't that great either…
As much as I'd like for ""bla語" :: ByteString
to be a compile-time error, I'd rather have a runtime error telling me why something is wrong rather than having this fact stick around in the back of my head, and I'd certainly rather have received the error instead of investigating and discovering this :smile:
I'm in favor of removing the instance over a long-enough timeline. With qq-literals
and the statically checked overloaded strings trick, the fix could be as simple as inserting the relevant quasiquoter ([bs| ... |]
) or quotation $$(...)
(or even $$"..."
in 8.12!). This is an annoyance but would dramatically improve safety.
In stages:
- Replace the
fromString
function with one that throws a runtime error in the next major version bump. - Add a warning to the
IsString
instance in the next major version bump with migration instructions - Add a
TypeError
to theIsString
instance in the next major version bump with instructions on how to migrate away.
@parsonsmatt I really like the idea of offering quasiquoters as an alternative to literals as a first step. The $$(...)
and $$"..."
quotations seem slightly less readable to me.
bs
and lbs
seem like good names to me. Does everyone agree or have better suggestions?
bs "foo"
is syntactically lighter-weight than quasiquotation, and
trivial to implement:
bs :: Text -> ByteString
bs = Data.Text.Encoding.encodeUtf8
-- see shy jo
@joeyh text
depends on bytestring
, so bytestring
can't have a dependency on text
.
I guess it would be convenient if bytestring
could UTF8-encode String
s itself – I would be surprised if that hadn't been discussed before!
To clarify: The bs
and lbs
quasiquoters would only validate the input, rejecting any input containing non-Latin-1 characters, i.e. characters that don't fit into a single byte.
The documentation could direct users who want UTF-8 encoding to the utf8-string
package.
I guess it would be convenient if
bytestring
could UTF8-encodeString
s itself – I would be surprised if that hadn't been discussed before!
Turns out that there is some UTF-8 functionality in bytestring
– for Builder
:
In fact the IsString
instance for Builder
supports UTF-8 via stringUtf8
:
https://github.com/haskell/bytestring/blob/95fe6bdf13c9cc86c1c880164f7844d61d989574/Data/ByteString/Builder.hs#L452-L453
That's a big surprise to me, and I think it changes the picture.
I think we should aim for consistency with this (superior) instance then, and change the instances for strict and lazy ByteString
to support UTF-8. That should be much more convenient than having to resort to Builder
, text
or utf8-string
to get a UTF-8-encoded ByteString
.
Thoughts on this?
Thoughts on this?
My problem with this is that it's not a good API. ByteString has less structure than String and fromString
is just misleading here. Once you turned it into a bytestring, you lost explicit information (about the encoding) and must be aware that it is utf8 and not something else.
I'd rather have people use those utf8 functions explicitly, so they write less bugs. The semantics between IsString instances are just not clear enough, IMO. If you need additional documentation for an instance, then it's a good sign you shouldn't have that instance in the first place.
Maybe someone will expect it to truncate... and someone expect it to be utf8.
As much as I'd like for ""bla語" :: ByteString to be a compile-time error, I'd rather have a runtime error telling me why something is wrong rather than having this fact stick around in the back of my head, and I'd certainly rather have received the error instead of investigating and discovering this
@parsonsmatt @sjakobi Fortunately for you guys I had this epiphany years ago and already implemented the required typed TH in a library (https://hackage.haskell.org/package/validated-literals) that makes it very simple to turn these problems into compile time errors :p
In fact, a compile time checked ByteString is literally one of the examples in that library ;)
One issue with changing the ByteString
IsString
instances to UTF-8 is that users would need to be careful to have proper lower bounds on bytestring
to avoid truncation with older versions.
@hasufell
Maybe someone will expect it to truncate...
I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated.
I would be very interested in knowing what @hvr and @dcoutts think about the idea of changing the instances to UTF-8.
A lot of people I know (including me) have shot their feet by this instance. As a user of Japanese writing system, I strongly support changing the instance to use UTF-8 (also to be consistent with the instance for Builder).
I really really don't believe that anyone would intentionally write non-Latin-1 literals knowing that they are truncated.
The point is, you are proposing to change behavior without renaming the function and you have no idea about the bugs in other peoples codebases, which may depend on unexpected behavior. PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on).
Removing the instance is safe. Old code will stop compiling, people will be forced to change it and think about its semantics.
PVP doesn't work, people don't read ChangeLogs of 200+ packages to figure out such things (especially when they are not even aware of what they are relying on).
The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it. Quite frankly, I recommend to stay away from those people's packages as they don't instill me with great confidence and obviously it makes little sense for a PVP-compliant package to depend upon packages which don't themselves make any effort to uphold the PVP contract.
That being said, luckily almost all packages I rely on are maintained by people who care to write correct software and therefore also appreciate the goals and benefits of the PVP formal framework.
To summarise the basic options,instance IsString ByteString
could be either
- Decoding as ISO-8859-1 (status quo)
- Decoding as UTF-8
- not supported
The nice property of 1. is that you can roundtrip the ByteString literals serialized by Show
; especially those that aren't valid UTF-8 encodings -- after all, ByteString
is about binary data, which is a superset of valid UTF-8 encodings.There's also code out there which relies on the IsString instance to efficiently (still O(n) but with a relatively small factor; but way better than using pack
on [Word8]
-- but see also @phadej and @andrewthad's https://github.com/ghc-proposals/ghc-proposals/pull/292) embed binary blobs into haskell code (NB: to make things more fun to reason about; GHC currently uses a modified UTF8 encoding to represent string literals as CStrings; and there's been a long-time desire to have the data-length be known statically at compile time... but I've seen @andrewthad make some progress in this department recently).
The original complaint about silent failure could be addressed by turning it into a non-silent runtime exception. This probably wouldn't add any significant overhead since currently string literals have a O(n) overhead as they need to be transcoded; we'd just need to signal errors instead of truncating code-points during this transcoding that needs to occur anyway.
Variant 2. would make it consistent with instance IsString Builder
which currently uses UTF-8 encoding (however, it could be argued that IsString Builder
should be using ISO-8859-1 as well!). But other than that, it's still a weird choice to pick UTF-8 for a binary ByteString type. Also note that you still have to deal transcoding GHC's modified UTF8 serialization into proper UTF-8 as well as dealing with improper UTF-8 strings (either silently or by runtime exceptions). And you lose the ability to represent all ByteString
values as string literals.
The appeal of variant 3. is to turn the silent failures into compile-time failures but also all currently legitimate uses of bytestring literals! thereby throwing out the baby with the bathwater.
Note however that TH or QQ is not a proper replacement for the lack of string literals. Not all GHC platforms support TH (and also for those that do, TH adds quite a bit of compile-time overhead); so I'd hate to see packages starting depending on TH support for something common such as bytestring literals which should be IMO provided by GHC w/o the need for the quite heavy TH machinery.
And if we go one step further by making the instance illegal (i.e. so you can't define your own local instance), it'd effectively represent a regression to me as I couldn't rely on TH for libraries (which I strive to be portable, which means avoiding TH when possible). Consequently, I'd strongly object with to any TypeError which would primarily recommend the use of TH/QQ as replacement as I'd consider this harmful for the ecosystem.
In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio; after all, string literals are constant values; if you manage to evaluate them at least once (hint hint, test-coverage via hpc
) as part of your CI, you'll be able for force those runtime errors. And typically you don't write that many Bytestring literals anyway -- and if you do, I'd expect you to be mindful about why you're using ByteString
ltierals in the first place -- and if you're abusing ByteString
literals for encoding Unicde text instead of using text
/utf8-text
/short-text
/..., I'd have little sympathy if you're not being careful. ;-)
I could also see hlint
be able to detect silent code-point truncation when the type is trivially inferrable -- if it doesn't already support this; i.e. code like
s :: Bytestring
s = "€uro"
could be something that hlint
could heuristically detect IMO.
The PVP only doesn't work if you're dealing with irresponsible programmers/maintainers which can't be bothered to honor it.
Well, I think it's fair to raise this issue here. PVP says:
Breaking change. If any entity was removed, or the types of any entities or the definitions of datatypes or classes were changed, or orphan instances were added or any instances were removed, then the new A.B MUST be greater than the previous A.B. Note that modifying imports or depending on a newer version of another package may cause extra orphan instances to be exported and thus force a major version change.
This doesn't say anything about changing semantics (not types) of existing functions. There is no way for a maintainer to know when this has happened. In fact, even a minor version bump may introduce changed semantics of a function as part of a "bugfix" (since it's even debatable what a bug is). And if that "bug" has been "fixed" after a few years, your chances of knowing the impact are very low.
In this instance, PVP doesn't give us any guarantees whatsoever. The only proper way for that is:
- deprecation phase
- removing the function and providing a new one
This is the only way that makes a significant change in semantics visible to your users. They will see the deprecation warning in their build logs, especially with -Werror
and after a year or two they will experience compilation failure and will have to look up the documentation/ChangeLog.
In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio
I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine IsString
used in the context of foreign input (from a user, another http server). Now you have a DoS? :fearful:
@hasufell You're are right about the incomplete wording in the PVP! I.e. that the current wording of the PVP could be interpreted that way; even though most people have interpreted the intent behind the PVP to mean that obviously semantically observable changes are considered breaking changes. And we've been discussing already a clarification (see haskell/pvp#30) of the wording to state this detail more explicit so there's really no doubt about what the intent is (the enumerated rules merely encode the general principle from the POV of a consumer; i.e. they follow from that principle; not the other way around -- but that's a different topic). In any case, I'd suggest to take the PVP specific discussion to haskell/pvp#30 where we're trying to address this minor oversight.
In summary, I think that adding a runtime error to ISO-8859-1 string literals represents the approach which provides the incremental improvement over the status quo with the best power-to-weight ratio
I think this is the most dangerous one. An error may crash someones backend. We don't know that. Just imagine
IsString
used in the context of foreign input (from a user, another http server). Now you have a DoS?
How realistic is this scenario? For one, typically you have proper exception handling in place for critical components which take untrusted input -- we're specifically talking about the IsString
which is primarily intended for string literals; so you'd have to assume that somebody had been using code that was incorrect to begin with and silently truncated code-points of a string literal. Or you might be considering the unintended use of IsString
for non-static input from untrusted input. But then again, you'd have to assume that code would be doing input validation given that IsString
has . I feel like you're trying to come up with pathological scenarios which may be technically correct but not very common in practice. At least I can't think of code of mine I've written over the years where my components would crash and burn without deserving it if non-ISO-8859-1 ByteString literals would start triggering runtime exceptions. That being said, I'd find ways to cope if the IsString instance went missing (and without an annoying TypeError-fake-instance in its place), resulting in some busywork finding ways to express the previous concise ByteString in less verbose forms; I just fail to see whether this is worth the cost (especially since to me personally I can't remember the last time I encountered this class of error).
PS: There's still the option of leaving the IsString instance as-is and accept the truncating behaviour legitimate if for the sake of having a total function (at the cost of being non-injective). It's not like this is a "bug" of IsString ByteString
; it's a documented behaviour that's been in place for a very long-time; and yes, it may bite people; yes it's not aligned with the Builder
instance (but you can also argue that the Builder instance having been added later is at fault here for diverging); but you can just as well justify the current semantics being legitimate for a specific tradeoff.
The whole point of the discussion at hand is gauge the cost/benefit ratios of a change from the current status quo to inform whether a change is worth doing, and most importantly, all choices are merely tradeoffs; there doesn't appear to be one variant which is absolutely superior to the others to me.
Otoh, I partly blame that GHC or the core Haskell standard didn't anticipate the need to have a richer support for annotating the purpose/subtype of string literals (but then again; the Haskell Report considered a "String" merely an alias for [Char]
); other languages I've worked with had some kind of charset annotation (and most importantly without requiring the heavy machinery of something like QQ/TH for something so basic); even Python whose typesystem is in a totally different corner of the typesystem space has modifiers to annotate and distinguish regex, bytestrings and text literals...
I've been bitten by changing instances before. The last major release of time
changed the instance Read UTCTime
silently (accidentally?) to require the trailing time zone. This broke a decent amount of code across the internet with *** Prelude.read: no parse
exceptions, which are a bunch of fun to debug.
Personally, I really want the instance IsString ByteString
to do UTF-8 decoding. That's what I initially expected, and it seems like the expectation of many other people as well. It is possible that people are relying on the current behavior somehow (accidentally?), but I'd suspect that more people have silent/hidden bugs based on this truncation. That belief is based on the fact that the bug that inspired this issue was in production for quite some time before it was eventually discovered.
A runtime exception with an informative error message would be surprising, but at least it should point to exactly where you need to fix your code. I usually expect to find some weird behavior when I do a major version upgrade (typically a new stackage
resolver), and a runtime exception like this would be satisfactory. Especially if that runtime exception noted that a future version of the library would change the behavior to use UTF-8 encoding instead of truncation, and provided a pointer to functions that could either a) have the old truncation behavior or b) have the future encoding behavior.
Removing the instance would be costly, and I wouldn't want to do it without a TypeError
that gave you a note saying "Please just use this function
instead to fix this error." Unfortunately, we can't attach a warning to an instance - this might be a good GHC feature to add.
I wish there was a good way of reaching out and polling the community about their preferences on this - it seems pretty important!
Builder’s UTF8 instance is probably blaze-builder/markup legacy.
I’d hope that if we do something, it will be change to
TypeError ... => IsString ByteString/Builder
Herbert’s argument that Latin1 instance roundtrips Show instance is one in favor of it. UTF8 is somewhat arbitrary, why not UTF16/32LE/BE.
I’d also prefer not introducing any pure exceptions.
Fwiw, http://hackage.haskell.org/package/overloaded-0.2/docs/Overloaded-Symbols.html instance for ByteStrings accepts only ASCII (7bit) literals. That’s somewhat accidentally by implementation; but it’s a proof that if we really want we can make GHC to fail compile time on ”invalid” literals.
On 31. Dec 2019, at 18.08, Matt Parsons [email protected] wrote:
I've been bitten by changing instances before. The last major release of time changed the instance Read UTCTime silently (accidentally?) to require the trailing time zone. This broke a decent amount of code across the internet with *** Prelude.read: no parse exceptions, which are a bunch of fun to debug.
Personally, I really want the instance IsString ByteString to do UTF-8 decoding. That's what I initially expected, and it seems like the expectation of many other people as well. It is possible that people are relying on the current behavior somehow (accidentally?), but I'd suspect that more people have silent/hidden bugs based on this truncation. That belief is based on the fact that the bug that inspired this issue was in production for quite some time before it was eventually discovered.
A runtime exception with an informative error message would be surprising, but at least it should point to exactly where you need to fix your code. I usually expect to find some weird behavior when I do a major version upgrade (typically a new stackage resolver), and a runtime exception like this would be satisfactory. Especially if that runtime exception noted that a future version of the library would change the behavior to use UTF-8 encoding instead of truncation, and provided a pointer to functions that could either a) have the old truncation behavior or b) have the future encoding behavior.
Removing the instance would be costly, and I wouldn't want to do it without a TypeError that gave you a note saying "Please just use this function instead to fix this error." Unfortunately, we can't attach a warning to an instance - this might be a good GHC feature to add.
I wish there was a good way of reaching out and polling the community about their preferences on this - it seems pretty important!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Just to throw out another bikeshedding idea; there is precedent for silently remapping invalid code-points in text literals, specifically text
/utf8-text
/text-short
remap invalid code-points (specifically surrogate pairs as documented in the IsString
instance documentation) to the replacement character U+FFFD
.
So instead of truncating code-points beyond the ISO-8859-1 plane to whatever low 8-bits the code-point value has; pick a specific code-point to map invalid code-points to (there's multiple potential choices here; one could e.g. pick 0xFF
which is an octet that would be easy to spot and also would make sure that the resulting bytestring literal can't be a valid UTF-8 encoding anymore as such octects are not allowed to occur in a well-formed UTF-8 string).
Btw, whatever we do, we're on a relatively long time-scale; bytestring has been on the major API version 0.9
between 2007 and 2012; and the last 7 years since 2012 we've been on major version 0.10; And I'd point out that this level of API stability is something I've grown to greatly appreciate for fundamental libraries such as bytestring
which almost every package depends upon. So any migration scheme involving multiple major versions is not something I'd expect to play out in a reasonable timeframe. Consequently, if (say) we decide to drop IsString ByteString
(with or without TypeError =>
), that's gonna happen in say bytestring-0.11.0
, end of story. There's really no point in any long-winded multi-stage process IMO. Also we should try to bundle up any other breaking changes that have been accumulated since 2012, to make good use of this costly major version increment.