text
text copied to clipboard
`decodeUtf8` / `decodeUtf8'` situation
There are a few issues with the current situation:
- Decoding/encoding from UTF-8 is a really common use case, but requires a separate import (
Data.Text.Encoding) - The best name (
decodeUtf8) is taken by the partial version of the function decodeUtf8'doesn't mean anything, you just have to know the'means it's the total version
Is there anything we could do to improve this situation? It doesn't bother me, but it's tough on people that are new to haskell. I know we're constrained by backwards compatibility here because so many packages use text, but I thought it would be worth thinking about.
The one thing we can do with little risk is improve the documentation...
Opened an issue for docs: https://github.com/haskell/text/issues/248
That can only do so much when the best names are taken up by partial variants of course. Is there interest in adding new functions with names like decodeUtf8Partial? Then (on a very long timescale) we could add deprecation notices to the decodeUtf8 style partial functions and get people to either switch to the *Partial variants or handle failure.
This would definitely improve this library. It might improve the ecosystem in general as well (since I bet a lot of people are using decodeUtf8 without realizing it can throw exceptions in pure code).
The problem I see is with the "very long timescale" -- what if the timescale is 10 years, does it still make sense to do this? as you point out text has way too many consumers which might be affected; way too much for a change which is primarily motivated by aesthetics/discoverability (which I do sympathize with though). This isn't the only instance where there's regret in a naming or type-signature choice that was done a decade ago, and where we'd just have to stick with it until there's a technically motivated excuse to fix it up presents itself...
Btw, the same argument about these functions being stuffed away in a separate module could equally be made about the I/O ops being hidden away in Data.Text.IO, couldn't it?
Partial functions are traditionally called "unsafe..." in Haskell. Please rename decodeUtf8 to unsafeDecodeUtf8.
I'd like to re-raise this issue. While it would be nice to rename decodeUtf8, I'd want to suggest a much more direct and uncomfortable approach:
- Deprecate
decodeUtf8. - Provide
unsafeDecodeUtf8with the behaviour of the currentdecodeUtf8.
I don't mind if we also rename decodeUtf8' to something nicer, but I think it's quite important that we take some action to stop people using decodeUtf8 blindly.
Let me give an example of why this is a big deal. Just recently I noticed that an upstream binary protocol library I was using used decodeUtf8 in its instance for Text. This meant that although it presented a nominally pure interface, it could in fact throw, in a way that was easy to trigger with malicious user input.
I think this is likely to be a common situation: decodeUtf8 is likely to be often used when parsing input, which can often be malicious or untrusted. As it stands, a use of decodeUtf8 anywhere in your dependency tree poses a potential risk of user-triggered exceptions in unexpected places.
This is why I think we should bite the bullet and just deprecate it. Yes, it's a big hammer, but I think it is likely that almost every use of decodeUtf8 in the wild should not be using it, and at most should be using decodeUtf8With lenientDecode.
@michaelpj Let me put it plainly. Is there a volunteer to facilitate such migration? One has to go through top N downstream dependencies (where N is > 20, ideally more like 50) and raise PRs to use decodeUtf8' or decodeUtf8With before we can deprecate decodeUtf8 without halting the world.
Hmm, I guess I'm missing something. I didn't think we needed a migration path for a deprecation. It should just result in warnings for downstream packages.
I'm not proposing to rush a removal, I agree that would be quite hard.
Sorry to post again, I'd just like to understand if it's really a problem to do a deprecation in this way? If it is I'll give up on this for now and pursue a hlint hint or something.
Sorry, @michaelpj I don't have a mental capacity to mull this over at the moment. I think deprecation is a plausible approach, but we need to plan carefully a) deprecation of decodeUtf8, b) removal of decodeUtf8, c) introducing decodeUtf8 as a synonym of decodeUtf8', d) deprecating decodeUtf8', e) removing decodeUtf8' - and each step takes a major version bump, which altogether means a decade of time.
My gut preference is to replace decodeUtf8 with decodeUtf8' in one step and employ a volunteer to update major packages. But as I said, unfortunately I cannot think properly about pros and cons right now. Maybe other maintainers have more resurces.
Off the top of my head we could consider this:
- Create two new functions with the nicest names possible given that
decodeUtf8is taken. Export these by default fromData.Text. Something likedecodeUtf8UnsafeanddecodeUtf8Either. - Deprecate
decodeUtf8anddecodeUtf8', with comments pointing at the new functions. - Resign ourselves to the fact that we won't be able to remove the deprecated functions-- at least in the 2020s-- and that's OK.