text icon indicating copy to clipboard operation
text copied to clipboard

`decodeUtf8` / `decodeUtf8'` situation

Open seagreen opened this issue 6 years ago • 11 comments

There are a few issues with the current situation:

  • Decoding/encoding from UTF-8 is a really common use case, but requires a separate import (Data.Text.Encoding)
  • The best name (decodeUtf8) is taken by the partial version of the function
  • decodeUtf8' doesn't mean anything, you just have to know the ' means it's the total version

Is there anything we could do to improve this situation? It doesn't bother me, but it's tough on people that are new to haskell. I know we're constrained by backwards compatibility here because so many packages use text, but I thought it would be worth thinking about.

seagreen avatar Jan 20 '19 00:01 seagreen

The one thing we can do with little risk is improve the documentation...

hvr avatar Jan 20 '19 18:01 hvr

Opened an issue for docs: https://github.com/haskell/text/issues/248

seagreen avatar Jan 21 '19 19:01 seagreen

That can only do so much when the best names are taken up by partial variants of course. Is there interest in adding new functions with names like decodeUtf8Partial? Then (on a very long timescale) we could add deprecation notices to the decodeUtf8 style partial functions and get people to either switch to the *Partial variants or handle failure.

This would definitely improve this library. It might improve the ecosystem in general as well (since I bet a lot of people are using decodeUtf8 without realizing it can throw exceptions in pure code).

seagreen avatar Jan 21 '19 19:01 seagreen

The problem I see is with the "very long timescale" -- what if the timescale is 10 years, does it still make sense to do this? as you point out text has way too many consumers which might be affected; way too much for a change which is primarily motivated by aesthetics/discoverability (which I do sympathize with though). This isn't the only instance where there's regret in a naming or type-signature choice that was done a decade ago, and where we'd just have to stick with it until there's a technically motivated excuse to fix it up presents itself...

Btw, the same argument about these functions being stuffed away in a separate module could equally be made about the I/O ops being hidden away in Data.Text.IO, couldn't it?

hvr avatar Jan 21 '19 22:01 hvr

Partial functions are traditionally called "unsafe..." in Haskell. Please rename decodeUtf8 to unsafeDecodeUtf8.

cblp avatar Sep 24 '19 17:09 cblp

I'd like to re-raise this issue. While it would be nice to rename decodeUtf8, I'd want to suggest a much more direct and uncomfortable approach:

  1. Deprecate decodeUtf8.
  2. Provide unsafeDecodeUtf8 with the behaviour of the current decodeUtf8.

I don't mind if we also rename decodeUtf8' to something nicer, but I think it's quite important that we take some action to stop people using decodeUtf8 blindly.

Let me give an example of why this is a big deal. Just recently I noticed that an upstream binary protocol library I was using used decodeUtf8 in its instance for Text. This meant that although it presented a nominally pure interface, it could in fact throw, in a way that was easy to trigger with malicious user input.

I think this is likely to be a common situation: decodeUtf8 is likely to be often used when parsing input, which can often be malicious or untrusted. As it stands, a use of decodeUtf8 anywhere in your dependency tree poses a potential risk of user-triggered exceptions in unexpected places.

This is why I think we should bite the bullet and just deprecate it. Yes, it's a big hammer, but I think it is likely that almost every use of decodeUtf8 in the wild should not be using it, and at most should be using decodeUtf8With lenientDecode.

michaelpj avatar Aug 15 '21 16:08 michaelpj

@michaelpj Let me put it plainly. Is there a volunteer to facilitate such migration? One has to go through top N downstream dependencies (where N is > 20, ideally more like 50) and raise PRs to use decodeUtf8' or decodeUtf8With before we can deprecate decodeUtf8 without halting the world.

Bodigrim avatar Aug 15 '21 16:08 Bodigrim

Hmm, I guess I'm missing something. I didn't think we needed a migration path for a deprecation. It should just result in warnings for downstream packages.

I'm not proposing to rush a removal, I agree that would be quite hard.

michaelpj avatar Aug 15 '21 16:08 michaelpj

Sorry to post again, I'd just like to understand if it's really a problem to do a deprecation in this way? If it is I'll give up on this for now and pursue a hlint hint or something.

michaelpj avatar Aug 23 '21 13:08 michaelpj

Sorry, @michaelpj I don't have a mental capacity to mull this over at the moment. I think deprecation is a plausible approach, but we need to plan carefully a) deprecation of decodeUtf8, b) removal of decodeUtf8, c) introducing decodeUtf8 as a synonym of decodeUtf8', d) deprecating decodeUtf8', e) removing decodeUtf8' - and each step takes a major version bump, which altogether means a decade of time.

My gut preference is to replace decodeUtf8 with decodeUtf8' in one step and employ a volunteer to update major packages. But as I said, unfortunately I cannot think properly about pros and cons right now. Maybe other maintainers have more resurces.

Bodigrim avatar Aug 23 '21 18:08 Bodigrim

Off the top of my head we could consider this:

  1. Create two new functions with the nicest names possible given that decodeUtf8 is taken. Export these by default from Data.Text. Something like decodeUtf8Unsafe and decodeUtf8Either.
  2. Deprecate decodeUtf8 and decodeUtf8', with comments pointing at the new functions.
  3. Resign ourselves to the fact that we won't be able to remove the deprecated functions-- at least in the 2020s-- and that's OK.

seagreen avatar Aug 23 '21 21:08 seagreen