ircv3-specifications icon indicating copy to clipboard operation
ircv3-specifications copied to clipboard

Allow Unicode in nicknames

Open ttepasse opened this issue 9 years ago • 37 comments

RFC 1459 only allows ASCII letters, numerals and some special characters in Nicknames, leaving people from non-anglophone countries at a disadvantage. Using the wealth of human writing is possible in the body of messages, it should be possible in the nicknames too.

ttepasse avatar May 03 '16 16:05 ttepasse

There are existing implementations of this (e.g. InspIRCd's m_nationalchars) but nothing standard. I believe that @DanielOaks was looking into trialling RFC 3454 in @mammon-ircd with a desire for standardising it though.

It isn't as simple as just allowing it though. Compatibility is a concern (there are clients which break when they get a CASEMAPPING which is not ascii or rfc1459) as well as masquerading with characters that look similar (e.g. character 97 "a" looks very similar to character 1072 "а").

SadieCat avatar May 03 '16 16:05 SadieCat

There's also cases of servers improperly implementing rfc1459 vs. strict-rfc1459 (see inspircd/inspircd#1017).

Ideally, wouldn't we want this to match how it is done for channel names?

clokep avatar May 03 '16 16:05 clokep

For what its worth I made a test branch for hexchat supporting rfc3454 though no network implements it afaik to try it.

TingPing avatar May 03 '16 17:05 TingPing

For what it's worth, we just experimented a bit on moznet and things like the zero-width space character is accepted as a valid room name...which shows up as an empty in whois:

(Additionally, there's also a channel which is just the prefix, #, which is a bit funky.)

screen shot 2016-05-03 at 2 06 50 pm

clokep avatar May 03 '16 18:05 clokep

Relevant reading:

UTR 36: Unicode Security Considerations

UTS 39: Unicode Security Mechanisms

Bitlbee has a 'utf8_nicks' setting, disabled by default and with a small warning about potential breakage in the help text. It doesn't perform any cleanup, deferring that to the IM server (XMPP for example cleans them with the nodeprep/resourceprep profiles of stringprep), but i'd really like to change this.

I haven't heard of clients with big issues when enabling this, just minor visual issues like miscalculating the width when displaying the nicks in a terminal.

dequis avatar May 03 '16 19:05 dequis

How are we going to maintain the backward compatibility? I'd upvote this otherwise

MicroDroid avatar May 03 '16 21:05 MicroDroid

In practice, it's probably already compatible because many clients don't care.

grawity avatar May 03 '16 21:05 grawity

Hmm, well then this should really be in IRCv3.2, it's awesome

MicroDroid avatar May 03 '16 22:05 MicroDroid

masquerading with characters that look similar (e.g. character 97 "a" looks very similar to character 1072 "а").

With rfc3454 casemapping I believe we use the nameprep profile to prevent issues like this. It would be good to read through documents like those in detail to make sure we do things right if we're standardising it though.

So long as you continue to disallow characters that break the protocol (i.e. commas, periods in client names, etc), and reject nicks/channel names that fail to casefold (i.e. strings that fail because they contain a character prohibited by the profile), I haven't seen too many issues with it.

DanielOaks avatar May 03 '16 22:05 DanielOaks

In charybdis, we plan to implement rfc7700 "casemapping", which is the same as rfc3454 nameprep except using IDN2008 rules, with specific requirements for "nicknames".

kaniini avatar May 03 '16 23:05 kaniini

How are we going to maintain the backward compatibility?

It is a joke but it is a solution, convert it to punycode (or similiar) for non unicode clients.

In practice, it's probably already compatible because many clients don't care.

Not sure what you mean by that, many clients respect the casemapping and rely upon its behavior.

TingPing avatar May 04 '16 00:05 TingPing

@kaniini That makes sense, once it's implemented/specced out give me a yell and I can see about switching my personal stuff over to use it as well.

DanielOaks avatar May 04 '16 00:05 DanielOaks

How are we going to maintain the backward compatibility?

There is no plan in charybdis for backwards compatibility. Deployments which switch from rfc1459 to rfc7700 casemapping will assume clients support UTF-8 properly. Networks will decide on their own when to make the switch, or whether to make it at all.

kaniini avatar May 04 '16 01:05 kaniini

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

What if the client is configured to use not-UTF-8-charset?

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild? (ref: https://github.com/weechat/weechat/issues/79)

Mikaela avatar May 04 '16 06:05 Mikaela

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

Presumably the same as it does with the latin alphabet.

What if the client is configured to use not-UTF-8-charset?

I think detecting a specifically UTF-8-based casemapping from the server should make the client default to using UTF-8, if they're not already. If the user decides not to use it, they may get corrupted characters, just like what happens today when two clients using utf8 and non-utf8 try to send weird characters to each other.

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild?

Then it will not show those characters because your client (or the system you're using it on) does not work properly. I don't think this is an issue for us to worry about, it's a bug that will get fixed by more distros over time, and I especially think will be fixed enough for us to not care about it by the time a unicode casemapping actually gets into proper usage.

DanielOaks avatar May 04 '16 06:05 DanielOaks

I fully support moving away from legacy rfc1459 towards rfc7700.

attilamolnar avatar May 04 '16 08:05 attilamolnar

How would tab completion work if someone used a nick on international channel that is not in latin alphabet?

Not sure if possible, but ideally a<Tab> would also include nicks beginning with ą ã å あ etc., similar to how in some clients it already skips over any leading punctuation (a<Tab>[Attila]).

What if the client is configured to use not-UTF-8-charset?

Clients which support CASEMAPPING=rfc7700 would always decode nicknames as UTF-8, regardless of the configured message encoding.

Existing clients would work the same way they already do when someone sends a UTF-8 message (i.e. some would detect UTF-8 anyway, others would mis-decode it as ISO-8859-42 or whatever such).

What if the the person using UTF-8 nick uses something that my client cannot show due to old glibc in my system which is in the wild?

🤷

I guess it'd be less likely to happen if only "word" characters were accepted, similar to how Python etc. filter characters allowed in variable names.

grawity avatar May 04 '16 11:05 grawity

ideally a<Tab> would also include nicks beginning with ą ã å あ etc., similar to how in some clients it already skips over any leading punctuation (a<Tab>[Attila]).

So long as the client takes the casefolding into account when evaluating tab-complete matches, should work without an issue I'd imagine.

DanielOaks avatar May 04 '16 11:05 DanielOaks

Maybe we can do some math in the IRC server to create an alias and send to the client? so the client uses the alias to complete the actual nick?

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa<Tab>ąãå

Or, the math part might be left up to the clients, as the whole thing is really client side anyways.

MicroDroid avatar May 04 '16 13:05 MicroDroid

I'd suggest it's just up to the clients to implement tab completion in a sane manner. UI interfaces shouldn't be speced in a protocol.

clokep avatar May 04 '16 13:05 clokep

Right. So either way this problem is avoidable.

MicroDroid avatar May 04 '16 13:05 MicroDroid

Hmm, how do people use tab-completion in the existing ISO-2022-JP networks?

grawity avatar May 04 '16 13:05 grawity

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa<Tab> → ąãå

Wouldn't this just mean that aaa / ąãå were the same nick and all variations of ąãå which the IRCd would interpret to aaa and get very confusing? This is why I gave :-1: to your comment.

Mikaela avatar May 04 '16 15:05 Mikaela

Like if the nick is ąãå, then the IRC server does some math and create aaa out of it, and send to the client during NAMES or something

This way the user can just put like aa<Tab> → ąãå Wouldn't this just mean that aaa / ąãå were the same nick and all variations of ąãå which the IRCd would interpret to aaa and get very confusing? This is why I gave :-1: to your comment.

Another idea could be to treat it like capitalizations? ąãå == aaa == AAA but it's not translated to the same characters.

RyanSquared avatar May 04 '16 15:05 RyanSquared

Proposed client behaviour would be in a non-normative part of the spec at best, so it's not even worth bothering with. I suspect with the way this discussion is going, this will be an area where the IRCv3 process fails us and we just form a coalition of IRCd vendors to make it happen, and then IRCv3 maybe documents it after the point.

kaniini avatar May 04 '16 15:05 kaniini

So business as usual, then?

grawity avatar May 04 '16 16:05 grawity

Pretty much what @kaniini says. It's not a huge issue to worry about.

DanielOaks avatar May 04 '16 18:05 DanielOaks

I'd still be concerned about breakage - even clients which support UTF-8 messages likely have made assumptions about nicknames, particularly any clients which support tab completion or which maintain a cached member list for channels for some purpose. I'd be afraid that this is likely to expose a lot of undefined behaviors around input sanitation of nicknames received from the server (or the lack thereof).

Some possible manifestations of incompatibility with UTF-8 nicknames

  • Commands applied to the wrong user
  • Broken tab completion
  • "null" users in internal caches
  • garbled characters

Some of these issues already exist today with channel names, and chat messages, but nicknames are more fundamental, as they are identifiers that the client absolutely has to deal with correctly - if a channel name breaks a client, the user can avoid that channel, a user can't necessarily choose to avoid all users with UTF-8 nicknames.

There's also a severe usability concern that needs to be addressed - a channel operator MUST be able to quickly and unambiguously specify nicknames for use in commands with only keyboard input, regardless of what language's characters might happen to be in those nicknames. Even if that client properly supports UTF-8 nicknames, if the use of such nicknames complicates the effective management of channels in the slightest, then user acceptance of internationalized nicknames will either be dead in the water as a feature users rebel against, or there will be demands for restrictive channel modes to prohibit all internationalized nicknames on a channel..

(Yes, I realize that in most cases, a user has access to a GUI, tab completion, or copy/paste, but there is no guarantee of this - there are environments where none of these will be a viable option. Tab completion, for example, often requires the user specify at least a partial match, or requires them to iterate through every nickname on the channel, copy/paste may not be available if the user is at an actual console session rather than running a terminal inside a GUI, GUI userlists aren't available in a terminal, and so on.)

sdaugherty avatar Jul 28 '16 11:07 sdaugherty

rfc7700, when properly implemented, handles all of those issues and more. have you read it?

kaniini avatar Jul 29 '16 18:07 kaniini

I have, and it is so extremely light on practical details about exactly how it would be implemented within the IRC protocol that it leaves more questions than answers.While IRC is mentioned as a possible application, aside from that mention, the rest of the RFC consists of a set of guidelines that can be generically applied to problems inherent with nickname internationalization. across a wide variety of existing and future protocols.

While the specifications set out in the RFC address a number of potential issues, the lack of any formal guidance of how to integrate them into the IRC protocol, combined with a lack of IRC specific recommendations effectively make it nothing more than a building block, and my concerns from a user standpoint above about IRC-specific implementation details remain at most partially addressed by RFC7700.

Of more concern, there are some security considerations that should be readily apparent to any long time user of IRC, which are not mentioned - specifically, the potential for disruption if the effective use of channel management and ignore functionality is obstructed or defeated by internationalized nicknames. This is especially important here because users might first have to learn how to deal with inputting i18n nicknames while under the pressure of on ongoing disruption or attack.

Any demonstration or reference implementations will have to be especially aware of these and other considerations, to avoid an implementation that is perceived as creating more problems that it solves.

sdaugherty avatar Jul 29 '16 23:07 sdaugherty