idna icon indicating copy to clipboard operation
idna copied to clipboard

Bringing idna into the Python core library

Open NCommander opened this issue 6 years ago • 15 comments

I am looking at bringing native IDNA2008 support into the Python core library, and had a long conversation with Nathaniel J. Smith (@njsmith) and Christian Heimes (@tiran) about the work required. The end goal of this work is to have Python be able to natively handle IDNs as a first class citizen, and recycle as much code use as possible.

To summarize the current conversation, the first step would be implementing a new codec in Python's code library, and then extend the standard library to be able to natively handle IDNs in such a way that the following code snippit could work:

#!/usr/bin/env python3
import urllib
import urllib.request

req = urllib.request.Request('http://fuß.standcore.com')
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode(encoding='utf-8')) 

From the conversation on Zulip, the first step would be implementing idna2008 as a new encoding codec, and then work on modifying the core library to be able to accept and interoperate with IDNs seamlessly.

I'm willing to do much of the legwork required to get code integrated into CPython. My first question is what (if any) blockers exist in implementation that would make it difficult to bring into CPython, and any tips or suggestions to help bring things forward. Right now, I'm just trying to get the ball rolling on figuring out a solid plan on hopefully having Python 3.8 be able to treat IDNs as first class citizens.

NCommander avatar Jan 13 '19 02:01 NCommander

the first step would be implementing idna2008 as a new encoding codec

This is one possible approach (I think @NCommander was planning to hard-code the equivalent of uts46=True, transitional=False?), but I think the first question for the python maintainers is whether this is actually the best strategy, versus other possibilities like bringing the idna library in as a whole. So if you have any thoughts on that it'd be great!

njsmith avatar Jan 13 '19 05:01 njsmith

My biggest concern with the current implementation of the idna module is the size of UTS46 mapping table. The uts46data data file has almost 200kB. Importing the module consumes about 1.5 MB of RSS:

>>> import psutil, os
>>> p = psutil.Process(os.getpid())
>>> p.memory_info()
pmem(rss=15568896, vms=237490176, shared=8200192, text=8192, lib=0, data=7569408, dirty=0)
>>> import idna.uts46data
>>> p.memory_info()
pmem(rss=17170432, vms=240320512, shared=8368128, text=8192, lib=0, data=9420800, dirty=0)
>>> (17170432 - 15568896) // 1024
1564

The uts46_remap method is fairly straight forward. It's basically just a bisect search + couple of checks. The lookup table can be implemented in C easily and added to unicodedata module. This would avoid boxing of all ints and str as Python objects and reduce RSS.

Here is some code for https://github.com/kjd/idna/blob/master/tools/idna-data to dump the table to a header file:

def uts46_cranges(ucdata):
    last = (None, None)
    for cp in ucdata.codepoints():
        fields = cp.uts46_data
        if not fields:
            continue
        status, mapping = UTS46_STATUSES[fields[0]]
        if mapping:
            mapping = "".join(chr(int(codepoint, 16)) for codepoint in fields[1].split())
            mapping = mapping.replace("\\", "\\\\").replace("'", "\\'")
        else:
            mapping = None
        if cp.value > 255 and (status, mapping) == last:
            continue
        last = (status, mapping)

        if mapping:
            mapping = ''.join("\\x{:02X}".format(c) for c in mapping.encode('utf-8'))
            mapping = '"' + mapping + '"'
        else:
            mapping = 'NULL'

        yield "{{0x{0:X}, '{1}', {2}}}".format(cp.value, status, mapping)

def uts46_cdata(ucdata):

    yield "/* This file is automatically generated by tools/idna-data"
    yield " * vim: set fileencoding=utf-8 :\n"
    yield " * IDNA Mapping Table from UTS46."
    yield "*/ \n\n"

    yield "#include <stddef.h>"
    yield "typedef struct {long cp; char status; const char* mapping;} uts46_map_t;"
    yield "const uts46_map_t uts46_map[] = {"

    for row in uts46_cranges(ucdata):
        yield "    {0},".format(row)
    yield "};\n"

def make_cdata(args, ucdata):
    dest_dir = args.dir or '.'
    target_filename = os.path.join(dest_dir, 'uts46data.h')
    with open(target_filename, 'wb') as target:
        for line in uts46_cdata(ucdata):
            target.write((line + "\n").encode('utf-8'))

tiran avatar Jan 13 '19 13:01 tiran

I have implemented a PoC unicodedata.uts_remap function, https://github.com/python/cpython/compare/master...tiran:uts46_remap . The table is created with a modified tools/idna-data script.

tiran avatar Jan 14 '19 21:01 tiran

the first step would be implementing idna2008 as a new encoding codec

This is one possible approach (I think @NCommander was planning to hard-code the equivalent of uts46=True, transitional=False?), but I think the first question for the python maintainers is whether this is actually the best strategy,

UTS46 is obsolete. When the IETF defined idna2008 they got rid of all those mappings so each Unicode U-label uniquely maps to one ASCII A-label and vice-versa. I looked at every IDN registered in ICANN-contracted top-level domains (three letters and longer) and can report that the number of names that depend on idna2003 is well under 0.1% of the total, they are all old junk with punctuation characters like ®≤€. None I could find are in active use, and even so none of those depend on the UTS46 mappings. So how about only loading in the mappings if the user specifically asks for IDNA2003? Then it only affects the very few people who think they need them. By the way, the UTS46 mappings are now actively harmful and break correct recently registered names. That's what the fuß.standcore.com example demonstrates.

jrlevine avatar Feb 03 '19 00:02 jrlevine

@jrlevine When you say UTS46, do you mean the UTS46 core mapping (what idna calls uts46=True, transitional=False), or the UTS46 transitional mapping (what idna calls uts46=True, transitional=True)?

The core mapping is what lets you do things like use capital letters in domain names (Google.com), and it doesn't break fuß. The transitional mapping is the one that breaks fuß. I haven't done an exhaustive check, but all the real-world users I've seen so far use uts46=True, transitional=False.

njsmith avatar Feb 03 '19 00:02 njsmith

Take another look at IDNA2008. That's been the standard for a decade and handles Google.com just fine. UTS46 always maps U+3002 to dot, which breaks IDNA2008 Chinese and Japanese names. Mapping accented upper or lower case characters correctly varies by language (Swedish, Turkish, etc.) and domain names don't come with language tags.

I believe that people use UTS46 but it's by default, not because it solves any current problems. ICANN has mandated IDNA2008 for all of its contracted registries, so using anything else risks breaking current and future valid names.

jrlevine avatar Feb 03 '19 02:02 jrlevine

@jrlevine From this project's readme:

>>> import idna
>>> idna.encode(u'Königsgäßchen')
...
idna.core.InvalidCodepoint: Codepoint U+004B at position 1 of 'Königsgäßchen' not allowed
>>> idna.encode('Königsgäßchen', uts46=True)
b'xn--knigsgchen-b4a3dun'
>>> print(idna.decode('xn--knigsgchen-b4a3dun'))
königsgäßchen

njsmith avatar Feb 03 '19 02:02 njsmith

You're right, but ICANN mandates IDNA2008 in all new names. If you tried to register that name with the capital K, no registry would accept it so it hardly matters. Google.com is of course OK because it's not an IDN. I know the people involved, they're quite serious that they're only allowing IDNA2008 going forward.

jrlevine avatar Feb 03 '19 02:02 jrlevine

Anyway, I'm far from an expert in IDNA. If there's a better thing to use then I would like to know :-). If we simply disable UTS-46 entirely across the Python ecosystem though then we'll have issues like requests.get("http://Königsgäßchen.com") suddenly starting to crash complaining that "K" is an illegal character in domain names, which is going to confuse and annoy people, and make them demand some workaround.

njsmith avatar Feb 03 '19 02:02 njsmith

I happen to know the people who wrote the IDNA specs and they would prefer that UTS46 die. Its fuzzy mappings cause problems when you go from U -> A -> U' and U' isn't the same as U. It'd certainly make sense to leave UTS46 as an option for people who for some reason need the stuff it does and don't miss the names it won't let them use. If people deeply care about ASCII upper case, let me see if I can find out what Safari and Firefox, which are generally IDNA2008 compatible, do.

jrlevine avatar Feb 03 '19 03:02 jrlevine

@jrlevine I'm afraid you don't seem to understand what UTS#46 is or how IDNA2008 works. UTS#46 has two different purposes: (a) the necessary mapping phase as discussed by IDNA2008, and (b) transitional considerations between IDNA2003 and IDNA2008.

I agree that (b) is likely best avoided these days (although note that the Chromium developers appear to disagree, and they presumably have some expertise in the area). This is indeed the default setting of the Python idna library.

However (a), the mapping phase, is a vital function of the Python idna library. Without it domain names will not resolve correctly, including but not limited to the case-insensitive behaviour that users expect as discussed above. And no this isn't just a case of lower-casing "ASCII upper case", as that wouldn't even cope with trivial Western names such as "CAFÉ.FR" let alone situations such as Japanese or Cherokee names.

I think you are attempting to relay a confused and half-understood conversation from somewhere else about UTS#46 transitional processing. If you want this library to replace the standard UTS#46 mapping with some other novel mapping of its own invention then this is a fairly extraordinary request which would require better support than "unnamed experts agree with you in private".

jribbens avatar Feb 03 '19 14:02 jribbens

@jribbens Sorry, I wasn't being very clear. IDNA2008 deliberately took the preliminary mapping out of the spec. (See RFC 5895.) Mapping operations are very language dependent, e.g., case folding in German is different from case folding in Swedish or Turkish. (Arguably, case folding in Canadian French is different from French or Belgian French.) ASCII and Arabic digits may or may not be equivalent and one or the other is preferred depending on which Arabic speaking country you're in.

UTS46 tries to be a universal mapping but it turns out that the farther you get from Western Europe, the worse it does. Its case folding is wrong for Turkish, the ZWJ are wrong for Persian, the treatment of digits is wrong for some versions of Arabic, and the whole thing is wrong for Chinese where users expect ASCII pinyin to be turned into Chinese characters.

A good implementation of IDNA2008 should use a mapping table based on the user's locale. Unfortunately, the supply of locale mappings remains pretty sparse. I suppose that you need some default so existing code with .encode('idna') doesn't break, and a default of UTS46 nontransitional at least has known wrongness, but if you're opening up the code, it would be a really good idea to put in the scaffolding to load per-locale mappings when they're available.

By the way, the experts in questions are the guys who wrote RFCs 5890, 5891, 5892, and 5894.

jrlevine avatar Feb 04 '19 16:02 jrlevine

Ah ok that makes much more sense. But this library as it is doesn't prevent you from applying any mapping you like, as you can just apply that mapping and then call encode/decode with uts46=False. If your point is simply that any inclusion into the Python stdlib should retain the ability to choose to exclude UTS#46 then I agree that's sensible - but the default should be to use UTS#46 in non-transitional mode.

jribbens avatar Feb 05 '19 00:02 jribbens

@jribbens This may be too far into the weeds, but I'd rather some way you can set a mapping at the beginning of your program and then have all .encode('idna')use that, so if your user speaks, say, Turkish, you can provide a mapping that makes sense for Turkish, and the rest of your application will use it automagically. I realize this will make programs behave differently but I believe that it will generally be seen as fixing bugs, like, oh, now the case folding or the joiners work the way I expect.

jrlevine avatar Feb 05 '19 00:02 jrlevine

As a coda to this thread, having discovered it a year late due to some Github notification issues:

I would say having native IDNA 2008 support in Python's core is probably a natural evolution of this work, if it is considered by the core maintainers to be in-scope and they are willing to maintain timely updates against new versions of Unicode. I think the status quo of having a deprecated incompatible version of IDNA in the core, and the current version not in the core, is the worst of both worlds. Either update the core to the modern spec, or deprecate the IDNA codec against the old standard from the core.

Not sure the current status of the work by @NCommander but if I can be of assistance I am happy to.

kjd avatar Feb 25 '20 21:02 kjd

Closing this issue as there is no activity on this front, for now.

kjd avatar Nov 23 '23 04:11 kjd