lpeg_patterns icon indicating copy to clipboard operation
lpeg_patterns copied to clipboard

Support i18n in URLs

Open torhve opened this issue 9 years ago • 9 comments

Would love support for both IDN-encoded domains http://øl.no/ and encoded paths and query args, like http://google.com/?q=æøå or http://google.com/å

Relevant RFC: https://www.ietf.org/rfc/rfc3987.txt

torhve avatar Jun 28 '16 11:06 torhve

Normalisation for domain names is hard.

Links

  • https://curl.haxx.se/mail/lib-2016-11/0030.html
  • http://unicode.org/reports/tr46/

daurnimator avatar Nov 15 '16 13:11 daurnimator

Started work on a new module to provide the functionality required: https://github.com/daurnimator/lua-unistring

Though I don't know how I feel about adding a dependency for lpeg_patterns.

daurnimator avatar Nov 16 '16 15:11 daurnimator

Interesting discussion in https://tools.ietf.org/html/draft-ietf-iri-3987bis-13 (found via http://blog.jclark.com/2008/11/what-allowed-in-uri.html, thanks @jclark) about the 'ucschar' production

daurnimator avatar Nov 17 '16 08:11 daurnimator

More URL problems are also detailed in: https://tools.ietf.org/html/draft-ruby-url-problem-01 and I blogged about a few a while ago: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/

There really is no good URL standard right now.

bagder avatar Nov 17 '16 08:11 bagder

Normalisation for domain names is hard.

libicu has TR46/UTS#46 support (transitional and non-transitional), but as you said (@daurnimator), your code has to work as plugin on systems without libicu. I just say this for the record that there is an 'easy' solution. libidn (=IDNA 2003) is obsolete and risky in use, libidn2 currently lacks UTS#46. Yesterday I found idnkit-2 which has UTS#46 as well (used on Dragonfly BSD).

rockdaboot avatar Nov 17 '16 09:11 rockdaboot

I came up with this snippet that generates the IdnaMappingTable in pure lua: https://gist.github.com/daurnimator/be276c5d32329e2a9250f4aabeea48a8

The generated file is 880K. However loading it into memory seems to take up ~5.5M. Which makes me think it's not a good solution.

daurnimator avatar Nov 21 '16 12:11 daurnimator

@rockdaboot do I recall you saying libidn2 had some fixes and is now a good solution?

daurnimator avatar Jan 09 '17 14:01 daurnimator

Yes, libidn2 0.14 (in Debian unstable, maybe also already in testing) has TR46 support. I condensed the mapping table, so it has < 100k, stripped libidn2 now has 179592 bytes. There is still room for improvements.

When using idn2_lookup_*, add either IDN2_TRANSITIONAL or IDN2_NONTRANSITIONAL to the flags to get TR46 transitional or TR46 non-transitional behavior.

Another good thing with TR46 is, you don't have to lowercase and/or NFC the input - this will be done by the TR46 processing (automatically).

rockdaboot avatar Jan 09 '17 15:01 rockdaboot

Today I packaged libidn2 for arch: https://aur.archlinux.org/packages/libidn2/ And wrote bindings for lua: https://github.com/daurnimator/lua-idn2

daurnimator avatar Jan 10 '17 14:01 daurnimator