lpeg_patterns
lpeg_patterns copied to clipboard
Support i18n in URLs
Would love support for both IDN-encoded domains http://øl.no/ and encoded paths and query args, like http://google.com/?q=æøå or http://google.com/å
Relevant RFC: https://www.ietf.org/rfc/rfc3987.txt
Normalisation for domain names is hard.
Links
- https://curl.haxx.se/mail/lib-2016-11/0030.html
- http://unicode.org/reports/tr46/
Started work on a new module to provide the functionality required: https://github.com/daurnimator/lua-unistring
Though I don't know how I feel about adding a dependency for lpeg_patterns.
Interesting discussion in https://tools.ietf.org/html/draft-ietf-iri-3987bis-13 (found via http://blog.jclark.com/2008/11/what-allowed-in-uri.html, thanks @jclark) about the 'ucschar' production
More URL problems are also detailed in: https://tools.ietf.org/html/draft-ruby-url-problem-01 and I blogged about a few a while ago: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/
There really is no good URL standard right now.
Normalisation for domain names is hard.
libicu has TR46/UTS#46 support (transitional and non-transitional), but as you said (@daurnimator), your code has to work as plugin on systems without libicu. I just say this for the record that there is an 'easy' solution. libidn (=IDNA 2003) is obsolete and risky in use, libidn2 currently lacks UTS#46. Yesterday I found idnkit-2 which has UTS#46 as well (used on Dragonfly BSD).
I came up with this snippet that generates the IdnaMappingTable in pure lua: https://gist.github.com/daurnimator/be276c5d32329e2a9250f4aabeea48a8
The generated file is 880K. However loading it into memory seems to take up ~5.5M. Which makes me think it's not a good solution.
@rockdaboot do I recall you saying libidn2 had some fixes and is now a good solution?
Yes, libidn2 0.14 (in Debian unstable, maybe also already in testing) has TR46 support. I condensed the mapping table, so it has < 100k, stripped libidn2 now has 179592 bytes. There is still room for improvements.
When using idn2_lookup_*, add either IDN2_TRANSITIONAL or IDN2_NONTRANSITIONAL to the flags to get TR46 transitional or TR46 non-transitional behavior.
Another good thing with TR46 is, you don't have to lowercase and/or NFC the input - this will be done by the TR46 processing (automatically).
Today I packaged libidn2 for arch: https://aur.archlinux.org/packages/libidn2/ And wrote bindings for lua: https://github.com/daurnimator/lua-idn2