url Expand which hostnames are considered IPv4 addresses

Follow-up to #560

I wonder if it would be possible to broaden the linked change to all hostnames whose final label begins with an ASCII digit. The reason is that this would very nicely match RFC-2396 from the IETF:

Hostnames take the form described in Section 3 of [RFC1034] and Section 2.1 of [RFC1123]: a sequence of domain labels separated by ".", each domain label starting and ending with an alphanumeric character and possibly also containing "-" characters. The rightmost domain label of a fully qualified domain name will never start with a digit, thus syntactically distinguishing domain names from IPv4 addresses

https://datatracker.ietf.org/doc/html/rfc2396#section-3.2.2

This kind of alignment is valuable for interoperability with older standards. As more software starts to use the WHATWG standard, there is more opportunity for mismatches when some subsystems use older standards, and that can be an opportunity for SSRF attacks and other bugs. So understanding the differences between the standards (and minimising them where practical/possible) is really important.

If we could broaden the change in this way, it would mean we can say with confidence that both standards agree about which hostnames are domains vs. which are IPv4. Of course, we accept more than just dotted-decimal IPv4, but we'd at least agree about what the hostname is supposed to mean. The IETF has been promising since the late 90s that TLDs won't ever begin with a digit, so it seems... maybe safe?

This change would make the "ends-in-a-number" checker a superset of its current implementation (more domains would be considered IPv4, nothing which is currently IPv4 would be considered a domain). It's also slightly computationally cheaper. It means that URLs like http://hello.0a would fail to parse, rather than being valid as they are today.

Jan 03 '22 05:01 karwa

CC @MattMenke2 who created the original issue

Jan 03 '22 05:01 karwa

I suspect that would be more disruptive than my change (which was deliberately scoped very narrowly, and generated no bug reports for Chrome, at least). This could break both domain squatters and folks who rely on suffix search. It may be a reasonable thing to do, but think we'd want to gather data on breakage, and figure out if the breakage is worth the benefits.

Jan 03 '22 06:01 MattMenke2

We'd have to solve #397 first I think.

Jan 03 '22 08:01 annevk

I missed that the rationale here was the original URI RFC. Since that's been obsoleted by something that has much less restrictions I think the cat's out of the bag with respect to interoperability issues.

If this came as a recommendation from the IETF or ICANN community it would be something we should strongly consider, but as-is I don't think this is workable.

Feb 23 '23 07:02 annevk