linkify-it icon indicating copy to clipboard operation
linkify-it copied to clipboard

Links with "_" in the domain name are not regarded as links

Open ZibanPirate opened this issue 4 years ago • 2 comments

what is the issue?

Links with "_" in the domain name, for eg:

  • https://api_stage.dzcode.io
  • https://api_stage.dz_code.io

are not regarded as links, which is no true, see : https://stackoverflow.com/a/2183140/8113942

the same goes for fuzzy links, for eg:

  • api_stage.dz_code.io
  • api_stage.dz_code.io

ZibanPirate avatar Jan 17 '21 18:01 ZibanPirate

As far as I've been able to research, api_stage.dzcode.io is an alias for api-stage.dzcode.io, and dz_code.io simply isn't a thing.

Please provide an example of widely used domains with underscores in them.

Underscores in domain names are very rare because:

Linkify-it isn't meant to find every single link (which is impossible), so we have to restrict ourselves to the most common cases. I'm not sure if domains with underscores are worth supporting, especially given false-positive potential of them being introduced in fuzzy links.

rlidwka avatar Apr 18 '22 16:04 rlidwka

Is it possible we get this resolved already? It seems like we are discussing whether this is a valid case or not, but it's obvious that there are cases like this around the web. This library has 100% test coverage, so it's safe to add this change without worrying it would break something. We hear "false-positive potential" mentioned before, but what are the exact cases which could be false-positives?

There is also other option that gets suggested - to use onCompile to override src_domain regexp, however, since most of the regexps are dependant on one of another this simple change needs to be applied like this:

LinkifyIt.prototype.onCompile = function onCompile() {
  const re = this.re;
  const text_separators = '[><\uff5c]';

  re.src_domain =
    '(?:' +
    re.src_xn +
    '|' +
    '(?:' + re.src_pseudo_letter + ')' +
    '|' +
    '(?:' + re.src_pseudo_letter + '(?:-|_|' + re.src_pseudo_letter + '){0,61}' + re.src_pseudo_letter + ')' +
    ')';

  re.src_host =
    '(?:' +
    '(?:(?:(?:' + re.src_domain + ')\\.)*' + re.src_domain/* _root */ + ')' +
    ')';

  re.tpl_host_fuzzy =
    '(?:' +
    re.src_ip4 +
    '|' +
    '(?:(?:(?:' + re.src_domain + ')\\.)+(?:%TLDS%))' +
    ')';

  re.src_host_strict =
    re.src_host + re.src_host_terminator;

  re.tpl_host_fuzzy_strict =
    re.tpl_host_fuzzy + re.src_host_terminator;

  re.src_host_port_strict =
    re.src_host + re.src_port + re.src_host_terminator;

  re.tpl_host_port_fuzzy_strict =
    re.tpl_host_fuzzy + re.src_port + re.src_host_terminator;

  re.tpl_email_fuzzy =
    '(^|' + text_separators + '|"|\\(|' + re.src_ZCc + ')' +
    '(' + re.src_email_name + '@' + re.tpl_host_fuzzy_strict + ')';

  re.tpl_link_fuzzy =
    '(^|(?![.:/\\-_@])(?:[$+<=>^`|\uff5c]|' + re.src_ZPCc + '))' +
    '((?![$+<=>^`|\uff5c])' + re.tpl_host_port_fuzzy_strict + re.src_path + ')';

  re.tpl_link_no_ip_fuzzy =
    '(^|(?![.:/\\-_@])(?:[$+<=>^`|\uff5c]|' + re.src_ZPCc + '))' +
    '((?![$+<=>^`|\uff5c])' + re.tpl_host_port_no_ip_fuzzy_strict + re.src_path + ')';

};

I don't think that's maintainable on our codebase.

I actually see couple of options here:

  1. Merge https://github.com/markdown-it/linkify-it/pull/96 which adds test coverage for these cases and fixes the issue.
  2. Make this library extendable/configurable in a better way, which doesn't include having half of regexps codebase on consumer side, maintaining backwards compatibility.

Please make some kind of decision, as doing nothing and ignoring OS community issues for years is not a valid solution.

domakas avatar Dec 27 '23 07:12 domakas