url-regex icon indicating copy to clipboard operation
url-regex copied to clipboard

Update path regex to not accept closing parens

Open skyebook opened this issue 8 years ago • 18 comments

This fixes a few cases, one of which is documented in #34:

  • css url's of the format background-image:url(http://example.com/img.jpg);
  • markdown url's such as look at [my image](http://example.com/img.jpg) blah blah

skyebook avatar Feb 09 '17 18:02 skyebook

Parentheses at the end of URLs are valid though and commonly used by for example Wikipedia. We need to check if the URL starts with a parens too and then skip the closing one. Until we have lookbehind assertions, my best advice is just to remove them manually.

kevva avatar May 21 '17 13:05 kevva

@kevva Maybe you can simulate lookbehind: https://stackoverflow.com/questions/7376238/javascript-regex-look-behind-alternative

Also see the JavaScript section in https://stackoverflow.com/a/35271017/64949 Could maybe use XRegExp.matchRecursive.

sindresorhus avatar Jun 02 '17 04:06 sindresorhus

It's almost impossible to take all edge cases into account while still following the spec. It's also somewhat common in PHP to use the following syntax when passing arrays into query strings.

/foo.php?bar[]=1&bar[]=2&bar[]=3

The spec says the following though:

A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax.

kevva avatar Jun 07 '17 17:06 kevva

@kevva Since we can't easily do anything about it right now, I think we should at least give the user a choice, so we should add an option whether to match parens or not. For my use-case, I'm ok with not supporting a few Wikipedia links with parens in favor of not matching Markdown syntax.

sindresorhus avatar Jun 13 '17 00:06 sindresorhus

Happy to. Do you think it fits better as a standalone option or using it with the strict one we have today?

kevva avatar Jun 13 '17 11:06 kevva

Standalone option.

sindresorhus avatar Jun 13 '17 13:06 sindresorhus

Lookbehinds are now in Chrome beta and will be out in Chrome stable in Oktober and probably come to Node.js this year. We could feature detect lookbehinds and use it when possible.

We need parens support for Refined GitHub and we can target latest Chrome.

// @bfred-it

sindresorhus avatar Sep 25 '17 18:09 sindresorhus

Cool! I've keeping an eye on any babel plugins or likewise that adds lookbehinds but seems like there are none yet.

kevva avatar Sep 25 '17 18:09 kevva

@kevva Yeah, I'm not surprised. It's very difficult to polyfill lookbehind.

sindresorhus avatar Sep 25 '17 18:09 sindresorhus

Yup, https://github.com/DmitrySoshnikov/regexp-tree/issues/66.

kevva avatar Sep 25 '17 18:09 kevva

Chrome 62 is out with lookbehind support. I suggest implementing lookbehind support behind an option for now.

sindresorhus avatar Nov 08 '17 09:11 sindresorhus

Lookbehind support is available in Node.js 10, so I suggest resolving this with lookbehinds, target Node.js 10, and do a major version bump.

sindresorhus avatar Nov 07 '18 08:11 sindresorhus

@skyebook, @kevva Given that lookbehind is now available is this something either of you could take a look at?

gajus avatar Apr 11 '19 11:04 gajus

@skyebook @gajus Please see #72 and also I have discovered that if a link starts with a [] brackets, or a mix, e.g. ][][][][][][ then the links returned contain these starting bracket prefixes.

niftylettuce avatar Aug 12 '20 07:08 niftylettuce

For anyone stumbling on that, you can clean up the results from url-regex matches by running them with:

const someLink = "[][][][][[[]])())))]]]google.com][[](()()";
someLink.replace(/^[\[\]\(\)]*|[\[\]\(\)]*$/g, ''); // google.com

niftylettuce avatar Aug 12 '20 09:08 niftylettuce

Another issue for folks to be aware of is that URL's returned from this regex will include stuff like this:

{background-image:url('someurl.com/some/path with {background-image:url(' prefixed, or some other combination

niftylettuce avatar Aug 12 '20 22:08 niftylettuce

The fix for my previous comment is this:

let someLink = "{background-image:url('someurl.com/some/path";
const quoteIndex = someLink.lastIndexOf('"');
const apostropheIndex = someLink.lastIndexOf("'");
const index = apostropheIndex > quoteIndex ? apostropheIndex : quoteIndex;
if (index !== -1) someLink = someLink.substring(index + 1); // 'someurl.com/some/path'

niftylettuce avatar Aug 12 '20 22:08 niftylettuce

This issue is fixed in my maintained and modern version of this package at https://github.com/niftylettuce/url-regex-safe. You should be able to switch from url-regex to url-regex-safe now. See the updated list of options as I added some new ones, and changed a few defaults to more sensible ones (since not everyone is parsing Markdown for instance).

niftylettuce avatar Aug 15 '20 08:08 niftylettuce