url-regex
url-regex copied to clipboard
Update path regex to not accept closing parens
This fixes a few cases, one of which is documented in #34:
- css url's of the format
background-image:url(http://example.com/img.jpg); - markdown url's such as
look at [my image](http://example.com/img.jpg) blah blah
Parentheses at the end of URLs are valid though and commonly used by for example Wikipedia. We need to check if the URL starts with a parens too and then skip the closing one. Until we have lookbehind assertions, my best advice is just to remove them manually.
@kevva Maybe you can simulate lookbehind: https://stackoverflow.com/questions/7376238/javascript-regex-look-behind-alternative
Also see the JavaScript section in https://stackoverflow.com/a/35271017/64949 Could maybe use XRegExp.matchRecursive.
It's almost impossible to take all edge cases into account while still following the spec. It's also somewhat common in PHP to use the following syntax when passing arrays into query strings.
/foo.php?bar[]=1&bar[]=2&bar[]=3
The spec says the following though:
A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax.
@kevva Since we can't easily do anything about it right now, I think we should at least give the user a choice, so we should add an option whether to match parens or not. For my use-case, I'm ok with not supporting a few Wikipedia links with parens in favor of not matching Markdown syntax.
Happy to. Do you think it fits better as a standalone option or using it with the strict one we have today?
Standalone option.
Lookbehinds are now in Chrome beta and will be out in Chrome stable in Oktober and probably come to Node.js this year. We could feature detect lookbehinds and use it when possible.
We need parens support for Refined GitHub and we can target latest Chrome.
// @bfred-it
Cool! I've keeping an eye on any babel plugins or likewise that adds lookbehinds but seems like there are none yet.
@kevva Yeah, I'm not surprised. It's very difficult to polyfill lookbehind.
Yup, https://github.com/DmitrySoshnikov/regexp-tree/issues/66.
Chrome 62 is out with lookbehind support. I suggest implementing lookbehind support behind an option for now.
Lookbehind support is available in Node.js 10, so I suggest resolving this with lookbehinds, target Node.js 10, and do a major version bump.
@skyebook, @kevva Given that lookbehind is now available is this something either of you could take a look at?
@skyebook @gajus Please see #72 and also I have discovered that if a link starts with a [] brackets, or a mix, e.g. ][][][][][][ then the links returned contain these starting bracket prefixes.
For anyone stumbling on that, you can clean up the results from url-regex matches by running them with:
const someLink = "[][][][][[[]])())))]]]google.com][[](()()";
someLink.replace(/^[\[\]\(\)]*|[\[\]\(\)]*$/g, ''); // google.com
Another issue for folks to be aware of is that URL's returned from this regex will include stuff like this:
{background-image:url('someurl.com/some/path with {background-image:url(' prefixed, or some other combination
The fix for my previous comment is this:
let someLink = "{background-image:url('someurl.com/some/path";
const quoteIndex = someLink.lastIndexOf('"');
const apostropheIndex = someLink.lastIndexOf("'");
const index = apostropheIndex > quoteIndex ? apostropheIndex : quoteIndex;
if (index !== -1) someLink = someLink.substring(index + 1); // 'someurl.com/some/path'
This issue is fixed in my maintained and modern version of this package at https://github.com/niftylettuce/url-regex-safe. You should be able to switch from url-regex to url-regex-safe now. See the updated list of options as I added some new ones, and changed a few defaults to more sensible ones (since not everyone is parsing Markdown for instance).