spaCy icon indicating copy to clipboard operation
spaCy copied to clipboard

Match private networks as URLs

Open antonpibm opened this issue 1 year ago • 1 comments

Currently, URLs under private networks won't be matched as URLs by the tokenizer e.g.

https://10.140.12.13/foo/bar?arg1=baz&arg2=taz # not a URL
https://142.140.12.13/foo/bar?arg1=baz&arg2=taz # A URL

From my understanding, at some point the URL matching was changed, initial issue: https://github.com/explosion/spaCy/issues/840 relevant PR: https://github.com/explosion/spaCy/pull/879

It looks like the regex was taken from some validation code, where maybe it made sense to neglect private networks, but let me argue that for our usage it doesn't. In my usecase I have a dump of corporate logs where some of the systems URLs are reported as their private networks, one can refer to the URL structure's Wikipedia page and see that there is no exception for private networks.

In terms of NLP, both examples above are URLs, but only one would be matched as single token, the prefix of URL with 10.140.12.13 won't even have the like_url attribute set to true (the token is https://10.140.12.13).

Note: I haven't removed the exception for broadcast addresses as they are unlikely to appear as URL, but I don't know whether a NLP lib should make this validation or also match the broadcast address.

Description

Types of change

enhancement

Checklist

  • [X] I confirm that I have the right to submit this contribution under the project's MIT license.
  • [X] I ran the tests, and all new and existing tests passed.
  • [X] My changes don't require a change to the documentation, or if they do, I've added all required information.

antonpibm avatar Jul 12 '22 09:07 antonpibm

Thanks for submitting this PR! We'll evaluate the changes.

rmitsch avatar Jul 13 '22 08:07 rmitsch

Hi @antonpibm, thanks again for the PR! As the changes could influence tokenization in existing pipelines/applications, we've decided to add these changes to the v4 branch that we're using to prepare the upcoming 4.0.0. We'll also be looking into some other exceptions that perhaps can be removed - if you'd find any other odd cases we'd be happy to consider those as well!

We'll merge this as soon as the tests go green again.

svlandeg avatar Aug 11 '22 07:08 svlandeg

@svlandeg thx for the update. Would gladly help with any relevant consideration

antonpibm avatar Aug 11 '22 11:08 antonpibm