spaCy
spaCy copied to clipboard
Match private networks as URLs
Currently, URLs under private networks won't be matched as URLs by the tokenizer e.g.
https://10.140.12.13/foo/bar?arg1=baz&arg2=taz # not a URL
https://142.140.12.13/foo/bar?arg1=baz&arg2=taz # A URL
From my understanding, at some point the URL matching was changed, initial issue: https://github.com/explosion/spaCy/issues/840 relevant PR: https://github.com/explosion/spaCy/pull/879
It looks like the regex was taken from some validation code, where maybe it made sense to neglect private networks, but let me argue that for our usage it doesn't. In my usecase I have a dump of corporate logs where some of the systems URLs are reported as their private networks, one can refer to the URL structure's Wikipedia page and see that there is no exception for private networks.
In terms of NLP, both examples above are URLs, but only one would be matched as single token, the prefix of URL with 10.140.12.13
won't even have the like_url
attribute set to true (the token is https://10.140.12.13
).
Note: I haven't removed the exception for broadcast addresses as they are unlikely to appear as URL, but I don't know whether a NLP lib should make this validation or also match the broadcast address.
Description
Types of change
enhancement
Checklist
- [X] I confirm that I have the right to submit this contribution under the project's MIT license.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.
Thanks for submitting this PR! We'll evaluate the changes.
Hi @antonpibm, thanks again for the PR! As the changes could influence tokenization in existing pipelines/applications, we've decided to add these changes to the v4
branch that we're using to prepare the upcoming 4.0.0. We'll also be looking into some other exceptions that perhaps can be removed - if you'd find any other odd cases we'd be happy to consider those as well!
We'll merge this as soon as the tests go green again.
@svlandeg thx for the update. Would gladly help with any relevant consideration