presidio
presidio copied to clipboard
IPV6 recognizer not working properly
I was trying to use presidio to identify and remove IP addresses, and I ran into the following issue. It was recognizing '::'
as a string containing an IP address, and '2345:0425:2CA1:0000:0000:0567:5673:23b5'
was not being recognized as an IP address. I ran a couple of tests as follows:
analyzer = AnalyzerEngine()
results = analyzer.analyze(text='::',
entities=['IP_ADDRESS'],
language='en')
print(results)
results2 = analyzer.analyze(text='2345:0425:2CA1:0000:0000:0567:5673:23b5',
entities=['IP_ADDRESS'],
language='en')
print(results2)
results3 = analyzer.analyze(text='2345:0425:2CA1::0567:5673:23b5',
entities=['IP_ADDRESS'],
language='en')
print(results3)
Output:
[type: IP_ADDRESS, start: 0, end: 2, score: 0.6]
[]
[type: IP_ADDRESS, start: 13, end: 30, score: 0.6]
This made it seem like it is just identifying an IPV6 address as any element that contains two consecutive colons. I then checked the source code, and found this in the tests:
https://github.com/microsoft/presidio/blob/4777d1759e9bddc45317d9b2689e6df9f75eec05/presidio-analyzer/tests/test_ip_recognizer.py#L24
Can the IPv6 regex be fixed?
Thanks for raising this. We'd be happy to review a PR if you're interested in contributing.
Seems like it was broken in https://github.com/microsoft/presidio/pull/312 Issued a PR to fix the regex, although still not optimal (see the tests scenario comments) IMO should be transitioned to use the core module for a much simpler implemnetation: https://docs.python.org/3/library/ipaddress.html?highlight=ipaddress#convenience-factory-functions
@omri374 Your thoughts?