presidio icon indicating copy to clipboard operation
presidio copied to clipboard

IPV6 recognizer not working properly

Open morrissharp opened this issue 2 years ago • 1 comments

I was trying to use presidio to identify and remove IP addresses, and I ran into the following issue. It was recognizing '::' as a string containing an IP address, and '2345:0425:2CA1:0000:0000:0567:5673:23b5' was not being recognized as an IP address. I ran a couple of tests as follows:

analyzer = AnalyzerEngine()

results = analyzer.analyze(text='::',
        entities=['IP_ADDRESS'],
        language='en')
print(results)

results2 = analyzer.analyze(text='2345:0425:2CA1:0000:0000:0567:5673:23b5',
        entities=['IP_ADDRESS'],
        language='en')
print(results2)


results3 = analyzer.analyze(text='2345:0425:2CA1::0567:5673:23b5',
        entities=['IP_ADDRESS'],
        language='en')
print(results3)

Output:

[type: IP_ADDRESS, start: 0, end: 2, score: 0.6]
[]
[type: IP_ADDRESS, start: 13, end: 30, score: 0.6]

This made it seem like it is just identifying an IPV6 address as any element that contains two consecutive colons. I then checked the source code, and found this in the tests:

https://github.com/microsoft/presidio/blob/4777d1759e9bddc45317d9b2689e6df9f75eec05/presidio-analyzer/tests/test_ip_recognizer.py#L24

Can the IPv6 regex be fixed?

morrissharp avatar Aug 16 '22 17:08 morrissharp

Thanks for raising this. We'd be happy to review a PR if you're interested in contributing.

omri374 avatar Aug 22 '22 08:08 omri374

Seems like it was broken in https://github.com/microsoft/presidio/pull/312 Issued a PR to fix the regex, although still not optimal (see the tests scenario comments) IMO should be transitioned to use the core module for a much simpler implemnetation: https://docs.python.org/3/library/ipaddress.html?highlight=ipaddress#convenience-factory-functions

@omri374 Your thoughts?

SharonHart avatar Nov 29 '22 10:11 SharonHart