circuit-maintenance-parser
circuit-maintenance-parser copied to clipboard
Add a CI check to validate IP address present in tests, and a helper to clean them
This PR addresses issue #166 by adding a helper script anonymize-ip-addresses.py, to be used in the CI to detect the presence of some IPv4 or IPv6 addresses, and to actually replace them by documentary IPs
Content
- [x] New script
anonymize-ip-addresses.pyto check for IPv4 and IPv6 addresses to detect them, and replace by documentation ones - [x] Created two invoke tasks:
check-anonymize-ipsandclean-anonymize-ips - [x] Add a first step in the CI to detect the presence of IP addresses in the test data
Just a friendly thought when I think of anonymizing IPs and parsing
- When you are parsing, there can be differences introduced by changing the octet size, such as a table
- When viewing the data, sometimes IPs are related and anonimzing them all to the same, it makes it difficult to correlate
Does it make sense to consider swapping the first subnet as the following:
- 1.x.x.x for single digit first subnets
- 10.x.x.x for double digit first subnets
- 172.x.x.x for triple digit first subnets
and keep everything else the same?
and keep everything else the same?
The parsing should work the same for any kind of addresses, so if changing the octet size fails, it would be great to catch the error earlier. So far, in all the test data sets, I have found only one IP address in the actual data (description), in the Megaport one. And I don't see a big problem to lose it by replacing by a generic one. Maybe I am missing some good point, but so far I think it's easier to do a simple replace to a reference value for all the IP addresses.
Case in point:

Additionally, I have had bad luck with parsing where a column would be based on 15 characters as an example and then when reduced would run into issues such as:
IP Status
100.100.100.101 active
IP Status
192.2.0.1 active
and creates a state that would never exist in the real world, but can only exist in the testing infra. So what win? The correct parsing strategy of 15 characters (imagine a real scenario where this is the 4th column and all columns are based on spacing) of the incorrect reality that was created by auto-anonymizing without spacial awareness?
Long and short, parsing is a fickle beast and I have been personally bitten by this one several times.
I'm not as concerned as @itdependsnetworks about specifically preserving exact character counts, but I do agree that it would be good to keep some uniqueness across different IPs rather than making all addresses identical. I'd vote for something like just changing the first octet of IPv4 addresses to 10. and changing the first two tokens of IPv6 addresses (do we even have any in our examples?) to 2001:db8:.
I will give a try to https://github.com/intentionet/netconan thanks @jvanderaa