circuit-maintenance-parser icon indicating copy to clipboard operation
circuit-maintenance-parser copied to clipboard

Add a CI check to validate IP address present in tests, and a helper to clean them

Open chadell opened this issue 3 years ago • 5 comments
trafficstars

This PR addresses issue #166 by adding a helper script anonymize-ip-addresses.py, to be used in the CI to detect the presence of some IPv4 or IPv6 addresses, and to actually replace them by documentary IPs

Content

  • [x] New script anonymize-ip-addresses.py to check for IPv4 and IPv6 addresses to detect them, and replace by documentation ones
  • [x] Created two invoke tasks: check-anonymize-ips and clean-anonymize-ips
  • [x] Add a first step in the CI to detect the presence of IP addresses in the test data

chadell avatar Sep 17 '22 15:09 chadell

Just a friendly thought when I think of anonymizing IPs and parsing

  • When you are parsing, there can be differences introduced by changing the octet size, such as a table
  • When viewing the data, sometimes IPs are related and anonimzing them all to the same, it makes it difficult to correlate

Does it make sense to consider swapping the first subnet as the following:

  • 1.x.x.x for single digit first subnets
  • 10.x.x.x for double digit first subnets
  • 172.x.x.x for triple digit first subnets

and keep everything else the same?

itdependsnetworks avatar Sep 19 '22 14:09 itdependsnetworks

and keep everything else the same?

The parsing should work the same for any kind of addresses, so if changing the octet size fails, it would be great to catch the error earlier. So far, in all the test data sets, I have found only one IP address in the actual data (description), in the Megaport one. And I don't see a big problem to lose it by replacing by a generic one. Maybe I am missing some good point, but so far I think it's easier to do a simple replace to a reference value for all the IP addresses.

chadell avatar Sep 19 '22 14:09 chadell

Case in point:

image

Additionally, I have had bad luck with parsing where a column would be based on 15 characters as an example and then when reduced would run into issues such as:

IP               Status
100.100.100.101  active
IP               Status
192.2.0.1  active

and creates a state that would never exist in the real world, but can only exist in the testing infra. So what win? The correct parsing strategy of 15 characters (imagine a real scenario where this is the 4th column and all columns are based on spacing) of the incorrect reality that was created by auto-anonymizing without spacial awareness?

Long and short, parsing is a fickle beast and I have been personally bitten by this one several times.

itdependsnetworks avatar Sep 19 '22 14:09 itdependsnetworks

I'm not as concerned as @itdependsnetworks about specifically preserving exact character counts, but I do agree that it would be good to keep some uniqueness across different IPs rather than making all addresses identical. I'd vote for something like just changing the first octet of IPv4 addresses to 10. and changing the first two tokens of IPv6 addresses (do we even have any in our examples?) to 2001:db8:.

glennmatthews avatar Sep 21 '22 13:09 glennmatthews

I will give a try to https://github.com/intentionet/netconan thanks @jvanderaa

chadell avatar Sep 21 '22 14:09 chadell